1. Source Data Acquisition#
The data source is Bilibili video comments.
When loading a new page of comments, the accessed URL is observed.
https://api.bilibili.com/x/v2/reply/main?callback=jQuery172020167562580015508_1653393655707&jsonp=jsonp&next=4&type=1&oid=768584323&mode=3&plat=1&_=1653396013955
Where
next: page number
oid: video av number
Access with header
For example{user-agent,"referer":"https://www.bilibili.com"}
You can obtain the comments in a JSON file.
- Data Preprocessing;
Taking av=13662970 "Your Name" as an example
url
=https://api.bilibili.com/x/v2/reply/main?jsonp=jsonp&next=%25d&type=1&oid=13662970
-
Modify the page value to crawl 10 pages of JSON (finally modify to obtain 100 pages of 100 JSONs).
-
Read 10 JSON files to get comments
data["data"]["replies"][i=0-19]["content"]["message"]
- Each JSON file contains 20 comments, modify i=0 to 19, loop to obtain 20 comments.
- Output all comments.
3. Data Storage and Management;#
- Obtain 100 pages of 2000 comments and write them to comment.txt.
4. Model Selection;#
- Use Jieba for word frequency analysis.
-
There are many meaningless words.
-
Create ignore_dict.txt to exclude words based on results.
- After removal:
Word cloud, excluding words.
5. Visual Analysis.#
file_path="res.csv"
df=pd.read_csv(file_path,encoding="gbk")
df.columns = ["word","frequency"]
print(df.head())
plt.bar(df["word"],df["frequency"])
plt.xlabel('Word')
plt.ylabel('Frequency')
plt.show()
- Read CSV and plot.
Relationship Between User Level and Likes#
Like list and user level list.
like.append(data["data"]["replies"][i]["like"])
level.append(data["data"]["replies"][i]["member"]["level_info"]["current_level"])
# Plotting
plt.bar(level,like)
plt.xlabel('User Level')
plt.ylabel('Total Likes')
- Core Code and Tool List
Libraries used: requests
, jieba
, re
, pandas
, matplotlib
, json
, csv
Process
-
from_bilibili_get_json.py---Get comment JSON.
-
like_level.py---Get the like count of different level users based on JSON and plot a bar chart.
-
dejson.py---Collect all comments into comment.txt.
-
word_frequency.py---Word frequency analysis.
-
visualization.py---Visualization.
from_bilibili_get_json.py
import requests,json
headerbili={
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36 Edg/100.0.1185.50",
"referer":"https://www.bilibili.com"
}
'''
next: page number
oid: video av number
'''python
def get_json(page):
url="https://api.bilibili.com/x/v2/reply/main?jsonp=jsonp&next=%d&type=1&oid=13662970"
data = requests.get(url % page, headers=headerbili).text.encode("utf8")
comm=json.loads(data)
print(comm)
'''
Where ensure_ascii is used to specify whether the return value can contain non-ASCII codes.
Chinese characters exceed the ASCII range, modify the ensure_ascii parameter value.
'''
with open("./json/{}.json".format(page),"w" ,encoding='utf-8') as f:
f.write(json.dumps(comm, ensure_ascii=False))
for i in range(100):#100 pages
get_json(i)
print(i,'n')
# vedioreg = 'class="tap-router">(.*?)</a>'
# aurllist = re.findall(vedioreg, rank, re.S | re.M)
# print(aurllist)
like_level.py
import json
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif']=['SimHei']
# To display Chinese labels normally
plt.rcParams['axes.unicode_minus']=False # To display negative signs normally
# In cases where Chinese appears, use u'content'
like= []
level=[]
for e in range(100):
with open("./json/{}.json".format(e),"r",encoding="utf8") as f:
data=json.load(f)
for i in range(20):
like.append(data["data"]["replies"][i]["like"])
level.append(data["data"]["replies"][i]["member"]["level_info"]["current_level"])
print(like,level)
plt.bar(level,like)
plt.xlabel('User Level')
plt.ylabel('Total Likes')
plt.show()
dejson.py
import json,csv
comment=[] # Comment list
for e in range(100):
with open("./json/{}.json".format(e),"r",encoding="utf8") as f:
data=json.load(f)
for i in range(20):
comment.append(data["data"]["replies"][i]["content"]["message"])
with open("comment.txt","w",encoding="utf8") as fp:
writer = csv.writer(fp)
writer.writerow(comment)
print("Written", comment) # Comments in txt file
word_frequency.py
import jieba,re
# Remove punctuation
def get_text(file_name):
with open(file_name, 'r', encoding='utf-8') as fr:
text = fr.read()
# Deleted punctuation
del_ch = ['《',',','》','n','。','、',';','"',
':',',','!','?',' ']
for ch in del_ch:
text = text.replace(ch,'')
return text
file_name = 'comment.txt'
text = get_text(file_name)
vlist = jieba.lcut(text) # Call Jieba for word segmentation, return list
res_dict = {}
# Perform word frequency statistics
for i in vlist:
res_dict[i] = res_dict.get(i,0) + 1
res_list = list(res_dict.items())
#print(res_list)
# Sort in descending order
res_list.sort(key = lambda x:x[1], reverse = True)
fin_res_list = []
# Remove single character words
for item in res_list:
if(len(item[0])>=2):
fin_res_list.append(item)
word_list=[]
words=[]
for i in range(1000):
word,count = fin_res_list[i]
pstr = str(i+1) + ':'
word_list.append(word)
with open('ignore_dict.txt', 'r', encoding='utf-8') as f:
ignore_words = f.read().splitlines()
# Iterate through segmented words
for word in word_list:
if word not in ignore_words: # Exclude words
word = re.sub(r'[n ]', '', word)
if len(word) < 1:
continue
words.append(word)
# print(pstr, end=' ')
# print(words[i], count)
with open("res.csv","a+")as fa:
fa.write(str(words[i])+","+str(count)+"n")
visualization.py
import pandas as pd
import matplotlib.pyplot as plt
import wordcloud,jieba
plt.rcParams['font.sans-serif']=['SimHei']
# To display Chinese labels normally
plt.rcParams['axes.unicode_minus']=False # To display negative signs normally
# In cases where Chinese appears, use u'content'
# File path
file_path="res.csv"
df=pd.read_csv(file_path,encoding="gbk")
# Bar chart
def _bar():
df.columns = ["word","frequency"]
print(df.head())
plt.bar(df["word"],df["frequency"])
plt.xlabel('Word')
plt.ylabel('Frequency')
plt.show()
_bar()
# Word cloud
with open("comment.txt","r",encoding="utf8")as f:
txt=f.read()
stopwords = ["的","了","我","时","在","你","看","到","没","不","就","是","人","也","有","和","会",
"一个","没有","时候","弹幕","自己","什么","时间","知道","就是","现在","真的","已经","还是","这个","看到","可以","因为"
,"你们","才能","不是","但是","那个","最后","每秒","所以","他们","觉得","怎么样","一样","可能","举报","这部","大家","不能","当时"
,"一直","一次","然后","还有","这样","评论","如果","那么","为什么","第一次","感谢","只是","这些","之后","忘记","一下","虽然","为了"
,"一定","今天","这么","不会","这里","去年","两个","以后","地方","那些","这种","怎么","其实","起来","应该","---","发生","只有","天气"
,"今年","很多","好像","所有","一部","出来","找到","之子","一遍","谢谢","告诉","东西","永远","的话","五块","一句","之前","过去","一年"
,"一天","终于","选择","对于","非常","突然"]
w=wordcloud.WordCloud(background_color="white",font_path="msyh.ttc",height=600,width=800,stopwords=stopwords)
w.generate(" ".join(jieba.lcut(txt)))
w.to_file("word_cloud.png")
Ignored words
Write based on results.
一个 没有 没有 时候 弹幕 自己 什么 时间 知道 就是 现在 真的 已经 还是 看到 可以 因为 你们 才能 不是 但是 每秒 所以 他们 觉得 怎么样 一样 举报 这部 大家 不能 当时 一直 一次 然后 还有 这样 评论 如果 那么 为什么 第一次 感谢 只是 这些 之后 忘记 一下 虽然 为了 一定 今天 这么 不会 这里 去年 两个 以后 地方 那些 这种 怎么 其实 起来 应该 --- 发生 只有 天气 今年 很多 好像 所有 一部 出来 找到 之子 一遍 谢谢 告诉 东西 永远 的话 五块 一句 之前 过去 一年 一天 终于 选择 对于 非常 突然
- Conclusion and Insights
Started preparing to crawl TapTap comment data, based on the analysis from the blog below.
Finally found that the comment data address requires sliding verification.
But Bilibili comment data does not require verification when accessed.
- References
Correct Method to Remove Stop Words from WordCloud Word Cloud Map_Luo Luo Pan's Blog-CSDN Blog
Using Jieba for Word Segmentation in Python 27, Removing Stop Words_Yan456jie's Blog-CSDN Blog