Python Big Data Analysis of Bilibili Comments

1. Source Data Acquisition#

The data source is Bilibili video comments.
When loading a new page of comments, the accessed URL is observed.

https://api.bilibili.com/x/v2/reply/main?callback=jQuery172020167562580015508_1653393655707&jsonp=jsonp&next=4&type=1&oid=768584323&mode=3&plat=1&_=1653396013955

Where

next: page number
oid: video av number
Access with header
For example {user-agent,"referer":"https://www.bilibili.com"}

You can obtain the comments in a JSON file.

Data Preprocessing;

Taking av=13662970 "Your Name" as an example

url=https://api.bilibili.com/x/v2/reply/main?jsonp=jsonp&next=%25d&type=1&oid=13662970

Modify the page value to crawl 10 pages of JSON (finally modify to obtain 100 pages of 100 JSONs).
Read 10 JSON files to get comments

data["data"]["replies"][i=0-19]["content"]["message"]

Each JSON file contains 20 comments, modify i=0 to 19, loop to obtain 20 comments.

rParC

rPoQL

Output all comments.

3. Data Storage and Management;#

Obtain 100 pages of 2000 comments and write them to comment.txt.

rPT4t

4. Model Selection;#

Use Jieba for word frequency analysis.

rPBiX

There are many meaningless words.
Create ignore_dict.txt to exclude words based on results.

rPU6x

After removal:

rPbbj

Word cloud, excluding words.

5. Visual Analysis.#

file_path="res.csv"
df=pd.read_csv(file_path,encoding="gbk")

df.columns = ["word","frequency"]
print(df.head())
plt.bar(df["word"],df["frequency"])
plt.xlabel('Word')
plt.ylabel('Frequency')
plt.show()

Read CSV and plot.

rPxYU

rPFqY

Relationship Between User Level and Likes#

Like list and user level list.

like.append(data["data"]["replies"][i]["like"])
level.append(data["data"]["replies"][i]["member"]["level_info"]["current_level"])
# Plotting
plt.bar(level,like)
plt.xlabel('User Level')
plt.ylabel('Total Likes')

rPhAv

Core Code and Tool List

Libraries used: requests, jieba, re, pandas, matplotlib, json, csv

Process

from_bilibili_get_json.py---Get comment JSON.
like_level.py---Get the like count of different level users based on JSON and plot a bar chart.
dejson.py---Collect all comments into comment.txt.
word_frequency.py---Word frequency analysis.
visualization.py---Visualization.

from_bilibili_get_json.py

import requests,json
headerbili={
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.127 Safari/537.36 Edg/100.0.1185.50",
"referer":"https://www.bilibili.com"
}

'''
next: page number
oid: video av number
'''python
def get_json(page):
url="https://api.bilibili.com/x/v2/reply/main?jsonp=jsonp&next=%d&type=1&oid=13662970"
data = requests.get(url % page, headers=headerbili).text.encode("utf8")
comm=json.loads(data)
print(comm)
'''
Where ensure_ascii is used to specify whether the return value can contain non-ASCII codes.
Chinese characters exceed the ASCII range, modify the ensure_ascii parameter value.
'''
with open("./json/{}.json".format(page),"w" ,encoding='utf-8') as f:
f.write(json.dumps(comm, ensure_ascii=False))

for i in range(100):#100 pages
get_json(i)
print(i,'n')

# vedioreg = 'class="tap-router">(.*?)</a>'
# aurllist = re.findall(vedioreg, rank, re.S | re.M)
# print(aurllist)

like_level.py

import json
import matplotlib.pyplot as plt

plt.rcParams['font.sans-serif']=['SimHei']
# To display Chinese labels normally
plt.rcParams['axes.unicode_minus']=False # To display negative signs normally
# In cases where Chinese appears, use u'content'

like= []
level=[]
for e in range(100):
with open("./json/{}.json".format(e),"r",encoding="utf8") as f:
data=json.load(f)
for i in range(20):
like.append(data["data"]["replies"][i]["like"])
level.append(data["data"]["replies"][i]["member"]["level_info"]["current_level"])
print(like,level)
plt.bar(level,like)
plt.xlabel('User Level')
plt.ylabel('Total Likes')
plt.show()

dejson.py

import json,csv

comment=[] # Comment list
for e in range(100):
with open("./json/{}.json".format(e),"r",encoding="utf8") as f:
data=json.load(f)
for i in range(20):
comment.append(data["data"]["replies"][i]["content"]["message"])
with open("comment.txt","w",encoding="utf8") as fp:
writer = csv.writer(fp)
writer.writerow(comment)
print("Written", comment) # Comments in txt file

word_frequency.py

import jieba,re
# Remove punctuation
def get_text(file_name):
with open(file_name, 'r', encoding='utf-8') as fr:
text = fr.read()
# Deleted punctuation
del_ch = ['《','，','》','n','。','、','；','"',
'：',',','！','？',' ']
for ch in del_ch:
text = text.replace(ch,'')
return text

file_name = 'comment.txt'
text = get_text(file_name)
vlist = jieba.lcut(text) # Call Jieba for word segmentation, return list

res_dict = {}
# Perform word frequency statistics
for i in vlist:
res_dict[i] = res_dict.get(i,0) + 1
res_list = list(res_dict.items())
#print(res_list)
# Sort in descending order
res_list.sort(key = lambda x:x[1], reverse = True)
fin_res_list = []

# Remove single character words
for item in res_list:
if(len(item[0])>=2):
fin_res_list.append(item)
word_list=[]
words=[]
for i in range(1000):
word,count = fin_res_list[i]
pstr = str(i+1) + ':'
word_list.append(word)
with open('ignore_dict.txt', 'r', encoding='utf-8') as f:
ignore_words = f.read().splitlines()
# Iterate through segmented words
for word in word_list:
if word not in ignore_words: # Exclude words
word = re.sub(r'[n ]', '', word)
if len(word) < 1:
continue
words.append(word)
# print(pstr, end=' ')
# print(words[i], count)
with open("res.csv","a+")as fa:
fa.write(str(words[i])+","+str(count)+"n")

visualization.py

import pandas as pd
import matplotlib.pyplot as plt
import wordcloud,jieba

plt.rcParams['font.sans-serif']=['SimHei']
# To display Chinese labels normally
plt.rcParams['axes.unicode_minus']=False # To display negative signs normally
# In cases where Chinese appears, use u'content'
# File path
file_path="res.csv"
df=pd.read_csv(file_path,encoding="gbk")
# Bar chart
def _bar():
df.columns = ["word","frequency"]
print(df.head())
plt.bar(df["word"],df["frequency"])
plt.xlabel('Word')
plt.ylabel('Frequency')
plt.show()
_bar()

# Word cloud
with open("comment.txt","r",encoding="utf8")as f:
txt=f.read()
stopwords = ["的","了","我","时","在","你","看","到","没","不","就","是","人","也","有","和","会",
"一个","没有","时候","弹幕","自己","什么","时间","知道","就是","现在","真的","已经","还是","这个","看到","可以","因为"
,"你们","才能","不是","但是","那个","最后","每秒","所以","他们","觉得","怎么样","一样","可能","举报","这部","大家","不能","当时"
,"一直","一次","然后","还有","这样","评论","如果","那么","为什么","第一次","感谢","只是","这些","之后","忘记","一下","虽然","为了"
,"一定","今天","这么","不会","这里","去年","两个","以后","地方","那些","这种","怎么","其实","起来","应该","---","发生","只有","天气"
,"今年","很多","好像","所有","一部","出来","找到","之子","一遍","谢谢","告诉","东西","永远","的话","五块","一句","之前","过去","一年"
,"一天","终于","选择","对于","非常","突然"]

w=wordcloud.WordCloud(background_color="white",font_path="msyh.ttc",height=600,width=800,stopwords=stopwords)
w.generate(" ".join(jieba.lcut(txt)))
w.to_file("word_cloud.png")

Ignored words

Write based on results.

一个 没有 没有 时候 弹幕 自己 什么 时间 知道 就是 现在 真的 已经 还是 看到 可以 因为 你们 才能 不是 但是 每秒 所以 他们 觉得 怎么样 一样 举报 这部 大家 不能 当时 一直 一次 然后 还有 这样 评论 如果 那么 为什么 第一次 感谢 只是 这些 之后 忘记 一下 虽然 为了 一定 今天 这么 不会 这里 去年 两个 以后 地方 那些 这种 怎么 其实 起来 应该 --- 发生 只有 天气 今年 很多 好像 所有 一部 出来 找到 之子 一遍 谢谢 告诉 东西 永远 的话 五块 一句 之前 过去 一年 一天 终于 选择 对于 非常 突然

Conclusion and Insights

Started preparing to crawl TapTap comment data, based on the analysis from the blog below.

【Python】Crawling TapTap Genshin Impact Comments and Generating Word Cloud Analysis_includei's Blog-CSDN Blog_Crawling TapTap Comments

Finally found that the comment data address requires sliding verification.

rPiSq

But Bilibili comment data does not require verification when accessed.

References

【Python】Crawling TapTap Genshin Impact Comments and Generating Word Cloud Analysis_includei's Blog-CSDN Blog_Crawling TapTap Comments

Correct Method to Remove Stop Words from WordCloud Word Cloud Map_Luo Luo Pan's Blog-CSDN Blog

Python Example Analysis------Text Word Frequency Statistics_Tian Zhou Loves Learning's Blog-CSDN Blog_Python Text Word Frequency Statistics

Using Jieba for Word Segmentation in Python 27, Removing Stop Words_Yan456jie's Blog-CSDN Blog