编写Python爬虫抓取豆瓣电影TOP100及用户头像的方法包含以下步骤:
下面是两个示例:
第一步:使用requests库发送HTTP请求获取页面HTML代码:
import requests
url = 'https://movie.douban.com/top250'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}
response = requests.get(url, headers=headers)
html = response.text
第二步:使用BeautifulSoup库解析HTML代码,获取电影名称和评分:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html, 'html.parser')
movies = soup.find_all('div', {'class': 'info'})
for movie in movies:
title_tag = movie.find('span', {'class': 'title'})
rating_tag = movie.find('span', {'class': 'rating_num'})
title = title_tag.get_text()
rating = rating_tag.get_text()
print(title, rating)
输出结果:
肖申克的救赎 9.7
霸王别姬 9.6
...
第一步:获取用户ID和头像URL
假设已经获取到电影详情页面的HTML代码,页面中有评论区,每个评论都有评论者的信息和头像。
import requests
from bs4 import BeautifulSoup
url = 'https://movie.douban.com/subject/1292052/'
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.3'
}
response = requests.get(url, headers=headers)
html = response.text
soup = BeautifulSoup(html, 'html.parser')
comments = soup.find_all('div', {'class': 'comment'})
for comment in comments:
avatar_tag = comment.find('div', {'class': 'avatar'})
avatar_url = avatar_tag.find('img')['src']
user_id = avatar_tag.find('a')['href'].split('/')[-2]
print(user_id, avatar_url)
输出结果:
bruce-lcs https://img1.doubanio.com/icon/u1368583-4.jpg
mengxiaoshuang https://img1.doubanio.com/icon/u3443720-154.jpg
第二步:下载头像图片到本地
根据头像URL,使用requests库发送HTTP请求,获取头像图片的二进制数据。然后使用Python内置的open()方法将二进制数据写入本地文件。
import requests
url = 'https://img1.doubanio.com/icon/u1368583-4.jpg'
response = requests.get(url)
with open('avatar.jpg', 'wb') as f:
f.write(response.content)
此时,当前目录下就生成了一个名为avatar.jpg的文件,即用户头像图片。
本文链接:http://task.lmcjl.com/news/15071.html