python实现从web抓取文档的方法

关键词

python实现从web抓取文档的方法

下面是 Python 实现从 Web 抓取文档的方法的完整攻略：

安装请求库

请求库是 Python 抓取 Web 数据的重要工具，常见的有 requests、urllib 等。在本攻略中我们以 requests 为例，首先需要安装 requests。

安装 requests 的方法有很多，在命令行中可以使用 pip 工具安装：

pip install requests

发起请求并获取响应

我们可以使用 requests.get() 方法来发起一个 GET 请求，获取响应的 HTML 数据。比如，想要抓取网站 https://www.python.org/ 的 HTML 数据，可以使用以下代码：

import requests

url = "https://www.python.org/"
response = requests.get(url)

print(response.text)

上述代码会发起一个 GET 请求，并将响应的 HTML 数据打印到控制台上。

解析 HTML 数据

得到 HTML 数据之后，我们需要使用解析库将其解析成具有结构的数据，以方便我们进一步处理和分析。常见的解析库有 BeautifulSoup、lxml 等，我们以 BeautifulSoup 为例。

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text, "html.parser")

# 获取网页标题
title = soup.title.string

# 获取网页所有链接
links = [link.get("href") for link in soup.find_all("a")]

print(title)
print(links)

上述代码使用 BeautifulSoup 将 HTML 数据解析成 BeautifulSoup 对象，可以通过 soup 对象获取网页标题和所有链接。

示例一：抓取豆瓣电影 Top250 数据

现在，我们来看一个实战示例，如何使用 Python 抓取豆瓣电影 Top250 的排名数据。

豆瓣电影 Top250 的排名数据页面为 https://movie.douban.com/top250，我们主要使用 requests 和 BeautifulSoup 两个库来实现。

import requests
from bs4 import BeautifulSoup

url = "https://movie.douban.com/top250"
response = requests.get(url)
soup = BeautifulSoup(response.text, "html.parser")

movie_list = soup.find_all("div", class_="info")
for index, movie in enumerate(movie_list):
    title = movie.find("span", class_="title").string
    rating = movie.find("span", class_="rating_num").string
    print("{}. {} - {}".format(index+1, title, rating))

上述代码首先发起请求，然后使用 BeautifulSoup 解析 HTML 数据，获取每部电影的标题和评分，最后打印排名结果。

示例二：抓取公众号历史文章列表

我们还可以使用 Python 抓取微信公众号的历史文章列表数据，以做数据分析或者二次开发。比如，我们想要获取某公众号历史文章列表数据，可以使用以下代码：

import requests
import re
from bs4 import BeautifulSoup

url = "https://mp.weixin.qq.com/mp/profile_ext?action=home&__biz=MjM5MDMzNjAzMQ==&scene=124&#wechat_redirect"
cookies = {
    # 在浏览器登录微信公众号后，获取到以下 Cookies 值
    "devicetype": "Windows 7",
    "version": "62060201",
    "lang": "zh_CN",
    "pass_ticket": "xxxxxxxx",
    "wap_sid2": "xxxxxxxx",
    "reward_uin": "xxxxxx",
    "pgv_pvid": "xxxxxxxxxx",
    "tvfe_boss_uuid": "xxxxxxxxxxxxxxxx",
    "ua_id": "xxxxxxxxxxxxx"
}

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/58.0.3029.110 Safari/537.36 Edge/16.16299"
}

response = requests.get(url, cookies=cookies, headers=headers)
soup = BeautifulSoup(response.text, "html.parser")

articles = []
for article in soup.find_all("h4", class_="weui_media_title"):
    title = article.string
    link = article.parent.get("hrefs")
    articles.append((title, link))

print(articles)

上述代码通过模拟浏览器的 Cookies 和 User-Agent 信息，来登陆微信公众号并获取历史文章列表数据，使用 BeautifulSoup 解析 HTML 数据，最后打印文章数据。

通过上述两个示例，我们可以看到使用 Python 抓取 Web 数据的方法，同时需要注意 HTTP 请求头部、Cookies 等信息的模拟和使用，以确保能够正常抓取数据。

本文链接：http://task.lmcjl.com/news/6623.html

展开阅读全文

上一篇：爬虫—Scrapy 下一篇：正则解析提速方案_爬虫

热门文章排行

推荐文章

关键词

python实现从web抓取文档的方法

安装请求库

发起请求并获取响应

解析 HTML 数据

示例一：抓取豆瓣电影 Top250 数据

示例二：抓取公众号历史文章列表