关键词

python urllib爬取百度云连接的实例代码

这里是做python urllib爬取百度云连接的完整攻略:

前置条件

在进行本操作之前,应该安装好python以及常用爬虫库requests和BeautifulSoup,并熟悉URl编码的知识。

思路

  1. 使用requests库请求百度云分享页面,获取页面HTML代码;
  2. 使用BeautifulSoup库解析HTML代码,提取百度云分享链接;
  3. 对链接进行URL编码,由于百度云分享链接可能会失效,需要将提取到的链接保存,以备后续使用。

代码实现

这里提供一个示例代码,以爬取机器学习大街的分享为例:

import requests
from bs4 import BeautifulSoup
import urllib.parse

url = 'http://www.jiqizhixin.com/share/detail/38d5fde8-2f5a-464c-9e0d-03f57088deaa'
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, 'html.parser')
links = soup.find_all('a',class_='downbtn')

for link in links:
    url = link.get('href').replace('pan.baidu.com/s/','www.baidupcs.com/rest/2.0/pcs/file')
    url = url.replace('?','&')
    url = url.replace('=','/')
    url = url + f"&method=download&access_token=null&app_id=250528"
    print(urllib.parse.quote(url,safe = "'/:&?=.,;~"))

示例说明

  1. 例一:爬取百度云中一张图片的分享链接
import requests
from bs4 import BeautifulSoup
import urllib.parse

url = 'https://pan.baidu.com/s/1IrkY2Jw2gGj6s-CL5SDQHw'
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, 'html.parser')
links = soup.find_all('a',class_='new-dbtn')

for link in links:
    url = link.get('href').replace('pan.baidu.com/s/','www.baidupcs.com/rest/2.0/pcs/file')
    url = url.replace('?','&')
    url = url.replace('=','/')
    url = url + f"&method=download&access_token=null&app_id=250528"
    print(urllib.parse.quote(url,safe = "'/:&?=.,;~"))
  1. 例二:爬取一个百度云电影的分享链接
import requests
from bs4 import BeautifulSoup
import urllib.parse

url = 'https://pan.baidu.com/s/1TTmmdmfR8dIFrw_Js98eyQ'
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, 'html.parser')
links = soup.find_all('a',class_='down-btn')

for link in links:
    url = link.get('href').replace('pan.baidu.com/s/','www.baidupcs.com/rest/2.0/pcs/file')
    url = url.replace('?','&')
    url = url.replace('=','/')
    url = url + f"&method=download&access_token=null&app_id=250528"
    print(urllib.parse.quote(url,safe = "'/:&?=.,;~"))

可以看到,在两个例子中我们都通过requests库访问了百度云的分享链接,然后使用了BeautifulSoup对获取到的HTML代码进行解析获取了分享链接,然后对分享链接进行了URL编码,实现了爬取百度云分享链接的功能。

本文链接:http://task.lmcjl.com/news/7087.html

展开阅读全文