这里是做python urllib爬取百度云连接的完整攻略:
在进行本操作之前,应该安装好python以及常用爬虫库requests和BeautifulSoup,并熟悉URl编码的知识。
这里提供一个示例代码,以爬取机器学习大街的分享为例:
import requests
from bs4 import BeautifulSoup
import urllib.parse
url = 'http://www.jiqizhixin.com/share/detail/38d5fde8-2f5a-464c-9e0d-03f57088deaa'
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, 'html.parser')
links = soup.find_all('a',class_='downbtn')
for link in links:
url = link.get('href').replace('pan.baidu.com/s/','www.baidupcs.com/rest/2.0/pcs/file')
url = url.replace('?','&')
url = url.replace('=','/')
url = url + f"&method=download&access_token=null&app_id=250528"
print(urllib.parse.quote(url,safe = "'/:&?=.,;~"))
import requests
from bs4 import BeautifulSoup
import urllib.parse
url = 'https://pan.baidu.com/s/1IrkY2Jw2gGj6s-CL5SDQHw'
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, 'html.parser')
links = soup.find_all('a',class_='new-dbtn')
for link in links:
url = link.get('href').replace('pan.baidu.com/s/','www.baidupcs.com/rest/2.0/pcs/file')
url = url.replace('?','&')
url = url.replace('=','/')
url = url + f"&method=download&access_token=null&app_id=250528"
print(urllib.parse.quote(url,safe = "'/:&?=.,;~"))
import requests
from bs4 import BeautifulSoup
import urllib.parse
url = 'https://pan.baidu.com/s/1TTmmdmfR8dIFrw_Js98eyQ'
response = requests.get(url)
html = response.text
soup = BeautifulSoup(html, 'html.parser')
links = soup.find_all('a',class_='down-btn')
for link in links:
url = link.get('href').replace('pan.baidu.com/s/','www.baidupcs.com/rest/2.0/pcs/file')
url = url.replace('?','&')
url = url.replace('=','/')
url = url + f"&method=download&access_token=null&app_id=250528"
print(urllib.parse.quote(url,safe = "'/:&?=.,;~"))
可以看到,在两个例子中我们都通过requests库访问了百度云的分享链接,然后使用了BeautifulSoup对获取到的HTML代码进行解析获取了分享链接,然后对分享链接进行了URL编码,实现了爬取百度云分享链接的功能。
本文链接:http://task.lmcjl.com/news/7087.html