系统 | 浏览器 | User-Agent字符串 |
---|---|---|
Mac | Chrome | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.75 Safari/537.36 |
Mac | Firefox | Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:65.0) Gecko/20100101 Firefox/65.0 |
Mac | Safari | Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0.3 Safari/605.1.15 |
Windows | Edge | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.140 Safari/537.36 Edge/18.17763 |
Windows | IE | Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; rv:11.0) like Gecko |
Windows | Chrome | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36 |
iOS | Chrome | Mozilla/5.0 (iPhone; CPU iPhone OS 7_0_4 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) CriOS/31.0.1650.18 Mobile/11B554a Safari/8536.25 |
iOS | Safari | Mozilla/5.0 (iPhone; CPU iPhone OS 8_3 like Mac OS X) AppleWebKit/600.1.4 (KHTML, like Gecko) Version/8.0 Mobile/12F70 Safari/600.1.4 |
Android | Chrome | Mozilla/5.0 (Linux; Android 4.2.1; M040 Build/JOP40D) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/31.0.1650.59 Mobile Safari/537.36 |
Android | Webkit | Mozilla/5.0 (Linux; U; Android 4.4.4; zh-cn; M351 Build/KTU84P) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30 |
浏览器名称 | Chrome |
---|---|
浏览器版本 | 88.0.4324.182 |
系统平台 | Windows |
UA信息 | Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.182 Safari/537.36 |
#导入模块 import urllib.request #向网站发送get请求 response=urllib.request.urlopen('http://httpbin.org/get') html = response.read().decode() print(html)程序运行后,输出的请求头信息如下所示:
{
"args": {},
#请求头信息
"headers": {
"Accept-Encoding": "identity",
"Host": "httpbin.org",
"User-Agent": "Python-urllib/3.7", #UserAgent信息包含在请求头中!
"X-Amzn-Trace-Id": "Root=1-6034954b-1cb061183308ae920668ec4c"
},
"origin": "121.17.25.194",
"url": "http://httpbin.org/get"
}
从输出结果可以看出,User-Agent 竟然是 Python-urllib/3.7,这显然是爬虫程序访问网站。因此就需要重构 User-Agent,将其伪装成“浏览器”访问网站。urllib.request.Request()
方法重构 User-Agent 信息,代码如下所示:
from urllib import request # 定义变量:URL 与 headers url = 'http://httpbin.org/get' #向测试网站发送请求 #重构请求头,伪装成 Mac火狐浏览器访问,可以使用上表中任意浏览器的UA信息 headers = { 'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:65.0) Gecko/20100101 Firefox/65.0'} # 1、创建请求对象,包装ua信息 req = request.Request(url=url,headers=headers) # 2、发送请求,获取响应对象 res = request.urlopen(req) # 3、提取响应内容 html = res.read().decode('utf-8') print(html)程序的运行结果,如下所示:
{
"args": {},
"headers": {
"Accept-Encoding": "identity",
"Host": "httpbin.org",
#伪装成了Mac火狐浏览器
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:65.0) Gecko/20100101 Firefox/65.0",
"X-Amzn-Trace-Id": "Root=1-6034a52f-372ca79027da685c3712e5f6"
},
"origin": "121.17.25.194",
"url": "http://httpbin.org/get"
}
上述代码重构了 User-Agent 字符串信息,这样就解决了网站通过识别 User-Agent 来封杀爬虫程序的问题。当然这只是应对反爬策略的第一步。重构 UA 也可以通过其他模块实现,比如 requests 模块,这在后续内容会做相应介绍。
本文链接:http://task.lmcjl.com/news/18097.html