爬虫单位时间内请求数多,对己方机器、对方服务器都会形成压力,如果每个请求都开启一个新连接,更是如此;如果服务器支持keep-alive,爬虫就可以通过多个请求共用一个连接实现“减员增效”:单位时间内新建、关闭的连接的数目少了,但可实现的有效请求多了,并且也能有效降低给目标服务器造成的压力。
keep-alive的好处:(HTTP persistent connection)
各http client对http protocol实现的程度不一,有些是不支持keep-alive的,就python来说:
下面是requests的实现代码(来自于本人项目代码,做了缩减、抽象)。
import sys
import time
import requests
def getSession():
s = requests.Session()
s.mount('http://', requests.adapters.HTTPAdapter(pool_connections=1, pool_maxsize=1, max_retries=0, pool_block=False))
return s
def main():
# start time of the current session
st = time.time()
# init the first session
s = getSession()
# init the keep-alive timeout value
kato = 5
# loop work
for task in tasks:
# use time of the current session
ut = time.time() - st
# rebuild the session according to the use time
if ut >= kato:
s = getSession()
# clear the start time of the current session
st = time.time()
url = "https://www.example.com/%s" % task
# to bypass the antiSpider solutions
headers = {'user-agent': "a new ua", "Cookie": "a new cookie id"}
# get response
try:
r = s.get(url, headers = headers, allow_redirects = False)
# need some robust fix
kato = int(r.headers["Keep-Alive"].replace("timeout=")) - 3
except Exception,e:
tasks.insert(0, task)
print str(e)
continue
# handle the response according to the status_code, etc.
if r.status_code == 404:
pass
elif r.status_code == 301:
pass
elif r.status_code == 200:
info = "your info"
pass
elif r.status_code == 403:
tasks.insert(0, task)
print "as triggered, will sleep for 5 minutes"
time.sleep(300)
s = getSession()
continue
else:
print sid, r.status_code
sys.exit()
sn += 1
print "%s, %s, %s" % (sn, sid, info)
本文链接:http://task.lmcjl.com/news/6699.html