五种ip proxy的设置方法
我们在制作爬虫爬取想要的资料时,由于是计算机自动抓取,强度大、速度快,通常会给网站服务器带来巨大压力,所以同一个IP反复爬取同一个网页,就很可能被封,在这里介绍相关的技巧,以免被封;但在制作爬虫时,还是要适当加入延时代码,以减少对目标网站的影响。
一、requests设置代理:
import requests
proxies = { “http”: “http://192.10.1.10:8080”, “https”: “http://193.121.1.10:9080”, }
requests.get(“http://targetwebsite.com”, proxies=proxies)
二、Selenium+Chrome设置代理:
from selenium import webdriver
PROXY = “192.206.133.227:8080”
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument(\’–proxy-server={0}\’.format(PROXY))
browser = webdriver.Chrome(chrome_options=chrome_options)
browser.get(\’www.targetwebsize.com\’)
print(browser.page_source)
brsowser.close()
三、Selenium+Phantomjs设置代理:
# 利用DesiredCapabilities(代理设置)参数值,重新打开一个sessionId.
proxy=webdriver.Proxy()
proxy.proxy_type=ProxyType.MANUAL
proxy.http_proxy=\’192.25.171.51:8080\’
# 将代理设置添加到webdriver.DesiredCapabilities.PHANTOMJS中
proxy.add_to_capabilities(webdriver.DesiredCapabilities.PHANTOMJS)
browser.start_session(webdriver.DesiredCapabilities.PHANTOMJS)
browser.get(\’http://www.targetwebsize.com\’)
print(browser.page_source)
# 还原为系统代理只需将proxy_type重新设置一次
proxy.proxy_type=ProxyType.DIRECT
proxy.add_to_capabilities(webdriver.DesiredCapabilities.PHANTOMJS)
browser.start_session(webdriver.DesiredCapabilities.PHANTOMJS)
四、爬虫框架scrapy设置代理:
在setting.py中添加代理IP
PROXIES = [\’http://173.207.95.27:8080\’,
\’http://111.8.100.99:8080\’,
\’http://126.75.99.113:8080\’,
\’http://68.146.165.226:3128\’]
而后,在middlewares.py文件中,添加下面的代码。
import scrapy from scrapy
import signals
import random
classProxyMiddleware(object):
\’\’\’ 设置Proxy \’\’\’
def__init__(self, ip):
self.ip = ip
@classmethod
deffrom_crawler(cls, crawler):
return cls(ip=crawler.settings.get(\’PROXIES\’))
defprocess_request(self, request, spider):
ip = random.choice(self.ip)
request.meta[\’proxy\’] = ip
最后将我们自定义的类添加到下载器中间件设置中,如下。
DOWNLOADER_MIDDLEWARES = { \’myproject.middlewares.ProxyMiddleware\’: 543,}
五、Python异步Aiohttp设置代理:
proxy=”http://192.121.1.10:9080″
asyncwithaiohttp.ClientSession()assession:
asyncwithsession.get(“http://python.org”,proxy=proxy)asresp:
print(resp.status)
# https方法一:
# connector = SocksConnector.from_url(\'socks5://localhost:1080\', rdns=True)
# async with aiohttp.ClientSession(connector=connector) as sess:
# https方法二:
async with aiohttp.ClientSession() as session:
session.proxies = {\'http\': \'socks5h://127.0.0.1:1080\',
\'https\': \'socks5h://127.0.0.1:1080\'}
headers = {\'content-type\': \'image/gif\',
\'User-Agent\': \'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36\'
}
cookies = {\'cookies_are\': \'working\'}
# proxy = "http://127.0.0.1:1080"
with async_timeout.timeout(10):#设置请求的最长时间为10s
# async with sess.get(url, proxy="http://54.222.232.0:3128") as res:
async with session.get(url,headers=headers,cookies=cookies, verify_ssl=False) as res:
text = await res.text()
print(text)