爬虫: Scrapy proxy HttpProxyMiddleware

 

  1. HttpProxyMiddleware 的使用
  2. setting.py
    1. USER_AGENT_LIST = [
      'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7 (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7',
      'Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:16.0) Gecko/16.0 Firefox/16.0',
      'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/534.55.3 (KHTML, like Gecko) Version/5.1.3 Safari/534.53.10'
      ]
      HTTP_PROXY = 'http://127.0.0.1:8888'
      DOWNLOADER_MIDDLEWARES = {
      'tutorial.middlewares.RandomUserAgentMiddleware': 400,
      'tutorial.middlewares.ProxyMiddleware': 410,
      'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None
      # Disable compression middleware, so the actual HTML pages are cached
      }

 

3.   middleware.py

import os
import random
from scrapy.conf import settings


class RandomUserAgentMiddleware(object):
def process_request(self, request, spider):
ua = random.choice(settings.get('USER_AGENT_LIST'))
if ua:
request.headers.setdefault('User-Agent', ua)

class ProxyMiddleware(object):
def process_request(self, request, spider):
request.meta['proxy'] = settings.get('HTTP_PROXY')

4. 代理设为 HTTP_PROXY = 'http://127.0.0.1:8888' , 通过charles
对爬虫进行抓包,查看请求的情况




 

参考:

[1]http://www.php101.cn/2015/03/27/Scrapy%E4%B9%8B%E6%97%85(1)%E4%BD%BF%E7%94%A8http_proxy/

[2]http://pkmishra.github.io/blog/2013/03/18/how-to-run-scrapy-with-TOR-and-multiple-browser-agents-part-1-mac/    [详细,简单]

[3]http://www.kuaidaili.com/free/inha/   [免费代理池]

[4]https://husless.github.io/2015/07/01/using-scrapy-with-proxies/  

[5]http://www.cnblogs.com/rwxwsblog/p/4575894.html

发表评论

电子邮件地址不会被公开。 必填项已用*标注