五月综合激情婷婷六月,日韩欧美国产一区不卡,他扒开我内裤强吻我下面视频 ,无套内射无矿码免费看黄,天天躁,日日躁,狠狠躁

新聞動(dòng)態(tài)

Python爬蟲實(shí)戰(zhàn)之使用Scrapy爬取豆瓣圖片

發(fā)布日期:2022-03-25 10:24 | 文章來源:gibhub

使用Scrapy爬取豆瓣某影星的所有個(gè)人圖片

以莫妮卡·貝魯奇為例

1.首先我們?cè)诿钚羞M(jìn)入到我們要?jiǎng)?chuàng)建的目錄,輸入 scrapy startproject banciyuan 創(chuàng)建scrapy項(xiàng)目

創(chuàng)建的項(xiàng)目結(jié)構(gòu)如下

2.為了方便使用pycharm執(zhí)行scrapy項(xiàng)目,新建main.py

from scrapy import cmdline
cmdline.execute("scrapy crawl banciyuan".split())

再edit configuration

然后進(jìn)行如下設(shè)置,設(shè)置后之后就能通過運(yùn)行main.py運(yùn)行scrapy項(xiàng)目了

3.分析該HTML頁面,創(chuàng)建對(duì)應(yīng)spider

from scrapy import Spider
import scrapy
from banciyuan.items import BanciyuanItem

class BanciyuanSpider(Spider):
 name = 'banciyuan'
 allowed_domains = ['movie.douban.com']
 start_urls = ["https://movie.douban.com/celebrity/1025156/photos/"]
 url = "https://movie.douban.com/celebrity/1025156/photos/"
 def parse(self, response):
  num = response.xpath('//div[@class="paginator"]/a[last()]/text()').extract_first('')
  print(num)
  for i in range(int(num)):
suffix = '?type=C&start=' + str(i * 30) + '&sortby=like&size=a&subtype=a'
yield scrapy.Request(url=self.url + suffix, callback=self.get_page)
 def get_page(self, response):
  href_list = response.xpath('//div[@class="article"]//div[@class="cover"]/a/@href').extract()
  # print(href_list)
  for href in href_list:
yield scrapy.Request(url=href, callback=self.get_info)
 def get_info(self, response):
  src = response.xpath(
'//div[@class="article"]//div[@class="photo-show"]//div[@class="photo-wp"]/a[1]/img/@src').extract_first('')
  title = response.xpath('//div[@id="content"]/h1/text()').extract_first('')
  # print(response.body)
  item = BanciyuanItem()
  item['title'] = title
  item['src'] = [src]
  yield item

4.items.py

# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy

class BanciyuanItem(scrapy.Item):
 # define the fields for your item here like:
 src = scrapy.Field()
 title = scrapy.Field()

pipelines.py

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
from scrapy.pipelines.images import ImagesPipeline
import scrapy
class BanciyuanPipeline(ImagesPipeline):
 def get_media_requests(self, item, info):
  yield scrapy.Request(url=item['src'][0], meta={'item': item})
 def file_path(self, request, response=None, info=None, *, item=None):
  item = request.meta['item']
  image_name = item['src'][0].split('/')[-1]
  # image_name.replace('.webp', '.jpg')
  path = '%s/%s' % (item['title'].split(' ')[0], image_name)
  return path

settings.py

# Scrapy settings for banciyuan project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
#  https://docs.scrapy.org/en/latest/topics/settings.html
#  https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#  https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'banciyuan'
SPIDER_MODULES = ['banciyuan.spiders']
NEWSPIDER_MODULE = 'banciyuan.spiders'

# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.80 Safari/537.36'}

# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
#'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
#'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'banciyuan.middlewares.BanciyuanSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'banciyuan.middlewares.BanciyuanDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'banciyuan.pipelines.BanciyuanPipeline': 1,
}
IMAGES_STORE = './images'
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

5.爬取結(jié)果

reference

源碼

到此這篇關(guān)于Python爬蟲實(shí)戰(zhàn)之使用Scrapy爬取豆瓣圖片的文章就介紹到這了,更多相關(guān)Scrapy爬取豆瓣圖片內(nèi)容請(qǐng)搜索本站以前的文章或繼續(xù)瀏覽下面的相關(guān)文章希望大家以后多多支持本站!

美國服務(wù)器租用

版權(quán)聲明:本站文章來源標(biāo)注為YINGSOO的內(nèi)容版權(quán)均為本站所有,歡迎引用、轉(zhuǎn)載,請(qǐng)保持原文完整并注明來源及原文鏈接。禁止復(fù)制或仿造本網(wǎng)站,禁止在非maisonbaluchon.cn所屬的服務(wù)器上建立鏡像,否則將依法追究法律責(zé)任。本站部分內(nèi)容來源于網(wǎng)友推薦、互聯(lián)網(wǎng)收集整理而來,僅供學(xué)習(xí)參考,不代表本站立場(chǎng),如有內(nèi)容涉嫌侵權(quán),請(qǐng)聯(lián)系alex-e#qq.com處理。

相關(guān)文章

實(shí)時(shí)開通

自選配置、實(shí)時(shí)開通

免備案

全球線路精選!

全天候客戶服務(wù)

7x24全年不間斷在線

專屬顧問服務(wù)

1對(duì)1客戶咨詢顧問

在線
客服

在線客服:7*24小時(shí)在線

客服
熱線

400-630-3752
7*24小時(shí)客服服務(wù)熱線

關(guān)注
微信

關(guān)注官方微信
頂部