Python爬虫实战之使用Scrapy爬取豆瓣图片
作者:濯君 发布时间:2023-06-08 10:56:20
标签:Python,Scrapy
使用Scrapy爬取豆瓣某影星的所有个人图片
以莫妮卡·贝鲁奇为例
1.首先我们在命令行进入到我们要创建的目录,输入 scrapy startproject banciyuan
创建scrapy项目
创建的项目结构如下
2.为了方便使用pycharm执行scrapy项目,新建main.py
from scrapy import cmdline
cmdline.execute("scrapy crawl banciyuan".split())
再edit configuration
然后进行如下设置,设置后之后就能通过运行main.py运行scrapy项目了
3.分析该HTML页面,创建对应spider
from scrapy import Spider
import scrapy
from banciyuan.items import BanciyuanItem
class BanciyuanSpider(Spider):
name = 'banciyuan'
allowed_domains = ['movie.douban.com']
start_urls = ["https://movie.douban.com/celebrity/1025156/photos/"]
url = "https://movie.douban.com/celebrity/1025156/photos/"
def parse(self, response):
num = response.xpath('//div[@class="paginator"]/a[last()]/text()').extract_first('')
print(num)
for i in range(int(num)):
suffix = '?type=C&start=' + str(i * 30) + '&sortby=like&size=a&subtype=a'
yield scrapy.Request(url=self.url + suffix, callback=self.get_page)
def get_page(self, response):
href_list = response.xpath('//div[@class="article"]//div[@class="cover"]/a/@href').extract()
# print(href_list)
for href in href_list:
yield scrapy.Request(url=href, callback=self.get_info)
def get_info(self, response):
src = response.xpath(
'//div[@class="article"]//div[@class="photo-show"]//div[@class="photo-wp"]/a[1]/img/@src').extract_first('')
title = response.xpath('//div[@id="content"]/h1/text()').extract_first('')
# print(response.body)
item = BanciyuanItem()
item['title'] = title
item['src'] = [src]
yield item
4.items.py
# Define here the models for your scraped items
#
# See documentation in:
# https://docs.scrapy.org/en/latest/topics/items.html
import scrapy
class BanciyuanItem(scrapy.Item):
# define the fields for your item here like:
src = scrapy.Field()
title = scrapy.Field()
pipelines.py
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
from scrapy.pipelines.images import ImagesPipeline
import scrapy
class BanciyuanPipeline(ImagesPipeline):
def get_media_requests(self, item, info):
yield scrapy.Request(url=item['src'][0], meta={'item': item})
def file_path(self, request, response=None, info=None, *, item=None):
item = request.meta['item']
image_name = item['src'][0].split('/')[-1]
# image_name.replace('.webp', '.jpg')
path = '%s/%s' % (item['title'].split(' ')[0], image_name)
return path
settings.py
# Scrapy settings for banciyuan project
#
# For simplicity, this file contains only settings considered important or
# commonly used. You can find more settings consulting the documentation:
#
# https://docs.scrapy.org/en/latest/topics/settings.html
# https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
# https://docs.scrapy.org/en/latest/topics/spider-middleware.html
BOT_NAME = 'banciyuan'
SPIDER_MODULES = ['banciyuan.spiders']
NEWSPIDER_MODULE = 'banciyuan.spiders'
# Crawl responsibly by identifying yourself (and your website) on the user-agent
USER_AGENT = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.80 Safari/537.36'}
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
# Configure maximum concurrent requests performed by Scrapy (default: 16)
#CONCURRENT_REQUESTS = 32
# Configure a delay for requests for the same website (default: 0)
# See https://docs.scrapy.org/en/latest/topics/settings.html#download-delay
# See also autothrottle settings and docs
#DOWNLOAD_DELAY = 3
# The download delay setting will honor only one of:
#CONCURRENT_REQUESTS_PER_DOMAIN = 16
#CONCURRENT_REQUESTS_PER_IP = 16
# Disable cookies (enabled by default)
#COOKIES_ENABLED = False
# Disable Telnet Console (enabled by default)
#TELNETCONSOLE_ENABLED = False
# Override the default request headers:
#DEFAULT_REQUEST_HEADERS = {
# 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
# 'Accept-Language': 'en',
#}
# Enable or disable spider middlewares
# See https://docs.scrapy.org/en/latest/topics/spider-middleware.html
#SPIDER_MIDDLEWARES = {
# 'banciyuan.middlewares.BanciyuanSpiderMiddleware': 543,
#}
# Enable or disable downloader middlewares
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html
#DOWNLOADER_MIDDLEWARES = {
# 'banciyuan.middlewares.BanciyuanDownloaderMiddleware': 543,
#}
# Enable or disable extensions
# See https://docs.scrapy.org/en/latest/topics/extensions.html
#EXTENSIONS = {
# 'scrapy.extensions.telnet.TelnetConsole': None,
#}
# Configure item pipelines
# See https://docs.scrapy.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
'banciyuan.pipelines.BanciyuanPipeline': 1,
}
IMAGES_STORE = './images'
# Enable and configure the AutoThrottle extension (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/autothrottle.html
#AUTOTHROTTLE_ENABLED = True
# The initial download delay
#AUTOTHROTTLE_START_DELAY = 5
# The maximum download delay to be set in case of high latencies
#AUTOTHROTTLE_MAX_DELAY = 60
# The average number of requests Scrapy should be sending in parallel to
# each remote server
#AUTOTHROTTLE_TARGET_CONCURRENCY = 1.0
# Enable showing throttling stats for every response received:
#AUTOTHROTTLE_DEBUG = False
# Enable and configure HTTP caching (disabled by default)
# See https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings
#HTTPCACHE_ENABLED = True
#HTTPCACHE_EXPIRATION_SECS = 0
#HTTPCACHE_DIR = 'httpcache'
#HTTPCACHE_IGNORE_HTTP_CODES = []
#HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'
5.爬取结果
reference
源码
来源:https://blog.csdn.net/zzldm/article/details/117425949


猜你喜欢
- 前言终于能够挤出一点时间来总结最近学到的一些技术知识点了,博主这两周被居家隔离-集中隔离-居家隔离来回折腾,现在终于是得到解放能够空出的时间
- 杨辉三角杨辉 定义如下: 1 / \ 1 1 &
- 算术运算符对数值类型的变量及常量进行算数运算。也是最简单和最常用的运算符号。四则混合运算,遵循 “先乘除后加减&
- 1.数据的容量:1-3年内会大概多少条数据,每条数据大概多少字节; 2.数据项:是否有大字段,那些字段的值是否经常被更新; 3.数据查询SQ
- 问题描述MySQL函数或者存储过程中使用group_concat()函数导致数据字符过长而报错CREATE DEFINER=`root`@`
- 前言gRPC 这项技术真是太棒了,接口约束严格,性能还高,在 k8s 和很多微服务框架中都有应用。作为一名程序员,学就对了。之前用 Pyth
- 目录1. 文件相关函数2. 函数_函数的参数2.1 函数2.2 函数的参数3. 收集参数4. 命名关键字_总结小提示:5. 小练习练习问题:
- 什么是pyc文件pyc是一种二进制文件,是由py文件经过编译后,生成的文件,是一种byte code,py文件变成pyc文件后,加载的速度有
- 什么是 Python? Python 之父 Guido van Rossum 说:Python是一种高级程序语言,其核心设计哲学是代码可读性
- oracle wm_concat(column)函数使我们经常会使用到的,下面就教您如何使用oracle wm_concat(column)
- 背景:文件内容每一行是由N个单一数字组成的,每个数字之间由制表符区分,比如:0 4 3 1 2 2 1 0 3 1 2 0 ……现在需要将每
- 内容摘要:FCKeditor至今已经到了2.3.1版本了,对于国内的WEB开发者来说,也基本上都已经“闻风知多少”了,很多人将其融放到自己的
- 前言:时间戳的定义Unix时间戳(Unix时间戳)或称Unix时间(Unix时间),POSIX时间(POSIX时间),是一种时间表示方式,定
- 本文实例为大家分享了python实现手写数字识别的具体代码,供大家参考,具体内容如下import numpyimport scipy.spe
- PHP如何获取当前页完整URL及其参数 <? echo 'http://'.$_SERVER[&
- 前言记得开始使用 OpenCV 的时候是在大学时期,当时用的是 C 语言,OpenCV 版本好像是1.1,随着时间的推移,后面 C++逐渐代
- 数组统计函数ndimage提供一系列函数,可以计算标注后的数组的相关特征,比如最值、均值、均方根等。下列函数,如果未作其他说明,那么就有3个
- 正在看的ORACLE教程是:Oracle不同数据库间对比分析脚本。Oracle数据库开发应用中经常对数据库管理员有这样的需求,对比两个不同实
- windows下载ziplinux下载tar下载地址:https://www.elastic.co/downloads/elasticsea
- 前言之前学习过binarytree第三方库,了解了它定义的各种基本用法。昨天在问答频道中做题时碰到一个关于二叉树的算法填空题,感觉代码不错非