技术标签: python爬虫学习笔记
创建scrapy项目
cmd进入自定义目录
我这里直接
1.先输入:F:进入F盘
2.cd F:\pycharm文件\学习 进入自定义文件夹
这时就可以在命令框里创建scrapy项目了。
scrapy startproject blog_Scrapy
这时就会在该目录下创建以下文件:
使用pycharm打开文件目录
打开items.py
会看到
修改代码为:
import scrapy
class BlogScrapyItem(scrapy.Item):
# define the fields for your item here like:
# name = scrapy.Field()
title=scrapy.Field()#获取标题
link=scrapy.Field()#获取链接
content=scrapy.Field()#获取内容
获取博客园网页
在之前命令窗口下,进入blog_Scrapy
输入:scrapy genspider blog https://www.cnblogs.com/
https://www.cnblogs.com/为博客园网址
blog为创建的代码文件
该目录下会出现一个blog.py文件。
使用pycharm打开blog.py会看到以下代码:
这里修改parse()获取网页并保存在本地
import scrapy
class BlogSpider(scrapy.Spider):
name = 'blog'
allowed_domains = ['https://www.cnblogs.com/']
start_urls = ['http://https://www.cnblogs.com//']
def parse(self, response):
print(response.text)
filename="index.html"
with open(filename,'w',encoding='utf-8') as f:
f.write(response.text)
在之前命令窗口下执行:
F:\pycharm文件\学习\blog_Scrapy>blog_Scrapy>scrapy crawl blog
得到了以下东西:
F:\pycharm文件\学习\blog_Scrapy>scrapy crawl blog
2020-05-23 12:01:47 [scrapy.utils.log] INFO: Scrapy 2.1.0 started (bot: blog_Scrapy)
2020-05-23 12:01:47 [scrapy.utils.log] INFO: Versions: lxml 4.4.1.0, libxml2 2.9.9, cssselect 1.1.0, parsel 1.5.2, w3lib 1.21.0, Twisted 20.3.0, Python 3.7.4 (default, Aug 9 2019, 18:34:13) [MSC v.1915 64 bit (AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1d 10 Sep 2019), cryptography 2.7, Platform Windows-10-10.0.18362-SP0
2020-05-23 12:01:47 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.selectreactor.SelectReactor
2020-05-23 12:01:47 [scrapy.crawler] INFO: Overridden settings:
{
'BOT_NAME': 'blog_Scrapy',
'NEWSPIDER_MODULE': 'blog_Scrapy.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['blog_Scrapy.spiders']}
2020-05-23 12:01:47 [scrapy.extensions.telnet] INFO: Telnet Password: 906a4b6f938df8c7
2020-05-23 12:01:47 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.logstats.LogStats']
2020-05-23 12:01:47 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2020-05-23 12:01:47 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2020-05-23 12:01:47 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2020-05-23 12:01:47 [scrapy.core.engine] INFO: Spider opened
2020-05-23 12:01:47 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2020-05-23 12:01:47 [py.warnings] WARNING: e:\anaconda\lib\site-packages\scrapy\spidermiddlewares\offsite.py:65: URLWarning: allowed_domains accepts only domains, not URLs. Ignoring URL entry https://www.cnblogs.com/ in allowed_domains.
warnings.warn(message, URLWarning)
2020-05-23 12:01:47 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2020-05-23 12:01:49 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://https/robots.txt> (failed 1 times): DNS lookup failed: no results for hostname lookup: https.
2020-05-23 12:01:52 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://https/robots.txt> (failed 2 times): DNS lookup failed: no results for hostname lookup: https.
2020-05-23 12:01:54 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET http://https/robots.txt> (failed 3 times): DNS lookup failed: no results for hostname lookup: https.
2020-05-23 12:01:54 [scrapy.downloadermiddlewares.robotstxt] ERROR: Error downloading <GET http://https/robots.txt>: DNS lookup failed: no results for hostname lookup: https.
Traceback (most recent call last):
File "e:\anaconda\lib\site-packages\twisted\internet\defer.py", line 1416, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "e:\anaconda\lib\site-packages\twisted\python\failure.py", line 512, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "e:\anaconda\lib\site-packages\scrapy\core\downloader\middleware.py", line 44, in process_request
return (yield download_func(request=request, spider=spider))
File "e:\anaconda\lib\site-packages\twisted\internet\defer.py", line 654, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "e:\anaconda\lib\site-packages\twisted\internet\endpoints.py", line 982, in startConnectionAttempts
"no results for hostname lookup: {}".format(self._hostStr)
twisted.internet.error.DNSLookupError: DNS lookup failed: no results for hostname lookup: https.
2020-05-23 12:01:56 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://https//www.cnblogs.com//> (failed 1 times): DNS lookup failed: no results for hostname lookup: https.
2020-05-23 12:01:58 [scrapy.downloadermiddlewares.retry] DEBUG: Retrying <GET http://https//www.cnblogs.com//> (failed 2 times): DNS lookup failed: no results for hostname lookup: https.
2020-05-23 12:02:01 [scrapy.downloadermiddlewares.retry] ERROR: Gave up retrying <GET http://https//www.cnblogs.com//> (failed 3 times): DNS lookup failed: no results for hostname lookup: https.
2020-05-23 12:02:01 [scrapy.core.scraper] ERROR: Error downloading <GET http://https//www.cnblogs.com//>
Traceback (most recent call last):
File "e:\anaconda\lib\site-packages\twisted\internet\defer.py", line 1416, in _inlineCallbacks
result = result.throwExceptionIntoGenerator(g)
File "e:\anaconda\lib\site-packages\twisted\python\failure.py", line 512, in throwExceptionIntoGenerator
return g.throw(self.type, self.value, self.tb)
File "e:\anaconda\lib\site-packages\scrapy\core\downloader\middleware.py", line 44, in process_request
return (yield download_func(request=request, spider=spider))
File "e:\anaconda\lib\site-packages\twisted\internet\defer.py", line 654, in _runCallbacks
current.result = callback(current.result, *args, **kw)
File "e:\anaconda\lib\site-packages\twisted\internet\endpoints.py", line 982, in startConnectionAttempts
"no results for hostname lookup: {}".format(self._hostStr)
twisted.internet.error.DNSLookupError: DNS lookup failed: no results for hostname lookup: https.
2020-05-23 12:02:01 [scrapy.core.engine] INFO: Closing spider (finished)
2020-05-23 12:02:01 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{
'downloader/exception_count': 6,
'downloader/exception_type_count/twisted.internet.error.DNSLookupError': 6,
'downloader/request_bytes': 1314,
'downloader/request_count': 6,
'downloader/request_method_count/GET': 6,
'elapsed_time_seconds': 13.762848,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2020, 5, 23, 4, 2, 1, 290473),
'log_count/DEBUG': 4,
'log_count/ERROR': 4,
'log_count/INFO': 10,
'log_count/WARNING': 1,
'retry/count': 4,
'retry/max_reached': 2,
'retry/reason_count/twisted.internet.error.DNSLookupError': 4,
"robotstxt/exception_count/<class 'twisted.internet.error.DNSLookupError'>": 1,
'robotstxt/request_count': 1,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2020, 5, 23, 4, 1, 47, 527625)}
2020-05-23 12:02:01 [scrapy.core.engine] INFO: Spider closed (finished)
这里按照书上所说,我应该可以在文件目录下找到index.html文件,可是我找了半天始终找不到。
查了一些资料,或许是因为自己没有加用户代理
我就打开settings.py文件
按照查的资料把这里修改为:
USER_AGENT = 'blog_Scrapy (+http://www.yourdomain.com)'
# Obey robots.txt rules
ROBOTSTXT_OBEY = False
这里修改为:
DEFAULT_REQUEST_HEADERS = {
'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
'Accept-Language': 'en',
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.113 Safari/537.36'
}
再次在命令框输入:
F:\pycharm文件\学习\blog_Scrapy>blog_Scrapy>scrapy crawl blog
回车以下,我以为这就行了,结果和上面一样。还是不行。
又反复检查了一下代码:终于发现了原因
这里自动生成的start_urls根本打不开,两个http://。我才发现我真是蠢,修改一下。
start_urls = ['https://www.cnblogs.com/']
再次命令框输入:
F:\pycharm文件\学习\blog_Scrapy>blog_Scrapy>scrapy crawl blog
这时显示:(截取一部分,内容太多了)
</p>
<div class="post_item_foot">
<a href="https://www.cnblogs.com/DeeZeng/" class="lightblue">DeeZeng</a>
发布于 2020-05-23 11:59
<span class="article_comment"><a href="https://www.cnblogs.com/DeeZeng/p/12932393.html#commentform" title="0001-01-01 08:05" class="gray">
评论(0)</a></span><span class="article_view"><a href="https://www.cnblogs.com/DeeZeng/p/12932393.html" class="gray">阅读(8)</a></span></div>
</div>
<div class="clear"></div>
</div>
<div class="post_item">
<div class="digg">
<div class="diggit" onclick="DiggPost('poloyy',12938778,567504,1)">
<span class="diggnum" id="digg_count_12938778">0</span>
</div>
<div class="clear"></div>
<div id="digg_tip_12938778" class="digg_tip">
</div>
</div>
<div class="post_item_body">
<h3><a class="titlelnk" href="https://www.cnblogs.com/poloyy/p/12938778.html" target="_blank">Robot Framework(4)- 测试套件的基本使用</a></h3>
<p class="post_item_summary">
<a href="https://www.cnblogs.com/poloyy/" target="_blank"><img width="48" height="48" class="pfs" src="https://pic.cnblogs.com/face/1896874/20191225235209.png" alt=""/></a> 如果你还想从头学起Robot Framework,可以看看这个系列的文章哦! https://www.cnblogs.com/poloyy/category/1770899.html 前言 因为是基于Pycharm 去写的,所以这里重点讲在Pycharm 写 RF 的语法格式和使用 我们在Pycha ...
</p>
<div class="post_item_foot">
<a href="https://www.cnblogs.com/poloyy/" class="lightblue">小菠萝测试笔记</a>
发布于 2020-05-23 11:57
<span class="article_comment"><a href="https://www.cnblogs.com/poloyy/p/12938778.html#commentform" title="0001-01-01 08:05" class="gray">
评论(0)</a></span><span class="article_view"><a href="https://www.cnblogs.com/poloyy/p/12938778.html" class="gray">阅读(11)</a></span></div>
</div>
<div class="clear"></div>
</div>
<div class="post_item">
并且此时在目录下生成了一个期待许久的index.html文件。
打开看一下
文章浏览阅读448次。Linux用户管理用户基本概念什么是用户用户指的是能够正常登录Linux或Windows系统,比如:登录QQ的用户、登入王者荣耀的用户、等等[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-nz1edsjq-1626145230283)(C:\Users\李开开\AppData\Roaming\Typora\typora-user-images\image-20210712171546940.png)]为什么需要用户系统上的每一个进程(运行的程序),都_linux登录qq是什么意思
文章浏览阅读4.5k次,点赞4次,收藏7次。最近用Animator获取状态各种获取错误,所以记一下笔记Animator中可以获取三种不同的状态:GetCurrentAnimatorStateInfo 获取正确的状态机状态GetNextAnimatorStateInfo 获取下一个状态机的状态GetAnimatorTransitionInfo 获取状态机的过渡状态动画同步是在帧最前,而协程是在帧的最后调用。所以切换状态后在协程获取状..._getanimatortransitioninfo
文章浏览阅读924次。\bibliography{report} % bibliography data in report.bib\bibliographystyle{unsrt} % makes bibtex use spphys.bstunsrt 表示按照引用的先后顺序进行排序_spphys.bst
文章浏览阅读335次。为什么80%的码农都做不了架构师?>>> ..._linux系统搭建maven+tomcat+mysql
文章浏览阅读354次。vector不是容器,并且它不存储bool,因为他是按照位来存储的,即一个bool只占一个二进制位。假设有vector v;则&v[0]会引起编译错误。如果不使用&v[0]可以使用vector,否则,可以用deque 和bitset来替代_避免使用vector
文章浏览阅读6.8k次,点赞2次,收藏20次。Bug程度分为四种,分别为:致命S0:致命缺陷是指会造成安全问题的各类缺陷。在测试中很少出现,一旦出现立即中止版本测试。系统崩溃,数据丢失,数据毁坏,无法运行等Bug。严重S1:是指可以引起易于纠正的异常情况,可能引起易于修复的故障或对产品外观造成难以接受的缺陷。不影响其他功能的情况下可以继续版本测试。功能和性能不能实现。 次要功能全部丧失。 功能遗漏等等。 一般S2:一般缺陷是指不影响产品的..._bug严重程度
文章浏览阅读202次。本节书摘来自异步社区《iOS 9 开发指南》一书中的第1章,第1.3节工欲善其事,必先利其器——搭建开发环境,作者 管蕾,更多章节内容可以访问云栖社区“异步社区”公众号查看1.3 工欲善其事,必先利其器——搭建开发环境iOS 9 开发指南图片 2 知识点讲解:光盘:视频知识点第1章搭建开发环境.mp4学习iOS 9开发也离不开好的开发工具的帮助,如果使..._(1)下载完成后单击打开下载的“.dmg”格式文件,然后双击xcode文件开始安装。
文章浏览阅读115次。开发四年只会写业务代码,分布式高并发都不会还做程序员? iView 3.3.2 发布了,iView 是一套基于 Vue..._iview 3.2.2
文章浏览阅读93次。2019独角兽企业重金招聘Python工程师标准>>> ..._predicate not in查询
文章浏览阅读4.7k次,点赞3次,收藏7次。SpringBoot整合Spring Data JPA、MySQL、Druid并使用Mockito实现单元测试_spring jpa mock
文章浏览阅读441次。package poi;import java.io.FileInputStream;import java.io.FileNotFoundException;import java.io.IOException;import java.io.InputStream;import java.util.ArrayList;import java.util.LinkedHashMap;import j..._java getcellformatvalue
文章浏览阅读170次。50个漂亮免费的 WordPress 主题(上)Minimatica( Demo | Download )Placeholder( Demo | Download )Navly( Demo | Download )Cobera( Demo | Download )The Blog( Demo | Download )Gabi..._wordpressbo'ke 免费