|
我是Scrapy的新手,我正在尝试从已删除项目中的链接中删除新页面.具体来说,我想从谷歌搜索结果中删除Dropbox上的一些文件共享链接,并将这些链接存储在JSON文件中.获取这些链接后,我想为每个链接打开一个新页面,以验证链接是否有效.如果它有效,我也想将文件名存储到JSON文件中.
我使用带有’链接’,’文件名’,’状态’,’err_msg’属性的DropboxItem来存储每个被删除的项目,我尝试在解析函数中为每个被删除的链接发起异步请求.但似乎永远不会调用parse_file_page函数.有谁知道如何实现这样的两步爬行?
class DropboxSpider(Spider):
name = "dropbox"
allowed_domains = ["google.com"]
start_urls = [
"https://www.google.com/#filter=0&q=site:www.dropbox.com/s/&start=0"
]
def parse(self,response):
sel = Selector(response)
sites = sel.xpath("//h3[@class='r']")
items = []
for site in sites:
item = DropboxItem()
link = site.xpath('a/@href').extract()
item['link'] = link
link = ''.join(link)
#I want to parse a new page with url=link here
new_request = Request(link,callback=self.parse_file_page)
new_request.meta['item'] = item
items.append(item)
return items
def parse_file_page(self,response):
#item passed from request
item = response.meta['item']
#selector
sel = Selector(response)
content_area = sel.xpath("//div[@id='shmodel-content-area']")
filename_area = content_area.xpath("div[@class='filename shmodel-filename']")
if filename_area:
filename = filename_area.xpath("span[@id]/text()").extract()
if filename:
item['filename'] = filename
item['status'] = "normal"
else:
err_area = content_area.xpath("div[@class='err']")
if err_area:
err_msg = err_area.xpath("h3/text()").extract()
item['err_msg'] = err_msg
item['status'] = "error"
return item
感谢@ScrapyNovice的回答.我修改了代码.现在看起来像
def parse(self,response):
sel = Selector(response)
sites = sel.xpath("//h3[@class='r']")
#items = []
for site in sites:
item = DropboxItem()
link = site.xpath('a/@href').extract()
item['link'] = link
link = ''.join(link)
print 'link!!!!!!=',link
new_request = Request(link,callback=self.parse_file_page)
new_request.meta['item'] = item
yield new_request
#items.append(item)
yield item
return
#return item #Note,when I simply return item here,got an error msg "SyntaxError: 'return' with argument inside generator"
def parse_file_page(self,response):
#item passed from request
print 'parse_file_page!!!'
item = response.meta['item']
#selector
sel = Selector(response)
content_area = sel.xpath("//div[@id='shmodel-content-area']")
filename_area = content_area.xpath("div[@class='filename shmodel-filename']")
if filename_area:
filename = filename_area.xpath("span[@id]/text()").extract()
if filename:
item['filename'] = filename
item['status'] = "normal"
item['err_msg'] = "none"
print 'filename=',filename
else:
err_area = content_area.xpath("div[@class='err']")
if err_area:
err_msg = err_area.xpath("h3/text()").extract()
item['filename'] = "null"
item['err_msg'] = err_msg
item['status'] = "error"
print 'err_msg',err_msg
else:
item['filename'] = "null"
item['err_msg'] = "unknown_err"
item['status'] = "error"
print 'unknown err'
return item
控制流程实际上变得非常奇怪.当我使用“scrapy crawl dropbox -o items_dropbox.json -t json”来抓取本地文件(谷歌搜索结果的下载页面)时,我可以看到输出像
2014-05-31 08:40:35-0400 [scrapy] INFO: Scrapy 0.22.2 started (bot: tutorial)
2014-05-31 08:40:35-0400 [scrapy] INFO: Optional features available: ssl,http11
2014-05-31 08:40:35-0400 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'tutorial.spiders','FEED_FORMAT': 'json','SPIDER_MODULES': ['tutorial.spiders'],'FEED_URI': 'items_dropbox.json','BOT_NAME': 'tutorial'}
2014-05-31 08:40:35-0400 [scrapy] INFO: Enabled extensions: FeedExporter,LogStats,TelnetConsole,CloseSpider,WebService,CoreStats,SpiderState
2014-05-31 08:40:35-0400 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware,DownloadTimeoutMiddleware,UserAgentMiddleware,RetryMiddleware,DefaultHeadersMiddleware,MetaRefreshMiddleware,HttpCompressionMiddleware,RedirectMiddleware,CookiesMiddleware,ChunkedTransferMiddleware,DownloaderStats
2014-05-31 08:40:35-0400 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware,OffsiteMiddleware,RefererMiddleware,UrlLengthMiddleware,DepthMiddleware
2014-05-31 08:40:35-0400 [scrapy] INFO: Enabled item pipelines:
2014-05-31 08:40:35-0400 [dropbox] INFO: Spider opened
2014-05-31 08:40:35-0400 [dropbox] INFO: Crawled 0 pages (at 0 pages/min),scraped 0 items (at 0 items/min)
2014-05-31 08:40:35-0400 [scrapy] DEBUG: Telnet console listening on 0.0.0.0:6023
2014-05-31 08:40:35-0400 [scrapy] DEBUG: Web service listening on 0.0.0.0:6080
2014-05-31 08:40:35-0400 [dropbox] DEBUG: Crawled (200)
现在json文件只有:
[{"link": ["http://www.dropbox.com/s/9x8924gtb52ksn6/Phonesky.apk"]},{"status": "error","err_msg": "unknown_err","link": ["http://www.google.com/intl/en/webmasters/#utm_source=en-wmxmsg&utm_medium=wmxmsg&utm_campaign=bm&authuser=0"],"filename": "null"}]
最佳答案
您正在创建一个请求并很好地设置回调,但您从不对它做任何事情.
(编辑:安卓应用网)
【声明】本站内容均来自网络,其相关言论仅代表作者个人观点,不代表本站立场。若无意侵犯到您的权利,请及时与联系站长删除相关内容!
|