python - Unable to scrape news headings from Hacker news -
i want scrape top news article's headline , link hacker news.
here code:
import scrapy scrapy.contrib.linkextractors.sgml import sgmllinkextractor class hnitem(scrapy.item): title=scrapy.field() link=scrapy.field() class hnspider(scrapy.spider): name="hn" allowed_domains=["https://news.ycombinator.com"] start_urls=["https://news.ycombinator.com/"] def parse(self,response): item=hnitem() item['title'] = response.xpath('//*[@id="hnmain"]/tbody/tr[3]/td/table/tbody/tr[1]/td[3]/a/text()').extract() item['link'] = response.xpath('//*[@id="hnmain"]/tbody/tr[3]/td/table/tbody/tr[1]/td[3]/a/@href').extract() print item['title'] print item['link']
but returns empty list.
p.s. beginner in python , in scrapy.
here ended when tried creating spider:
import scrapy class hnitem(scrapy.item): title = scrapy.field() link = scrapy.field() class hnspider(scrapy.spider): name = 'hackernews' allowed_domains = ['news.ycombinator.com'] # see javier's comment start_urls = ['http://news.ycombinator.com/'] def parse(self,response): sel = scrapy.selector(response) item=hnitem() # these xpaths can made more generic item['title'] = sel.xpath("//tr[@class='athing']/td[3]/a[@href]/text()").extract() item['link'] = sel.xpath("//tr[@class='athing']/td[3]/a/@href").extract() # whatever want item. print,return, etc.. print item['title'] print item['link']
you can run command line with: scrapy runspider path/to/your_spider.py
Comments
Post a Comment