python - Unable to scrape news headings from Hacker news -

- May 15, 2014

i want scrape top news article's headline , link hacker news.

here code:

import scrapy scrapy.contrib.linkextractors.sgml import sgmllinkextractor  class hnitem(scrapy.item):     title=scrapy.field()     link=scrapy.field()  class hnspider(scrapy.spider):     name="hn"     allowed_domains=["https://news.ycombinator.com"]     start_urls=["https://news.ycombinator.com/"]     def parse(self,response):         item=hnitem()         item['title'] = response.xpath('//*[@id="hnmain"]/tbody/tr[3]/td/table/tbody/tr[1]/td[3]/a/text()').extract()         item['link'] = response.xpath('//*[@id="hnmain"]/tbody/tr[3]/td/table/tbody/tr[1]/td[3]/a/@href').extract()         print item['title']         print item['link']

but returns empty list.

p.s. beginner in python , in scrapy.

here ended when tried creating spider:

import scrapy  class hnitem(scrapy.item):     title = scrapy.field()     link = scrapy.field()  class hnspider(scrapy.spider):     name = 'hackernews'     allowed_domains = ['news.ycombinator.com'] # see javier's comment     start_urls = ['http://news.ycombinator.com/']      def parse(self,response):         sel = scrapy.selector(response)         item=hnitem()          # these xpaths can made more generic         item['title'] = sel.xpath("//tr[@class='athing']/td[3]/a[@href]/text()").extract()         item['link'] = sel.xpath("//tr[@class='athing']/td[3]/a/@href").extract()          # whatever want item. print,return, etc..         print item['title']         print item['link']

you can run command line with: scrapy runspider path/to/your_spider.py

Search This Blog

Ruby Co

python - Unable to scrape news headings from Hacker news -

Comments

Post a Comment

Popular posts from this blog

c - Bitwise operation with (signed) enum value -

xslt - Unnest parent nodes by child node -

YouTubePlayerFragment cannot be cast to android.support.v4.app.Fragment -