web - How to make python webcrawler infinite and record link once -


with of thenewboston able create nice little web crawler in python. after watching videos played around , added couple of things it. i've tried make infinite in every link on every link every recorded, have failed in doing so. have problem of recording same link more once? how go fixing problem?

this code.

import requests bs4 import beautifulsoup  def spider(max_pages):     page = 1     while page <= max_pages:         url = ''         source_code = requests.get(url)         plain_text = source_code.text         soup = beautifulsoup(plain_text, "html.parser")         link in soup.findall("a"):             href = link.get("href")             title = link.get("title")             links = []             #print(href)             #print(title)             try:                 get_single_user_data(href)             except:                 pass         page += 1  def get_single_user_data(user_url):     source_code = requests.get(user_url)     plain_text = source_code.text     soup = beautifulsoup(plain_text, "html.parser")     #for item_name in soup.findall('span', {'id':'mm-saledscprc'}):     #   print(item_name.string)     link in soup.findall("a"):         href = link.get("href")         print(href)   spider(1) 

i've tried make infinite in every link on every link every recorded

that's not going happen unless have decently sized datacentre. sake of it. need larger starting pool of websites crawl links other websites, , you'll far enough. start outbound links reddit or something.

i have problem of recording same link more once?

i recommend recording links have been using hash table record websites you've visited, , check if link there before visiting it.


Comments

Popular posts from this blog

c - Bitwise operation with (signed) enum value -

xslt - Unnest parent nodes by child node -

YouTubePlayerFragment cannot be cast to android.support.v4.app.Fragment -