web - How to make python webcrawler infinite and record link once -
with of thenewboston able create nice little web crawler in python. after watching videos played around , added couple of things it. i've tried make infinite in every link on every link every recorded, have failed in doing so. have problem of recording same link more once? how go fixing problem?
this code.
import requests bs4 import beautifulsoup def spider(max_pages): page = 1 while page <= max_pages: url = '' source_code = requests.get(url) plain_text = source_code.text soup = beautifulsoup(plain_text, "html.parser") link in soup.findall("a"): href = link.get("href") title = link.get("title") links = [] #print(href) #print(title) try: get_single_user_data(href) except: pass page += 1 def get_single_user_data(user_url): source_code = requests.get(user_url) plain_text = source_code.text soup = beautifulsoup(plain_text, "html.parser") #for item_name in soup.findall('span', {'id':'mm-saledscprc'}): # print(item_name.string) link in soup.findall("a"): href = link.get("href") print(href) spider(1)
i've tried make infinite in every link on every link every recorded
that's not going happen unless have decently sized datacentre. sake of it. need larger starting pool of websites crawl links other websites, , you'll far enough. start outbound links reddit or something.
i have problem of recording same link more once?
i recommend recording links have been using hash table record websites you've visited, , check if link there before visiting it.
Comments
Post a Comment