regex - Python BeautifulSoup find_all re.compile finding anything within a set of tags -


here html data:

<td>4.2.2</td>, <td align="center"><a href="https://blah.org/blah-4.2.2.zip">zip</a> (<a  href="https://blah.org/blah-4.2.2.zip.md5">md5</a> | <a href="https://blah.org/blah-4.2.2.zip.sha1">sha1</a>)</td>, <td align="center"><a href="https://blah.org/blah-.2.2.tar.gz">tar.gz</a> (<a href="https://blah.org/blah-4.2.2.tar.gz.md5">md5</a>|<ahref="https://blah.org/blah-4.2.2.tar.gz.sha1">sha1</a>)</td>, <td align="center"><a href="https://blah.org/blah-4.2.2-iis.zip">iiszip</a> (<a href="https://blah.org/blah-4.2.2-iis.zip.md5">md5</a> | <a href="https://blah.org/blah-4.2.2-iis.zip.sha1">sha1</a>)</td>, <td>4.2.1</td>, <td align="center"><a href="https://blah.org/blah-4.2.1.zip">zip</a> (<a href="https://blah.org/blah-4.2.1.zip.md5">md5</a> | <a href="https://blah.org/blah-4.2.1.zip.sha1">sha1</a>)</td>, <td align="center"><a href="https://blah.org/blah-4.2.1.tar.gz">tar.gz</a> (<a href="https://blah.org/blah-4.2.1.tar.gz.md5">md5</a> | <a href="https://blah.org/blah-4.2.1.tar.gz.sha1">sha1</a>)</td>, <td align="center"><a href="https://blah.org/blah-4.2.1-iis.zip">iis zip</a> (<a href="https://blah.org/blah-4.2.1-iis.zip.md5">md5</a> | <a href="https://blah.org/blah-4.2.1-iis.zip.sha1">sha1</a>)</td>, <td>4.2</td> <td>1.0-platinum</td> 

etc..

i iterate down page , pull out version numbers within:

<td>4.2.2</td> 

tags. ex:

4.2.2

4.2.1

4.2

1.0-platinum

so far have tried:

for tag in html.find_all('tbody', limit=1, string=re.compile("\<td\>(.*?)\<\/td\>")): print(tag.content) 

nothing

rpart=html.find('tbody') tds in rpart.find_all('td'): print(tds.find_all('\<td\>(.*?)\<\/td>')) 

nothing

results=rpart.find_all('td', tds=re.compile("\<td\>(.*?)\<\/td\>")) 

nothing

wphtml.find('tbody').find_all('td', tds=re.compile('\<td\>(.*?)\<\/td\>')) 

nothing

 p in rpart.find_all('td', digits=re.compile('\<td\>(.*?)\<\/td\>')):  print(p.contents) 

nothing

i did notice rpart type "resultset", willing bet little missing. on gods earth am doing wrong?

first off, there missing space in last tag in third . might causing problems parsing using beautifulsoup.

there 2 ways can pull off text provided:

  1. beautifulsoup:
    html = beautifulsoup(htmlstring, 'html.parser') tag in html.find_all('td', align=none):     print(tag.string)
  2. pure regex (no beautifulsoup):

    for val in re.findall(re.compile('\<td\>(.*?)\<\/td\>'), htmlstring):     print val

best can tell, because beautifulsoup searching through tag names when using "find_all" function, re.compile use regex find tag names match pattern. example, if wanted find "tbody" , "td" tags, use this:

for tag in html.find_all(re.compile('t[d|b]')):     print tag.string 

from tag found, can access attributes or value/string within opening , closing tag. i've not found way use beautifulsoup find tags values/strings.

here's reference couple of examples in case helps: beautifulsoup documentation - regular expression

also, in beautifulsoup, re.compile in "find_all" used "filtering/matching", not capture groups. meaning, regex pattern match. can't use (.*?) extract part of value comparison in situation.


Comments

Popular posts from this blog

c - Bitwise operation with (signed) enum value -

xslt - Unnest parent nodes by child node -

python - Healpy: From Data to Healpix map -