regex - Python BeautifulSoup find_all re.compile finding anything within a set of tags -
here html data:
<td>4.2.2</td>, <td align="center"><a href="https://blah.org/blah-4.2.2.zip">zip</a> (<a href="https://blah.org/blah-4.2.2.zip.md5">md5</a> | <a href="https://blah.org/blah-4.2.2.zip.sha1">sha1</a>)</td>, <td align="center"><a href="https://blah.org/blah-.2.2.tar.gz">tar.gz</a> (<a href="https://blah.org/blah-4.2.2.tar.gz.md5">md5</a>|<ahref="https://blah.org/blah-4.2.2.tar.gz.sha1">sha1</a>)</td>, <td align="center"><a href="https://blah.org/blah-4.2.2-iis.zip">iiszip</a> (<a href="https://blah.org/blah-4.2.2-iis.zip.md5">md5</a> | <a href="https://blah.org/blah-4.2.2-iis.zip.sha1">sha1</a>)</td>, <td>4.2.1</td>, <td align="center"><a href="https://blah.org/blah-4.2.1.zip">zip</a> (<a href="https://blah.org/blah-4.2.1.zip.md5">md5</a> | <a href="https://blah.org/blah-4.2.1.zip.sha1">sha1</a>)</td>, <td align="center"><a href="https://blah.org/blah-4.2.1.tar.gz">tar.gz</a> (<a href="https://blah.org/blah-4.2.1.tar.gz.md5">md5</a> | <a href="https://blah.org/blah-4.2.1.tar.gz.sha1">sha1</a>)</td>, <td align="center"><a href="https://blah.org/blah-4.2.1-iis.zip">iis zip</a> (<a href="https://blah.org/blah-4.2.1-iis.zip.md5">md5</a> | <a href="https://blah.org/blah-4.2.1-iis.zip.sha1">sha1</a>)</td>, <td>4.2</td> <td>1.0-platinum</td>
etc..
i iterate down page , pull out version numbers within:
<td>4.2.2</td>
tags. ex:
4.2.2
4.2.1
4.2
1.0-platinum
so far have tried:
for tag in html.find_all('tbody', limit=1, string=re.compile("\<td\>(.*?)\<\/td\>")): print(tag.content)
nothing
rpart=html.find('tbody') tds in rpart.find_all('td'): print(tds.find_all('\<td\>(.*?)\<\/td>'))
nothing
results=rpart.find_all('td', tds=re.compile("\<td\>(.*?)\<\/td\>"))
nothing
wphtml.find('tbody').find_all('td', tds=re.compile('\<td\>(.*?)\<\/td\>'))
nothing
p in rpart.find_all('td', digits=re.compile('\<td\>(.*?)\<\/td\>')): print(p.contents)
nothing
i did notice rpart type "resultset", willing bet little missing. on gods earth am doing wrong?
first off, there missing space in last tag in third . might causing problems parsing using beautifulsoup.
there 2 ways can pull off text provided:
- beautifulsoup:
html = beautifulsoup(htmlstring, 'html.parser') tag in html.find_all('td', align=none): print(tag.string)
pure regex (no beautifulsoup):
for val in re.findall(re.compile('\<td\>(.*?)\<\/td\>'), htmlstring): print val
best can tell, because beautifulsoup searching through tag names when using "find_all" function, re.compile use regex find tag names match pattern. example, if wanted find "tbody" , "td" tags, use this:
for tag in html.find_all(re.compile('t[d|b]')): print tag.string
from tag found, can access attributes or value/string within opening , closing tag. i've not found way use beautifulsoup find tags values/strings.
here's reference couple of examples in case helps: beautifulsoup documentation - regular expression
also, in beautifulsoup, re.compile in "find_all" used "filtering/matching", not capture groups. meaning, regex pattern match. can't use (.*?) extract part of value comparison in situation.
Comments
Post a Comment