yet a better syntax and more API: 1) no more web2pyHTMLParser, use TAG(...) instead. and flatten (remove tags)
>>> a=TAG('<div>Hello<span>world</span></div>') >>> print a <div>Hello<span>world</span></div> >>> print a.element('span') <span>world</span> >>> print a.flatten() Helloworld 2) search by multiple conditions, including regex for example, find all external links in a page >>> import re, urllib >>> html = urllib.urlopen('http://web2py.com').read() >>> elements = TAG(html).elements('a',_href=re.compile('^http')) >>> for e in elements: print e['_href'] http://web2py.com/book http://www.python.org http://mycti.cti.depaul.edu/people/facultyInfo_mycti.asp?id=343 .... I think we just blew BeautifulSoup out of the water. Massimo On May 25, 7:59 am, mdipierro <mdipie...@cs.depaul.edu> wrote: > The entire code is 40 lines and uses the python built-in html parser. > It will not be a problem to maintain it. Actually we could even use > this simplify both XML(...,sanitize) and gluon.contrib.markdown.WIKI > > On May 25, 12:50 am, Thadeus Burgess <thade...@thadeusb.com> wrote: > > > > So why our own? > > > Because it converts it into web2py helpers. > > > And you don't have to deal with installing anything other than web2py. > > > -- > > Thadeus > > > On Tue, May 25, 2010 at 12:14 AM, Kevin Bowling <kevin.bowl...@gmail.com> > > wrote: > > > Hmm, I wonder if this is worth the possible maintenance cost? It also > > > transcends the role of a web framework and now you are getting into > > > network programming. > > > > I have a currently deployed screen scraping app and found PyQuery to > > > be more than adequate. There is also lxml directly, or Beautiful > > > Soup. A simple import away and they integrate with web2py or anything > > > else just fine. So why our own? > > > > Regards, > > > Kevin > > > > On May 24, 9:35 pm, mdipierro <mdipie...@cs.depaul.edu> wrote: > > >> New in trunk. Screen scraping capabilities. > > > >> Example:>>> import re > > >> >>> from gluon.html import web2pyHTMLParser > > >> >>> from urllib import urlopen > > >> >>> html=urlopen('http://nobelprize.org/nobel_prizes/physics/laureates/1921/einstein-bi...() > > >> >>> tree=web2pyHTMLParser(html).tree ### NEW!! > > >> >>> elements=tree.elements('div') # search by tag type > > >> >>> elements=tree.elements(_id="Einstein") # search by attribute value > > >> >>> (id for example) > > >> >>> elements=tree.elements(find='Einstein') # text search NEW!! > > >> >>> elements=tree.elements(find=re.compile('Einstein')) # search via > > >> >>> regex NEW!! > > >> >>> print elements[0] > > > >> <title>Albert Einstein - Biography</title>>>> print elements[0][0] > > > >> Albert Einstein - Biography>>> elements[0].append(SPAN(' modified')) > > > >> <title>Albert Einstein - Biography<span>modified</span></title>>>> print > > >> tree > > > >> <html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml"> > > >> <head> > > >> <title>Albert Einstein - Biography<span>modified<span></title> > > >> ...