Re: Trying to understand html.parser.HTMLParser
Andrew Berg, 19.05.2011 02:39: On 2011.05.18 03:30 AM, Stefan Behnel wrote: Well, it pretty clearly states that on the PyPI page, but I also added it to the project home page now. lxml 2.3 works with any CPython version from 2.3 to 3.2. Thank you. I never would've looked at PyPI for info on a project that has its own site. You should, especially for standardised information like this. Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: Trying to understand html.parser.HTMLParser
On 2011.05.16 02:26 AM, Karim wrote: Use regular expression for bad HTLM or beautifulSoup (google it), below a exemple to extract all html links: Actually, using regex wasn't so bad: import re import urllib.request url = 'http://x264.nl/x264/?dir=./64bit/8bit_depth' page = str(urllib.request.urlopen(url).read(), encoding='utf-8') # urlopen() returns a bytes object, need to get a normal string rev_re = re.compile('revision[0-9][0-9][0-9][0-9]') num_re = re.compile('[0-9][0-9][0-9][0-9]') rev = rev_re.findall(str(page))[0] # only need the first item since the first listing is the latest revision num = num_re.findall(rev)[0] # findall() always returns a list print(num) prints out the revision number - 1995. 'revision1995' might be useful, so I saved that to rev. This actually works pretty well for consistently formatted lists. I suppose I went about this the wrong way - I thought I needed to parse the HTML to get the links and do simple regexes on those, but I can just do simple regexes on the entire HTML document. -- http://mail.python.org/mailman/listinfo/python-list
Re: Trying to understand html.parser.HTMLParser
On 05/19/2011 11:35 PM, Andrew Berg wrote: On 2011.05.16 02:26 AM, Karim wrote: Use regular expression for bad HTLM or beautifulSoup (google it), below a exemple to extract all html links: Actually, using regex wasn't so bad: import re import urllib.request url = 'http://x264.nl/x264/?dir=./64bit/8bit_depth' page = str(urllib.request.urlopen(url).read(), encoding='utf-8') # urlopen() returns a bytes object, need to get a normal string rev_re = re.compile('revision[0-9][0-9][0-9][0-9]') num_re = re.compile('[0-9][0-9][0-9][0-9]') rev = rev_re.findall(str(page))[0] # only need the first item since the first listing is the latest revision num = num_re.findall(rev)[0] # findall() always returns a list print(num) prints out the revision number - 1995. 'revision1995' might be useful, so I saved that to rev. This actually works pretty well for consistently formatted lists. I suppose I went about this the wrong way - I thought I needed to parse the HTML to get the links and do simple regexes on those, but I can just do simple regexes on the entire HTML document. Great for you! Use what works well and easy to code, always the simpler is the better. For complicate search link to avoid using too complex and bugs prone regex you can derived the code I gave on HTMLParser with max comparison. Anyway you get the choice which is cool, not be stuck on only one solution. Cheers Karim -- http://mail.python.org/mailman/listinfo/python-list
Re: Trying to understand html.parser.HTMLParser
Andrew Berg wrote: ElementTree doesn't seem to have been updated in a long time, so I'll assume it won't work with Python 3. I don't know how to use it, but you'll find ElementTree as xml.etree in Python 3. ~Ethan~ -- http://mail.python.org/mailman/listinfo/python-list
Re: Trying to understand html.parser.HTMLParser
Andrew Berg, 17.05.2011 03:05: lxml looks promising, but it doesn't say anywhere whether it'll work on Python 3 or not Well, it pretty clearly states that on the PyPI page, but I also added it to the project home page now. lxml 2.3 works with any CPython version from 2.3 to 3.2. Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: Trying to understand html.parser.HTMLParser
On 2011.05.18 03:30 AM, Stefan Behnel wrote: Well, it pretty clearly states that on the PyPI page, but I also added it to the project home page now. lxml 2.3 works with any CPython version from 2.3 to 3.2. Thank you. I never would've looked at PyPI for info on a project that has its own site. I'll take a look at it. -- http://mail.python.org/mailman/listinfo/python-list
Re: Trying to understand html.parser.HTMLParser
On 05/17/2011 03:05 AM, Andrew Berg wrote: On 2011.05.16 02:26 AM, Karim wrote: Use regular expression for bad HTLM or beautifulSoup (google it), below a exemple to extract all html links: linksList = re.findall('a href=(.*?).*?/a',htmlSource) for link in linksList: print link I was afraid I might have to use regexes (mostly because I could never understand them). Even the BeautifulSoup website itself admits it's awful with Python 3 - only the admittedly broken 3.1.0 will work with Python 3 at all. ElementTree doesn't seem to have been updated in a long time, so I'll assume it won't work with Python 3. lxml looks promising, but it doesn't say anywhere whether it'll work on Python 3 or not, which is puzzling since the latest release was only a couple months ago. Actually, if I'm going to use regex, I might as well try to implement Versions* in Python. Thanks for the answers! *http://en.totalcmd.pl/download/wfx/net/Versions (original, made for Total Commander) and https://addons.mozilla.org/en-US/firefox/addon/versions-wfx_versions/ (clone implemented as a Firefox add-on; it's so wonderful, I even wrote the docs for it!) Andrew, I wrote a class with HMLTParser to get only one link for a given project, cf below: 73 class ResultsLinkParser(HTMLParser.HTMLParser): 74 Class ResultsLinkParser inherits form HTMLParser to extract 75 the original 'Submission date' of the a bug. 76 This customized parser will deals with the 'View Defect' HTML 77 page from Clear DDTS. 78 79 def __init__(self): 80 HTMLParser.HTMLParser.__init__(self) 81 self._link = None 82 83 def handle_starttag(self, tag, attrs): 84 Implement standard class HTMLParser customizing method. 85 if tag == 'frame': 86 try: 87 attributes = dict(attrs) 88 if attributes['name'] == 'indexframe': 89 self._link = attributes['src'] 90 except KeyError, e: 91 print(WARNING: Attribute '{keyname}' from frame tag 92 in QueryResult page does not exist!.format(keyname=e)) 93 94 def link(self): 95 Return the html link of the query results page. 96 return self._link You can use it and just modified it to get the latest just add some code (and change the tag 'name' of my example) to compare revision number with max and keep the max to compare it to the next value. I let you add this little code just create self._revision = None in the __init__(self) which hold the current max revision. After parser.feed() you can get the value by parser._revision or a public parser.revision() method to get the value. Cheers Karim -- http://mail.python.org/mailman/listinfo/python-list
Re: Trying to understand html.parser.HTMLParser
On 05/16/2011 03:06 AM, David Robinow wrote: On Sun, May 15, 2011 at 4:45 PM, Andrew Bergbahamutzero8...@gmail.com wrote: I'm trying to understand why HMTLParser.feed() isn't returning the whole page. My test script is this: import urllib.request import html.parser class MyHTMLParser(html.parser.HTMLParser): def handle_starttag(self, tag, attrs): if tag == 'a' and attrs: print(tag,'-',attrs) url = 'http://x264.nl/x264/?dir=./64bit/8bit_depth' page = urllib.request.urlopen(url).read() parser = MyHTMLParser() parser.feed(str(page)) I can do print(page) and get the entire HTML source, but parser.feed(str(page)) only spits out the information for the top links and none of the revision links. Ultimately, I just want to find the name of the first revision link (right now it's revision1995, when a new build is uploaded it will be revision2000 or whatever). I figure this is a relatively simple page; once I understand all of this, I can move on to more complicated pages. You've got bad HTML. Look closely and you'll see the there's no space between the revision strings and the style tag following. The parser doesn't like this. I don't know a solution other than fixing the html. (I created a local copy, edited it and it worked.) Hello, Use regular expression for bad HTLM or beautifulSoup (google it), below a exemple to extract all html links: linksList = re.findall('a href=(.*?).*?/a',htmlSource) for link in linksList: print link Cheers Karim -- http://mail.python.org/mailman/listinfo/python-list
Re: Trying to understand html.parser.HTMLParser
On 2011.05.16 02:26 AM, Karim wrote: Use regular expression for bad HTLM or beautifulSoup (google it), below a exemple to extract all html links: linksList = re.findall('a href=(.*?).*?/a',htmlSource) for link in linksList: print link I was afraid I might have to use regexes (mostly because I could never understand them). Even the BeautifulSoup website itself admits it's awful with Python 3 - only the admittedly broken 3.1.0 will work with Python 3 at all. ElementTree doesn't seem to have been updated in a long time, so I'll assume it won't work with Python 3. lxml looks promising, but it doesn't say anywhere whether it'll work on Python 3 or not, which is puzzling since the latest release was only a couple months ago. Actually, if I'm going to use regex, I might as well try to implement Versions* in Python. Thanks for the answers! *http://en.totalcmd.pl/download/wfx/net/Versions (original, made for Total Commander) and https://addons.mozilla.org/en-US/firefox/addon/versions-wfx_versions/ (clone implemented as a Firefox add-on; it's so wonderful, I even wrote the docs for it!) -- http://mail.python.org/mailman/listinfo/python-list
Re: Trying to understand html.parser.HTMLParser
On Sun, May 15, 2011 at 4:45 PM, Andrew Berg bahamutzero8...@gmail.com wrote: I'm trying to understand why HMTLParser.feed() isn't returning the whole page. My test script is this: import urllib.request import html.parser class MyHTMLParser(html.parser.HTMLParser): def handle_starttag(self, tag, attrs): if tag == 'a' and attrs: print(tag,'-',attrs) url = 'http://x264.nl/x264/?dir=./64bit/8bit_depth' page = urllib.request.urlopen(url).read() parser = MyHTMLParser() parser.feed(str(page)) I can do print(page) and get the entire HTML source, but parser.feed(str(page)) only spits out the information for the top links and none of the revision links. Ultimately, I just want to find the name of the first revision link (right now it's revision1995, when a new build is uploaded it will be revision2000 or whatever). I figure this is a relatively simple page; once I understand all of this, I can move on to more complicated pages. You've got bad HTML. Look closely and you'll see the there's no space between the revision strings and the style tag following. The parser doesn't like this. I don't know a solution other than fixing the html. (I created a local copy, edited it and it worked.) -- http://mail.python.org/mailman/listinfo/python-list