Re: Trying to understand html.parser.HTMLParser

2011-05-19 Thread Stefan Behnel

Andrew Berg, 19.05.2011 02:39:

On 2011.05.18 03:30 AM, Stefan Behnel wrote:

Well, it pretty clearly states that on the PyPI page, but I also added it
to the project home page now. lxml 2.3 works with any CPython version from
2.3 to 3.2.

Thank you. I never would've looked at PyPI for info on a project that
has its own site.


You should, especially for standardised information like this.

Stefan

--
http://mail.python.org/mailman/listinfo/python-list


Re: Trying to understand html.parser.HTMLParser

2011-05-19 Thread Andrew Berg
On 2011.05.16 02:26 AM, Karim wrote:
 Use regular expression for bad HTLM or beautifulSoup (google it), below 
 a exemple to extract all html links:
Actually, using regex wasn't so bad:
 import re
 import urllib.request

 url = 'http://x264.nl/x264/?dir=./64bit/8bit_depth'
 page = str(urllib.request.urlopen(url).read(), encoding='utf-8') #
 urlopen() returns a bytes object, need to get a normal string
 rev_re = re.compile('revision[0-9][0-9][0-9][0-9]')
 num_re = re.compile('[0-9][0-9][0-9][0-9]')
 rev = rev_re.findall(str(page))[0] # only need the first item since
 the first listing is the latest revision
 num = num_re.findall(rev)[0] # findall() always returns a list
 print(num)
prints out the revision number - 1995. 'revision1995' might be useful,
so I saved that to rev.

This actually works pretty well for consistently formatted lists. I
suppose I went about this the wrong way - I thought I needed to parse
the HTML to get the links and do simple regexes on those, but I can just
do simple regexes on the entire HTML document.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Trying to understand html.parser.HTMLParser

2011-05-19 Thread Karim

On 05/19/2011 11:35 PM, Andrew Berg wrote:

On 2011.05.16 02:26 AM, Karim wrote:

Use regular expression for bad HTLM or beautifulSoup (google it), below
a exemple to extract all html links:

Actually, using regex wasn't so bad:

import re
import urllib.request

url = 'http://x264.nl/x264/?dir=./64bit/8bit_depth'
page = str(urllib.request.urlopen(url).read(), encoding='utf-8') #
urlopen() returns a bytes object, need to get a normal string
rev_re = re.compile('revision[0-9][0-9][0-9][0-9]')
num_re = re.compile('[0-9][0-9][0-9][0-9]')
rev = rev_re.findall(str(page))[0] # only need the first item since
the first listing is the latest revision
num = num_re.findall(rev)[0] # findall() always returns a list
print(num)

prints out the revision number - 1995. 'revision1995' might be useful,
so I saved that to rev.

This actually works pretty well for consistently formatted lists. I
suppose I went about this the wrong way - I thought I needed to parse
the HTML to get the links and do simple regexes on those, but I can just
do simple regexes on the entire HTML document.

Great for you!
Use what works well and easy to code, always the simpler is the better.
For complicate search link to avoid using too complex and bugs prone regex
you can derived the code I gave on HTMLParser with max comparison.
Anyway you get the choice which is cool, not be stuck on only one solution.

Cheers
Karim
--
http://mail.python.org/mailman/listinfo/python-list


Re: Trying to understand html.parser.HTMLParser

2011-05-19 Thread Ethan Furman

Andrew Berg wrote:

ElementTree doesn't seem to have been updated in a long time, so I'll
assume it won't work with Python 3.


I don't know how to use it, but you'll find ElementTree as xml.etree in 
Python 3.


~Ethan~
--
http://mail.python.org/mailman/listinfo/python-list


Re: Trying to understand html.parser.HTMLParser

2011-05-18 Thread Stefan Behnel

Andrew Berg, 17.05.2011 03:05:

lxml looks promising, but it doesn't say anywhere whether it'll work on
Python 3 or not


Well, it pretty clearly states that on the PyPI page, but I also added it 
to the project home page now. lxml 2.3 works with any CPython version from 
2.3 to 3.2.


Stefan

--
http://mail.python.org/mailman/listinfo/python-list


Re: Trying to understand html.parser.HTMLParser

2011-05-18 Thread Andrew Berg
On 2011.05.18 03:30 AM, Stefan Behnel wrote:
 Well, it pretty clearly states that on the PyPI page, but I also added it 
 to the project home page now. lxml 2.3 works with any CPython version from 
 2.3 to 3.2.
Thank you. I never would've looked at PyPI for info on a project that
has its own site. I'll take a look at it.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Trying to understand html.parser.HTMLParser

2011-05-17 Thread Karim

On 05/17/2011 03:05 AM, Andrew Berg wrote:

On 2011.05.16 02:26 AM, Karim wrote:

Use regular expression for bad HTLM or beautifulSoup (google it), below
a exemple to extract all html links:

linksList = re.findall('a href=(.*?).*?/a',htmlSource)
for link in linksList:
  print link

I was afraid I might have to use regexes (mostly because I could never
understand them).
Even the BeautifulSoup website itself admits it's awful with Python 3 -
only the admittedly broken 3.1.0 will work with Python 3 at all.
ElementTree doesn't seem to have been updated in a long time, so I'll
assume it won't work with Python 3.
lxml looks promising, but it doesn't say anywhere whether it'll work on
Python 3 or not, which is puzzling since the latest release was only a
couple months ago.

Actually, if I'm going to use regex, I might as well try to implement
Versions* in Python.

Thanks for the answers!

*http://en.totalcmd.pl/download/wfx/net/Versions (original, made for
Total Commander) and
https://addons.mozilla.org/en-US/firefox/addon/versions-wfx_versions/
(clone implemented as a Firefox add-on; it's so wonderful, I even wrote
the docs for it!)


Andrew,

I wrote a class with HMLTParser to get only one link for a given 
project, cf below:


  73 class ResultsLinkParser(HTMLParser.HTMLParser):
  74 Class ResultsLinkParser inherits form HTMLParser to extract
  75 the original 'Submission date' of the a bug.
  76 This customized parser will deals with the 'View Defect' HTML
  77 page from Clear DDTS.
  78 
  79 def __init__(self):
  80 HTMLParser.HTMLParser.__init__(self)
  81 self._link = None
  82
  83 def handle_starttag(self, tag, attrs):
  84 Implement standard class HTMLParser customizing method.
  85 if tag == 'frame':
  86 try:
  87 attributes = dict(attrs)
  88 if attributes['name'] == 'indexframe':
  89 self._link = attributes['src']
  90 except KeyError, e:
  91 print(WARNING: Attribute '{keyname}' from frame tag
  92   in QueryResult page does not 
exist!.format(keyname=e))

  93
  94 def link(self):
  95 Return the html link of the query results page.
  96 return self._link

You can use it and just modified it to get the latest just add some code 
(and change the tag 'name' of my example) to compare revision number 
with max and keep the max to compare it to the next value. I let you add 
this little code just create self._revision = None in the __init__(self) 
which hold the current max revision. After parser.feed() you can get the 
value by parser._revision or a public parser.revision() method to get 
the value.


Cheers
Karim


--
http://mail.python.org/mailman/listinfo/python-list


Re: Trying to understand html.parser.HTMLParser

2011-05-16 Thread Karim

On 05/16/2011 03:06 AM, David Robinow wrote:

On Sun, May 15, 2011 at 4:45 PM, Andrew Bergbahamutzero8...@gmail.com  wrote:

I'm trying to understand why HMTLParser.feed() isn't returning the whole
page. My test script is this:

import urllib.request
import html.parser
class MyHTMLParser(html.parser.HTMLParser):
def handle_starttag(self, tag, attrs):
if tag == 'a' and attrs:
print(tag,'-',attrs)

url = 'http://x264.nl/x264/?dir=./64bit/8bit_depth'
page = urllib.request.urlopen(url).read()
parser = MyHTMLParser()
parser.feed(str(page))

I can do print(page) and get the entire HTML source, but
parser.feed(str(page)) only spits out the information for the top links
and none of the revision links. Ultimately, I just want to find
the name of the first revision link (right now it's
revision1995, when a new build is uploaded it will be revision2000
or whatever). I figure this is a relatively simple page; once I
understand all of this, I can move on to more complicated pages.

You've got bad HTML. Look closely and you'll see the there's no space
between the revision strings and the style tag following.
The parser doesn't like this. I don't know a solution other than
fixing the html.
(I created a local copy, edited it and it worked.)

Hello,

Use regular expression for bad HTLM or beautifulSoup (google it), below 
a exemple to extract all html links:


linksList = re.findall('a href=(.*?).*?/a',htmlSource)
for link in linksList:
print link

Cheers
Karim
--
http://mail.python.org/mailman/listinfo/python-list


Re: Trying to understand html.parser.HTMLParser

2011-05-16 Thread Andrew Berg
On 2011.05.16 02:26 AM, Karim wrote:
 Use regular expression for bad HTLM or beautifulSoup (google it), below 
 a exemple to extract all html links:

 linksList = re.findall('a href=(.*?).*?/a',htmlSource)
 for link in linksList:
  print link
I was afraid I might have to use regexes (mostly because I could never
understand them).
Even the BeautifulSoup website itself admits it's awful with Python 3 -
only the admittedly broken 3.1.0 will work with Python 3 at all.
ElementTree doesn't seem to have been updated in a long time, so I'll
assume it won't work with Python 3.
lxml looks promising, but it doesn't say anywhere whether it'll work on
Python 3 or not, which is puzzling since the latest release was only a
couple months ago.

Actually, if I'm going to use regex, I might as well try to implement
Versions* in Python.

Thanks for the answers!

*http://en.totalcmd.pl/download/wfx/net/Versions (original, made for
Total Commander) and
https://addons.mozilla.org/en-US/firefox/addon/versions-wfx_versions/
(clone implemented as a Firefox add-on; it's so wonderful, I even wrote
the docs for it!)
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Trying to understand html.parser.HTMLParser

2011-05-15 Thread David Robinow
On Sun, May 15, 2011 at 4:45 PM, Andrew Berg bahamutzero8...@gmail.com wrote:
 I'm trying to understand why HMTLParser.feed() isn't returning the whole
 page. My test script is this:

 import urllib.request
 import html.parser
 class MyHTMLParser(html.parser.HTMLParser):
    def handle_starttag(self, tag, attrs):
        if tag == 'a' and attrs:
            print(tag,'-',attrs)

 url = 'http://x264.nl/x264/?dir=./64bit/8bit_depth'
 page = urllib.request.urlopen(url).read()
 parser = MyHTMLParser()
 parser.feed(str(page))

 I can do print(page) and get the entire HTML source, but
 parser.feed(str(page)) only spits out the information for the top links
 and none of the revision links. Ultimately, I just want to find
 the name of the first revision link (right now it's
 revision1995, when a new build is uploaded it will be revision2000
 or whatever). I figure this is a relatively simple page; once I
 understand all of this, I can move on to more complicated pages.
You've got bad HTML. Look closely and you'll see the there's no space
between the revision strings and the style tag following.
The parser doesn't like this. I don't know a solution other than
fixing the html.
(I created a local copy, edited it and it worked.)
-- 
http://mail.python.org/mailman/listinfo/python-list