Re: [Tutor] Another regular expression question

Kent Johnson Wed, 14 Sep 2005 06:38:44 -0700

Bernard Lebel wrote:
> Thanks Alan,
> 
> I'll check BeautifulSoup asap.
> 
> I'm using regex simply because I have no clue where to start to parse
> XML. I have read the various xml tools available in the Python
> library, however I'm a complete loss at what to make out of them. Many
> of them seem to use some programming standards, wich I am completely
> unfamiliar with (this is the first time that I dig into XML writing
> and parsing).
> 
> I don't know where to start to learn about all these standards, and as
> usual with new programming things, the documentation is hard to
> swallow (it usually is written more as a reference than a proper user
> guide/tutorial). I have to admit this is very frustrating, so if I'm
> looking at things from a wrong perspective please advise me, I need
> it.

I agree that the Python XML story is confusing even for the files in the 
standard library. Worse, the (IMO) best solutions are not to be found in the 
standard lib or PyXML at all.

The std lib and PyXML are based on the DOM and SAX standards. These standards 
were designed to be "language-neutral" - there are implementations in Python, 
Java and other languages. The good side of this is, if you learn how to use 
them, the knowledge is pretty portable to other languages. The bad side is, the 
APIs defined by the standard are IMO clunky and painful to use, especially in 
Python.

There is a current thread on comp.lang.python discussing this with good 
suggestions and pointers to more info:
http://groups.google.com/group/comp.lang.python/browse_frm/thread/a48891aa645ead13/dcd8fdc20b4b191b?hl=en#dcd8fdc20b4b191b

My personal preference is ElementTree. Beautiful Soup is good too though I have 
only tried it with HTML. If I was running on Linux I would try lxml which uses 
the ElementTree API and adds full XPath support. Amara looks like the Cadillac 
solution - big and cushy. I haven't tried it. Uche's articles (referenced in 
the thread above) have pointers to many other choices but these seem to be the 
most popular.

My favorite XML lib is actually dom4j which is in Java. It works great with 
Jython.

Kent

> 
> So right now I'm just taking a shortcut and using ultra-simple
> re-based parser to retrieve the tags I'm looking for. I know it will
> probably be slow, but hopefully I'll get familiar with sophisticated
> parsing in the future and improve my code. As it stands right now,
> even the re syntax is not super easy to learn.

For what you are doing re seems fine to me. You can get in trouble using re's 
with XML because of nested tags, variations in spelling and order, probably a 
bunch of other things. But for simple stuff it can work fine.

Kent

> 
> 
> Kent: That works (of course!). Thanks a bunch once again!
> 
> 
> Thanks
> Bernard
> 
> On 9/14/05, Alan G <[EMAIL PROTECTED]> wrote:
> 
>>Hi Bernard,
>>
>>
>>>Hello, yet another regular expression question :-)
>>>
>>>So I have this xml file that I'm trying to find a
>>>specific tag in.
>>
>>I'm always suspicious when I see regular expression
>>and xml/html in the same context. regex are not good
>>for parsing xml/html files and it's usually much easier
>>to use a proper parser - such as beautiful soup.
>>
>>http://www.crummy.com/software/BeautifulSoup/
>>
>>Is there any special reason why you are using a regex
>>sledgehammer to crack this particular nut? Or is it
>>just to gain experience using regex?
>>
>>Alan G.
>>
> 
> _______________________________________________
> Tutor maillist  -  [email protected]
> http://mail.python.org/mailman/listinfo/tutor
> 
> 

_______________________________________________
Tutor maillist  -  [email protected]
http://mail.python.org/mailman/listinfo/tutor

Re: [Tutor] Another regular expression question

Reply via email to