I would like to investigate (and possibly implement it) the possibility of 
using Python for processing html pages.

The actual work would look something like this:
* Retrieve pages from the net that are in any number of formats such as XML, 
XHML, HTML, HTML, with major errors in it
* Create a usable DOM for the files (considering the fact that they may have 
malformed html) OR...  extract the stuff I need directly from the potentially 
malformed html.
* If the DOM route is used, then I would need something to retrieve stuff from 
certain areas of the DOM.
Additional features needed:

I wonder, is this a good place to talk about this?

I know the goal is XML, but I think this still fits.  What libraries should I 
be looking into to do things like this?  I would prefer to look at all the 
options, if possible.
_______________________________________________
XML-SIG maillist  -  XML-SIG@python.org
http://mail.python.org/mailman/listinfo/xml-sig

Reply via email to