[XML-SIG] HTML Processing

Ar18 Mon, 07 May 2007 18:55:18 -0700

I would like to investigate (and possibly implement it) the possibility of 
using Python for processing html pages.


The actual work would look something like this:
* Retrieve pages from the net that are in any number of formats such as XML, 
XHML, HTML, HTML, with major errors in it
* Create a usable DOM for the files (considering the fact that they may have 
malformed html) OR...  extract the stuff I need directly from the potentially 
malformed html.
* If the DOM route is used, then I would need something to retrieve stuff from 
certain areas of the DOM.
Additional features needed:

I wonder, is this a good place to talk about this?

I know the goal is XML, but I think this still fits.  What libraries should I 
be looking into to do things like this?  I would prefer to look at all the 
options, if possible.
_______________________________________________
XML-SIG maillist  -  [email protected]
http://mail.python.org/mailman/listinfo/xml-sig

[XML-SIG] HTML Processing

Reply via email to