I would like to investigate (and possibly implement it) the possibility of using Python for processing html pages.
The actual work would look something like this: * Retrieve pages from the net that are in any number of formats such as XML, XHML, HTML, HTML, with major errors in it * Create a usable DOM for the files (considering the fact that they may have malformed html) OR... extract the stuff I need directly from the potentially malformed html. * If the DOM route is used, then I would need something to retrieve stuff from certain areas of the DOM. Additional features needed: I wonder, is this a good place to talk about this? I know the goal is XML, but I think this still fits. What libraries should I be looking into to do things like this? I would prefer to look at all the options, if possible. _______________________________________________ XML-SIG maillist - XML-SIG@python.org http://mail.python.org/mailman/listinfo/xml-sig