[EMAIL PROTECTED] wrote: > I would like to investigate (and possibly implement it) the possibility of > using Python for processing html pages. > > The actual work would look something like this: > * Retrieve pages from the net that are in any number of formats such as XML, > XHML, HTML, HTML, with major errors in it > * Create a usable DOM for the files (considering the fact that they may have > malformed html) OR... extract the stuff I need directly from the potentially > malformed html. > * If the DOM route is used, then I would need something to retrieve stuff > from certain areas of the DOM. > Additional features needed: > > I wonder, is this a good place to talk about this? > > I know the goal is XML, but I think this still fits. What libraries should I > be looking into to do things like this? I would prefer to look at all the > options, if possible. > _______________________________________________ > XML-SIG maillist - XML-SIG@python.org > http://mail.python.org/mailman/listinfo/xml-sig > > I wrote an application to do just this. I found that the existing xml.dom module had some serious bugs, has not been touched since 2004, and had no easy way of creating and inserting subtrees in the DOM, or working with subsets of the DOM. This looks like it was written, then abandoned for some reason. Not sure why. I tried to use the elementree from effbot, but also with no success. It is not DOM compliant, and it's nesting is odd. For example, text appearing after a <p>...</P. tag on the same line is stuffed into a 'tail variable of the same node, instead of being made into a sibling node of the <p> node. I found it very odd, and not useful for DOM manipulation at all. I wrote to Mr. Lundh, and got an indifferent response.
I ended up writing my own DOM tree manager, which is DOM 2 compliant for the most part. A range() interface still needs to be fully written, which will allow it to reference anywhere in the tag structure arbitrarily. Right now I limit my DOM referencing to well-defined components of the tags and elements. I have not yet written the code to allow for a completely unlimited referencing of content in any node, and across any range. Once that is added to my module, it will be complete and even more DOM2 compliant. But that functionality is not required for my app, so I may not get the chance to write it. It has the ability to work with any subtree and insert it using array syntax. The nesting is exactly what you'd expect in a DOM structure. If you want this module, and you reach the point where you can help me debug and improve it, then contact me and we shall talk about the details. based on this e-mail, it sounds like you're not yet there. For serving RESTful front end app, I highly recommend CherryPy. Best, Gloria _______________________________________________ XML-SIG maillist - XML-SIG@python.org http://mail.python.org/mailman/listinfo/xml-sig