Well, I've been quite busy lately working on the NekoHTML parser for Xerces2 and I'm pleased to announce the latest version, NekoHTML 0.6, is available for download at the following location:
http://www.apache.org/~andyc/nekohtml/doc/index.html There are a *lot* of changes and additions in this version. Here's a list of what's new: * Added property to allow custom document filters to be appended to the default NekoHTML parser pipeline; * added convenience filters for serializing HTML documents and removing elements from the document event stream; * added samples to demonstrate the filtering feature; * added experimental functionality to allow applications to dynamically insert content into the HTML document stream; * added a minimal Xerces2 Jar file containing just the files required for using the HTMLConfiguration class directly to alleviate full dependence on Xerces2 distribution; * applied patch from Serge Proskuryakov to fix handling of misplaced <title> within <body>; * fixed minor tag balancing bug; and * re-organized and added new documentation. The coolest features added to this version are the ability to append custom document filters to the parsing pipeline by setting a property; and the (currently experimental) ability to dynamically insert new content into the document parsing stream. I have included a variety of simple (but quite useful) samples of the new filter functionality. One filter is an HTML serializer which has the ability to change the encoding of the document as it's being serialized -- this includes changing the META[@http-equiv='content-type']/@content tag on the way out. Another filter strips elements (and attrs) from the document stream. This one is useful for stripping out everything but rich-text elements, for example. I'm thinking about writing a related filter that converts the remaining rich-text elements to text which would be a good way of producing vanilla text documents that retain the "richness". I have also included an identity transform which basically filters out all of the events synthesized by the tag balancer. Why would you want to do this? Well, you might want to receive all of the warnings/errors reported by the tag balancer without wanting the elements that were generated to make the document well-formed. Adding custom filters is incredibly easy. Simply make an array of objects that implement the XMLDocumentFilter interface from XNI and set the appropriate property on the parser. For example: ElementRemover remover = new ElementRemover(); remover.acceptElement("b", null); remover.acceptElement("i", null); remover.acceptElement("u", null); remover.acceptElement("a", new String[] { "href" }); XMLDocumentFilter[] filters = { remover, new Writer() }; SAXParser parser = new SAXParser(); parser.setProperty("http://cyberneko.org/html/properties/filters", filters); But this is all covered in the docs which I have expanded and improved. I've separated the existing docs into multiple pages and added a bunch of information about the filters, etc. And now it's finally all on my public website so you don't have to download the package to peruse the information. The other big feature (which took me longer to implement today than I thought) is the ability to insert content into the document parsing stream. I've labeled it as "experimental" because I'm not entirely convinced yet that it's a good way to do it -- I'm referring to the public API here. There is now a method on the HTMLConfiguration called "pushInputSource" which allows you to push a new input source onto the stack of readers. This is the same thing we do in the Xerces2 implementation (albeit a more round- about way) but it has the net effect of changing where the parser is scanning. When the end of that stream is reached, the parser pops it off and continues where it left off. Pretty cool. There is a new sample call Script in the src/sample/ directory that shows how it is used. Again, there's more information in the new documentation. Like I said, it's experimental because I may think of a "cleaner" way of allowing applications to do this. But then again, if it works why fix it. So I'll just have to see how it goes. And lastly, I wanted to mention that this distribution now includes a minimal Xerces Jar file for convenience. This Jar just contains the XNI framework and the Xerces2 utility classes that are used by the NekoHTML impl. So, if you are using the HTMLConfiguration class directly (and *not* using the DOMParser or SAXParser which have more dependencies), then you can just use the NekoHTML Jar file and the minimal Xerces Jar file. This greatly reduces the size of the required files. I see a huge savings because I write directly to XNI. Compare for yourself: 42k nekohtml.jar 35k lib/xercesMinimal.jar 131k lib/xmlParserAPIs.jar 1760k lib/xercesImpl.jar Okay, that's all for now. Enjoy! -- Andy Clark * [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
