[Announce] NekoHTML 0.6 Available

Andy Clark Sun, 12 May 2002 07:33:32 -0700

Well, I've been quite busy lately working on the NekoHTML 
parser for Xerces2 and I'm pleased to announce the latest 
version, NekoHTML 0.6, is available for download at the
following location:


  http://www.apache.org/~andyc/nekohtml/doc/index.html

There are a *lot* of changes and additions in this version.
Here's a list of what's new:

  * Added property to allow custom document filters to be 
    appended to the default NekoHTML parser pipeline; 
  * added convenience filters for serializing HTML documents 
    and removing elements from the document event stream; 
  * added samples to demonstrate the filtering feature; 
  * added experimental functionality to allow applications 
    to dynamically insert content into the HTML document 
    stream; 
  * added a minimal Xerces2 Jar file containing just the 
    files required for using the HTMLConfiguration class 
    directly to alleviate full dependence on Xerces2 
    distribution; 
  * applied patch from Serge Proskuryakov to fix handling 
    of misplaced <title> within <body>; 
  * fixed minor tag balancing bug; and 
  * re-organized and added new documentation.

The coolest features added to this version are the ability
to append custom document filters to the parsing pipeline
by setting a property; and the (currently experimental)
ability to dynamically insert new content into the document
parsing stream.

I have included a variety of simple (but quite useful)
samples of the new filter functionality. One filter is an
HTML serializer which has the ability to change the encoding
of the document as it's being serialized -- this includes
changing the META[@http-equiv='content-type']/@content tag
on the way out. 

Another filter strips elements (and attrs) from the document 
stream. This one is useful for stripping out everything but 
rich-text elements, for example. I'm thinking about writing
a related filter that converts the remaining rich-text
elements to text which would be a good way of producing
vanilla text documents that retain the "richness".

I have also included an identity transform which basically
filters out all of the events synthesized by the tag
balancer. Why would you want to do this? Well, you might
want to receive all of the warnings/errors reported by
the tag balancer without wanting the elements that were
generated to make the document well-formed.

Adding custom filters is incredibly easy. Simply make an
array of objects that implement the XMLDocumentFilter 
interface from XNI and set the appropriate property on
the parser. For example:

  ElementRemover remover = new ElementRemover();
  remover.acceptElement("b", null);
  remover.acceptElement("i", null);
  remover.acceptElement("u", null);
  remover.acceptElement("a", new String[] { "href" });

  XMLDocumentFilter[] filters = { remover, new Writer() };

  SAXParser parser = new SAXParser();
  parser.setProperty("http://cyberneko.org/html/properties/filters";,
                     filters);

But this is all covered in the docs which I have
expanded and improved. I've separated the existing docs
into multiple pages and added a bunch of information 
about the filters, etc. And now it's finally all on my
public website so you don't have to download the package
to peruse the information.

The other big feature (which took me longer to implement
today than I thought) is the ability to insert content
into the document parsing stream. I've labeled it as
"experimental" because I'm not entirely convinced yet
that it's a good way to do it -- I'm referring to the
public API here.

There is now a method on the HTMLConfiguration called
"pushInputSource" which allows you to push a new input
source onto the stack of readers. This is the same thing
we do in the Xerces2 implementation (albeit a more round-
about way) but it has the net effect of changing where
the parser is scanning. When the end of that stream is
reached, the parser pops it off and continues where it
left off. Pretty cool.

There is a new sample call Script in the src/sample/
directory that shows how it is used. Again, there's more
information in the new documentation.

Like I said, it's experimental because I may think of
a "cleaner" way of allowing applications to do this.
But then again, if it works why fix it. So I'll just
have to see how it goes.

And lastly, I wanted to mention that this distribution
now includes a minimal Xerces Jar file for convenience.
This Jar just contains the XNI framework and the Xerces2
utility classes that are used by the NekoHTML impl. So,
if you are using the HTMLConfiguration class directly
(and *not* using the DOMParser or SAXParser which have
more dependencies), then you can just use the NekoHTML
Jar file and the minimal Xerces Jar file. This greatly
reduces the size of the required files. 

I see a huge savings because I write directly to XNI.
Compare for yourself:

    42k nekohtml.jar
    35k lib/xercesMinimal.jar

   131k lib/xmlParserAPIs.jar
  1760k lib/xercesImpl.jar

Okay, that's all for now. Enjoy!

-- 
Andy Clark * [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[Announce] NekoHTML 0.6 Available

Reply via email to