Re: Getting the position of a node in the input stream (using Neko)

Martin Jericho 23 Aug 2002 00:20:02 -0000

----- Original Message -----
From: "Andy Clark" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Thursday, August 22, 2002 2:41 AM
Subject: Re: Getting the position of a node in the input stream (using Neko)

> Martin Jericho wrote:
> > Thanks for your quick reply Andy. I was half expecting it to be this
sort
> > of problem, but I was then puzzled that you can track the line and
column
> > number.
>
> That's standard locator information provided by the SAX
> interfaces. So we implement that in Xerces and I decided
> to implement the same thing in NekoHTML. But in neither
> case do we track "character offsets", which I think has
> limited usefulness but others disagree.

Hopefully my arguments below will help to convince you of their usefulness.

>
> > 1. Insert some code at particular points, but it is imperative that the
> > rest of the html remains EXACTLY the same. This is not possible using
the
> > Writer filter, as some properties, such as
> > http://cyberneko.org/html/properties/names/elems, do not have a
"no-change"
> > option. Even if this option were present, there are still some changes
made
>
> Because "no-change" has the potential of producing XML
> that is not well-formed. And the whole purpose of Neko-
> HTML is to parse HTML and make it appear as XML.

So Neko has to do this because otherwise the underlying xerces parser would not be able to parse it. Is that right? This would not be of any concern to me anyway if character positions were reported, I was just using it as an example to demonstrate that you can't get Neko to output the orginal source unchanged.

> Take
> the following instance:
>
> <tAbLe> ... </TaBlE>
>
> How do you handle this and still make it well formed
> in an XML sense? NekoHTML lets you transform these to
> uppercase, lowercase, or just to match the end tag w/
> whatever the start tag is. The latter option will
> produce the following:
>
> <tAbLe> ... </tAbLe>
>
> > to the output. One example is that it inserts a <COLGROUP> element, and
> > </col> tags in the wrong places. (sorry I haven't had time to report
these
>
> Please let me know more detail about these bugs so
> that I can fix them. Minimal sample files would be
> preferable.

I have attached the relevant files.

>
> > bugs). The point is that I don't want to have to worry about Neko being
> > able to regenerate the original source verbatim, all I want is the
character
> > positions so I can insert the code myself.
>
> Is your HTML string generated? Or serialized into a
> String object?

The HTML is created by an end user, probably using some kind of GUI tool. I want to do a one-off parse of it to insert some velocity tags and store it in a database for later generation of dynamic content. The reason it is so important that the HTML remains exactly the same (apart from the deliberate changes), is that the GUI HTML editors are fine-tuned for all the little idiosyncracies of any client. The end user may want the page to display in Netscape 4.5, but if Neko goes and makes even the slightest of changes, we know all to well how easy Netscape is to break.

Neko is an excellent HTML parser, but its usefulness is limited by the fact that the developer doesn't have full control over the output. Having the original character offsets in an Augmentations object (?) would provide the ultimate flexibility that would make it useful in many more applications.

I get the feeling that this would have to be implemented in the XNI framework rather than as a Neko improvement. I would love to get involved myself but like most other people have too much work on my plate already.

>
> > Is there any way of doing this, assuming a unicode character set?
>
> Not unless you've stored the offsets for each separate
> line within the document String. Then you could use the
> line/column information.
>
> > By the way, congratulations and thanks for your efforts so far. I can
see
> > Neko being useful to me in future projects, but for my current problem I
may
> > have to use JTidy.
>
> By all means. JTidy is a really nice tool.

I have since found out that JTidy doesn't do this either. The fields I thought were character offsets were just in relation to an internal buffer, not the original source. In fact the only HTML parser I have found which does it is the one in the swing package, although I still haven't tested it properly. If that doesn't do what I want, I might even have to write my own from scratch.

> > Also, I very nearly didn't find Neko during my search for HTML parsers.
Two
> > reasons for this:
> > - You have to dig pretty deep into the Xerces documentation to find out
that
> > it is capable of parsing HTML. Even the FAQ says it is not possible! I
>
> Xerces is *not* capable of parsing HTML because HTML is
> not XML. However, we do mention that an HTML parser
> configuration is possible with the XNI framework. And
> perhaps NekoHTML will make it into the Xerces download
> or at least as a kind of sub-project that's explicitly
> mentioned in the Xerces pages.
>
> > - The Neko home page
(http://www.apache.org/~andyc/neko/doc/html/index.html)
> > does not contain any meta keywords or anything to make it easy for
search
> > engines to find it. Have you registered it with any search engines?
>
> Not really. It's rather quiet 'cause I concentrate on
> Xerces developers. However, I do announce on Freshmeat
> so that make sa lot of people aware of its existence.
>
> > Something else I found to be quite ironic, considering one of the prime
uses
> > of Neko, is that it's homepage isn't even valid HTML!
>
> Ummm... that's sort of the point. To show that NekoHTML
> can even parse its own sloppy documentation. :)
>
> > I found it when google came up with one of your posts on a mailing list
> > archive. It would be a pity if people start using inferior products
simply
> > because they don't know Neko exists.
>
> True. I should work harder to let people know that it
> is available.
>
> Thanks for the insight and I'd really like to hear more
> about the bugs you're experiencing. Type at you later...
>
> --
> Andy Clark * [EMAIL PROTECTED]
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [EMAIL PROTECTED]
> For additional commands, e-mail: [EMAIL PROTECTED]

Title: Document Title

This	is a
simple	table

Title: Document Title

This	is a
simple	table

Test.java
Description: Binary data

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Getting the position of a node in the input stream (using Neko)

Reply via email to