Re: HTML decoder is splitting tokens

Koji Sekiguchi Sat, 29 Aug 2009 03:02:40 -0700

Anders,

Thank you for the explanation.

> which could be written in HTML like this:
>
> use <tt>&lt;p&gt;</tt> to mark a paragraph

Ok.

> so the mapping char filter would map it into:
>
> use <tt><p></tt> to mark a paragraph

This is correct when you have the mapping definition:

"&lt;" => "<"
"&gt;" => ">"
   :              :

But I thought you could not have them, but have only:

"&uuml;" => "ü"
"&auml;" => "ä"
   :             :

Didn't it solve your problem?

Thank you,

Koji

Anders Melchiorsen wrote:

Koji Sekiguchi <k...@r.email.ne.jp> writes:

Thank you for attaching the patch. Sorry again, I don't have enough
time to investigate the patch and the problem you have, though, I'd
like just to recommend that you'd open a JIRA issue and attach the
patch so that I or someone can look into it later.


Sorry, learning an issue tracker every time I find a bug in some
project is too much trouble. I wouldn't mind if someone else transfers
my previous mail, though.

And I didn't understand this part of your previous mail:

Adding MappingCharFilterFactory in front of the HTML stripper (so
that the latter will not see the entity) does work as expected.
That is, until I try strings like "use &lt;p&gt; to mark a
paragraph", where the HTML stripper will then remove parts of the
actual text. So this approach will not work.


Entity mapping and tag removal has to happen in one pass to keep
fidelity.

Let's say that we are analyzing a tutorial on writing HTML. It might
contain the text:

    use <p> to mark a paragraph

which could be written in HTML like this:

    use <tt>&lt;p&gt;</tt> to mark a paragraph

so the mapping char filter would map it into:

    use <tt><p></tt> to mark a paragraph

which is already wrong. Next, the HTML stripper would remove the tags:

    use to mark a paragraph

and we have now lost a part of the original text.


Cheers,
Anders.

Re: HTML decoder is splitting tokens

Reply via email to