Re: Motivation for white space after entities in HTMLStripReader

Grant Ingersoll Fri, 21 Nov 2008 19:34:02 -0800

It is an attempt at making things work properly with the highlighter(such that offsets are correct). I believe it works most of the time,but there still might be a few issues, check JIRA.


-Grant


On Nov 21, 2008, at 5:29 PM, Dawid Weiss wrote:

Hi folks. What's the motivation to add exactly the number of whitespaces after an entity declaration in HTMLStripReader? It basicallylooks like this:
"l&oacute;d"

(UTF: lód, "ice" in Polish) is translated into:

"ló       d"
This happens both with numeric entities and named entities. Needlessto say, these added spaces in the character stream do no good asthey effectively split a single term "lód" into two meaninglessterms "l" and "d".
I can fix this in the code easily, but it looks like it wasintentional, so before I write test cases and commit a JIRA issue Iwould like to understand what the original reasons might have been(I really don't see anything this would be useful for). Apologies ifI'm being dim here.
Dawid


--------------------------
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ

Re: Motivation for white space after entities in HTMLStripReader

Reply via email to