It is an attempt at making things work properly with the highlighter
(such that offsets are correct). I believe it works most of the time,
but there still might be a few issues, check JIRA.
-Grant
On Nov 21, 2008, at 5:29 PM, Dawid Weiss wrote:
Hi folks. What's the motivation to add exactly the number of white
spaces after an entity declaration in HTMLStripReader? It basically
looks like this:
"lód"
(UTF: lód, "ice" in Polish) is translated into:
"ló d"
This happens both with numeric entities and named entities. Needless
to say, these added spaces in the character stream do no good as
they effectively split a single term "lód" into two meaningless
terms "l" and "d".
I can fix this in the code easily, but it looks like it was
intentional, so before I write test cases and commit a JIRA issue I
would like to understand what the original reasons might have been
(I really don't see anything this would be useful for). Apologies if
I'm being dim here.
Dawid
--------------------------
Grant Ingersoll
Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ