Sure, a patch would be fine.

On Nov 22, 2008, at 4:31 AM, Dawid Weiss wrote:


Thanks Grant. You mean this issue: https://issues.apache.org/jira/browse/SOLR-42 , I see now. This is a problem for me only, I guess, because I use HTMLStripReader independently of the Lucene architecture. This class is public, would it make sense if I provided a patch that would switch the whitespace emitting functionality on and off, depending on a particular person's use case?

Dawid

Grant Ingersoll wrote:
It is an attempt at making things work properly with the highlighter (such that offsets are correct). I believe it works most of the time, but there still might be a few issues, check JIRA.
-Grant
On Nov 21, 2008, at 5:29 PM, Dawid Weiss wrote:

Hi folks. What's the motivation to add exactly the number of white spaces after an entity declaration in HTMLStripReader? It basically looks like this:

"lód"

(UTF: lód, "ice" in Polish) is translated into:

"ló       d"

This happens both with numeric entities and named entities. Needless to say, these added spaces in the character stream do no good as they effectively split a single term "lód" into two meaningless terms "l" and "d".

I can fix this in the code easily, but it looks like it was intentional, so before I write test cases and commit a JIRA issue I would like to understand what the original reasons might have been (I really don't see anything this would be useful for). Apologies if I'm being dim here.


Reply via email to