Re: Motivation for white space after entities in HTMLStripReader

Grant Ingersoll Sat, 22 Nov 2008 14:07:17 -0800

Sure, a patch would be fine.

On Nov 22, 2008, at 4:31 AM, Dawid Weiss wrote:

Thanks Grant. You mean this issue: https://issues.apache.org/jira/browse/SOLR-42, I see now. This is a problem for me only, I guess, because I useHTMLStripReader independently of the Lucene architecture. This classis public, would it make sense if I provided a patch that wouldswitch the whitespace emitting functionality on and off, dependingon a particular person's use case?
Dawid

Grant Ingersoll wrote:
It is an attempt at making things work properly with thehighlighter (such that offsets are correct). I believe it worksmost of the time, but there still might be a few issues, check JIRA.
-Grant
On Nov 21, 2008, at 5:29 PM, Dawid Weiss wrote:
Hi folks. What's the motivation to add exactly the number of whitespaces after an entity declaration in HTMLStripReader? Itbasically looks like this:
"l&oacute;d"

(UTF: lód, "ice" in Polish) is translated into:

"ló       d"
This happens both with numeric entities and named entities.Needless to say, these added spaces in the character stream do nogood as they effectively split a single term "lód" into twomeaningless terms "l" and "d".
I can fix this in the code easily, but it looks like it wasintentional, so before I write test cases and commit a JIRA issueI would like to understand what the original reasons might havebeen (I really don't see anything this would be useful for).Apologies if I'm being dim here.

Re: Motivation for white space after entities in HTMLStripReader

Reply via email to