On Wed, Jan 6, 2010 at 4:30 PM, Erick Erickson <erickerick...@gmail.com>wrote:

> Hmmm, I'll have to defer to the highlighter experts here....
>
>
I've looked at the source code for the highlighter, and I think I know
what's going on. I haven't had time to play with this yet, so I could be
wrong, but this is my impression.

The highlighter builds a highlighted fragment by reading tokens in, and
appending their contents to a string buffer.

Now, every time a token is appended to a fragment, it adds the "whitespace"
between the previous token and the current token (this isn't strictly
whitespace, but really anything that was removed from the source text by the
tokenizer, like punctuation etc.).

I believe what is happening in my case is that the leading ". " is the
"whitespace" between the last token (of the previous fragment) and the first
token of the current fragment.

And, of course, the trailing punctuation is being cut off because
the fragment builder doesn't APPEND "whitespace" after the last token, it
just prepends this "whitespace".

You can see the code that does this, from the
Highlighter#getBestTextFragments (line 233 in lucene 3.0.0) here:

http://gist.github.com/271515

If I do what I said in my second email (add preserveOriginal=1 to the
WordDelimiterFilter), things work because the ending punctuation is stored
with the token, and just the real whitespace is prepended by this code.

I'm not sure what the solution is, but currently I'm just trimming leading
punctuation + a space off on the client side, and leaving the sentence
terminator-less.

-- 
Caleb Land

Reply via email to