On Wed, Jan 6, 2010 at 4:30 PM, Erick Erickson erickerick...@gmail.comwrote:
Hmmm, I'll have to defer to the highlighter experts here
I've looked at the source code for the highlighter, and I think I know
what's going on. I haven't had time to play with this yet, so I could be
wrong, but
Regular expressions won't work well for sentence boundary detection.
If you want something free, you could plug in OpenNLP or GATE. Or LingPipe,
but that's not free.
Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch
- Original Message
From: Caleb Land
Hmmm, the name WordDelimiterFilterFactory might be leading
you astray. Its purpose isn't to break things up into words
that have anything to do with grammatical rules. Rather, it's
purpose is to break up strings of funky characters into
searchable stuff. see:
I've looked at the docs/source for WordDelimiterFilter, and I understand
what it does now.
Here is my configuration:
http://gist.github.com/270590
I've tried the StandardTokenizerFactory instead of the
WhitespaceTokenizerFactory, but I get the same problem as before, a the
period from the
Hmmm, I'll have to defer to the highlighter experts here
Erick
On Wed, Jan 6, 2010 at 3:23 PM, Caleb Land redhatd...@gmail.com wrote:
I've looked at the docs/source for WordDelimiterFilter, and I understand
what it does now.
Here is my configuration:
http://gist.github.com/270590
I've tracked this problem down to the fact that I'm using the
WordDelimiterFilter. I don't quite understand what's happening, but if I
add preserveOriginal=1 as an option, everything looks fine. I think it has
to do with the period being stripped in the token stream.
On Tue, Jan 5, 2010 at 2:05