Re: Basic sentence parsing with the regex highlighter fragmenter

2010-01-07 Thread Caleb Land
On Wed, Jan 6, 2010 at 4:30 PM, Erick Erickson erickerick...@gmail.comwrote: Hmmm, I'll have to defer to the highlighter experts here I've looked at the source code for the highlighter, and I think I know what's going on. I haven't had time to play with this yet, so I could be wrong, but

Re: Basic sentence parsing with the regex highlighter fragmenter

2010-01-07 Thread Otis Gospodnetic
Regular expressions won't work well for sentence boundary detection. If you want something free, you could plug in OpenNLP or GATE. Or LingPipe, but that's not free. Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch - Original Message From: Caleb Land

Re: Basic sentence parsing with the regex highlighter fragmenter

2010-01-06 Thread Erick Erickson
Hmmm, the name WordDelimiterFilterFactory might be leading you astray. Its purpose isn't to break things up into words that have anything to do with grammatical rules. Rather, it's purpose is to break up strings of funky characters into searchable stuff. see:

Re: Basic sentence parsing with the regex highlighter fragmenter

2010-01-06 Thread Caleb Land
I've looked at the docs/source for WordDelimiterFilter, and I understand what it does now. Here is my configuration: http://gist.github.com/270590 I've tried the StandardTokenizerFactory instead of the WhitespaceTokenizerFactory, but I get the same problem as before, a the period from the

Re: Basic sentence parsing with the regex highlighter fragmenter

2010-01-06 Thread Erick Erickson
Hmmm, I'll have to defer to the highlighter experts here Erick On Wed, Jan 6, 2010 at 3:23 PM, Caleb Land redhatd...@gmail.com wrote: I've looked at the docs/source for WordDelimiterFilter, and I understand what it does now. Here is my configuration: http://gist.github.com/270590

Re: Basic sentence parsing with the regex highlighter fragmenter

2010-01-05 Thread Caleb Land
I've tracked this problem down to the fact that I'm using the WordDelimiterFilter. I don't quite understand what's happening, but if I add preserveOriginal=1 as an option, everything looks fine. I think it has to do with the period being stripped in the token stream. On Tue, Jan 5, 2010 at 2:05