Regular expressions won't work well for sentence boundary detection. If you want something free, you could plug in OpenNLP or GATE. Or LingPipe, but that's not free.
Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch ----- Original Message ---- > From: Caleb Land <caleb.l...@gmail.com> > To: solr-user@lucene.apache.org > Sent: Tue, January 5, 2010 2:05:18 PM > Subject: Basic sentence parsing with the regex highlighter fragmenter > > Hello, > I'm using Solr 1.4, and I'm trying to get the regex fragmenter to parse > basic sentences, and I'm running into a problem. > > I'm using the default regex specified in the example solr configuration: > > [-\w ,/\n\"']{20,200} > > But I am using a larger fragment size (140) with a slop of 1.0. > > Given the passage: > > Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla a neque a > ipsum accumsan iaculis at id lacus. Sed magna velit, aliquam ut congue > vitae, molestie quis nunc. > > When I search for "Nulla" (the first word of the second sentence) and grab > the first highlighted snippet, this is what I get: > > . Nulla a neque a ipsum accumsan iaculis at id lacus > > As you can see, there's a leading period from the previous sentence and the > period from the current sentence is missing. > > I understand this regex isn't that advanced, but I've tried everything I can > think of, regex-wise, to get this to work, and I always end up with this > problem. > > For example, I've tried: \w[^.!?]{0,200}[.!?] > > Which seems like it should include the ending punctuation, but it doesn't, > so I think I'm missing something. > > Does anybody know a regex that works? > -- > Caleb Land