Regular expressions won't work well for sentence boundary detection.
If you want something free, you could plug in OpenNLP or GATE.  Or LingPipe, 
but that's not free.

 Otis
--
Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch



----- Original Message ----
> From: Caleb Land <caleb.l...@gmail.com>
> To: solr-user@lucene.apache.org
> Sent: Tue, January 5, 2010 2:05:18 PM
> Subject: Basic sentence parsing with the regex highlighter fragmenter
> 
> Hello,
> I'm using Solr 1.4, and I'm trying to get the regex fragmenter to parse
> basic sentences, and I'm running into a problem.
> 
> I'm using the default regex specified in the example solr configuration:
> 
> [-\w ,/\n\"']{20,200}
> 
> But I am using a larger fragment size (140) with a slop of 1.0.
> 
> Given the passage:
> 
> Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla a neque a
> ipsum accumsan iaculis at id lacus. Sed magna velit, aliquam ut congue
> vitae, molestie quis nunc.
> 
> When I search for "Nulla" (the first word of the second sentence) and grab
> the first highlighted snippet, this is what I get:
> 
> . Nulla a neque a ipsum accumsan iaculis at id lacus
> 
> As you can see, there's a leading period from the previous sentence and the
> period from the current sentence is missing.
> 
> I understand this regex isn't that advanced, but I've tried everything I can
> think of, regex-wise, to get this to work, and I always end up with this
> problem.
> 
> For example, I've tried: \w[^.!?]{0,200}[.!?]
> 
> Which seems like it should include the ending punctuation, but it doesn't,
> so I think I'm missing something.
> 
> Does anybody know a regex that works?
> -- 
> Caleb Land

Reply via email to