Basic sentence parsing with the regex highlighter fragmenter

Caleb Land Tue, 05 Jan 2010 11:05:49 -0800

Hello,
I'm using Solr 1.4, and I'm trying to get the regex fragmenter to parse
basic sentences, and I'm running into a problem.


I'm using the default regex specified in the example solr configuration:

[-\w ,/\n\"']{20,200}

But I am using a larger fragment size (140) with a slop of 1.0.

Given the passage:

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla a neque a
ipsum accumsan iaculis at id lacus. Sed magna velit, aliquam ut congue
vitae, molestie quis nunc.

When I search for "Nulla" (the first word of the second sentence) and grab
the first highlighted snippet, this is what I get:

. <em>Nulla</em> a neque a ipsum accumsan iaculis at id lacus

As you can see, there's a leading period from the previous sentence and the
period from the current sentence is missing.

I understand this regex isn't that advanced, but I've tried everything I can
think of, regex-wise, to get this to work, and I always end up with this
problem.

For example, I've tried: \w[^.!?]{0,200}[.!?]

Which seems like it should include the ending punctuation, but it doesn't,
so I think I'm missing something.

Does anybody know a regex that works?
-- 
Caleb Land

Basic sentence parsing with the regex highlighter fragmenter

Reply via email to