I've tracked this problem down to the fact that I'm using the WordDelimiterFilter. I don't quite understand what's happening, but if I add preserveOriginal="1" as an option, everything looks fine. I think it has to do with the period being stripped in the token stream.
On Tue, Jan 5, 2010 at 2:05 PM, Caleb Land <caleb.l...@gmail.com> wrote: > Hello, > I'm using Solr 1.4, and I'm trying to get the regex fragmenter to parse > basic sentences, and I'm running into a problem. > > I'm using the default regex specified in the example solr configuration: > > [-\w ,/\n\"']{20,200} > > But I am using a larger fragment size (140) with a slop of 1.0. > > Given the passage: > > Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla a neque a > ipsum accumsan iaculis at id lacus. Sed magna velit, aliquam ut congue > vitae, molestie quis nunc. > > When I search for "Nulla" (the first word of the second sentence) and grab > the first highlighted snippet, this is what I get: > > . <em>Nulla</em> a neque a ipsum accumsan iaculis at id lacus > > As you can see, there's a leading period from the previous sentence and the > period from the current sentence is missing. > > I understand this regex isn't that advanced, but I've tried everything I > can think of, regex-wise, to get this to work, and I always end up with this > problem. > > For example, I've tried: \w[^.!?]{0,200}[.!?] > > Which seems like it should include the ending punctuation, but it doesn't, > so I think I'm missing something. > > Does anybody know a regex that works? > -- > Caleb Land > -- Caleb Land