I've tracked this problem down to the fact that I'm using the
WordDelimiterFilter. I don't quite understand what's happening, but if I
add preserveOriginal="1" as an option, everything looks fine. I think it has
to do with the period being stripped in the token stream.

On Tue, Jan 5, 2010 at 2:05 PM, Caleb Land <caleb.l...@gmail.com> wrote:

> Hello,
> I'm using Solr 1.4, and I'm trying to get the regex fragmenter to parse
> basic sentences, and I'm running into a problem.
>
> I'm using the default regex specified in the example solr configuration:
>
> [-\w ,/\n\"']{20,200}
>
> But I am using a larger fragment size (140) with a slop of 1.0.
>
> Given the passage:
>
> Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla a neque a
> ipsum accumsan iaculis at id lacus. Sed magna velit, aliquam ut congue
> vitae, molestie quis nunc.
>
> When I search for "Nulla" (the first word of the second sentence) and grab
> the first highlighted snippet, this is what I get:
>
> . <em>Nulla</em> a neque a ipsum accumsan iaculis at id lacus
>
> As you can see, there's a leading period from the previous sentence and the
> period from the current sentence is missing.
>
> I understand this regex isn't that advanced, but I've tried everything I
> can think of, regex-wise, to get this to work, and I always end up with this
> problem.
>
> For example, I've tried: \w[^.!?]{0,200}[.!?]
>
> Which seems like it should include the ending punctuation, but it doesn't,
> so I think I'm missing something.
>
> Does anybody know a regex that works?
> --
> Caleb Land
>



-- 
Caleb Land

Reply via email to