Re: Basic sentence parsing with the regex highlighter fragmenter

Erick Erickson Wed, 06 Jan 2010 06:14:02 -0800

Hmmm, the name WordDelimiterFilterFactory might be leading
you astray. Its purpose isn't to break things up into "words"
that have anything to do with grammatical rules. Rather, it's
purpose is to break up strings of funky characters into
searchable stuff. see:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory


In the grammatical sense, PowerShot should just be
PowerShot, not power shot (which is what WordDelimiterFactory
gives you, options permitting). So I think you probably want
one of the other analyzers....

Have you tried any other analyzers? StandardAnalyzer might be
more friendly....

HTH
Erick

On Tue, Jan 5, 2010 at 5:18 PM, Caleb Land <caleb.l...@gmail.com> wrote:

> I've tracked this problem down to the fact that I'm using the
> WordDelimiterFilter. I don't quite understand what's happening, but if I
> add preserveOriginal="1" as an option, everything looks fine. I think it
> has
> to do with the period being stripped in the token stream.
>
> On Tue, Jan 5, 2010 at 2:05 PM, Caleb Land <caleb.l...@gmail.com> wrote:
>
> > Hello,
> > I'm using Solr 1.4, and I'm trying to get the regex fragmenter to parse
> > basic sentences, and I'm running into a problem.
> >
> > I'm using the default regex specified in the example solr configuration:
> >
> > [-\w ,/\n\"']{20,200}
> >
> > But I am using a larger fragment size (140) with a slop of 1.0.
> >
> > Given the passage:
> >
> > Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla a neque a
> > ipsum accumsan iaculis at id lacus. Sed magna velit, aliquam ut congue
> > vitae, molestie quis nunc.
> >
> > When I search for "Nulla" (the first word of the second sentence) and
> grab
> > the first highlighted snippet, this is what I get:
> >
> > . <em>Nulla</em> a neque a ipsum accumsan iaculis at id lacus
> >
> > As you can see, there's a leading period from the previous sentence and
> the
> > period from the current sentence is missing.
> >
> > I understand this regex isn't that advanced, but I've tried everything I
> > can think of, regex-wise, to get this to work, and I always end up with
> this
> > problem.
> >
> > For example, I've tried: \w[^.!?]{0,200}[.!?]
> >
> > Which seems like it should include the ending punctuation, but it
> doesn't,
> > so I think I'm missing something.
> >
> > Does anybody know a regex that works?
> > --
> > Caleb Land
> >
>
>
>
> --
> Caleb Land
>

Re: Basic sentence parsing with the regex highlighter fragmenter

Reply via email to