Re: Basic sentence parsing with the regex highlighter fragmenter

Erick Erickson Wed, 06 Jan 2010 13:30:49 -0800

Hmmm, I'll have to defer to the highlighter experts here....

Erick


On Wed, Jan 6, 2010 at 3:23 PM, Caleb Land <redhatd...@gmail.com> wrote:

> I've looked at the docs/source for WordDelimiterFilter, and I understand
> what it does now.
>
> Here is my configuration:
>
> http://gist.github.com/270590
>
> I've tried the StandardTokenizerFactory instead of the
> WhitespaceTokenizerFactory, but I get the same problem as before, a the
> period from the previous sentence shows up and the period from the current
> sentence is cut off of highlighter fragments.
>
> I've tried the WhitespaceTokenizer with the StandardFilter, and this kinda
> works, but to match a word at the end of a sentence, you need to search for
> the period at the end of the sentence (the period is being tokenized along
> with the word).
>
> In any case, if I use the WordDelimiterFilter or add preserveOriginal="1",
> everything seems to work. (If I remove the WordDelimiterFilter, the periods
> are indexed with the word they're connected to, and searching for those
> words doesn't match unless the user includes the period)
>
> I'm trying to go through the code to understand how this works.
>
> On Wed, Jan 6, 2010 at 9:13 AM, Erick Erickson <erickerick...@gmail.com
> >wrote:
>
> > Hmmm, the name WordDelimiterFilterFactory might be leading
> > you astray. Its purpose isn't to break things up into "words"
> > that have anything to do with grammatical rules. Rather, it's
> > purpose is to break up strings of funky characters into
> > searchable stuff. see:
> >
> >
> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory
> >
> > In the grammatical sense, PowerShot should just be
> > PowerShot, not power shot (which is what WordDelimiterFactory
> > gives you, options permitting). So I think you probably want
> > one of the other analyzers....
> >
> > Have you tried any other analyzers? StandardAnalyzer might be
> > more friendly....
> >
> > HTH
> > Erick
> >
> > On Tue, Jan 5, 2010 at 5:18 PM, Caleb Land <caleb.l...@gmail.com> wrote:
> >
> > > I've tracked this problem down to the fact that I'm using the
> > > WordDelimiterFilter. I don't quite understand what's happening, but if
> I
> > > add preserveOriginal="1" as an option, everything looks fine. I think
> it
> > > has
> > > to do with the period being stripped in the token stream.
> > >
> > > On Tue, Jan 5, 2010 at 2:05 PM, Caleb Land <caleb.l...@gmail.com>
> wrote:
> > >
> > > > Hello,
> > > > I'm using Solr 1.4, and I'm trying to get the regex fragmenter to
> parse
> > > > basic sentences, and I'm running into a problem.
> > > >
> > > > I'm using the default regex specified in the example solr
> > configuration:
> > > >
> > > > [-\w ,/\n\"']{20,200}
> > > >
> > > > But I am using a larger fragment size (140) with a slop of 1.0.
> > > >
> > > > Given the passage:
> > > >
> > > > Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla a
> neque
> > a
> > > > ipsum accumsan iaculis at id lacus. Sed magna velit, aliquam ut
> congue
> > > > vitae, molestie quis nunc.
> > > >
> > > > When I search for "Nulla" (the first word of the second sentence) and
> > > grab
> > > > the first highlighted snippet, this is what I get:
> > > >
> > > > . <em>Nulla</em> a neque a ipsum accumsan iaculis at id lacus
> > > >
> > > > As you can see, there's a leading period from the previous sentence
> and
> > > the
> > > > period from the current sentence is missing.
> > > >
> > > > I understand this regex isn't that advanced, but I've tried
> everything
> > I
> > > > can think of, regex-wise, to get this to work, and I always end up
> with
> > > this
> > > > problem.
> > > >
> > > > For example, I've tried: \w[^.!?]{0,200}[.!?]
> > > >
> > > > Which seems like it should include the ending punctuation, but it
> > > doesn't,
> > > > so I think I'm missing something.
> > > >
> > > > Does anybody know a regex that works?
> > > > --
> > > > Caleb Land
> > > >
> > >
> > >
> > >
> > > --
> > > Caleb Land
> > >
> >
>
>
>
> --
> Caleb Land
>

Re: Basic sentence parsing with the regex highlighter fragmenter

Reply via email to