Hmmm, I'll have to defer to the highlighter experts here.... Erick
On Wed, Jan 6, 2010 at 3:23 PM, Caleb Land <redhatd...@gmail.com> wrote: > I've looked at the docs/source for WordDelimiterFilter, and I understand > what it does now. > > Here is my configuration: > > http://gist.github.com/270590 > > I've tried the StandardTokenizerFactory instead of the > WhitespaceTokenizerFactory, but I get the same problem as before, a the > period from the previous sentence shows up and the period from the current > sentence is cut off of highlighter fragments. > > I've tried the WhitespaceTokenizer with the StandardFilter, and this kinda > works, but to match a word at the end of a sentence, you need to search for > the period at the end of the sentence (the period is being tokenized along > with the word). > > In any case, if I use the WordDelimiterFilter or add preserveOriginal="1", > everything seems to work. (If I remove the WordDelimiterFilter, the periods > are indexed with the word they're connected to, and searching for those > words doesn't match unless the user includes the period) > > I'm trying to go through the code to understand how this works. > > On Wed, Jan 6, 2010 at 9:13 AM, Erick Erickson <erickerick...@gmail.com > >wrote: > > > Hmmm, the name WordDelimiterFilterFactory might be leading > > you astray. Its purpose isn't to break things up into "words" > > that have anything to do with grammatical rules. Rather, it's > > purpose is to break up strings of funky characters into > > searchable stuff. see: > > > > > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.WordDelimiterFilterFactory > > > > In the grammatical sense, PowerShot should just be > > PowerShot, not power shot (which is what WordDelimiterFactory > > gives you, options permitting). So I think you probably want > > one of the other analyzers.... > > > > Have you tried any other analyzers? StandardAnalyzer might be > > more friendly.... > > > > HTH > > Erick > > > > On Tue, Jan 5, 2010 at 5:18 PM, Caleb Land <caleb.l...@gmail.com> wrote: > > > > > I've tracked this problem down to the fact that I'm using the > > > WordDelimiterFilter. I don't quite understand what's happening, but if > I > > > add preserveOriginal="1" as an option, everything looks fine. I think > it > > > has > > > to do with the period being stripped in the token stream. > > > > > > On Tue, Jan 5, 2010 at 2:05 PM, Caleb Land <caleb.l...@gmail.com> > wrote: > > > > > > > Hello, > > > > I'm using Solr 1.4, and I'm trying to get the regex fragmenter to > parse > > > > basic sentences, and I'm running into a problem. > > > > > > > > I'm using the default regex specified in the example solr > > configuration: > > > > > > > > [-\w ,/\n\"']{20,200} > > > > > > > > But I am using a larger fragment size (140) with a slop of 1.0. > > > > > > > > Given the passage: > > > > > > > > Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nulla a > neque > > a > > > > ipsum accumsan iaculis at id lacus. Sed magna velit, aliquam ut > congue > > > > vitae, molestie quis nunc. > > > > > > > > When I search for "Nulla" (the first word of the second sentence) and > > > grab > > > > the first highlighted snippet, this is what I get: > > > > > > > > . <em>Nulla</em> a neque a ipsum accumsan iaculis at id lacus > > > > > > > > As you can see, there's a leading period from the previous sentence > and > > > the > > > > period from the current sentence is missing. > > > > > > > > I understand this regex isn't that advanced, but I've tried > everything > > I > > > > can think of, regex-wise, to get this to work, and I always end up > with > > > this > > > > problem. > > > > > > > > For example, I've tried: \w[^.!?]{0,200}[.!?] > > > > > > > > Which seems like it should include the ending punctuation, but it > > > doesn't, > > > > so I think I'm missing something. > > > > > > > > Does anybody know a regex that works? > > > > -- > > > > Caleb Land > > > > > > > > > > > > > > > > -- > > > Caleb Land > > > > > > > > > -- > Caleb Land >