Hi Christopher, ShingleFilter(Factory), by design, inserts underscores for empty positions, so that you don't get shingles created from non-contiguous tokens.
It would probably be better to treat empty positions as edges, like an end-of-stream followed by a beginning-of-stream, and only output meaningful token n-grams, instead of these underscore-things - I can't image what use they are. There should probably also be an option to ignore position gaps and generate shingles as if the tokens were really contiguous. Anybody have a different opinion? Steve On 02/11/2010 at 10:03 PM, Christopher Ball wrote: > I think I am making some progress - the key suggestion was to look at > the analysis.jsp which I foolishly had forgotten =(. > > I think it is actually a bug in the ShingleFilterFactory when it is used > in subsequent to another Filter which removes tokens, e.g. > StopFilterFactory or WordDelimiterFactory. The Analyzer clearly shows > anytime a token is dropped the ShingleFilterFactory picks up a > mysterious '_'. > > For example, I enter "w'w oa". The WordDelimiterFactory removes the > "w'w" token but then the ShingleFilterFactory shows "_ oa". Drop the > apostraphy in to create "ww oa" and the ShingleFilterFactory shows "oa". > Same occurs if I have the StopFilterFactory remove tokens. > > Be grateful if anyone else can replicate this behavior. > > Christopher > > -----Original Message----- > From: Ahmet Arslan [mailto:iori...@yahoo.com] > Sent: Thursday, February 11, 2010 12:40 PM > To: solr-user@lucene.apache.org > Subject: RE: The Riddle of the Underscore and the Dollar Sign . . . > > > > Unfortunately, the underscore is > > being quite resilient =( > > > > I tried the solr.MappingCharFilterFactory and know the > > mapping is working as > > I am changing "c" => "q" just fine. But the underscore > > refuses to go! > > > > I am baffled . . . > > I just activated name="textCharNorm" in example schema.xml and added > "_" => "xxx" to mapping-ISOLatin1Accent.txt > I verified from http://localhost:8983/solr/admin/analysis.jsp that > replacement is done without problems. Can you also test analysis.jsp? > > May be your documents has underscores having different Unicode values. I > know three different Unicode valued characters that all look like "-" If > thats the case you need to find their Unicode values and write them into > mappings.txt. > > > > >