Hi Christopher,

ShingleFilter(Factory), by design, inserts underscores for empty positions, so 
that you don't get shingles created from non-contiguous tokens.

It would probably be better to treat empty positions as edges, like an 
end-of-stream followed by a beginning-of-stream, and only output meaningful 
token n-grams, instead of these underscore-things - I can't image what use they 
are.  There should probably also be an option to ignore position gaps and 
generate shingles as if the tokens were really contiguous.

Anybody have a different opinion?

Steve

On 02/11/2010 at 10:03 PM, Christopher Ball wrote:
> I think I am making some progress - the key suggestion was to look at
> the analysis.jsp which I foolishly had forgotten =(.
> 
> I think it is actually a bug in the ShingleFilterFactory when it is used
> in subsequent to another Filter which removes tokens, e.g.
> StopFilterFactory or WordDelimiterFactory. The Analyzer clearly shows
> anytime a token is dropped the ShingleFilterFactory picks up a
> mysterious '_'.
> 
> For example, I enter "w'w oa". The WordDelimiterFactory removes the
> "w'w" token but then the ShingleFilterFactory shows "_ oa". Drop the
> apostraphy in to create "ww oa" and the ShingleFilterFactory shows "oa".
> Same occurs if I have the StopFilterFactory remove tokens.
> 
> Be grateful if anyone else can replicate this behavior.
> 
> Christopher
> 
> -----Original Message-----
> From: Ahmet Arslan [mailto:iori...@yahoo.com]
> Sent: Thursday, February 11, 2010 12:40 PM
> To: solr-user@lucene.apache.org
> Subject: RE: The Riddle of the Underscore and the Dollar Sign . . .
> 
> 
> > Unfortunately, the underscore is
> > being quite resilient =(
> > 
> > I tried the solr.MappingCharFilterFactory and know the
> > mapping is working as
> > I am changing "c" => "q" just fine. But the underscore
> > refuses to go!
> > 
> > I am baffled . . .
> 
> I just activated name="textCharNorm" in example schema.xml and added
> "_" => "xxx" to mapping-ISOLatin1Accent.txt
> I verified from http://localhost:8983/solr/admin/analysis.jsp that
> replacement is done without problems. Can you also test analysis.jsp?
> 
> May be your documents has underscores having different Unicode values. I
> know three different Unicode valued characters that all look like "-" If
> thats the case you need to find their Unicode values and write them into
> mappings.txt.
> 
> 
> 
> 
>


Reply via email to