Re: Punctuation marks in documents prevent recognition of synonyms at indexing?

G.S.J. Lobbestael Sun, 27 Sep 2009 04:56:31 -0700

Thanks, this helps. 
But our synonym file has some 16,000 sets of synonyms.



Should the wiki warn users?
- WhitespaceTokenizerFactory with synonyms at indexing will not expand synonyms 
in text "... synonym[punctuation mark] ..."

- the individual synonyms in your synonym file should be in a form as if they 
were sent through the tokenizers which come before the SynonymFilterFactory

With a WhitespaceTokenizerFactory:
Flaubert's Parrot, Julian Barnes
A History of the World in 10½ Chapters, Julian Barnes
England\, England, Julian Barnes
Arthur & George, Julian Barnes
Absalom\, Absalom!, William Faulkner
k-nearest neighbors algorithm, k-NN, k nn

With a StandardTokenizerFactory:
Flaubert's Parrot, Julian Barnes
A History of the World in 10 Chapters, Julian Barnes
England England, Julian Barnes
Arthur George, Julian Barnes
Absalom Absalom, William Faulkner
k nearest neighbors algorithm, k-NN, k nn, knn 

This means that when changing the TokenizerFactory you also might have to 
change your synonym file. But the change may be irreversible (you can't 
reconstruct the first version from the second one).

Would it be possible for Solr to apply the Tokenizer in use while reading the 
synonym file? Then the user would only need the original synonym file, and 
their could not be a conflict.

regards
geert
> > You lose the WordDelimiterFilterFactory functionality:
> > 
> > Syn.txt has: ADC, HIV-dementie
> > Search on "ADC" doesn't find document with "HIV-dementie".
> 
> synonym filter can handle multi word synonyms. Replace Syn.txt to
> Syn.txt has: ADC, HIV dementie
> 
> And search on "ADC" will find document with "HIV-dementie".
> 
> hope this helps.

Re: Punctuation marks in documents prevent recognition of synonyms at indexing?

Reply via email to