Unfortunately the current SynonymFilter cannot handle posInc != 1 ... we could perhaps try to fix this ... patches welcome :)
So for now it's best to place SynonymFilter before StopFilter, and before any other filters that may create graph tokens (posLen > 1, posInc == 0). Mike McCandless http://blog.mikemccandless.com On Mon, Sep 23, 2013 at 2:45 AM, <david.dav...@correo.aeat.es> wrote: > Hi, > > I am having a problem applying StopFilterFactory and > SynonimFilterFactory. The problem is that SynonymFilter removes the gaps > that were previously put by the StopFilterFactory. I'm applying filters in > > query time, because users need to change synonym lists frequently. > > This is my schema, and an example of the issue: > > > String: "documentacion para agentes" > > org.apache.solr.analysis.WhitespaceTokenizerFactory > {luceneMatchVersion=LUCENE_35} > position 1 2 3 > term text documentación para agentes > startOffset 0 14 19 > endOffset 13 18 26 > org.apache.solr.analysis.LowerCaseFilterFactory > {luceneMatchVersion=LUCENE_35} > position 1 2 3 > term text documentación para agentes > startOffset 0 14 19 > endOffset 13 18 26 > org.apache.solr.analysis.StopFilterFactory {words=stopwords_intranet.txt, > ignoreCase=true, enablePositionIncrements=true, > luceneMatchVersion=LUCENE_35} > position 1 3 > term text documentación agentes > startOffset 0 19 > endOffset 13 26 > org.apache.solr.analysis.SynonymFilterFactory > {synonyms=sinonimos_intranet.txt, expand=true, ignoreCase=true, > luceneMatchVersion=LUCENE_35} > position 1 2 > term text documentación agente > archivo agentes > type SYNONYM SYNONYM > SYNONYM SYNONYM > startOffset 0 19 > 0 19 > endOffset 13 26 > 13 26 > > > As you can see, the position should be 1 and 3, but SynonymFilter removes > the gap and moves token from position 3 to 2 > I've got the same problem with Solr 3.5 y 4.0. > I don't know if it's a bug or an error with my configuration. In other > schemas that I have worked with, I had always put the SynonymFilter > previous to StopFilter, but in this I prefered using this order because of > > the big number of synonym that the list has (i.e. I don't want to generate > > a lot of synonyms for a word that I really wanted to remove). > > Thanks, > > David Dávila Atienza > AEAT - Departamento de Informática Tributaria > > David Dávila Atienza > AEAT - Departamento de Informática Tributaria > Subdirección de Tecnologías de Análisis de la Información e Investigación > del Fraude > Área de Infraestructuras > Teléfono: 915831543 > Extensión: 31543