Re: Synonym Filter with Nutch

2009-11-13 Thread Andrzej Bialecki

Dharan Althuru wrote:

Hi,


We are trying to incorporate synonym filter during indexing using Nutch. As
per my understanding Nutch doesn’t have synonym indexing plug-in by default.
Can we extend IndexFilter in Nutch to incorporate the synonym filter plug-in
available in Lucene using WordNet or custom synonym plug-in without any
negative impacts to existing Nutch indexing (i.e., considering bigram etc).


Synonym expansion should be done when the text is analyzed (using 
Analyzers), so you can reuse the Lucene's synonym filter.


Unfortunately, this happens at different stages depending on whether you 
use the built-in Lucene indexer, or the Solr indexer.


If you use the Lucene indexer, this happens in LuceneWriter, and the 
only way to affect it is to implement an analysis plugin, so that it's 
returned from AnalyzerFactory, and use your analysis plugin instead of 
the default one. See e.g. analysis-fr for an example of how to implement 
such plugin.


However, when you index to Solr you need to configure the Solr's 
analysis chain, i.e. in your schema.xml you need to define for your 
fieldType that it has the synonym filter in its indexing analysis chain.


--
Best regards,
Andrzej Bialecki 
 ___. ___ ___ ___ _ _   __
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com



Synonym Filter with Nutch

2009-11-12 Thread Dharan Althuru
Hi,


We are trying to incorporate synonym filter during indexing using Nutch. As
per my understanding Nutch doesn’t have synonym indexing plug-in by default.
Can we extend IndexFilter in Nutch to incorporate the synonym filter plug-in
available in Lucene using WordNet or custom synonym plug-in without any
negative impacts to existing Nutch indexing (i.e., considering bigram etc).


Another option we are thinking is to look for synonyms during query time.
But this might cause performance issue as we scale the system to say more
than 100M pages.



Can someone please suggest the best way to incorporate the synonym filter in
Nutch.



Thank you.



Regards,

Dharan