Re: Stemming and accents (HunspellStemFilterFactory)

2012-02-15 Thread Jan Høydahl
Or if you know that you'll always strip accents in your search you may 
pre-process your pt_PT.dic to remove accents from it and use that custom 
dictionary instead in Solr.

Another alternative could be to extend HunSpellFilter so that it can take in 
the class name of a TokenFilter class to apply when parsing the dictionary into 
memory.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com
Solr Training - www.solrtraining.com

On 14. feb. 2012, at 16:27, Chantal Ackermann wrote:

 Hi Bráulio,
 
 I don't know about HunspellStemFilterFactory especially but concerning
 accents:
 
 There are several accent filter that will remove accents from your
 tokens. If the Hunspell filter factory requires the accents, then simply
 add the accent filters after Hunspell in your index and query filter
 chains.
 
 You would then have Hunspell produce the tokens as result of the
 stemming and only afterwards the accents would be removed (your example:
 'forum' instead of 'fórum'). Do the same on the query side in case
 someone inputs accents.
 
 Accent filters are:
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUTokenizerFactory
 (lowercases, as well!)
 http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ASCIIFoldingFilterFactory
 
 and others on that page.
 
 Chantal
 
 
 On Tue, 2012-02-14 at 14:48 +0100, Bráulio Bhavamitra wrote:
 Hello all,
 
 I'm evaluating the HunspellStemFilterFactory I found it works with a
 pt_PT dictionary.
 
 For example, if I search for 'fóruns' it stems it to 'fórum' and then find
 'fórum' references.
 
 But if I search for 'foruns' (without accent),
 then HunspellStemFilterFactory cannot stem
 word, as it does' not exist in its dictionary.
 
 It there any way to make HunspellStemFilterFactory work without accents
 differences?
 
 best,
 bráulio
 



Re: Stemming and accents (HunspellStemFilterFactory)

2012-02-14 Thread Chantal Ackermann
Hi Bráulio,

I don't know about HunspellStemFilterFactory especially but concerning
accents:

There are several accent filter that will remove accents from your
tokens. If the Hunspell filter factory requires the accents, then simply
add the accent filters after Hunspell in your index and query filter
chains.

You would then have Hunspell produce the tokens as result of the
stemming and only afterwards the accents would be removed (your example:
'forum' instead of 'fórum'). Do the same on the query side in case
someone inputs accents.

Accent filters are:
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ICUTokenizerFactory
(lowercases, as well!)
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.ASCIIFoldingFilterFactory

and others on that page.

Chantal


On Tue, 2012-02-14 at 14:48 +0100, Bráulio Bhavamitra wrote:
 Hello all,
 
 I'm evaluating the HunspellStemFilterFactory I found it works with a
 pt_PT dictionary.
 
 For example, if I search for 'fóruns' it stems it to 'fórum' and then find
 'fórum' references.
 
 But if I search for 'foruns' (without accent),
 then HunspellStemFilterFactory cannot stem
 word, as it does' not exist in its dictionary.
 
 It there any way to make HunspellStemFilterFactory work without accents
 differences?
 
 best,
 bráulio