Re: Has anyone HunspellStemFilterFactory working?

Сергей Бирюков Wed, 14 Nov 2012 03:09:30 -0800

Rob, as regards your "problem"

'SET charset'

'charset' word must be replaced with a name-of-character-set (i.e. encoding)
For exampe,  you can write 'SET UTF-8'


BUT...

----

Be careful!

At least for russian language morthology HunspellStemFilterFactory hasbug(s) in its algorythm.


Simple comparison with original hunspell library shown huge difference.


For example on  Linux x86_64 Ubuntu 12.10

1) INSTALL:
# sudo apt-get install hunspell hunspell-ru


2) TEST with string "мама мыла раму мелом"
(it has a meaning: "mom was_washing frame (with) chalk" ):

2.1) OS hunspell library
# echo "мама мыла раму мелом" | hunspell -d ru_RU -D -m
gives results:
...
    LOADED DICTIONARY:
    /usr/share/hunspell/ru_RU.aff
    /usr/share/hunspell/ru_RU.dic

    мама  -> мама
    мыла  -> мыло | мыть     <<< as noun | as verb
    раму  -> рама
    мелом -> мел

2.2) solr's HunspellStemFilterFactory
config fieldType

<fieldType name="text_hunspell" class="solr.TextField"positionIncrementGap="100">

      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.LowerCaseFilterFactory" />

<filter class="solr.HunspellStemFilterFactory"dictionary="ru_RU.dic" affix="ru_RU.aff" ignoreCase="true" />

      </analyzer>
    </fieldType>

gives results:
    мама -> мама | мама         : FAILED:  duplicate words
    мыла -> мыть | мыло         : SUSSECC: all OK
    раму -> рама | расти          : FAILED: second word is wrong and excess

мелом -> мести | метить | месть | мел : FAILED: only last word iscorrect, other ones are excess


----------

That's why I use a JNA (v3.2.7) binding on original (system)libhunspell.so for a long time :)


----
Best regards
  Sergey Biryukov
  Moscow, Russian Federation


14.11.2012 04:18, Rob Koeling wrote:

If so, would you be willing to share the .dic and .aff files with me?
When I try to load a dictionary file, Solr is complaining that:

java.lang.RuntimeException: java.io.IOException: Unable to load hunspell
data! [dictionary=en_GB.dic,affix=en_GB.aff]
     at org.apache.solr.schema.IndexSchema.<init>(IndexSchema.java:116)
.......
Caused by: java.text.ParseException: The first non-comment line in the
affix file must be a 'SET charset', was: 'FLAG num'
     at
org.apache.lucene.analysis.hunspell.HunspellDictionary.getDictionaryEncoding(HunspellDictionary.java:306)
     at
org.apache.lucene.analysis.hunspell.HunspellDictionary.<init>(HunspellDictionary.java:130)
     at
org.apache.lucene.analysis.hunspell.HunspellStemFilterFactory.inform(HunspellStemFilterFactory.java:103)
     ... 46 more

When I change the first line to 'SET charset' it is still not happy. I got
the dictionary files from the OpenOffice website.

I'm using Solr 4.0 (but had the same problem with 3.6)

   - Rob

Re: Has anyone HunspellStemFilterFactory working?

Reply via email to