File based wordlists for spellchecker

Tomasz Wegrzanowski Mon, 14 Nov 2011 20:52:55 -0800

Hi,

I have a very large index, and I'm trying to add a spell checker for it.
I don't want to copy all text in index to extra spell field, since that would
be prohibitively big, and index is already close to how big it can
reasonably be,
so I just want to extract word frequencies as I index for offline processing.


After some filtering I get something like this (word, frequency):

a       122958495
aa      834203
aaa     175206
aaaa    22389
aaab    1522
aaai    1050
aaas    6384
aab     8109
aabb    1906
aac     35100
aacc    1692
aachen  11723

I wanted to use FileBasedSpellChecker, but it doesn't support frequencies,
so its recommendations are consistently horrible. Increasing frequency cutoff
won't really help that much - it will still suggest less frequent
words over equally
similar more frequent words.

What's the easiest way to get this working?
Presumably I'd need to create a separate index with just these words.
How do I get frequencies there, without actually creating 11723 records with
"aachen" in them etc.?

I can do some small Java coding if need be.
I'm already using 3.x branch (mostly for edismax, plus some unrelated
minor patches).

Thanks,
Tomasz

File based wordlists for spellchecker

Reply via email to