Hello all,

We have a large number of languages which we currently index all in one
index.  The paper below uses ngrams as a substitute for language-specific
stemming and got good results with a number of complex languages.    Has
anyone tried doing this with Solr?

 They also got fairly good results (at least for the more complex
languages) by simply truncating words.
We would be very interested to hear about any experience using either of
these approaches for multiple languages


The test collections were pretty much short newswire stories, so the other
question is whether similar results might be expected for longer documents.

Paul McNamee, Charles Nicholas, and James Mayfield. 2009. Addressing
morphological variation in alphabetic languages. In *Proceedings of the
32nd international ACM SIGIR conference on Research and development in
information retrieval* (SIGIR '09). ACM, New York, NY, USA, 75-82.
DOI=10.1145/1571941.1571957 http://doi.acm.org/10.1145/1571941.1571957

Tom Burton-West

http://www.hathitrust.org/blogs/large-scale-search

Reply via email to