Hello all, We have a large number of languages which we currently index all in one index. The paper below uses ngrams as a substitute for language-specific stemming and got good results with a number of complex languages. Has anyone tried doing this with Solr?
They also got fairly good results (at least for the more complex languages) by simply truncating words. We would be very interested to hear about any experience using either of these approaches for multiple languages The test collections were pretty much short newswire stories, so the other question is whether similar results might be expected for longer documents. Paul McNamee, Charles Nicholas, and James Mayfield. 2009. Addressing morphological variation in alphabetic languages. In *Proceedings of the 32nd international ACM SIGIR conference on Research and development in information retrieval* (SIGIR '09). ACM, New York, NY, USA, 75-82. DOI=10.1145/1571941.1571957 http://doi.acm.org/10.1145/1571941.1571957 Tom Burton-West http://www.hathitrust.org/blogs/large-scale-search