First, I'd search the mail archive for the topic of languages, it's been discussed often and there's a wealth of information that might be of benefit, far more information than I can remember.
As to whether your approach will be "too big, too slow...", you really haven't given enough information to go on. Here are a few of the questions the answers to which would help: How many e-mails are you indexing? Are you indexing attachments? How many users to you expect to be using this system? What are your target response times? What is your design queries-per-second? How much dynamic is the index (that is, how many e-mails do you expect to add per day and what is the latency you can live with between the time the e-mail is indexed and when it's searchable)? If you're indexing 10,000 e-mails, it's one thing. If you're indexing 1,000,000,000 e-mails it's another. Best Erick On Tue, Jan 27, 2009 at 3:05 PM, Alejandro Valdez < alejandro.val...@gmail.com> wrote: > Hi, I plan to use solr to index a large number of documents extracted > from emails bodies, such documents could be in different languages, > and a single document could be in more than one language. In the same > way, the query string could be words in different languages. > > I read that a common approach to index multilingual documents is to > use some algorithm (n-gram) to determine the document language, then use a > stemmer and finally index the document in a different index for each > language. > > As the document language and the query string can't be detected in a > reliable way, I think that it make not sense to use a stemmer on them > because a stemmer is tied to a specific language. > > My plan is to index all the documents in the same index, without any > stemming process (the users will have to search for the exact words that > they are looking for). > > But I'm not sure if this approach will make the index too big, too > slow, or if there is a better way to index this kind of documents. > > Any suggestion will be very appreciated. >