François I think there is a language identification tool in the Nutch code base, otherwise I have written one in Perl which could easily be translated to Java. I wont have access to it for 10 days (I am traveling), but I am happy to send you a link to it when I get back (and anyone else who wants it).
Cheers François On Mar 25, 2011, at 11:59 AM, Grant Ingersoll wrote: > You are looking for a language identification tool. You could check > https://issues.apache.org/jira/browse/SOLR-1979 for the start of this. > Otherwise, you have to roll your own or buy a third party one. > > On Mar 24, 2011, at 12:24 PM, fr.jur...@voila.fr wrote: > >> Hello Solrists, >> >> As it says in the subject line, I'm looking for a Java component that, >> given an ISO 639-1 code or some equivalent, >> would return a Lucene Analyzer ready to gobble documents in the >> corresponding language. >> Solr looks like it has to contain one, >> only I've not been able to locate it so far; >> can you point the spot? >> >> I've found org.apache.solr.analysis, >> and thing like org.apache.lucene.analysis.bg &c in lucene/modules, >> with many classes which I'm sure are related, however the factory itself >> still eludes me; >> I mean the Java class.method that'd decide on request, what to do with all >> these packages >> to bring the requisite object to existence, once the language is specified. >> Where should I look? Or was I mistaken & Solr has nothing of the kind, at >> least in Java? >> Thanks in advance for your help. >> >> Best regards, >> François Jurain. >> >> ____________________________________________________ >> >> Retrouvez les 10 conseils pour économiser votre carburant sur Voila : >> http://actu.voila.fr/evenementiel/LeDossierEcologie/l-eco-conduite/ >> >> >> > > -------------------------- > Grant Ingersoll > http://www.lucidimagination.com/ > > Search the Lucene ecosystem docs using Solr/Lucene: > http://www.lucidimagination.com/search >