Hi, Do anyone have an answer to this question that I posted last week? I know how to generate profiles for Nutch, but not for Tika.
-- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Training in Europe - www.solrtraining.com On 20. aug. 2010, at 22.07, Jan Høydahl / Cominvent wrote: > Hi, > > What is the procedure to add a language profile to LanguageIdentifier? Do we > use Wikipedia as training set? > > I'd like to add some languages relevant for Norway. > In Norway there are two official languages: nb and nn. Those are recommended > used instead of the common "no" tag. > > We also have a third language, Sami. You have northern sami and southern > sami. The referenced ISO-639 list > (http://www.w3.org/WAI/ER/IG/ert/iso639.htm) is obsolete as it does not list > any of these. A better list is > http://www.loc.gov/standards/iso639-2/php/code_list.php > > What if we have a requirement to represent language dialects such as en-US > and en-GB? ISO-639 does not deal with such. Perhaps it is better to switch to > RFC 5646 and IANA Language Subtag Registry > (http://rishida.net/utils/subtags/) which uses ISO-6391 and ISO-639-2 but > allows for region variants as well? > > -- > Jan Høydahl, search solution architect > Cominvent AS - www.cominvent.com > Training in Europe - www.solrtraining.com >
