Re: Issues with Language detection in Solr

Jack Krupansky Fri, 18 Oct 2013 11:23:04 -0700

I would say that in general you need at least 15 or 20 words in a text fieldfor language to be detected reasonably well. Sure, sometimes it can work for8 to 12 words, but flip a coin how reliable it will be.

You haven't shown us any true text fields. I would say that languagedetection against simple name fields is a misuse of the language detectionfeature. I mean, it is designed for larger blocks of text, not very shortphrases.


See some examples in my e-book.

-- Jack Krupansky

-----Original Message-----From: vibhoreng04

Sent: Friday, October 18, 2013 2:01 PM
To: solr-user@lucene.apache.org
Subject: Issues with Language detection in Solr

Hi All,I am trying to detect the language of the business name filed and the
address field. I am using Solr's lang Detect(Google Library) , not Tika. It
works ok in most of the cases but in some it detects the language
wrongly.For an example the document -"OrgName": "EXPLOITS VALLEY
HIGHGREENWOOD",        "StreetLine1": "19 GREENWOOD AVE",
"StreetLine2": "",        "SOrgName": "EXPLOITS VALLEY HIGHGREENWOOD",
"StandardizedStreetLine1": "19 GREENWOOD AVE",        "language_s": [
"de"        ]Language is detected as German(de) here , which is wrong.Below
is my
configuration-+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
OrgName,StreetLine1,StreetLine2,SOrgName,StandardizedStreetLine1
language_s 0.9           en
+++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++Why
there is an issue?Why the language detection is wrong ?Please help !Vibhor



--

View this message in context:http://lucene.472066.n3.nabble.com/Issues-with-Language-detection-in-Solr-tp4096433.htmlSent from the Solr - User mailing list archive at Nabble.com.

Re: Issues with Language detection in Solr

Reply via email to