I am trying to use “languid.map.individual” setting to allow field “a” to
detect as, say, English, and be mapped to “a_en”, while in the same document,
field “b” detects as, say, German and is mapped to “b_de”.
What happens in my tests is that the global language is detected (for example,
German), but BOTH fields are mapped to “_de” as a result. I cannot get
individual detection or mapping to work. Am I mis-understanding the purpose of
this setting?
Here is the resulting document from my test:
----------------
{
"id": "1005!22345",
"language": [
"de"
],
"a_de": "A title that should be detected as English with high
confidence",
"b_de": "Die Einführung einer anlasslosen Speicherung von
Passagierdaten für alle Flüge aus einem Nicht-EU-Staat in die EU und umgekehrt
ist näher gerückt. Der Ausschuss des EU-Parlaments für bürgerliche Freiheiten,
Justiz und Inneres (LIBE) hat heute mit knapper Mehrheit für einen
entsprechenden Richtlinien-Entwurf der EU-Kommission gestimmt. Bürgerrechtler,
Grüne und Linke halten die geplante Richtlinie für eine andere Form der
anlasslosen Vorratsdatenspeicherung, die alle Flugreisenden zu Verdächtigen
mache.",
"_version_": 1508494723734569000
}
----------------
I expected “a_de” to be “a_en”, and the “language” multi-valued field to have
“en” and “de”.
Here is my configuration in solrconfig.xml:
--------------------
<updateRequestProcessorChain name="langid" default="true">
<processor
class="org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessorFactory">
<lst name="defaults">
<str name="langid">true</str>
<str name="langid.fl">a,b</str>
<str name="langid.map">true</str>
<str name="langid.map.individual">true</str>
<str name="langid.langField">language</str>
<str
name="langid.map.lcmap">af:uns,ar:uns,bg:uns,bn:uns,cs:uns,da:uns,el:uns,et:uns,fa:uns,fi:uns,gu:uns,he:uns,hi:uns,hr:uns,hu:uns,id:uns,ja:uns,kn:uns,ko:uns,lt:uns,lv:uns,mk:uns,ml:uns,mr:uns,ne:uns,nl:uns,no:uns,pa:uns,pl:uns,ro:uns,ru:uns,sk:uns,sl:uns,so:uns,sq:uns,sv:uns,sw:uns,ta:uns,te:uns,th:uns,tl:uns,tr:uns,uk:uns,ur:uns,vi:uns,zh-cn:uns,zh-tw:uns</str>
<str name="langid.fallback">en</str>
</lst>
</processor>
<processor class="solr.LogUpdateProcessorFactory" />
<processor class="solr.RunUpdateProcessorFactory" />
</updateRequestProcessorChain>
--------------------
The debug output of lang detect, during indexing, is as follows:
-------------------
DEBUG - 2015-08-03 14:37:54.450;
org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Language
detected de with certainty 0.9999964723182276
DEBUG - 2015-08-03 14:37:54.450;
org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Detected
main document language from fields [a, b]: de
DEBUG - 2015-08-03 14:37:54.450;
org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor;
Appending field a
DEBUG - 2015-08-03 14:37:54.451;
org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor;
Appending field b
DEBUG - 2015-08-03 14:37:54.453;
org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Language
detected de with certainty 0.9999964723182276
DEBUG - 2015-08-03 14:37:54.453;
org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Mapping
field a using individually detected language de
DEBUG - 2015-08-03 14:37:54.454;
org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Doing
mapping from a with language de to field a_de
DEBUG - 2015-08-03 14:37:54.454;
org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Mapping
field 1005!22345 to de
DEBUG - 2015-08-03 14:37:54.454; org.eclipse.jetty.webapp.WebAppClassLoader;
loaded class org.apache.solr.common.SolrInputField from
WebAppClassLoader=525571@80503
DEBUG - 2015-08-03 14:37:54.454;
org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Removing
old field a
DEBUG - 2015-08-03 14:37:54.455;
org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor;
Appending field a
DEBUG - 2015-08-03 14:37:54.455;
org.apache.solr.update.processor.LangDetectLanguageIdentifierUpdateProcessor;
Appending field b
DEBUG - 2015-08-03 14:37:54.456;
org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Language
detected de with certainty 0.9999980402022373
DEBUG - 2015-08-03 14:37:54.456;
org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Mapping
field b using individually detected language de
DEBUG - 2015-08-03 14:37:54.456;
org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Doing
mapping from b with language de to field b_de
DEBUG - 2015-08-03 14:37:54.456;
org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Mapping
field 1005!22345 to de
DEBUG - 2015-08-03 14:37:54.456;
org.apache.solr.update.processor.LanguageIdentifierUpdateProcessor; Removing
old field b
-------------
From this, my takeaway is that every time the
LangDetectLanguageIdentifierUpdateProcessor is asked to detect the language, it
is using field a AND b. But I can’t quite tell from this output.
Any insight appreciated.
Regards,
David