[Wikidata-bugs] [Maniphest] [Commented On] T150891: Find a good way to represent multi-lingual text fields in Elastic

dcausse Thu, 17 Nov 2016 03:40:38 -0800

dcausse added a comment.

There are tons of possibilities and the solution highly depends on the usecases you'd like to support. I think more precise examples would definitely help.

Note that the term representation in Elastic is not merely intended a search index, but also for retrieving all labels/descriptions for a given subject.

@daniel can you elaborate on this point?

As pointed by Stas Index expansion might not be doable if we plan to leverage language specific analyzers, it's not possible to mix different index analyzers on the same field.

Going with a one field per language approach is certainly doable, all the usecases you'd like to support are still unclear to me but the following setup could maybe work:

a plain all field with ICU tokenization to support exact match, all languages could be merge into this single field. We would have to verify that terms collisions between languages are not causing too much trouble.
a field per language with language aware analyzers (stem support if available). The use of copy_to can atomically this content to the plain all field.

The input doc would look like:

{
  labels: {
    en: ["This entity", "Entity"],
    fr: ["Cette entité"],
    ...
  }
}

A query (for a swiss german) would look like (assuming you want to always fallback to english stems, remove the labels.en if you're ok to fallback only on exact matches):

labels_all:query^0.5 OR labels.de_ch:query^2 OR labels.de^1 OR labels.en^0.5

Concerning perf it's hard to tell but we recently switched to a perfield (we query 14 fields) builder on the top 10 wikis and it seems to be OK so far.

ICU tokenization is important here as it's extremely convenient: it tokenizes text by first detecting the script and then applies custom tokenization, e.g. It detects a trad chinese script it switches to a dictionary based tokenizer. Drawback is that it can split words written using mixed scripts (e.g. ßeta => ["ß", "eta"]

@aude could you add a link to the experiment you started? I remember that it was going in the right direction.

Overall, given all the possibilities and language diversity it's really hard to anticipate thus I'd suggest to invest more time in experimenting various techniques.

TASK DETAIL

https://phabricator.wikimedia.org/T150891

EMAIL PREFERENCES

https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: aude, dcausse
Cc: EBernhardson, dcausse, hoo, Ricordisamoa, aude, Deskana, StudiesWorld, Aklapper, Smalyshev, Tobi_WMDE_SW, thiemowmde, JanZerebecki, gerritbot, Jonas, daniel, EBjune, mschwarzer, Avner, debt, Gehel, D3r1ck01, FloNight, Izno, Wikidata-bugs, jayvdb, Mbch331, jeremyb

_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Commented On] T150891: Find a good way to represent multi-lingual text fields in Elastic

Reply via email to