dcausse added a comment.
>> For Wikibase, I am experimenting with unnested fields for multilingual content, though then we have hundreds of unnested fields and not sure there is some point where it's too many? > > Since Elastic says nested field is separate document anyway (https://www.elastic.co/guide/en/elasticsearch/reference/2.3/nested.html towards the end) the question is what is better - having a lot of fields in the same doc or separate one. We'll need to check that. > Maybe @dcausse can help with it? Drawbacks with nested fields is that, like you said it'll create one subdocument per nested field. Then for performance reasons a bitset will loaded into ram to join parent and child. Note that nested fields won't allow you to set a specific analyzer for a specific language. The only advantage of nested fields in this case is that it allows you to manage the list of supported languages without any mapping change. You'll be able to add a new language and still query like this : "label.text:HIQaH AND label:lang:KLINGON". Word frequencies will still be mixed and you'll have to analyze all the languages with the same analyzer. This mapping question is really tough, ideally we'd like to: 1. have proper scoring for a specific language: if I'm french I want to properly weights french words against french content, I don't want to decrease the weight of a term because it's popular in another language: If I search for Thé in french I don't want to decrease its weight because it's a common word in english. 2. have proper analysis for a specific language: I don't want a word stemmed by a french analyzer to collide with another language stem. 3. I want to boost french content if I'm french and possibly additional languages in the same query. 4. I'd like to change language boost values at query time (if we rely on index time boosting techniques we will never be able to tune the system) 5. I want to also boost a particular field over another: I'd like to have documents that match labels first and then those that match description. In other words: lang boost needs to be combined with field type boost. 6. I want to add query independent factors in the scoring formula (number of statements/label/any particular metadata) Nested field will probably break 1 and 2. 3 will be OK but sub-optimal. Considering that it also has some perf drawbacks I don't consider it as a good fit here (unless adding new languages in real-time is a blocker). At this stage I'd continue to investigate with the solution proposed by @aude (unested subfields). Too many subfields is generally not a good idea but I don't see another option that fits the requirements above. One optimization that could work is to use a kind of "allfield" that will be used only for fast filtering, the language subfields will then be used for scoring. With a filter like that the query will be slightly simpler since you'll be able to use simple disjunctions. I havn't thought about all the implementation details but I suppose something like https://github.com/yakaz/elasticsearch-analysis-combo could help to build such field (as long as it's used only for filtering). TASK DETAIL https://phabricator.wikimedia.org/T89733 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: daniel, dcausse Cc: DannyH, hoo, Deskana, RobLa-WMF, Tgr, Yurik, dcausse, JanZerebecki, Smalyshev, matthiasmullie, aude, Ricordisamoa, Krenair, MZMcBride, bd808, brion, Manybubbles, Aklapper, daniel, D3r1ck01, Izno, Luke081515, Wikidata-bugs, GWicke, jayvdb, fbstj, Jackmcbarn, Mbch331, Jay8g, Ltrlg, Legoktm _______________________________________________ Wikidata-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
