[Wikidata-bugs] [Maniphest] [Commented On] T89733: Allow ContentHandler to expose structured data to the search engine.

dcausse Tue, 10 May 2016 02:05:37 -0700

dcausse added a comment.


  >> For Wikibase, I am experimenting with unnested fields for multilingual 
content, though then we have hundreds of unnested fields and not sure there is 
some point where it's too many?
  > 
  > Since Elastic says nested field is separate document anyway 
(https://www.elastic.co/guide/en/elasticsearch/reference/2.3/nested.html 
towards the end) the question is what is better - having a lot of fields in the 
same doc or separate one. We'll need to check that.
  >  Maybe @dcausse can help with it?
  
  Drawbacks with nested fields is that, like you said it'll create one 
subdocument per nested field. Then for performance reasons a bitset will loaded 
into ram to join parent and child.
  Note that nested fields won't allow you to set a specific analyzer for a 
specific language. The only advantage of nested fields in this case is that it 
allows you to manage the list of supported languages without any mapping 
change. You'll be able to add a new language and still query like this : 
"label.text:HIQaH AND label:lang:KLINGON".
  Word frequencies will still be mixed and you'll have to analyze all the 
languages with the same analyzer.
  
  This mapping question is really tough, ideally we'd like to:
  
  1. have proper scoring for a specific language: if I'm french I want to 
properly weights french words against french content, I don't want to decrease 
the weight of a term because it's popular in another language: If I search for 
Thé in french I don't want to decrease its weight because it's a common word in 
english.
  2. have proper analysis for a specific language: I don't want a word stemmed 
by a french analyzer to collide with another language stem.
  3. I want to boost french content if I'm french and possibly additional 
languages in the same query.
  4. I'd like to change language boost values at query time (if we rely on 
index time boosting techniques we will never be able to tune the system)
  5. I want to also boost a particular field over another: I'd like to have 
documents that match labels first and then those that match description. In 
other words: lang boost needs to be combined with field type boost.
  6. I want to add query independent factors in the scoring formula (number of 
statements/label/any particular metadata)
  
  Nested field will probably break 1 and 2. 3 will be OK but sub-optimal. 
Considering that it also has some perf drawbacks I don't consider it as a good 
fit here (unless adding new languages in real-time is a blocker).
  
  At this stage I'd continue to investigate with the solution proposed by @aude 
(unested subfields). Too many subfields is generally not a good idea but I 
don't see another option that fits the requirements above.
  
  One optimization that could work is to use a kind of "allfield" that will be 
used only for fast filtering, the language subfields will then be used for 
scoring. With a filter like that the query will be slightly simpler since 
you'll be able to use simple disjunctions. I havn't thought about all the 
implementation details but I suppose something like 
https://github.com/yakaz/elasticsearch-analysis-combo could help to build such 
field (as long as it's used only for filtering).

TASK DETAIL
  https://phabricator.wikimedia.org/T89733

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: daniel, dcausse
Cc: DannyH, hoo, Deskana, RobLa-WMF, Tgr, Yurik, dcausse, JanZerebecki, 
Smalyshev, matthiasmullie, aude, Ricordisamoa, Krenair, MZMcBride, bd808, 
brion, Manybubbles, Aklapper, daniel, D3r1ck01, Izno, Luke081515, 
Wikidata-bugs, GWicke, jayvdb, fbstj, Jackmcbarn, Mbch331, Jay8g, Ltrlg, Legoktm



_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Commented On] T89733: Allow ContentHandler to expose structured data to the search engine.

Reply via email to