Smalyshev added a comment.

  I've looked into how we map now in CirrusSearch and we have these broad types:
  
  - Date - a match for INDEX_TYPE_DATETIME
  - Integer - may be a match for INDEX_TYPE_QUANTITY, but I think we need 
separate types for floats and integers. ElasticSearch has support for both.
  - String - that would be INDEX_TYPE_TEXT.
  - Keyword string, which disables analysing, variant: case-folded keyword 
string. Probably INDEX_TYPE_IDENTIFIER?
  - Composite field - e.g. redirect is namespace+title, where namespace is long 
and title is string.
  - geo_point - obviously INDEX_TYPE_GEOPOINT.
  - We do not seem to have anything like INDEX_TYPE_MULTILINGUAL and it may be 
hard to do as analyzers would be different for different languages I imagine. 
We do have ability to have different subfields in ElasticSearch, but not sure 
it'd be OK with 800 subfields.
  
  We also have a bunch of fields with custom configurations for ElasticSearch, 
such as analyzers, options, etc. Very frequently used are index options: 
https://www.elastic.co/guide/en/elasticsearch/reference/2.3/index-options.html
  
  Also very frequently used are subfields with different analyzers than the 
main field. Many of these definitions are similar, but defined in 
ElasticSearch-specific way, so we need some way to define engine-specific 
options. 
  The base for it may be functions buildKeywordField(), 
buildLowercaseKeywordField(), buildLongField(), buildStringField() in 
MappingConfigBuilder.php.
  
  We also must ensure namespacing - extensions should not create fields with 
the same name as existing fields.
  
  `getWeight()` may be too simplistic - at least for ElasticSearch, there are 
more tweaks to determine relevancy, and index-time boosting is officially 
called "bad idea" in the manual: 
https://www.elastic.co/guide/en/elasticsearch/reference/2.3/index-boost.html
  I would consider wieghting be part of the query definition and drop it from 
the index field.
  
  If we reverse it and as for field data for specific engine, this implies we 
have to have a lot of knowledge about the engine inside the specific extension. 
It looks like for most cases, this knowledge is not strictly required, as while 
particular field definition can be very complex, fields do group into a number 
of large buckets with similar tweaks to them.

TASK DETAIL
  https://phabricator.wikimedia.org/T89733

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: daniel, Smalyshev
Cc: DannyH, hoo, Deskana, RobLa-WMF, Tgr, Yurik, dcausse, JanZerebecki, 
Smalyshev, matthiasmullie, aude, Ricordisamoa, Krenair, MZMcBride, bd808, 
brion, Manybubbles, Aklapper, daniel, D3r1ck01, Izno, Luke081515, 
Wikidata-bugs, GWicke, jayvdb, fbstj, Jackmcbarn, Mbch331, Jay8g, Ltrlg, Legoktm



_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to