Smalyshev added a comment.
I've looked into how we map now in CirrusSearch and we have these broad types: - Date - a match for INDEX_TYPE_DATETIME - Integer - may be a match for INDEX_TYPE_QUANTITY, but I think we need separate types for floats and integers. ElasticSearch has support for both. - String - that would be INDEX_TYPE_TEXT. - Keyword string, which disables analysing, variant: case-folded keyword string. Probably INDEX_TYPE_IDENTIFIER? - Composite field - e.g. redirect is namespace+title, where namespace is long and title is string. - geo_point - obviously INDEX_TYPE_GEOPOINT. - We do not seem to have anything like INDEX_TYPE_MULTILINGUAL and it may be hard to do as analyzers would be different for different languages I imagine. We do have ability to have different subfields in ElasticSearch, but not sure it'd be OK with 800 subfields. We also have a bunch of fields with custom configurations for ElasticSearch, such as analyzers, options, etc. Very frequently used are index options: https://www.elastic.co/guide/en/elasticsearch/reference/2.3/index-options.html Also very frequently used are subfields with different analyzers than the main field. Many of these definitions are similar, but defined in ElasticSearch-specific way, so we need some way to define engine-specific options. The base for it may be functions buildKeywordField(), buildLowercaseKeywordField(), buildLongField(), buildStringField() in MappingConfigBuilder.php. We also must ensure namespacing - extensions should not create fields with the same name as existing fields. `getWeight()` may be too simplistic - at least for ElasticSearch, there are more tweaks to determine relevancy, and index-time boosting is officially called "bad idea" in the manual: https://www.elastic.co/guide/en/elasticsearch/reference/2.3/index-boost.html I would consider wieghting be part of the query definition and drop it from the index field. If we reverse it and as for field data for specific engine, this implies we have to have a lot of knowledge about the engine inside the specific extension. It looks like for most cases, this knowledge is not strictly required, as while particular field definition can be very complex, fields do group into a number of large buckets with similar tweaks to them. TASK DETAIL https://phabricator.wikimedia.org/T89733 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: daniel, Smalyshev Cc: DannyH, hoo, Deskana, RobLa-WMF, Tgr, Yurik, dcausse, JanZerebecki, Smalyshev, matthiasmullie, aude, Ricordisamoa, Krenair, MZMcBride, bd808, brion, Manybubbles, Aklapper, daniel, D3r1ck01, Izno, Luke081515, Wikidata-bugs, GWicke, jayvdb, fbstj, Jackmcbarn, Mbch331, Jay8g, Ltrlg, Legoktm _______________________________________________ Wikidata-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
