aude added a comment. **Extra fields**
- labels - descriptions - aliases - entity_type (and potentially simple statements, such as for looking up items by identifier, which would actually be more simple to implement since it is not multilingual content) **Modify the mapping in Elastic Search to add extra 'fields'** Suggest we use the CirrusSearchMappingConfig hook to add stuff to the mapping, to start with. We can introduce 'field mapping builder' objects that build the mapping data structure for elastic, and as a first step, use these more directly with the hooks. Later, we can perhaps expose an interface in the Content objects that exposes these fields for mapping, and use the 'field mapping builder' objects indirectly. **Populate the extra fields during indexing** Suggest (as a start) that we use the CirrusSearchBuildDocumentParse to have extra stuff indexed when indexing a page. At some point, we may want to add something to EntityContent (and Content generally) to expose these fields (https://phabricator.wikimedia.org/T78011) and implement a way for the SearchEngine implementations to consume these. For now, I propose we introduce objects that build these data structures for the extra fields, with a generic interface. We can directly use these objects in the hook handlers, or indirectly use them via EntityContent (or just Content). At the same time that we want better integration with EntityContent, it would be nice to have clear separation of the Elastic Search Wikibase code so that it is reusable. **Multilingual indexing** multiple fields by language "page": { "dynamic": "false", "_all": { "enabled": false }, "properties": { "label_de": { "type": "string" }, "label_en" { "type": "string" }, "label_es": { "type": "string" } } } multiple fields has the disadvantage that there would be potentially be a very large number these. (one for every language * three term types) Nested type "page": { "dynamic": "false", "_all": { "enabled": false }, "properties": { "labels": { "type": "nested", "properties": { "de": { "type": "string" }, "en": { "type": "string" }, "es": { "type": "string" } } } } } nested can be a problem when the nesting gets very large, which it would. To start with, this is what I am experimenting with but not convinced this is what we want. Other possible options for multilingual content: - Language-specific child documents - might be nice and if feasible, might be best. For language fallback, search / lookup could request a handful of languages and not have to retrieve all child documents. - Multiple per-language indicies - a lot of stuff would be duplicated, unless the per-language indices were child documents. I don't think we want this. **Searching** We should introduce an EntitySearch (or TermSearch) interface that SearchEntities and other stuff can use. We can also introduce a TermLookup implementation based on Elastic for things that use TermLookup. There is some special syntax that can be used when searching with Cirrus, such as insource or incategory. If we want special syntax for stuff like labels, then we might want a hook added to Cirrus for this. The existing code where the special syntax is handled is very complex and would be good if that was factored out and split up some to make it easier/nicer/less bug-prone to hook into. If there can be a generic interface for this syntax, that would be even nicer. **TODO** - We still need to figure out better how to handle display text. TASK DETAIL https://phabricator.wikimedia.org/T117548 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: aude Cc: aude, Aklapper, Wikidata-bugs, Mbch331 _______________________________________________ Wikidata-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
