[Wikidata-bugs] [Maniphest] [Updated] T117548: [Task] Find the best way to get labels into Elastic

aude Mon, 16 Nov 2015 07:02:58 -0800

aude added a comment.

**Extra fields**


- labels
- descriptions
- aliases
- entity_type

(and potentially simple statements, such as for looking up items by identifier, 
which would actually be more simple to implement since it is not multilingual 
content)

**Modify the mapping in Elastic Search to add extra 'fields'**

Suggest we use the CirrusSearchMappingConfig hook to add stuff to the mapping, 
to start with.  We can introduce 'field mapping builder' objects that build the 
mapping data structure for elastic, and as a first step, use these more 
directly with the hooks.  Later, we can perhaps expose an interface in the 
Content objects that exposes these fields for mapping, and use the 'field 
mapping builder' objects indirectly.

**Populate the extra fields during indexing**

Suggest (as a start) that we use the CirrusSearchBuildDocumentParse to have 
extra stuff indexed when indexing a page. At some point, we may want to add 
something to EntityContent (and Content generally) to expose these fields 
(https://phabricator.wikimedia.org/T78011) and implement a way for the 
SearchEngine implementations to consume these.

For now, I propose we introduce objects that build these data structures for 
the extra fields, with a generic interface. We can directly use these objects 
in the hook handlers, or indirectly use them via EntityContent (or just 
Content). At the same time that we want better integration with EntityContent, 
it would be nice to have clear separation of the Elastic Search Wikibase code 
so that it is reusable.

**Multilingual indexing**

multiple fields by language

  "page": {
    "dynamic": "false",
    "_all": {
      "enabled": false
    },
    "properties": {
      "label_de": {
        "type": "string"
      },
      "label_en" {
        "type": "string"
      },
      "label_es": {
        "type": "string"
      }
    }
  }

multiple fields has the disadvantage that there would be potentially be a very 
large number these. (one for every language * three term types)

Nested type

  "page": {
    "dynamic": "false",
    "_all": {
      "enabled": false
    },
    "properties": {
      "labels": {
        "type": "nested",
        "properties": {
          "de": {
            "type": "string"
          },
          "en": {
            "type": "string"
          },
          "es": {
            "type": "string"
          }
        }
      }
    }
  }

nested can be a problem when the nesting gets very large, which it would.  To 
start with, this is what I am experimenting with but not convinced this is what 
we want.

Other possible options for multilingual content:

- Language-specific child documents - might be nice and if feasible, might be 
best.  For language fallback, search / lookup could request a handful of 
languages and not have to retrieve all child documents.
- Multiple per-language indicies - a lot of stuff would be duplicated, unless 
the per-language indices were child documents.  I don't think we want this.

**Searching**

We should introduce an EntitySearch (or TermSearch) interface that 
SearchEntities and other stuff can use.

We can also introduce a TermLookup implementation based on Elastic for things 
that use TermLookup.

There is some special syntax that can be used when searching with Cirrus, such 
as insource or incategory.

If we want special syntax for stuff like labels, then we might want a hook 
added to Cirrus for this. The existing code where the special syntax is handled 
is very complex and would be good if that was factored out and split up some to 
make it easier/nicer/less bug-prone to hook into.  If there can be a generic 
interface for this syntax, that would be even nicer.

**TODO**

- We still need to figure out better how to handle display text.


TASK DETAIL
  https://phabricator.wikimedia.org/T117548

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: aude
Cc: aude, Aklapper, Wikidata-bugs, Mbch331



_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Updated] T117548: [Task] Find the best way to get labels into Elastic

Reply via email to