WMDE-leszek created this task.
WMDE-leszek added projects: Wikidata, Lexicographical data.
Herald added a subscriber: Aklapper.

TASK DESCRIPTION

Currently users are only allowed to store a single lemma, or form representation for a given language/language variant code. In other words, I could have a lexeme with lemmas:

  • color, en-us,
  • colour, en-gb

but I couldn't have:

  • color, en
  • colour, en.

In latter case, I would either mean to use different language code for lemma variants, or change the way I model to have color and colour modelled as separate lexemes (with a single lemma variant each). Similar applies to form representations.

When doing the edit on the lexeme page, user is not informed on such limitation. The UI pretends that two lemmas/representations are defined with the same language code (as seen on screenshots below), but after reloading the page, and loading the data from the database user sees than only the latest entry for the language code is actually being stored.

F15879879: multiple-lemmas.png
F15879884: mutliple-representations.png

The question whether it should be possible to store multiple form representations and/or lemmas has been briefly discussed in the email conversation in 2017 (with the particular focus on form representations). Then @daniel concluded, referring to "lit" and "lighted" being two representations of the hypothetical form:

"lit" and "lighted" indeed have to be separate forms (with the same grammatical features set). The justification is that statements that apply to one may not apply to the other: rhymes-with, first known usage, region or time of use, example quotation, syllabification, pronunciation audio - even different spellings! All these would be different, so "lit" and "lighted" have to be separate forms.

Two representations leading to a different sets of statements, and thus to them being representation of two different forms has seemed a convincing argument for me at first. Then I am having seconds thoughts:

  • should some possible statements be a decisive factor on how to model form representations?
  • are we certain there will always be a statement that make the distinction clear?
  • should the modelling always force users to find the distinction that would either lead to creating a language variant code, or to modelling as separate forms?
  • with lemmas in mind, statements couldn't be used as a tool to distinguish different variants, i.e. different variant codes would be required. Should users be forced to always provide those different variant codes?

Question: Should users be only allowed to provide a single lemma, or single form representation with the given language variant code.

Note: the question how to handle this in the UI might be considered at some later point. As the way the lexeme data is currently stored in the database also assumes there is a single lemma or form representation for a given language code (i.e. they're stored as language => data maps). If this was to be changed, this would mean that the code reading from database would potentially need to handle the data stored in the old format too. If the change of the behaviour, and/or the interal storage format is likely, it might be worth doing it before the WikibaseLexeme is deployed in the production environment, so there is no need to deal with the legacy format.

Accidentally, getentities API uses the same structure for lemmas and form representations as used for internal storage. The current format could be observed there (example taken from https://wikidata-lexeme.wmflabs.org/api.php?action="">)

"L560": {
      ...
       "lemmas": {
           "en": {
               "language": "en",
               "value": "colour"
           }
       },
       ...
       "forms": [
           {
                "id": "L560-F1",
                "representations": {
                    "en-gb": {
                        "language": "en-gb",
                         "value": "colour"
                    },
                    "en-us": {
                        "language": "en-us",
                         "value": "color"
                    }
                },
                ...
            }
            ...
       ]
    ...
}

TASK DETAIL
https://phabricator.wikimedia.org/T190033

EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: WMDE-leszek
Cc: Lydia_Pintscher, WMDE-leszek, Aklapper, daniel, Lahi, Gq86, Cinemantique, GoranSMilovanovic, QZanden, LawExplorer, Wikidata-bugs, aude, Darkdadaah, Mbch331
_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to