mrephabricator added a comment.
This may be verging on pedantry, but I will say that the principle of "one
form per combination of grammatical features" does not sound broadly applicable
enough to follow for each language. Maybe I am missing something and this is
just a convention for certain languages.
In any case, here are some examples which illustrate where this would not be
a helpful model. In Punjabi, an alternate form with identical grammatical
features could represent any combination of the following:
- An alternative pronunciation of the same form, represented by mutual
"alternative form" property links without mutual "homophone form" links
- An alternative spelling of the same form in any or all of the spelling
variants/orthographies represented, represented by mutual "alternative form"
links and mutual "homophone form" links.
- If the the spelling varies only for one representation--which actually is
not as common as I initially expected--the other representation(s) are
duplicated exactly. This may seem somewhat tedious, but for the time being it
is an effective way to store the useful information that where spelling varies
in one writing system, only one spelling is accepted in the other.
- Dialectal or regional variants of the same form, most often simply
indicated with "variety of form" set to "unknown value," as usually no
empirical evidence exists to assign the form to a specific named dialect or say
anything more specific than "this form will vary depending on who you talk to."
- Shortened or contracted variants of the same form, indicated with mutual
"alternative form" property links and "short form" as a grammatical feature on
the shorter form.
- Versions of forms which are only for use in spoken language / dialogue as
opposed to versions of forms which are only used in writing. For example, for
some forms on a Punjabi verb, the form will get inflected twice for grammatical
number and/or person, once on an infixed part of the form, and once on the
suffixed ending of the form, but in spoken/colloquial language it is acceptable
to use a form which is only inflected once.
Notably all of the above will only apply to particular inflections of a given
lexeme. If we take this verb for example,
https://www.wikidata.org/wiki/Lexeme:L688582 , there are 30 forms with
"alternate forms" that share grammatical features with another so far out of
the 99 forms documented. If we were to create 30 separate lexemes to represent
this 1 word, how would we represent the rest of the context that is important
for understanding what these inflections represent, or indicate for example
that ਹਸਾਏਂਗੀ and ਹਸਾਵੇਂਗੀ are interchangeable spelling + pronunciation options
for second person + feminine + singular + additive + causative + subjunctive +
definite, but that only ਹਸਾਵਾਂਗੀ is acceptable as a spelling + pronunciation
option for first person + feminine + singular + additive + causative +
subjunctive + definite? On other lexemes, the same grammatical feature
combination may permit variation. (This is ultimately governed by the final
phoneme of the root in a verb which only ever applies to the gender-inflected,
written/formal first person subjunctive definite forms.) That would be an
unsustainable model. I am relatively conservative about what constitutes a
separate lexeme; I tend to base it primarily on a combination of part of speech
+ mode of derivation rather than pronunciation or spelling variation,
especially since the latter factors generally don't have any bearing on how and
where a lexeme can be used according to the internal logic of the language.
I am inclined to agree that the numbered Q-item language code patch is hard
to discern the specific purpose. I think what may be the case here is that each
of the concerns brought up in this thread have different solutions.
Theoretically, there is no upper limit on the number of variations a form can
have, and it could become confusing if languages started to have long vertical
strips of representations, some of which are governed by a consistent
heuristic, and some of which are arbitrary. What may be productive is the
addition of various properties for use on lexeme forms which offer more nuanced
ways to model the different languages discussed here.
TASK DETAIL
https://phabricator.wikimedia.org/T236593
EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/
To: mrephabricator
Cc: mrephabricator, LucasWerkmeister, C933103, AGutman-WMF, mxn, So9q, Ijon,
daniel, Asaf, Mahir256, Danmichaelo, Fnielsen, Lucas_Werkmeister_WMDE, Denny,
Lydia_Pintscher, jeblad, jhsoby, Astuthiodit_1, karapayneWMDE, Invadibot,
maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic,
QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude,
Mbch331
_______________________________________________
Wikidata-bugs mailing list -- [email protected]
To unsubscribe send an email to [email protected]