[Wikidata-bugs] [Maniphest] T236593: Cannot enter multiple forms for the same language variant

LucasWerkmeister Wed, 20 Jul 2022 13:14:23 -0700

LucasWerkmeister added a comment.


  It’s still not clear to me which problem the `-x-Q123-1` patch is trying to 
solve. Several languages have been mentioned in this task, but which of them 
would benefit from this system? I feel like for several of them, we’ve already 
reached the conclusion that separate forms are in fact the way to go.
  
  I’d like to extract a general rule from @Fnielsen’s comment above 
(T236593#5610903 <https://phabricator.wikimedia.org/T236593#5610903>): if you 
need separate statements, then you need separate forms or lexemes. (I think 
this is a sufficient condition, though it might not be a necessary one.) 
Pronunciation (whether pronunciation audio 
<https://www.wikidata.org/wiki/Property:P443> or IPA transcription 
<https://www.wikidata.org/wiki/Property:P898>) is probably the most significant 
kind of statement here: if a speaker would pronounce the spellings differently, 
then they should be different forms – regardless of whether the difference is a 
completely different ending as in octopuses/octopi, or just an extra schwa as 
in aft(e)nen. (I don’t find the hyphenation example as convincing… don’t you 
need a different hyphenation for every spelling variant, even for cases that 
really should just be multiple representations of one form? E.g. co‧lor/co‧lour 
– that could just be multiple statements on the same form, with different 
monolingual text language codes.)
  
  I suspect this rule covers the Norwegian example that originally motivated 
this task: I feel like “parametere” and “parametre” are probably pronounced 
differently, much like “aftnen” and “aftenen” are pronounced differently in 
Danish according to Finn. For Vietnamese chữ Nôm, I feel like @mxn’s comment at 
T236593#8024999 <https://phabricator.wikimedia.org/T236593#8024999> goes in a 
similar direction, though I admit I find the whole Chinese-characters part of 
this discussion hard to follow.
  
  For the cases where you really only want to have one form with multiple 
representations, I still agree with @daniel’s comment (T236593#5610378 
<https://phabricator.wikimedia.org/T236593#5610378>): “you make up a code for 
each of the spellings”. In practice, the only way to “make up a code” that we 
currently support is to append -x-Q//12345// to an existing, established 
language code. As far as I understand, this solution works well for Hebrew: 
e.g. ספר/סֵפֶר (L67105) <https://www.wikidata.org/wiki/Lexeme:L67105> (the 
“book” word) uses the language codes `he` and `he-x-Q21283070`, where Q21283070 
<https://www.wikidata.org/wiki/Q21283070> represents Tiberian vocalization, the 
orthography with diacritics. At some point, an editorial decision was made that 
the spelling without diacritics “deserves” the unsuffixed `he` language code 
(instead of both spellings using an -x-Q//12345// language code), which I think 
is reasonable: data reusers who don’t care about the different spellings can 
use the most standard language code (`he`) and its single representation per 
form.
  
  Allowing people to append an integer number to the item ID adds a second way 
to make up a code, and one that seems less useful to me: without knowing what 
the number means, how do I know which form representation to use? To me this 
runs counter to the goal of “allow[ing] the consumer to choose which variant 
they prefer”. For the languages that appear to need multiple representations 
for the same language code per form (e.g. the Indian languages @Mahir256 
mentioned in T236593#5608530 
<https://phabricator.wikimedia.org/T236593#5608530>?), is it not possible to 
make the item ID approach work, by creating more special-purpose items? 
Wikidata editors would then make a decision which of the possible spellings 
“deserves” the standard language code, and which additional items need to be 
created (“spelling with character X”, “spelling with sequence Y”?). I 
understand that not all languages have standardized spellings where you can use 
a single item ID to refer to the spelling variants of a wide range of lexemes 
(like in Hebrew), but I think it should still be possible to describe different 
spellings using items that carry more meaning than just a number.

TASK DETAIL
  https://phabricator.wikimedia.org/T236593

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: LucasWerkmeister
Cc: LucasWerkmeister, C933103, AGutman-WMF, mxn, So9q, Ijon, daniel, Asaf, 
Mahir256, Danmichaelo, Fnielsen, Lucas_Werkmeister_WMDE, Denny, 
Lydia_Pintscher, jeblad, jhsoby, Astuthiodit_1, karapayneWMDE, Invadibot, 
maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, 
QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, 
Mbch331

_______________________________________________
Wikidata-bugs mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[Wikidata-bugs] [Maniphest] T236593: Cannot enter multiple forms for the same language variant

Reply via email to