[Wikidata-bugs] [Maniphest] T236593: Cannot enter multiple forms for the same language variant

2022-08-04 Thread mxn
mxn added a comment.


  The approach I was forced to take with Vietnamese (separate lexemes per word 
per writing system, “translations” from one writing system to another) has some 
downsides. For one thing, the criteria for a translation between vi and vi-Hani 
must be stricter than the criteria for a translation between vi and en; 
otherwise there would be no way to distinguish these transcriptions from 
translations more generally. In principle, it would follow that every 
simplified Chinese character should also have a separate lexeme from the 
corresponding traditional character(s), as on Wiktionary, and we could even 
take this to the extreme that “colour” is the en-GB “translation” of “color” in 
en-US.
  
  On a practical level, this separate lexeme approach means any Wiktionary 
template similar to https://en.wiktionary.org/wiki/Template:vi-readings would 
need to look up translations, while a template generating a table of 
translations of an English sense would need to know to ignore vi-Hani 
statements or merge them with vi statements. In a Vietnamese dictionary, it’s 
also normal to list the other words represented by the same characters. 
Currently, such a template on Wiktionary requires a series of expensive calls 
to look up second-order lexemes. (A rejected property proposal 
 
would streamline that somewhat.)
  
  It would be nice to be able to more strongly link representations in the two 
Vietnamese writing systems, but allowing multiple representations to have the 
same language code would only be a partial solution anyways. A full solution 
would be able to limit some statements to certain representations of a form. 
Otherwise, how would one indicate that one representation is now rare, having 
been supplanted by the other, independently of any broader linguistic shift, or 
that two sources disagree about whether that change has even occurred?

TASK DETAIL
  https://phabricator.wikimedia.org/T236593

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: mxn
Cc: mrephabricator, LucasWerkmeister, C933103, AGutman-WMF, mxn, So9q, Ijon, 
daniel, Asaf, Mahir256, Danmichaelo, Fnielsen, Lucas_Werkmeister_WMDE, Denny, 
Lydia_Pintscher, jeblad, jhsoby, Astuthiodit_1, karapayneWMDE, Invadibot, 
maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, 
QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, 
Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T236593: Cannot enter multiple forms for the same language variant

2022-08-04 Thread mrephabricator
mrephabricator added a comment.


  This may be verging on pedantry, but I will say that the principle of "one 
form per combination of grammatical features" does not sound broadly applicable 
enough to follow for each language. Maybe I am missing something and this is 
just a convention for certain languages.
  
  In any case, here are some examples which illustrate where this would not be 
a helpful model. In Punjabi, an alternate form with identical grammatical 
features could represent any combination of the following:
  
  - An alternative pronunciation of the same form, represented by mutual 
"alternative form" property links without mutual "homophone form" links
  - An alternative spelling of the same form in any or all of the spelling 
variants/orthographies represented, represented by mutual "alternative form" 
links and mutual "homophone form" links.
- If the the spelling varies only for one representation--which actually is 
not as common as I initially expected--the other representation(s) are 
duplicated exactly. This may seem somewhat tedious, but for the time being it 
is an effective way to store the useful information that where spelling varies 
in one writing system, only one spelling is accepted in the other.
  - Dialectal or regional variants of the same form, most often simply 
indicated with "variety of form" set to "unknown value," as usually no 
empirical evidence exists to assign the form to a specific named dialect or say 
anything more specific than "this form will vary depending on who you talk to."
  - Shortened or contracted variants of the same form, indicated with mutual 
"alternative form" property links and "short form" as a grammatical feature on 
the shorter form.
  - Versions of forms which are only for use in spoken language / dialogue as 
opposed to versions of forms which are only used in writing. For example, for 
some forms on a Punjabi verb, the form will get inflected twice for grammatical 
number and/or person, once on an infixed part of the form, and once on the 
suffixed ending of the form, but in spoken/colloquial language it is acceptable 
to use a form which is only inflected once.
  
  Notably all of the above will only apply to particular inflections of a given 
lexeme. If we take this verb for example, 
https://www.wikidata.org/wiki/Lexeme:L688582 , there are 30 forms with 
"alternate forms" that share grammatical features with another so far out of 
the 99 forms documented. If we were to create 30 separate lexemes to represent 
this 1 word, how would we represent the rest of the context that is important 
for understanding what these inflections represent, or indicate for example 
that ਹਸਾਏਂਗੀ and ਹਸਾਵੇਂਗੀ are interchangeable spelling + pronunciation options 
for second person + feminine + singular + additive + causative + subjunctive + 
definite, but that only ਹਸਾਵਾਂਗੀ is acceptable as a spelling + pronunciation 
option for first person + feminine + singular + additive + causative + 
subjunctive + definite? On other lexemes, the same grammatical feature 
combination may permit variation. (This is ultimately governed by the final 
phoneme of the root in a verb which only ever applies to the gender-inflected, 
written/formal first person subjunctive definite forms.) That would be an 
unsustainable model. I am relatively conservative about what constitutes a 
separate lexeme; I tend to base it primarily on a combination of part of speech 
+ mode of derivation rather than pronunciation or spelling variation, 
especially since the latter factors generally don't have any bearing on how and 
where a lexeme can be used according to the internal logic of the language.
  
  I am inclined to agree that the numbered Q-item language code patch is hard 
to discern the specific purpose. I think what may be the case here is that each 
of the concerns brought up in this thread have different solutions. 
Theoretically, there is no upper limit on the number of variations a form can 
have, and it could become confusing if languages started to have long vertical 
strips of representations, some of which are governed by a consistent 
heuristic, and some of which are arbitrary. What may be productive is the 
addition of various properties for use on lexeme forms which offer more nuanced 
ways to model the different languages discussed here.

TASK DETAIL
  https://phabricator.wikimedia.org/T236593

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: mrephabricator
Cc: mrephabricator, LucasWerkmeister, C933103, AGutman-WMF, mxn, So9q, Ijon, 
daniel, Asaf, Mahir256, Danmichaelo, Fnielsen, Lucas_Werkmeister_WMDE, Denny, 
Lydia_Pintscher, jeblad, jhsoby, Astuthiodit_1, karapayneWMDE, Invadibot, 
maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, 
QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, 
Mbch331
___
Wikidata-bugs mailing list -- wikid

[Wikidata-bugs] [Maniphest] T236593: Cannot enter multiple forms for the same language variant

2022-07-25 Thread LucasWerkmeister
LucasWerkmeister added a comment.


  In T236593#8093121 , 
@C933103 wrote:
  
  > As an English example, some religious people might refuse to write the name 
"God" out directly as it is as this would constitute idolatry. For this we can 
tag it as en-x-Q for which Q refer to religious group of people, but 
there are more than one alternative way to write "God". They can either write 
"G-d", "G*d", "G_d", "G-o-d", and so on. It would make no contextual 
differences in whether a hyphen or a underscore is being used, and the change 
in which exact symbol being used in place of original alphabet wouldn't affect 
pronunciation or religious connection. Hence all of these alternatives should 
be tagged en-x-Q, and with the patch it would be possible to have 
"en-x-Q-1" being "G-d" while "en-x-Q-2" being "G*d". I can't see how 
more specific labels can be useful in differentiating "G-d" and "G*d"
  
  I don’t follow this example. If you think all of these potential forms are 
significant, and all of them should be tracked in Wikidata, then why do you 
want to combine them all under a single item ID where nobody can tell them 
apart? To me it makes more sense (assuming this data is notable at all) to have 
separate items like “bowdlerized using hyphens”, “bowdlerized using asterisks”, 
etc., which can be subclasses of a more general “avoiding idolatry” item, have 
other statements indicating which character is being used, and so on. 
(“Bowdlerized” definitely isn’t the right word here, but I don’t know what the 
right word is, sorry.)
  
  In T236593#8097326 , 
@AGutman-WMF wrote:
  
  > @LucasWerkmeister I agree with you that if two variants have two different 
pronunciation, they should probably be split into two different lexemes (in 
general, I think we should avoid having multiple forms with the same 
grammatical features within one lexeme). There is some leeway, however, in this 
rule, since different dialects may have slightly different pronunciations which 
we still want to group into a single lexeme/form. For instance American English 
"color" and British English "colour" are in fact pronounced slightly 
differently, but it would be over-kill to split them, since the difference in 
pronunciation is systematic between the dialects.
  
  That’s fair, and I actually almost wrote “if //the same// speaker would 
pronounce them…” in my comment :) I’m not sure how exactly to phrase the rule, 
but mainly I’m glad to have found some rule at all (which I’m not sure I really 
understood, at least consciously, back in 2019 when I was apparently sitting 
next to @jhsoby).

TASK DETAIL
  https://phabricator.wikimedia.org/T236593

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: LucasWerkmeister
Cc: LucasWerkmeister, C933103, AGutman-WMF, mxn, So9q, Ijon, daniel, Asaf, 
Mahir256, Danmichaelo, Fnielsen, Lucas_Werkmeister_WMDE, Denny, 
Lydia_Pintscher, jeblad, jhsoby, Astuthiodit_1, karapayneWMDE, Invadibot, 
maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, 
QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, 
Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T236593: Cannot enter multiple forms for the same language variant

2022-07-25 Thread AGutman-WMF
AGutman-WMF added a comment.


  @Asaf Insofar two forms are considered distinct lexemes, it is probably the 
case that not all statements hold for both forms (e.g. the pronunciation may be 
different, and possibly other details such as etymology). If the two forms are 
close enough (e.g. just minor dialectal pronunciation details), then we may 
indeed lump them together in one lexeme as if there were spelling variants (and 
then my suggested patch may become relevant). Even if we decide to split them, 
we may of course link the two lexemes to each other, using various properties 
such as "synonym of" or "derived from" etc. Anyhow, my suggested patch would 
allow more easily to lump together such variants, as it allows re-using the 
same basic language code for several spelling variants.

TASK DETAIL
  https://phabricator.wikimedia.org/T236593

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AGutman-WMF
Cc: LucasWerkmeister, C933103, AGutman-WMF, mxn, So9q, Ijon, daniel, Asaf, 
Mahir256, Danmichaelo, Fnielsen, Lucas_Werkmeister_WMDE, Denny, 
Lydia_Pintscher, jeblad, jhsoby, Astuthiodit_1, karapayneWMDE, Invadibot, 
maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, 
QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, 
Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T236593: Cannot enter multiple forms for the same language variant

2022-07-22 Thread Asaf
Asaf added a comment.


  I apologize if I missed something, but if we do end up separating into 
different *lexemes*, how do we retain the value of all the descriptive work 
done on one lexeme (presumably the more common or standard form) that 
equally-well describes the form in the other lexeme? Do we rely on some sameAs 
property and then on applications and re-users to consider that property and 
auto-merge/import statements from the other lexeme?
  
  To give a concrete example, if the rich lexeme currently at 
https://www.wikidata.org/wiki/Lexeme:L189 were split, how would we make sure 
the sample sentences, etymology, etc., would be discoverable from the other 
lexeme?
  
  To my mind, that's the main disadvantage of any solution that would involve 
separating into multiple lexemes.

TASK DETAIL
  https://phabricator.wikimedia.org/T236593

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Asaf
Cc: LucasWerkmeister, C933103, AGutman-WMF, mxn, So9q, Ijon, daniel, Asaf, 
Mahir256, Danmichaelo, Fnielsen, Lucas_Werkmeister_WMDE, Denny, 
Lydia_Pintscher, jeblad, jhsoby, Astuthiodit_1, karapayneWMDE, Invadibot, 
maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, 
QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, 
Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T236593: Cannot enter multiple forms for the same language variant

2022-07-22 Thread AGutman-WMF
AGutman-WMF added a comment.


  @LucasWerkmeister I agree with you that if two variants have two different 
pronunciation, they should probably be split into two different lexemes (in 
general, I think we should avoid having multiple forms with the same 
grammatical features within one lexeme). There is some leeway, however, in this 
rule, since different dialects may have slightly different pronunciations which 
we still want to group into a single lexeme/form. For instance American English 
"color" and British English "colour" are in fact pronounced slightly 
differently, but it would be over-kill to split them, since the difference in 
pronunciation is systematic between the dialects.
  
  Moreover, I agree that in general we should qualify variant spellings by a 
meaningful identifier (and indeed, my proposal requires this, as the integers 
can only qualify already Q-qualified language codes), but as @C933103 mentioned 
above, there are situations where there is no meaningful way to qualify two 
spellings (or at least, the editor haven't thought of such a qualification 
yet). These integer qualified codes allow in these cases to list the variants 
as spelling variants nonetheless, instead of adding spurious forms or lexemes. 
If in a future point a more meaningful qualification would be found (e.g. maybe 
use Q209316 for spellings with added e-s in Norwegian?), the codes can easily 
be altered, while restructuring spurious forms as spelling variants is more 
difficult.

TASK DETAIL
  https://phabricator.wikimedia.org/T236593

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AGutman-WMF
Cc: LucasWerkmeister, C933103, AGutman-WMF, mxn, So9q, Ijon, daniel, Asaf, 
Mahir256, Danmichaelo, Fnielsen, Lucas_Werkmeister_WMDE, Denny, 
Lydia_Pintscher, jeblad, jhsoby, Astuthiodit_1, karapayneWMDE, Invadibot, 
maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, 
QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, 
Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T236593: Cannot enter multiple forms for the same language variant

2022-07-20 Thread C933103
C933103 added a comment.


  In T236593#8092471 , 
@LucasWerkmeister wrote:
  
  > It’s still not clear to me which problem the `-x-Q123-1` patch is trying to 
solve. Several languages have been mentioned in this task, but which of them 
would benefit from this system? I feel like for several of them, we’ve already 
reached the conclusion that separate forms are in fact the way to go.
  >
  > I’d like to extract a general rule from @Fnielsen’s comment above 
(T236593#5610903 ): if you 
need separate statements, then you need separate forms or lexemes. (I think 
this is a sufficient condition, though it might not be a necessary one.) 
Pronunciation (whether pronunciation audio 
 or IPA transcription 
) is probably the most significant 
kind of statement here: if a speaker would pronounce the spellings differently, 
then they should be different forms – regardless of whether the difference is a 
completely different ending as in octopuses/octopi, or just an extra schwa as 
in aft(e)nen. (I don’t find the hyphenation example as convincing… don’t you 
need a different hyphenation for every spelling variant, even for cases that 
really should just be multiple representations of one form? E.g. co‧lor/co‧lour 
– that could just be multiple statements on the same form, with different 
monolingual text language codes.)
  >
  > I suspect this rule covers the Norwegian example that originally motivated 
this task: I feel like “parametere” and “parametre” are probably pronounced 
differently, much like “aftnen” and “aftenen” are pronounced differently in 
Danish according to Finn. For Vietnamese chữ Nôm, I feel like @mxn’s comment at 
T236593#8024999  goes in a 
similar direction, though I admit I find the whole Chinese-characters part of 
this discussion hard to follow.
  >
  > For the cases where you really only want to have one form with multiple 
representations, I still agree with @daniel’s comment (T236593#5610378 
): “you make up a code for 
each of the spellings”. In practice, the only way to “make up a code” that we 
currently support is to append -x-Q//12345// to an existing, established 
language code. As far as I understand, this solution works well for Hebrew: 
e.g. ספר/סֵפֶר (L67105)  (the 
“book” word) uses the language codes `he` and `he-x-Q21283070`, where Q21283070 
 represents Tiberian vocalization, the 
orthography with diacritics. At some point, an editorial decision was made that 
the spelling without diacritics “deserves” the unsuffixed `he` language code 
(instead of both spellings using an -x-Q//12345// language code), which I think 
is reasonable: data reusers who don’t care about the different spellings can 
use the most standard language code (`he`) and its single representation per 
form.
  >
  > Allowing people to append an integer number to the item ID adds a second 
way to make up a code, and one that seems less useful to me: without knowing 
what the number means, how do I know which form representation to use? To me 
this runs counter to the goal of “allow[ing] the consumer to choose which 
variant they prefer”. For the languages that appear to need multiple 
representations for the same language code per form (e.g. the Indian languages 
@Mahir256 mentioned in T236593#5608530 
?), is it not possible to 
make the item ID approach work, by creating more special-purpose items? 
Wikidata editors would then make a decision which of the possible spellings 
“deserves” the standard language code, and which additional items need to be 
created (“spelling with character X”, “spelling with sequence Y”?). I 
understand that not all languages have standardized spellings where you can use 
a single item ID to refer to the spelling variants of a wide range of lexemes 
(like in Hebrew), but I think it should still be possible to describe different 
spellings using items that carry more meaning than just a number.
  
  As an English example, some religious people might refuse to write the name 
"God" out directly as it is as this would constitute idolatry. For this we can 
tag it as en-x-Q for which Q refer to religious group of people, but 
there are more than one alternative way to write "God". They can either write 
"G-d", "G*d", "G_d", "G-o-d", and so on. It would make no contextual 
differences in whether a hyphen or a underscore is being used, and the change 
in which exact symbol being used in place of original alphabet wouldn't affect 
pronunciation or religious connection. Hence all of these alternatives should 
be tagged en-x-Q, and with the patch it would 

[Wikidata-bugs] [Maniphest] T236593: Cannot enter multiple forms for the same language variant

2022-07-20 Thread LucasWerkmeister
LucasWerkmeister added a comment.


  It’s still not clear to me which problem the `-x-Q123-1` patch is trying to 
solve. Several languages have been mentioned in this task, but which of them 
would benefit from this system? I feel like for several of them, we’ve already 
reached the conclusion that separate forms are in fact the way to go.
  
  I’d like to extract a general rule from @Fnielsen’s comment above 
(T236593#5610903 ): if you 
need separate statements, then you need separate forms or lexemes. (I think 
this is a sufficient condition, though it might not be a necessary one.) 
Pronunciation (whether pronunciation audio 
 or IPA transcription 
) is probably the most significant 
kind of statement here: if a speaker would pronounce the spellings differently, 
then they should be different forms – regardless of whether the difference is a 
completely different ending as in octopuses/octopi, or just an extra schwa as 
in aft(e)nen. (I don’t find the hyphenation example as convincing… don’t you 
need a different hyphenation for every spelling variant, even for cases that 
really should just be multiple representations of one form? E.g. co‧lor/co‧lour 
– that could just be multiple statements on the same form, with different 
monolingual text language codes.)
  
  I suspect this rule covers the Norwegian example that originally motivated 
this task: I feel like “parametere” and “parametre” are probably pronounced 
differently, much like “aftnen” and “aftenen” are pronounced differently in 
Danish according to Finn. For Vietnamese chữ Nôm, I feel like @mxn’s comment at 
T236593#8024999  goes in a 
similar direction, though I admit I find the whole Chinese-characters part of 
this discussion hard to follow.
  
  For the cases where you really only want to have one form with multiple 
representations, I still agree with @daniel’s comment (T236593#5610378 
): “you make up a code for 
each of the spellings”. In practice, the only way to “make up a code” that we 
currently support is to append -x-Q//12345// to an existing, established 
language code. As far as I understand, this solution works well for Hebrew: 
e.g. ספר/סֵפֶר (L67105)  (the 
“book” word) uses the language codes `he` and `he-x-Q21283070`, where Q21283070 
 represents Tiberian vocalization, the 
orthography with diacritics. At some point, an editorial decision was made that 
the spelling without diacritics “deserves” the unsuffixed `he` language code 
(instead of both spellings using an -x-Q//12345// language code), which I think 
is reasonable: data reusers who don’t care about the different spellings can 
use the most standard language code (`he`) and its single representation per 
form.
  
  Allowing people to append an integer number to the item ID adds a second way 
to make up a code, and one that seems less useful to me: without knowing what 
the number means, how do I know which form representation to use? To me this 
runs counter to the goal of “allow[ing] the consumer to choose which variant 
they prefer”. For the languages that appear to need multiple representations 
for the same language code per form (e.g. the Indian languages @Mahir256 
mentioned in T236593#5608530 
?), is it not possible to 
make the item ID approach work, by creating more special-purpose items? 
Wikidata editors would then make a decision which of the possible spellings 
“deserves” the standard language code, and which additional items need to be 
created (“spelling with character X”, “spelling with sequence Y”?). I 
understand that not all languages have standardized spellings where you can use 
a single item ID to refer to the spelling variants of a wide range of lexemes 
(like in Hebrew), but I think it should still be possible to describe different 
spellings using items that carry more meaning than just a number.

TASK DETAIL
  https://phabricator.wikimedia.org/T236593

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: LucasWerkmeister
Cc: LucasWerkmeister, C933103, AGutman-WMF, mxn, So9q, Ijon, daniel, Asaf, 
Mahir256, Danmichaelo, Fnielsen, Lucas_Werkmeister_WMDE, Denny, 
Lydia_Pintscher, jeblad, jhsoby, Astuthiodit_1, karapayneWMDE, Invadibot, 
maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, 
QZanden, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, 
Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T236593: Cannot enter multiple forms for the same language variant

2022-07-16 Thread Ijon
Ijon added a comment.


  @AGutman-WMF - yes, I think your approach makes sense.  It would be good to 
auto-suggest those custom language codes in data-entry.

TASK DETAIL
  https://phabricator.wikimedia.org/T236593

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Ijon
Cc: C933103, AGutman-WMF, mxn, So9q, Ijon, daniel, Asaf, Mahir256, Danmichaelo, 
Fnielsen, Lucas_Werkmeister_WMDE, Denny, Lydia_Pintscher, jeblad, jhsoby, 
Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, 
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T236593: Cannot enter multiple forms for the same language variant

2022-07-16 Thread Ijon
Ijon added a comment.


  @daniel - this would work very well for Hebrew, for example, where the two 
orthographies have a formal name known to all speakers, but less well when the 
variations are due to lack of standardization, as in the Bangla case mentioned 
by @Mahir256.

TASK DETAIL
  https://phabricator.wikimedia.org/T236593

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Ijon
Cc: C933103, AGutman-WMF, mxn, So9q, Ijon, daniel, Asaf, Mahir256, Danmichaelo, 
Fnielsen, Lucas_Werkmeister_WMDE, Denny, Lydia_Pintscher, jeblad, jhsoby, 
Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, 
rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T236593: Cannot enter multiple forms for the same language variant

2022-07-12 Thread AGutman-WMF
AGutman-WMF added a comment.


  I believe the current situation, where multiple forms are added to account 
for spelling variations goes against the spirit of the lexicographical data 
model, and in particular the idea that there should be exactly one form for 
each combination of grammatical features 
.
 Therefore I think it is important to unblock this situation, and I think my 
proposal is a simple way to go forward.
  
  @mxn @Fnielsen @jhsoby @Ijon @daniel do you mind to chime in regarding this?

TASK DETAIL
  https://phabricator.wikimedia.org/T236593

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AGutman-WMF
Cc: C933103, AGutman-WMF, mxn, So9q, Ijon, daniel, Asaf, Mahir256, Danmichaelo, 
Fnielsen, Lucas_Werkmeister_WMDE, Denny, Lydia_Pintscher, jeblad, jhsoby, 
Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, 
rosalieper, Bodhisattwa, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T236593: Cannot enter multiple forms for the same language variant

2022-06-30 Thread AGutman-WMF
AGutman-WMF added a comment.


  I've now created a patch  that does 
allow associating several spelling variants with the same private language code.
  
  If the patch gets merged, it will allow associating spelling variants of 
forms or lexemes with codes like `da-x-Q123-1`, `da-x-Q123-2` (on top of an 
existing `da-x-Q123`) etc. In other words, the same private-use language code 
(qualified by some Q-id) can be reused with an arbitrary integer number 
following it. These numbers may represent some order of preference (for 
instance, according to frequency, or word-length), but they can also be 
arbitrary if no distinguishing criteria is provided. I believe such a 
representation of variant spellings is better than duplicating the form with 
its grammatical features and other statements. If a statement should apply only 
to a specific spelling variant, it is possible to qualify it (using for 
instance the Subject Form  
property or some other tailor-made property).
  
  I believe this solution would solve the initial problem associated with this 
ticket ("Cannot enter multiple forms for the same language variant"), but is 
this still of relevance?

TASK DETAIL
  https://phabricator.wikimedia.org/T236593

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AGutman-WMF
Cc: C933103, AGutman-WMF, mxn, So9q, Ijon, daniel, Asaf, Mahir256, Danmichaelo, 
Fnielsen, Lucas_Werkmeister_WMDE, Denny, Lydia_Pintscher, jeblad, jhsoby, 
Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, 
rosalieper, Bodhisattwa, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T236593: Cannot enter multiple forms for the same language variant

2022-06-27 Thread C933103
C933103 added a comment.


  In T236593#5610378 , 
@daniel wrote:
  
  > I recall that we had long discussions about this when initially deciding on 
the data model. In technical terms, the question was whether we would allow 
only a single literal value for a spelling variant, or a list or set of words. 
Allowing a list or set would enable the kind of flexibility @jhsoby is asking 
for. But the down side is that it introduces ambiguity when listing forms (you 
would always have to list all of them, in undefined order), and when generating 
text (which one should you use)?
  >
  > If I recall correctly, we decided that we want to give the consumer of the 
data maximum control over which variant they prefer, by forcing the producer to 
provide different variant codes for all different spellings. We had discussions 
about how to encode this in the variant (language) codes, and how to represent 
it in the UI, but decided to leave that for later.
  >
  > So, the solution that we envisioned when originally discussing this about 
four years ago was: you make up a code for each of the spellings, in a way that 
allows the consumer to choose which variant they prefer. If that is done by 
encoding a region or a rhyme or a tradition or school or whatever will depend 
on the language. If it's a stylistic choice, name the style.
  >
  > The same approach can be used for historical spellings. codes could look 
something like de-x-hist-nd-15jh or something (this code is totally made up and 
probably linguistically nonsense).
  
  The underlying assumption behind this decision is that, different spelling 
forms must be associated with certain variant, or that there are some of the 
spelling being preferred over other spellings, or that some spelling is more 
commonly used for some spoken variant/sociolet/etc than others and is other 
spelling.
  
  None of these are correct assumption, when it come to non-Chinese languages 
that use Chinese characters, or even some Chinese languages that need to apply 
Chinese characters.
  
  Example of Vietnamese chu nom have already been presented above. Other 
examples includes Japanese ateji when Kanji are used for Japanese native words 
except cases where there have been full established transliteration, and its 
Korean equivalent in history, as well as in languages like Cantonese when 
non-Mandarin words need to be expressed in Chinese characters.

TASK DETAIL
  https://phabricator.wikimedia.org/T236593

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: C933103
Cc: C933103, AGutman-WMF, mxn, So9q, Ijon, daniel, Asaf, Mahir256, Danmichaelo, 
Fnielsen, Lucas_Werkmeister_WMDE, Denny, Lydia_Pintscher, jeblad, jhsoby, 
Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, 
rosalieper, Bodhisattwa, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T236593: Cannot enter multiple forms for the same language variant

2022-06-27 Thread mxn
mxn added a comment.


  In T236593#8026331 , @mxn 
wrote:
  
  > If it is so important that forms not be used for orthographic variants of a 
non-alphabetic writing system, then the alternative approach would be to store 
the //quốc ngữ// and //chữ Nôm// representations in separate lexemes, as though 
they’re different languages. We could link individual //quốc ngữ// and //chữ 
Nôm// senses together as translations. This would be broadly consistent with 
the approach taken on every Wiktionary and render this ticket moot for 
Vietnamese, but it bends the definition of a language quite a bit.
  
  I’ve implemented this approach, so this feature request is no longer of great 
importance to Vietnamese. A side benefit is that it’s now possible to say that 
a //Nôm// character is a “translation” of some senses of a //quốc ngữ// word 
but not others (because of semantic distinctions that were only necessary to 
indicate in //chữ Nôm//).

TASK DETAIL
  https://phabricator.wikimedia.org/T236593

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: mxn
Cc: AGutman-WMF, mxn, So9q, Ijon, daniel, Asaf, Mahir256, Danmichaelo, 
Fnielsen, Lucas_Werkmeister_WMDE, Denny, Lydia_Pintscher, jeblad, jhsoby, 
Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, 
rosalieper, Bodhisattwa, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T236593: Cannot enter multiple forms for the same language variant

2022-06-27 Thread Fnielsen
Fnielsen added a comment.


  @AGutman-WMF Spelling variants in Ordregister are each associated with a 
specific identifier. If the spelling variants are just a representation, then 
it is not possible to associate the identifier with the specific representation 
(unless a new property is proposed). On the other hand if the spelling variant 
is associated with a separate form then the Wikidata property can be used for 
the Ordregister spelling variant identifier.

TASK DETAIL
  https://phabricator.wikimedia.org/T236593

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Fnielsen
Cc: AGutman-WMF, mxn, So9q, Ijon, daniel, Asaf, Mahir256, Danmichaelo, 
Fnielsen, Lucas_Werkmeister_WMDE, Denny, Lydia_Pintscher, jeblad, jhsoby, 
Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, 
rosalieper, Bodhisattwa, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T236593: Cannot enter multiple forms for the same language variant

2022-06-24 Thread mxn
mxn added a comment.


  In T236593#8025472 , 
@AGutman-WMF wrote:
  
  > @mxn If these are purely orthographic variants (i.e. the pronunciation is 
the same) I would list them under a single lexeme. And in that case, the most 
natural way would be to list them as spelling variants rather than distinct 
forms.
  
  This assumption is only valid in an environment with purely 
phonetic/alphabetic writing systems. But in Chinese, two characters that are 
“spelled” distinctly but carry the same semantics and pronunciation would still 
have distinct lexemes. This also makes it possible to indicate that the two 
characters are pronounced similarly in one dialect but differently in another.
  
  //Chữ Nôm// is a Chinese-based writing system that adds a phonosemantic 
aspect. If not for its relationship to the //quốc ngữ// alphabet, every 
character would clearly get its own lexeme, just like in Chinese. Any 
similarity in pronunciation would be irrelevant, because this writing system 
makes finer semantic distinctions than any alphabet would. For example, the 
difference between 𬖾 and 頗 (both interchangeable written forms of //phở//) is 
that 𬖾 combines 頗 with the component 米 as a disambiguator, clarifying that it 
has to do with rice (because phở noodles are made of rice), as opposed to 
whatever 頗 originally meant in Chinese. This is only one of many possible ways 
in which characters may be used interchangeably but can carry different 
nuances. Yet all this is secondary to the fact that the two characters are 
equivalent to //phở//, which makes no such distinctions.
  
  To further illustrate the difficulty, if you look at a //quốc ngữ//–to–//chữ 
Nôm// dictionary and a //chữ Nôm//–to–//quốc ngữ// dictionary by the same 
author, the entries will not line up, just as there isn’t a one-to-one 
correspondence between the English-to-German and German-to-English halves of an 
English–German dictionary. If you look up “bỏ” in this dictionary 
,
 you’ll get three characters from the source “vhn” corresponding to two 
different senses of //bỏ//. Any Vietnamese dictionary would have just one entry 
for these two senses of //bỏ//, because Vietnamese speakers no longer 
illustrate semantics in writing.
  
  If it is so important that forms not be used for orthographic variants of a 
non-alphabetic writing system, then the alternative approach would be to store 
the //quốc ngữ// and //chữ Nôm// representations in separate lexemes, as though 
they’re different languages. We could link individual //quốc ngữ// and //chữ 
Nôm// senses together as translations. This would be broadly consistent with 
the approach taken on every Wiktionary and render this ticket moot for 
Vietnamese, but it bends the definition of a language quite a bit.
  
  > To attach statements to specific variants,  I believe that you can qualify 
statements using the "subject form 
" property
  
  This is for statements on senses. If we somehow combine all the //Nôm// 
characters into a single form, then it would make sense to qualify sources and 
P5425 statements by a “applies to representation” property, but even this would 
get messy with compounds.
  
  > (although, aside, I must admit I don't understand the need for the "Han 
character in this lexeme" property; what novel information does it bring on top 
of the orthography itself?)
  
  Translingual data about a Han character is stored in an item. There’s a need 
to connect this translingual data to individual senses via language-specific 
forms.

TASK DETAIL
  https://phabricator.wikimedia.org/T236593

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: mxn
Cc: AGutman-WMF, mxn, So9q, Ijon, daniel, Asaf, Mahir256, Danmichaelo, 
Fnielsen, Lucas_Werkmeister_WMDE, Denny, Lydia_Pintscher, jeblad, jhsoby, 
Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, 
rosalieper, Bodhisattwa, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T236593: Cannot enter multiple forms for the same language variant

2022-06-24 Thread AGutman-WMF
AGutman-WMF added a comment.


  @Fnielsen as far as I see, each variant spelling forms its own set of 
inflected forms, so you have a paradigm related to //mørklægge// and another 
paradigm related to the variant spelling //mørkelægge//. So conceptually you 
don't have a single list of forms, but rather two distinct lists of forms. For 
this reason (and since the pronunciation slightly differs) it may make sense to 
separate them to two distinct lexemes.
  
  However, if you want to follow the system of the Ordregister, please note 
that the identifiers have 3 parts: the first corresponds to the "lexeme", the 
second one to the "inflectional form", and the third one to the "spelling 
variant" level. 
  So if you want to follow this system, you shouldn't list the spelling 
variants as separate forms, but rather as spelling variants of the same forms. 
If you want to attach statements to the spelling variants, you could use the  
"subject form" property, as I suggested above.

TASK DETAIL
  https://phabricator.wikimedia.org/T236593

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AGutman-WMF
Cc: AGutman-WMF, mxn, So9q, Ijon, daniel, Asaf, Mahir256, Danmichaelo, 
Fnielsen, Lucas_Werkmeister_WMDE, Denny, Lydia_Pintscher, jeblad, jhsoby, 
Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, 
rosalieper, Bodhisattwa, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T236593: Cannot enter multiple forms for the same language variant

2022-06-24 Thread Fnielsen
Fnielsen added a comment.


  @AGutman-WMF https://www.wikidata.org/wiki/Lexeme:L348129 does have the same 
inflection. The Ordregister is presumable also for machines and lumps forms 
together. For instance, https://ordregister.dk/id/COR.53473/ corresponding to 
https://www.wikidata.org/wiki/Lexeme:L250372 lumps 6 different forms together 
in the lexeme. And each as a separate COR identifier.

TASK DETAIL
  https://phabricator.wikimedia.org/T236593

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Fnielsen
Cc: AGutman-WMF, mxn, So9q, Ijon, daniel, Asaf, Mahir256, Danmichaelo, 
Fnielsen, Lucas_Werkmeister_WMDE, Denny, Lydia_Pintscher, jeblad, jhsoby, 
Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, 
rosalieper, Bodhisattwa, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T236593: Cannot enter multiple forms for the same language variant

2022-06-24 Thread AGutman-WMF
AGutman-WMF changed the task status from "Open" to "In Progress".
AGutman-WMF added a comment.


  I'm working on a patch to allow multiple forms associated with the same 
private language code.

TASK DETAIL
  https://phabricator.wikimedia.org/T236593

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AGutman-WMF
Cc: AGutman-WMF, mxn, So9q, Ijon, daniel, Asaf, Mahir256, Danmichaelo, 
Fnielsen, Lucas_Werkmeister_WMDE, Denny, Lydia_Pintscher, jeblad, jhsoby, 
Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, 
rosalieper, Bodhisattwa, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T236593: Cannot enter multiple forms for the same language variant

2022-06-24 Thread AGutman-WMF
AGutman-WMF added a comment.


  @Fnielsen given that the pronunciation of these forms is in fact different 
(according to the X-Sampa notation), and each has its own distinct inflection 
set, I would treat these as two distinct (synonymous) lexemes. I don't see the 
advantage of lumping all these forms in one entry. Of course, in a dictionary 
intended for human-consumption it is convenient to list them together, but in a 
machine-readable dictionary, such as Wikidata, these should really be treated 
as two distinct lexemes.

TASK DETAIL
  https://phabricator.wikimedia.org/T236593

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AGutman-WMF
Cc: AGutman-WMF, mxn, So9q, Ijon, daniel, Asaf, Mahir256, Danmichaelo, 
Fnielsen, Lucas_Werkmeister_WMDE, Denny, Lydia_Pintscher, jeblad, jhsoby, 
Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, 
rosalieper, Bodhisattwa, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T236593: Cannot enter multiple forms for the same language variant

2022-06-24 Thread Fnielsen
Fnielsen added a comment.


  I have entered this Danish lexeme today: 
https://www.wikidata.org/wiki/Lexeme:L348129. In authoritative works 
https://ordnet.dk/ddo/ordbog?query=m%C3%B8rkel%C3%A6gge&search=Den+Danske+Ordbog
 , https://dsn.dk/ordbog/ro/moerkelaegge/ and https://ordregister.dk/ they are 
regarded as one lexeme. The Ordregister has one identifier for the lexeme and 
we have different Ordregister form identifiers for each of the variants. The 
hyphenation is different between the variant.

TASK DETAIL
  https://phabricator.wikimedia.org/T236593

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Fnielsen
Cc: AGutman-WMF, mxn, So9q, Ijon, daniel, Asaf, Mahir256, Danmichaelo, 
Fnielsen, Lucas_Werkmeister_WMDE, Denny, Lydia_Pintscher, jeblad, jhsoby, 
Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, 
rosalieper, Bodhisattwa, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T236593: Cannot enter multiple forms for the same language variant

2022-06-24 Thread AGutman-WMF
AGutman-WMF added a comment.


  @mxn If these are purely orthographic variants (i.e. the pronunciation is the 
same) I would list them under a single lexeme. And in that case, the most 
natural way would be to list them as spelling variants rather than distinct 
forms.
  
  To attach statements to specific variants,  I believe that you can qualify 
statements using the "subject form 
" property (although, aside, I 
must admit I don't understand the need for the "Han character in this lexeme" 
property; what novel information does it bring on top of the orthography 
itself?)

TASK DETAIL
  https://phabricator.wikimedia.org/T236593

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AGutman-WMF
Cc: AGutman-WMF, mxn, So9q, Ijon, daniel, Asaf, Mahir256, Danmichaelo, 
Fnielsen, Lucas_Werkmeister_WMDE, Denny, Lydia_Pintscher, jeblad, jhsoby, 
Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, 
rosalieper, Bodhisattwa, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T236593: Cannot enter multiple forms for the same language variant

2022-06-24 Thread mxn
mxn added a comment.


  In T236593#8017255 , @mxn 
wrote:
  
  > In T236593#8015993 , 
@AGutman-WMF wrote:
  >
  >> The ideal solution would be to allow (in the language code validator) 
arbitrary language codes including a rank identifier. For instance, for 
Viatnamese one should be able to use codes such as vi-x-Q8201-1, vi-x-Q8201-2 
etc. Currently this doesn't pass the validation as one gets the error //Invalid 
Item ID "Q8201-1"//.
  >
  > It sounds like representations need the ability to have qualifiers…
  
  To elaborate, each //Nôm// character needs a different set of Han character 
in this lexeme  statements 
(multiple statements for compound words), different sources, probably other 
things that aren’t coming to mind. It’s not that I don’t want to give the 
multiple-representation approach a try, but how else would hủy bỏ/huỷ bỏ 
 and ký hiệu/kí hiệu 
 be modeled but to keep the 
characters in separate forms?
  
  In principle, each character should even get its own lexeme, but since each 
//Nôm// character is an alternative form of a //quốc ngữ// word, the various 
spellings of that word would need to be duplicated as lemmas of each such 
lexeme. It ends up being a lot of redundancy and room for error. I had tried 
this approach at one point, with very redundant lexemes for phở 
, 𬖾 
, and 頗 
, but it seemed 
like needless complication for both editors and data consumers.

TASK DETAIL
  https://phabricator.wikimedia.org/T236593

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: mxn
Cc: AGutman-WMF, mxn, So9q, Ijon, daniel, Asaf, Mahir256, Danmichaelo, 
Fnielsen, Lucas_Werkmeister_WMDE, Denny, Lydia_Pintscher, jeblad, jhsoby, 
Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, 
rosalieper, Bodhisattwa, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T236593: Cannot enter multiple forms for the same language variant

2022-06-24 Thread AGutman-WMF
AGutman-WMF added a comment.


  In T236593#8016636 , 
@Fnielsen wrote:
  
  > In Danish, we are currently using multiple forms and linking them with 
https://www.wikidata.org/wiki/Property:P8530 See also the discussion at 
https://www.wikidata.org/wiki/Wikidata:Property_proposal/Alternative_form
  
  I think this is not ideal either, because it would mean validating by Lexical 
Masks  becomes more 
difficult.  I would argue that in such cases the right way is to use a distinct 
language codes for the variant spelling (possibly arbitrarily selecting one as 
the variant), as I did for demonstration on 
https://www.wikidata.org/wiki/Lexeme:L229388 (though I only did it on the 
lexeme header, one could use the same system for each of the forms). If the 
pronunciation is different as well, I think they should be seen as different 
lexemes.

TASK DETAIL
  https://phabricator.wikimedia.org/T236593

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AGutman-WMF
Cc: AGutman-WMF, mxn, So9q, Ijon, daniel, Asaf, Mahir256, Danmichaelo, 
Fnielsen, Lucas_Werkmeister_WMDE, Denny, Lydia_Pintscher, jeblad, jhsoby, 
Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, 
rosalieper, Bodhisattwa, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T236593: Cannot enter multiple forms for the same language variant

2022-06-21 Thread mxn
mxn added a comment.


  In T236593#8015993 , 
@AGutman-WMF wrote:
  
  > The ideal solution would be to allow (in the language code validator) 
arbitrary language codes including a rank identifier. For instance, for 
Viatnamese one should be able to use codes such as vi-x-Q8201-1, vi-x-Q8201-2 
etc. Currently this doesn't pass the validation as one gets the error //Invalid 
Item ID "Q8201-1"//.
  
  It sounds like representations need the ability to have qualifiers…

TASK DETAIL
  https://phabricator.wikimedia.org/T236593

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: mxn
Cc: AGutman-WMF, mxn, So9q, Ijon, daniel, Asaf, Mahir256, Danmichaelo, 
Fnielsen, Lucas_Werkmeister_WMDE, Denny, Lydia_Pintscher, jeblad, jhsoby, 
Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, 
rosalieper, Bodhisattwa, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T236593: Cannot enter multiple forms for the same language variant

2022-06-21 Thread Fnielsen
Fnielsen added a comment.


  In Danish, we are currently using multiple forms and linking them with 
https://www.wikidata.org/wiki/Property:P8530 See also the discussion at 
https://www.wikidata.org/wiki/Wikidata:Property_proposal/Alternative_form

TASK DETAIL
  https://phabricator.wikimedia.org/T236593

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Fnielsen
Cc: AGutman-WMF, mxn, So9q, Ijon, daniel, Asaf, Mahir256, Danmichaelo, 
Fnielsen, Lucas_Werkmeister_WMDE, Denny, Lydia_Pintscher, jeblad, jhsoby, 
Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, 
rosalieper, Bodhisattwa, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T236593: Cannot enter multiple forms for the same language variant

2022-06-21 Thread AGutman-WMF
AGutman-WMF added a comment.


  The ideal solution would be to allow (in the language code validator) 
arbitrary language codes including a rank identifier. For instance, for 
Viatnamese one should be able to use codes such as vi-x-Q8201-1, vi-x-Q8201-2 
etc. Currently this doesn't pass the validation as one gets the error //Invalid 
Item ID "Q8201-1"//.

TASK DETAIL
  https://phabricator.wikimedia.org/T236593

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AGutman-WMF
Cc: AGutman-WMF, mxn, So9q, Ijon, daniel, Asaf, Mahir256, Danmichaelo, 
Fnielsen, Lucas_Werkmeister_WMDE, Denny, Lydia_Pintscher, jeblad, jhsoby, 
Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, 
rosalieper, Bodhisattwa, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T236593: Cannot enter multiple forms for the same language variant

2022-06-21 Thread mxn
mxn added a comment.


  Nearly Vietnamese lexeme would be affected by this issue 
,
 because one of the two writing systems for the language is phonetic while the 
other is phonosemantic, resulting in a many-to-many relationship between the 
two writing systems.
  
  In T236593#5610378 , 
@daniel wrote:
  
  > So, the solution that we envisioned when originally discussing this about 
four years ago was: you make up a code for each of the spellings, in a way that 
allows the consumer to choose which variant they prefer. If that is done by 
encoding a region or a rhyme or a tradition or school or whatever will depend 
on the language. If it's a stylistic choice, name the style.
  
  This isn’t always possible. Vietnamese //chữ Nôm// is unstandardized, so a 
single author may use multiple characters interchangeably for the same word 
(with the same pronunciation and meaning). There isn’t any “style” to speak of: 
an author’s choice to use one character for “and” has little if any bearing on 
their choice of character for “or”. If Wikidata were in existence a century or 
more ago, we might’ve chosen to create separate a separate lexeme for each 
//Nôm// character, in which case it might be possible to model //quốc ngữ// 
spellings as dialectal representations of //Nôm// forms. But in the 21st 
century, //Nôm// characters must be subordinate to //quốc ngữ// words.

TASK DETAIL
  https://phabricator.wikimedia.org/T236593

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: mxn
Cc: mxn, So9q, Ijon, daniel, Asaf, Mahir256, Danmichaelo, Fnielsen, 
Lucas_Werkmeister_WMDE, Denny, Lydia_Pintscher, jeblad, jhsoby, Astuthiodit_1, 
karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Lahi, 
Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, 
Bodhisattwa, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org