| Pamputt added a comment. |
In T195740#4237172, @VIGNERON wrote:Tough questions.
Caveat: I'm not an expert (feel free to correct me), but I cam across several languages issues in my decade and half as Wikimedian.
My point of view is that the current system for the monolingual text is good but not enough for the lexemes needs. One (anecdotal) example, the 'fro' and 'frm' codes have been approved (in a week T181823, sometimes it can takes months), and they have added month ago but still not fully integrate (you need to enter the code itself, it's not possible to just add the name of the language). For items, I understand the caution and we can wait to be sure, but for lexemes this seem to slow to me.
I like the idea of « list of item that have instance of/subclass of language » but that's not enough as it didn't take into account the script.
In the end, I think the we need to go closer to the IETF BCP 47 system of tag (a world-wide standard, already used by a lot of online dictionaries and well documented, see BCP 47 text). We need several lists : one for language (ISO 639-1), one for territories (ISO 3166), one for script (ISO 15924, here the list is quite short) and maybe others subtags (dialects, variants and private uses). This is important for precisely reflect the lemmas, like French before or after the 1990 reform, German before/after the 1996 reform ; de-CH-1996 is gave as an example in the BCP 47 itself.
More importantly we need the ability to combine these lists like in the BCP 47.
Some example of why we need more flexibility and granularity:
- the Kazakh language, was written in Arab, then in Latn, then in Cyrl and now they are going back to Latn... 4 (or 5) scripts is probably a record but it's not unusual for a language to have at least two scripts (I'm thinking of Serbian sr-Latn and sr-Cyrl, or Chinese: zh-Hant and zh-Hans).
- one more extrem example: the Bornholm amulet has an inscription in Latin but written in runes. So if we want to model it correctly, we need the code "la-Runr". Runic latin is not common and have limited corpus but still, a wikidatian could want to work on that, and just add this word and move on (to an other strange epigraphy, like 4th century prototurkish written in Chinese characters).
My view is more a technical one (a code is valid in his format), for the social and validation part (a code is valid regarding the content encoded), I'm not sure how to best handle it (should we let entirely to the community? and can it be articulate with the LangCom?)
PS: anyhow, attention should be paid to the capitalization (ISO 639 is full lowercase, ISO 3166 is full uppercase and ISO 15924 has only the first letter in uppercase (the same letters can exists in different codes 'ca' for Catalan and 'CA' for Canada).
As explained in my prevuious message, I agree we need to specify at least language and script. For the rest (country, orthography reform, ...), I think the best way to store this kind of information is to use property in the lexeme itself. The advantages of the property is it is really flexible and so we can decide a psoteriori what kind of information we want to store in one lexeme.
Let us take the example of the "de-CH-1996" code, am I supposed to use this code only for word in Swiss German followinf the 1996 reform? If the same lexeme is used both in Swiss German and in "standard German", should I use this code. If a Swiss German lexeme has not been modified by the 1996 reform, should I use "de-CH-1996", "de-CH" or both. All are valid and I think it will be really difficult to managed.
Cc: Pamputt, Liuxinyu970226, Micru, VIGNERON, Lydia_Pintscher, Lea_Lacroix_WMDE, Aklapper, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, Wikidata-bugs, aude, Darkdadaah, Mbch331
_______________________________________________ Wikidata-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
