I've got the revisions to the revisions on the paper sitting on Gary's desk (was hoping we'd get this online today, but the day's getting old, so tomorrow is looking more likely). So, I'll return to this discussion and try to respond to some of the weekend's flurry of messages. On 09/16/2000 08:21:04 AM Michael Everson wrote: [snip] >The Ethnologue lists six different Ancash Quechua, five different Hu�naco >Quechuas, and a lot of other Quechuas besides. It's got five kinds of >Italian. How do we evaluate this? And I don't know how many Zapotecos, >there are too many to count. Do we just accept that it's all been evaluated? > >Well, then we find errors, and we point them out. And we say, that's why >we're worried about this database. But Peter says that's not good enough, >it's only "anecdotal", and indeed the burden is placed on us to improve the >Ethnologue by filing reports. What I mean here, Michael is this: in the first paragraph above, you haven't demonstrated that problems exist; you've merely implied that problems exist based on the assumption that there shouldn't be more than one Ancash Quechua, etc. This is the kind of thing I'm referring to as anecdotal: "it's wrong because I don't agree with it". There is a reason why six different Ancash Quechuas, etc. are listed: research has indicated that there are that many related but distinct, mutually non-intelligible, speech varieties there are that have made use of the name "Ancash Quechua". >I've got Meillet and Cohen's 1924 _Les langues du monde_ here on my desk in >front of me. Like the Ethnologue, it deals with the languages of the world. >It has big lists in it. Would I accept those uncritically either? No. This seems to me to be an important issue: can people involved in creating standardized systems of language identifiers trust the judgements of experts from the field of linguistics. I think the answer must be yes for two reasons: 1. People creating IT standards cannot be experts in all fields, and certainly cannot all be experts in linguistics, especially of all different languages and language families of the world. When dealing with something outside their field of expertise, there must be a willingness to trust the judgements of experts in that domain, and I think this applies in this case. 2. The position that those controlling a system of language identifiers must hold the expertise and be able to make determine how to "tile the plane" of language variations around the world is based on an invalid assumption: that there is only one, correct way to tile the plane for use in IT. There is not one single, correct categorization of languages. This is one of the key points Gary and I have made in our paper. >I recognize the need for more languages. My concern with the Ethnologue is >with its classification. This seems to argue in favour of the proceeding point: there is no single consensus on how to enumerate the world's languages, since different people use different definitions for different purposes. The only solution to that impossible situation is a system that allows for alternate namespaces, each based on different particular definitions and maintained by different authorities. In various messages, it has sounded like you agree with us that the international standards process could never cope with providing the thousands of tags that some existing users need. We are in agreement that the list of 6000+ Ethnologue codes can't serve as *the* international standard; and we agree that you could never get everybody to agree on a list that large - this is precisely our point about categorization. Thus if you recognize the need for more language tags, then you must like our idea of namespaces, since that gives us a way to have well-documented codes that anybody can use to address the full scope of the world's languages, without requiring that the whole world own the codes. It seems that, in the same way that the XML community couldn't agree on a single worldwide tag set and so adopted namespaces, so must the IT community do this for language tagging. >You know >how much a fuss there was just because the code for Yiddish was changed >from ji to yi? Well how much fuss is there going to be if we find out that >Upper Kinauri and Lower Kinauri shouldn't really have been given two >different codes? Because we DON'T want to change codes once they have been >used in an RFC 1766 context. This is somewhat overstated. Changing the code for a given meaning from "abc" to "def" is a serious problem, and it is understandable that people would be upset. And that is something that the Ethnologue staff is committed never to do. This is different from changing the categorization based on improved knowledge, such as merging two categories or splitting a category into two. That is something that would only be done if it conformed to the operational definition and was motivated by improved knowledge. And given the operational definition, this is actually what users would probably prefer to have happen - if they assume the categories are defined in a certain way, and they gain a better understanding of the real-world categories, they want the codes for the categories to reflect the current best practices. There is no problem to users and to existing data provided there is clear documentation as to what codes mean and regarding what changes occurred when. This is exactly what we have argued is needed to deal with the dynamic nature of language, and is precisely how the Ethnologue will be maintained. This is also not a problem to many users, including business users, for two reasons: the languages that are most likely to undergo such recategorization are of interest to a relatively limited number of users, and the categorization used for many users, including business users, will generally be based on a different operational definition such that the codes they are using (preferably from a different namespace) would not be affected by such changes. For instance, if a better understanding of regional Thai varieties results in a revision to the categorization of those languages in a namespace defined in terms on mutually non-intelligible speech varieties, that would have no effect on the categorization (based on a different definition) that is interested in only a single language variety based primarily on a common written form, "Standard Thai", and business users and many other users would continue to use "tha". But those users for whom the individual speech varieties are important will get codes that reflect the improved understanding, and for them, that is exactly what they want, provided that they can also maintain their older data based on the earlier understanding. >Therefore I am wary of such a huge list. Do you really find this so >unreasonable? Only in as much as I don't think all the issues have been considered. When we really think about the entire set of problems involved in language identification, the only real solutions seem to be: - clarify what are the operational definitions on which categorizations are based - create distinct namespaces based upon distinct operational definitions and maintained by agencies with expertise for the given domain (with an assumption of some minimal criteria to be met for creating a distinct namespace, including the need to ensure avoidance of synonyms with ISO 639-x in particular) - have some mechanism to handle the dynamic nature of language and of our knowledge of languages in a *controlled* manner that helps users rather than creating problems for users - provide adequate documentation as to the meaning of codes; this must include some measure of encyclopedic information that is freely available online, and maintained on an on-going basis (it's not about to go out of print) The Ethnologue is only a part of a broad solution that we are proposing, though we feel it makes a valuable contribution to that solution particularly because it conforms to the criteria described above and because it immediately overcomes existing problems of scale. - Peter --------------------------------------------------------------------------- Peter Constable Non-Roman Script Initiative, SIL International 7500 W. Camp Wisdom Rd., Dallas, TX 75236, USA Tel: +1 972 708 7485 E-mail: <[EMAIL PROTECTED]>

