For those that are interested. Here is the perl script, makeISO639.pl,
I use to create the listing for JSword.
In order for names to sort better, I'm using the "inverted" name that
puts the family name in front of the qualifier.
This means that all the Zapotek languages sort together.
(Note, I have to run the output through native2ascii to create a
property file):
******************************************************************************
#!/usr/bin/perl
# This file is used to create a Java property file from SIL's ISO639-3
files.
# That file changes frequently both in content and layout.
# Adjust this program as needed.
#
# The files are currently downloaded from:
# http://www.sil.org/iso639-3/iso-639-3_20090210.tab
# http://www.sil.org/iso639-3/iso-639-3_Name_Index_20090210.tab
# http://www.sil.org/iso639-3/iso-639-3_Retirements_20090126.tab
#
# Run the program as:
# makeISO639.pl > iso639.txt
#
# Sort the file if desired with:
# makeISO639.pl | sort -t = -k 2 > iso639.txt
#
# Convert it from UTF-8 to Java's ASCII representation with:
# native2ascii -encoding utf-8 iso639.txt > iso639.properties
use strict;
use Unicode::Normalize;
binmode(STDOUT, ":utf8");
my $nameIndex = "iso-639-3_Name_Index_20090210.tab";
my $langCodes = "iso-639-3_20090210.tab";
my $deadCodes = "iso-639-3_Retirements_20090126.tab";
my %names = ();
open(my $nameIndexFile, "<:utf8", $nameIndex);
# skip the first line
my $firstLine = <$nameIndexFile>;
while (<$nameIndexFile>)
{
# chomp ms-dos line endings
s/\r//o;
chomp();
# Skip blank lines
next if (/^$/o);
# ensure it is normalized to NFC
$_ = NFC($_);
my @line = split(/\t/o, $_);
$names{$line[0],$line[1]} = $line[2];
}
open(my $langFile, "<:utf8", $langCodes);
# skip the first line
$firstLine = <$langFile>;
while (<$langFile>)
{
# chomp ms-dos line endings
s/\r//o;
chomp();
# Skip blank lines
next if (/^$/o);
# ensure it is normalized to NFC
$_ = NFC($_);
my @line = split(/\t/o, $_);
# exclude extinct languages
next if ($line[5] eq 'E');
my $name = $names{$line[0],$line[6]};
print "$line[3]=$name\n" if ($line[3]);
print "$line[0]=$name\n";
}
# The dead codes file is iso-8859-1. This may change at some date.
open(my $deadFile, "<:encoding(iso-8859-1)", $deadCodes);
# skip the first line
$firstLine = <$deadFile>;
while (<$deadFile>)
{
# chomp ms-dos line endings
s/\r//o;
chomp();
# Skip blank lines
next if (/^$/o);
# ensure it is normalized to NFC
$_ = NFC($_);
my @line = split(/\t/o, $_);
print "$line[0]=$line[1]\n";
}
******************************************************************************
On Nov 9, 2009, at 2:01 PM, DM Smith wrote:
Here is a list of the proposed changes for the last update of 2009
(review ends December 15, so I think we can expect a new listing
shortly after that):
http://www.sil.org/iso639-3/chg_requests.asp
The last column gives the reason for the request.
Perhaps of interest are some Iranian languages.
In His Service,
DM
On Nov 9, 2009, at 1:32 PM, DM Smith wrote:
On 11/09/2009 11:51 AM, Karl Kleinpaste wrote:
DM Smith<[email protected]> writes:
ISO-639-3 is a changing set of codes.
...
These all changed on 2009-01-16.
What is the point of "standardized" abbreviations if the
"standard" is
not fixed? "ckw" is replaced with "cak", "tzz" with "tzo"? For
whose
benefit is that, other than as a make-work issue for people like us?
I don't know all the history, and what I know may be a bit faulty.
There are about 7500 languages. The beginnings of the ISO-639 were
in the Ethnologue, started in 1950. ISO-639-1 was adopted in 1988.
ISO-639-2 was adopted in 1998 and covered about 400 languages.
IS0-639-3 was given to SIL in 2002 and the first adoption of it was
published in 2007. So only a few years ago, the list was quite
small. At that time, some of our module had Ethnologue codes of the
form x-aaa or x-yyy-aaa.
At this point ISO-639-3 encompasses all 2 and 3 letter codes. It is
actively maintained and updates happen at least once a year.
Much of the effort to define languages resolves around literacy and
Bible translation. It is widely held that the return of Christ is
predicated on the gospel being preached to every tongue and there
is an effort to get the Bible into every spoken language. Many
languages have no alphabet. My daughter and her husband spent the
summer finalizing the alphabets for 3 closely related languages. At
this point they, and the team that they were on, believe that these
are 3 distinct languages and not merely dialects of each other. As
such, they would have three different codes and language names. If
later, these were found to be merely dialectical different, the 3
alphabets might be merged into one and the 3 different codes and
their names would be replaced with one name.
If you look at the reasons for retiral, many of them were 'M', that
is merging several codes into one code.
On a similar note, the two letter codes are not stable either.
Hebrew used to have the code 'iw' now it has the code of 'he'.
Likewise for Indonesian, it use to have the code 'in', but now it
is 'id'. Now with the latest CDRL, 'in' is an alias for 'id'.
These two have bitten me as Java silently transforms the current
code to the obsolete one. 'iw', Hebrew, bit me a few years back.
Indonesian, 'in', was last week as Tonny supplied an Indonesian
translation for JSword. We had to name the resource files with the
obsolete name to get it to work.
In Him,
DM
_______________________________________________
sword-devel mailing list: [email protected]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
_______________________________________________
sword-devel mailing list: [email protected]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page
_______________________________________________
sword-devel mailing list: [email protected]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page