Re: Dataset for all ISO639 code sorted by country/territory?

2016-11-10 Thread Andrew West
On 10 November 2016 at 17:56, Doug Ewell  wrote:
>
> Keep in mind that the CLDR table documents 675 of the world's best-known
> languages, counting variants such as three different orthographies of
> Uzbek.

Oddly, it seems that there are over 1.2 billion speakers of Cantonese
in China, but no speakers of Mandarin (the biggest language by number
of speakers in the world).

Andrew


RE: Dataset for all ISO639 code sorted by country/territory?

2016-11-10 Thread Doug Ewell
Mats Blakstad wrote:

> For myself I was not actually considering the amount of speakers in
> each country, but to map languages with countries/territories where
> the language originated or have been spoken traditionally.

And that is where I think you'll have disagreement on the details.

> So I guess what matters is which language people mostly expect to find
> under the country/territory.

Yep, that's the challenge.

> Would it be possible to extend this dataset to all languages and start
> build an open source data set for language-territory mapping?
> http://www.unicode.org/cldr/charts/latest/supplemental/language_territory_information.html
>  

That's a good question for the CLDR folks, who have their own mailing
list.

Keep in mind that the CLDR table documents 675 of the world's best-known
languages, counting variants such as three different orthographies of
Uzbek. While anything is possible, extending this to "all languages,"
e.g. the other 6,300 lesser-known living languages, might require a bit
of time and money.

There is also a resource in the "UDHR in Unicode" project that might be
worth investigating, though it too is an imperfect match with what you
seem to be looking for.

--
Doug Ewell | Thornton, CO, US | ewellic.org




Re: Dataset for all ISO639 code sorted by country/territory?

2016-11-10 Thread Mats Blakstad
On 20 September 2016 at 18:34, Doug Ewell  wrote:

> > Is there any dataset that contains all languages in the world sorted
> > by country/territory?
>
> As others have pointed out, be careful about how slippery this slope can
> get. Everyone has his or her own opinion about how many speakers of
> Language X in country Y need to be identified, estimated, or conjectured
> in order to say that "language X is spoken in country Y."
>

For myself I was not actually considering the amount of speakers in each
country, but to map languages with countries/territories where the language
originated or have been spoken traditionally.
For instance in Norway we do have many immigrants from Pakistan, but I
doubt any of them would expect to see Urdu sorted under Norway, even though
there are many people in Norway that speak Urdu.
They would expect to see it under Pakistan that is a their heritage
country, I guess this is a lot an identity issue also

I do understand that it is not easy to get a perfect language-country
mapping, and I guess the mapping also depend on the use.
For myself I want people to be able to sort languages by
country/territories to make it easier to make lists of translations, I
think it can be good to be able to sort by territories instead of providing
a looong list of languages.
So I guess what matters is which language people mostly expect to find
under the country/territory.


>
> > I manage to find a dataset on the website of Ethnologue, though it
> > doesn't look like open source, need to check with them exactly how I'm
> > allowed to use it:
> > http://www.ethnologue.com/codes/download-code-tables
>
> The readme file included in the downloadable zip file makes SIL's terms
> very clear. Basically you need to credit SIL as the source of the data,
> not change it, and not make the data directly available for others to
> download. It's best not to get caught up in "open source" as if any
> other terms would make the data totally unusable.
>
>
I agree that a dataset is not unusable just because it is not open source,
but for myself I in fact need a dowbloadable file!

I tried contact SiL but they will only sell the dataset for a fee and will
not give an open source license.

Would it be possible to extend this dataset to all languages and start
build an open source data set for language-territory mapping?
http://www.unicode.org/cldr/charts/latest/supplemental/language_territory_information.html


Re: Dataset for all ISO639 code sorted by country/territory?

2016-09-20 Thread Doug Ewell
Mats Blakstad wrote:

> Is there any dataset that contains all languages in the world sorted
> by country/territory?

As others have pointed out, be careful about how slippery this slope can
get. Everyone has his or her own opinion about how many speakers of
Language X in country Y need to be identified, estimated, or conjectured
in order to say that "language X is spoken in country Y."

> I manage to find a dataset on the website of Ethnologue, though it
> doesn't look like open source, need to check with them exactly how I'm
> allowed to use it:
> http://www.ethnologue.com/codes/download-code-tables

The readme file included in the downloadable zip file makes SIL's terms
very clear. Basically you need to credit SIL as the source of the data,
not change it, and not make the data directly available for others to
download. It's best not to get caught up in "open source" as if any
other terms would make the data totally unusable.

--
Doug Ewell | Thornton, CO, US | ewellic.org




Re: Dataset for all ISO639 code sorted by country/territory?

2016-09-17 Thread Mats Blakstad
I manage to find a dataset on the website of Ethnologue, though it doesn't
look like open source, need to check with them exactly how I'm allowed to
use it:
http://www.ethnologue.com/codes/download-code-tables

Thanks for the explanation Phillippe. I know it is not an easy issue. Look
for different resources on the web, any specific links or feedbacks would
be helpful.

On 17 September 2016 at 13:35, Philippe Verdy  wrote:

> Not all languages are sorted, only those for which there are released data
> in CLDR.
> And languages frequently belong to several countries/territories at the
> same time, with different official or recognized status (itself independant
> of the number of actual speakers, which is very frequently roughly
> estimated).
> Some countries are giving official statistics about their national or
> regional languages, but frequently these stats are old, or underestimated
> or overestimated for political reasons, or some languages are mixed as if
> they were only one, or simply discarded if it is considered locally as a
> secondary language, even if the official language is superficially
> understood but taken as a primary one.
> Statistics are also forgetting native speakers living abroad in a
> diaspora, or secondary learners of a language taught in foreign countries.
>
>
> 2016-09-17 11:19 GMT+02:00 Mats Blakstad :
>
>> Hi
>>
>> Is there any dataset that contains all languages in the world sorted by
>> country/territory?
>>
>> I found this at Unicode, but seems like only containing the most spoken
>> languages in each country and not the smaller once:
>> http://www.unicode.org/cldr/charts/latest/supplemental/terri
>> tory_language_information.html
>>
>> Thanks in advance for help.
>>
>> Best regards
>> Mats Blakstad
>>
>
>


Re: Dataset for all ISO639 code sorted by country/territory?

2016-09-17 Thread Philippe Verdy
Not all languages are sorted, only those for which there are released data
in CLDR.
And languages frequently belong to several countries/territories at the
same time, with different official or recognized status (itself independant
of the number of actual speakers, which is very frequently roughly
estimated).
Some countries are giving official statistics about their national or
regional languages, but frequently these stats are old, or underestimated
or overestimated for political reasons, or some languages are mixed as if
they were only one, or simply discarded if it is considered locally as a
secondary language, even if the official language is superficially
understood but taken as a primary one.
Statistics are also forgetting native speakers living abroad in a diaspora,
or secondary learners of a language taught in foreign countries.


2016-09-17 11:19 GMT+02:00 Mats Blakstad :

> Hi
>
> Is there any dataset that contains all languages in the world sorted by
> country/territory?
>
> I found this at Unicode, but seems like only containing the most spoken
> languages in each country and not the smaller once:
> http://www.unicode.org/cldr/charts/latest/supplemental/territory_language_
> information.html
>
> Thanks in advance for help.
>
> Best regards
> Mats Blakstad
>


Re: Dataset for all ISO639 code sorted by country/territory?

2016-09-17 Thread Otto Stolz

Hello,

am 2016-09-17 um 11:19 Uhr hat Mats Blakstad geschrieben:

Is there any dataset that contains all languages in the world sorted by
country/territory?


Have you tried , already?

Also, 
and 
may provide partial answers.

Best wishes,
  Otto Stolz