[silk] The weirdest languages

Udhay Shankar N Fri, 05 Jul 2013 05:20:55 -0700

Much more, including the full spreadsheet with all 21 'weirdness
features' for all the languages, at the URL below.


Also, it amuses me that this list says the most 'normal' language is
Hindi. :-)

Thoughts?

Udhay

http://idibon.com/the-weirdest-languages/

We’re in the business of natural language processing with lots of
different languages. In the last six months, we’ve worked on (big
breath): English, Portuguese (Brazilian and from Portugal), Spanish,
Italian, French, Russian, German, Turkish, Arabic, Japanese, Greek,
Mandarin Chinese, Persian, Polish, Dutch, Swedish, Serbian, Romanian,
Korean, Hungarian, Bulgarian, Hindi, Croatian, Czech, Ukrainian,
Finnish, Hebrew, Urdu, Catalan, Slovak, Indonesian, Malay, Vietnamese,
Bengali, Thai, and a bit on Latvian, Estonian, Lithuanian, Kurdish,
Yoruba, Amharic, Zulu, Hausa, Kazakh, Sindhi, Punjabi, Tagalog, Cebuano,
Danish, and Navajo.

Natural language processing (NLP) is about finding patterns in
language—for example, taking heaps of unstructured text and
automatically pulling out its structure. The open secret about NLP is
that it’s very English-centric. English is far and away the language
that linguists have worked on the most and it’s also the language that
has the most available resources for computer science projects (and more
data is almost always better in computer science). So one of the best
ways to test an NLP system is to try languages other than English. The
better that a system can deal with diverse  data, the more confident
that you can be in its ability to handle unseen data.

To this end, we might choose to define “weirdness” in terms of English.
But that’s a pretty irritating definition. Let’s try to do something
different.
A global method for linguistic outliers

The World Atlas of Language Structures evaluates 2,676 different
languages in terms of a bunch of different language features. These
features include word order, types of sounds, ways of doing negation,
and a lot of other things—192 different language features in total.

So rather than take an English-centric view of the world, WALS allows us
take a worldwide view. That is, we evaluate each language in terms of
how unusual it is for each feature. For example, English word order is
subject-verb-object—there are 1,377 languages that are coded for word
order in WALS and 35.5% of them have SVO word order. Meanwhile only 8.7%
of languages start with a verb—like Welsh, Hawaiian and Majang—so
cross-linguistically, starting with a verb is unusual. For what it’s
worth, 41.0% of the world’s languages are actually SOV order. (Aside:
I’ve done some work with Hawaiian and Majang and that’s how I learned
that verbs are a big commitment for me. I’m just not ready for verbs
when I open my mouth.)

The data in WALS is fairly sparse, so we restrict ourselves to the 165
features that have at least 100 languages in them (at this stage we also
knock out languages that have fewer than 10 of these—dropping us down to
1,693 languages).

Now, one problem is that if you just stop there you have a huge amount
of collinearity. Part of this is just the nature of the features listed
in WALS—there’s one for overall subject/object/verb order and then
separate ones for object/verb and subject/verb. Ideally, we’d like to
judge weirdness based on unrelated features. We can focus in on features
that aren’t strongly correlated with each other (between two correlated
features, we pick the one that has more languages coded for it). We end
up with 21 features in total.

For each value that a language has, we calculate the relative frequency
of that value for all the other languages that are coded for it. So if
we had included subject-object-verb order then English would’ve gotten a
value of 0.355 (we actually normalized these values according to the
overal entropy for each feature, so it wasn’t exactly 0.355, but you get
the idea). The Weirdness Index is then an average across the 21 unique
structural features. But because different features have different
numbers of values and we want to reduce skewing, we actually take the
harmonic mean (and because we want bigger numbers = more weird, we
actually subtract the mean from one). In this blog post, I’ll only
report languages that have a value filled in for at least two-thirds of
features (239 languages).
The outlier (weirdest) languages

The language that is most different from the majority of all other
languages in the world is a verb-initial tonal languages spoken by 6,000
people in Oaxaca, Mexico, known as Chalcatongo Mixtec (aka San Miguel el
Grande Mixtec). Number two is spoken in Siberia by 22,000 people: Nenets
(that’s where we get the word parka from). Number three is Choctaw,
spoken by about 10,000 people, mostly in Oklahoma.

But here’s the rub—some of the weirdest languages in the world are ones
you’ve heard of: German, Dutch, Norwegian, Czech, Spanish, and Mandarin.
 And actually English is #33 in the Language Weirdness Index.

The weirdest languages in the world

The 25 weirdest languages of the world. In North America: Chalcatongo
Mixtec, Choctaw, Mesa Grande Diegueño, Kutenai, and Zoque; in South
America: Paumarí and Trumai; in Australia/Oceania: Pitjantjatjara and
Lavukaleve; in Africa: Harar Oromo, Iraqw, Kongo, Mumuye, Ju|’hoan, and
Khoekhoe; in Asia: Nenets, Eastern Armenian, Abkhaz, Ladakhi, and
Mandarin; and in Europe: German, Dutch, Norwegian, Czech, and Spanish.

By the way, how awesome of a name is “Pitjantjatjara“? (Also: can you
guess which one of the internal syllables is silent?)

<snip>

-- 
((Udhay Shankar N)) ((udhay @ pobox.com)) ((www.digeratus.com))

[silk] The weirdest languages

Reply via email to