RE: [Mt-list] [OT] Terminology relative to NLP for (African) pi-languages and pi-pairs, towards more oral systems

2007-01-15 Thread Harold Somers
 
 - under-resourced language was introduced in
 2005 by a native English colleague I can't remember the name 
 of at the moment, to translate the title and description of a 
 workshop dedicated to NLP for pi-languages at TALN-05.

My feeling is that this term was in use before 2005. For example, Steve
Bird, writing in February 2004,  gives here
(http://itre.cis.upenn.edu/~myl/languagelog/archives/000481.html) a nice
list of alternative terms including under-resourced.

I think I coined the term non-indigenous minority language (NIML) in
1997, which enjoyed a brief period of currency, to cover the idea of a
language spoken by a local minority (e.g. Urdu in the UK), but to get
away from the predominant association at that time between the term
minority langauge, in Europe at least, and languages like Welsh,
Breton and so on.

In the USA, the term low-density language seemed to be prevalent: I
first came across it at the 1998 AMTA conference, in the title of Jones
and Havrilla's paper. In their paper they gloss low density as
languages of low diffusion, world minority langauges, ie langauges
for which major online resources are typically not available. As I
always liked to point out at that time, one of the world's top 3
languages (in terms of numbers of speakers), namely Hindi-Urdu, was
certainly a low-density language in this sense, but not a world minority
language; but at least the question of resources was foregrounded by
this term. I used the term in a 2001 paper. 

The term lesser-studied languages (sic) was current in 2000, when it
was used in the title of a NATO Advanced Study symposium organized by
Kemal Oflazer. This list has already discussed (in Feb 2005) the
appropriateness or otherwise of lesser as opposed to less, so let's
not rerun that one (but as a pernickety native speaker I might point out
that a less-spoken language is spoken by fewer, not less, people!)

 Last bit of thought: we should be more precise about the term 
 language.
 For example, Chinese is not 1 language, but several (no 
 oral intercomprehension), with their
 dialects: Mandarin, Cantonese, Wu (Shanghainese is a dialect 
 of Wu), Fujien, etc. Or Arabic, for that matter. And that is 

Mandarin and Cantonese are mutually non-comprehensible languages, not
dialects. To describe them as dialects of Chinese would be like
describing French and German as dialects of European. 

But it is fairly difficult to be very precise about the term language.
As was famously stated, possibly by Max Weinreich, A language is a
dialect with an army and navy (though see
http://en.wikipedia.org/wiki/Language_is_a_dialect_with_an_army_and_navy
). Although he was referring to Yiddish, Flemish~Dutch is the example
that always springs to my mind.

___
Mt-list mailing list


Re: [Mt-list] Re: MT-List digest, Vol 1 #36 - 2 msgs

2004-07-11 Thread Harold Somers
 Given the above trend, I think an effective response is to explicitly 
 say in an EBMT paper yes I am doing EBMT but creating the example 
 phrases and their translation by hand; some SMT is creating the 
[...]
 the bigger point, though, is: why should one not make comparisons to 
 SMT-style EBMT?  A serious weakness of EBMT has always been the 
 bottleneck of building the example patterns and their translations 
 manually.  

Unless I've badly misunderstood all the papers I have read, EBMT does not 
build anything by hand. Existing translated texts are used as sources of 
examples which are sought out and reused on the fly. In some reported 
experiments, the examples were handpicked, or pruned to get rid of awkward 
cases, but I don't think this idea is taken seriously as the way to do EBMT.

One recent flavour of EBMT has been to extract similar examples beforehand and 
generalise them, giving translation templates which could be likened to old-
fashioned transfer rules. But again this is done automatically. This seems to 
me to be very close to the latest trend in phrase-based SMT, and if I had been 
someone who had worked on this idea and calling it EBMT I would find it quite 
galling not to be cited, or being asked to compare it with an approach that 
actually postdates what I had done. 

It seems to me EBMT is very misundersttod ... at the other end of the scale 
are writers who don't distinguish EBMT and translation memory, but that's 
another hobby horse.

___
MT-List mailing list
[EMAIL PROTECTED]
http://www.computing.dcu.ie/mailman/listinfo/mt-list