[Mt-list] PhD Position - Representation Learning for Sign Language Translation Using Linguistic and Knowledge-based Constraints

2021-01-07 Thread Vincent Vandeghinste
PHD POSITION - REPRESENTATION LEARNING FOR SIGN LANGUAGE TRANSLATION 
USING LINGUISTIC AND KNOWLEDGE-BASED CONSTRAINTS


Within the context of the SignOn project funded by the European Horizon 
2020 programme, the Centre for Computational Linguistics (CCL), part of 
the ComForT research unit at KU Leuven, seeks to hire a PhD student to 
carry out research on the subject of representation learning for sign 
language translation.

Website unit [1]

PROJECT

The SignON project, which unites 17 European partners, aims to 
facilitate the exchange of information among deaf, hard of hearing, and 
hearing individuals across Europe by developing automatic sign language 
translation tools. Automatic sign language translation (the task of 
automatically translating a visual-gestural sign language utterance to 
an oral language utterance and vice versa) is an application that has 
the potential to reduce communicative barriers for millions of people. 
The World Health Organisation reports that there are about 466 million 
people in the world today with disabling hearing loss; and according to 
the World Federation of the Deaf over 70 million people communicate 
primarily via a sign language.


Sign languages are, just like verbal languages, highly structured 
systems governed by a set of linguistic rules. There are, however, also 
linguistic characteristics of signed languages that are modality 
specific. As a consequence, sign language translation cannot be 
considered as a one-to-one mapping from signs to spoken language words. 
Recent machine learning methods have greatly improved the state-of-the 
art in natural language processing applications, including the 
multi-modal problem of sign language translation. However, due to the 
inherent complexity of the task, most approaches do not favour an 
end-to-end approach (i.e., directly translating sign to text), but first 
transform the signs to an intermediate, gloss-based transcription (sign 
to gloss), and in a second step translate the intermediate 
representation to verbal language (gloss to text).  Using glosses as an 
interface for sign to language translation is fairly successful, but 
also poses a number of problems. Gloss annotations are an imprecise 
representation of sign language; in this respect, they are often an 
impoverished representation that does not do justice to the complex 
multi-channel production of sign language.


The PhD candidate will focus on the intermediate representation that 
functions as an interface between sign language and verbal language in 
the context of sign language translation. Research will be carried out 
along two tracks:


 	* Firstly, the project will consider the development of a 
multi-faceted interlingual representation for sign language translation, 
that can function as a sufficiently rich interface between sign language 
and verbal language, and is tailored towards machine learning methods. 
Crucially, the representation needs to be sufficiently rich to capture 
the intricacies of elaborate, multi-channel sign language, but at the 
same time lenient enough to be incorporated into a classification-based 
optimization objective that is inherent to machine learning approaches. 
This task will be carried out in close cooperation with 
linguistically-formed sign language experts; the representation will be 
developed using Flemish Sign Language as a test-bed, but the resulting 
representational framework should be generally applicable. Additionally, 
the representational framework will be augmented with various 
knowledge-based resources (such as WordNet and FrameNet) as well as 
machine-learning based optimizations (i.e., informed by word and 
sentence embeddings).


 	* Secondly, the project will examine how the resulting representations 
can be exploited as soft constraints to improve the output predictions 
of a neural machine translation architecture for sign language. 
Specifically, the linguistic knowledge that is encoded within the 
representation can be used to constrain the neural network's output 
probability distribution. Learning-based approaches suffer from a lack 
of resources: large-scale annotated sign language corpora are few and 
far between. As a consequence, the resulting output predictions are 
potentially syntactically unsound, semantically improbable, or otherwise 
linguistically incongruous. By augmenting the network output with 
representation-based constraints modeled as a priori distributions on 
the neural network's output distribution, possible discrepancies can be 
mediated. Additionally, the knowledge encoded in the representational 
framework can be used to rerank the various candidates yielded by the 
neural network architecture.


PROFILE

 	* You hold a Master in linguistics or computer science, or equivalent 
education.

* You have solid programming skills.
 	* You exhibit excellent proficiency in English and good communication 
skills.
 	* Working knowledge of Dutch is 

Re: [Mt-list] Context window size

2021-01-07 Thread Andras Kornai
Peter,

thank to you and Yvon who have found this wonderful quotation 
(http://www.mt-archive.info/50/Weaver-1949.pdf). The question is: where does 
the “folk wisdom” of N=3 come from? 

Thanks again,
Andras

> On Jan 6, 2021, at 11:18 PM, Peter Kolb  wrote:
> 
> The question of context window size is raised (but not answered) by Warren 
> Weaver in his Memorandum from 1949, under the heading "Meaning and Context":
> 
> If one examines the words in a book, one at a time as through an opaque mask 
> with a hole in it one word wide, then it is obviously impossible  to 
> determine ... the meaning of the words. "Fast" may mean "rapid"; or it may 
> mean "motionless"; and there is no way of telling which.
> But, if one lengthens the slit in the opaque mask, until one can see not only 
> the central word in question but also say N words on either side, then if N 
> is large enough one can unambiguously decide the meaning of the central word. 
> ...
> The practical question is: "What minimum value of N will, at least in a 
> tolerable fraction of cases, lead to the correct choice of meaning for the 
> central word?"
> 
> (in Locke & Booth, Machine Translation of Languages, 1955, p. 20)
> 
> Regards,
> Peter Kolb
> 
> Am Mi., 6. Jan. 2021 um 08:25 Uhr schrieb Andras Kornai :
> When I started to learn about these things, it was Received Wisdom that to 
> disambiguate a word, or to provide a translation equivalent, a context of 3 
> words on each side of the target are almost always sufficient. 
> (Counterexamples could always be constructed, but for the statistical 
> majority of the cases three on each side would be fine.) But where does this 
> piece of wisdom originate? Weaver? Salton? Sparck-Jones? Bar-Hillel? Any 
> pointers to the literature, including pointers to counterarguments, would be 
> greatly appreciated. 
> 
> Thank you,
> Andras Kornai
> ___
> Mt-list site list
> Mt-list@eamt.org
> http://lists.eamt.org/mailman/listinfo/mt-list

___
Mt-list site list
Mt-list@eamt.org
http://lists.eamt.org/mailman/listinfo/mt-list


Re: [Mt-list] Context window size

2021-01-07 Thread Peter Kolb
The question of context window size is raised (but not answered) by Warren
Weaver in his Memorandum from 1949, under the heading "Meaning and Context":

*If one examines the words in a book, one at a time as through an opaque
mask with a hole in it one word wide, then it is obviously impossible  to
determine ... the meaning of the words. "Fast" may mean "rapid"; or it may
mean "motionless"; and there is no way of telling which.*
*But, if one lengthens the slit in the opaque mask, until one can see not
only the central word in question but also say N words on either side, then
if N is large enough one can unambiguously decide the meaning of the
central word. ...*
*The practical question is: "What minimum value of N will, at least in a
tolerable fraction of cases, lead to the correct choice of meaning for the
central word?"*

(in Locke & Booth, Machine Translation of Languages, 1955, p. 20)

Regards,
Peter Kolb

Am Mi., 6. Jan. 2021 um 08:25 Uhr schrieb Andras Kornai :

> When I started to learn about these things, it was Received Wisdom that to
> disambiguate a word, or to provide a translation equivalent, a context of 3
> words on each side of the target are almost always sufficient.
> (Counterexamples could always be constructed, but for the statistical
> majority of the cases three on each side would be fine.) But where does
> this piece of wisdom originate? Weaver? Salton? Sparck-Jones? Bar-Hillel?
> Any pointers to the literature, including pointers to counterarguments,
> would be greatly appreciated.
>
> Thank you,
> Andras Kornai
> ___
> Mt-list site list
> Mt-list@eamt.org
> http://lists.eamt.org/mailman/listinfo/mt-list
>
___
Mt-list site list
Mt-list@eamt.org
http://lists.eamt.org/mailman/listinfo/mt-list


[Mt-list] CFP: Final Call -- The 6th Arabic Natural Language Processing Workshop (WANLP-6 2021)

2021-01-07 Thread Samia Touileb
**apologies for cross-posting**


 Final Call for Papers 

The 6th Arabic Natural Language Processing Workshop (WANLP-6 2021 
) will be collocated with EACL 2021 
.

We invite submissions on topics of natural language processing that include, 
but are not limited to, the following:

  - Basic core technologies: morphological analysis, disambiguation, 
tokenization, POS tagging, named entity detection, chunking, parsing, semantic 
role labeling, Arabic dialect modeling, etc.
  - Applications: machine translation, speech recognition, speech synthesis, 
optical character recognition, pedagogy, assistive technologies, social media 
analytics, sentiment analysis, summarizations, dialogue systems, etc.
  - Resources: lexicons, dictionaries, annotated and unannotated corpora, etc.

Submissions may include work in progress as well as finished work that has not 
been previously published. Submissions must have a clear focus on specific 
issues pertaining to the Arabic language whether it is standard Arabic, 
dialectal, or mixed. Papers on other languages sharing problems faced by Arabic 
NLP researchers such as Semitic languages or languages using Arabic script are 
welcome. Additionally, papers on efforts using Arabic resources but targeting 
other languages are also welcome. Descriptions of commercial systems are 
welcome, but authors should be willing to discuss the details of their work.

*Shared Task*

Two shared tasks will be associated with the workshop this year:


  - Shared Task 1: NADI 2021 -- Arabic dialect identification 
. This shared task 
targets fine-grained dialect identification with new datasets and efforts to 
distinguish both modern standard Arabic (MSA) and dialects (DA) according to 
their geographical origin.

  - Shared Task 2: Sarcasm and Sentiment Detection In Arabic 
. The shared 
task will focus on analysing tweets and identifying their sentiment and whether 
a tweet is sarcastic or not.


*Important Dates*

   - February 1, 2021: Workshop Paper Due Date
   - February 22, 2021: Notification of Acceptance
   - March 1, 2021: Camera-ready papers due
   - April 19-20, 2021: Workshop Dates


*Submission Details*

This year we invite two types of research papers (long and short), demo papers, 
and shared task description papers. Long research papers may consist of up to 8 
pages of content, plus unlimited references. Short research papers, demo 
papers, and shared task description papers may consist of up to 4 pages of 
content, plus unlimited references. Submissions will be done via softconf.

*Submission Link*: https://www.softconf.com/eacl2021/WANLP2021/


*WANLP 2021 Organizing Committee*

General Chair:
  - Nizar Habash, New York University Abu Dhabi, UAE.

Program Chairs:
  - Houda Bouamor, Carnegie Mellon University in Qatar.
  - Hazem Hajj, American University of Beirut, Lebanon.
  - Walid Magdy, University of Edinburgh, Scotland.
  - Wajdi Zaghouani, Hamad Bin Khalifa University, Qatar.

Publication Chair:
  - Fethi Bougares, University of Le Mans, France.
  - Nadi Tomeh, LIPN, Université Paris 13, Sorbonne Paris Cité.

Publicity Chair:
  - Ibrahim Abu Farha, University of Edinburgh, Scotland.
  - Samia Touileb, University of Oslo, Norway.

Ex-General Chairs / Advisors:
  - Wassim El-Hajj, American University of Beirut, Lebanon.
  - Imed Zitouni, Google, USA.


*Advisory Committee:*
Muhammad Abdul-Mageed, Ahmed Ali, Hend Alkhalifa, Houda Bouamor, Fethi 
Bougares, Khalid Choukri, Kareem Darwish, Mona Diab, Mahmoud El-Haj, Samhaa 
El-Beltagy, Wassim El-Hajj, Nizar Habash, Lamia Hadrich Belguith, Hazem Hajj, 
Walid Magdy, Khaled Shaalan, Kamel Smaili, Nadi Tomeh, Wajdi Zaghouani, Imed 
Zitouni.

*Shared Task 1:*
  - Muhammad Abdul-Mageed, Chiyu Zhang (The University of British Columbia, 
Canada), Nizar Habash (New York University Abu Dhabi), and Houda Bouamor 
(Carnegie Mellon University, Qatar).

*Shared Task 2:*
- Ibrahim Abu Farha (The University of Edinburgh, UK), Wajdi Zaghouani 
(Hamad Bin Khalifa University Doha, Qatar), and Walid Magdy (The University of 
Edinburgh, UK)


For questions or comments regarding  WANLP-6 please contact Ibrahim Abu Farha 
(i.abufarha AT ed.ac.uk ) and Samia Touileb (samiat AT ifi.uio.no ).



Samia Touileb

Postdoc
Language Technology Group
Section for Machine Learning
Department of Informatics, University of Oslo


___
Mt-list site list
Mt-list@eamt.org
http://lists.eamt.org/mailman/listinfo/mt-list