FYI FOLKS

---------- Forwarded message ---------
From: Mcgibbney, Lewis J (398M) <lewis.j.mcgibb...@jpl.nasa.gov>
Date: Mon, Jul 22, 2019 at 9:01 AM
Subject: FW: [EXTERNAL] July 2019 Newsletter - LDC
To: lewis john mcgibbney <lewi...@apache.org>






Dr. Lewis John McGibbney Ph.D., B.Sc.(Hons)

Data Scientist III

Computer Science for Data Intensive Applications Group (398M)

Instrument Software and Science Data Systems Section (398)

Jet Propulsion Laboratory

California Institute of Technology

4800 Oak Grove Drive

Pasadena, California 91109-8099

Mail Stop : 158-256C

Tel:  (+1) (818)-393-7402

Cell: (+1) (626)-487-3476

Fax:  (+1) (818)-393-1190

Email: lewis.j.mcgibb...@jpl.nasa.gov

ORCID: orcid.org/0000-0003-2185-928X



           [image: signature_1939587371]



 Dare Mighty Things



*From: *Ldc-customers1 <ldc-customers1-boun...@ldc.upenn.edu> on behalf of
Penn LDC <l...@ldc.upenn.edu>
*Date: *Monday, July 15, 2019 at 10:05 AM
*To: *Penn LDC <l...@ldc.upenn.edu>
*Subject: *[EXTERNAL] July 2019 Newsletter - LDC



*In this newsletter:*

*Fall 2019 LDC Data Scholarship Program*

*LDC data and commercial technology development*



*New Publications:*

The DKU-JNU-EMA Electromagnetic Articulography Database
<https://catalog.ldc.upenn.edu/LDC2019S14>
Phrase Detectives Corpus Version 2
<https://catalog.ldc.upenn.edu/LDC2019T10>

First DIHARD Challenge Evaluation - Nine Sources
<https://catalog.ldc.upenn.edu/LDC2019S12>

First DIHARD Challenge Evaluation – SEEDLingS
<https://catalog.ldc.upenn.edu/LDC2019S13>





*Fall 2019 LDC Data Scholarship Program*

Student applications for the Fall 2019 LDC Data Scholarship program are
being accepted now through September 15, 2019. This scholarship program
provides eligible students with access to LDC data at no cost. Students
must complete an application consisting of a data use proposal and letter
of support from their advisor.



For application requirements and program rules, please visit the LDC Data
Scholarship page
<https://www.ldc.upenn.edu/language-resources/data/data-scholarships>.


*LDC data and commercial technology development*

For-profit organizations are reminded that an LDC membership is a
pre-requisite for obtaining a commercial license to almost all LDC
databases. Non-member organizations, including non-member for-profit
organizations, cannot use LDC data to develop or test products for
commercialization, nor can they use LDC data in any commercial product or
for any commercial purpose. LDC data users should consult corpus-specific
license agreements for limitations on the use of certain corpora. Visit the
Licensing <https://www.ldc.upenn.edu/data-management/using/licensing> page
for further information.




*New publications:*



(1) The DKU-JNU-EMA Electromagnetic Articulography Database
<https://catalog.ldc.upenn.edu/LDC2019S14> was developed by Duke Kunshan
University <https://dukekunshan.edu.cn/en> and Jinan University
<https://english.jnu.edu.cn/> and contains approximately 10 hours of
articulography and speech data in Mandarin, Cantonese, Hakka, and Teochew
Chinese from two to seven native speakers for each dialect.



Articulatory measurements were made using the NDI electromagnetic
articulography wave research system to capture real-time vocal tract
variable trajectories. Subjects had six sensors placed in various locations
in their mouth and one reference sensor was placed on the bridge of their
nose. For simultaneous recording of speech signals, subjects also wore a
head-mounted close-talk microphone.


Speakers engaged in four different types of recording sessions: one in
which they read complete sentences or short texts, and three sessions in
which they read related words of a specific common consonant, vowel, or
tone.



DKU-JNU-EMA Electromagnetic Articulography Database is distributed via web
download.


2019 Subscription Members will automatically receive copies of this corpus.
2019 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for $1000.


*


(2) Phrase Detectives Corpus Version 2
<https://catalog.ldc.upenn.edu/LDC2019T10> was developed by the School of
Computer Science and Electronic Engineering at the University of Essex
<https://www.essex.ac.uk/csee/> and consists of approximately 407,000
tokens across 537 documents anaphorically-annotated by the Phrase
Detectives Game <https://anawiki.essex.ac.uk/phrasedetectives>, an online
interactive "game-with-a-purpose" (GWAP) designed to collect data about
English anaphoric coreference.



This release constitutes a new version of the Phrase Detectives Corpus (
LDC2017T08 <https://catalog.ldc.upenn.edu/LDC2017T08>), adding
significantly more annotated tokens to the data set and supplying players’
judgments and a silver label annotation based on the probabilistic
aggregation method for anaphoric information for each markable.



The documents in the corpus are taken from Wikipedia
<https://www.wikipedia.org/> articles and from narrative text in Project
Gutenberg <https://www.gutenberg.org/>. The annotation is a simplified form
of the coding scheme used in The ARRAU Corpus of Anaphoric Information (
LDC2013T22 <https://catalog.ldc.upenn.edu/LDC2013T22>).



Phrase Detectives Corpus Version 2 is distributed via web download.



2019 Subscription Members will automatically receive copies of this corpus.
2019 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data at no cost.

*


(3) First DIHARD Challenge Evaluation - Nine Sources
<https://catalog.ldc.upenn.edu/LDC2019S12> was developed by LDC and
contains approximately 18 hours of English and Chinese speech data along
with corresponding annotations used in support of the First DIHARD Challenge
<https://coml.lscp.ens.fr/dihard/2018/index.html>.


The First DIHARD Challenge was an attempt to reinvigorate work on
diarization through a shared task focusing on "hard" diarization; that is,
speech diarization for challenging corpora where there was an expectation
that existing state-of-the-art systems would fare poorly. As such, it
included speech from a wide sampling of domains representing diversity in
number of speakers, speaker demographics, interaction style, recording
quality, and environmental conditions as follows (all sources are in
English unless otherwise indicated):



·         Autism Diagnostic Observation Schedule (ADOS) interviews

·         Conversations in Restaurants

·         DCIEM/HCRC map task (LDC96S38
<https://catalog.ldc.upenn.edu/LDC96S38>)

·         Audiobook recordings from LibriVox <https://librivox.org/>

·         Meeting speech collected by LDC in 2001 for the ROAR project
(see, e.g., ISL Meeting Speech Part 1 (LDC2004S05
<https://catalog.ldc.upenn.edu/LDC2004S05>))

·         2001 U.S. Supreme Court oral arguments

·         Mixer 6 Speech (LDC2013S02
<https://catalog.ldc.upenn.edu/LDC2013S03>)

·         Chinese video collected by LDC as part of the Video Annotation
for Speech Technologies (VAST) project

·         YouthPoint radio interviews



This release, when combined with First DIHARD Challenge Evaluation -
SEEDLingS (LDC2019S13 <https://catalog.ldc.upenn.edu/LDC2019S13>), contains
the evaluation set audio data and annotation as well as the official
scoring tool. The development data for the First DIHARD Challenge is also
available from LDC as Eight Sources (LDC2019S09
<https://catalog.ldc.upenn.edu/LDC2019S09>) and SEEDLingS (LDC2019S10
<https://catalog.ldc.upenn.edu/LDC2019S10>).



First DIHARD Challenge Evaluation - Nine Sources is distributed via web
download.



2019 Subscription Members will automatically receive copies of this corpus.
2019 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for $300.


*


(4) First DIHARD Challenge Evaluation – SEEDLingS
<https://catalog.ldc.upenn.edu/LDC2019S13> was developed by Duke University
and LDC and contains approximately two hours of English child language
recordings along with corresponding annotations used in support of the First
DIHARD Challenge <https://coml.lscp.ens.fr/dihard/2018/index.html>.



The source data was drawn from the SEEDLingS
<https://homebank.talkbank.org/access/Password/Bergelson.html> (The Study
of Environmental Effects on Developing Linguistic Skills) corpus, designed
to investigate how infants' early linguistic and environmental input plays
a role in their learning. Recordings for SEEDLingS were generated in the
home environment of 44 infants from 6-18 months of age in the Rochester,
New York area. A subset of that data was annotated by LDC for use in the
First DIHARD Challenge.



This release, when combined with First DIHARD Challenge Evaluation - Nine
Sources (LDC2019S12 <https://catalog.ldc.upenn.edu/LDC2019S12>), contains
the evaluation set audio data and annotation as well as the official
scoring tool. The development data for the First DIHARD Challenge is also
available from LDC as Eight Sources (LDC2019S09
<https://catalog.ldc.upenn.edu/LDC2019S09>) and SEEDLingS (LDC2019S10
<https://catalog.ldc.upenn.edu/LDC2019S10>).



First DIHARD Challenge Evaluation – SEEDLingS is distributed via web
download.



2019 Subscription Members will receive copies of this corpus provided they
have submitted a completed copy of the special license agreement. 2019
Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for $50.



*



Membership Office

Linguistic Data Consortium <http://ldc.upenn.edu>

University of Pennsylvania

T: +1-215-573-1275

E: l...@ldc.upenn.edu

M: 3600 Market St. Suite 810

      Philadelphia, PA 19104









-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc

Reply via email to