Hello all, Quick reminder that we will be starting our monthly Research Showcase on *Machine Translation on Wikipedia* in 30 minutes. Join us at https://www.youtube.com/live/O7AqvHgqUVk.
Best, Kinneret On Fri, Jul 19, 2024 at 3:12 PM Kinneret Gordon <[email protected]> wrote: > Hi all, > > The next Research Showcase will be live-streamed next Wednesday, July 24, > at 9:30 AM PST / 16:30 UTC. Find your local time here > <https://zonestamp.toolforge.org/1721838600>. The theme for this showcase > is *Machine Translation on Wikipedia*. > > You are welcome to watch via the YouTube stream: > https://www.youtube.com/live/O7AqvHgqUVk. As usual, you can join the > conversation in the YouTube chat as soon as the showcase goes live. > > This month's presentations: > The Promise and Pitfalls of AI Technology in Bridging Digital Language > DivideBy *Kai Zhu, Bocconi University*Machine translation technologies > have the potential to bridge knowledge gaps across languages, promoting > more inclusive access to information regardless of native languages. This > study examines the impact of integrating Google Translate into Wikipedia's > Content Translation system in January 2019. Employing a natural experiment > design and difference-in-differences strategy, we analyze how this > translation technology shock influenced the dynamics of content production > and accessibility on Wikipedia across over a hundred languages. We find > that this technology integration leads to a 149% increase in content > production through translation, driven by existing editors becoming more > productive as well as an expansion of the editor base. Moreover, we observe > that machine translation enhances the propagation of biographical and > geographical information, helping to close these knowledge gaps in the > multilingual context. However, our findings also underscore the need for > continued efforts to mitigate the preexisting systemic barriers. Our study > contributes to our knowledge on the evolving role of artificial > intelligence in shaping knowledge dissemination through enhanced language > translation capabilities.Implications of Using Inorganic Content in > Arabic Wikipedia EditionsBy *Saied Alshahrani and Jeanna Matthews, > Clarkson University*Wikipedia articles (content pages) are one of the > widely utilized training corpora for NLP tasks and systems, yet these > articles are not always created, generated, or even edited organically by > native speakers; some are automatically created, generated, or translated > using Wikipedia bots or off-the-shelf translation tools like Google > Translate without human revision or supervision. We first analyzed the > three Arabic Wikipedia editions, Arabic (AR), Egyptian Arabic (ARZ), and > Moroccan Arabic (ARY), and found that these Arabic Wikipedia editions > suffer from a few serious issues, like large-scale automatic creations and > translations from English to Arabic, all without human involvement, > generating content (articles) that lack not only linguistic richness and > diversity but also content that lacks cultural richness and meaningful > representation of the Arabic language and its native speakers. We second > studied the performance implications of using such inorganic, > unrepresentative articles to train NLP tasks or systems, where we > intrinsically evaluated the performance of two main NLP upstream tasks, > namely word representation and language modeling, using word analogy and > fill-mask evaluations. We found that most of the models trained on the > organic and representative content outperformed or, at worst, performed on > par with the models trained with inorganic content generated using bots or > translated using templates included, demonstrating that training on > unrepresentative content not only impacts the representation of native > speakers but also impacts the performance of NLP tasks or systems. We > recommend avoiding utilizing the automatically created, generated, or > translated articles on Wikipedia when the task is a representation-based > task, like measuring opinions, sentiments, or perspectives of native > speakers, and also suggest that when registered users employ automated > creation or translation, their contributions should be marked differently > than “registered user” for better transparency; perhaps “registered user > (automation-assisted)”. > Best,Kinneret > _______________________________________________ Wiki-research-l mailing list -- [email protected] To unsubscribe send an email to [email protected]
