[Wikidata-bugs] [Maniphest] T360296: [Analytics] Implement data process to identify missing Wiktionary entries

Manuel Mon, 18 Mar 2024 04:14:41 -0700

Manuel created this task.
Manuel added projects: Wikidata, Epic, Wikidata Analytics (Kanban).


TASK DESCRIPTION
  As a Wiktionary user, I want to know what are the most common words 
("entries") that are missing from a specific Wiktionary project.
  
  Scope
  -----
  
  - Identify the original CSV for the "I miss you ..." table in 
https://analytics.wikimedia.org/published/datasets/wmde-analytics-engineering/Wiktionary/
  - (Re)create a data process that generates the table daily (daily for now so 
that we can evaluate the resource investment and usage)
  - Some entries need to be filtered out ("Main_Page" and "main_Page")
  
  Context
  -------
  
  **Wiktionaries** describe words coming from their own languages as well as 
other languages.  Pages on Wiktionaries are called "entries". Example: en:tree 
<https://en.wiktionary.org/wiki/pain>.
  
  The **Cognate extension** provides automatic links between two pages of 
different language versions of Wiktionary that have the same title (including a 
few normalization rules). So for example, fr:tree 
<https://fr.wiktionary.org/wiki/tree> and en:tree 
<https://en.wiktionary.org/wiki/tree>. These links then show up as automatic 
interwikilinks.
  
  There was also a **Wiktionary Cognate dashboard** that helped the community 
analyze the data of the extension.
  
  This community tool included an **"I miss you..." table/dashboard**.
  
  - The users could select a particular Wiktionary from a drop-down menu. A 
table then showed a table encompassing the top 1,000 enties (page titles) found 
in other Wiktionaries that are absent from the selected project.
  - The idea was to give to the editors of a language version, some ideas on 
what new pages to create on their home wiki. So, if someone is editing French 
Wiktionary, they would be interested in the words (whatever the language), that 
already have a page on many other Wiktionaries, but not the French one. That's 
probably the most interesting/useful pages to create. That's why users want a 
list of the entries that already exist in a lot of languages, but not theirs.
  - The data was originally updated every 6 hours.
  
  https://meta.wikimedia.org/wiki/Wiktionary_Cognate_Dashboard#I_Miss_You_tab
  
  This is just for context, this task ist only about implementing the data 
process to create public CSVs.
  
  Notes
  -----
  
  - Some tech details of the original work was documented in this task: 
{T166487#4425588 <https://phabricator.wikimedia.org/T166487#4425588>}
  
  Acceptance criteria
  -------------------
  
  [ ] We know which CSV was the source for the "I miss you ..." table
  [ ] A data process is generating the respective CSV daily
    [ ] Some entries are filtered out ("Main_Page" and "main_Page")
    [ ] The CSV is published in 
https://analytics.wikimedia.org/published/datasets/wmde-analytics-engineering/Wiktionary/
 again

TASK DETAIL
  https://phabricator.wikimedia.org/T360296

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Manuel
Cc: Aklapper, Pamputt, AndrewTavis_WMDE, JeanFred, Lydia_Pintscher, MarcoSwart, 
Manuel, me, Danny_Benjafield_WMDE, Astuthiodit_1, BeautifulBold, Suran38, 
karapayneWMDE, Invadibot, maantietaja, Peteosx1x, NavinRizwi, ItamarWMDE, 
Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, KimKelting, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Dinoguy1000, 
Mbch331

_______________________________________________
Wikidata-bugs mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[Wikidata-bugs] [Maniphest] T360296: [Analytics] Implement data process to identify missing Wiktionary entries

Reply via email to