[Wikidata-bugs] [Maniphest] T360296: [Analytics] Implement data process to identify missing Wiktionary entries

2024-06-11 Thread AndrewTavis_WMDE
AndrewTavis_WMDE added a comment.


  Hi @MarcoSwart  Thanks for the communication here :) I guess I'm a bit 
confused by how the other one would be used. You're roughly talking about:
  
  | word_that_is_missing_from_a_wiktionary | 
number_of_wiktionaries_that_do_have_it |
  | MOST_MISSING_WORD  | 156
|
  | NEXT_MOST_MISSING_WORD | 155
|
  | ...| ...
|
  |
  
  With that we're missing the `Wiktionary` column, so then editors wouldn't 
have the ability to easily know if their Wiktionary needed that word or not? 
Maybe it can be gotten from another part of the data process. Let me explain :)
  
  What's planned for this data process at this point is two outputs:
  
  - Missing Entries (I miss you ...) as described above 
 - per Wiktionary what are 
the 1,000 most popular missing words
  - Most Popular - the most popular entries across all Wiktionaries
  
  Maybe Most Popular would serve your interests above? This would be a CSV with 
say the 10,000 or 100,000 or whatever you all would need most popular entries 
across all Wiktionaries. All of this updating on a daily basis. Would that work 
for you?
  
  Please let me know if I'm understanding correctly, by the way! Appreciate 
your feedback :)

TASK DETAIL
  https://phabricator.wikimedia.org/T360296

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AndrewTavis_WMDE
Cc: Michael, ECohen_WMDE, Aklapper, Pamputt, AndrewTavis_WMDE, JeanFred, 
Lydia_Pintscher, MarcoSwart, Manuel, Danny_Benjafield_WMDE, S8321414, 
Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Dringsim, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, KimKelting, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T360296: [Analytics] Implement data process to identify missing Wiktionary entries

2024-06-11 Thread MarcoSwart
MarcoSwart added a comment.


  To me, it seems difficult to combine 188 separate datasets myself: they will 
contain lots of duplicates, because a large part of the smaller wiktionaries 
will share the same missing entries. Would it be possible to combine proposal 1 
with a comprehensive CSV that contains all entries that are part of at least 
one of the 188 CSVs in the first column and the total of Wiktionaries that have 
it in the second column? Because of the duplicates I expect the number of rows 
in this file to be significantly lower than 188,000.

TASK DETAIL
  https://phabricator.wikimedia.org/T360296

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AndrewTavis_WMDE, MarcoSwart
Cc: Michael, ECohen_WMDE, Aklapper, Pamputt, AndrewTavis_WMDE, JeanFred, 
Lydia_Pintscher, MarcoSwart, Manuel, Danny_Benjafield_WMDE, S8321414, 
Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Dringsim, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, KimKelting, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T360296: [Analytics] Implement data process to identify missing Wiktionary entries

2024-06-11 Thread AndrewTavis_WMDE
AndrewTavis_WMDE added a comment.


  Talked further with WMF about this just now. One basic question for the end 
users: would it make it more convenient for you all if the exported datasets 
were per Wiktionary? There are two options here, with missing entries being 
used as an example:
  
  1. We export one file that has all missing entries for all Wiktionaries
- 188,000 rows x 3 columns
- 188,000 rows = the 1,000 most popular missing entries for each Wiktionary 
(there are 188 in the data)
- 3 columns
  - The Wiktionary
  - The word that's missing from it
  - The total of the other Wiktionaries that have it
  2. We export 188 CSVs, each of length 1,000 with the above columns
  
  Reason for option 1 or 2 and not both is that we don't want to keep the data 
in duplicate both in the published datasets directories and in the data lake. 
Option 1 is easier, but we can figure out Option 2 if that would be your all's 
preference.
  
  So the baseline question for each option is:
  
  1. If you're only working on one Wiktionary, would you be ok with getting it 
as a subset from the whole dataset?
  2. If you're working on more than one Wiktionary, would you be ok with 
getting the separate datasets and combining them?
  
  Let us know which would be better for your workflow! And thanks for your 
continued interest in this. Great talks today about the various options we have 


TASK DETAIL
  https://phabricator.wikimedia.org/T360296

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AndrewTavis_WMDE
Cc: Michael, ECohen_WMDE, Aklapper, Pamputt, AndrewTavis_WMDE, JeanFred, 
Lydia_Pintscher, MarcoSwart, Manuel, Danny_Benjafield_WMDE, S8321414, 
Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Dringsim, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, KimKelting, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T360296: [Analytics] Implement data process to identify missing Wiktionary entries

2024-06-06 Thread AndrewTavis_WMDE
AndrewTavis_WMDE added a comment.


  Hi @MarcoSwart, sorry for changing the status without explanation. Was in a 
meeting and we were moving things around, but obviously context should have 
been added. This is stalled for now as we're waiting for WMF to advise us on 
the best way forward on migrating data from MariaDB to HDFS. The data processes 
we need to use for this cannot be run directly on MariaDB in a sustainable way 
that's in line with long term supported data practices, so first we need to 
migrate the data to the private data cluster, and then our normal workflows 
take over. This migration is non-standard, and they're looking into how best to 
support/guide us.
  
  By the sounds of it they're allotting the budget of a Staff Engineer to help 
with this soon. The data pipeline and the needed queries are basically done, so 
what we're waiting on is the process to migrate the data as a final step. From 
there we'll get the process up and running such that the data at the very least 
will be exported to the published datasets folders 
 on a daily basis.
  
  As far as a dashboard is concerned, we're also in the midst of looking into a 
more sustainable solution for presenting information to the public. This is 
similarly tied to WMF's efforts on this front. For now we hope that an export 
to the published datasets will suffice such that the community can then take 
the data and model it as they wish. I'd be happy to help people with simple 
Python scripts to get the data loaded into data frames and more workable states 
once that's done! I'd put an estimate on the data process as end of month if 
things work out with WMF's resources, but if not then it's August as I'm away 
for most of July (no later than that though).
  
  Please let me know if you have further questions, and again sorry for the 
confusion!

TASK DETAIL
  https://phabricator.wikimedia.org/T360296

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AndrewTavis_WMDE
Cc: Michael, ECohen_WMDE, Aklapper, Pamputt, AndrewTavis_WMDE, JeanFred, 
Lydia_Pintscher, MarcoSwart, Manuel, Danny_Benjafield_WMDE, S8321414, 
Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Dringsim, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, KimKelting, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T360296: [Analytics] Implement data process to identify missing Wiktionary entries

2024-06-06 Thread MarcoSwart
MarcoSwart added a comment.


  I am a simple editor who volunteered to give feedback on the original 
Wiktionary Cognate Dashboard. We used it on Dutch Wiktionary as a means to help 
editors prioritizing new lemmas to add. As soon as I discovered it didn't work 
anymore, I have asked for a remedy 18 months ago. Is there someone who 
understands that it's really dispiriting to see the status changed from Open to 
Stalled, without a comprehensible explanation and no indication of the expected 
moment a solution will be implemented?

TASK DETAIL
  https://phabricator.wikimedia.org/T360296

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AndrewTavis_WMDE, MarcoSwart
Cc: Michael, ECohen_WMDE, Aklapper, Pamputt, AndrewTavis_WMDE, JeanFred, 
Lydia_Pintscher, MarcoSwart, Manuel, Danny_Benjafield_WMDE, S8321414, 
Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Dringsim, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, KimKelting, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T360296: [Analytics] Implement data process to identify missing Wiktionary entries

2024-06-06 Thread AndrewTavis_WMDE
AndrewTavis_WMDE changed the task status from "Open" to "Stalled".

TASK DETAIL
  https://phabricator.wikimedia.org/T360296

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AndrewTavis_WMDE
Cc: Michael, ECohen_WMDE, Aklapper, Pamputt, AndrewTavis_WMDE, JeanFred, 
Lydia_Pintscher, MarcoSwart, Manuel, Danny_Benjafield_WMDE, S8321414, 
Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Dringsim, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, KimKelting, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T360296: [Analytics] Implement data process to identify missing Wiktionary entries

2024-06-04 Thread AndrewTavis_WMDE
AndrewTavis_WMDE added a comment.


  There's now a draft for the DAGs 

 open on GitLab. There's still lots to do as WMF wants to sync on suggestions 
they'll give me on how to do the MariaDB to HDFS data transfer, but the DAGs 
are mapped out and the hive queries they're calling have been prepared :)

TASK DETAIL
  https://phabricator.wikimedia.org/T360296

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AndrewTavis_WMDE
Cc: Michael, ECohen_WMDE, Aklapper, Pamputt, AndrewTavis_WMDE, JeanFred, 
Lydia_Pintscher, MarcoSwart, Manuel, Danny_Benjafield_WMDE, S8321414, 
Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Dringsim, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, KimKelting, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T360296: [Analytics] Implement data process to identify missing Wiktionary entries

2024-06-04 Thread Manuel
Manuel removed a project: Epic.

TASK DETAIL
  https://phabricator.wikimedia.org/T360296

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AndrewTavis_WMDE, Manuel
Cc: Michael, ECohen_WMDE, Aklapper, Pamputt, AndrewTavis_WMDE, JeanFred, 
Lydia_Pintscher, MarcoSwart, Manuel, Danny_Benjafield_WMDE, S8321414, 
Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Dringsim, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, KimKelting, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331, me, 
BeautifulBold, Suran38, Peteosx1x, NavinRizwi, Dinoguy1000
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T360296: [Analytics] Implement data process to identify missing Wiktionary entries

2024-06-03 Thread AndrewTavis_WMDE
AndrewTavis_WMDE updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T360296

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AndrewTavis_WMDE
Cc: Michael, ECohen_WMDE, Aklapper, Pamputt, AndrewTavis_WMDE, JeanFred, 
Lydia_Pintscher, MarcoSwart, Manuel, me, Danny_Benjafield_WMDE, S8321414, 
Astuthiodit_1, BeautifulBold, Suran38, karapayneWMDE, Invadibot, maantietaja, 
Peteosx1x, NavinRizwi, ItamarWMDE, Akuckartz, Dringsim, Nandana, Lahi, Gq86, 
GoranSMilovanovic, QZanden, KimKelting, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Wikidata-bugs, aude, Dinoguy1000, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T360296: [Analytics] Implement data process to identify missing Wiktionary entries

2024-06-03 Thread AndrewTavis_WMDE
AndrewTavis_WMDE updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T360296

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AndrewTavis_WMDE
Cc: Michael, ECohen_WMDE, Aklapper, Pamputt, AndrewTavis_WMDE, JeanFred, 
Lydia_Pintscher, MarcoSwart, Manuel, me, Danny_Benjafield_WMDE, S8321414, 
Astuthiodit_1, BeautifulBold, Suran38, karapayneWMDE, Invadibot, maantietaja, 
Peteosx1x, NavinRizwi, ItamarWMDE, Akuckartz, Dringsim, Nandana, Lahi, Gq86, 
GoranSMilovanovic, QZanden, KimKelting, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Wikidata-bugs, aude, Dinoguy1000, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T360296: [Analytics] Implement data process to identify missing Wiktionary entries

2024-06-03 Thread AndrewTavis_WMDE
AndrewTavis_WMDE updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T360296

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AndrewTavis_WMDE
Cc: Michael, ECohen_WMDE, Aklapper, Pamputt, AndrewTavis_WMDE, JeanFred, 
Lydia_Pintscher, MarcoSwart, Manuel, me, Danny_Benjafield_WMDE, S8321414, 
Astuthiodit_1, BeautifulBold, Suran38, karapayneWMDE, Invadibot, maantietaja, 
Peteosx1x, NavinRizwi, ItamarWMDE, Akuckartz, Dringsim, Nandana, Lahi, Gq86, 
GoranSMilovanovic, QZanden, KimKelting, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Wikidata-bugs, aude, Dinoguy1000, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T360296: [Analytics] Implement data process to identify missing Wiktionary entries

2024-06-03 Thread AndrewTavis_WMDE
AndrewTavis_WMDE added a comment.


  wmde/analytics/hql/airflow_jobs/wiktionary_cognate 

 on GitLab now has all the needed queries to for missing entries, most popular 
entries and comparing Wiktionaries. Was easier to write all three at once 
rather than lose some context later. Note that these are Hive queries as the 
goal is to
  
  I've discussed the further infrastructure needs at length with a data 
engineer at WMF, with the steps from here being:
  
  - I need to write a PySpark job that gets the `cognate_wiktionary` tables 
from the MariaDB instance and puts them on HDFS on a daily basis
- This will go in wmde/analytics/spark 

- Note  that this is relatively uncharted territory (it can be done with 
current long term supported tools, but will be a new type of job)
  - From there we need a DAG that will eventually include all three processes 
discussed above
- The reason we'll do a DAG for all three is that each will rely on the 
PySpark job to migrate the data from MariaDB to HDFS
- We can start with just doing missing entries as an output for this task, 
and then other tasks can add the other two to the DAG

TASK DETAIL
  https://phabricator.wikimedia.org/T360296

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AndrewTavis_WMDE
Cc: Michael, ECohen_WMDE, Aklapper, Pamputt, AndrewTavis_WMDE, JeanFred, 
Lydia_Pintscher, MarcoSwart, Manuel, me, Danny_Benjafield_WMDE, S8321414, 
Astuthiodit_1, BeautifulBold, Suran38, karapayneWMDE, Invadibot, maantietaja, 
Peteosx1x, NavinRizwi, ItamarWMDE, Akuckartz, Dringsim, Nandana, Lahi, Gq86, 
GoranSMilovanovic, QZanden, KimKelting, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Wikidata-bugs, aude, Dinoguy1000, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T360296: [Analytics] Implement data process to identify missing Wiktionary entries

2024-05-28 Thread AndrewTavis_WMDE
AndrewTavis_WMDE added a comment.


  I've been asking around about the data source and connecting the tables and 
have yet to get concrete answers. Based on general assumptions of the names of 
the tables/columns though, the path forward for getting missing entries for a 
Wiktionary will be to:
  
  - Start with `cognate_wiktionary.cognate_sites`
  - Join to `cognate_wiktionary.cognate_pages` (`cognate_sites.cgsi_key = 
cognate_pages.cgpa_site`)
  - Join to `cognate_wiktionary.cognate_titles` (`cognate_pages.cgpa_title = 
cognate_titles.cgti_raw_key` - note the use of `cgti_raw_key`)
  - Use `cognate_titles.cgti_normalized_key` as a means of checking which 
Wiktionary entries are shared/missing across projects
  
  Putting this here as documentation :)

TASK DETAIL
  https://phabricator.wikimedia.org/T360296

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AndrewTavis_WMDE
Cc: Michael, ECohen_WMDE, Aklapper, Pamputt, AndrewTavis_WMDE, JeanFred, 
Lydia_Pintscher, MarcoSwart, Manuel, me, Danny_Benjafield_WMDE, S8321414, 
Astuthiodit_1, BeautifulBold, Suran38, karapayneWMDE, Invadibot, maantietaja, 
Peteosx1x, NavinRizwi, ItamarWMDE, Akuckartz, Dringsim, Nandana, Lahi, Gq86, 
GoranSMilovanovic, QZanden, KimKelting, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Wikidata-bugs, aude, Dinoguy1000, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T360296: [Analytics] Implement data process to identify missing Wiktionary entries

2024-05-23 Thread AndrewTavis_WMDE
AndrewTavis_WMDE updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T360296

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AndrewTavis_WMDE
Cc: Michael, ECohen_WMDE, Aklapper, Pamputt, AndrewTavis_WMDE, JeanFred, 
Lydia_Pintscher, MarcoSwart, Manuel, me, Danny_Benjafield_WMDE, S8321414, 
Astuthiodit_1, BeautifulBold, Suran38, karapayneWMDE, Invadibot, maantietaja, 
Peteosx1x, NavinRizwi, ItamarWMDE, Akuckartz, Dringsim, Nandana, Lahi, Gq86, 
GoranSMilovanovic, QZanden, KimKelting, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Wikidata-bugs, aude, Dinoguy1000, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T360296: [Analytics] Implement data process to identify missing Wiktionary entries

2024-03-18 Thread AndrewTavis_WMDE
AndrewTavis_WMDE added a comment.


  Thanks! I'll give an estimate on the timing of this once we've finished up 
T341330:  [Analytics] Airflow implementation of unique ips accessing Wikidata's 
REST API metrics . I'll need to 
check to see that the cognate_wiktionary table 
 is an appropriate source, 
but here's hoping as the original source is unclear and this is the only 
structured data source I've seen. Maybe there's an API for the extension that 
could also be used within the job. Note that the plan for this is an Airflow 
DAG that leverages the aforementioned job that will be running on WMF's 
infrastructure.

TASK DETAIL
  https://phabricator.wikimedia.org/T360296

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AndrewTavis_WMDE
Cc: Michael, ECohen_WMDE, Aklapper, Pamputt, AndrewTavis_WMDE, JeanFred, 
Lydia_Pintscher, MarcoSwart, Manuel, me, Danny_Benjafield_WMDE, Astuthiodit_1, 
BeautifulBold, Suran38, karapayneWMDE, Invadibot, maantietaja, Peteosx1x, 
NavinRizwi, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, 
QZanden, KimKelting, LawExplorer, _jensen, rosalieper, Scott_WUaS, 
Wikidata-bugs, aude, Dinoguy1000, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T360296: [Analytics] Implement data process to identify missing Wiktionary entries

2024-03-18 Thread AndrewTavis_WMDE
AndrewTavis_WMDE updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T360296

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AndrewTavis_WMDE
Cc: Michael, ECohen_WMDE, Aklapper, Pamputt, AndrewTavis_WMDE, JeanFred, 
Lydia_Pintscher, MarcoSwart, Manuel, me, Danny_Benjafield_WMDE, Astuthiodit_1, 
BeautifulBold, Suran38, karapayneWMDE, Invadibot, maantietaja, Peteosx1x, 
NavinRizwi, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, 
QZanden, KimKelting, LawExplorer, _jensen, rosalieper, Scott_WUaS, 
Wikidata-bugs, aude, Dinoguy1000, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T360296: [Analytics] Implement data process to identify missing Wiktionary entries

2024-03-18 Thread AndrewTavis_WMDE
AndrewTavis_WMDE claimed this task.
AndrewTavis_WMDE updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T360296

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AndrewTavis_WMDE
Cc: Michael, ECohen_WMDE, Aklapper, Pamputt, AndrewTavis_WMDE, JeanFred, 
Lydia_Pintscher, MarcoSwart, Manuel, me, Danny_Benjafield_WMDE, Astuthiodit_1, 
BeautifulBold, Suran38, karapayneWMDE, Invadibot, maantietaja, Peteosx1x, 
NavinRizwi, ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, 
QZanden, KimKelting, LawExplorer, _jensen, rosalieper, Scott_WUaS, 
Wikidata-bugs, aude, Dinoguy1000, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T360296: [Analytics] Implement data process to identify missing Wiktionary entries

2024-03-18 Thread Manuel
Manuel added a subscriber: ECohen_WMDE.
Manuel added a project: Wikidata Integration in Wikimedia projects.

TASK DETAIL
  https://phabricator.wikimedia.org/T360296

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Manuel
Cc: ECohen_WMDE, Aklapper, Pamputt, AndrewTavis_WMDE, JeanFred, 
Lydia_Pintscher, MarcoSwart, Manuel, me, Danny_Benjafield_WMDE, Astuthiodit_1, 
BeautifulBold, Suran38, karapayneWMDE, Invadibot, maantietaja, Peteosx1x, 
NavinRizwi, ItamarWMDE, Akuckartz, Michael, Nandana, Lahi, Gq86, 
GoranSMilovanovic, QZanden, KimKelting, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Wikidata-bugs, aude, Dinoguy1000, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T360296: [Analytics] Implement data process to identify missing Wiktionary entries

2024-03-18 Thread Manuel
Manuel updated the task description.

TASK DETAIL
  https://phabricator.wikimedia.org/T360296

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Manuel
Cc: Aklapper, Pamputt, AndrewTavis_WMDE, JeanFred, Lydia_Pintscher, MarcoSwart, 
Manuel, me, Danny_Benjafield_WMDE, Astuthiodit_1, BeautifulBold, Suran38, 
karapayneWMDE, Invadibot, maantietaja, Peteosx1x, NavinRizwi, ItamarWMDE, 
Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, KimKelting, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Dinoguy1000, 
Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] T360296: [Analytics] Implement data process to identify missing Wiktionary entries

2024-03-18 Thread Manuel
Manuel created this task.
Manuel added projects: Wikidata, Epic, Wikidata Analytics (Kanban).

TASK DESCRIPTION
  As a Wiktionary user, I want to know what are the most common words 
("entries") that are missing from a specific Wiktionary project.
  
  Scope
  -
  
  - Identify the original CSV for the "I miss you ..." table in 
https://analytics.wikimedia.org/published/datasets/wmde-analytics-engineering/Wiktionary/
  - (Re)create a data process that generates the table daily (daily for now so 
that we can evaluate the resource investment and usage)
  - Some entries need to be filtered out ("Main_Page" and "main_Page")
  
  Context
  ---
  
  **Wiktionaries** describe words coming from their own languages as well as 
other languages.  Pages on Wiktionaries are called "entries". Example: en:tree 
.
  
  The **Cognate extension** provides automatic links between two pages of 
different language versions of Wiktionary that have the same title (including a 
few normalization rules). So for example, fr:tree 
 and en:tree 
. These links then show up as automatic 
interwikilinks.
  
  There was also a **Wiktionary Cognate dashboard** that helped the community 
analyze the data of the extension.
  
  This community tool included an **"I miss you..." table/dashboard**.
  
  - The users could select a particular Wiktionary from a drop-down menu. A 
table then showed a table encompassing the top 1,000 enties (page titles) found 
in other Wiktionaries that are absent from the selected project.
  - The idea was to give to the editors of a language version, some ideas on 
what new pages to create on their home wiki. So, if someone is editing French 
Wiktionary, they would be interested in the words (whatever the language), that 
already have a page on many other Wiktionaries, but not the French one. That's 
probably the most interesting/useful pages to create. That's why users want a 
list of the entries that already exist in a lot of languages, but not theirs.
  - The data was originally updated every 6 hours.
  
  https://meta.wikimedia.org/wiki/Wiktionary_Cognate_Dashboard#I_Miss_You_tab
  
  This is just for context, this task ist only about implementing the data 
process to create public CSVs.
  
  Notes
  -
  
  - Some tech details of the original work was documented in this task: 
{T166487#4425588 }
  
  Acceptance criteria
  ---
  
  [ ] We know which CSV was the source for the "I miss you ..." table
  [ ] A data process is generating the respective CSV daily
[ ] Some entries are filtered out ("Main_Page" and "main_Page")
[ ] The CSV is published in 
https://analytics.wikimedia.org/published/datasets/wmde-analytics-engineering/Wiktionary/
 again

TASK DETAIL
  https://phabricator.wikimedia.org/T360296

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Manuel
Cc: Aklapper, Pamputt, AndrewTavis_WMDE, JeanFred, Lydia_Pintscher, MarcoSwart, 
Manuel, me, Danny_Benjafield_WMDE, Astuthiodit_1, BeautifulBold, Suran38, 
karapayneWMDE, Invadibot, maantietaja, Peteosx1x, NavinRizwi, ItamarWMDE, 
Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, KimKelting, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Dinoguy1000, 
Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org