[Wikidata-bugs] [Maniphest] T358254: [Analytics] Investigate effort of selective legacy migrations to Airflow

AndrewTavis_WMDE Fri, 15 Mar 2024 10:23:40 -0700

AndrewTavis_WMDE added a comment.


  Moving on to the Usage Dashboard, what it is we're looking for is the 
following two tables:
  
  | Project | Project Type | Total Articles | Percent Articles Using WD | Total 
Articles Using WD | Percent Articles With Sitelinks | Total Articles With 
Sitelinks |
  |
  
  
  
  | Project Type | Total Articles | Percent Articles Using WD | Total Articles 
Using WD | Percent Articles With Sitelinks | Total Articles With Sitelinks |
  |
  
  The process to produce the above tables is similarly quite confusing. There 
are tables being loaded into the server code that have no relation to the 
outputs, like `wdcm_project_category.csv` that loads in per project counts for 
categories like `Architectural Structure`. Maybe the aggregates of the 
categories is being used to do this, but it's all quite messy and if that is 
the case then it's not a fluid data process...
  
  Generally we're looking for the process that creates the table 
`USER_NAME.wdcm_clients_wb_entity_usage` that the frontend is using. Looking 
through the entire Wikidata Analytics code for `wdcm_clients_wb_entity_usage`, 
we're mostly getting print statements with progress reports related to this 
table and code reading from the table. The file 
WikidataAnalytics/_engines/_wdcmModules/WDCM_Sqoop_Clients.R 
<https://github.com/wikimedia/analytics-wmde-WD-WikidataAnalytics/blob/master/_engines/_wdcmModules/WDCM_Sqoop_Clients.R>
 is where the table is dropped, created and filled, with Sqoop 
<https://sqoop.apache.org/> being Apache software for transferring bulk data 
between Hadoop and relational databases. Original table for this is 
`wbc_entity_usage`, with the original destination table before the user table 
being `tmp/wmde/analytics/wdcm/wdcmsqoop/wdcm_clients_wb_entity_usage` (copied 
to the user table in the same file).
  
  The documentation for `wbc_entity_usage` is found [here](from 
https://www.mediawiki.org/wiki/Wikibase/Schema/wbc_entity_usage). I would 
suggest that we find someone with greater knowledge of this table and plan out 
how to recreate the data such that the steps being taken are checked are 
verified along the way. We'd be having this be primarily an query based job 
rather than R based, so working from the R files that were not peer reviewed in 
the first place and use systems (R, Sqoop, etc) that we won't be using seems 
like not the best use of time for this.
  
  We got confirmation that `cognate_wiktionary` is a source for the Wiktionary 
Cognate data as well, so we're covered as far as baseline data sources 🎉

TASK DETAIL
  https://phabricator.wikimedia.org/T358254

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: AndrewTavis_WMDE
Cc: ECohen_WMDE, AndrewTavis_WMDE, Manuel, Aklapper, Danny_Benjafield_WMDE, 
Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, 
Michael, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, KimKelting, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331

_______________________________________________
Wikidata-bugs mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[Wikidata-bugs] [Maniphest] T358254: [Analytics] Investigate effort of selective legacy migrations to Airflow

Reply via email to