AndrewTavis_WMDE added a comment.
Moving on to the Usage Dashboard, what it is we're looking for is the following two tables: | Project | Project Type | Total Articles | Percent Articles Using WD | Total Articles Using WD | Percent Articles With Sitelinks | Total Articles With Sitelinks | | | Project Type | Total Articles | Percent Articles Using WD | Total Articles Using WD | Percent Articles With Sitelinks | Total Articles With Sitelinks | | The process to produce the above tables is similarly quite confusing. There are tables being loaded into the server code that have no relation to the outputs, like `wdcm_project_category.csv` that loads in per project counts for categories like `Architectural Structure`. Maybe the aggregates of the categories is being used to do this, but it's all quite messy and if that is the case then it's not a fluid data process... Generally we're looking for the process that creates the table `USER_NAME.wdcm_clients_wb_entity_usage` that the frontend is using. Looking through the entire Wikidata Analytics code for `wdcm_clients_wb_entity_usage`, we're mostly getting print statements with progress reports related to this table and code reading from the table. The file WikidataAnalytics/_engines/_wdcmModules/WDCM_Sqoop_Clients.R <https://github.com/wikimedia/analytics-wmde-WD-WikidataAnalytics/blob/master/_engines/_wdcmModules/WDCM_Sqoop_Clients.R> is where the table is dropped, created and filled, with Sqoop <https://sqoop.apache.org/> being Apache software for transferring bulk data between Hadoop and relational databases. Original table for this is `wbc_entity_usage`, with the original destination table before the user table being `tmp/wmde/analytics/wdcm/wdcmsqoop/wdcm_clients_wb_entity_usage` (copied to the user table in the same file). The documentation for `wbc_entity_usage` is found [here](from https://www.mediawiki.org/wiki/Wikibase/Schema/wbc_entity_usage). I would suggest that we find someone with greater knowledge of this table and plan out how to recreate the data such that the steps being taken are checked are verified along the way. We'd be having this be primarily an query based job rather than R based, so working from the R files that were not peer reviewed in the first place and use systems (R, Sqoop, etc) that we won't be using seems like not the best use of time for this. We got confirmation that `cognate_wiktionary` is a source for the Wiktionary Cognate data as well, so we're covered as far as baseline data sources 🎉 TASK DETAIL https://phabricator.wikimedia.org/T358254 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: AndrewTavis_WMDE Cc: ECohen_WMDE, AndrewTavis_WMDE, Manuel, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Michael, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, KimKelting, LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
_______________________________________________ Wikidata-bugs mailing list -- [email protected] To unsubscribe send an email to [email protected]
