JAllemandou created this task. JAllemandou added projects: Product-Analytics, Structured-Data-Backlog, Wikidata-Query-Service, Wikidata, Data-Engineering, Discovery-Search (Current work), Patch-For-Review, Data-Engineering-Kanban. Restricted Application removed a project: Patch-For-Review.
TASK DESCRIPTION The airflow job should - be run weekly on Mondays. - wait for source data to be available: - source folder is of form `hdfs://analytics-hadoop/wmf/data/raw/commons/dumps/mediainfo-json/YYYYMMDD` - source folder contains a file named `_IMPORTED` when the source data has been succesfully imported in the folder - run a spark job reading the source data and writing it to hive - the spark job is in the `refinery-job.jar` archive, we need to have it as a dependency for the job - the spark job class is `org.wikimedia.analytics.refinery.job.structureddata.jsonparse.JsonDumpConverter` - main parameters of the job are the input folder, the output hive table and the snapshot (time partition) being created. The output hive table will be `structured_data.commons_entity` and the `snapshot` will be in the form `YYYY-MM-DD`. See the class for the detailed list of parameters :) TASK DETAIL https://phabricator.wikimedia.org/T299059 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: JAllemandou Cc: nettrom_WMF, Miriam, Nuria, cchen, AKhatun_WMF, JAllemandou, ntsako, EChetty, toberto, ldelench_wmf, Invadibot, MPhamWMF, maantietaja, CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, Base, aude, Tobias1984, Manybubbles, Mbch331
_______________________________________________ Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org