[Wikidata-bugs] [Maniphest] T299059: Write an Airflow job converting commons structured data dump to Hive

JAllemandou Wed, 12 Jan 2022 07:07:28 -0800

JAllemandou created this task.
JAllemandou added projects: Product-Analytics, Structured-Data-Backlog, 
Wikidata-Query-Service, Wikidata, Data-Engineering, Discovery-Search (Current 
work), Patch-For-Review, Data-Engineering-Kanban.
Restricted Application removed a project: Patch-For-Review.


TASK DESCRIPTION
  The airflow job should
  
  - be run weekly on Mondays.
  - wait for source data to be available:
    - source folder is of form 
`hdfs://analytics-hadoop/wmf/data/raw/commons/dumps/mediainfo-json/YYYYMMDD`
    - source folder contains a file named `_IMPORTED` when the source data has 
been succesfully imported in the folder
  - run a spark job reading the source data and writing it to hive
    - the spark job is in the `refinery-job.jar` archive, we need to have it as 
a dependency for the job
    - the spark job class is 
`org.wikimedia.analytics.refinery.job.structureddata.jsonparse.JsonDumpConverter`
    - main parameters of the job are the input folder, the output hive table 
and the snapshot (time partition) being created. The output hive table will be 
`structured_data.commons_entity` and the `snapshot` will be in the form 
`YYYY-MM-DD`. See the class for the detailed list of parameters :)

TASK DETAIL
  https://phabricator.wikimedia.org/T299059

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: JAllemandou
Cc: nettrom_WMF, Miriam, Nuria, cchen, AKhatun_WMF, JAllemandou, ntsako, 
EChetty, toberto, ldelench_wmf, Invadibot, MPhamWMF, maantietaja, CBogen, 
Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, Base, aude, 
Tobias1984, Manybubbles, Mbch331

_______________________________________________
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org

[Wikidata-bugs] [Maniphest] T299059: Write an Airflow job converting commons structured data dump to Hive

Reply via email to