Re: [Wikitech-l] TechCom topics 2020-11-04 (fixed)

Dan Andreescu Tue, 17 Nov 2020 13:41:56 -0800

>
> Maybe something exists already in Hadoop
>


The page properties table is already loaded into Hadoop on a monthly basis
(wmf_raw.mediawiki_page_props).  I haven't played with it much, but Hive
also has JSON-parsing goodies, so give it a shot and let me know if you get
stuck.  In general, data from the databases can be sqooped into Hadoop.  We
do this for large pipelines like edit history
<https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Edit_data_loading>
and
it's very easy
<https://github.com/wikimedia/analytics-refinery/blob/master/python/refinery/sqoop.py#L505>
to add a table.  We're looking at just replicating the whole db on a more
frequent basis, but we have to do some groundwork first to allow
incremental updates (see Apache Iceberg if you're interested).

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] TechCom topics 2020-11-04 (fixed)

Reply via email to