> > Maybe something exists already in Hadoop >
The page properties table is already loaded into Hadoop on a monthly basis (wmf_raw.mediawiki_page_props). I haven't played with it much, but Hive also has JSON-parsing goodies, so give it a shot and let me know if you get stuck. In general, data from the databases can be sqooped into Hadoop. We do this for large pipelines like edit history <https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Edit_data_loading> and it's very easy <https://github.com/wikimedia/analytics-refinery/blob/master/python/refinery/sqoop.py#L505> to add a table. We're looking at just replicating the whole db on a more frequent basis, but we have to do some groundwork first to allow incremental updates (see Apache Iceberg if you're interested).
_______________________________________________ Wikitech-l mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikitech-l
