dcausse created this task. dcausse added projects: Wikidata, Wikidata-Query-Service. Restricted Application added a subscriber: Aklapper.
TASK DESCRIPTION As of today the data-reload cookbook does multiple tasks on the wdqs host being reloaded: - copy the dumps from the snapshot machines to a local folder - munge - import into blazegraph It would be interesting to have a more flexible process that sources its data from hdfs/hive directly (or indirectly via swift?) so that we could reuse the data computed by some jobs running in hadoop (munging, graph splitting). It is unclear yet how to precisely achieve this but the goal would be to have a set of tools that can be given a wdqs host, a target blazegraph port/namespace, a hive partition for the source data and schedule a data-reload. Design: TBD. AC: - the system is designed and this phab task updated - a wdqs host can be loaded using the triples stored in a hive partition - the load process can be resumed if it failed (except if blazegraph has corrupted its journal) - the loading time must be inferior (~ -1day than the classic data-reload because the munge step) TASK DETAIL https://phabricator.wikimedia.org/T349069 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dcausse Cc: dr0ptp4kt, bking, BTullis, dcausse, Aklapper, Danny_Benjafield_WMDE, Astuthiodit_1, AWesterinen, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
_______________________________________________ Wikidata-bugs mailing list -- [email protected] To unsubscribe send an email to [email protected]
