[Wikidata-bugs] [Maniphest] T349069: Design and implement a WDQS data-reload mechanism that sources its data from HDFS instead of the snapshot servers

dcausse Tue, 17 Oct 2023 01:24:22 -0700

dcausse created this task.
dcausse added projects: Wikidata, Wikidata-Query-Service.
Restricted Application added a subscriber: Aklapper.


TASK DESCRIPTION
  As of today the data-reload cookbook does multiple tasks on the wdqs host 
being reloaded:
  
  - copy the dumps from the snapshot machines to a local folder
  - munge
  - import into blazegraph
  
  It would be interesting to have a more flexible process that sources its data 
from hdfs/hive directly (or indirectly via swift?) so that we could reuse the 
data computed by some jobs running in hadoop (munging, graph splitting).
  
  It is unclear yet how to precisely achieve this but the goal would be to have 
a set of tools that can be given a wdqs host, a target blazegraph 
port/namespace, a hive partition for the source data and schedule a data-reload.
  
  Design: TBD.
  
  AC:
  
  - the system is designed and this phab task updated
  - a wdqs host can be loaded using the triples stored in a hive partition
  - the load process can be resumed if it failed (except if blazegraph has 
corrupted its journal)
  - the loading time must be inferior (~ -1day than the classic data-reload 
because the munge step)

TASK DETAIL
  https://phabricator.wikimedia.org/T349069

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dcausse
Cc: dr0ptp4kt, bking, BTullis, dcausse, Aklapper, Danny_Benjafield_WMDE, 
Astuthiodit_1, AWesterinen, karapayneWMDE, Invadibot, maantietaja, ItamarWMDE, 
Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, 
GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, 
Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331

_______________________________________________
Wikidata-bugs mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[Wikidata-bugs] [Maniphest] T349069: Design and implement a WDQS data-reload mechanism that sources its data from HDFS instead of the snapshot servers

Reply via email to