[Wikidata-bugs] [Maniphest] T286938: Create a plan for a final streaming updater rollout as source of truth for blazegraph instances

dcausse Tue, 24 Aug 2021 01:53:50 -0700

dcausse added a comment.


  In T286938#7302853 <https://phabricator.wikimedia.org/T286938#7302853>, 
@EBernhardson wrote:
  
  > A couple thoughts, perhaps one will even be useful:
  >
  >> start import on wdqs1009 and wdqs2008 with --skolemize: best case 10 days 
(import from 2 machines to maximize the chances of success)
  >
  > I have some memory that we thought this could be sped up with skolemizing 
in hadoop, that currently runs weekly and take a few hours.  How far are we 
from being able to feed those outputs into blazegraph, and would we expect much 
improvement? Or maybe the process is fragile enough it's not worth adding risks 
here.
  
  Indeed, munging on a single core will take around ~20hours IIRC (around 8% of 
the import time) compared to 3hours in hadoop, unfortunately we don't have the 
process to serialize the resulting hive table back to plain TTL files and ship 
them to the target machine. I don't think anything there is complicated but 
these data-sharing/transfer tasks tend to be complex to put in place and 
stabilize (this one does not have to be automated though).
  
  >> start data-transfer + updater-consumer activation, wdqs2008 -> all codfw 
machines (EST: 2 to 3days: 3h/machine*7
  >>
  >> - Figure out if there is a way to optimize and parallelize this process
  >
  > With 7 machines, i guess we could cut it to 3 steps by also copying from 
the machines we copied to in a previous step. Plausibly brings runtime to 
single day, next step is live deployment so mostly it frees us up for testing 
the service thurs/fri before we go live. Should mostly amount to starting the 
transfer from more machines each round.
  >
  > 1. a->b
  > 2. a->c, b->d
  > 3. a->e, b->f, c->g
  
  Makes sense, thanks!
  Given some of these tasks will be launched manually I guess it would make 
sense to make these actions more concrete and write them down as you did with 
real hostnames.
  
  >> except wdqs1010 that we could use as source for emergency rollback
  >
  > I worry about having only a single source for emergency rollback. If we 
think we still need that option then keeping at least two copies would be 
typical, but do we have enough machines to keep two back reasonably?  Also it 
might be worth figuring out how we can decide when the emergency rollback can 
be decom'd, but then again we could wait until it's obvious that we can't go 
back anymore.
  
  True, I think we can keep one additional machine in codfw from the internal 
cluster.
  I think blockers are likely to be detected while the spare DC is being 
migrated but it might be good to keep these two machines for a couple months.

TASK DETAIL
  https://phabricator.wikimedia.org/T286938

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dcausse
Cc: EBernhardson, RKemper, Aklapper, dcausse, Gehel, MPhamWMF, Zbyszko, 
Invadibot, maantietaja, CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, 
Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331

_______________________________________________
Wikidata-bugs mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[Wikidata-bugs] [Maniphest] T286938: Create a plan for a final streaming updater rollout as source of truth for blazegraph instances

Reply via email to