dcausse added a comment.
In T286938#7302853 <https://phabricator.wikimedia.org/T286938#7302853>, @EBernhardson wrote: > A couple thoughts, perhaps one will even be useful: > >> start import on wdqs1009 and wdqs2008 with --skolemize: best case 10 days (import from 2 machines to maximize the chances of success) > > I have some memory that we thought this could be sped up with skolemizing in hadoop, that currently runs weekly and take a few hours. How far are we from being able to feed those outputs into blazegraph, and would we expect much improvement? Or maybe the process is fragile enough it's not worth adding risks here. Indeed, munging on a single core will take around ~20hours IIRC (around 8% of the import time) compared to 3hours in hadoop, unfortunately we don't have the process to serialize the resulting hive table back to plain TTL files and ship them to the target machine. I don't think anything there is complicated but these data-sharing/transfer tasks tend to be complex to put in place and stabilize (this one does not have to be automated though). >> start data-transfer + updater-consumer activation, wdqs2008 -> all codfw machines (EST: 2 to 3days: 3h/machine*7 >> >> - Figure out if there is a way to optimize and parallelize this process > > With 7 machines, i guess we could cut it to 3 steps by also copying from the machines we copied to in a previous step. Plausibly brings runtime to single day, next step is live deployment so mostly it frees us up for testing the service thurs/fri before we go live. Should mostly amount to starting the transfer from more machines each round. > > 1. a->b > 2. a->c, b->d > 3. a->e, b->f, c->g Makes sense, thanks! Given some of these tasks will be launched manually I guess it would make sense to make these actions more concrete and write them down as you did with real hostnames. >> except wdqs1010 that we could use as source for emergency rollback > > I worry about having only a single source for emergency rollback. If we think we still need that option then keeping at least two copies would be typical, but do we have enough machines to keep two back reasonably? Also it might be worth figuring out how we can decide when the emergency rollback can be decom'd, but then again we could wait until it's obvious that we can't go back anymore. True, I think we can keep one additional machine in codfw from the internal cluster. I think blockers are likely to be detected while the spare DC is being migrated but it might be good to keep these two machines for a couple months. TASK DETAIL https://phabricator.wikimedia.org/T286938 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dcausse Cc: EBernhardson, RKemper, Aklapper, dcausse, Gehel, MPhamWMF, Zbyszko, Invadibot, maantietaja, CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
_______________________________________________ Wikidata-bugs mailing list -- [email protected] To unsubscribe send an email to [email protected]
