dcausse added a comment.
In T244590#5893018 <https://phabricator.wikimedia.org/T244590#5893018>, @Ottomata wrote: > COOL! :) > >> it's important to note that the state of step 3 is tightly coupled with its dump and thus we will have to instantiate a new stream per imported dump. In other words a wdqs system imported using dump Y will have to consume the RDF stream generated from an initial state based on this same dump. This means that the RDF stream will be named against a particular dump instance. > > Hm. Would it be possible instead to lambda architecture this part? Instead of having to reload from a full dump and then recreate a new stream, could accomplish the same cleanups by backfilling from a batch job in Hadoop? I'm not sure I fully understand the 'cleanups' here. Are they not do-able with the stream because events representing some of the state changes don't exist (yet)? I hope that in the future once the stream has been stabilized yes reloading the system might become less necessary and that a fresh and consistent dump can be reconstructed (daily?) using the stream itself. Reloading from the dump generated by MW is something we need anyways in order to bootstrap the system and at the beginning will be needed to circumvent: - bug fixes (bug where the data is simply lost) - lost events (undetected failures or bugs in MW) - cleanup The cleanup operation mentioned here is a sort of "garbage collection", to simplify we need to detect unused resources (subgraph) in the graph, the stream itself does not know this unless we keep another large state doing references counting. The solution proposed here is to simply spawn a new system from time to time (the dump generated by MW is clean) so that we do cleanup and fix lost events at the same time, but I agree with you this is not ideal and leveraging more batch jobs and/or more states in the stream will help minimize the need to do a full reload. TASK DETAIL https://phabricator.wikimedia.org/T244590 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dcausse Cc: Ottomata, JAllemandou, Aklapper, Zbyszko, Gehel, dcausse, darthmon_wmde, Nandana, Lahi, Gq86, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Smalyshev, Wikidata-bugs, Jdouglas, aude, Tobias1984, Dinoguy1000, Manybubbles, Mbch331
_______________________________________________ Wikidata-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
