Hi Guillaume, Which file system is used with Blazegraph? Is it NFS or Ext4, etc.? Specifically, the file system used where Journal files are written and read from? [1] Because looking at the code, it seems there could be cases where unreported errors can happen around file locking.
[1] https://github.com/blazegraph/database/blob/master/bigdata-core/bigdata/src/java/com/bigdata/journal/FileMetadata.java Thad https://www.linkedin.com/in/thadguidry/ https://calendly.com/thadguidry/ On Wed, Feb 22, 2023 at 5:06 AM Guillaume Lederrey <[email protected]> wrote: > Hello all! > > TL;DR: We expect to successfully complete the recent data reload on > Wikidata Query Service soon, but we've encountered multiple failures > related to the size of the graph, and anticipate that this issue may worsen > in the future. Although we succeeded this time, we cannot guarantee that > future reload attempts will be successful given the current trend of the > data reload process. Thank you for your understanding and patience.. > > Longer version: > > WDQS is updated from a stream of recent changes on Wikidata, with a > maximum delay of ~2 minutes. This process was improved as part of the WDQS > Streaming Updater project to ensure data coherence[1] . However, the update > process is still imperfect and can lead to data inconsistencies in some > cases[2][3]. To address this, we reload the data from dumps a few times per > year to reinitialize the system from a known good state. > > The recent reload of data from dumps started in mid-December and was > initially met with some issues related to download and instabilities in > Blazegraph, the database used by WDQS[4]. Loading the data into Blazegraph > takes a couple of weeks due to the size of the graph, and we had multiple > attempts where the reload failed after >90% of the data had been loaded. > Our understanding of the issue is that a "race condition" in Blazegraph[5], > where subtle timing changes lead to corruption of the journal in some rare > cases, is to blame.[6] > > We want to reassure you that the last reload job was successful on one of > our servers. The data still needs to be copied over to all of the WDQS > servers, which will take a couple of weeks, but should not bring any > additional issues. However, reloading the full data from dumps is becoming > more complex as the data size grows, and we wanted to let you know why the > process took longer than expected. We understand that data inconsistencies > can be problematic, and we appreciate your patience and understanding while > we work to ensure the quality and consistency of the data on WDQS. > > Thank you for your continued support and understanding! > > > Guillaume > > > [1] https://phabricator.wikimedia.org/T244590 > [2] https://phabricator.wikimedia.org/T323239 > [3] https://phabricator.wikimedia.org/T322869 > [4] https://phabricator.wikimedia.org/T323096 > [5] https://en.wikipedia.org/wiki/Race_condition#In_software > [6] https://phabricator.wikimedia.org/T263110 > > -- > *Guillaume Lederrey* (he/him) > Engineering Manager > Wikimedia Foundation <https://wikimediafoundation.org/> > _______________________________________________ > Wikidata mailing list -- [email protected] > Public archives at > https://lists.wikimedia.org/hyperkitty/list/[email protected]/message/7QTJBRU2T3J22SNV4TGBRML4QNBGCEOU/ > To unsubscribe send an email to [email protected] >
_______________________________________________ Wikidata mailing list -- [email protected] Public archives at https://lists.wikimedia.org/hyperkitty/list/[email protected]/message/U2T6JKVJFJK7HNQCXNPYBFGSHK4AJQTX/ To unsubscribe send an email to [email protected]
