Hi Balduin, I have no experience with the Fuseki persistent storage and maybe I do not fully understand the reasons for your current migration workflow but I think you can make it much faster and safer by using the Apache Jena Java API for TBD. The API is hopefully the same for TBD2. Proposed workflow: - make a backup (optional, but: better safe than sorry) - shutdown your fuseki server - load your data directly from java via TBD API ( https://jena.apache.org/documentation/tdb/java_api.html#using-a-directory-name ) - perform your migration directly on the persisted data while using transactions ( https://jena.apache.org/documentation/tdb/tdb_transactions.html) -> if anything goes wrong, you can simply abort the transaction and nothing has changed - commit and end program - start the fuseki again
I hope it works out for you. Arne Am Fr., 9. Feb. 2024 um 15:18 Uhr schrieb <[email protected]>: > Hi Andy, > > > If I understand correctly, this is a schema change requiring the data to > change. > > Correct, but we don't enforce any schema on the database level (no SHACL > involved), that's only done programmatically in the application. > > > The transformation of the data to the updated data model could be done > offline, that would reduce downtime. If the data is being continuously > updated, that's harder because the offline copy will get out of step with > the live data. > > How often does the data change (not due to application logic changes)? > > The data might theoretically change constantly, so just doing it offline > on a copy isn't really possible. > A compromise I've been thinking about, which would still be better than > downtime, would be a read-only mode for the duration of the migration. But > for now, the application doesn't support something like this yet. > (And if we can get the downtime to something reasonable, that would be > good enough.) > > > Do you have a concrete example of such a change? > > These changes can vary from very simple to very complex: > - The simplest case would maybe be that a certain property that used to be > optional on a certain type of resource becomes mandatory; for all instances > where this is not present, a default value needs to be supplied. > => this we could easily do with a SPARQL update. > - The most complex case I encountered so far was roughly this: > Given in graph A (representing the data model for a subset of the data), > a particular statement on something of type P (defining some kind of > property) is present, and in graph B (the subset of data corresponding to > the model in A), a certain statement holds true for all V (which have a > reference to P), then P should be modified. If the statement does not hold > true for all V, then each V where it does not, must be modified to become a > more complex object. > (More concretely: V represents a text value. If P says that V may > contain markup, then check if any V contains markup. If not, change P to > say that it does not contain markup; if any V contains markup, then all Vs > that represent text without markup need to be changed to contain text with > markup. Text without markup here represents a bit of reification around a > string literal; text with markup follows a sophisticated standoff markup > model, and even if no markup is present, it needs to contain information on > nature of the markup that is used.) > => this is something I would not know how to, or feel comfortable > attempting in SPARQL, so it needs to happen in code. > > Long story short: Some simple changes I could easily do in SPARQL; the > more complex ones would require me to be able to do the changes in code, > but it might be possible to have it set up in such a way that the code > essentially has read access to the data and can generate update queries > from that. > > Our previous setup worked like this (durations not measured, just from > experience): > On application start, if a migration needs doing, it won't start right > away, but kick that process off: > - download an entire dump of fuseki to a file on disk (ca. 20 min.) > - load the dump into an in-memory Jena model (10 min. -> plus huge memory > consumption that will always grow proportional to our data growing) > - perform the migration on the in-memory model (1 sec. - 1 min.) > - dump the model to a file on disk > - drop all graphs from fuseki (20 min.) > - upload the dump into fuseki (20 min.) > Then the application would start... so at least 1h downtime, clearly room > for improvement. > The good thing about this approach is that if the migration fails, the > data would not be corrupted because the data loaded in fuseki is not > affected. > > My best bet at this point is to say, we take the risk of data corruption > (thank god for backups!) and operate on the live fuseki database. This cuts > out the time-consuming downloading, uploading etc. and solves the memory > issue with loading the entire database into an in-memory model. Then the > migration is either just SPARQL or a programmatic series of database > interactions leading to update queries. > Then we would probably be down from 1h downtime to 1 min, which would be a > huge improvement. > > Does that sound reasonable? Are there better ways? Anything I'm missing? > > Best & thanks! (and sorry for the wall of text) > Balduin > > > -----Original Message----- > From: Andy Seaborne <[email protected]> > Sent: Freitag, 9. Februar 2024 14:12 > To: [email protected] > Subject: Re: Database Migrations in Fuseki > > Hi Balduin, > > On 07/02/2024 11:05, Balduin Landolt wrote: > > Hi everyone, > > > > we're storing data in Fuseki as a persistence for our application > > backend, the data is structured according to the application logic. > > Whenever something changes in our application logic, we have to do a > > database migration, so that the data conforms to the updated model. > > Our current solution to that is very home-spun, not exactly stable and > > comes with a lot of downtime, so we try to avoid it whenever possible. > > If I understand correctly, this is a schema change requiring the data to > change. > > The transformation of the data to the updated data model could be done > offline, that would reduce downtime. If the data is being continuously > updated, that's harder because the offline copy will get out of step with > the live data. > > How often does the data change (not due to application logic changes)? > > > I'm now looking into how this could be improved in the future. My > > double question is: > > 1) is there any tooling I missed, to help with this process? (In SQL > > world for example, there are out of the box solutions for that.) > > 2) and if not, more broadly, does anyone have any hints on how I could > > best go about this? > > Do you have a concrete example of such a change? maybe chnage-in-place is > possible but that depends on w=howupdates happen, how the dada feeds change > with the application logic change. > > Andy > > > > > Thanks in advance! > > Balduin > > > > > >
