Re: Database Migrations in Fuseki

Arne Bernhardt Mon, 12 Feb 2024 14:59:20 -0800

Hi Balduin,

I have no experience with the Fuseki persistent storage and maybe I do not
fully understand the reasons for your current migration workflow but I
think you can make it much faster and safer by using the Apache Jena Java
API for TBD.
The API is hopefully the same for TBD2.
Proposed workflow:
- make a backup (optional, but: better safe than sorry)
- shutdown your fuseki server
- load your data directly from java via TBD API (
https://jena.apache.org/documentation/tdb/java_api.html#using-a-directory-name
)
- perform your migration directly on the persisted data while using
transactions (
https://jena.apache.org/documentation/tdb/tdb_transactions.html)
  -> if anything goes wrong, you can simply abort the transaction and
nothing has changed
- commit and end program
- start the fuseki again


I hope it works out for you.
Arne


Am Fr., 9. Feb. 2024 um 15:18 Uhr schrieb
<[email protected]>:

> Hi Andy,
>
> > If I understand correctly, this is a schema change requiring the data to
> change.
>
> Correct, but we don't enforce any schema on the database level (no SHACL
> involved), that's only done programmatically in the application.
>
> > The transformation of the data to the updated data model could be done
> offline, that would reduce downtime. If the data is being continuously
> updated, that's harder because the offline copy will get out of step with
> the live data.
> > How often does the data change (not due to application logic changes)?
>
> The data might theoretically change constantly, so just doing it offline
> on a copy isn't really possible.
> A compromise I've been thinking about, which would still be better than
> downtime, would be a read-only mode for the duration of the migration. But
> for now, the application doesn't support something like this yet.
> (And if we can get the downtime to something reasonable, that would be
> good enough.)
>
> > Do you have a concrete example of such a change?
>
> These changes can vary from very simple to very complex:
> - The simplest case would maybe be that a certain property that used to be
> optional on a certain type of resource becomes mandatory; for all instances
> where this is not present, a default value needs to be supplied.
>   => this we could easily do with a SPARQL update.
> - The most complex case I encountered so far was roughly this:
>   Given in graph A (representing the data model for a subset of the data),
> a particular statement on something of type P (defining some kind of
> property) is present, and in graph B (the subset of data corresponding to
> the model in A), a certain statement holds true for all V (which have a
> reference to P), then P should be modified. If the statement does not hold
> true for all V, then each V where it does not, must be modified to become a
> more complex object.
>   (More concretely: V represents a text value. If P says that V may
> contain markup, then check if any V contains markup. If not, change P to
> say that it does not contain markup;  if any V contains markup, then all Vs
> that represent text without markup need to be changed to contain text with
> markup. Text without markup here represents a bit of reification around a
> string literal; text with markup follows a sophisticated standoff markup
> model, and even if no markup is present, it needs to contain information on
> nature of the markup that is used.)
>   => this is something I would not know how to, or feel comfortable
> attempting in SPARQL, so it needs to happen in code.
>
> Long story short: Some simple changes I could easily do in SPARQL; the
> more complex ones would require me to be able to do the changes in code,
> but it might be possible to have it set up in such a way that the code
> essentially has read access to the data and can generate update queries
> from that.
>
> Our previous setup worked like this (durations not measured, just from
> experience):
> On application start, if a migration needs doing, it won't start right
> away, but kick that process off:
> - download an entire dump of fuseki to a file on disk (ca. 20 min.)
> - load the dump into an in-memory Jena model (10 min. -> plus huge memory
> consumption that will always grow proportional to our data growing)
> - perform the migration on the in-memory model (1 sec. - 1 min.)
> - dump the model to a file on disk
> - drop all graphs from fuseki (20 min.)
> - upload the dump into fuseki (20 min.)
> Then the application would start... so at least 1h downtime, clearly room
> for improvement.
> The good thing about this approach is that if the migration fails, the
> data would not be corrupted because the data loaded in fuseki is not
> affected.
>
> My best bet at this point is to say, we take the risk of data corruption
> (thank god for backups!) and operate on the live fuseki database. This cuts
> out the time-consuming downloading, uploading etc. and solves the memory
> issue with loading the entire database into an in-memory model. Then the
> migration is either just SPARQL or a programmatic series of database
> interactions leading to update queries.
> Then we would probably be down from 1h downtime to 1 min, which would be a
> huge improvement.
>
> Does that sound reasonable? Are there better ways? Anything I'm missing?
>
> Best & thanks! (and sorry for the wall of text)
> Balduin
>
>
> -----Original Message-----
> From: Andy Seaborne <[email protected]>
> Sent: Freitag, 9. Februar 2024 14:12
> To: [email protected]
> Subject: Re: Database Migrations in Fuseki
>
> Hi Balduin,
>
> On 07/02/2024 11:05, Balduin Landolt wrote:
> > Hi everyone,
> >
> > we're storing data in Fuseki as a persistence for our application
> > backend, the data is structured according to the application logic.
> > Whenever something changes in our application logic, we have to do a
> > database migration, so that the data conforms to the updated model.
> > Our current solution to that is very home-spun, not exactly stable and
> > comes with a lot of downtime, so we try to avoid it whenever possible.
>
> If I understand correctly, this is a schema change requiring the data to
> change.
>
> The transformation of the data to the updated data model could be done
> offline, that would reduce downtime. If the data is being continuously
> updated, that's harder because the offline copy will get out of step with
> the live data.
>
> How often does the data change (not due to application logic changes)?
>
> > I'm now looking into how this could be improved in the future. My
> > double question is:
> > 1) is there any tooling I missed, to help with this process? (In SQL
> > world for example, there are out of the box solutions for that.)
> > 2) and if not, more broadly, does anyone have any hints on how I could
> > best go about this?
>
> Do you have a concrete example of such a change? maybe chnage-in-place is
> possible but that depends on w=howupdates happen, how the dada feeds change
> with the application logic change.
>
>      Andy
>
> >
> > Thanks in advance!
> > Balduin
> >
>
>
>
>

Re: Database Migrations in Fuseki

Reply via email to