I don't know if my experience is helpful, but I went a different way to solve these sorts of problems.

I avoid adding business logic while generating the RDF from the source data systems. It is almost entirely a simple transliteration from one format to another (I use an R2RML mapping.) The purpose is to combine data about the same subject from different source systems.

Separately, I wrote rules that check for several different desired constraints on the data. I think this corresponds to your "application logic". The source rules are written in RIF, then translated to HTML for display and SPARQL for execution.

Using the rules (in SPARQL format) I materialize additional triples and add them to the datastore.

This approach decouples data from business logic and lets each evolve separately. When the source data schema changes, I change the R2RML mapping. When the business logic changes, I change the RIF rules.

The logic (rules) may be applied in any of several different ways to suit your needs (for example using a different rules engine--or shex or shacl--, or generating results on-the-fly instead of materializing them back into the datastore).

Regards,
--Paul

On 2/9/24 08:17, [email protected] wrote:
Hi Andy,

If I understand correctly, this is a schema change requiring the data to change.
Correct, but we don't enforce any schema on the database level (no SHACL 
involved), that's only done programmatically in the application.

The transformation of the data to the updated data model could be done offline, 
that would reduce downtime. If the data is being continuously updated, that's 
harder because the offline copy will get out of step with the live data.
How often does the data change (not due to application logic changes)?
The data might theoretically change constantly, so just doing it offline on a 
copy isn't really possible.
A compromise I've been thinking about, which would still be better than 
downtime, would be a read-only mode for the duration of the migration. But for 
now, the application doesn't support something like this yet.
(And if we can get the downtime to something reasonable, that would be good 
enough.)

Do you have a concrete example of such a change?
These changes can vary from very simple to very complex:
- The simplest case would maybe be that a certain property that used to be 
optional on a certain type of resource becomes mandatory; for all instances 
where this is not present, a default value needs to be supplied.
   => this we could easily do with a SPARQL update.
- The most complex case I encountered so far was roughly this:
   Given in graph A (representing the data model for a subset of the data), a 
particular statement on something of type P (defining some kind of property) is 
present, and in graph B (the subset of data corresponding to the model in A), a 
certain statement holds true for all V (which have a reference to P), then P 
should be modified. If the statement does not hold true for all V, then each V 
where it does not, must be modified to become a more complex object.
   (More concretely: V represents a text value. If P says that V may contain 
markup, then check if any V contains markup. If not, change P to say that it 
does not contain markup;  if any V contains markup, then all Vs that represent 
text without markup need to be changed to contain text with markup. Text 
without markup here represents a bit of reification around a string literal; 
text with markup follows a sophisticated standoff markup model, and even if no 
markup is present, it needs to contain information on nature of the markup that 
is used.)
   => this is something I would not know how to, or feel comfortable attempting 
in SPARQL, so it needs to happen in code.

Long story short: Some simple changes I could easily do in SPARQL; the more 
complex ones would require me to be able to do the changes in code, but it 
might be possible to have it set up in such a way that the code essentially has 
read access to the data and can generate update queries from that.

Our previous setup worked like this (durations not measured, just from 
experience):
On application start, if a migration needs doing, it won't start right away, 
but kick that process off:
- download an entire dump of fuseki to a file on disk (ca. 20 min.)
- load the dump into an in-memory Jena model (10 min. -> plus huge memory 
consumption that will always grow proportional to our data growing)
- perform the migration on the in-memory model (1 sec. - 1 min.)
- dump the model to a file on disk
- drop all graphs from fuseki (20 min.)
- upload the dump into fuseki (20 min.)
Then the application would start... so at least 1h downtime, clearly room for 
improvement.
The good thing about this approach is that if the migration fails, the data 
would not be corrupted because the data loaded in fuseki is not affected.

My best bet at this point is to say, we take the risk of data corruption (thank 
god for backups!) and operate on the live fuseki database. This cuts out the 
time-consuming downloading, uploading etc. and solves the memory issue with 
loading the entire database into an in-memory model. Then the migration is 
either just SPARQL or a programmatic series of database interactions leading to 
update queries.
Then we would probably be down from 1h downtime to 1 min, which would be a huge 
improvement.

Does that sound reasonable? Are there better ways? Anything I'm missing?

Best & thanks! (and sorry for the wall of text)
Balduin


-----Original Message-----
From: Andy Seaborne <[email protected]>
Sent: Freitag, 9. Februar 2024 14:12
To: [email protected]
Subject: Re: Database Migrations in Fuseki

Hi Balduin,

On 07/02/2024 11:05, Balduin Landolt wrote:
Hi everyone,

we're storing data in Fuseki as a persistence for our application
backend, the data is structured according to the application logic.
Whenever something changes in our application logic, we have to do a
database migration, so that the data conforms to the updated model.
Our current solution to that is very home-spun, not exactly stable and
comes with a lot of downtime, so we try to avoid it whenever possible.
If I understand correctly, this is a schema change requiring the data to change.

The transformation of the data to the updated data model could be done offline, 
that would reduce downtime. If the data is being continuously updated, that's 
harder because the offline copy will get out of step with the live data.

How often does the data change (not due to application logic changes)?

I'm now looking into how this could be improved in the future. My
double question is:
1) is there any tooling I missed, to help with this process? (In SQL
world for example, there are out of the box solutions for that.)
2) and if not, more broadly, does anyone have any hints on how I could
best go about this?
Do you have a concrete example of such a change? maybe chnage-in-place is 
possible but that depends on w=howupdates happen, how the dada feeds change 
with the application logic change.

      Andy

Thanks in advance!
Balduin



Reply via email to