Hi Andy, > If I understand correctly, this is a schema change requiring the data to > change.
Correct, but we don't enforce any schema on the database level (no SHACL involved), that's only done programmatically in the application. > The transformation of the data to the updated data model could be done > offline, that would reduce downtime. If the data is being continuously > updated, that's harder because the offline copy will get out of step with the > live data. > How often does the data change (not due to application logic changes)? The data might theoretically change constantly, so just doing it offline on a copy isn't really possible. A compromise I've been thinking about, which would still be better than downtime, would be a read-only mode for the duration of the migration. But for now, the application doesn't support something like this yet. (And if we can get the downtime to something reasonable, that would be good enough.) > Do you have a concrete example of such a change? These changes can vary from very simple to very complex: - The simplest case would maybe be that a certain property that used to be optional on a certain type of resource becomes mandatory; for all instances where this is not present, a default value needs to be supplied. => this we could easily do with a SPARQL update. - The most complex case I encountered so far was roughly this: Given in graph A (representing the data model for a subset of the data), a particular statement on something of type P (defining some kind of property) is present, and in graph B (the subset of data corresponding to the model in A), a certain statement holds true for all V (which have a reference to P), then P should be modified. If the statement does not hold true for all V, then each V where it does not, must be modified to become a more complex object. (More concretely: V represents a text value. If P says that V may contain markup, then check if any V contains markup. If not, change P to say that it does not contain markup; if any V contains markup, then all Vs that represent text without markup need to be changed to contain text with markup. Text without markup here represents a bit of reification around a string literal; text with markup follows a sophisticated standoff markup model, and even if no markup is present, it needs to contain information on nature of the markup that is used.) => this is something I would not know how to, or feel comfortable attempting in SPARQL, so it needs to happen in code. Long story short: Some simple changes I could easily do in SPARQL; the more complex ones would require me to be able to do the changes in code, but it might be possible to have it set up in such a way that the code essentially has read access to the data and can generate update queries from that. Our previous setup worked like this (durations not measured, just from experience): On application start, if a migration needs doing, it won't start right away, but kick that process off: - download an entire dump of fuseki to a file on disk (ca. 20 min.) - load the dump into an in-memory Jena model (10 min. -> plus huge memory consumption that will always grow proportional to our data growing) - perform the migration on the in-memory model (1 sec. - 1 min.) - dump the model to a file on disk - drop all graphs from fuseki (20 min.) - upload the dump into fuseki (20 min.) Then the application would start... so at least 1h downtime, clearly room for improvement. The good thing about this approach is that if the migration fails, the data would not be corrupted because the data loaded in fuseki is not affected. My best bet at this point is to say, we take the risk of data corruption (thank god for backups!) and operate on the live fuseki database. This cuts out the time-consuming downloading, uploading etc. and solves the memory issue with loading the entire database into an in-memory model. Then the migration is either just SPARQL or a programmatic series of database interactions leading to update queries. Then we would probably be down from 1h downtime to 1 min, which would be a huge improvement. Does that sound reasonable? Are there better ways? Anything I'm missing? Best & thanks! (and sorry for the wall of text) Balduin -----Original Message----- From: Andy Seaborne <a...@apache.org> Sent: Freitag, 9. Februar 2024 14:12 To: users@jena.apache.org Subject: Re: Database Migrations in Fuseki Hi Balduin, On 07/02/2024 11:05, Balduin Landolt wrote: > Hi everyone, > > we're storing data in Fuseki as a persistence for our application > backend, the data is structured according to the application logic. > Whenever something changes in our application logic, we have to do a > database migration, so that the data conforms to the updated model. > Our current solution to that is very home-spun, not exactly stable and > comes with a lot of downtime, so we try to avoid it whenever possible. If I understand correctly, this is a schema change requiring the data to change. The transformation of the data to the updated data model could be done offline, that would reduce downtime. If the data is being continuously updated, that's harder because the offline copy will get out of step with the live data. How often does the data change (not due to application logic changes)? > I'm now looking into how this could be improved in the future. My > double question is: > 1) is there any tooling I missed, to help with this process? (In SQL > world for example, there are out of the box solutions for that.) > 2) and if not, more broadly, does anyone have any hints on how I could > best go about this? Do you have a concrete example of such a change? maybe chnage-in-place is possible but that depends on w=howupdates happen, how the dada feeds change with the application logic change. Andy > > Thanks in advance! > Balduin >