Hi Arne, hi Paul, sorry for the late response and thank you for your suggestions.
@Arne I haven't considered that possibility yet, that sounds intriguing. The immediate issue I see is that our Fuseki lives in a separate container, the application only communicates with it through HTTP, so we don't really have access to its dataset. But there may be ways around that. @Paul the workflow you describe is how I wish our application would work. But we use Fuseki as our data persistence, not just as a projection onto data. So there is no concept of "data sources" or any data transformation, which I could tweek. In any case, I think I have an idea what my options are, and will see from there, thanks! Best, Balduin On Mon, Feb 12, 2024 at 11:59 PM Arne Bernhardt <[email protected]> wrote: > Hi Balduin, > > I have no experience with the Fuseki persistent storage and maybe I do not > fully understand the reasons for your current migration workflow but I > think you can make it much faster and safer by using the Apache Jena Java > API for TBD. > The API is hopefully the same for TBD2. > Proposed workflow: > - make a backup (optional, but: better safe than sorry) > - shutdown your fuseki server > - load your data directly from java via TBD API ( > > https://jena.apache.org/documentation/tdb/java_api.html#using-a-directory-name > ) > - perform your migration directly on the persisted data while using > transactions ( > https://jena.apache.org/documentation/tdb/tdb_transactions.html) > -> if anything goes wrong, you can simply abort the transaction and > nothing has changed > - commit and end program > - start the fuseki again > > I hope it works out for you. > Arne > > > Am Fr., 9. Feb. 2024 um 15:18 Uhr schrieb > <[email protected]>: > > > Hi Andy, > > > > > If I understand correctly, this is a schema change requiring the data > to > > change. > > > > Correct, but we don't enforce any schema on the database level (no SHACL > > involved), that's only done programmatically in the application. > > > > > The transformation of the data to the updated data model could be done > > offline, that would reduce downtime. If the data is being continuously > > updated, that's harder because the offline copy will get out of step with > > the live data. > > > How often does the data change (not due to application logic changes)? > > > > The data might theoretically change constantly, so just doing it offline > > on a copy isn't really possible. > > A compromise I've been thinking about, which would still be better than > > downtime, would be a read-only mode for the duration of the migration. > But > > for now, the application doesn't support something like this yet. > > (And if we can get the downtime to something reasonable, that would be > > good enough.) > > > > > Do you have a concrete example of such a change? > > > > These changes can vary from very simple to very complex: > > - The simplest case would maybe be that a certain property that used to > be > > optional on a certain type of resource becomes mandatory; for all > instances > > where this is not present, a default value needs to be supplied. > > => this we could easily do with a SPARQL update. > > - The most complex case I encountered so far was roughly this: > > Given in graph A (representing the data model for a subset of the > data), > > a particular statement on something of type P (defining some kind of > > property) is present, and in graph B (the subset of data corresponding to > > the model in A), a certain statement holds true for all V (which have a > > reference to P), then P should be modified. If the statement does not > hold > > true for all V, then each V where it does not, must be modified to > become a > > more complex object. > > (More concretely: V represents a text value. If P says that V may > > contain markup, then check if any V contains markup. If not, change P to > > say that it does not contain markup; if any V contains markup, then all > Vs > > that represent text without markup need to be changed to contain text > with > > markup. Text without markup here represents a bit of reification around a > > string literal; text with markup follows a sophisticated standoff markup > > model, and even if no markup is present, it needs to contain information > on > > nature of the markup that is used.) > > => this is something I would not know how to, or feel comfortable > > attempting in SPARQL, so it needs to happen in code. > > > > Long story short: Some simple changes I could easily do in SPARQL; the > > more complex ones would require me to be able to do the changes in code, > > but it might be possible to have it set up in such a way that the code > > essentially has read access to the data and can generate update queries > > from that. > > > > Our previous setup worked like this (durations not measured, just from > > experience): > > On application start, if a migration needs doing, it won't start right > > away, but kick that process off: > > - download an entire dump of fuseki to a file on disk (ca. 20 min.) > > - load the dump into an in-memory Jena model (10 min. -> plus huge memory > > consumption that will always grow proportional to our data growing) > > - perform the migration on the in-memory model (1 sec. - 1 min.) > > - dump the model to a file on disk > > - drop all graphs from fuseki (20 min.) > > - upload the dump into fuseki (20 min.) > > Then the application would start... so at least 1h downtime, clearly room > > for improvement. > > The good thing about this approach is that if the migration fails, the > > data would not be corrupted because the data loaded in fuseki is not > > affected. > > > > My best bet at this point is to say, we take the risk of data corruption > > (thank god for backups!) and operate on the live fuseki database. This > cuts > > out the time-consuming downloading, uploading etc. and solves the memory > > issue with loading the entire database into an in-memory model. Then the > > migration is either just SPARQL or a programmatic series of database > > interactions leading to update queries. > > Then we would probably be down from 1h downtime to 1 min, which would be > a > > huge improvement. > > > > Does that sound reasonable? Are there better ways? Anything I'm missing? > > > > Best & thanks! (and sorry for the wall of text) > > Balduin > > > > > > -----Original Message----- > > From: Andy Seaborne <[email protected]> > > Sent: Freitag, 9. Februar 2024 14:12 > > To: [email protected] > > Subject: Re: Database Migrations in Fuseki > > > > Hi Balduin, > > > > On 07/02/2024 11:05, Balduin Landolt wrote: > > > Hi everyone, > > > > > > we're storing data in Fuseki as a persistence for our application > > > backend, the data is structured according to the application logic. > > > Whenever something changes in our application logic, we have to do a > > > database migration, so that the data conforms to the updated model. > > > Our current solution to that is very home-spun, not exactly stable and > > > comes with a lot of downtime, so we try to avoid it whenever possible. > > > > If I understand correctly, this is a schema change requiring the data to > > change. > > > > The transformation of the data to the updated data model could be done > > offline, that would reduce downtime. If the data is being continuously > > updated, that's harder because the offline copy will get out of step with > > the live data. > > > > How often does the data change (not due to application logic changes)? > > > > > I'm now looking into how this could be improved in the future. My > > > double question is: > > > 1) is there any tooling I missed, to help with this process? (In SQL > > > world for example, there are out of the box solutions for that.) > > > 2) and if not, more broadly, does anyone have any hints on how I could > > > best go about this? > > > > Do you have a concrete example of such a change? maybe chnage-in-place is > > possible but that depends on w=howupdates happen, how the dada feeds > change > > with the application logic change. > > > > Andy > > > > > > > > Thanks in advance! > > > Balduin > > > > > > > > > > > >
