Hi Andy & Lorenz, Thanks for the clarification and support.
Best regards, Gaspar > On 14 Jul 2022, at 19:36, Andy Seaborne <[email protected]> wrote: > > > > On 07/07/2022 16:19, Lorenz Buehmann wrote: >> I think we should wait for Andy here with further input as he's the persons >> who basically designed and implemented all the fancy stuff and knows better >> advice for sure. >> @Andy Did you read the whole discussion and can you verify that it's >> expected behavior that lot's of daily updates lead to such a big growth of >> the node table files? > > Sorry for the delay. > > There is no Lucene index by default. > > SPO.dat is not nodes table related - it is the base level of the SPO B+Tree. > SPO.idn is the tree above the base level and SPO.bpt keeps the pointers to > the root block and some size information. > > The issue looks to be the large numbers of small updates. TDB2 used a > copy-by-write MVCC scheme which means transactions can proceed without > needing latches (database locks) but has the consequence of needing > compaction. > > TDB1 with Fuseki is worth a try. It does not use the scheme. It does grow > but much more slowly. It is limited in the size of updates it can handle but > the limit is no where near what you describe. > > Also worth trying is compaction and deletion > > /$/compact/db_name?deleteOld=true > > which will delete the old database after compaction (only the one just > compacted. Old ones can be manually deleted). > > Andy > >> On 07.07.22 10:53, Bartalus Gáspár wrote: >>> Hi Lorenz, >>> >>> Would you recommend using tdb1 instead of tdb2 for our use case? What would >>> be the differences? >>> We are using fuseki 4.5.0 btw. >>> >>> Gaspar >>> >>>> On 6 Jul 2022, at 14:39, Bartalus Gáspár >>>> <[email protected]> wrote: >>>> >>>> Hi, >>>> >>>> Most of the updates are DELETE/INSERT queries, i.e >>>> >>>> DELETE {?s ?p ?oldValue} >>>> INSERT {?s ?p ?newValue} >>>> WHERE { >>>> OPTIONAL {?s ?p ?oldValue} >>>> #derive ?newValue from somewhere >>>> } >>>> >>>> We also have some separate DELETE queries and INSERT queries. >>>> >>>> I’ve tried HTTP POST /$/compact/db_name and as a result the files are >>>> getting back to normal size. However, as far as I can tell the old files >>>> are also kept. This is the folder structure I see: >>>> - databases/db_name/Data-0001 - with the old large files >>>> - databases/db_name/Data-0002 - presumably the result of the compact >>>> operation with normal file sizes. >>>> >>>> Is there also some operation (http or cli) that would keep only one (the >>>> latest) data folder, i.e. delete the old files from Data-0001? >>>> >>>> Gaspar >>>> >>>>> On 6 Jul 2022, at 12:52, Lorenz Buehmann >>>>> <[email protected]> wrote: >>>>> >>>>> Ok, interesting >>>>> >>>>> so >>>>> >>>>> we have >>>>> >>>>> - 150k triples, rather small dataset >>>>> >>>>> - loaded into 10MB node table files >>>>> >>>>> - 10 updates every 5s >>>>> >>>>> - which makes up to 24 * 60 * 60 / 5 * 10 ~ 200k updates per day >>>>> >>>>> - and leads to 10GB node table files >>>>> >>>>> >>>>> Can you share the shape of those update queries? >>>>> >>>>> >>>>> After doing a "compact" operation, the files are getting back to "normal" >>>>> size? >>>>> >>>>> >>>>> On 06.07.22 10:36, Bartalus Gáspár wrote: >>>>>> Hi Lorenz, >>>>>> >>>>>> Thanks for quick feedback and clarification on lucene indexes. >>>>>> >>>>>> Here are my answers to your questions: >>>>>> - We are uploading 7 ttl files to our dataset, where 1 is larger 6Mb, >>>>>> the others are below 200Kb. >>>>>> - The overall number of triples after data upload is ~150000. >>>>>> - We have around 10 SPARQL UPDATE queries that are executed on a regular >>>>>> and frequent basis, i.e. every 5 seconds. We also have 5 such queries >>>>>> that are executed each minute. But most of the time they do not produce >>>>>> any outcome, i.e. the dataset is not altered, and when they do, there >>>>>> are just a couple of triples that are added to the dataset. >>>>>> - These *.dat files start from ~10Mb in size, and after a day or so some >>>>>> of them grow to ~10Gb. >>>>>> >>>>>> We have ~300 blank nodes, and ~half of the triples have a literal in the >>>>>> object position, so ~75000. >>>>>> >>>>>> Best regards, >>>>>> Gaspar >>>>>> >>>>>> >>>>>> >>>>>>> On 6 Jul 2022, at 10:55, Lorenz Buehmann >>>>>>> <[email protected]> wrote: >>>>>>> >>>>>>> Hi and welcome Gaspar. >>>>>>> >>>>>>> >>>>>>> Those files do contain the node tables. >>>>>>> >>>>>>> A Lucene index is never computed by default and would be contained in >>>>>>> Lucene specific index files. >>>>>>> >>>>>>> >>>>>>> Can you give some details about the >>>>>>> >>>>>>> - size of the files >>>>>>> - the number of triples >>>>>>> - the number triples added/removed/changed >>>>>>> - the frequency of updates >>>>>>> - how much the files grow >>>>>>> - what kind of data you insert? Lots of blank nodes? Or literals? >>>>>>> >>>>>>> Also, did you try a compact operation during time? >>>>>>> >>>>>>> Lorenz >>>>>>> >>>>>>> On 06.07.22 09:40, Bartalus Gáspár wrote: >>>>>>>> Hi Jena support team, >>>>>>>> >>>>>>>> We are experiencing an issue with Jena Fuseki databases. In the >>>>>>>> databases folder we see some files called SPO.dat, OSP.dat, etc., and >>>>>>>> the size of these files are growing quickly. From our understanding >>>>>>>> these files are containing the Lucene indexes. We would have two >>>>>>>> questions: >>>>>>>> >>>>>>>> 1. Why are these files growing rapidly, although the underlying data >>>>>>>> (triples) are not being changed, or only slightly changed? >>>>>>>> 2. Can we disable indexing easily, since we are not using full text >>>>>>>> searches in our SPARQL queries? >>>>>>>> >>>>>>>> Our usage of Jena Fuseki: >>>>>>>> >>>>>>>> * Start the server with `fuseki-server —port 3030` >>>>>>>> * Create databases with HTTP POST to >>>>>>>> `/$/datasets?state=active&dbType=tdb2&dbName=db_name` >>>>>>>> * Upload ttl files with HTTP POST to /db_name/data >>>>>>>> >>>>>>>> Thanks in advance for your feedback, and if you’d require more input >>>>>>>> from our side, please let me know. >>>>>>>> >>>>>>>> Best regards, >>>>>>>> Gaspar Bartalus >>>>>>>>
smime.p7s
Description: S/MIME cryptographic signature
