Re: [MASSMAIL]Re: Large *.dat files in Fuseki

Andy Seaborne Thu, 14 Jul 2022 09:36:35 -0700



On 07/07/2022 16:19, Lorenz Buehmann wrote:

I think we should wait for Andy here with further input as he's thepersons who basically designed and implemented all the fancy stuff andknows better advice for sure.
@Andy Did you read the whole discussion and can you verify that it'sexpected behavior that lot's of daily updates lead to such a big growthof the node table files?


Sorry for the delay.

There is no Lucene index by default.

SPO.dat is not nodes table related - it is the base level of the SPOB+Tree. SPO.idn is the tree above the base level and SPO.bpt keeps thepointers to the root block and some size information.

The issue looks to be the large numbers of small updates. TDB2 used acopy-by-write MVCC scheme which means transactions can proceed withoutneeding latches (database locks) but has the consequence of needingcompaction.

TDB1 with Fuseki is worth a try. It does not use the scheme. It doesgrow but much more slowly. It is limited in the size of updates it canhandle but the limit is no where near what you describe.


Also worth trying is compaction and deletion

/$/compact/db_name?deleteOld=true

which will delete the old database after compaction (only the one justcompacted. Old ones can be manually deleted).


    Andy

On 07.07.22 10:53, Bartalus Gáspár wrote:
Hi Lorenz,
Would you recommend using tdb1 instead of tdb2 for our use case? Whatwould be the differences?
We are using fuseki 4.5.0 btw.

Gaspar
On 6 Jul 2022, at 14:39, Bartalus Gáspár<[email protected]> wrote:
Hi,

Most of the updates are DELETE/INSERT queries, i.e

DELETE {?s ?p ?oldValue}
INSERT {?s ?p ?newValue}
WHERE {
  OPTIONAL {?s ?p ?oldValue}
  #derive ?newValue from somewhere
}

We also have some separate DELETE queries and INSERT queries.
I’ve tried HTTP POST /$/compact/db_name and as a result the files aregetting back to normal size. However, as far as I can tell the oldfiles are also kept. This is the folder structure I see:
- databases/db_name/Data-0001 - with the old large files
- databases/db_name/Data-0002 - presumably the result of the compactoperation with normal file sizes.
Is there also some operation (http or cli) that would keep only one(the latest) data folder, i.e. delete the old files from Data-0001?
Gaspar
On 6 Jul 2022, at 12:52, Lorenz Buehmann<[email protected]> wrote:
Ok, interesting

so

we have

- 150k triples, rather small dataset

- loaded into 10MB node table files

- 10 updates every 5s

- which makes up to 24 * 60 * 60 / 5 * 10 ~ 200k updates per day

- and leads to 10GB node table files


Can you share the shape of those update queries?
After doing a "compact" operation, the files are getting back to"normal" size?
On 06.07.22 10:36, Bartalus Gáspár wrote:
Hi Lorenz,

Thanks for quick feedback and clarification on lucene indexes.

Here are my answers to your questions:
- We are uploading 7 ttl files to our dataset, where 1 is larger6Mb, the others are below 200Kb.
- The overall number of triples after data upload is  ~150000.
- We have around 10 SPARQL UPDATE queries that are executed on aregular and frequent basis, i.e. every 5 seconds. We also have 5such queries that are executed each minute. But most of the timethey do not produce any outcome, i.e. the dataset is not altered,and when they do, there are just a couple of triples that are addedto the dataset.- These *.dat files start from ~10Mb in size, and after a day or sosome of them grow to ~10Gb.
We have ~300 blank nodes, and ~half of the triples have a literalin the object position, so ~75000.
Best regards,
Gaspar
On 6 Jul 2022, at 10:55, Lorenz Buehmann<[email protected]> wrote:
Hi and welcome Gaspar.


Those files do contain the node tables.
A Lucene index is never computed by default and would be containedin Lucene specific index files.
Can you give some details about the

- size of the files
- the number of triples
- the number triples added/removed/changed
- the frequency of updates
- how much the files grow
- what kind of data you insert? Lots of blank nodes? Or literals?

Also, did you try a compact operation during time?

Lorenz

On 06.07.22 09:40, Bartalus Gáspár wrote:
Hi Jena support team,
We are experiencing an issue with Jena Fuseki databases. In thedatabases folder we see some files called SPO.dat, OSP.dat, etc.,and the size of these files are growing quickly. From ourunderstanding these files are containing the Lucene indexes. Wewould have two questions:
1. Why are these files growing rapidly, although the underlyingdata (triples) are not being changed, or only slightly changed?2. Can we disable indexing easily, since we are not using fulltext searches in our SPARQL queries?
Our usage of Jena Fuseki:

* Start the server with `fuseki-server —port 3030`
* Create databases with HTTP POST to`/$/datasets?state=active&dbType=tdb2&dbName=db_name`
* Upload ttl files with HTTP POST to /db_name/data
Thanks in advance for your feedback, and if you’d require moreinput from our side, please let me know.
Best regards,
Gaspar Bartalus

Re: [MASSMAIL]Re: Large *.dat files in Fuseki

Reply via email to