Thanks a lot for the explanations!

Harri

On 24.9.2021 12.15, Andy Seaborne wrote:
Inline...

On 24/09/2021 09:13, Harri Kiiskinen wrote:
Hi all,

and thanks for the support! I did manage to resolve the problem by modifying the query, detailed comments below.

Harri K.

On 23.9.2021 22.47, Andy Seaborne wrote:
I guess you are using TDB2 if you have -Xmx2G. TDB1 wil use even more heap space.

Yes, TDB2.

All those named variables mean that the intermediate results are being held onto. That includes the "no change" case. It looks like REPLACE and no change is still a new string.

I was a afraid this might be the vase.

There is at least 8 Gbytes just there by my rough calculation.

-Xmx12G was not enough, so even more, I guess.

Things to try:

1/
Replace the use of named variables by a single expression
REPLACE (REPLACE( .... ))

This did the trick. Combining all the replaces to one as above was enough to keep the memory use below 7 GB.

I also tried replacing the BIND's with the Jena-specific LET-constructs (https://jena.apache.org/documentation/query/assignment.html) but that had no effect – is the LET just a pre-SPARQL-1.1 addition that is practically same as BIND, or is there a meaningful difference between the two?

LET is sort of the same as BIND.

You can assign to already set variables in which case the values must match (c.f. FILTER). It makes the assignment order independent.

It means

   LET(...AS ?X)
   pattern involving ?X

and

   pattern involving ?X
   LET(...AS ?X)

have the same results.


2/ (expanding on Macros' email):
If you are using TDB2:

First transaction:
COPY vice:pageocrdata TO vice:pageocrdata_clean
or
insert {
     graph vice:pageocrdata_clean {
       ?page vice:ocrtext ?X .
     }
   }
   where {
     graph vice:pageocrdata {
       ?page vice:ocrtext ?X .
     }

then applies the changes:

WITH vice:pageocrdata_clean
DELETE { ?page vice:ocrtext ?ocr }
INSERT { ?page vice:ocrtext ?ocr7 }
WHERE {
     ?page vice:ocrtext ?ocr .
     BIND(replace(?ocr1,'uͤ','ü') AS ?ocr7)
     FILTER (?ocr != ?ocr7)
}

Is there a big difference in working within one graphs as compared to intergraph update operations?

No - it's all quads.

But here the COPY means there isn't a lot of intermediate calculation of RDF terms. The two graph share terms.

Then the DELETE-INSERT only modifies things that need changing.
In the original setup all ?ocr1-7 are internally new strings whether the REPLACE did anything or not.
   > Just asking because I'm compartmentalizing
my data to different graphs quite much, but if it is significantly more expensive, I may have to rethink some processes, like shown above.

3/
If TDB1 and none of that works, maybe reduce the internal transaction space as well

It so happens that SELECT LIMIT OFFSET is predictable for a persistent database (this is not portable!!).

WHERE {
    {
      SELECT ?ocr
      { graph vice:pageocrdata { ?page vice:ocrtext ?ocr . }
      OFFSET ... LIMIT ...
    }
    All the BIND
}

(or filter by , starts ?ocr starts with "A" then with "B"

     Andy

Ah, yes, of course, this may become handy with even larger datasets.

BTW : replace(str(?ocr), ...
Any URIs will turn into strings and any language tags will be lost.

Yes, that is unnecessary.



--
Tutkijatohtori / post-doctoral researcher
Viral Culture in the Early Nineteenth-Century Europe (ViCE)
Movie Making Finland: Finnish fiction films as audiovisual big data, 1907–2017 (MoMaF)
Turun yliopisto / University of Turku

Reply via email to