Re: Heap space problem with insert where

Harri Kiiskinen Fri, 24 Sep 2021 04:22:24 -0700

Thanks a lot for the explanations!

Harri


On 24.9.2021 12.15, Andy Seaborne wrote:

Inline...

On 24/09/2021 09:13, Harri Kiiskinen wrote:
Hi all,
and thanks for the support! I did manage to resolve the problem bymodifying the query, detailed comments below.
Harri K.

On 23.9.2021 22.47, Andy Seaborne wrote:
I guess you are using TDB2 if you have -Xmx2G. TDB1 wil use even moreheap space.
Yes, TDB2.
All those named variables mean that the intermediate results arebeing held onto. That includes the "no change" case. It looks likeREPLACE and no change is still a new string.
I was a afraid this might be the vase.
There is at least 8 Gbytes just there by my rough calculation.
-Xmx12G was not enough, so even more, I guess.
Things to try:

1/
Replace the use of named variables by a single expression
REPLACE (REPLACE( .... ))
This did the trick. Combining all the replaces to one as above wasenough to keep the memory use below 7 GB.
I also tried replacing the BIND's with the Jena-specificLET-constructs(https://jena.apache.org/documentation/query/assignment.html) but thathad no effect – is the LET just a pre-SPARQL-1.1 addition that ispractically same as BIND, or is there a meaningful difference betweenthe two?
LET is sort of the same as BIND.
You can assign to already set variables in which case the values mustmatch (c.f. FILTER). It makes the assignment order independent.
It means

   LET(...AS ?X)
   pattern involving ?X

and

   pattern involving ?X
   LET(...AS ?X)

have the same results.
2/ (expanding on Macros' email):
If you are using TDB2:

First transaction:
COPY vice:pageocrdata TO vice:pageocrdata_clean
or
insert {
     graph vice:pageocrdata_clean {
       ?page vice:ocrtext ?X .
     }
   }
   where {
     graph vice:pageocrdata {
       ?page vice:ocrtext ?X .
     }

then applies the changes:

WITH vice:pageocrdata_clean
DELETE { ?page vice:ocrtext ?ocr }
INSERT { ?page vice:ocrtext ?ocr7 }
WHERE {
     ?page vice:ocrtext ?ocr .
     BIND(replace(?ocr1,'uͤ','ü') AS ?ocr7)
     FILTER (?ocr != ?ocr7)
}
Is there a big difference in working within one graphs as compared tointergraph update operations?
No - it's all quads.
But here the COPY means there isn't a lot of intermediate calculation ofRDF terms. The two graph share terms.
Then the DELETE-INSERT only modifies things that need changing.
In the original setup all ?ocr1-7 are internally new strings whether theREPLACE did anything or not.
   > Just asking because I'm compartmentalizing
my data to different graphs quite much, but if it is significantlymore expensive, I may have to rethink some processes, like shown above.
3/
If TDB1 and none of that works, maybe reduce the internal transactionspace as well
It so happens that SELECT LIMIT OFFSET is predictable for apersistent database (this is not portable!!).
WHERE {
    {
      SELECT ?ocr
      { graph vice:pageocrdata { ?page vice:ocrtext ?ocr . }
      OFFSET ... LIMIT ...
    }
    All the BIND
}

(or filter by , starts ?ocr starts with "A" then with "B"

     Andy
Ah, yes, of course, this may become handy with even larger datasets.
BTW : replace(str(?ocr), ...
Any URIs will turn into strings and any language tags will be lost.
Yes, that is unnecessary.



--
Tutkijatohtori / post-doctoral researcher
Viral Culture in the Early Nineteenth-Century Europe (ViCE)

Movie Making Finland: Finnish fiction films as audiovisual big data,1907–2017 (MoMaF)

Turun yliopisto / University of Turku

Re: Heap space problem with insert where

Reply via email to