Thanks a lot for the explanations!
Harri
On 24.9.2021 12.15, Andy Seaborne wrote:
Inline...
On 24/09/2021 09:13, Harri Kiiskinen wrote:
Hi all,
and thanks for the support! I did manage to resolve the problem by
modifying the query, detailed comments below.
Harri K.
On 23.9.2021 22.47, Andy Seaborne wrote:
I guess you are using TDB2 if you have -Xmx2G. TDB1 wil use even more
heap space.
Yes, TDB2.
All those named variables mean that the intermediate results are
being held onto. That includes the "no change" case. It looks like
REPLACE and no change is still a new string.
I was a afraid this might be the vase.
There is at least 8 Gbytes just there by my rough calculation.
-Xmx12G was not enough, so even more, I guess.
Things to try:
1/
Replace the use of named variables by a single expression
REPLACE (REPLACE( .... ))
This did the trick. Combining all the replaces to one as above was
enough to keep the memory use below 7 GB.
I also tried replacing the BIND's with the Jena-specific
LET-constructs
(https://jena.apache.org/documentation/query/assignment.html) but that
had no effect – is the LET just a pre-SPARQL-1.1 addition that is
practically same as BIND, or is there a meaningful difference between
the two?
LET is sort of the same as BIND.
You can assign to already set variables in which case the values must
match (c.f. FILTER). It makes the assignment order independent.
It means
LET(...AS ?X)
pattern involving ?X
and
pattern involving ?X
LET(...AS ?X)
have the same results.
2/ (expanding on Macros' email):
If you are using TDB2:
First transaction:
COPY vice:pageocrdata TO vice:pageocrdata_clean
or
insert {
graph vice:pageocrdata_clean {
?page vice:ocrtext ?X .
}
}
where {
graph vice:pageocrdata {
?page vice:ocrtext ?X .
}
then applies the changes:
WITH vice:pageocrdata_clean
DELETE { ?page vice:ocrtext ?ocr }
INSERT { ?page vice:ocrtext ?ocr7 }
WHERE {
?page vice:ocrtext ?ocr .
BIND(replace(?ocr1,'uͤ','ü') AS ?ocr7)
FILTER (?ocr != ?ocr7)
}
Is there a big difference in working within one graphs as compared to
intergraph update operations?
No - it's all quads.
But here the COPY means there isn't a lot of intermediate calculation of
RDF terms. The two graph share terms.
Then the DELETE-INSERT only modifies things that need changing.
In the original setup all ?ocr1-7 are internally new strings whether the
REPLACE did anything or not.
> Just asking because I'm compartmentalizing
my data to different graphs quite much, but if it is significantly
more expensive, I may have to rethink some processes, like shown above.
3/
If TDB1 and none of that works, maybe reduce the internal transaction
space as well
It so happens that SELECT LIMIT OFFSET is predictable for a
persistent database (this is not portable!!).
WHERE {
{
SELECT ?ocr
{ graph vice:pageocrdata { ?page vice:ocrtext ?ocr . }
OFFSET ... LIMIT ...
}
All the BIND
}
(or filter by , starts ?ocr starts with "A" then with "B"
Andy
Ah, yes, of course, this may become handy with even larger datasets.
BTW : replace(str(?ocr), ...
Any URIs will turn into strings and any language tags will be lost.
Yes, that is unnecessary.
--
Tutkijatohtori / post-doctoral researcher
Viral Culture in the Early Nineteenth-Century Europe (ViCE)
Movie Making Finland: Finnish fiction films as audiovisual big data,
1907–2017 (MoMaF)
Turun yliopisto / University of Turku