Perhaps so; but as a tool, Jena, and SPARQL in general, is very suitable for managing and processing data so that the processes can be described and repeated. For example in this case, processing the results of the OCR is very quick compared to the actual OCR process, so I prefer to store the original results of the OCR somewhere, and do post-processing – which may require other stages than just the one presented here – later. For any external solution, I'd have to store the original text somewhere in any case, and keep track of the file names etc.

In this case, the actual run of the corrected SPARQL took only some tens of seconds, which is rather good, especially compared to the amount of time it would take to write the necessary scripts and data management for making this simple process repeatable with external solutions.

And in fact, if a database cannot be used for managing and processing data, I don't what what it should be used for :-)

Harri


On 24.9.2021 11.21, Marco Neumann wrote:
All that said, I would think you'd be best advised to run this type of
operation outside of Jena during preprocessing with CLI tools such as grep,
sed, awk or ack.

On Fri, Sep 24, 2021 at 9:14 AM Harri Kiiskinen <[email protected]>
wrote:

Hi all,

and thanks for the support! I did manage to resolve the problem by
modifying the query, detailed comments below.

Harri K.

On 23.9.2021 22.47, Andy Seaborne wrote:
I guess you are using TDB2 if you have -Xmx2G. TDB1 wil use even more
heap space.

Yes, TDB2.

All those named variables mean that the intermediate results are being
held onto. That includes the "no change" case. It looks like REPLACE and
no change is still a new string.

I was a afraid this might be the vase.

There is at least 8 Gbytes just there by my rough calculation.

-Xmx12G was not enough, so even more, I guess.

Things to try:

1/
Replace the use of named variables by a single expression
REPLACE (REPLACE( .... ))

This did the trick. Combining all the replaces to one as above was
enough to keep the memory use below 7 GB.

I also tried replacing the BIND's with the Jena-specific LET-constructs
(https://jena.apache.org/documentation/query/assignment.html) but that
had no effect – is the LET just a pre-SPARQL-1.1 addition that is
practically same as BIND, or is there a meaningful difference between
the two?

2/ (expanding on Macros' email):
If you are using TDB2:

First transaction:
COPY vice:pageocrdata TO vice:pageocrdata_clean
or
insert {
      graph vice:pageocrdata_clean {
        ?page vice:ocrtext ?X .
      }
    }
    where {
      graph vice:pageocrdata {
        ?page vice:ocrtext ?X .
      }

then applies the changes:

WITH vice:pageocrdata_clean
DELETE { ?page vice:ocrtext ?ocr }
INSERT { ?page vice:ocrtext ?ocr7 }
WHERE {
      ?page vice:ocrtext ?ocr .
      BIND(replace(?ocr1,'uͤ','ü') AS ?ocr7)
      FILTER (?ocr != ?ocr7)
}

Is there a big difference in working within one graphs as compared to
intergraph update operations? Just asking because I'm compartmentalizing
my data to different graphs quite much, but if it is significantly more
expensive, I may have to rethink some processes, like shown above.

3/
If TDB1 and none of that works, maybe reduce the internal transaction
space as well

It so happens that SELECT LIMIT OFFSET is predictable for a persistent
database (this is not portable!!).

WHERE {
     {
       SELECT ?ocr
       { graph vice:pageocrdata { ?page vice:ocrtext ?ocr . }
       OFFSET ... LIMIT ...
     }
     All the BIND
}

(or filter by , starts ?ocr starts with "A" then with "B"

      Andy

Ah, yes, of course, this may become handy with even larger datasets.

BTW : replace(str(?ocr), ...
Any URIs will turn into strings and any language tags will be lost.

Yes, that is unnecessary.

On 23/09/2021 16:28, Marco Neumann wrote:
"not to bind" to be read as "just bind once"

On Thu, Sep 23, 2021 at 4:25 PM Marco Neumann <[email protected]>
wrote:

set -Xmx to 8G and try not to bind the variable and to see if this
alleviates the issue.

On Thu, Sep 23, 2021 at 12:41 PM Harri Kiiskinen
<[email protected]>
wrote:

Hi!

I'm trying to run a simple update query that reads strings from one
graph, processes them, and stores to another:



------------------------------------------------------------------------------


    insert {
      graph vice:pageocrdata_clean {
        ?page vice:ocrtext ?ocr7 .
      }
    }
    where {
      graph vice:pageocrdata {
        ?page vice:ocrtext ?ocr .
      }
      bind (replace(str(?ocr),'ſ','s') as ?ocr1)
      bind (replace(?ocr1,'uͤ','ü') as ?ocr2)
      bind (replace(?ocr2,'aͤ','ä') as ?ocr3)
      bind (replace(?ocr3,'oͤ','ö') as ?ocr4)
      bind (replace(?ocr4,"[⸗—]\n",'') as ?ocr5)
      bind (replace(?ocr5,"\n",' ') as ?ocr6)
      bind (replace(?ocr6,"[ ]+",' ') as ?ocr7)
    }


-------------------------------------------------------------------------------


The source graph has some 250,000 triples that fill the WHERE
criterium.
The strings are one to two thousand characters in length.

I'm running the query using the Fuseki web UI, and it ends each time
with
"Bad Request (#400) Java heap space". The fuseki log does not show any
error except for the Bad Request #400. I'm quite surprised by this
problem,
because the update operation is a simple and straightforward data
processing, with no ordering etc.

I started with -Xmx2G, but even increasing the heap to -Xmx12G only
increases the time it takes for Fuseki to return the same error.

Is there something wrong with the SPARQL above? Is there something
that
increases the memory use unnecessarily?

Best,

Harri Kiiskinen



--


---
Marco Neumann
KONA





--
Tutkijatohtori / post-doctoral researcher
Viral Culture in the Early Nineteenth-Century Europe (ViCE)
Movie Making Finland: Finnish fiction films as audiovisual big data,
1907–2017 (MoMaF)
Turun yliopisto / University of Turku





--
Tutkijatohtori / post-doctoral researcher
Viral Culture in the Early Nineteenth-Century Europe (ViCE)
Movie Making Finland: Finnish fiction films as audiovisual big data, 1907–2017 (MoMaF)
Turun yliopisto / University of Turku

Reply via email to