All that said, I would think you'd be best advised to run this type of
operation outside of Jena during preprocessing with CLI tools such as grep,
sed, awk or ack.

On Fri, Sep 24, 2021 at 9:14 AM Harri Kiiskinen <harri.kiiski...@utu.fi>
wrote:

> Hi all,
>
> and thanks for the support! I did manage to resolve the problem by
> modifying the query, detailed comments below.
>
> Harri K.
>
> On 23.9.2021 22.47, Andy Seaborne wrote:
> > I guess you are using TDB2 if you have -Xmx2G. TDB1 wil use even more
> > heap space.
>
> Yes, TDB2.
>
> > All those named variables mean that the intermediate results are being
> > held onto. That includes the "no change" case. It looks like REPLACE and
> > no change is still a new string.
>
> I was a afraid this might be the vase.
>
> > There is at least 8 Gbytes just there by my rough calculation.
>
> -Xmx12G was not enough, so even more, I guess.
>
> > Things to try:
> >
> > 1/
> > Replace the use of named variables by a single expression
> > REPLACE (REPLACE( .... ))
>
> This did the trick. Combining all the replaces to one as above was
> enough to keep the memory use below 7 GB.
>
> I also tried replacing the BIND's with the Jena-specific LET-constructs
> (https://jena.apache.org/documentation/query/assignment.html) but that
> had no effect – is the LET just a pre-SPARQL-1.1 addition that is
> practically same as BIND, or is there a meaningful difference between
> the two?
>
> > 2/ (expanding on Macros' email):
> > If you are using TDB2:
> >
> > First transaction:
> > COPY vice:pageocrdata TO vice:pageocrdata_clean
> > or
> > insert {
> >      graph vice:pageocrdata_clean {
> >        ?page vice:ocrtext ?X .
> >      }
> >    }
> >    where {
> >      graph vice:pageocrdata {
> >        ?page vice:ocrtext ?X .
> >      }
> >
> > then applies the changes:
> >
> > WITH vice:pageocrdata_clean
> > DELETE { ?page vice:ocrtext ?ocr }
> > INSERT { ?page vice:ocrtext ?ocr7 }
> > WHERE {
> >      ?page vice:ocrtext ?ocr .
> >      BIND(replace(?ocr1,'uͤ','ü') AS ?ocr7)
> >      FILTER (?ocr != ?ocr7)
> > }
>
> Is there a big difference in working within one graphs as compared to
> intergraph update operations? Just asking because I'm compartmentalizing
> my data to different graphs quite much, but if it is significantly more
> expensive, I may have to rethink some processes, like shown above.
>
> > 3/
> > If TDB1 and none of that works, maybe reduce the internal transaction
> > space as well
> >
> > It so happens that SELECT LIMIT OFFSET is predictable for a persistent
> > database (this is not portable!!).
> >
> > WHERE {
> >     {
> >       SELECT ?ocr
> >       { graph vice:pageocrdata { ?page vice:ocrtext ?ocr . }
> >       OFFSET ... LIMIT ...
> >     }
> >     All the BIND
> > }
> >
> > (or filter by , starts ?ocr starts with "A" then with "B"
> >
> >      Andy
>
> Ah, yes, of course, this may become handy with even larger datasets.
>
> > BTW : replace(str(?ocr), ...
> > Any URIs will turn into strings and any language tags will be lost.
>
> Yes, that is unnecessary.
>
> > On 23/09/2021 16:28, Marco Neumann wrote:
> >> "not to bind" to be read as "just bind once"
> >>
> >> On Thu, Sep 23, 2021 at 4:25 PM Marco Neumann <marco.neum...@gmail.com>
> >> wrote:
> >>
> >>> set -Xmx to 8G and try not to bind the variable and to see if this
> >>> alleviates the issue.
> >>>
> >>> On Thu, Sep 23, 2021 at 12:41 PM Harri Kiiskinen
> >>> <harri.kiiski...@utu.fi>
> >>> wrote:
> >>>
> >>>> Hi!
> >>>>
> >>>> I'm trying to run a simple update query that reads strings from one
> >>>> graph, processes them, and stores to another:
> >>>>
> >>>>
> >>>>
> ------------------------------------------------------------------------------
>
> >>>>
> >>>>    insert {
> >>>>      graph vice:pageocrdata_clean {
> >>>>        ?page vice:ocrtext ?ocr7 .
> >>>>      }
> >>>>    }
> >>>>    where {
> >>>>      graph vice:pageocrdata {
> >>>>        ?page vice:ocrtext ?ocr .
> >>>>      }
> >>>>      bind (replace(str(?ocr),'ſ','s') as ?ocr1)
> >>>>      bind (replace(?ocr1,'uͤ','ü') as ?ocr2)
> >>>>      bind (replace(?ocr2,'aͤ','ä') as ?ocr3)
> >>>>      bind (replace(?ocr3,'oͤ','ö') as ?ocr4)
> >>>>      bind (replace(?ocr4,"[⸗—]\n",'') as ?ocr5)
> >>>>      bind (replace(?ocr5,"\n",' ') as ?ocr6)
> >>>>      bind (replace(?ocr6,"[ ]+",' ') as ?ocr7)
> >>>>    }
> >>>>
> >>>>
> -------------------------------------------------------------------------------
>
> >>>>
> >>>> The source graph has some 250,000 triples that fill the WHERE
> >>>> criterium.
> >>>> The strings are one to two thousand characters in length.
> >>>>
> >>>> I'm running the query using the Fuseki web UI, and it ends each time
> >>>> with
> >>>> "Bad Request (#400) Java heap space". The fuseki log does not show any
> >>>> error except for the Bad Request #400. I'm quite surprised by this
> >>>> problem,
> >>>> because the update operation is a simple and straightforward data
> >>>> processing, with no ordering etc.
> >>>>
> >>>> I started with -Xmx2G, but even increasing the heap to -Xmx12G only
> >>>> increases the time it takes for Fuseki to return the same error.
> >>>>
> >>>> Is there something wrong with the SPARQL above? Is there something
> that
> >>>> increases the memory use unnecessarily?
> >>>>
> >>>> Best,
> >>>>
> >>>> Harri Kiiskinen
> >>>>
> >>>
> >>>
> >>> --
> >>>
> >>>
> >>> ---
> >>> Marco Neumann
> >>> KONA
> >>>
> >>>
> >>
>
>
> --
> Tutkijatohtori / post-doctoral researcher
> Viral Culture in the Early Nineteenth-Century Europe (ViCE)
> Movie Making Finland: Finnish fiction films as audiovisual big data,
> 1907–2017 (MoMaF)
> Turun yliopisto / University of Turku
>


-- 


---
Marco Neumann
KONA

Reply via email to