Re: OutOfMemoryError when converting a large TTL to a JSON-LD

Andy Seaborne Sat, 20 Jul 2019 13:48:00 -0700

The JSON-LD writer needs the whole graph in-memory.

The Turtle parse is streaming but the output needs collecting into agraph in-memory to give to the JSON-LD writer. That needs RAM, as doesthe processing JSON-LD performs in order to write.


Turtle to N-Triples is streaming.

at com.github.jsonldjava.core.RDFDataset$IRI.<init>(RDFDataset.java:317)
at com.github.jsonldjava.core.RDFDataset$Quad.<init>(RDFDataset.java:52)
at com.github.jsonldjava.core.RDFDataset.addQuad(RDFDataset.java:540)
at org.apache.jena.riot.writer.JenaRDF2JSONLD.parse(JenaRDF2JSONLD.java:85)
at
org.apache.jena.riot.writer.JsonLDWriter.toJsonLDJavaAPI(JsonLDWriter.java:205)
at
org.apache.jena.riot.writer.JsonLDWriter.serialize(JsonLDWriter.java:178)
at org.apache.jena.riot.writer.JsonLDWriter.write(JsonLDWriter.java:139)

...
> at org.apache.jena.riot.RDFDataMgr.write(RDFDataMgr.java:1162)
> at riotcmd.CmdLangParse$1.postParse(CmdLangParse.java:334)

It's writing the JSON-LD.  Parsing input has completed.

The JSON-LD engine is github/jsonld-java.
https://github.com/jsonld-java

More below ...

On 19/07/2019 15:39, Ankit Dangi wrote:

Splitting the file into smaller chunks is an option I'm considering if
there isn't any viable alternative that could be fixed with a configuration
symbol either in "riot" or property in JVM_ARGS.

- Ankit

On Fri, Jul 19, 2019 at 10:35 AM Ankit Dangi <[email protected]> wrote:

Please find my comments inline below.

On Fri, Jul 19, 2019 at 8:58 AM ajs6f <[email protected]> wrote:

You're dealing with two formats that both require context to be parsed.
In other words, they build up information in the heap as they are parsed.
Ideally, you could switch to stream-able formats like NTriples, but if that
is not a choice you can make, you could try a couple of things. You could
go Turtle -> NTriples then NTriples -> JSON-LD. This might work a bit
better because you don't have to build up the state in heap for _both_
contextual formats at the same time. I don't know what kind of use to which
you intend to put the JSON-LD, but if you can use multiple files in that
use, you might try splitting the file and processing it in pieces.

I understand better -- thank you. I gave it a try (in the sequence below)
with a similar setup as earlier with only 1 type of GC. The Step-1 bloated
the 3.2G .ttl file to a 5G .nt file in about a minute, but Step-2 took
about 10 mins and ended up with a similar error (stack trace at the end).


10 mins- the GC is working very hard.

No significant improvement so far.

Step-1: $


presumably "riotcmd.turtle" is your own script.

riotcmd.turtle --time --verbose --syntax=TURTLE

--output=N-Triples large_file.ttl -Xmx40G -XX:+OptimizeStringConcat
-XX:+UseConcMarkSweepGC
-Dlog4j.configuration=file:~/apache-jena-3.12.0/jena-log4j.properties >
large_file.ttl.nt

Right - NT is more verbose than Turtle. N-Triples writes outeverything, Trutle has prefix names and, e.g. does not repeat thesubject for every triple.


Step-2: $ riotcmd.ntriples --time --verbose --syntax=N-Triples
--output=JSON-LD large_file.ttl.nt -Xmx40G -XX:+OptimizeStringConcat
-XX:+UseConcMarkSweepGC
-Dlog4j.configuration=file:~/apache-jena-3.12.0/jena-log4j.properties


JSON-LD  needs to analyse the whole graph before it writes out.

(as does the most pretty forms of Turtle and RDF/XML - it's just thatTurtle and RDF/XML has streamable, less pretty forms as well.)

To see the differences, try (on a small sample) "riot --output=TTL" and"riot --pretty=TTL"


For more details:
https://jena.apache.org/documentation/io/rdf-output.html

Can you tell us a bit more about your use case? There might be another
approach someone can recommend.

Two building blocks or pieces of code that both work with different RDF
formats -- interfacing them via a format translator such as this. Open to
suggestions?

Stack trace follows:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
at java.util.LinkedHashMap.newNode(LinkedHashMap.java:256)
at java.util.HashMap.putVal(HashMap.java:631)
at java.util.HashMap.put(HashMap.java:612)
at
com.github.jsonldjava.core.RDFDataset$BlankNode.<init>(RDFDataset.java:341)

at com.github.jsonldjava.core.RDFDataset$Quad.<init>(RDFDataset.java:62)
at com.github.jsonldjava.core.RDFDataset$Quad.<init>(RDFDataset.java:51)
at com.github.jsonldjava.core.RDFDataset.addQuad(RDFDataset.java:540)

at org.apache.jena.riot.writer.JenaRDF2JSONLD.parse(JenaRDF2JSONLD.java:85)
at
org.apache.jena.riot.writer.JsonLDWriter.toJsonLDJavaAPI(JsonLDWriter.java:205)
at
org.apache.jena.riot.writer.JsonLDWriter.serialize(JsonLDWriter.java:178)
at org.apache.jena.riot.writer.JsonLDWriter.write(JsonLDWriter.java:139)
at org.apache.jena.riot.writer.JsonLDWriter.write(JsonLDWriter.java:145)
at org.apache.jena.riot.RDFWriter.write$(RDFWriter.java:207)
at org.apache.jena.riot.RDFWriter.output(RDFWriter.java:165)
at org.apache.jena.riot.RDFWriter.output(RDFWriter.java:112)
at org.apache.jena.riot.RDFWriterBuilder.output(RDFWriterBuilder.java:178)
at org.apache.jena.riot.RDFDataMgr.write$(RDFDataMgr.java:1277)
at org.apache.jena.riot.RDFDataMgr.write(RDFDataMgr.java:1162)
at riotcmd.CmdLangParse$1.postParse(CmdLangParse.java:334)
at riotcmd.CmdLangParse.exec$(CmdLangParse.java:170)
at riotcmd.CmdLangParse.exec(CmdLangParse.java:128)
at jena.cmd.CmdMain.mainMethod(CmdMain.java:93)
at jena.cmd.CmdMain.mainRun(CmdMain.java:58)
at jena.cmd.CmdMain.mainRun(CmdMain.java:45)
at riotcmd.ntriples.main(ntriples.java:30)

ajs6f

On Jul 19, 2019, at 8:18 AM, Ankit Dangi <[email protected]> wrote:

Hi,

I am using Apache Jena 3.12.0 with OpenJDK version 1.8.0_212 on a 64-Bit
Ubuntu 18.04.2 LTS (bionic) server with no changes to any default
configurations.

I have a 3.2G sized-Turtle (.ttl) RDF file that has ~25M triples that

I'd

like to convert to a JSON-LD representation. I first looked at

jena.rdfcat

which suggested I should be using 'riot' instead. I then tried
riotcmd.turtle with 2 different GCs with up to 40G max-heap size but in
about 12 mins it ran into a "java.lang.OutOfMemoryError: Java heap

space"

(stack trace at the end).

$ cd apache-jena-3.12.0/bin



FAILED-1: $ riotcmd.turtle --time --verbose --syntax=TURTLE

--output=JSON-LD large_file.ttl -Xmx40G -XX:+OptimizeStringConcat
-XX:+UseG1GC -XX:+UseStringDeduplication
-XX:+PrintStringDeduplicationStatistics
-Dlog4j.configuration=file:~/apache-jena-3.12.0/jena-log4j.properties




FAILED-2: $ riotcmd.turtle --time --verbose --syntax=TURTLE

--output=JSON-LD large_file.ttl -Xmx40G -XX:+OptimizeStringConcat
-XX:+UseConcMarkSweepGC
-Dlog4j.configuration=file:~/apache-jena-3.12.0/jena-log4j.properties



Question: I believe I may be missing some parameters or configurations

that

I could fine-tune. Any suggestions on what could I try? If not, are

there

any alternate mechanisms by which I could convert the large TTL to a
JSON-LD?

Stack trace follows below:

Exception in thread "main" java.lang.OutOfMemoryError: Java heap space

at java.util.LinkedHashMap.newNode(LinkedHashMap.java:256)
at java.util.HashMap.putVal(HashMap.java:631)
at java.util.HashMap.put(HashMap.java:612)
at

com.github.jsonldjava.core.RDFDataset$IRI.<init>(RDFDataset.java:317)

at

com.github.jsonldjava.core.RDFDataset$Quad.<init>(RDFDataset.java:52)

at com.github.jsonldjava.core.RDFDataset.addQuad(RDFDataset.java:540)
at

org.apache.jena.riot.writer.JenaRDF2JSONLD.parse(JenaRDF2JSONLD.java:85)

at

org.apache.jena.riot.writer.JsonLDWriter.toJsonLDJavaAPI(JsonLDWriter.java:205)

at

org.apache.jena.riot.writer.JsonLDWriter.serialize(JsonLDWriter.java:178)

at

org.apache.jena.riot.writer.JsonLDWriter.write(JsonLDWriter.java:139)

at

org.apache.jena.riot.writer.JsonLDWriter.write(JsonLDWriter.java:145)

at org.apache.jena.riot.RDFWriter.write$(RDFWriter.java:207)
at org.apache.jena.riot.RDFWriter.output(RDFWriter.java:165)
at org.apache.jena.riot.RDFWriter.output(RDFWriter.java:112)
at

org.apache.jena.riot.RDFWriterBuilder.output(RDFWriterBuilder.java:178)

at org.apache.jena.riot.RDFDataMgr.write$(RDFDataMgr.java:1277)
at org.apache.jena.riot.RDFDataMgr.write(RDFDataMgr.java:1162)
at riotcmd.CmdLangParse$1.postParse(CmdLangParse.java:334)
at riotcmd.CmdLangParse.exec$(CmdLangParse.java:170)
at riotcmd.CmdLangParse.exec(CmdLangParse.java:128)
at jena.cmd.CmdMain.mainMethod(CmdMain.java:93)
at jena.cmd.CmdMain.mainRun(CmdMain.java:58)
at jena.cmd.CmdMain.mainRun(CmdMain.java:45)
at riotcmd.turtle.main(turtle.java:30)



--
Ankit Dangi


--
Ankit Dangi

Re: OutOfMemoryError when converting a large TTL to a JSON-LD

Reply via email to