Re: OutOfMemoryError when converting a large TTL to a JSON-LD

Ankit Dangi Fri, 19 Jul 2019 07:40:39 -0700

Splitting the file into smaller chunks is an option I'm considering if
there isn't any viable alternative that could be fixed with a configuration
symbol either in "riot" or property in JVM_ARGS.


- Ankit

On Fri, Jul 19, 2019 at 10:35 AM Ankit Dangi <dangian...@gmail.com> wrote:

> Please find my comments inline below.
>
> On Fri, Jul 19, 2019 at 8:58 AM ajs6f <aj...@apache.org> wrote:
>
>> You're dealing with two formats that both require context to be parsed.
>> In other words, they build up information in the heap as they are parsed.
>> Ideally, you could switch to stream-able formats like NTriples, but if that
>> is not a choice you can make, you could try a couple of things. You could
>> go Turtle -> NTriples then NTriples -> JSON-LD. This might work a bit
>> better because you don't have to build up the state in heap for _both_
>> contextual formats at the same time. I don't know what kind of use to which
>> you intend to put the JSON-LD, but if you can use multiple files in that
>> use, you might try splitting the file and processing it in pieces.
>>
>>
> I understand better -- thank you. I gave it a try (in the sequence below)
> with a similar setup as earlier with only 1 type of GC. The Step-1 bloated
> the 3.2G .ttl file to a 5G .nt file in about a minute, but Step-2 took
> about 10 mins and ended up with a similar error (stack trace at the end).
> No significant improvement so far.
>
> Step-1: $ riotcmd.turtle --time --verbose --syntax=TURTLE
> --output=N-Triples large_file.ttl -Xmx40G -XX:+OptimizeStringConcat
> -XX:+UseConcMarkSweepGC
> -Dlog4j.configuration=file:~/apache-jena-3.12.0/jena-log4j.properties >
> large_file.ttl.nt
>
> Step-2: $ riotcmd.ntriples --time --verbose --syntax=N-Triples
> --output=JSON-LD large_file.ttl.nt -Xmx40G -XX:+OptimizeStringConcat
> -XX:+UseConcMarkSweepGC
> -Dlog4j.configuration=file:~/apache-jena-3.12.0/jena-log4j.properties
>
>
>> Can you tell us a bit more about your use case? There might be another
>> approach someone can recommend.
>>
>>
> Two building blocks or pieces of code that both work with different RDF
> formats -- interfacing them via a format translator such as this. Open to
> suggestions?
>
> Stack trace follows:
>
> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
> at java.util.LinkedHashMap.newNode(LinkedHashMap.java:256)
> at java.util.HashMap.putVal(HashMap.java:631)
> at java.util.HashMap.put(HashMap.java:612)
> at
> com.github.jsonldjava.core.RDFDataset$BlankNode.<init>(RDFDataset.java:341)
> at com.github.jsonldjava.core.RDFDataset$Quad.<init>(RDFDataset.java:62)
> at com.github.jsonldjava.core.RDFDataset$Quad.<init>(RDFDataset.java:51)
> at com.github.jsonldjava.core.RDFDataset.addQuad(RDFDataset.java:540)
> at org.apache.jena.riot.writer.JenaRDF2JSONLD.parse(JenaRDF2JSONLD.java:85)
> at
> org.apache.jena.riot.writer.JsonLDWriter.toJsonLDJavaAPI(JsonLDWriter.java:205)
> at
> org.apache.jena.riot.writer.JsonLDWriter.serialize(JsonLDWriter.java:178)
> at org.apache.jena.riot.writer.JsonLDWriter.write(JsonLDWriter.java:139)
> at org.apache.jena.riot.writer.JsonLDWriter.write(JsonLDWriter.java:145)
> at org.apache.jena.riot.RDFWriter.write$(RDFWriter.java:207)
> at org.apache.jena.riot.RDFWriter.output(RDFWriter.java:165)
> at org.apache.jena.riot.RDFWriter.output(RDFWriter.java:112)
> at org.apache.jena.riot.RDFWriterBuilder.output(RDFWriterBuilder.java:178)
> at org.apache.jena.riot.RDFDataMgr.write$(RDFDataMgr.java:1277)
> at org.apache.jena.riot.RDFDataMgr.write(RDFDataMgr.java:1162)
> at riotcmd.CmdLangParse$1.postParse(CmdLangParse.java:334)
> at riotcmd.CmdLangParse.exec$(CmdLangParse.java:170)
> at riotcmd.CmdLangParse.exec(CmdLangParse.java:128)
> at jena.cmd.CmdMain.mainMethod(CmdMain.java:93)
> at jena.cmd.CmdMain.mainRun(CmdMain.java:58)
> at jena.cmd.CmdMain.mainRun(CmdMain.java:45)
> at riotcmd.ntriples.main(ntriples.java:30)
>
>
>> ajs6f
>>
>> > On Jul 19, 2019, at 8:18 AM, Ankit Dangi <dangian...@gmail.com> wrote:
>> >
>> > Hi,
>> >
>> > I am using Apache Jena 3.12.0 with OpenJDK version 1.8.0_212 on a 64-Bit
>> > Ubuntu 18.04.2 LTS (bionic) server with no changes to any default
>> > configurations.
>> >
>> > I have a 3.2G sized-Turtle (.ttl) RDF file that has ~25M triples that
>> I'd
>> > like to convert to a JSON-LD representation. I first looked at
>> jena.rdfcat
>> > which suggested I should be using 'riot' instead. I then tried
>> > riotcmd.turtle with 2 different GCs with up to 40G max-heap size but in
>> > about 12 mins it ran into a "java.lang.OutOfMemoryError: Java heap
>> space"
>> > (stack trace at the end).
>> >
>> > $ cd apache-jena-3.12.0/bin
>> >
>> >
>> >
>> > FAILED-1: $ riotcmd.turtle --time --verbose --syntax=TURTLE
>> >> --output=JSON-LD large_file.ttl -Xmx40G -XX:+OptimizeStringConcat
>> >> -XX:+UseG1GC -XX:+UseStringDeduplication
>> >> -XX:+PrintStringDeduplicationStatistics
>> >> -Dlog4j.configuration=file:~/apache-jena-3.12.0/jena-log4j.properties
>> >
>> >
>> >
>> > FAILED-2: $ riotcmd.turtle --time --verbose --syntax=TURTLE
>> >> --output=JSON-LD large_file.ttl -Xmx40G -XX:+OptimizeStringConcat
>> >> -XX:+UseConcMarkSweepGC
>> >> -Dlog4j.configuration=file:~/apache-jena-3.12.0/jena-log4j.properties
>> >
>> >
>> > Question: I believe I may be missing some parameters or configurations
>> that
>> > I could fine-tune. Any suggestions on what could I try? If not, are
>> there
>> > any alternate mechanisms by which I could convert the large TTL to a
>> > JSON-LD?
>> >
>> > Stack trace follows below:
>> >
>> > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space
>> >> at java.util.LinkedHashMap.newNode(LinkedHashMap.java:256)
>> >> at java.util.HashMap.putVal(HashMap.java:631)
>> >> at java.util.HashMap.put(HashMap.java:612)
>> >> at
>> com.github.jsonldjava.core.RDFDataset$IRI.<init>(RDFDataset.java:317)
>> >> at
>> com.github.jsonldjava.core.RDFDataset$Quad.<init>(RDFDataset.java:52)
>> >> at com.github.jsonldjava.core.RDFDataset.addQuad(RDFDataset.java:540)
>> >> at
>> org.apache.jena.riot.writer.JenaRDF2JSONLD.parse(JenaRDF2JSONLD.java:85)
>> >> at
>> >>
>> org.apache.jena.riot.writer.JsonLDWriter.toJsonLDJavaAPI(JsonLDWriter.java:205)
>> >> at
>> >>
>> org.apache.jena.riot.writer.JsonLDWriter.serialize(JsonLDWriter.java:178)
>> >> at
>> org.apache.jena.riot.writer.JsonLDWriter.write(JsonLDWriter.java:139)
>> >> at
>> org.apache.jena.riot.writer.JsonLDWriter.write(JsonLDWriter.java:145)
>> >> at org.apache.jena.riot.RDFWriter.write$(RDFWriter.java:207)
>> >> at org.apache.jena.riot.RDFWriter.output(RDFWriter.java:165)
>> >> at org.apache.jena.riot.RDFWriter.output(RDFWriter.java:112)
>> >> at
>> org.apache.jena.riot.RDFWriterBuilder.output(RDFWriterBuilder.java:178)
>> >> at org.apache.jena.riot.RDFDataMgr.write$(RDFDataMgr.java:1277)
>> >> at org.apache.jena.riot.RDFDataMgr.write(RDFDataMgr.java:1162)
>> >> at riotcmd.CmdLangParse$1.postParse(CmdLangParse.java:334)
>> >> at riotcmd.CmdLangParse.exec$(CmdLangParse.java:170)
>> >> at riotcmd.CmdLangParse.exec(CmdLangParse.java:128)
>> >> at jena.cmd.CmdMain.mainMethod(CmdMain.java:93)
>> >> at jena.cmd.CmdMain.mainRun(CmdMain.java:58)
>> >> at jena.cmd.CmdMain.mainRun(CmdMain.java:45)
>> >> at riotcmd.turtle.main(turtle.java:30)
>> >
>> >
>> > --
>> > Ankit Dangi
>>
>>
>
> --
> Ankit Dangi
>
>

-- 
Ankit Dangi
LTI at CMU, ex-MLD
https://dangiankit.info

Re: OutOfMemoryError when converting a large TTL to a JSON-LD

Reply via email to