Splitting the file into smaller chunks is an option I'm considering if there isn't any viable alternative that could be fixed with a configuration symbol either in "riot" or property in JVM_ARGS.
- Ankit On Fri, Jul 19, 2019 at 10:35 AM Ankit Dangi <dangian...@gmail.com> wrote: > Please find my comments inline below. > > On Fri, Jul 19, 2019 at 8:58 AM ajs6f <aj...@apache.org> wrote: > >> You're dealing with two formats that both require context to be parsed. >> In other words, they build up information in the heap as they are parsed. >> Ideally, you could switch to stream-able formats like NTriples, but if that >> is not a choice you can make, you could try a couple of things. You could >> go Turtle -> NTriples then NTriples -> JSON-LD. This might work a bit >> better because you don't have to build up the state in heap for _both_ >> contextual formats at the same time. I don't know what kind of use to which >> you intend to put the JSON-LD, but if you can use multiple files in that >> use, you might try splitting the file and processing it in pieces. >> >> > I understand better -- thank you. I gave it a try (in the sequence below) > with a similar setup as earlier with only 1 type of GC. The Step-1 bloated > the 3.2G .ttl file to a 5G .nt file in about a minute, but Step-2 took > about 10 mins and ended up with a similar error (stack trace at the end). > No significant improvement so far. > > Step-1: $ riotcmd.turtle --time --verbose --syntax=TURTLE > --output=N-Triples large_file.ttl -Xmx40G -XX:+OptimizeStringConcat > -XX:+UseConcMarkSweepGC > -Dlog4j.configuration=file:~/apache-jena-3.12.0/jena-log4j.properties > > large_file.ttl.nt > > Step-2: $ riotcmd.ntriples --time --verbose --syntax=N-Triples > --output=JSON-LD large_file.ttl.nt -Xmx40G -XX:+OptimizeStringConcat > -XX:+UseConcMarkSweepGC > -Dlog4j.configuration=file:~/apache-jena-3.12.0/jena-log4j.properties > > >> Can you tell us a bit more about your use case? There might be another >> approach someone can recommend. >> >> > Two building blocks or pieces of code that both work with different RDF > formats -- interfacing them via a format translator such as this. Open to > suggestions? > > Stack trace follows: > > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > at java.util.LinkedHashMap.newNode(LinkedHashMap.java:256) > at java.util.HashMap.putVal(HashMap.java:631) > at java.util.HashMap.put(HashMap.java:612) > at > com.github.jsonldjava.core.RDFDataset$BlankNode.<init>(RDFDataset.java:341) > at com.github.jsonldjava.core.RDFDataset$Quad.<init>(RDFDataset.java:62) > at com.github.jsonldjava.core.RDFDataset$Quad.<init>(RDFDataset.java:51) > at com.github.jsonldjava.core.RDFDataset.addQuad(RDFDataset.java:540) > at org.apache.jena.riot.writer.JenaRDF2JSONLD.parse(JenaRDF2JSONLD.java:85) > at > org.apache.jena.riot.writer.JsonLDWriter.toJsonLDJavaAPI(JsonLDWriter.java:205) > at > org.apache.jena.riot.writer.JsonLDWriter.serialize(JsonLDWriter.java:178) > at org.apache.jena.riot.writer.JsonLDWriter.write(JsonLDWriter.java:139) > at org.apache.jena.riot.writer.JsonLDWriter.write(JsonLDWriter.java:145) > at org.apache.jena.riot.RDFWriter.write$(RDFWriter.java:207) > at org.apache.jena.riot.RDFWriter.output(RDFWriter.java:165) > at org.apache.jena.riot.RDFWriter.output(RDFWriter.java:112) > at org.apache.jena.riot.RDFWriterBuilder.output(RDFWriterBuilder.java:178) > at org.apache.jena.riot.RDFDataMgr.write$(RDFDataMgr.java:1277) > at org.apache.jena.riot.RDFDataMgr.write(RDFDataMgr.java:1162) > at riotcmd.CmdLangParse$1.postParse(CmdLangParse.java:334) > at riotcmd.CmdLangParse.exec$(CmdLangParse.java:170) > at riotcmd.CmdLangParse.exec(CmdLangParse.java:128) > at jena.cmd.CmdMain.mainMethod(CmdMain.java:93) > at jena.cmd.CmdMain.mainRun(CmdMain.java:58) > at jena.cmd.CmdMain.mainRun(CmdMain.java:45) > at riotcmd.ntriples.main(ntriples.java:30) > > >> ajs6f >> >> > On Jul 19, 2019, at 8:18 AM, Ankit Dangi <dangian...@gmail.com> wrote: >> > >> > Hi, >> > >> > I am using Apache Jena 3.12.0 with OpenJDK version 1.8.0_212 on a 64-Bit >> > Ubuntu 18.04.2 LTS (bionic) server with no changes to any default >> > configurations. >> > >> > I have a 3.2G sized-Turtle (.ttl) RDF file that has ~25M triples that >> I'd >> > like to convert to a JSON-LD representation. I first looked at >> jena.rdfcat >> > which suggested I should be using 'riot' instead. I then tried >> > riotcmd.turtle with 2 different GCs with up to 40G max-heap size but in >> > about 12 mins it ran into a "java.lang.OutOfMemoryError: Java heap >> space" >> > (stack trace at the end). >> > >> > $ cd apache-jena-3.12.0/bin >> > >> > >> > >> > FAILED-1: $ riotcmd.turtle --time --verbose --syntax=TURTLE >> >> --output=JSON-LD large_file.ttl -Xmx40G -XX:+OptimizeStringConcat >> >> -XX:+UseG1GC -XX:+UseStringDeduplication >> >> -XX:+PrintStringDeduplicationStatistics >> >> -Dlog4j.configuration=file:~/apache-jena-3.12.0/jena-log4j.properties >> > >> > >> > >> > FAILED-2: $ riotcmd.turtle --time --verbose --syntax=TURTLE >> >> --output=JSON-LD large_file.ttl -Xmx40G -XX:+OptimizeStringConcat >> >> -XX:+UseConcMarkSweepGC >> >> -Dlog4j.configuration=file:~/apache-jena-3.12.0/jena-log4j.properties >> > >> > >> > Question: I believe I may be missing some parameters or configurations >> that >> > I could fine-tune. Any suggestions on what could I try? If not, are >> there >> > any alternate mechanisms by which I could convert the large TTL to a >> > JSON-LD? >> > >> > Stack trace follows below: >> > >> > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space >> >> at java.util.LinkedHashMap.newNode(LinkedHashMap.java:256) >> >> at java.util.HashMap.putVal(HashMap.java:631) >> >> at java.util.HashMap.put(HashMap.java:612) >> >> at >> com.github.jsonldjava.core.RDFDataset$IRI.<init>(RDFDataset.java:317) >> >> at >> com.github.jsonldjava.core.RDFDataset$Quad.<init>(RDFDataset.java:52) >> >> at com.github.jsonldjava.core.RDFDataset.addQuad(RDFDataset.java:540) >> >> at >> org.apache.jena.riot.writer.JenaRDF2JSONLD.parse(JenaRDF2JSONLD.java:85) >> >> at >> >> >> org.apache.jena.riot.writer.JsonLDWriter.toJsonLDJavaAPI(JsonLDWriter.java:205) >> >> at >> >> >> org.apache.jena.riot.writer.JsonLDWriter.serialize(JsonLDWriter.java:178) >> >> at >> org.apache.jena.riot.writer.JsonLDWriter.write(JsonLDWriter.java:139) >> >> at >> org.apache.jena.riot.writer.JsonLDWriter.write(JsonLDWriter.java:145) >> >> at org.apache.jena.riot.RDFWriter.write$(RDFWriter.java:207) >> >> at org.apache.jena.riot.RDFWriter.output(RDFWriter.java:165) >> >> at org.apache.jena.riot.RDFWriter.output(RDFWriter.java:112) >> >> at >> org.apache.jena.riot.RDFWriterBuilder.output(RDFWriterBuilder.java:178) >> >> at org.apache.jena.riot.RDFDataMgr.write$(RDFDataMgr.java:1277) >> >> at org.apache.jena.riot.RDFDataMgr.write(RDFDataMgr.java:1162) >> >> at riotcmd.CmdLangParse$1.postParse(CmdLangParse.java:334) >> >> at riotcmd.CmdLangParse.exec$(CmdLangParse.java:170) >> >> at riotcmd.CmdLangParse.exec(CmdLangParse.java:128) >> >> at jena.cmd.CmdMain.mainMethod(CmdMain.java:93) >> >> at jena.cmd.CmdMain.mainRun(CmdMain.java:58) >> >> at jena.cmd.CmdMain.mainRun(CmdMain.java:45) >> >> at riotcmd.turtle.main(turtle.java:30) >> > >> > >> > -- >> > Ankit Dangi >> >> > > -- > Ankit Dangi > > -- Ankit Dangi LTI at CMU, ex-MLD https://dangiankit.info