Exporting and importing invalid IRIs

Brian Vosburgh Fri, 12 Aug 2022 20:15:37 -0700

Hello, Jena Community.

TL;DR: Is there a way I can export and/or import a graph with invalid IRIs;
typically, IRIs with spaces in them?


Details:
When I try to write out a graph that contains the IRI <http://foo.com/bar
baz> with a method like this:

static void writeModelTo(String baseURI, Model model, OutputStream out) {
    RDFWriter.create()
         .base(baseURI)
        .format(RDFFormat.TURTLE_BLOCKS)
        .source(model)
        .output(out);
}

the result is an error with a stack trace like this:

Caused by: org.apache.jena.irix.IRIException: <http://foo.com/bar baz>
Code: 17/WHITESPACE in PATH: A single whitespace character. These
match no grammar rules of URIs/IRIs. These characters are permitted in
RDF URI References, XML system identifiers, and XML Schema anyURIs.
    at 
org.apache.jena.irix.IRIProviderJenaIRI.exceptions(IRIProviderJenaIRI.java:256)
~[jena-core-4.3.2.jar:4.3.2]
    at 
org.apache.jena.irix.IRIProviderJenaIRI.newIRIxJena(IRIProviderJenaIRI.java:137)
~[jena-core-4.3.2.jar:4.3.2]
    at 
org.apache.jena.irix.IRIProviderJenaIRI.create(IRIProviderJenaIRI.java:145)
~[jena-core-4.3.2.jar:4.3.2]
    at org.apache.jena.irix.IRIx.create(IRIx.java:54)
~[jena-core-4.3.2.jar:4.3.2]
    at 
org.apache.jena.riot.out.NodeFormatterTTL.abbrevByBase(NodeFormatterTTL.java:100)
~[jena-arq-4.3.2.jar:4.3.2]
    at 
org.apache.jena.riot.out.NodeFormatterTTL.formatURI(NodeFormatterTTL.java:84)
~[jena-arq-4.3.2.jar:4.3.2]
    at 
org.apache.jena.riot.out.NodeFormatterBase.formatURI(NodeFormatterBase.java:70)
~[jena-arq-4.3.2.jar:4.3.2]
    at 
org.apache.jena.riot.out.NodeFormatterBase.format(NodeFormatterBase.java:43)
~[jena-arq-4.3.2.jar:4.3.2]
    at 
org.apache.jena.riot.writer.WriterStreamRDFBase.outputNode(WriterStreamRDFBase.java:159)
~[jena-arq-4.3.2.jar:4.3.2]
    at 
org.apache.jena.riot.writer.WriterStreamRDFBlocks.writePredicateObjectList(WriterStreamRDFBlocks.java:161)
~[jena-arq-4.3.2.jar:4.3.2]
    at 
org.apache.jena.riot.writer.WriterStreamRDFBlocks.printBatch(WriterStreamRDFBlocks.java:140)
~[jena-arq-4.3.2.jar:4.3.2]
    at 
org.apache.jena.riot.writer.WriterStreamRDFBlocks.printBatchTriples(WriterStreamRDFBlocks.java:126)
~[jena-arq-4.3.2.jar:4.3.2]
    at 
org.apache.jena.riot.writer.WriterStreamRDFBatched.finishBatchTriples(WriterStreamRDFBatched.java:100)
~[jena-arq-4.3.2.jar:4.3.2]
    at 
org.apache.jena.riot.writer.WriterStreamRDFBatched.batch(WriterStreamRDFBatched.java:74)
~[jena-arq-4.3.2.jar:4.3.2]
    at 
org.apache.jena.riot.writer.WriterStreamRDFBatched.print(WriterStreamRDFBatched.java:88)
~[jena-arq-4.3.2.jar:4.3.2]
    at 
org.apache.jena.riot.writer.WriterStreamRDFBase.triple(WriterStreamRDFBase.java:116)
~[jena-arq-4.3.2.jar:4.3.2]
    at 
org.apache.jena.riot.system.StreamRDFOps.sendTriplesToStream(StreamRDFOps.java:122)
~[jena-arq-4.3.2.jar:4.3.2]
    at 
org.apache.jena.riot.system.StreamRDFOps.sendGraphToStream(StreamRDFOps.java:108)
~[jena-arq-4.3.2.jar:4.3.2]
    at 
org.apache.jena.riot.writer.TurtleWriterBlocks.output(TurtleWriterBlocks.java:36)
~[jena-arq-4.3.2.jar:4.3.2]
    at 
org.apache.jena.riot.writer.TurtleWriterBase.output$(TurtleWriterBase.java:53)
~[jena-arq-4.3.2.jar:4.3.2]
    at 
org.apache.jena.riot.writer.TurtleWriterBase.write(TurtleWriterBase.java:47)
~[jena-arq-4.3.2.jar:4.3.2]
    at org.apache.jena.riot.RDFWriter.write$(RDFWriter.java:236)
~[jena-arq-4.3.2.jar:4.3.2]
    at org.apache.jena.riot.RDFWriter.output(RDFWriter.java:195)
~[jena-arq-4.3.2.jar:4.3.2]
    at org.apache.jena.riot.RDFWriter.output(RDFWriter.java:146)
~[jena-arq-4.3.2.jar:4.3.2]
    at org.apache.jena.riot.RDFWriterBuilder.output(RDFWriterBuilder.java:205)
~[jena-arq-4.3.2.jar:4.3.2]

Likewise, when I import a graph that contains the IRI <http://foo.com/bar
baz#xxxx> with a method like this:

static Model modelFrom(InputStream in, String baseURI) {
    Model model = ModelFactory.createDefaultModel();
    RDFParser.create()
        .source(in)
        .lang(Lang.TURTLE)
        .base(baseURI)
        .parse(model);
    return model;
    }

the result is an error with a stack trace like this:

Caused by: org.apache.jena.riot.RiotException: [line: 30, col: 29] Bad
character in IRI (space): <http://foo.com/bar[space]...>
    at 
org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerStd.error(ErrorHandlerFactory.java:156)
    at org.apache.jena.riot.tokens.TokenizerText.error(TokenizerText.java:1334)
    at org.apache.jena.riot.tokens.TokenizerText.readIRI(TokenizerText.java:532)
    at 
org.apache.jena.riot.tokens.TokenizerText.parseToken(TokenizerText.java:194)
    at org.apache.jena.riot.tokens.TokenizerText.hasNext(TokenizerText.java:90)
    at org.apache.jena.atlas.iterator.PeekIterator.fill(PeekIterator.java:50)
    at org.apache.jena.atlas.iterator.PeekIterator.next(PeekIterator.java:92)
    at org.apache.jena.riot.lang.LangEngine.nextToken(LangEngine.java:98)
    at 
org.apache.jena.riot.lang.LangTurtleBase.predicateObjectList(LangTurtleBase.java:340)
    at org.apache.jena.riot.lang.LangTurtleBase.triples(LangTurtleBase.java:314)
    at 
org.apache.jena.riot.lang.LangTurtleBase.triplesSameSubject(LangTurtleBase.java:178)
    at 
org.apache.jena.riot.lang.LangTurtle.oneTopLevelElement(LangTurtle.java:46)
    at 
org.apache.jena.riot.lang.LangTurtleBase.runParser(LangTurtleBase.java:79)
    at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:43)
    at 
org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserRegistry.java:186)
    at org.apache.jena.riot.RDFParser.read(RDFParser.java:366)
    at org.apache.jena.riot.RDFParser.parseNotUri(RDFParser.java:356)
    at org.apache.jena.riot.RDFParser.parse(RDFParser.java:306)
    at org.apache.jena.riot.RDFParser.parse(RDFParser.java:252)
    at org.apache.jena.riot.RDFParser.parse(RDFParser.java:261)
    at org.apache.jena.riot.RDFParserBuilder.parse(RDFParserBuilder.java:576)

Is there any way to configure the Turtle Writer and/or Reader to simply log
these errors and continue processing, assuming the issues are not too
fatal? It appears there are a number of ways to configure the IRI writing
and reading validation, but the indirection was a bit too deep for me to
figure out how to configure the validation used by the Turtle Writer and
Reader.

Configuring the Turtle Reader/Writer validation would be very helpful (for
me, at least) for several reasons:

   - Often, I have no control over the contents of the graphs, but I still
   want to export and import the graphs.
   - It seems reasonable that, if I can store an invalid IRI in a Jena TDB,
   I should be able to export that data and re-import it. This would allow me
   to restore a graph, invalid data and all, to its original state from its
   previously-exported Turtle file.
   - These exceptions stop the export/import dead in its tracks. Therefore,
   if a graph has multiple invalid IRIs, the export/import must be executed at
   least once for each invalid IRI, after each error is fixed. It would be
   much nicer (particularly for a large graph) to report multiple errors per
   execution.

I would greatly appreciate any and all help. :-)
Brian

Exporting and importing invalid IRIs

Reply via email to