Hi Brian,
On 13/08/2022 04:13, Brian Vosburgh wrote:
Hello, Jena Community.
TL;DR: Is there a way I can export and/or import a graph with invalid IRIs;
typically, IRIs with space in them?
The assumption runs through the system. Change just the parser and
something else will break, in Jena or in application code.
This includes the UI because now a URI may have a line break put in the
display.
If Jena processed URIs with spaces, then we're in the situation where
Jena produces illegal RDF for other systems.
It's not a simple matter of turn off or on.
I'm sure your colleagues can add to the background here.
Details:
When I try to write out a graph that contains the IRI <http://foo.com/bar
baz> with a method like this:
static void writeModelTo(String baseURI, Model model, OutputStream out) {
RDFWriter.create()
.base(baseURI)
.format(RDFFormat.TURTLE_BLOCKS)
.source(model)
.output(out);
}
the result is an error with a stack trace like this:
FWIW Only because there is a base URI and to sort out base abbreviation
the code has to look at the data URI as a URI, not a string, apply URI
resolution rules which is where it breaks. Otherwise data is printed. As
it is N-Triples.
The database can be written out.
For spaces - other bad characters like '>' or '\n' or '{' are worse.
(Translating to %xx is wrong. In URIs, %20 is not a apsace. it is 3
characters %-2-0 and comparse as such. Think %7E which is '~'.)
Caused by: org.apache.jena.irix.IRIException: <http://foo.com/bar baz>
Code: 17/WHITESPACE in PATH: A single whitespace character. These
match no grammar rules of URIs/IRIs. These characters are permitted in
RDF URI References, XML system identifiers, and XML Schema anyURIs.
~[jena-core-4.3.2.jar:4.3.2]
at
org.apache.jena.riot.out.NodeFormatterTTL.abbrevByBase(NodeFormatterTTL.java:100)
Likewise, when I import a graph that contains the IRI <http://foo.com/bar
baz#xxxx> with a method like this:
static Model modelFrom(InputStream in, String baseURI) {
Model model = ModelFactory.createDefaultModel();
RDFParser.create()
.source(in)
.lang(Lang.TURTLE)
.base(baseURI)
.parse(model);
return model;
}
the result is an error with a stack trace like this:
Caused by: org.apache.jena.riot.RiotException: [line: 30, col: 29] Bad
character in IRI (space): <http://foo.com/bar[space]...>
Actually you are asking for spaces in certain places - i.e. URI parsing
and internal to the path/querystring/fragment components.
Otherwise: what about:
<http://foo.com/bar >
< http://foo.com/bar>
<ht tp://foo.com/bar>
<http:/ /foo.com/bar>
Or:
----
<http://foo.com/
bar>
----
Is there any way to configure the Turtle Writer and/or Reader to simply log
these errors and continue processing, assuming the issues are not too
fatal? It appears there are a number of ways to configure the IRI writing
and reading validation, but the indirection was a bit too deep for me to
figure out how to configure the validation used by the Turtle Writer and
Reader.
Configuring the Turtle Reader/Writer validation would be very helpful (for
me, at least) for several reasons:
- Often, I have no control over the contents of the graphs, but I still
want to export and import the graphs.
- It seems reasonable that, if I can store an invalid IRI in a Jena TDB,
I should be able to export that data and re-import it. This would allow me
to restore a graph, invalid data and all, to its original state from its
previously-exported Turtle file.
It is historical that Jena API does not check URIs.
You do control the application code of your system! Using SHACL maybe?
- These exceptions stop the export/import dead in its tracks. Therefore,
if a graph has multiple invalid IRIs, the export/import must be executed at
least once for each invalid IRI, after each error is fixed. It would be
much nicer (particularly for a large graph) to report multiple errors per
execution.
You can write a SPARQL update to fix the URIs in place.
I would greatly appreciate any and all help. :-)
Brian
Andy