Hello Andy,

first of all, thanks for the answer. I added answers to your comments inline below.


Comments inline and at the end ...

On 27/01/15 10:57, Lorenz Bühmann wrote:
Hello,

when I run the SPARQL query on the DBpedia endpoint
http://dbpedia.org/sparql

CONSTRUCT {
<http://dbpedia.org/resource/Leipzig> ?p0 ?o0.
}
WHERE {
<http://dbpedia.org/resource/Leipzig> ?p0 ?o0.
}


by using the code


String query = "CONSTRUCT {\n" +
"<http://dbpedia.org/resource/Trey_Parker> ?p0 ?o0.\n" +
                 "?o0 ?p1 ?o1.\n" +
                 "}\n" +
                 "WHERE {\n" +
"<http://dbpedia.org/resource/Trey_Parker> ?p0 ?o0.\n" +
                 "OPTIONAL{\n" +
                 "?o0 ?p1 ?o1.\n" +
                 "}}";
com.hp.hpl.jena.sparql.engine.http.QueryEngineHTTP qe = new
com.hp.hpl.jena.sparql.engine.http.QueryEngineHTTP("http://dbpedia.org/sparql";,
query);
qe.setDefaultGraphURIs(Collections.singletonList("http://dbpedia.org";));
Model model = qe.execConstruct();
qe.close();


I get an exception thrown by the Turtle parser:

11:48:30,550 ErrorHandlerFactory$ErrorLogger - [line: 263, col: 45] Bad
IRI: <http://th.dbpedia.org/resource/หมวดหมู่:ผู้กำกับภาพยนตร์ชาว อเมริกัน>
Code: 47/NOT_NFKC in PATH: The IRI is not in Unicode Normal Form KC.

This is a warning - the parser emits the data and continues ...

(I'm somewhat tempted to turn the NF tests off - while strictly correct, few people worry or understand NF - feedback welcome).

Form my point of view the warnings are quite confusing, although I usually tend to ignore such kind of warnings.


11:48:30,553 ErrorHandlerFactory$ErrorLogger - [line: 263, col: 45] Bad
IRI: <http://th.dbpedia.org/resource/หมวดหมู่:ผู้กำกับภาพยนตร์ชาว อเมริกัน>
Code: 56/COMPATIBILITY_CHARACTER in PATH: TODO
11:48:30,557 ErrorHandlerFactory$ErrorLogger - [line: 288, col: 45] Bad
IRI:
<http://zh_min_nan.dbpedia.org/resource/Category:Bí-kok_tiān-iáⁿ_tō-ián>
Code: 47/NOT_NFKC in PATH: The IRI is not in Unicode Normal Form KC.
11:48:30,557 ErrorHandlerFactory$ErrorLogger - [line: 288, col: 45] Bad
IRI:
<http://zh_min_nan.dbpedia.org/resource/Category:Bí-kok_tiān-iáⁿ_tō-ián>
Code: 56/COMPATIBILITY_CHARACTER in PATH: TODO
11:48:30,574 ErrorHandlerFactory$ErrorLogger - [line: 440, col: 13] Bad
IRI: <http://th.dbpedia.org/resource/หมวดหมู่:ผู้อำนวยการสร้างรายการ โทรทัศน์
ชาวอเมริกัน> Code: 47/NOT_NFKC in PATH: The IRI is not in Unicode Normal
Form KC.
11:48:30,575 ErrorHandlerFactory$ErrorLogger - [line: 440, col: 13] Bad
IRI: <http://th.dbpedia.org/resource/หมวดหมู่:ผู้อำนวยการสร้างรายการ โทรทัศน์
ชาวอเมริกัน> Code: 56/COMPATIBILITY_CHARACTER in PATH: TODO

and now we have a real error.

What's line 513? (You can get the response by using curl or wget).
Well, from what I can see line 513 contains

ns56:Лауреати_премії_«Еммі» ,

so I guess the char « is unknown to some reason.

11:48:30,584 ErrorHandlerFactory$ErrorLogger - [line: 513, col: 24]
Unknown char: «(171;0x00AB)

The actual error is from looking for a new turtle token and does nto find a start-of-token marker like < or " or a digit. So it assumes a prefix name (which does not start with an identifing character)

It might be badly written data (some unescaped significant character earlier in the triple). It's structural problem with the data sent back.
Ok, so the Dbpedia endpoint aka Virtuoso seems to return some illegal structural data. Probably I'll have to file an issue or at least ask on their mailing list.

(Hmm - the stack trace does not seem to quite agree with the current codebase. What version are you running?)
I used JENA ARQ 2.11.2, but now updated to

JENA ARQ 2.12.1
JENA Core 2.12.1
JENA IRI 1.1.1

The stacktrace seems to be the same as before:

WARN - [line: 263, col: 45] Bad IRI: <http://th.dbpedia.org/resource /หมวดหมู่:ผู้กำกับภาพยนตร์ชาว อเมริกัน> Code: 47/NOT_NFKC in PATH: The IRI is not in Unicode Normal Form KC. WARN - [line: 263, col: 45] Bad IRI: <http://th.dbpedia.org/resource /หมวดหมู่:ผู้กำกับภาพยนตร์ชาว อเมริกัน> Code: 56/COMPATIBILITY_CHARACTER in PATH: TODO WARN - [line: 288, col: 45] Bad IRI: <http://zh_min_nan.dbpedia.org/resource/Category:Bí-kok_tiān-iáⁿ_tō-ián> Code: 47/NOT_NFKC in PATH: The IRI is not in Unicode Normal Form KC. WARN - [line: 288, col: 45] Bad IRI: <http://zh_min_nan.dbpedia.org/resource/Category:Bí-kok_tiān-iáⁿ_tō-ián> Code: 56/COMPATIBILITY_CHARACTER in PATH: TODO WARN - [line: 440, col: 13] Bad IRI: <http://th.dbpedia.org/resource /หมวดหมู่:ผู้อำนวยการสร้างรายการ โทรทัศน์ชาวอเมริกัน> Code: 47/NOT_NFKC in PATH: The IRI is not in Unicode Normal Form KC. WARN - [line: 440, col: 13] Bad IRI: <http://th.dbpedia.org/resource /หมวดหมู่:ผู้อำนวยการสร้างรายการ โทรทัศน์ชาวอเมริกัน> Code: 56/COMPATIBILITY_CHARACTER in PATH: TODO
ERROR - [line: 513, col: 24] Unknown char: «(171;0x00AB)
Exception in thread "main" org.apache.jena.riot.RiotException: [line: 513, col: 24] Unknown char: «(171;0x00AB) at org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerStd.fatal(ErrorHandlerFactory.java:136) at org.apache.jena.riot.lang.LangEngine.raiseException(LangEngine.java:163)
    at org.apache.jena.riot.lang.LangEngine.nextToken(LangEngine.java:106)
at org.apache.jena.riot.lang.LangTurtleBase.triplesNode(LangTurtleBase.java:368) at org.apache.jena.riot.lang.LangTurtleBase.objectList(LangTurtleBase.java:350) at org.apache.jena.riot.lang.LangTurtleBase.predicateObjectItem(LangTurtleBase.java:288) at org.apache.jena.riot.lang.LangTurtleBase.predicateObjectList(LangTurtleBase.java:281) at org.apache.jena.riot.lang.LangTurtleBase.triples(LangTurtleBase.java:250) at org.apache.jena.riot.lang.LangTurtleBase.triplesSameSubject(LangTurtleBase.java:191) at org.apache.jena.riot.lang.LangTurtle.oneTopLevelElement(LangTurtle.java:44) at org.apache.jena.riot.lang.LangTurtleBase.runParser(LangTurtleBase.java:90)
    at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:42)
at org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserRegistry.java:182)
    at org.apache.jena.riot.RDFDataMgr.process(RDFDataMgr.java:906)
    at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:257)
    at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:231)
    at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:221)
at com.hp.hpl.jena.sparql.engine.http.QueryEngineHTTP.execModel(QueryEngineHTTP.java:432) at com.hp.hpl.jena.sparql.engine.http.QueryEngineHTTP.execConstruct(QueryEngineHTTP.java:387) at com.hp.hpl.jena.sparql.engine.http.QueryEngineHTTP.execConstruct(QueryEngineHTTP.java:382) at org.aksw.jena_sparql_api.virtfix.QueryExecutionFactoryVirtFix.main(QueryExecutionFactoryVirtFix.java:80)

Exception in thread "main" org.apache.jena.riot.RiotException: [line:
513, col: 24] Unknown char: «(171;0x00AB)
     at
org.apache.jena.riot.system.ErrorHandlerFactory$ErrorHandlerStd.fatal(ErrorHandlerFactory.java:136)

     at
org.apache.jena.riot.lang.LangEngine.raiseException(LangEngine.java:163)
at org.apache.jena.riot.lang.LangEngine.nextToken(LangEngine.java:106)
     at
org.apache.jena.riot.lang.LangTurtleBase.triplesNode(LangTurtleBase.java:368)

     at
org.apache.jena.riot.lang.LangTurtleBase.objectList(LangTurtleBase.java:350)

     at
org.apache.jena.riot.lang.LangTurtleBase.predicateObjectItem(LangTurtleBase.java:288)

     at
org.apache.jena.riot.lang.LangTurtleBase.predicateObjectList(LangTurtleBase.java:281)

     at
org.apache.jena.riot.lang.LangTurtleBase.triples(LangTurtleBase.java:250)
     at
org.apache.jena.riot.lang.LangTurtleBase.triplesSameSubject(LangTurtleBase.java:191)

     at
org.apache.jena.riot.lang.LangTurtle.oneTopLevelElement(LangTurtle.java:44)
     at
org.apache.jena.riot.lang.LangTurtleBase.runParser(LangTurtleBase.java:90)
     at org.apache.jena.riot.lang.LangBase.parse(LangBase.java:42)
     at
org.apache.jena.riot.RDFParserRegistry$ReaderRIOTLang.read(RDFParserRegistry.java:169)

     at org.apache.jena.riot.RDFDataMgr.process(RDFDataMgr.java:859)
     at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:255)
     at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:229)
     at org.apache.jena.riot.RDFDataMgr.read(RDFDataMgr.java:219)
     at
com.hp.hpl.jena.sparql.engine.http.QueryEngineHTTP.execModel(QueryEngineHTTP.java:431)

     at
com.hp.hpl.jena.sparql.engine.http.QueryEngineHTTP.execConstruct(QueryEngineHTTP.java:387)

     at
com.hp.hpl.jena.sparql.engine.http.QueryEngineHTTP.execConstruct(QueryEngineHTTP.java:382)


When I force the QueryEngineHTTP to request RDFXML instead of TURTLE
which is somehow the default setting it works without exception.

My questions are:

1. Is it a bug in Virtuoso and a wrong character is returned or is it
some problem within the Turtle parser?

For the NFC warnings, mostly is that the data is not NFC, not the Virtuoso engine messing with it.

2. Is there a way to change the accept format from outside the
QueryEngineHTTP class?

QueryEngineHTTP.setModelContentType(String)

3. Is there a way to ignore such kind of triples such that I get some
warning but the parser does not terminate with an error?

Not really (see above about surpessing the check) but you can configure your logging to not output anything.

Thanks in advance.

Kind regards,
Lorenz


    Andy



Kind regards,
Lorenz

--
Lorenz Bühmann
AKSW group, University of Leipzig
Group: http://aksw.org - semantic web research center

Reply via email to