Yeah guys, sorry, I'm dumb and didn't scroll down enough to see Andy's last inline comment referring to TDB w.r.t. encoding issue.
Anyways, Andy already spotted the source of the issue, so as usual will be fixed soon I think On 25.04.20 10:53, Andy Seaborne wrote: > JENA-1890, PR#735 > > On 25/04/2020 08:34, Lorenz Buehmann wrote: >> Hi, >> >> I tried with cURL + riot CLI tools manually and can't reproduce the >> parsing issue, neither with RDF/XML nor with Turtle. > > The problem is in TDB. In fact the use of \u is not part of the > problem directly. The parser step works and the database is loaded > correctly. > > > Encoding URIs term in TDB1 (not TDB2) was added JENA-1793/Jena 3.14.0 > using "_" as the hex marker; so like %XX but as _XX. It allows illegal > URIs (spaces :-() to be handled by the database. > > The decoder is also more general - it can decode multibyte codepoints > written as %xx%xx but (bug) it gets bytes and chars mixed up at one > point. > > When all the characters before the _ are single byte in UTF-8 it works > but "사용_" has multi-byte characters before the _. The decoder then > accesses the string and it can be off the end. > > Andy > >> curl -L -H "Accept: text/turtle" http://dbpedia.org/resource/User_guide >>> /tmp/test.ttl >> curl -L -H "Accept: application/rdf+xml" >> http://dbpedia.org/resource/User_guide > /tmp/test.rdf >> >> >> I know, that a few years ago DBpedia (resp. its Virtuoso backend) had >> some issues with serialization, but this has been fixed long time ago. >> >> Also, I don't understand what you mean by "suspicious"? The parser can >> easily convert the UTF-8 encoded URIs as expected: >> >> riot --check /tmp/test.ttl >> >> <http://dbpedia.org/resource/User_guide> >> <http://www.w3.org/2002/07/owl#sameAs> >> <http://nl.dbpedia.org/resource/Handleiding> . >> <http://dbpedia.org/resource/User_guide> >> <http://www.w3.org/2002/07/owl#sameAs> >> <http://cs.dbpedia.org/resource/Uživatelská_příručka> . >> <http://dbpedia.org/resource/User_guide> >> <http://www.w3.org/2002/07/owl#sameAs> >> <http://wikidata.dbpedia.org/resource/Q1057179> . >> <http://dbpedia.org/resource/User_guide> >> <http://www.w3.org/2002/07/owl#sameAs> >> <http://www.wikidata.org/entity/Q1057179> . >> <http://dbpedia.org/resource/User_guide> >> <http://www.w3.org/2002/07/owl#sameAs> >> <http://ko.dbpedia.org/resource/사용_설명서> . >> <http://dbpedia.org/resource/User_guide> >> <http://www.w3.org/2002/07/owl#sameAs> >> <http://es.dbpedia.org/resource/Guía_del_usuario> . >> <http://dbpedia.org/resource/User_guide> >> <http://www.w3.org/2002/07/owl#sameAs> >> <http://ja.dbpedia.org/resource/マニュアル> . >> <http://dbpedia.org/resource/User_guide> >> <http://www.w3.org/2002/07/owl#sameAs> >> <http://it.dbpedia.org/resource/Manuale> . >> <http://dbpedia.org/resource/User_guide> >> <http://www.w3.org/2002/07/owl#sameAs> >> <http://rdf.freebase.com/ns/m.04mqbf> . >> <http://dbpedia.org/resource/User_guide> >> <http://www.w3.org/2002/07/owl#sameAs> >> <http://fr.dbpedia.org/resource/Mode_d'emploi> . >> <http://dbpedia.org/resource/User_guide> >> <http://www.w3.org/2002/07/owl#sameAs> >> <http://yago-knowledge.org/resource/User_guide> . >> <http://dbpedia.org/resource/User_guide> >> <http://www.w3.org/2002/07/owl#sameAs> >> <http://de.dbpedia.org/resource/Gebrauchsanleitung> . >> <http://dbpedia.org/resource/User_guide> >> <http://www.w3.org/2002/07/owl#sameAs> >> <http://id.dbpedia.org/resource/Manual_pengguna> . >> <http://dbpedia.org/resource/User_guide> >> <http://www.w3.org/2002/07/owl#sameAs> >> <http://dbpedia.org/resource/User_guide> . >> >> On 24.04.20 22:33, Jean-Marc Vanel wrote: >>> Le ven. 24 avr. 2020 à 22:17, Andy Seaborne <[email protected]> a écrit : >>> >>>> On 24/04/2020 15:17, Jean-Marc Vanel wrote: >>>>> How to reproduce with 3.14.0 >>>>> >>>>> bin/*tdbloader* --loc TDB >>>>> --graph=http://dbpedia.org/resource/User_guide >>>> \ >>>>> --verbose http://dbpedia.org/resource/User_guide >>>> Did the log say anything? >>>> >>> NO, nothing special, neither with --debug . >>> >>> As this is a remote URL, did it all arrive and parse without warnings? >>> No warning. >>> >>> Was the database fresh or was there data in it to start with? >>> database fresh, of course. >>> >>> >>>>> echo " >>>>> CONSTRUCT { >>>>> <http://dbpedia.org/resource/User_guide> >>>>> ?P ?O . } >>>>> WHERE { GRAPH ?G { >>>>> <http://dbpedia.org/resource/User_guide> >>>>> ?P ?O . } } >>>>> LIMIT >>>>> # 30 # OK >>>>> 35 # KO !!! >>>>> " > /tmp/const.ql >>>>> >>>>> bin/*tdbquery* --debug --loc=TDB --query /tmp/const.ql >>>>> >>>>> And here is the *stack*: >>>>> >>>>> 16:14:23 ERROR BindingTDB :: get1(?O) >>>>> java.lang.StringIndexOutOfBoundsException: String index out of >>>>> range: 39 >>>>> at java.lang.String.charAt(String.java:658) >>>>> at org.apache.jena.atlas.lib.StrUtils.decodeHex(StrUtils.java:212) >>>>> at >>>>> org.apache.jena.tdb.store.nodetable.NodecSSE.decode(NodecSSE.java:121) >>>>> >>>> If the load was clean, the database is intact and it is a decoding bug >>>> in Jena for an URI. The data has a lot of encoded \u terms but its >>>> a URI >>>> in the object position causing a problem. (I don't see why these are >>>> encoded - it's not necessary). >>>> >>> Indeed these URI are suspect: >>> >>> <http://fr.dbpedia.org/resource/Mode_d\u0027emploi> , >>> <http://es.dbpedia.org/resource/Gu\u00EDa_del_usuario> . >>> >>> <http://ja.dbpedia.org/resource/\u30DE\u30CB\u30E5\u30A2\u30EB> , >>> < >>> http://cs.dbpedia.org/resource/U\u017Eivatelsk\u00E1_p\u0159\u00EDru\u010Dka> >>> >>> , >>> <http://ko.dbpedia.org/resource/\uC0AC\uC6A9_\uC124\uBA85\uC11C> . >>> >>> >>>> Andy >>>> >>>> ... >>>>> at tdb.tdbquery.main(tdbquery.java:33) >>>>> >>>>> NOTE : no problem with apache-jena-3.10.0-SNAPSHOT !? >>>>> >>>>> >>>>> Jean-Marc Vanel >>>>> < >>>> http://semantic-forms.cc:9112/display?displayuri=http://jmvanel.free.fr/jmv.rdf%23me >>>> >>>>> +33 (0)6 89 16 29 52 >>>>> Twitter: @jmvanel , @jmvanel_fr ; chat: >>>>> irc://irc.freenode.net#eulergui >>>>> Chroniques jardin >>>>> < >>>> http://semantic-forms.cc:1952/history?uri=http%3A%2F%2Fdbpedia.org%2Fresource%2FChronicle >>>> >>>>> >>
