JENA-1890, PR#735

On 25/04/2020 08:34, Lorenz Buehmann wrote:
Hi,

I tried with cURL + riot CLI tools manually and can't reproduce the
parsing issue, neither with RDF/XML nor with Turtle.

The problem is in TDB. In fact the use of \u is not part of the problem directly. The parser step works and the database is loaded correctly.


Encoding URIs term in TDB1 (not TDB2) was added JENA-1793/Jena 3.14.0 using "_" as the hex marker; so like %XX but as _XX. It allows illegal URIs (spaces :-() to be handled by the database.

The decoder is also more general - it can decode multibyte codepoints written as %xx%xx but (bug) it gets bytes and chars mixed up at one point.

When all the characters before the _ are single byte in UTF-8 it works but "사용_" has multi-byte characters before the _. The decoder then accesses the string and it can be off the end.

    Andy

curl -L -H "Accept: text/turtle" http://dbpedia.org/resource/User_guide
/tmp/test.ttl
curl -L -H "Accept: application/rdf+xml"
http://dbpedia.org/resource/User_guide > /tmp/test.rdf


I know, that a few years ago DBpedia (resp. its Virtuoso backend) had
some issues with serialization, but this has been fixed long time ago.

Also, I don't understand what you mean by "suspicious"? The parser can
easily convert the UTF-8 encoded URIs as expected:

riot --check /tmp/test.ttl

<http://dbpedia.org/resource/User_guide>
<http://www.w3.org/2002/07/owl#sameAs>
<http://nl.dbpedia.org/resource/Handleiding> .
<http://dbpedia.org/resource/User_guide>
<http://www.w3.org/2002/07/owl#sameAs>
<http://cs.dbpedia.org/resource/Uživatelská_příručka> .
<http://dbpedia.org/resource/User_guide>
<http://www.w3.org/2002/07/owl#sameAs>
<http://wikidata.dbpedia.org/resource/Q1057179> .
<http://dbpedia.org/resource/User_guide>
<http://www.w3.org/2002/07/owl#sameAs>
<http://www.wikidata.org/entity/Q1057179> .
<http://dbpedia.org/resource/User_guide>
<http://www.w3.org/2002/07/owl#sameAs>
<http://ko.dbpedia.org/resource/사용_설명서> .
<http://dbpedia.org/resource/User_guide>
<http://www.w3.org/2002/07/owl#sameAs>
<http://es.dbpedia.org/resource/Guía_del_usuario> .
<http://dbpedia.org/resource/User_guide>
<http://www.w3.org/2002/07/owl#sameAs>
<http://ja.dbpedia.org/resource/マニュアル> .
<http://dbpedia.org/resource/User_guide>
<http://www.w3.org/2002/07/owl#sameAs>
<http://it.dbpedia.org/resource/Manuale> .
<http://dbpedia.org/resource/User_guide>
<http://www.w3.org/2002/07/owl#sameAs>
<http://rdf.freebase.com/ns/m.04mqbf> .
<http://dbpedia.org/resource/User_guide>
<http://www.w3.org/2002/07/owl#sameAs>
<http://fr.dbpedia.org/resource/Mode_d'emploi> .
<http://dbpedia.org/resource/User_guide>
<http://www.w3.org/2002/07/owl#sameAs>
<http://yago-knowledge.org/resource/User_guide> .
<http://dbpedia.org/resource/User_guide>
<http://www.w3.org/2002/07/owl#sameAs>
<http://de.dbpedia.org/resource/Gebrauchsanleitung> .
<http://dbpedia.org/resource/User_guide>
<http://www.w3.org/2002/07/owl#sameAs>
<http://id.dbpedia.org/resource/Manual_pengguna> .
<http://dbpedia.org/resource/User_guide>
<http://www.w3.org/2002/07/owl#sameAs>
<http://dbpedia.org/resource/User_guide> .

On 24.04.20 22:33, Jean-Marc Vanel wrote:
Le ven. 24 avr. 2020 à 22:17, Andy Seaborne <a...@apache.org> a écrit :

On 24/04/2020 15:17, Jean-Marc Vanel wrote:
How to reproduce with 3.14.0

bin/*tdbloader* --loc TDB --graph=http://dbpedia.org/resource/User_guide
\
    --verbose http://dbpedia.org/resource/User_guide
Did the log say anything?

NO, nothing special, neither with --debug .

As this is a remote URL, did it all arrive and parse without warnings?
No warning.

Was the database fresh or was there data in it to start with?
database fresh, of course.


echo "
CONSTRUCT {
   <http://dbpedia.org/resource/User_guide>
    ?P ?O . }
WHERE { GRAPH ?G {
   <http://dbpedia.org/resource/User_guide>
    ?P ?O . } }
LIMIT
# 30 # OK
35 # KO !!!
" > /tmp/const.ql

bin/*tdbquery* --debug --loc=TDB --query /tmp/const.ql

And here is the *stack*:

16:14:23 ERROR BindingTDB           :: get1(?O)
java.lang.StringIndexOutOfBoundsException: String index out of range: 39
at java.lang.String.charAt(String.java:658)
at org.apache.jena.atlas.lib.StrUtils.decodeHex(StrUtils.java:212)
at org.apache.jena.tdb.store.nodetable.NodecSSE.decode(NodecSSE.java:121)
If the load was clean, the database is intact and it is a decoding bug
in Jena for an URI. The data has a lot of encoded \u terms but its a URI
in the object position causing a problem.  (I don't see why these are
encoded - it's not necessary).

Indeed these URI are suspect:

<http://fr.dbpedia.org/resource/Mode_d\u0027emploi> ,
<http://es.dbpedia.org/resource/Gu\u00EDa_del_usuario> .

<http://ja.dbpedia.org/resource/\u30DE\u30CB\u30E5\u30A2\u30EB> ,
<
http://cs.dbpedia.org/resource/U\u017Eivatelsk\u00E1_p\u0159\u00EDru\u010Dka>
,
<http://ko.dbpedia.org/resource/\uC0AC\uC6A9_\uC124\uBA85\uC11C> .


      Andy

...
at tdb.tdbquery.main(tdbquery.java:33)

NOTE : no problem with apache-jena-3.10.0-SNAPSHOT !?


Jean-Marc Vanel
<
http://semantic-forms.cc:9112/display?displayuri=http://jmvanel.free.fr/jmv.rdf%23me
+33 (0)6 89 16 29 52
Twitter: @jmvanel , @jmvanel_fr ; chat: irc://irc.freenode.net#eulergui
   Chroniques jardin
<
http://semantic-forms.cc:1952/history?uri=http%3A%2F%2Fdbpedia.org%2Fresource%2FChronicle


Reply via email to