Re: Some malediction in http://dbpedia.org/resource/User_guide => StringIndexOutOfBoundsException in TDB

2020-04-25 Thread Lorenz Buehmann
Yeah guys,

sorry, I'm dumb and didn't scroll down enough to see Andy's last inline
comment referring to TDB w.r.t. encoding issue.

Anyways, Andy already spotted the source of the issue, so as usual will
be fixed soon I think


On 25.04.20 10:53, Andy Seaborne wrote:
> JENA-1890, PR#735
>
> On 25/04/2020 08:34, Lorenz Buehmann wrote:
>> Hi,
>>
>> I tried with cURL + riot CLI tools manually and can't reproduce the
>> parsing issue, neither with RDF/XML nor with Turtle.
>
> The problem is in TDB. In fact the use of \u is not part of the
> problem directly.  The parser step works and the database is loaded
> correctly.
>
>
> Encoding URIs term in TDB1 (not TDB2) was added JENA-1793/Jena 3.14.0
> using "_" as the hex marker; so like %XX but as _XX. It allows illegal
> URIs (spaces :-() to be handled by the database.
>
> The decoder is also more general - it can decode multibyte codepoints
> written as %xx%xx but (bug) it gets bytes and chars mixed up at one
> point.
>
> When all the characters before the _ are single byte in UTF-8 it works
> but "사용_" has multi-byte characters before the _. The decoder then
> accesses the string and it can be off the end.
>
>     Andy
>
>> curl -L -H "Accept: text/turtle" http://dbpedia.org/resource/User_guide
>>> /tmp/test.ttl
>> curl -L -H "Accept: application/rdf+xml"
>> http://dbpedia.org/resource/User_guide > /tmp/test.rdf
>>
>>
>> I know, that a few years ago DBpedia (resp. its Virtuoso backend) had
>> some issues with serialization, but this has been fixed long time ago.
>>
>> Also, I don't understand what you mean by "suspicious"? The parser can
>> easily convert the UTF-8 encoded URIs as expected:
>>
>> riot --check /tmp/test.ttl
>>
>> 
>> 
>>  .
>> 
>> 
>>  .
>> 
>> 
>>  .
>> 
>> 
>>  .
>> 
>> 
>>  .
>> 
>> 
>>  .
>> 
>> 
>>  .
>> 
>> 
>>  .
>> 
>> 
>>  .
>> 
>> 
>>  .
>> 
>> 
>>  .
>> 
>> 
>>  .
>> 
>> 
>>  .
>> 
>> 
>>  .
>>
>> On 24.04.20 22:33, Jean-Marc Vanel wrote:
>>> Le ven. 24 avr. 2020 à 22:17, Andy Seaborne  a écrit :
>>>
 On 24/04/2020 15:17, Jean-Marc Vanel wrote:
> How to reproduce with 3.14.0
>
> bin/*tdbloader* --loc TDB
> --graph=http://dbpedia.org/resource/User_guide
 \
>     --verbose http://dbpedia.org/resource/User_guide
 Did the log say anything?

>>> NO, nothing special, neither with --debug .
>>>
>>> As this is a remote URL, did it all arrive and parse without warnings?
>>> No warning.
>>>
>>> Was the database fresh or was there data in it to start with?
>>> database fresh, of course.
>>>
>>>
> echo "
> CONSTRUCT {
>    
>     ?P ?O . }
> WHERE { GRAPH ?G {
>    
>     ?P ?O . } }
> LIMIT
> # 30 # OK
> 35 # KO !!!
> " > /tmp/const.ql
>
> bin/*tdbquery* --debug --loc=TDB --query /tmp/const.ql
>
> And here is the *stack*:
>
> 16:14:23 ERROR BindingTDB   :: get1(?O)
> java.lang.StringIndexOutOfBoundsException: String index out of
> range: 39
> at java.lang.String.charAt(String.java:658)
> at org.apache.jena.atlas.lib.StrUtils.decodeHex(StrUtils.java:212)
> at
> org.apache.jena.tdb.store.nodetable.NodecSSE.decode(NodecSSE.java:121)
>
 If the load 

Re: Some malediction in http://dbpedia.org/resource/User_guide => StringIndexOutOfBoundsException in TDB

2020-04-25 Thread Andy Seaborne

JENA-1890, PR#735

On 25/04/2020 08:34, Lorenz Buehmann wrote:

Hi,

I tried with cURL + riot CLI tools manually and can't reproduce the
parsing issue, neither with RDF/XML nor with Turtle.


The problem is in TDB. In fact the use of \u is not part of the problem 
directly.  The parser step works and the database is loaded correctly.



Encoding URIs term in TDB1 (not TDB2) was added JENA-1793/Jena 3.14.0 
using "_" as the hex marker; so like %XX but as _XX. It allows illegal 
URIs (spaces :-() to be handled by the database.


The decoder is also more general - it can decode multibyte codepoints 
written as %xx%xx but (bug) it gets bytes and chars mixed up at one point.


When all the characters before the _ are single byte in UTF-8 it works 
but "사용_" has multi-byte characters before the _. The decoder then 
accesses the string and it can be off the end.


Andy


curl -L -H "Accept: text/turtle" http://dbpedia.org/resource/User_guide

/tmp/test.ttl

curl -L -H "Accept: application/rdf+xml"
http://dbpedia.org/resource/User_guide > /tmp/test.rdf


I know, that a few years ago DBpedia (resp. its Virtuoso backend) had
some issues with serialization, but this has been fixed long time ago.

Also, I don't understand what you mean by "suspicious"? The parser can
easily convert the UTF-8 encoded URIs as expected:

riot --check /tmp/test.ttl



 .


 .


 .


 .


 .


 .


 .


 .


 .


 .


 .


 .


 .


 .

On 24.04.20 22:33, Jean-Marc Vanel wrote:

Le ven. 24 avr. 2020 à 22:17, Andy Seaborne  a écrit :


On 24/04/2020 15:17, Jean-Marc Vanel wrote:

How to reproduce with 3.14.0

bin/*tdbloader* --loc TDB --graph=http://dbpedia.org/resource/User_guide

\

--verbose http://dbpedia.org/resource/User_guide

Did the log say anything?


NO, nothing special, neither with --debug .

As this is a remote URL, did it all arrive and parse without warnings?
No warning.

Was the database fresh or was there data in it to start with?
database fresh, of course.



echo "
CONSTRUCT {
   
?P ?O . }
WHERE { GRAPH ?G {
   
?P ?O . } }
LIMIT
# 30 # OK
35 # KO !!!
" > /tmp/const.ql

bin/*tdbquery* --debug --loc=TDB --query /tmp/const.ql

And here is the *stack*:

16:14:23 ERROR BindingTDB   :: get1(?O)
java.lang.StringIndexOutOfBoundsException: String index out of range: 39
at java.lang.String.charAt(String.java:658)
at org.apache.jena.atlas.lib.StrUtils.decodeHex(StrUtils.java:212)
at org.apache.jena.tdb.store.nodetable.NodecSSE.decode(NodecSSE.java:121)

If the load was clean, the database is intact and it is a decoding bug
in Jena for an URI. The data has a lot of encoded \u terms but its a URI
in the object position causing a problem.  (I don't see why these are
encoded - it's not necessary).


Indeed these URI are suspect:

 ,
 .

 ,
<
http://cs.dbpedia.org/resource/U\u017Eivatelsk\u00E1_p\u0159\u00EDru\u010Dka>
,
 .



  Andy

...

at tdb.tdbquery.main(tdbquery.java:33)

NOTE : no problem with 

Re: Some malediction in http://dbpedia.org/resource/User_guide => StringIndexOutOfBoundsException in TDB

2020-04-25 Thread Jean-Marc Vanel
As was stated by Andy, this is not a parsing issue.
riot is not reporting anything, nor rapper
 .
This is an issue with how TDB renders the URI once it has been stored in
TDB.

Jean-Marc Vanel

+33 (0)6 89 16 29 52
Twitter: @jmvanel , @jmvanel_fr ; chat: irc://irc.freenode.net#eulergui
 Chroniques jardin



Le sam. 25 avr. 2020 à 09:34, Lorenz Buehmann <
buehm...@informatik.uni-leipzig.de> a écrit :

> Hi,
>
> I tried with cURL + riot CLI tools manually and can't reproduce the
> parsing issue, neither with RDF/XML nor with Turtle.
>
> curl -L -H "Accept: text/turtle" http://dbpedia.org/resource/User_guide
> > /tmp/test.ttl
> curl -L -H "Accept: application/rdf+xml"
> http://dbpedia.org/resource/User_guide > /tmp/test.rdf
>
>
> I know, that a few years ago DBpedia (resp. its Virtuoso backend) had
> some issues with serialization, but this has been fixed long time ago.
>
> Also, I don't understand what you mean by "suspicious"? The parser can
> easily convert the UTF-8 encoded URIs as expected:
>
> riot --check /tmp/test.ttl
>
> 
> 
>  .
> 
> 
>  .
> 
> 
>  .
> 
> 
>  .
> 
> 
>  .
> 
> 
>  .
> 
> 
>  .
> 
> 
>  .
> 
> 
>  .
> 
> 
>  .
> 
> 
>  .
> 
> 
>  .
> 
> 
>  .
> 
> 
>  .
>
> On 24.04.20 22:33, Jean-Marc Vanel wrote:
> > Le ven. 24 avr. 2020 à 22:17, Andy Seaborne  a écrit :
> >
> >> On 24/04/2020 15:17, Jean-Marc Vanel wrote:
> >>> How to reproduce with 3.14.0
> >>>
> >>> bin/*tdbloader* --loc TDB --graph=
> http://dbpedia.org/resource/User_guide
> >> \
> >>>--verbose http://dbpedia.org/resource/User_guide
> >> Did the log say anything?
> >>
> > NO, nothing special, neither with --debug .
> >
> > As this is a remote URL, did it all arrive and parse without warnings?
> > No warning.
> >
> > Was the database fresh or was there data in it to start with?
> > database fresh, of course.
> >
> >
> >>> echo "
> >>> CONSTRUCT {
> >>>   
> >>>?P ?O . }
> >>> WHERE { GRAPH ?G {
> >>>   
> >>>?P ?O . } }
> >>> LIMIT
> >>> # 30 # OK
> >>> 35 # KO !!!
> >>> " > /tmp/const.ql
> >>>
> >>> bin/*tdbquery* --debug --loc=TDB --query /tmp/const.ql
> >>>
> >>> And here is the *stack*:
> >>>
> >>> 16:14:23 ERROR BindingTDB   :: get1(?O)
> >>> java.lang.StringIndexOutOfBoundsException: String index out of range:
> 39
> >>> at java.lang.String.charAt(String.java:658)
> >>> at org.apache.jena.atlas.lib.StrUtils.decodeHex(StrUtils.java:212)
> >>> at
> org.apache.jena.tdb.store.nodetable.NodecSSE.decode(NodecSSE.java:121)
> >> If the load was clean, the database is intact and it is a decoding bug
> >> in Jena for an URI. The data has a lot of encoded \u terms but its a URI
> >> in the object position causing a problem.  (I don't see why these are
> >> encoded - it's not necessary).
> >>
> > Indeed these URI are suspect:
> >
> >  ,
> >  .
> >
> >  ,
> > <
> >
> 

Re: Some malediction in http://dbpedia.org/resource/User_guide => StringIndexOutOfBoundsException in TDB

2020-04-25 Thread Lorenz Buehmann
Hi,

I tried with cURL + riot CLI tools manually and can't reproduce the
parsing issue, neither with RDF/XML nor with Turtle.

curl -L -H "Accept: text/turtle" http://dbpedia.org/resource/User_guide
> /tmp/test.ttl
curl -L -H "Accept: application/rdf+xml"
http://dbpedia.org/resource/User_guide > /tmp/test.rdf


I know, that a few years ago DBpedia (resp. its Virtuoso backend) had
some issues with serialization, but this has been fixed long time ago.

Also, I don't understand what you mean by "suspicious"? The parser can
easily convert the UTF-8 encoded URIs as expected:

riot --check /tmp/test.ttl



 .


 .


 .


 .


 .


 .


 .


 .


 .


 .


 .


 .


 .


 .

On 24.04.20 22:33, Jean-Marc Vanel wrote:
> Le ven. 24 avr. 2020 à 22:17, Andy Seaborne  a écrit :
>
>> On 24/04/2020 15:17, Jean-Marc Vanel wrote:
>>> How to reproduce with 3.14.0
>>>
>>> bin/*tdbloader* --loc TDB --graph=http://dbpedia.org/resource/User_guide
>> \
>>>--verbose http://dbpedia.org/resource/User_guide
>> Did the log say anything?
>>
> NO, nothing special, neither with --debug .
>
> As this is a remote URL, did it all arrive and parse without warnings?
> No warning.
>
> Was the database fresh or was there data in it to start with?
> database fresh, of course.
>
>
>>> echo "
>>> CONSTRUCT {
>>>   
>>>?P ?O . }
>>> WHERE { GRAPH ?G {
>>>   
>>>?P ?O . } }
>>> LIMIT
>>> # 30 # OK
>>> 35 # KO !!!
>>> " > /tmp/const.ql
>>>
>>> bin/*tdbquery* --debug --loc=TDB --query /tmp/const.ql
>>>
>>> And here is the *stack*:
>>>
>>> 16:14:23 ERROR BindingTDB   :: get1(?O)
>>> java.lang.StringIndexOutOfBoundsException: String index out of range: 39
>>> at java.lang.String.charAt(String.java:658)
>>> at org.apache.jena.atlas.lib.StrUtils.decodeHex(StrUtils.java:212)
>>> at org.apache.jena.tdb.store.nodetable.NodecSSE.decode(NodecSSE.java:121)
>> If the load was clean, the database is intact and it is a decoding bug
>> in Jena for an URI. The data has a lot of encoded \u terms but its a URI
>> in the object position causing a problem.  (I don't see why these are
>> encoded - it's not necessary).
>>
> Indeed these URI are suspect:
>
>  ,
>  .
>
>  ,
> <
> http://cs.dbpedia.org/resource/U\u017Eivatelsk\u00E1_p\u0159\u00EDru\u010Dka>
> ,
>  .
>
>
>>  Andy
>>
>> ...
>>> at tdb.tdbquery.main(tdbquery.java:33)
>>>
>>> NOTE : no problem with apache-jena-3.10.0-SNAPSHOT !?
>>>
>>>
>>> Jean-Marc Vanel
>>> <
>> http://semantic-forms.cc:9112/display?displayuri=http://jmvanel.free.fr/jmv.rdf%23me
>>> +33 (0)6 89 16 29 52
>>> Twitter: @jmvanel , @jmvanel_fr ; chat: irc://irc.freenode.net#eulergui
>>>   Chroniques jardin
>>> <
>> http://semantic-forms.cc:1952/history?uri=http%3A%2F%2Fdbpedia.org%2Fresource%2FChronicle
>>>



Re: Some malediction in http://dbpedia.org/resource/User_guide => StringIndexOutOfBoundsException in TDB

2020-04-24 Thread Jean-Marc Vanel
Le ven. 24 avr. 2020 à 22:17, Andy Seaborne  a écrit :

>
> On 24/04/2020 15:17, Jean-Marc Vanel wrote:
> > How to reproduce with 3.14.0
> >
> > bin/*tdbloader* --loc TDB --graph=http://dbpedia.org/resource/User_guide
> \
> >--verbose http://dbpedia.org/resource/User_guide
>
> Did the log say anything?
>

NO, nothing special, neither with --debug .

As this is a remote URL, did it all arrive and parse without warnings?
>

No warning.

Was the database fresh or was there data in it to start with?
>

database fresh, of course.


> > echo "
> > CONSTRUCT {
> >   
> >?P ?O . }
> > WHERE { GRAPH ?G {
> >   
> >?P ?O . } }
> > LIMIT
> > # 30 # OK
> > 35 # KO !!!
> > " > /tmp/const.ql
> >
> > bin/*tdbquery* --debug --loc=TDB --query /tmp/const.ql
> >
> > And here is the *stack*:
> >
> > 16:14:23 ERROR BindingTDB   :: get1(?O)
> > java.lang.StringIndexOutOfBoundsException: String index out of range: 39
> > at java.lang.String.charAt(String.java:658)
> > at org.apache.jena.atlas.lib.StrUtils.decodeHex(StrUtils.java:212)
> > at org.apache.jena.tdb.store.nodetable.NodecSSE.decode(NodecSSE.java:121)
>
> If the load was clean, the database is intact and it is a decoding bug
> in Jena for an URI. The data has a lot of encoded \u terms but its a URI
> in the object position causing a problem.  (I don't see why these are
> encoded - it's not necessary).
>

Indeed these URI are suspect:

 ,
 .

 ,
<
http://cs.dbpedia.org/resource/U\u017Eivatelsk\u00E1_p\u0159\u00EDru\u010Dka>
,
 .


>  Andy
>
> ...
> > at tdb.tdbquery.main(tdbquery.java:33)
> >
> > NOTE : no problem with apache-jena-3.10.0-SNAPSHOT !?
> >
> >
> > Jean-Marc Vanel
> > <
> http://semantic-forms.cc:9112/display?displayuri=http://jmvanel.free.fr/jmv.rdf%23me
> >
> > +33 (0)6 89 16 29 52
> > Twitter: @jmvanel , @jmvanel_fr ; chat: irc://irc.freenode.net#eulergui
> >   Chroniques jardin
> > <
> http://semantic-forms.cc:1952/history?uri=http%3A%2F%2Fdbpedia.org%2Fresource%2FChronicle
> >
> >
>


Re: Some malediction in http://dbpedia.org/resource/User_guide => StringIndexOutOfBoundsException in TDB

2020-04-24 Thread Andy Seaborne




On 24/04/2020 15:17, Jean-Marc Vanel wrote:

How to reproduce with 3.14.0

bin/*tdbloader* --loc TDB --graph=http://dbpedia.org/resource/User_guide \
   --verbose http://dbpedia.org/resource/User_guide


Did the log say anything?

As this is a remote URL, did it all arrive and parse without warnings?

Was the database fresh or was there data in it to start with?


echo "
CONSTRUCT {
  
   ?P ?O . }
WHERE { GRAPH ?G {
  
   ?P ?O . } }
LIMIT
# 30 # OK
35 # KO !!!
" > /tmp/const.ql

bin/*tdbquery* --debug --loc=TDB --query /tmp/const.ql

And here is the *stack*:

16:14:23 ERROR BindingTDB   :: get1(?O)
java.lang.StringIndexOutOfBoundsException: String index out of range: 39
at java.lang.String.charAt(String.java:658)
at org.apache.jena.atlas.lib.StrUtils.decodeHex(StrUtils.java:212)
at org.apache.jena.tdb.store.nodetable.NodecSSE.decode(NodecSSE.java:121)


If the load was clean, the database is intact and it is a decoding bug 
in Jena for an URI. The data has a lot of encoded \u terms but its a URI 
in the object position causing a problem.  (I don't see why these are 
encoded - it's not necessary).


Andy

...

at tdb.tdbquery.main(tdbquery.java:33)

NOTE : no problem with apache-jena-3.10.0-SNAPSHOT !?


Jean-Marc Vanel

+33 (0)6 89 16 29 52
Twitter: @jmvanel , @jmvanel_fr ; chat: irc://irc.freenode.net#eulergui
  Chroniques jardin