Re: Jena riot --validate : I can't spot the errror

Andy Seaborne Mon, 14 Jun 2021 10:56:49 -0700

PS Try Jena 4.1.0.

There were changes relating to:


* Handing of Unicode surrogate pairs

* not being sensitive to the Unicode version supported by the Javaruntime platform.

Depending on how you use the data, it may output as one or two U+FFFD("unmappable") characters but that's an output issue.


    Andy

On 14/06/2021 15:50, Andy Seaborne wrote:

Hi Jerven,
This is java.nio.charset.MalformedInputException wrapped up.Unfortunately, the line/col aren't right because the standard Javadecoder does not provide the information. It will be after the reportedpoint in the input so slicing the front off should get you a smallerfile to look at, then binary chop to find a slice containing the error.
For efficiency reasons (and it does make a notable difference), RIOTgrabs large chunks of characters from the UTF-8 => java charactersdecoder (128K chunks). Down side - encoding errors are reported as"somewhere" and can be anywhere in the chunk.
There is a Jena tool "utf8" which more carefully translates UTF-8 - itmay help pinpoint the error. It's better in 4.1.0: in 4.0.0 it got leftin verbose mode - it needs a small slice of data and it may not find theerror helpfully because of the different ways it can fail.
I'd be interested in knowing whether mapping to "unmappable" U+FFFDwould help - but it's a silent translation so not a perfect solution.
     Andy

Is this connected with a Q today:
https://stackoverflow.com/questions/67970538/is-it-possible-to-ignore-riotparseexception-in-apache-jena
On 14/06/2021 14:59, jerven Bolleman wrote:
Dear Jena team,

I have a turtle file that fails validation with the following error.

riot --validate swisslipids.ttl
15:19:08 ERROR riot :: [line: 5794892, col: 6 ] Badcharacter encoding
But I can't spot the error on that line so I did a hexdump.


sed -n '5794892p' swisslipids.ttl | hexdump -C
00000000 20 20 53 4c 4d 3a 72 61 6e 6b 20 53 4c 4d 3a 49 |SLM:rank SLM:I|00000010 73 6f 6d 65 72 69 63 5f 53 75 62 73 70 65 63 69|someric_Subspeci|
00000020  65 73 20 3b 0a                                    |es ;.|
00000025

Which is the same as a different earlier line which passed

sed -n '5794877p' swisslipids.ttl | hexdump -C
00000000 20 20 53 4c 4d 3a 72 61 6e 6b 20 53 4c 4d 3a 49 |SLM:rank SLM:I|00000010 73 6f 6d 65 72 69 63 5f 53 75 62 73 70 65 63 69|someric_Subspeci|
00000020  65 73 20 3b 0a                                    |es ;.|
00000025
The file is unfortunatly to large to attach at 29MB of xz compresseddata. I would be more than happy to share it or a subset.
Riot version is 4.0.0

Regards,
Jerven

PS nearby lines with their line numbers

5794876 SLM:000501095 a owl:Class ;
5794877   SLM:rank SLM:Isomeric_Subspecies ;
5794878 rdfs:label"(12S)-hydroperoxy-(5Z,8Z,10E,14Z,17Z)-eicosapentaenoate" ;
5794879   skos:altLabel "(12S)-Hp-(5Z,8Z,10E,14Z,17Z)-EPE" ;
5794880   rdfs:subClassOf SLM:000501324 ;
5794881 chebislash:inchi"InChI=1S/C20H30O4/c1-2-3-4-5-10-13-16-19(24-23)17-14-11-8-6-7-9-12-15-18-20(21)22/h3-4,7-11,13-14,17,19,23H,2,5-6,12,15-16,18H2,1H3,(H,21,22)/p-1/b4-3-,9-7-,11-8-,13-10-,17-14+/t19-/m0/s1";
5794882   chebislash:inchikey "HDMYXONNVAOHFR-UOLHMMFFSA-M" ;
5794883   owl:equivalentClass
5794884  CHEBI:90772
5794885  ;
5794886   rdfs:seeAlso lipidmaps:LMFA03070012 ;
5794887   chebislash:charge "-1" ;
5794888 chebislash:smiles'''C(=C\\C\\C=C/C=C/[C@H](C/C=C\\C/C=C\\CC)OO)\\CCCC([O-])=O''' ;
5794889   SLM:citation citation:22984144 ;
5794890   chebislash:formula "C20H29O4" .
5794891 SLM:000501145 a owl:Class ;
5794892   SLM:rank SLM:Isomeric_Subspecies ;

Re: Jena riot --validate : I can't spot the errror

Reply via email to