On 08/12/2024 11:42, zPlus wrote:
However, just by renaming the file to something else such as "file.json"
everything works as aspected.
"file.jsonld", not "file.json"
The file name passed in is used for the base URI for the parsing process.
The file name string isn't normal form C (NFC).
Εισαγωγή_στον_στοχαστικό_λογισμό_2015_Χελιώτης.jsonld
By chopping it up, I got it down to the eta ή.
I think it is because there two ways of writing it.
Either a single unicode codepoint for the character and accent (one
codepoint ή), or the character without accent η followed by a combining
character for the accent ´ to modify the previous character η which is
two unicode codepoints.
There is a picture in figure 1 of https://unicode.org/reports/tr15/.
Both ways display the same but the are different unicode codepoints.
Most of the parsing work for JSON-LD is by separate subsystem (Titanium
JSON-LD) but the base URI is set by Jena which is why it is the file
name triggers this.
> I don't know if there is any such requirement in the IRI standard?
See the note in RDF Concepts
https://www.w3.org/TR/rdf11-concepts/#section-IRIs
I think the advice on NFC used to be stronger. The jena-iri code is
quite old.
All - there is a new IRI parsing coming along which is more up-to-date
with URI RFCs, more maintainable and faster. It does not check for NFC.
Should it? That check is another pass over the string (to utilize the
JDK code for NFC checking) and is not zero-cost.
If most other systems don't check for NFC or carefully produce NFC,
there is not so much value in checking.
Andy