On 08/12/2024 11:42, zPlus wrote:
However, just by renaming the file to something else such as "file.json"
everything works as aspected.

"file.jsonld", not "file.json"

The file name passed in is used for the base URI for the parsing process.

The file name string isn't normal form C (NFC).

Εισαγωγή_στον_στοχαστικό_λογισμό_2015_Χελιώτης.jsonld

By chopping it up, I got it down to the eta ή.

I think it is because there two ways of writing it.

Either a single unicode codepoint for the character and accent (one codepoint ή), or the character without accent η followed by a combining character for the accent ´ to modify the previous character η which is two unicode codepoints.

There is a picture in figure 1 of https://unicode.org/reports/tr15/.

Both ways display the same but the are different unicode codepoints.

Most of the parsing work for JSON-LD is by separate subsystem (Titanium JSON-LD) but the base URI is set by Jena which is why it is the file name triggers this.

> I don't know if there is any such requirement in the IRI standard?

See the note in RDF Concepts
https://www.w3.org/TR/rdf11-concepts/#section-IRIs

I think the advice on NFC used to be stronger. The jena-iri code is quite old.

All - there is a new IRI parsing coming along which is more up-to-date with URI RFCs, more maintainable and faster. It does not check for NFC. Should it? That check is another pass over the string (to utilize the JDK code for NFC checking) and is not zero-cost.

If most other systems don't check for NFC or carefully produce NFC, there is not so much value in checking.

    Andy

Reply via email to