Hi Holger,

How big is the database?
What sort of framing are you aiming to do?
Using framing to select some from a large database doesn't feel like the way to extract triples as you've discovered. Framing can touch anywhere in the JSON document.

This recent thread is relevant --
https://lists.apache.org/thread/3mrcyf1ccry78rkxxb6vqsm4okfffzfl

That JSON-LD file is 280 million triples.

It's structure is

[{"@context": <url> , ... }
,{"@context": <url> , ... }
,{"@context": <url> , ... }
...
,{"@context": <url> , ... }
]

9 million array entries.

It looks to me like it has been produced by text manipulation, taking each entity, writing a separate, self-contained JSON-LD object then, by text, making a big array. That, or a tool that is designed specially to write large JSON-LD. e.g. the outer array.

That's the same context URL and would be a denial of service attack except Titanium reads the whole file as JSON and runs out of space.

The JSON-LD algorithms do assume the whole document is available. Titanium is a faithful implementation of the spec.

It is hard to work with.

In JSON the whole object needs to be seen - repeated member names (and facto - last duplicate wins) and "@context" being at the end are possible. Cases that don't occur in XML. Streaming JSON or JSON-LD is going to have to relax the strictness somehow.

JSON-LD is designed around the assumption of small/medium sized data.

And this affects writing. That large file looks like it was specially written or at least with a tool that is designed specially to write large JSON-LD. e.g. the outer array.


Jena could do with some RDFFormats + writers for JSONLD at scale. Oen obvious one is the one extending WriterStreamRDFBatched where a batch is the subject and its immediate triples, then write similar to the case above except in a way that is one context then the array is with "@graph".

https://www.w3.org/TR/json-ld11/#example-163-same-description-in-json-ld-context-shared-among-node-objects

That doesn't solve the reading side - a companion reader would be needed that stream-reads JSON.

Contributions welcome!

    Andy

On 10/07/2024 12:36, Holger Knublauch wrote:
I am working on serializing partial RDF graphs to JSON-LD using the 
Jena-Titanium bridge.

Problem: For Titanium to "see" the triples it needs to have a complete copy. 
See JenaTitanion.convert which copies all Jena triples into a corresponding RdfDatset. 
This cannot scale if the graph is backed by a database, and we only want to export 
certain triples (esp for Framing). Titanium's RdfGraph does not provide an incremental 
function similar to Graph.find() but only returns a complete Java List of all triples.

Has anyone here run into the same problem and what would be a solution?

I guess one solution would be an incremental algorithm that "walks" a @context 
and JSON-LD frame document to collect all required Jena triples, producing a sub-graph 
that can then be sent to Titanium. But the complexity of such an algorithm is similar to 
having to implement my own JSON-LD engine, which feels like an overkill.

Holger

Reply via email to