Hi Holger,
How big is the database?
What sort of framing are you aiming to do?
Using framing to select some from a large database doesn't feel like the
way to extract triples as you've discovered. Framing can touch anywhere
in the JSON document.
This recent thread is relevant --
https://lists.apache.org/thread/3mrcyf1ccry78rkxxb6vqsm4okfffzfl
That JSON-LD file is 280 million triples.
It's structure is
[{"@context": <url> , ... }
,{"@context": <url> , ... }
,{"@context": <url> , ... }
...
,{"@context": <url> , ... }
]
9 million array entries.
It looks to me like it has been produced by text manipulation, taking
each entity, writing a separate, self-contained JSON-LD object then, by
text, making a big array. That, or a tool that is designed specially to
write large JSON-LD. e.g. the outer array.
That's the same context URL and would be a denial of service attack
except Titanium reads the whole file as JSON and runs out of space.
The JSON-LD algorithms do assume the whole document is available.
Titanium is a faithful implementation of the spec.
It is hard to work with.
In JSON the whole object needs to be seen - repeated member names (and
facto - last duplicate wins) and "@context" being at the end are
possible. Cases that don't occur in XML. Streaming JSON or JSON-LD is
going to have to relax the strictness somehow.
JSON-LD is designed around the assumption of small/medium sized data.
And this affects writing. That large file looks like it was specially
written or at least with a tool that is designed specially to write
large JSON-LD. e.g. the outer array.
Jena could do with some RDFFormats + writers for JSONLD at scale. Oen
obvious one is the one extending WriterStreamRDFBatched where a batch is
the subject and its immediate triples, then write similar to the case
above except in a way that is one context then the array is with "@graph".
https://www.w3.org/TR/json-ld11/#example-163-same-description-in-json-ld-context-shared-among-node-objects
That doesn't solve the reading side - a companion reader would be needed
that stream-reads JSON.
Contributions welcome!
Andy
On 10/07/2024 12:36, Holger Knublauch wrote:
I am working on serializing partial RDF graphs to JSON-LD using the
Jena-Titanium bridge.
Problem: For Titanium to "see" the triples it needs to have a complete copy.
See JenaTitanion.convert which copies all Jena triples into a corresponding RdfDatset.
This cannot scale if the graph is backed by a database, and we only want to export
certain triples (esp for Framing). Titanium's RdfGraph does not provide an incremental
function similar to Graph.find() but only returns a complete Java List of all triples.
Has anyone here run into the same problem and what would be a solution?
I guess one solution would be an incremental algorithm that "walks" a @context
and JSON-LD frame document to collect all required Jena triples, producing a sub-graph
that can then be sent to Titanium. But the complexity of such an algorithm is similar to
having to implement my own JSON-LD engine, which feels like an overkill.
Holger