One million triples in memory isn't very many these days.

Just make sure there are not a lot of "1 millions" at the same time.

As you describe it, it does sound like the example below. For each "entity" i.e. closure, produce a JSON-LD document and stream/assemble wrapped in

{ "@context" : URL , "@graph" : [ .... ] }

to avoid repeated "@context"

    Andy

On 11/07/2024 06:03, Holger Knublauch wrote:
Hi Andy,

thanks for your response. To clarify, it would be a scenario such as a TDB with 1 million 
triples and the request is to produce a JSON-LD document from the "closure" 
around a given resource (in TopBraid's Source Code panel when the user navigates to a 
resource or through API calls). In other words: input is a Jena Graph, a start node and a 
JSON-LD frame document, and the output should be a JSON-LD describing the node and all 
reachable triples described by the frame.

So it sounds like Titanium cannot really be used for this as its algorithms can 
only operate on their own in-memory copy of a graph, and we cannot copy all 1 
million triples into memory each time.

Holger


On 10 Jul 2024, at 5:53 PM, Andy Seaborne <a...@apache.org> wrote:

Hi Holger,

How big is the database?
What sort of framing are you aiming to do?
Using framing to select some from a large database doesn't feel like the way to 
extract triples as you've discovered. Framing can touch anywhere in the JSON 
document.

This recent thread is relevant --
https://lists.apache.org/thread/3mrcyf1ccry78rkxxb6vqsm4okfffzfl

That JSON-LD file is 280 million triples.

It's structure is

[{"@context": <url> , ... }
,{"@context": <url> , ... }
,{"@context": <url> , ... }
...
,{"@context": <url> , ... }
]

9 million array entries.

It looks to me like it has been produced by text manipulation, taking each 
entity, writing a separate, self-contained JSON-LD object then, by text, making 
a big array. That, or a tool that is designed specially to write large JSON-LD. 
e.g. the outer array.

That's the same context URL and would be a denial of service attack except 
Titanium reads the whole file as JSON and runs out of space.

The JSON-LD algorithms do assume the whole document is available. Titanium is a 
faithful implementation of the spec.

It is hard to work with.

In JSON the whole object needs to be seen - repeated member names (and facto - last 
duplicate wins) and "@context" being at the end are possible.  Cases that don't 
occur in XML. Streaming JSON or JSON-LD is going to have to relax the strictness somehow.

JSON-LD is designed around the assumption of small/medium sized data.

And this affects writing. That large file looks like it was specially written 
or at least with a tool that is designed specially to write large JSON-LD. e.g. 
the outer array.


Jena could do with some RDFFormats + writers for JSONLD at scale. Oen obvious one is the 
one extending WriterStreamRDFBatched where a batch is the subject and its immediate 
triples, then write similar to the case above except in a way that is one context then 
the array is with "@graph".

https://www.w3.org/TR/json-ld11/#example-163-same-description-in-json-ld-context-shared-among-node-objects

That doesn't solve the reading side - a companion reader would be needed that 
stream-reads JSON.

Contributions welcome!

    Andy

On 10/07/2024 12:36, Holger Knublauch wrote:
I am working on serializing partial RDF graphs to JSON-LD using the 
Jena-Titanium bridge.
Problem: For Titanium to "see" the triples it needs to have a complete copy. 
See JenaTitanion.convert which copies all Jena triples into a corresponding RdfDatset. 
This cannot scale if the graph is backed by a database, and we only want to export 
certain triples (esp for Framing). Titanium's RdfGraph does not provide an incremental 
function similar to Graph.find() but only returns a complete Java List of all triples.
Has anyone here run into the same problem and what would be a solution?
I guess one solution would be an incremental algorithm that "walks" a @context 
and JSON-LD frame document to collect all required Jena triples, producing a sub-graph 
that can then be sent to Titanium. But the complexity of such an algorithm is similar to 
having to implement my own JSON-LD engine, which feels like an overkill.
Holger

Reply via email to