On 05/07/2024 16:16, Andy Seaborne wrote:

The file authorities-gnd_entityfacts.jsonld.gz has a structure you can exploit.

Each line is a entry in a JSON array and it is complete - it has the @context on each line

There are 9 million lines.

A possiblity is split the file on newline,
Clean up each line
* Drop the last line (the "]")
* remove the first character of each line (which is ",", after "[" on the first line.

Parse each line.

Each element of the array starts

{"@context":"https://hub.culturegraph.org/entityfacts/context/v1/entityfacts.jsonld";

so there are 9 million @context URLs and the parser is going to do a network call for each. On the whole file, that's days and that's if the far end does not decide it is a denial of service attack!

(confirmed: watching the network on an extract - Titanium isn't caching, at least in the way Jena uses it setup - maybe possible to get Titanum to to cache but the file size limit is still a problem).

Split the file into chunks - say 100k lines per file, 90 odd files, then
fixup so each file is legal JSON-LD with one overall @context at the start of each file.

run riot on all the files.

riot will parse one file, writing n-triple and then do the next so maximum RFAM is a 100k chunk.
default JVM size (16G on my desktop machine)

I got a few transient network errors fetching the context. So better would be parse each file to its own N-Triples, so it is easy to redo a chunk file.

A few bad URIs in which cause Titanium to skip bits.

About 280 million triples.

    Andy

## In a directory Files:
split -l 100000  ... the download uncomrpessed file.

for X in x??
do
    echo "== $X"
    (
        # One object, one context, @graph array
        cat header
        # Convert start array to ",", remove @context and trailing array
        sed -e 's/^\[/,/' \
-e 's!"@context":"https://hub.culturegraph.org/entityfacts/context/v1/entityfacts.jsonld";,!!' \
            -e '/^\]$/d' \
            "$X"
    cat footer
    ) > ${X}-ld
done

to give files "xaa-ld" etc.

(I avoided jsonld extensions because it triggers editor help but the files are so big that really slow.

riot --syntax jsonld x??-ld

The parser step took 20 minutes


"header" is
---
{ "@context":"https://hub.culturegraph.org/entityfacts/context/v1/entityfacts.jsonld";,
  "@graph" : [
  {}
---
the {} is for the first line starts ","

"footer" is
---
]}
---

Reply via email to