On 05/07/2024 16:16, Andy Seaborne wrote:
The file authorities-gnd_entityfacts.jsonld.gz has a structure you can
exploit.
Each line is a entry in a JSON array and it is complete - it has the
@context on each line
There are 9 million lines.
A possiblity is split the file on newline,
Clean up each line
* Drop the last line (the "]")
* remove the first character of each line (which is ",", after "[" on
the first line.
Parse each line.
Each element of the array starts
{"@context":"https://hub.culturegraph.org/entityfacts/context/v1/entityfacts.jsonld"
so there are 9 million @context URLs and the parser is going to do a
network call for each. On the whole file, that's days and that's if the
far end does not decide it is a denial of service attack!
(confirmed: watching the network on an extract - Titanium isn't caching,
at least in the way Jena uses it setup - maybe possible to get Titanum
to to cache but the file size limit is still a problem).
Split the file into chunks - say 100k lines per file, 90 odd files, then
fixup so each file is legal JSON-LD with one overall @context at the
start of each file.
run riot on all the files.
riot will parse one file, writing n-triple and then do the next so
maximum RFAM is a 100k chunk.
default JVM size (16G on my desktop machine)
I got a few transient network errors fetching the context. So better
would be parse each file to its own N-Triples, so it is easy to redo a
chunk file.
A few bad URIs in which cause Titanium to skip bits.
About 280 million triples.
Andy
## In a directory Files:
split -l 100000 ... the download uncomrpessed file.
for X in x??
do
echo "== $X"
(
# One object, one context, @graph array
cat header
# Convert start array to ",", remove @context and trailing array
sed -e 's/^\[/,/' \
-e
's!"@context":"https://hub.culturegraph.org/entityfacts/context/v1/entityfacts.jsonld",!!'
\
-e '/^\]$/d' \
"$X"
cat footer
) > ${X}-ld
done
to give files "xaa-ld" etc.
(I avoided jsonld extensions because it triggers editor help but the
files are so big that really slow.
riot --syntax jsonld x??-ld
The parser step took 20 minutes
"header" is
---
{
"@context":"https://hub.culturegraph.org/entityfacts/context/v1/entityfacts.jsonld",
"@graph" : [
{}
---
the {} is for the first line starts ","
"footer" is
---
]}
---