thank you again!

rereading your answers, i checked on the utilities xargs and riot, which i had not ever used before. then i understood your approach (thank you for putting the comand line in!) and followed your approach. it indeed produces lots of warnings and i had also a hard error in the riot output, which i could fix with rapper. then it loaded....

still: why would project gutenberg select such a format?

andrew





On 10/07/2017 12:52 PM, Andy Seaborne wrote:


On 07/10/17 17:06, Andrew U. Frank wrote:
thank you - your link indicates why the solution with calling s-put for each individual file is so slow.

practically - i will just wait the 10 hours and then extract the triples from the store.

I admire your patience!

I've just downloaded the RDF, converted it to N-triples and loaded it into TDB. 55688 files converted to N-triples : 7,949,706 triples.

date ; ( find . -name \*.rdf | xargs riot ) >> data.nt ; date

(Load time was 83s / disk is an SSD)

Then I loaded it into Fuseki into a different, empty database and it took ~82 seconds (java had already started).

There are a few RDF warnings:

It uses mixed case host names sometimes:
  http://fr.Wikipedia.org

Some literals are in non-canonical UTF-8:
  "String not in Unicode Normal Form C"

Doesn't stop the process - they are only warnings.

    Andy

can you understand, why somebody would select this format? what is the advantage?

andrew



On 10/07/2017 10:52 AM, zPlus wrote:
Hello Andrew,

if I understand this correctly, I think I stumbled on the same problem
before. Concatenating XML files will not work indeed. My solution was
to convert all XML files to N-Triples, then concatenate all those
triples into a single file, and finally load only this file.
Ultimately, what I ended up with is this loop [1]. The idea is to call
RIOT with a list of files as input, instead of calling RIOT on every
file.

I hope this helps.

----
[1] https://notabug.org/metadb/pipeline/src/master/build.sh#L54

----- Original Message -----
From: [email protected]
To:"[email protected]" <[email protected]>
Cc:
Sent:Sat, 7 Oct 2017 10:17:18 -0400
Subject:loading many small rdf/xml files

  i have to load the Gutenberg projects catalog in rdf/xml format. this
is
  a collection of about 50,000 files, each containing a single record
as
  attached.

  if i try to concatenate these files into a single one the result is
not
  legal rdf/xml - there are xml doc headers:

  <rdf:RDF xml:base="http://www.gutenberg.org/";>

  and similar, which can only occur once per file.

  i found a way to load each file individually with s-put and a loop,
but
  this runs extremely slowly - it is alrady running for more than 10
  hours; each file takes half a second to load (fuseki running as
localhost).

  i am sure there is a better way?

  thank you for the help!

  andrew

  --
  em.o.Univ.Prof. Dr. sc.techn. Dr. h.c. Andrew U. Frank
  +43 1 58801 12710 direct
  Geoinformation, TU Wien +43 1 58801 12700 office
  Gusshausstr. 27-29 +43 1 55801 12799 fax
  1040 Wien Austria +43 676 419 25 72 mobil





--
em.o.Univ.Prof. Dr. sc.techn. Dr. h.c. Andrew U. Frank
                                 +43 1 58801 12710 direct
Geoinformation, TU Wien          +43 1 58801 12700 office
Gusshausstr. 27-29               +43 1 55801 12799 fax
1040 Wien Austria                +43 676 419 25 72 mobil

Reply via email to