RDF/XML was the first RDF syntax. On Sat, 7 Oct 2017 at 20.27, Andrew U. Frank <[email protected]> wrote:
> thank you again! > > rereading your answers, i checked on the utilities xargs and riot, which > i had not ever used before. then i understood your approach (thank you > for putting the comand line in!) and followed your approach. it indeed > produces lots of warnings and i had also a hard error in the riot > output, which i could fix with rapper. then it loaded.... > > still: why would project gutenberg select such a format? > > andrew > > > > > > On 10/07/2017 12:52 PM, Andy Seaborne wrote: > > > > > > On 07/10/17 17:06, Andrew U. Frank wrote: > >> thank you - your link indicates why the solution with calling s-put > >> for each individual file is so slow. > >> > >> practically - i will just wait the 10 hours and then extract the > >> triples from the store. > > > > I admire your patience! > > > > I've just downloaded the RDF, converted it to N-triples and loaded it > > into TDB. 55688 files converted to N-triples : 7,949,706 triples. > > > > date ; ( find . -name \*.rdf | xargs riot ) >> data.nt ; date > > > > (Load time was 83s / disk is an SSD) > > > > Then I loaded it into Fuseki into a different, empty database and it > > took ~82 seconds (java had already started). > > > > There are a few RDF warnings: > > > > It uses mixed case host names sometimes: > > http://fr.Wikipedia.org > > > > Some literals are in non-canonical UTF-8: > > "String not in Unicode Normal Form C" > > > > Doesn't stop the process - they are only warnings. > > > > Andy > > > >> can you understand, why somebody would select this format? what is > >> the advantage? > >> > >> andrew > >> > >> > >> > >> On 10/07/2017 10:52 AM, zPlus wrote: > >>> Hello Andrew, > >>> > >>> if I understand this correctly, I think I stumbled on the same problem > >>> before. Concatenating XML files will not work indeed. My solution was > >>> to convert all XML files to N-Triples, then concatenate all those > >>> triples into a single file, and finally load only this file. > >>> Ultimately, what I ended up with is this loop [1]. The idea is to call > >>> RIOT with a list of files as input, instead of calling RIOT on every > >>> file. > >>> > >>> I hope this helps. > >>> > >>> ---- > >>> [1] https://notabug.org/metadb/pipeline/src/master/build.sh#L54 > >>> > >>> ----- Original Message ----- > >>> From: [email protected] > >>> To:"[email protected]" <[email protected]> > >>> Cc: > >>> Sent:Sat, 7 Oct 2017 10:17:18 -0400 > >>> Subject:loading many small rdf/xml files > >>> > >>> i have to load the Gutenberg projects catalog in rdf/xml format. this > >>> is > >>> a collection of about 50,000 files, each containing a single record > >>> as > >>> attached. > >>> > >>> if i try to concatenate these files into a single one the result is > >>> not > >>> legal rdf/xml - there are xml doc headers: > >>> > >>> <rdf:RDF xml:base="http://www.gutenberg.org/"> > >>> > >>> and similar, which can only occur once per file. > >>> > >>> i found a way to load each file individually with s-put and a loop, > >>> but > >>> this runs extremely slowly - it is alrady running for more than 10 > >>> hours; each file takes half a second to load (fuseki running as > >>> localhost). > >>> > >>> i am sure there is a better way? > >>> > >>> thank you for the help! > >>> > >>> andrew > >>> > >>> -- > >>> em.o.Univ.Prof. Dr. sc.techn. Dr. h.c. Andrew U. Frank > >>> +43 1 58801 12710 direct > >>> Geoinformation, TU Wien +43 1 58801 12700 office > >>> Gusshausstr. 27-29 +43 1 55801 12799 fax > >>> 1040 Wien Austria +43 676 419 25 72 mobil > >>> > >>> > >>> > >> > > -- > em.o.Univ.Prof. Dr. sc.techn. Dr. h.c. Andrew U. Frank > +43 1 58801 12710 direct > Geoinformation, TU Wien +43 1 58801 12700 office > Gusshausstr. 27-29 +43 1 55801 12799 fax > 1040 Wien Austria +43 676 419 25 72 mobil > >
