Re: loading many small rdf/xml files

Martynas Jusevičius Sat, 07 Oct 2017 11:32:36 -0700

RDF/XML was the first RDF syntax.

On Sat, 7 Oct 2017 at 20.27, Andrew U. Frank <[email protected]>
wrote:


> thank you again!
>
> rereading your answers, i checked on the utilities xargs and riot, which
> i had not ever used before. then i understood your approach (thank you
> for putting the comand line in!) and followed your approach. it indeed
> produces lots of warnings and i had also a hard error in the riot
> output, which i could fix with rapper. then it loaded....
>
> still: why would project gutenberg select such a format?
>
> andrew
>
>
>
>
>
> On 10/07/2017 12:52 PM, Andy Seaborne wrote:
> >
> >
> > On 07/10/17 17:06, Andrew U. Frank wrote:
> >> thank you - your link indicates why the solution with calling s-put
> >> for each individual file is so slow.
> >>
> >> practically - i will just wait the 10 hours and then extract the
> >> triples from the store.
> >
> > I admire your patience!
> >
> > I've just downloaded the RDF, converted it to N-triples and loaded it
> > into TDB. 55688 files converted to N-triples : 7,949,706 triples.
> >
> > date ; ( find . -name \*.rdf | xargs riot ) >> data.nt ; date
> >
> > (Load time was 83s / disk is an SSD)
> >
> > Then I loaded it into Fuseki into a different, empty database and it
> > took ~82 seconds (java had already started).
> >
> > There are a few RDF warnings:
> >
> > It uses mixed case host names sometimes:
> >   http://fr.Wikipedia.org
> >
> > Some literals are in non-canonical UTF-8:
> >   "String not in Unicode Normal Form C"
> >
> > Doesn't stop the process - they are only warnings.
> >
> >     Andy
> >
> >> can you understand, why somebody would select this format? what is
> >> the advantage?
> >>
> >> andrew
> >>
> >>
> >>
> >> On 10/07/2017 10:52 AM, zPlus wrote:
> >>> Hello Andrew,
> >>>
> >>> if I understand this correctly, I think I stumbled on the same problem
> >>> before. Concatenating XML files will not work indeed. My solution was
> >>> to convert all XML files to N-Triples, then concatenate all those
> >>> triples into a single file, and finally load only this file.
> >>> Ultimately, what I ended up with is this loop [1]. The idea is to call
> >>> RIOT with a list of files as input, instead of calling RIOT on every
> >>> file.
> >>>
> >>> I hope this helps.
> >>>
> >>> ----
> >>> [1] https://notabug.org/metadb/pipeline/src/master/build.sh#L54
> >>>
> >>> ----- Original Message -----
> >>> From: [email protected]
> >>> To:"[email protected]" <[email protected]>
> >>> Cc:
> >>> Sent:Sat, 7 Oct 2017 10:17:18 -0400
> >>> Subject:loading many small rdf/xml files
> >>>
> >>>   i have to load the Gutenberg projects catalog in rdf/xml format. this
> >>> is
> >>>   a collection of about 50,000 files, each containing a single record
> >>> as
> >>>   attached.
> >>>
> >>>   if i try to concatenate these files into a single one the result is
> >>> not
> >>>   legal rdf/xml - there are xml doc headers:
> >>>
> >>>   <rdf:RDF xml:base="http://www.gutenberg.org/";>
> >>>
> >>>   and similar, which can only occur once per file.
> >>>
> >>>   i found a way to load each file individually with s-put and a loop,
> >>> but
> >>>   this runs extremely slowly - it is alrady running for more than 10
> >>>   hours; each file takes half a second to load (fuseki running as
> >>> localhost).
> >>>
> >>>   i am sure there is a better way?
> >>>
> >>>   thank you for the help!
> >>>
> >>>   andrew
> >>>
> >>>   --
> >>>   em.o.Univ.Prof. Dr. sc.techn. Dr. h.c. Andrew U. Frank
> >>>   +43 1 58801 12710 direct
> >>>   Geoinformation, TU Wien +43 1 58801 12700 office
> >>>   Gusshausstr. 27-29 +43 1 55801 12799 fax
> >>>   1040 Wien Austria +43 676 419 25 72 mobil
> >>>
> >>>
> >>>
> >>
>
> --
> em.o.Univ.Prof. Dr. sc.techn. Dr. h.c. Andrew U. Frank
>                                   +43 1 58801 12710 direct
> Geoinformation, TU Wien          +43 1 58801 12700 office
> Gusshausstr. 27-29               +43 1 55801 12799 fax
> 1040 Wien Austria                +43 676 419 25 72 mobil
>
>

Re: loading many small rdf/xml files

Reply via email to