Re: XXXX-Large TDB (suggestion wanted)

Paul Tyson Fri, 14 Nov 2014 16:36:36 -0800

Jacek,

Sorry, I wasn't paying close attention to this thread, but saw your last
comment and wanted to chime in.


Of course jena/fuseki (nor any RDF system) can't compete with SQL in the
things SQL is good at.

I load 760M triples in about 8 hours on a Linux VM on what is by now
probably a middle-of-the road machine with an SSD and maybe 8Gb RAM. As
Andy said, the bulk loader must load into an empty store to get
performance benefits. You can just feed all your input files to it at
once--I load a couple of dozen turtle files at a time. I don't use named
graphs, so I don't know if that makes any difference.

I am still surprised at how fast most queries go, although some require
tweaking based on knowledge of the data. I'm not sure a "select distinct
type" query is a good indicator of general usefulness.

Anyway, you need to select the best tools to get the job done, and it
sounds like you've done enough testing to determine that. I just didn't
want others to think that was a general or common experience with
jena/fuseki.

Regards,
--Paul

On Fri, 2014-11-14 at 22:25 +0000, Jacek Grzebyta wrote:
> Hi guys,
> 
> I give up. Thanks for your help but unfortunately nothing works. I tried
> different triple stores with different combinations. None of them are able
> to work with large data on non-server machine. But Fuseki was the most
> efficient. It took only ca 10 min to have results for request:
> 
> select distinct ?class {
> [] a ?class
> }
> 
> BigData(R) I had to kill after half an hour whereas Virtuoso I wasn't able
> to load....
> I will use postgresql + d2rq.
> 
> 
> Thanks a lot,
> Jacek
> 
> 
> 
> 
> On 12 November 2014 11:46, Andy Seaborne <a...@apache.org> wrote:
> 
> > Hi Jacek,
> >
> > On 11/11/14 16:07, Jacek Grzebyta wrote:
> >
> >> Dear All,
> >>
> >> I have a problem with creating a copy (for read only) of a very large
> >> database. The original data are located at
> >> ftp://ftp.ebi.ac.uk/pub/databases/chembl/ChEMBL-RDF/latest/.
> >> I want to load at least all large files (activity, assays and molecule).
> >> The biggest file (molecule) contains ca 400M triples and it took several
> >> hours to load. First of all I loaded 'molecule' into a named graph. Than I
> >> started to load (into the same store) next (activity) file inside another
> >> named graph. After 2 hours I found it was loaded only ~2M.
> >>
> >
> > Loading all the data at once into an empty store is faster than adding
> > data in separate updates.  The bulk loader can only do anything special on
> > an empty store.
> >
> >  Please help me. How to organise the database? Load each named graph into
> >> separate tdb storage (what is faster)? Maybe I should not use named graphs
> >> if I do not need them: I have just wanted split data into graphs based on
> >> the subject. Maybe it is better to use separate store for each data
> >> subject. But maybe if I have 3 or more stores and 400M triples in each
> >> than
> >> the final service would be painfully slow. I am planning to do sparql
> >> queries without reasoner but expect quite large output.
> >>
> >
> > There isn't a single "best" that I know of.  The details of your data and
> > access requirements will matter.
> >
> > Some thoughts:
> >
> > If the SPARQL queries will want to match across different data sources,
> > then it'll need to be in one store.  Federated query is unlikely to help.
> >
> > If you don't need to keep the data separate, load into the default graph.
> > It's less work, its easier to write the queries (alternatively, use
> > unionDefault Graph for the latter - but it's slower to load into named
> > graphs than
> >
> > Make sure you have the right hardware.  RAM and SSD.  It sounds as if the
> > later "2m in 2hr" might be suffering from some kind of
> >
> > If you can borrow a large SSD machine to load the data, you can load and
> > prepare the database on that machine and use different hardware for query.
> > Query needs adequate RAM but is less SSD sensitive if the RAM is large
> > enough.
> >
> >
> >> Thanks a lot,
> >> Jacek
> >>
> >>
> >         Andy
> >
> >

Re: XXXX-Large TDB (suggestion wanted)

Reply via email to