Re: JENA Loader Benchmarks

Marco Neumann Tue, 18 Jun 2019 07:38:55 -0700

I agree, would be desirable to have funding for these requests and more. to
bad there isn't currently a commercial entity that helps actively driving
this valuable project.



On Tue, Jun 18, 2019 at 2:37 PM Andy Seaborne <[email protected]> wrote:

>
>
> On 18/06/2019 13:44, Marco Neumann wrote:
> > Andy, just one observation. there seems to be quite some data replication
> > going on in the respective tdb / tdb2 folder.
> >
> > Is it possibly to instruct tdb/tdb2 only to create a database with one
> > default graph?
>
> In theory you can set the indexes you want via StoreParams - it works
> for choices but I would not be surprised if the code assumed at least
> one quads index. Fixable.
>
> > It seems to be quite safe to remove files from disk that
> > contain G-indexes manually and maintain query consistency in the default
> > graph and it would reduced the tdb database footprint on disk by 1/3.
> >
>
> They aren't as big as you think they are :-)
>
> Try this:
>
> No DB2.
>
> tdb2.tdbquery --loc DB2 'ASK{}'
> Ask => Yes
>
> du -sh DB2
> 216K    DB2
>
> so it is 216K bytes on disk empty.
>
> (this is Linux/ext4 filesystem)
>
> ~ >> ll DB2/Data-0001/
>
> loads of 8M files.
>
> How come there are files that are 8M but the entire thing is 216K?
>
> They are sparse files.
> The space is not allocated.
>
> Some systems (Mac for example) report the size of the files added up,
> not the space used.
>
> total 204
> -rw-r--r-- 1 afs afs      24 Jun 18 14:26 GOSP.bpt
> -rw-r--r-- 1 afs afs 8388608 Jun 18 14:26 GOSP.dat
> -rw-r--r-- 1 afs afs 8388608 Jun 18 14:26 GOSP.idn
>
> > not to speak of an option for LZW compression a la HDT.
>
> That would be good if I had time. Anyone got any spare funding?!
>
> I'm not sure how the HDT (java) project is doing.
> Like all open source projects, it needs time and energy, and executing a
> steady state still requires backing.
>
> I currently think RocksDB is possible choice. Initial experiments showed
> it works but needs tuning work. The new storage architecture
> (jena-dboe-storage) would make it event easier to build.
>
>      Andy
>
>
> >
> >
> >
> > On Fri, Jun 14, 2019 at 8:03 PM Andy Seaborne <[email protected]> wrote:
> >
> >>
> >>
> >> On 14/06/2019 18:13, Marco Neumann wrote:
> >>> I am collecting jena loader benchmarks. if you have results please post
> >>> them directly.
> >>>
> >>> http://www.lotico.com/index.php/JENA_Loader_Benchmarks
> >>
> >> tdb2.tdbloader has variations controlled by --loader.
> >>
> >> --loader=
> >> Loader to use: 'basic', 'phased' (default), 'sequential', 'parallel' or
> >> 'light'
> >>
> >> "basic" is a super naive parser-add triple loop - it used if a loader
> >> can't cope with an already loaded database.
> >>
> >> "phased" is a balanced, does not saturate the machine loader. Some
> >> parallelism.
> >>
> >> "sequential" is the tdbloader algorithm for TDB2, more for reference.
> >>
> >> "parallel" is as much parallelism as it wants. (5 for triples, more for
> >> quads)
> >>
> >> "light" is two threaded. Slightly ligther than "phased".
> >>
> >> See LoaderPlans.
> >>
> >>> On a linux machine I am using "time" to collect data.
> >>>
> >>> Is there a flag on tdb2.tdbloader to report time and triples per
> second?
> >>>
> >>> I have noticed that storage space use for tdbloader2 is significantly
> >>> smaller on disk compared to tdbloader and tdb2.tdbloader. Is there a
> >>> straight forward explanation here?
> >>>
> >>
> >
> >
>


-- 


---
Marco Neumann
KONA

-- 


---
Marco Neumann
KONA

Re: JENA Loader Benchmarks

Reply via email to