Re: JENA Loader Benchmarks

Andy Seaborne Tue, 18 Jun 2019 06:37:26 -0700



On 18/06/2019 13:44, Marco Neumann wrote:

Andy, just one observation. there seems to be quite some data replication
going on in the respective tdb / tdb2 folder.

Is it possibly to instruct tdb/tdb2 only to create a database with one
default graph?

In theory you can set the indexes you want via StoreParams - it worksfor choices but I would not be surprised if the code assumed at leastone quads index. Fixable.

It seems to be quite safe to remove files from disk that
contain G-indexes manually and maintain query consistency in the default
graph and it would reduced the tdb database footprint on disk by 1/3.


They aren't as big as you think they are :-)

Try this:

No DB2.

tdb2.tdbquery --loc DB2 'ASK{}'
Ask => Yes

du -sh DB2
216K    DB2

so it is 216K bytes on disk empty.

(this is Linux/ext4 filesystem)

~ >> ll DB2/Data-0001/

loads of 8M files.

How come there are files that are 8M but the entire thing is 216K?

They are sparse files.
The space is not allocated.

Some systems (Mac for example) report the size of the files added up,not the space used.


total 204
-rw-r--r-- 1 afs afs      24 Jun 18 14:26 GOSP.bpt
-rw-r--r-- 1 afs afs 8388608 Jun 18 14:26 GOSP.dat
-rw-r--r-- 1 afs afs 8388608 Jun 18 14:26 GOSP.idn

not to speak of an option for LZW compression a la HDT.


That would be good if I had time. Anyone got any spare funding?!

I'm not sure how the HDT (java) project is doing.

Like all open source projects, it needs time and energy, and executing asteady state still requires backing.

I currently think RocksDB is possible choice. Initial experiments showedit works but needs tuning work. The new storage architecture(jena-dboe-storage) would make it event easier to build.


    Andy




On Fri, Jun 14, 2019 at 8:03 PM Andy Seaborne <[email protected]> wrote:



On 14/06/2019 18:13, Marco Neumann wrote:

I am collecting jena loader benchmarks. if you have results please post
them directly.

http://www.lotico.com/index.php/JENA_Loader_Benchmarks


tdb2.tdbloader has variations controlled by --loader.

--loader=
Loader to use: 'basic', 'phased' (default), 'sequential', 'parallel' or
'light'

"basic" is a super naive parser-add triple loop - it used if a loader
can't cope with an already loaded database.

"phased" is a balanced, does not saturate the machine loader. Some
parallelism.

"sequential" is the tdbloader algorithm for TDB2, more for reference.

"parallel" is as much parallelism as it wants. (5 for triples, more for
quads)

"light" is two threaded. Slightly ligther than "phased".

See LoaderPlans.

On a linux machine I am using "time" to collect data.

Is there a flag on tdb2.tdbloader to report time and triples per second?

I have noticed that storage space use for tdbloader2 is significantly
smaller on disk compared to tdbloader and tdb2.tdbloader. Is there a
straight forward explanation here?

Re: JENA Loader Benchmarks

Reply via email to