On 18/06/2019 13:44, Marco Neumann wrote:
Andy, just one observation. there seems to be quite some data replication
going on in the respective tdb / tdb2 folder.
Is it possibly to instruct tdb/tdb2 only to create a database with one
default graph?
In theory you can set the indexes you want via StoreParams - it works
for choices but I would not be surprised if the code assumed at least
one quads index. Fixable.
It seems to be quite safe to remove files from disk that
contain G-indexes manually and maintain query consistency in the default
graph and it would reduced the tdb database footprint on disk by 1/3.
They aren't as big as you think they are :-)
Try this:
No DB2.
tdb2.tdbquery --loc DB2 'ASK{}'
Ask => Yes
du -sh DB2
216K DB2
so it is 216K bytes on disk empty.
(this is Linux/ext4 filesystem)
~ >> ll DB2/Data-0001/
loads of 8M files.
How come there are files that are 8M but the entire thing is 216K?
They are sparse files.
The space is not allocated.
Some systems (Mac for example) report the size of the files added up,
not the space used.
total 204
-rw-r--r-- 1 afs afs 24 Jun 18 14:26 GOSP.bpt
-rw-r--r-- 1 afs afs 8388608 Jun 18 14:26 GOSP.dat
-rw-r--r-- 1 afs afs 8388608 Jun 18 14:26 GOSP.idn
not to speak of an option for LZW compression a la HDT.
That would be good if I had time. Anyone got any spare funding?!
I'm not sure how the HDT (java) project is doing.
Like all open source projects, it needs time and energy, and executing a
steady state still requires backing.
I currently think RocksDB is possible choice. Initial experiments showed
it works but needs tuning work. The new storage architecture
(jena-dboe-storage) would make it event easier to build.
Andy
On Fri, Jun 14, 2019 at 8:03 PM Andy Seaborne <[email protected]> wrote:
On 14/06/2019 18:13, Marco Neumann wrote:
I am collecting jena loader benchmarks. if you have results please post
them directly.
http://www.lotico.com/index.php/JENA_Loader_Benchmarks
tdb2.tdbloader has variations controlled by --loader.
--loader=
Loader to use: 'basic', 'phased' (default), 'sequential', 'parallel' or
'light'
"basic" is a super naive parser-add triple loop - it used if a loader
can't cope with an already loaded database.
"phased" is a balanced, does not saturate the machine loader. Some
parallelism.
"sequential" is the tdbloader algorithm for TDB2, more for reference.
"parallel" is as much parallelism as it wants. (5 for triples, more for
quads)
"light" is two threaded. Slightly ligther than "phased".
See LoaderPlans.
On a linux machine I am using "time" to collect data.
Is there a flag on tdb2.tdbloader to report time and triples per second?
I have noticed that storage space use for tdbloader2 is significantly
smaller on disk compared to tdbloader and tdb2.tdbloader. Is there a
straight forward explanation here?