On 31/07/2020 10:37, Rob Vesse wrote:
I don’t have any specific queries to suggest since I have no familiarity with 
the database but an observation on the query shown

Scanning the whole database and counting the triples inherently requires a full 
traversal of the B-Tree so what you have shown so far is a corner case extreme. 
 Of course more complex queries could in fact require multiple traversals over 
the whole database and see a wider performance difference but YMMV

Rob

From: Wolfgang Fahl <[email protected]>
Organisation: BITPlan GmbH
Reply to: <[email protected]>
Date: Friday, 31 July 2020 at 08:51
To: <[email protected]>
Subject: Difference in query speed for rotating disk and SSD

Dear Apache Jena users,

the experience with the 
http://wiki.bitplan.com/index.php/Get_your_own_copy_of_WikiData trials and the 
unanswered question
https://stackoverflow.com/questions/61813248/jena-tdbloader-performance-and-limits
 led me to the assumption that it would be possible
to run the wikidata import for Jena on a costly 4 TB SSD but then use the 
resulting database on much cheaper rotating disk and see not much of a 
performance difference for queries then running from the SSD.

My assumption was based on the 
https://jena.apache.org/documentation/tdb/architecture.html and the mentioned 
use of https://en.wikipedia.org/wiki/B+_tree.
I thought the B+ tree approach is optimized for making sure that not too many 
time costly seeks are necessary when fetching data during a query.

My experiment at 
http://wiki.bitplan.com/index.php/WikiData_Import_2020-07-15#log_for_query 
shows a different result for the query:
SELECT (COUNT(*) as ?Triples) WHERE { ?s ?p ?o}

This requires going through every block of one of the B+Trees (the SPO one unless it is default-union-graph and it's GSPO).

How long this takes not only depends on disk/SSD but also caching.

TDB uses the OS file system cache for the B+Trees.

Your queries are [5] and [7]. If only small queries have been done, then not only is the disk cache cold, Java has not JITed much code.

Now, if the data is all loaded at once, it will, by chance, be laid out on disk roughly sequentially for the SPO index which means the OS will tend to read ahead a bit - not much because the amount fo work per block is not much so the OS does not have much chance to read ahead. Read head only has any effect on large scans.

It takes 31.501 secs on a rotating disk which is only a bit slower than the SSD 
in raw i/o but has the seek time of a rotating disk while the SSD does not have 
this performance penalty and the query takes 5.516 secs for the SSD.

The important factor for this query (which is not typical) is the I/O throughput of the interface. If a query hops around the graph, latency is the more important factor.


Would other queries see the same factor 6 difference or does the speed 
difference depend on the query? Please suggest some queries that i might test 
and then I will report the results here.

Depends on query and history of the server since launch and amount of RAM.

For an SSD, the speed ration of SSD/RAM is less that disk/RAM and the latency is less (how much depends on the way the SSD is connected e.g. NVMe vs SATA3 ... number of PCIe lanes ...)

Writing a cache that is matched to the needs of a database is hard - lots of factors, different workloads - which makes it a high-resource task.

Resources includes both people and hardware for experimentation and testing. Given the resources available, there is a limit to how much can be done.


Cheers

   Wolfgang

Reply via email to