[Wikidata-bugs] [Maniphest] [Commented On] T90114: BlazeGraph Finalization: Update performance

Thompsonbry.systap Mon, 23 Feb 2015 08:15:13 -0800

Thompsonbry.systap added a subscriber: Thompsonbry.systap.
Thompsonbry.systap added a comment.


The data storage on the disk is identical for the single machine and HA 
replication cluster (HAJournalServer) modes.  In fact, you can take a 
compressed snapshot file from the HA replication cluster, decompress it, and 
open it as a standard Journal.

BlazeGraph does support incremental truth maintenance.  Whether or not this is 
a good idea depends on the application.  We generally recommend that people use 
exactly those inference rules that they really require.  A lot of inference 
rules in the standard RDF Schema model are useless.  Either they do not add 
useful information or they add information that could be as easily maintained 
by the application when it is applying writes to the database.  Based on what I 
have heard so far, it seems like it might be more efficient to automatically 
chase paths (such as geospatial regions) in the query rather than eagerly 
materializing that information during writes but this is worth testing.

BlazeGraph has pretty good sustained throughput for updates.  I suspect that it 
will not be at all difficult to hit your targets, especially if you meld 
individual page updates together into a single write set for commit purposes. 
(However, the current work on group commit support for the REST API may mean 
that this will no longer be necessary. See http://trac.bigdata.com/ticket/566)

The biggest factors in the commit rate are:

- number of property values / edges modified in the batch. Very small commits 
have a relatively large overhead.  Batched commits have much better throughput.

- number of durable commits since actually syncing to the disk is a relatively 
slow operation.  Melding write sets together at the application layer will 
address this by reducing the number of distinct commit points.

- The IO system.  SSD is good.

- The number of write buffers. This is more important for very large batched 
updates.

- The CPU speed. Faster CPUs will update the on the page representation more 
quickly.

- The shape of the data. For this application the write rate will likely depend 
on how the per-property/link metadata is handled and is likely to be different 
for a simple graph and a graph having all of the additional per-property/link 
metadata.  I would suggest performance testing the different representation 
schemes for both write rate and query.

- Branching factors of the indices.  Index updates tend to be well-clustered 
for SPO(C).  However you can often reduce the IO Wait induced by scattered 
updates on the non-clustered indices (OSP, which has the value as the first 
component in the key; and POS, which has the edge type as the first component 
in the key) by reducing their target branching factor in order to have smaller 
page sizes for those indices and then allowing the RWStore to coalesce the page 
allocations onto the same 8k pages so the number of actual IOs is reduced.  
These issues might also interact with the manner in which the 
per-property/per-link metadata are represented since that could influence the 
locality of updates on these indices.


TASK DETAIL
  https://phabricator.wikimedia.org/T90114

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
<username>.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Manybubbles, Thompsonbry.systap
Cc: Thompsonbry.systap, daniel, Beebs.systap, Haasepeter, Aklapper, 
Manybubbles, jkroll, Smalyshev, Wikidata-bugs, Jdouglas, aude, GWicke, 
JanZerebecki



_______________________________________________
Wikidata-bugs mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Commented On] T90114: BlazeGraph Finalization: Update performance

Reply via email to