Thompsonbry.systap added a subscriber: Thompsonbry.systap. Thompsonbry.systap added a comment.
The data storage on the disk is identical for the single machine and HA replication cluster (HAJournalServer) modes. In fact, you can take a compressed snapshot file from the HA replication cluster, decompress it, and open it as a standard Journal. BlazeGraph does support incremental truth maintenance. Whether or not this is a good idea depends on the application. We generally recommend that people use exactly those inference rules that they really require. A lot of inference rules in the standard RDF Schema model are useless. Either they do not add useful information or they add information that could be as easily maintained by the application when it is applying writes to the database. Based on what I have heard so far, it seems like it might be more efficient to automatically chase paths (such as geospatial regions) in the query rather than eagerly materializing that information during writes but this is worth testing. BlazeGraph has pretty good sustained throughput for updates. I suspect that it will not be at all difficult to hit your targets, especially if you meld individual page updates together into a single write set for commit purposes. (However, the current work on group commit support for the REST API may mean that this will no longer be necessary. See http://trac.bigdata.com/ticket/566) The biggest factors in the commit rate are: - number of property values / edges modified in the batch. Very small commits have a relatively large overhead. Batched commits have much better throughput. - number of durable commits since actually syncing to the disk is a relatively slow operation. Melding write sets together at the application layer will address this by reducing the number of distinct commit points. - The IO system. SSD is good. - The number of write buffers. This is more important for very large batched updates. - The CPU speed. Faster CPUs will update the on the page representation more quickly. - The shape of the data. For this application the write rate will likely depend on how the per-property/link metadata is handled and is likely to be different for a simple graph and a graph having all of the additional per-property/link metadata. I would suggest performance testing the different representation schemes for both write rate and query. - Branching factors of the indices. Index updates tend to be well-clustered for SPO(C). However you can often reduce the IO Wait induced by scattered updates on the non-clustered indices (OSP, which has the value as the first component in the key; and POS, which has the edge type as the first component in the key) by reducing their target branching factor in order to have smaller page sizes for those indices and then allowing the RWStore to coalesce the page allocations onto the same 8k pages so the number of actual IOs is reduced. These issues might also interact with the manner in which the per-property/per-link metadata are represented since that could influence the locality of updates on these indices. TASK DETAIL https://phabricator.wikimedia.org/T90114 REPLY HANDLER ACTIONS Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign <username>. EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: Manybubbles, Thompsonbry.systap Cc: Thompsonbry.systap, daniel, Beebs.systap, Haasepeter, Aklapper, Manybubbles, jkroll, Smalyshev, Wikidata-bugs, Jdouglas, aude, GWicke, JanZerebecki _______________________________________________ Wikidata-bugs mailing list [email protected] https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
