Re: [Wikidata] Scaling Wikidata Query Service

Ted Thibodeau Jr Mon, 17 Jun 2019 11:56:35 -0700

Hello, Stas --

On Jun 13, 2019, at 07:52 PM, Stas Malyshev <smalys...@wikimedia.org> wrote:
> 
> Hi!
> 
>> It handles data locality across a shared nothing cluster just fine i.e., you 
>> can interact with any node in a Virtuoso cluster and experience identical 
>> behavior (everyone node looks like single node in the eyes of the operator).
> 
> Does this mean no sharding, i.e. each server stores the full DB?


No.

The full DB is automatically sharded across all Virtuoso instances in an 
Elastic Cluster, and each instance *appears* to store the full DB -- i.e., you 
can issue a query to any instance in an Elastic Cluster, if you have the 
relevant communication details (typically IP address and port number), and you 
will get the same results from it as from any other instance in that Elastic 
Cluster.

(I am generally specific about Elastic Cluster vs Replication Cluster, because 
these are different though complementary technologies, implemented via 
different Modules in Virtuoso.)


> This is the model we're using currently, but given the growth of the data it 
> may be non sustainable on current hardware. I see in your tables that Uniprot 
> has about 30B triples, but I wonder how update loads there look like. Our 
> main issue is that the hardware we have now is showing its limits when 
> there's a lot of updates in parallel to significant query load. So I wonder 
> if the "single server holds everything" model is sustainable in the long term.

Your questions are unsurprising, and are one of the reasons for the benchmark 
efforts of the LDBC --

   http://ldbcouncil.org/benchmarks/

Uniprot does not get a lot of updates, and it is running on a single instance 
-- i.e., there's no cluster involved at all, neither Elastic (Shared-Nothing) 
Cluster nor Replication Cluster -- so its probably not the best example for 
your workflows.

I think the LDBC's Social Networking Benchmark (SNB) is likely to be the 
closest to the Wikidata update and query patterns, so you may find these 
articles interesting --

1. SNB Interactive, Part 1: What is SNB Interactive Really About?
   https://virtuoso.openlinksw.com/blog/vdb/blog/?id=1835

2. SNB Interactive, Part 2: Modeling Choices
   https://virtuoso.openlinksw.com/blog/vdb/blog/?id=1837

3. SNB Interactive, Part 3: Choke Points and Initial Run on Virtuoso
   https://virtuoso.openlinksw.com/blog/vdb/blog/?id=1842



>> There are live instances of Virtuoso that demonstrate its capabilities. If 
>> you want to explore shared-nothing cluster capabilities then our live LOD 
>> Cloud cache is the place to start [1][2][3]. If you want to see the 
>> single-server open source edition that you have DBpedia, DBpedia-Live, 
>> Uniprot and many other nodes in the LOD Cloud to choose from. All of these 
>> instance are highly connected.
> 
> Again, here the question is not too much in "can you load 7bn triples into 
> Virtuoso" - we know we can. What we want to figure out whether given specific 
> query/update patterns we have now - it is going to give us significantly 
> better performance allowing to support our projected growth. And also 
> possibly whether Virtuoso has ways to make our update workflow be more 
> optimal - e.g. right now if one triple changes in Wikidata item, we're 
> essentially downloading and updating the whole item (not exactly since 
> triples that stay the same are preserved but it requires a lot of data 
> transfer to express that in SPARQL). Would there be ways to update the things 
> more efficiently?

The first thing that will improve your performance is to break out of the 
"stored as JSON blobs" pattern you've been using.

Updates should not require a full download of the named graph (which I think is 
what your JSON Blobs amount to) followed by an upload of the entire revised 
named graph.

Even if you *query* the full content of an existing named graph, determine the 
necessary changes locally, and then submit an update query which includes a 
full set of DELETE + INSERT statements (this "full set" only including the 
*changed* triples), you should find a significant reduction in data throughput.

The live parallel to such regular updates is DBpedia-Live, which started from a 
static load of dump files, and has been (and is still) continuously updated by 
an RDF feed based on the Wikipedia update firehose.  The same RDF feed is made 
available to users of our AMI-based DBpedia-Live mirror AMI (currently being 
refreshed, and soon to be made available for new users) --

   https://aws.amazon.com/marketplace/pp/B012DSCFEK


>> Virtuoso handles both shared-nothing clusters and replication i.e., you can 
>> have a cluster configuration used in conjunction with a replication topology 
>> if your solution requires that.
> 
> Replication could certainly be useful I think it it's faster to update single 
> server and then replicate than simultaneously update all servers (that's what 
> is happening now).

There are multiple Replication strategies which might be used, as well as 
multiple Replication Cluster topologies which might be considered, and none of 
them is inherently the fastest.

That said, periodic monolithic replication of an entire dataset or DB would 
certainly not be faster than propagation of DIFFs from the master to the 
replica(s).  Replication via periodic cumulative DIFFs *may* be faster than 
incremental DIFFs that are dispatched after every change, but this depends on 
many variables.

This page of cluster topology diagrams starts with Replication-only and 
progresses to Elastic-only.  (There are no illustrations of a combined 
Replicating-Elastic-Cluster on this page.)

   http://vos.openlinksw.com/owiki/wiki/VOS/VirtClusteringDiagrams

Any Replication Cluster topology and methodology -- including zero Replication 
-- may be combined with an Elastic (Shared-Nothing) Cluster setup.  Generally 
speaking, when these are combined, an entire Elastic Cluster would take the 
place of each Single-Server Instance in a given Replication topology.

I hope this helps your understanding of the available options.

Ted



--
A: Yes.                          http://www.idallen.com/topposting.html
| Q: Are you sure?           
| | A: Because it reverses the logical flow of conversation.
| | | Q: Why is top posting frowned upon?

Ted Thibodeau, Jr.           //               voice +1-781-273-0900 x32
Senior Support & Evangelism  //        mailto:tthibod...@openlinksw.com
                             //              http://twitter.com/TallTed
OpenLink Software, Inc.      //              http://www.openlinksw.com/
         20 Burlington Mall Road, Suite 322, Burlington MA 01803
     Weblog    -- http://www.openlinksw.com/blogs/
     Community -- https://community.openlinksw.com/
     LinkedIn  -- http://www.linkedin.com/company/openlink-software/
     Twitter   -- http://twitter.com/OpenLink
     Facebook  -- http://www.facebook.com/OpenLinkSoftware
Universal Data Access, Integration, and Management Technology Providers

smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Scaling Wikidata Query Service

Reply via email to