Re: [Wikidata] Scaling Wikidata Query Service

2019-06-26 Thread Eric Prud'hommeaux
On Mon, Jun 17, 2019 at 09:41:51PM +0200, Finn Aarup Nielsen wrote:
> 
> Changing the subject a bit:
> 
> I am surprised to see how many SPARQL requests go to the endpoint when
> performing a ShEx validation with the shex-simple Toolforge tool. They are
> all very simple and quickly complete. For each Wikidata item tested, one of
> our tests [1] requests tens of times. That is, testing 100 Wikidata items
> may yield thousands of requests to the endpoint in rapid succession.
> 
> I suppose that given the simple SPARQL queries, these kinds of requests
> might not load WDQS very much.

It's true; they require no joins are are designed to be answerable by
only looking at the index. That said, given that they offer virtually
no load, running them with API access to the Blaze getStatements() [2]
would make validation thousands of times faster and eliminate parsing
and query planning time on the SPARQL server.


> [1] 
> https://tools.wmflabs.org/shex-simple/wikidata/packages/shex-webapp/doc/shex-simple.html?data=Endpoint:%20https://query.wikidata.org/sparql=[]=%2F%2Fwww.wikidata.org%2Fwiki%2FSpecial%3AEntitySchemaText%2FE65
[2] 
https://www.programcreek.com/java-api-examples/?class=org.eclipse.rdf4j.repository.RepositoryConnection=getStatements

> Finn
> http://people.compute.dtu.dk/faan/
> 
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Scaling Wikidata Query Service

2019-06-25 Thread Ted Thibodeau Jr

On Jun 17, 2019, at 03:41 PM, Finn Aarup Nielsen  wrote:
> 
> 
> Changing the subject a bit:

Well... Changing the subject a *lot*, to an extent probably 
worthy of its own subject line, and an entirely new thread,
not only because it seems more relevant to the "shex-simple 
Toolforge tool" you reference, than to anything in this thread
about scaling the back-end.

Ted

> I am surprised to see how many SPARQL requests go to the endpoint when 
> performing a ShEx validation with the shex-simple Toolforge tool. They are 
> all very simple and quickly complete. For each Wikidata item tested, one of 
> our tests [1] requests tens of times. That is, testing 100 Wikidata items may 
> yield thousands of requests to the endpoint in rapid succession.
> 
> I suppose that given the simple SPARQL queries, these kinds of requests might 
> not load WDQS very much.
> 
> 
> [1] 
> https://tools.wmflabs.org/shex-simple/wikidata/packages/shex-webapp/doc/shex-simple.html?data=Endpoint:%20https://query.wikidata.org/sparql=[]=%2F%2Fwww.wikidata.org%2Fwiki%2FSpecial%3AEntitySchemaText%2FE65





--
A: Yes.  http://www.idallen.com/topposting.html
| Q: Are you sure?   
| | A: Because it reverses the logical flow of conversation.
| | | Q: Why is top posting frowned upon?

Ted Thibodeau, Jr.   //   voice +1-781-273-0900 x32
Senior Support & Evangelism  //mailto:tthibod...@openlinksw.com
 //  http://twitter.com/TallTed
OpenLink Software, Inc.  //  http://www.openlinksw.com/
 20 Burlington Mall Road, Suite 322, Burlington MA 01803
 Weblog-- http://www.openlinksw.com/blogs/
 Community -- https://community.openlinksw.com/
 LinkedIn  -- http://www.linkedin.com/company/openlink-software/
 Twitter   -- http://twitter.com/OpenLink
 Facebook  -- http://www.facebook.com/OpenLinkSoftware
Universal Data Access, Integration, and Management Technology Providers






smime.p7s
Description: S/MIME cryptographic signature
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Scaling Wikidata Query Service

2019-06-22 Thread Thad Guidry
In the enterprise where I work as a Data Architect, we approach scaling in
many ways, but there's no question that the age old technique of SORTING
lines up everything for systems and cpu's to massively ingest and pipeline
across IO boundaries.  Sometimes this involves more indices, lots of
duplicated data, and scatter/gather techniques.  Knowing WHAT to sort and
HOW to sort will vary widely by the queries that are expected to perform
well from a system.  So sorted indices of many kinds (where data is
duplicated) are necessary to achieve extremely fast IO for a broad database
such as Wikidata.

Scaling problems can be categorized in a few buckets:
1. Increase of data queries. (READ)
2. Increase of data writes.  (WRITE)
3. There is no 3.  Because all scale problems boil down to IO (READ/WRITE)
and how you approach fast IO.

Google is known to replicate data at different levels of abstraction
(metaschema, indices, meta-relations) across entire regions of the world in
order to achieve fast IO.  With a nearly unlimited budget they can MOVE
THINGS FAST certainly and afford to be extremely wasteful and smart with
data replication techniques.

IBM approaches the scale problem via Polymorphic stores that support
multiple indices, db structures, both in-memory and graph-like.
Essentially, duplicating the hell out of the data in many, many ways and
wasting space and memory to result in extremely high performance on queries.
https://queue.acm.org/detail.cfm?id=3332266

Juan Sequeda (now at data.world) and team at Capsenta also seem to use
polymorpic storage to bridge SPARQL and relational DB's.  But I'm unsure of
the actual architecture but would love to hear more about it.  I've
followed Juan for some time.
https://www.zdnet.com/article/data-world-joins-forces-with-capsenta-to-bring-knowledge-graph-based-data-management-and-consumer-grade-ui-to-the-enterprise/


It is unfortunate that Wikidata doesn't have the hardware resources to
duplicate and sort data in myriad ways to achieve better scale.  On the
software(s) side, we all know what the capabilities are of various stacks,
but we often don't have the "time" or "hardware" to truly flex the
"software" stack muscles to allow fast IO.

Thad
https://www.linkedin.com/in/thadguidry/
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Scaling Wikidata Query Service

2019-06-22 Thread Marco Neumann
Thibaut, while it's certainly exciting to see continued work on the
development of storage solutions and hybrids are most likely part of the
future story here I'd also would to stay as close as possible to existing
Semantic Web / Linked Data standards like RDF and SPARQL to
guarantee interop and extensibility.

no matter what mix of underlying tech is being deployed here under the hood.

On Fri, Jun 21, 2019 at 11:56 PM Thibaut DEVERAUX <
thibaut.dever...@gmail.com> wrote:

> Dear,
>
> I've seen this suggestion on Quora :
>
> https://www.quora.com/Wouldnt-a-mix-database-system-that-handle-both-JSON-documents-and-graph-functions-like-ArangoDB-provide-a-better-scalability-to-enormous-knowledge-graphs-like-Wikidata-than-a-classical-quadstore
>
>
>  I'm not qualified enough to know if it is relevant but this could be some
> brainstorming.
>
> Regards
>
>
>
> Le mer. 19 juin 2019 à 19:45, Finn Aarup Nielsen  a écrit :
>
>>
>> Changing the subject a bit:
>>
>> I am surprised to see how many SPARQL requests go to the endpoint when
>> performing a ShEx validation with the shex-simple Toolforge tool. They
>> are all very simple and quickly complete. For each Wikidata item tested,
>> one of our tests [1] requests tens of times. That is, testing 100
>> Wikidata items may yield thousands of requests to the endpoint in rapid
>> succession.
>>
>> I suppose that given the simple SPARQL queries, these kinds of requests
>> might not load WDQS very much.
>>
>>
>> [1]
>>
>> https://tools.wmflabs.org/shex-simple/wikidata/packages/shex-webapp/doc/shex-simple.html?data=Endpoint:%20https://query.wikidata.org/sparql=[]=%2F%2Fwww.wikidata.org%2Fwiki%2FSpecial%3AEntitySchemaText%2FE65
>>
>>
>> Finn
>> http://people.compute.dtu.dk/faan/
>>
>> ___
>> Wikidata mailing list
>> Wikidata@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>


-- 


---
Marco Neumann
KONA
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Scaling Wikidata Query Service

2019-06-21 Thread Thibaut DEVERAUX
Dear,

I've seen this suggestion on Quora :
https://www.quora.com/Wouldnt-a-mix-database-system-that-handle-both-JSON-documents-and-graph-functions-like-ArangoDB-provide-a-better-scalability-to-enormous-knowledge-graphs-like-Wikidata-than-a-classical-quadstore


 I'm not qualified enough to know if it is relevant but this could be some
brainstorming.

Regards



Le mer. 19 juin 2019 à 19:45, Finn Aarup Nielsen  a écrit :

>
> Changing the subject a bit:
>
> I am surprised to see how many SPARQL requests go to the endpoint when
> performing a ShEx validation with the shex-simple Toolforge tool. They
> are all very simple and quickly complete. For each Wikidata item tested,
> one of our tests [1] requests tens of times. That is, testing 100
> Wikidata items may yield thousands of requests to the endpoint in rapid
> succession.
>
> I suppose that given the simple SPARQL queries, these kinds of requests
> might not load WDQS very much.
>
>
> [1]
>
> https://tools.wmflabs.org/shex-simple/wikidata/packages/shex-webapp/doc/shex-simple.html?data=Endpoint:%20https://query.wikidata.org/sparql=[]=%2F%2Fwww.wikidata.org%2Fwiki%2FSpecial%3AEntitySchemaText%2FE65
>
>
> Finn
> http://people.compute.dtu.dk/faan/
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Scaling Wikidata Query Service

2019-06-19 Thread Finn Aarup Nielsen


Changing the subject a bit:

I am surprised to see how many SPARQL requests go to the endpoint when 
performing a ShEx validation with the shex-simple Toolforge tool. They 
are all very simple and quickly complete. For each Wikidata item tested, 
one of our tests [1] requests tens of times. That is, testing 100 
Wikidata items may yield thousands of requests to the endpoint in rapid 
succession.


I suppose that given the simple SPARQL queries, these kinds of requests 
might not load WDQS very much.



[1] 
https://tools.wmflabs.org/shex-simple/wikidata/packages/shex-webapp/doc/shex-simple.html?data=Endpoint:%20https://query.wikidata.org/sparql=[]=%2F%2Fwww.wikidata.org%2Fwiki%2FSpecial%3AEntitySchemaText%2FE65



Finn
http://people.compute.dtu.dk/faan/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Scaling Wikidata Query Service

2019-06-17 Thread Stas Malyshev
Hi!

> The documented limits about FDB states that it to support up to 100TB of
> data
> .
> That is 100x times more
> than what WDQS needs at the moment.

"Support" is such a multi-faceted word. It can mean "it works very well
with such amount of data and is faster than the alternatives" or "it is
guaranteed not to break up to this number but breaks after it" or "it
would work, given massive amounts of memory and super-fast hardware and
very specific set of queries, but you'd really have to take an effort to
make it work" and everything in between. The devil is always in the
details, which this seemingly simple word "supports" is rife with.

> I am offering my full-time services, it is up to you decide what will
> happen.

I wish you luck with the grant, though I personally think if you expect
to have a production-ready service in 6 month that can replace WDQS then
in my personal opinion it is a bit too optimistic. I might be completely
wrong on this of course. If you just plan to load the Wikidata data set
and evaluate the queries to ensure they are fast and produce proper
results on the setup you propose, then it can be done. Good luck!
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Scaling Wikidata Query Service

2019-06-17 Thread Ted Thibodeau Jr
Hello, Stas --

On Jun 13, 2019, at 07:52 PM, Stas Malyshev  wrote:
> 
> Hi!
> 
>> It handles data locality across a shared nothing cluster just fine i.e., you 
>> can interact with any node in a Virtuoso cluster and experience identical 
>> behavior (everyone node looks like single node in the eyes of the operator).
> 
> Does this mean no sharding, i.e. each server stores the full DB?

No.

The full DB is automatically sharded across all Virtuoso instances in an 
Elastic Cluster, and each instance *appears* to store the full DB -- i.e., you 
can issue a query to any instance in an Elastic Cluster, if you have the 
relevant communication details (typically IP address and port number), and you 
will get the same results from it as from any other instance in that Elastic 
Cluster.

(I am generally specific about Elastic Cluster vs Replication Cluster, because 
these are different though complementary technologies, implemented via 
different Modules in Virtuoso.)


> This is the model we're using currently, but given the growth of the data it 
> may be non sustainable on current hardware. I see in your tables that Uniprot 
> has about 30B triples, but I wonder how update loads there look like. Our 
> main issue is that the hardware we have now is showing its limits when 
> there's a lot of updates in parallel to significant query load. So I wonder 
> if the "single server holds everything" model is sustainable in the long term.

Your questions are unsurprising, and are one of the reasons for the benchmark 
efforts of the LDBC --

   http://ldbcouncil.org/benchmarks/

Uniprot does not get a lot of updates, and it is running on a single instance 
-- i.e., there's no cluster involved at all, neither Elastic (Shared-Nothing) 
Cluster nor Replication Cluster -- so its probably not the best example for 
your workflows.

I think the LDBC's Social Networking Benchmark (SNB) is likely to be the 
closest to the Wikidata update and query patterns, so you may find these 
articles interesting --

1. SNB Interactive, Part 1: What is SNB Interactive Really About?
   https://virtuoso.openlinksw.com/blog/vdb/blog/?id=1835

2. SNB Interactive, Part 2: Modeling Choices
   https://virtuoso.openlinksw.com/blog/vdb/blog/?id=1837

3. SNB Interactive, Part 3: Choke Points and Initial Run on Virtuoso
   https://virtuoso.openlinksw.com/blog/vdb/blog/?id=1842



>> There are live instances of Virtuoso that demonstrate its capabilities. If 
>> you want to explore shared-nothing cluster capabilities then our live LOD 
>> Cloud cache is the place to start [1][2][3]. If you want to see the 
>> single-server open source edition that you have DBpedia, DBpedia-Live, 
>> Uniprot and many other nodes in the LOD Cloud to choose from. All of these 
>> instance are highly connected.
> 
> Again, here the question is not too much in "can you load 7bn triples into 
> Virtuoso" - we know we can. What we want to figure out whether given specific 
> query/update patterns we have now - it is going to give us significantly 
> better performance allowing to support our projected growth. And also 
> possibly whether Virtuoso has ways to make our update workflow be more 
> optimal - e.g. right now if one triple changes in Wikidata item, we're 
> essentially downloading and updating the whole item (not exactly since 
> triples that stay the same are preserved but it requires a lot of data 
> transfer to express that in SPARQL). Would there be ways to update the things 
> more efficiently?

The first thing that will improve your performance is to break out of the 
"stored as JSON blobs" pattern you've been using.

Updates should not require a full download of the named graph (which I think is 
what your JSON Blobs amount to) followed by an upload of the entire revised 
named graph.

Even if you *query* the full content of an existing named graph, determine the 
necessary changes locally, and then submit an update query which includes a 
full set of DELETE + INSERT statements (this "full set" only including the 
*changed* triples), you should find a significant reduction in data throughput.

The live parallel to such regular updates is DBpedia-Live, which started from a 
static load of dump files, and has been (and is still) continuously updated by 
an RDF feed based on the Wikipedia update firehose.  The same RDF feed is made 
available to users of our AMI-based DBpedia-Live mirror AMI (currently being 
refreshed, and soon to be made available for new users) --

   https://aws.amazon.com/marketplace/pp/B012DSCFEK


>> Virtuoso handles both shared-nothing clusters and replication i.e., you can 
>> have a cluster configuration used in conjunction with a replication topology 
>> if your solution requires that.
> 
> Replication could certainly be useful I think it it's faster to update single 
> server and then replicate than simultaneously update all servers (that's what 
> is happening now).

There are multiple Replication strategies which might be used, as 

Re: [Wikidata] Scaling Wikidata Query Service

2019-06-17 Thread Sebastian Hellmann

Hi Amirouche,

On 16.06.19 23:01, Amirouche Boubekki wrote:
Le mer. 12 juin 2019 à 19:27, Amirouche Boubekki 
mailto:amirouche.boube...@gmail.com>> a 
écrit :


Hello Sebastian,

First thanks a lot for the reply. I started to believe that what I
was saying was complete nonsense.

Le mer. 12 juin 2019 à 16:51, Sebastian Hellmann
mailto:hellm...@informatik.uni-leipzig.de>> a écrit :

Hi Amirouche,

Any open data projects that are running open databases with
FoundationDB and WiredTiger? Where can I query them?

Thanks for asking. I will set up a wiredtiger instance of
wikidata. I need a few days, maybe a week (or two :)).

I could setup FoundationDB on a single machine instead but it will
require more time (maybe one more week).

Also, it will not support geo-queries. I will try to make
labelling work but with a custom syntax (inspired form SPARQL).


I figured that anything that is not SPARQL will not be convincing. 
Getting my engine 100% compatible is much work.


The example deployment I have given in the previous message should be 
enough to convince you that

FoundationDB can store WDQS.


Don get me wrong, I don want you to set it up. I am asking about a 
reference project, that has:


1. open data and an open database

2. decent amount of data

3. several years of running it.

Like OpenStreetMap and PostreSQL, MediaWiki/Wikipedia -> MySQL, DBpedia 
-> Virtuoso.


This would be a very good point for it. Otherwise I would consider it a 
sales trap, i.e. some open source which does not work really until you 
switch to the commercial product, same for Neptune.


Now I think, only Apple knows how to use it. Any other reference projects?


--
All the best,
Sebastian Hellmann

Director of Knowledge Integration and Linked Data Technologies (KILT) 
Competence Center

at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org, 
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt 


Homepage: http://aksw.org/SebastianHellmann
Research Group: http://aksw.org
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Scaling Wikidata Query Service

2019-06-16 Thread Amirouche Boubekki
Hello Sebastian and Stas,

Le mer. 12 juin 2019 à 19:27, Amirouche Boubekki <
amirouche.boube...@gmail.com> a écrit :

> Hello Sebastian,
>
> First thanks a lot for the reply. I started to believe that what I was
> saying was complete nonsense.
>
> Le mer. 12 juin 2019 à 16:51, Sebastian Hellmann <
> hellm...@informatik.uni-leipzig.de> a écrit :
>
>> Hi Amirouche,
>>
>> Any open data projects that are running open databases with FoundationDB
>> and WiredTiger? Where can I query them?
>>
>
> Thanks for asking. I will set up a wiredtiger instance of wikidata. I need
> a few days, maybe a week (or two :)).
>
> I could setup FoundationDB on a single machine instead but it will require
> more time (maybe one more week).
>
> Also, it will not support geo-queries. I will try to make labelling work
> but with a custom syntax (inspired form SPARQL).
>

I figured that anything that is not SPARQL will not be convincing. Getting
my engine 100% compatible is much work.

The example deployment I have given in the previous message should be
enough to convince you that
FoundationDB can store WDQS.

The documented limits about FDB states that it to support up to 100TB of
data
.
That is 100x times more
than what WDQS needs at the moment.

Anyway, I updated my proposal to support wikimedia foundation to transition
to a new solution in the wiki
 to
reflect
the new requirements where the required space was reduced from 12T SSD to
6T SSD, it is based on this FDB forum topic

and an optimisation I will make in my engine. That proposal is biased
toward getting a FDB prototype. It could be
reworked to emphasize the fact that a benchmarking tool must be put
together to be able to tell which solution is best.

My estimations might be off, especially the 1 month of GCP credits.

To be honest, WDQS is a low hanging fruit compared to the goal of building
a portable wikidata.

I am offering my full-time services, it is up to you decide what will
happen.
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Scaling Wikidata Query Service

2019-06-14 Thread Kingsley Idehen
On 6/13/19 7:55 PM, Stas Malyshev wrote:
> Hi!
>
>> Data living in an RDBMS engine distinct from Virtuoso is handled via the
>> engines Virtual Database module i.e., you can build powerful RDF Views
>> over ODBC- or JDBC- accessible data using Virtuoso. These view also have
>> the option of being materialized etc..
> Yes, but the way the data are stored now is JSON blob within a text
> field in MySQL. I do not see how RDF View over ODBC would help it any -
> of course Virtuoso would be able to fetch JSON text for a single item,
> but then what? We'd need to run queries across millions of items,
> fetching and parsing JSON for every one of them every time is
> unfeasible. Not to mention this JSON is not an accurate representation
> of the RDF data model. So I don't think it is worth spending time in
> this direction... I just don't see how any query engine could work with
> that storage.
> -- Stas Malyshev smalys...@wikimedia.org


The point I am trying to make is that Virtuoso can integrate data from
external DBMS systems in a variety of ways.

ODBC and JDBC are simply APIs for accessing external DBMS systems.

What you really need here is a clear project definition and a discussion
with us about how it would be implemented.

Despite the fact that Virtuoso is a hardcore DBMS, its also a hardcore
Data Virtualization platform for handling relations represented in a
variety of ways using a plethora of protocols.

I am email away if you want to explore this further.

-- 
Regards,

Kingsley Idehen   
Founder & CEO 
OpenLink Software   
Home Page: http://www.openlinksw.com
Community Support: https://community.openlinksw.com
Weblogs (Blogs):
Company Blog: https://medium.com/openlink-software-blog
Virtuoso Blog: https://medium.com/virtuoso-blog
Data Access Drivers Blog: 
https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers

Personal Weblogs (Blogs):
Medium Blog: https://medium.com/@kidehen
Legacy Blogs: http://www.openlinksw.com/blog/~kidehen/
  http://kidehen.blogspot.com

Profile Pages:
Pinterest: https://www.pinterest.com/kidehen/
Quora: https://www.quora.com/profile/Kingsley-Uyi-Idehen
Twitter: https://twitter.com/kidehen
Google+: https://plus.google.com/+KingsleyIdehen/about
LinkedIn: http://www.linkedin.com/in/kidehen

Web Identities (WebID):
Personal: http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i
: 
http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this



smime.p7s
Description: S/MIME Cryptographic Signature
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Scaling Wikidata Query Service

2019-06-13 Thread Stas Malyshev
Hi!

> Data living in an RDBMS engine distinct from Virtuoso is handled via the
> engines Virtual Database module i.e., you can build powerful RDF Views
> over ODBC- or JDBC- accessible data using Virtuoso. These view also have
> the option of being materialized etc..

Yes, but the way the data are stored now is JSON blob within a text
field in MySQL. I do not see how RDF View over ODBC would help it any -
of course Virtuoso would be able to fetch JSON text for a single item,
but then what? We'd need to run queries across millions of items,
fetching and parsing JSON for every one of them every time is
unfeasible. Not to mention this JSON is not an accurate representation
of the RDF data model. So I don't think it is worth spending time in
this direction... I just don't see how any query engine could work with
that storage.
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Scaling Wikidata Query Service

2019-06-13 Thread Stas Malyshev
Hi!

> It handles data locality across a shared nothing cluster just fine i.e.,
> you can interact with any node in a Virtuoso cluster and experience
> identical behavior (everyone node looks like single node in the eyes of
> the operator).

Does this mean no sharding, i.e. each server stores the full DB? This is
the model we're using currently, but given the growth of the data it may
be non sustainable on current hardware. I see in your tables that
Uniprot has about 30B triples, but I wonder how update loads there look
like. Our main issue is that the hardware we have now is showing its
limits when there's a lot of updates in parallel to significant query
load. So I wonder if the "single server holds everything" model is
sustainable in the long term.

> There are live instances of Virtuoso that demonstrate its capabilities.
> If you want to explore shared-nothing cluster capabilities then our live
> LOD Cloud cache is the place to start [1][2][3]. If you want to see the
> single-server open source edition that you have DBpedia, DBpedia-Live,
> Uniprot and many other nodes in the LOD Cloud to choose from. All of
> these instance are highly connected.

Again, here the question is not too much in "can you load 7bn triples
into Virtuoso" - we know we can. What we want to figure out whether
given specific query/update patterns we have now - it is going to give
us significantly better performance allowing to support our projected
growth.
And also possibly whether Virtuoso has ways to make our update workflow
be more optimal - e.g. right now if one triple changes in Wikidata item,
we're essentially downloading and updating the whole item (not exactly
since triples that stay the same are preserved but it requires a lot of
data transfer to express that in SPARQL). Would there be ways to update
the things more efficiently?

> Virtuoso handles both shared-nothing clusters and replication i.e., you
> can have a cluster configuration used in conjunction with a replication
> topology if your solution requires that.

Replication could certainly be useful I think it it's faster to update
single server and then replicate than simultaneously update all servers
(that's what is happening now).

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Scaling Wikidata Query Service

2019-06-13 Thread Stas Malyshev
Hi!

> Unlike, most sites we do have our own custom frontend in front of
> virtuoso. We did this to allow more styling, as well as being flexible
> and change implementations at our whim. e.g. we double parse the SPARQL
> queries and even rewrite some to be friendlier. I suggest you do the
> same no matter which DB you use in the end, and we would be willing to
> open source ours (it is in Java, and uses RDF4J and some ugly JSPX but
> it works, if not to use at least as an inspiration). We did this to
> avoid being locked into endpoint specific features.

It would be interesting to know more about this, if this is open source.
Is there any more information about it online?

> Pragmatically, while WDS is a Graph database, the queries are actually
> very relational. And none of the standard graph algorithms are used. To

If you mean algorithms like A* or PageRank, then yes, they are not used
too much (likely also because SPARQL has no standard support for any of
these, too), though Blazegraph implements some of them as custom services.

> be honest RDF is actually a relational system which means that
> relational techniques are very good at answering them. The sole issue is
> recursive queries (e.g. rdfs:subClassOf+) in which the virtuoso
> implementation is adequate but not great.

Yes, path queries are pretty popular on WDQS too, especially given as
many relationships like administrative/territorial placement or
ownership are hierarchical and transitive, which often requires path
queries.

> This is why recovering physical schemata from RDF data is such a
> powerful optimization technique [1]. i.e. you tend to do joins not
> traversals. This is not always true but I strongly suspect it will hold
> for the vast majority of the Wikidata Query Service case.

Would be interesting to see if we can apply anything from the article.
Thanks for the link!

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Scaling Wikidata Query Service

2019-06-13 Thread Kingsley Idehen
On 6/12/19 1:11 PM, Stas Malyshev wrote:
>> That will be vendor lock-in for wikidata and wikimedia along all the
>> poor souls that try to interop with it.
> Since Virtuoso is using standard SPARQL, it won't be too much of a
> vendor lock in, though of course the standard does not cover all, so
> some corners are different in all SPARQL engines. This is why even
> migration between SPARQL engines, even excluding operational aspects, is
> non-trivial. Of course, migration to any non-SPARQL engine would be
> order of magnitude more disruptive, so right now we do not seriously
> consider doing that.
>

Hi Stas,

Yes, Virtuoso supports W3C SPARQL and ASNI SQL standards. The most
important aspect of Virtuoso's design and vision boils down to using
open standard on the front- and back-ends to enable maximum flexibility
for its users.

There is nothing more important to us than open standards. For instance,
we even extend SQL using SPARQL before entering the realm on
non-standard extensions.


-- 
Regards,

Kingsley Idehen   
Founder & CEO 
OpenLink Software   
Home Page: http://www.openlinksw.com
Community Support: https://community.openlinksw.com
Weblogs (Blogs):
Company Blog: https://medium.com/openlink-software-blog
Virtuoso Blog: https://medium.com/virtuoso-blog
Data Access Drivers Blog: 
https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers

Personal Weblogs (Blogs):
Medium Blog: https://medium.com/@kidehen
Legacy Blogs: http://www.openlinksw.com/blog/~kidehen/
  http://kidehen.blogspot.com

Profile Pages:
Pinterest: https://www.pinterest.com/kidehen/
Quora: https://www.quora.com/profile/Kingsley-Uyi-Idehen
Twitter: https://twitter.com/kidehen
Google+: https://plus.google.com/+KingsleyIdehen/about
LinkedIn: http://www.linkedin.com/in/kidehen

Web Identities (WebID):
Personal: http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i
: 
http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this



smime.p7s
Description: S/MIME Cryptographic Signature
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Scaling Wikidata Query Service

2019-06-12 Thread Amirouche Boubekki
Le mer. 12 juin 2019 à 19:11, Stas Malyshev  a
écrit :

> Hi!
>
> >> So there needs to be some smarter solution, one that we'd unlike to
> > develop inhouse
> >
> > Big cat, small fish. As wikidata continue to grow, it will have specific
> > needs.
> > Needs that are unlikely to be solved by off-the-shelf solutions.
>
> Here I think it's good place to remind that we're not Google, and
> developing a new database engine inhouse is probably a bit beyond our
> resources and budgets.


Today, the problem is not the same as the one MySQL, PostgreSQL, blazegraph
and openlink had when they started working on their respective databases.
See
below.


> Fitting existing solution to our goals - sure, but developing something
> new of

that scale is probably not going to happen.
>

It will.

> FoundationDB and WiredTiger are respectively used at Apple (among other
> > companies)
> > and MongoDB since 3.2 all over-the-world. WiredTiger is also used at
> Amazon.
>
> I believe they are, but I think for our particular goals we have to
> limit themselves for a set of solution that are a proven good match for
> our case.
>

See the other mail I just sent. We are a turning point in database
engineering
history. The very last database systems that were built are all based on
Ordered Key Value Store, see Google Spanner paper [0].

Thanks to WT/MongoDB and Apple, those are readily available, in widespread
use
and fully open source. It is only missing a few pieces for making it work a
fully
backward compatible way with WDQS (at scale).

[0] https://ai.google/research/pubs/pub39966


> > That will be vendor lock-in for wikidata and wikimedia along all the
> > poor souls that try to interop with it.
>
> Since Virtuoso is using standard SPARQL, it won't be too much of a
> vendor lock in, though of course the standard does not cover all, so
> some corners are different in all SPARQL engines.


There is a big chance that same thing that happened with the www will
happen with RDF. That is one big player own all the implementations.


> This is why even migration between SPARQL engines, even excluding

operational aspects, is non-trivial.


I agree.


> Of course, migration to any non-SPARQL engine would be order of magnitude

more disruptive, so right now we do not seriously consider doing that.
>

I also agree.

>
> As I already mentioned, there's a difference between "you can do it" and
> "you can do it efficiently". [...] The tricky part starts when you need to
> run millions
> of queries on 10B triples database. If your backend is not optimal for
> that task, it's not going to perform.
>

I already did small benchmarks against blazegraph. I will do more intensive
benchmarks using wikidata (and reduce the requirements in terms of SSD).


Thanks for the reply.
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Scaling Wikidata Query Service

2019-06-12 Thread Amirouche Boubekki
Hello Sebastian,

First thanks a lot for the reply. I started to believe that what I was
saying was complete nonsense.

Le mer. 12 juin 2019 à 16:51, Sebastian Hellmann <
hellm...@informatik.uni-leipzig.de> a écrit :

> Hi Amirouche,
> On 12.06.19 14:07, Amirouche Boubekki wrote:
>
> > So there needs to be some smarter solution, one that we'd unlike to
> develop inhouse
>
> Big cat, small fish. As wikidata continue to grow, it will have specific
> needs.
> Needs that are unlikely to be solved by off-the-shelf solutions.
>
>
> Are you suggesting to develop the database in-house?
>
Yes! At least part of it. The domain specific part.

> even MediaWiki uses MySQL
>
Yes, but it is not because of its technical merits. Similarly for PHP.
Historically, PHP and MySQL were easy to setup and easy to use
but otherwise difficult to work with. This is/was painful enough that
nowadays the goto RDBMS is PostgreSQL even if MySQL is still
very popular [0][1]. Those are technical reasons. Also, I agree it is
not because MySQL had no ACID guarantees when it started
1995, that nowadays it is a bad choice.

[0] https://trends.google.com/trends/explore?q=MySQL,PostgreSQL
[1] https://stackshare.io/stackups/mysql-vs-postgresql

> but one that has already been verified by industry experience and other
> deployments.
>
> FoundationDB and WiredTiger are respectively used at Apple (among other
> companies)
> and MongoDB since 3.2 all over-the-world. WiredTiger is also used at
> Amazon.
>
> Let`s not talk about MongoDB, it is irrelevant and very mixed.
>

I am giving an example deployment of WiredTiger. WiredTiger is an ordered
Key-Value
Store that is the storage engine of MongoDB since 3.2. It was created by
independent
company and later mongodb acquired WiredTiger. It is still GPLv2 or v3.
Among the founders
there is one of the engineer that created bsddb that Oracle has bought.
Also, I am not saying
WiredTiger solve all the problems of mongodb. I am just saying that because
WiredTiger is
the storage backend of MongoDB since 3.2 it has seen widespread usage and
testing.

Some say it is THE solution for scalability, others have said it was the
> biggest disappointment.
>

Some people gave warnings about the technical issues of mongodb before 3.2.
Also, Caveat emptor. The situation is better that a few years back. After
all that
was open source / free software / source available software since the
beginning.

Like I said above, WiredTiger is not the solution of all problems. I cited
WiredTiger
as a possible tool for building a cluster similar the current one where
machines have full
copies of the data. The advantage of WiredTiger is that it is easier to
setup (compared
to a distributed database) but it still requires fine-tuning /
configuration. Also, there is
many other Ordered Key-Value store in the wild. I have documented those in
the document:

https://github.com/scheme-requests-for-implementation/srfi-167/blob/master/libraries.md

In particular, if WDQS doesn't want to use ACID transactions, there might
be a better
solution. Other popular options are LMDB (used in OpenLDAP) and RocksDB by
Facebook (that is LevelDB fork). But again, that is ONE possibility, my
design / database
work with any of the libraries described in the above libraries.md url.

My recommendation for production cluster is to use FoundationDB.
Because it can scale horizontally and provides single / double / triple
replication. If a
node is down, the write and reads can still continue if you have enough
machine up.

WiredTiger would better suited for single machine (and my database (can)
support
both WiredTiger and FoundationDB with the same code base).

Do FoundationDB and WiredTiger have any track record for hosting open data
> projects or being chosen by open data projects?
>
tl;dr: I don't know.

Like I said previously, WiredTiger is used in many contexts among others it
used at Amazon Web Services (AWS).

FoundationDB is used at Apple, I don't remember which services rely on it
but at least the Data Science team
rely on it. The main contributor did a lightning talk about it:

   Entity Store: A FoundationDB Layer for Versioned Entities with Fine
Grained 

That is the use-case that looks the more like data.

More on popularity contest, it is used at WaveFront (owned by VMWare) that
is an analytic tool.
Here is a talk:

  Running FDB at scale 

JanusGraph has FDB backend, see the talk:

  The JanusGraph FoundationDB Storage Adapter 

It is also used at SnowFlake  that is
apparently a datawhare house, here is the talk:

   How FoundationDB powers SnowflakeDB's metadata


It is also used at SkuVault as multi-model database, see the forum topic:


https://forums.foundationdb.org/t/success-story-foundationdb-at-skuvault/336


Again, I think the popularity of a tool is a hint. For 

Re: [Wikidata] Scaling Wikidata Query Service

2019-06-12 Thread Stas Malyshev
Hi!

>> So there needs to be some smarter solution, one that we'd unlike to
> develop inhouse
> 
> Big cat, small fish. As wikidata continue to grow, it will have specific
> needs.
> Needs that are unlikely to be solved by off-the-shelf solutions.

Here I think it's good place to remind that we're not Google, and
developing a new database engine inhouse is probably a bit beyond our
resources and budgets. Fitting existing solution to our goals - sure,
but developing something new of that scale is probably not going to happen.

> FoundationDB and WiredTiger are respectively used at Apple (among other
> companies)
> and MongoDB since 3.2 all over-the-world. WiredTiger is also used at Amazon.

I believe they are, but I think for our particular goals we have to
limit themselves for a set of solution that are a proven good match for
our case.

>> We also have a plan on improving the throughput of Blazegraph, which
> we're working on now.
> 
> What is the phabricator ticket? Please.

You can see WDQS task board here:
https://phabricator.wikimedia.org/tag/wikidata-query-service/

> That will be vendor lock-in for wikidata and wikimedia along all the
> poor souls that try to interop with it.

Since Virtuoso is using standard SPARQL, it won't be too much of a
vendor lock in, though of course the standard does not cover all, so
some corners are different in all SPARQL engines. This is why even
migration between SPARQL engines, even excluding operational aspects, is
non-trivial. Of course, migration to any non-SPARQL engine would be
order of magnitude more disruptive, so right now we do not seriously
consider doing that.

> It has two backends: MMAP and rocksdb.

Sure, but I was talking about the data model - ArangoDB sees the data as
set of documents. RDF approach is a bit different.

> ArangoDB is a multi-model database, it support:

As I already mentioned, there's a difference between "you can do it" and
"you can do it efficiently". Graphs are simple creatures, and can be
modeled on many backends - KV, document, relational, column store,
whatever you have. The tricky part starts when you need to run millions
of queries on 10B triples database. If your backend is not optimal for
that task, it's not going to perform.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Scaling Wikidata Query Service

2019-06-12 Thread Sebastian Hellmann

Hi Amirouche,

On 12.06.19 14:07, Amirouche Boubekki wrote:
> So there needs to be some smarter solution, one that we'd unlike to 
develop inhouse


Big cat, small fish. As wikidata continue to grow, it will have 
specific needs.

Needs that are unlikely to be solved by off-the-shelf solutions.



Are you suggesting to develop the database in-house? even MediaWiki uses 
MySQL





> but one that has already been verified by industry experience and 
other deployments.


FoundationDB and WiredTiger are respectively used at Apple (among 
other companies)
and MongoDB since 3.2 all over-the-world. WiredTiger is also used at 
Amazon.



Let`s not talk about MongoDB, it is irrelevant and very mixed. Some say 
it is THE solution for scalability, others have said it was the biggest 
disappointment.


Do FoundationDB and WiredTiger have any track record for hosting open 
data projects or being chosen by open data projects? PostgreSQL and 
MySQL are widely used, e.g. OpenStreetMaps. Virtuoso by DBpedia, 
LODCloud cache and Uniprot.


I don know FoundationDB or WiredTiger, but in the past there were often 
these OS projects published by large corporations that worked in-house, 
but not the OS variant. Apache UIMA was one such example. Maybe 
Blazegraph works much better if you move to Neptune, that could be a 
sales hook.


Any open data projects that are running open databases with FoundationDB 
and WiredTiger? Where can I query them?





> "Evaluation of Metadata Representations in RDF stores"

I don't understand how this is related to the scaling issues.


Not 100% pertinent, but do you have a better paper?


> [About proprietary version Virtuoso], I dare say [it must have] enormous advantage for us to 
consider running it in production.


That will be vendor lock-in for wikidata and wikimedia along all the 
poor souls that try to interop with it.


Actually Uniprot and Kingsley suggested to host the OS version. Sounded 
like this will hold for 5 more years, which is probably the average 
lifecycle. There is also SPARQL, which normally doesn`t do vendor 
lock-ins. Maybe you mean that nobody can rent 15 servers and install the 
same setup as WMF for Wikidata. That would be true. Switching always 
seems possible though.



--
All the best,
Sebastian Hellmann

Director of Knowledge Integration and Linked Data Technologies (KILT) 
Competence Center

at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org, 
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt 


Homepage: http://aksw.org/SebastianHellmann
Research Group: http://aksw.org
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Scaling Wikidata Query Service

2019-06-12 Thread Kingsley Idehen
On 6/11/19 12:06 PM, Andra Waagmeester wrote:
>
>
> On Tue, Jun 11, 2019 at 11:23 AM Jerven Bolleman et al wrote:
>
>
> >>  So we are playing the game since ten years now: Everybody
> tries other databases, but then most people come back to virtuoso. 
>
>
> Nothing bad about virtuoso, on the contrary, they are a prime
> infrastructure provider (Except maybe their trademark SPARQL query:
> "select distinct ?Concept where {[] a ?Concept}" ;). But I personally
> think that replacing the current WDS with virtuoso would be a bad
> idea. Not from a performance perspective, but more from the signal it
> gives. If indeed as you state virtuoso is the only viable solution in
> the field, this field is nothing more than a niche. We really need
> more competition to get things done.  
> Since both DBpedia and UniProt are indeed already running on Virtuoso
> - where it is doing a prime job -, having Wikidata running on another
> vendor's infrastructure does provide us with the so needed benchmark.
> The benchmark seems to be telling some of us already that there is
> room for other alternatives. So it is fulfilling its benchmarks role.
> Is there really no room for improvement with Blazegraph? How about
> graphDB?
>  
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata


Hi Andra,

The goal is to provide a solution to a problem. Unfortunately, it has
ended up in a product debate. I've struggle with the logic about a
demonstrable solution being challenged by a lack of alternatives.

The fundamental goal of Linked Data is to enable Data Publication and
Access that applies capabilities delivered by HTTP to modern Data
Access, Integration, and Management.

The Linked Data meme outlining Linked Data principles has existed since
2006. Like others, we digested the meme and applied it to our existing
SQL RDBMS en route to producing a solution that made the vision in the
paper reality, as demonstrated by DBpedia, DBpedia-Live, Uniprot, our
LOD Cloud Cache, and many other nodes in the massive LOD Cloud [1].

Virtuoso's role in the LOD Cloud is an example of what happens when open
standards are understood and appropriately applied to a problem, with a
little innovation.

Links:

[1]
https://medium.com/virtuoso-blog/what-is-the-linked-open-data-cloud-and-why-is-it-important-1901a7cb7b1f
-- What is the LOD Cloud, and why is it important?

[2]
https://medium.com/virtuoso-blog/what-is-small-data-and-why-is-it-important-fbf5f267884
-- What is Small Data, and why is it important?

-- 
Regards,

Kingsley Idehen   
Founder & CEO 
OpenLink Software   
Home Page: http://www.openlinksw.com
Community Support: https://community.openlinksw.com
Weblogs (Blogs):
Company Blog: https://medium.com/openlink-software-blog
Virtuoso Blog: https://medium.com/virtuoso-blog
Data Access Drivers Blog: 
https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers

Personal Weblogs (Blogs):
Medium Blog: https://medium.com/@kidehen
Legacy Blogs: http://www.openlinksw.com/blog/~kidehen/
  http://kidehen.blogspot.com

Profile Pages:
Pinterest: https://www.pinterest.com/kidehen/
Quora: https://www.quora.com/profile/Kingsley-Uyi-Idehen
Twitter: https://twitter.com/kidehen
Google+: https://plus.google.com/+KingsleyIdehen/about
LinkedIn: http://www.linkedin.com/in/kidehen

Web Identities (WebID):
Personal: http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i
: 
http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this



smime.p7s
Description: S/MIME Cryptographic Signature
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Scaling Wikidata Query Service

2019-06-12 Thread Amirouche Boubekki
Le dim. 9 juin 2019 à 23:18, Amirouche Boubekki <
amirouche.boube...@gmail.com> a écrit :

> I made a proposal for a grant at
> https://meta.wikimedia.org/wiki/Grants:Project/WDQS_On_FoundationDB
>
> Mind the fact that this is not about the versioned quadstore. It is about
> simple triplestore, it mainly missing bindings for foundationdb and SPARQL
> syntax.
>
> Also, I will prolly need help to interface with geo and label services.
>
> Feedback welcome!
>

I got "feedback" in others threads from the same topic that I will quote
and reply to.

> So there needs to be some smarter solution, one that we'd unlike to
develop inhouse

Big cat, small fish. As wikidata continue to grow, it will have specific
needs.
Needs that are unlikely to be solved by off-the-shelf solutions.

> but one that has already been verified by industry experience and other
deployments.

FoundationDB and WiredTiger are respectively used at Apple (among other
companies)
and MongoDB since 3.2 all over-the-world. WiredTiger is also used at Amazon.

> We also have a plan on improving the throughput of Blazegraph, which
we're working on now.

What is the phabricator ticket? Please.

> "Evaluation of Metadata Representations in RDF stores"

I don't understand how this is related to the scaling issues.

> [About proprietary version Virtuoso], I dare say [it must have] enormous
advantage for us to consider running it in production.

That will be vendor lock-in for wikidata and wikimedia along all the poor
souls that try to interop with it.

> This project seems to be still very young.

First commit

is from 2011.

> AgangoDB seems to be document database inside.

It has two backends: MMAP and rocksdb.

> While I would be very interested if somebody took on themselves to model
Wikidata
> in terms of ArangoDB documents,

It looks like a bounty.

ArangoDB is a multi-model database, it support:

- Document
- Graph
- Key-Value

> load the whole data and see what the resulting performance would be, I am
not sure
> it would be wise for us to invest our team's - very limited currently -
resources into that.

I am biased. I would advise against trying arangodb. This is another short
term solution.

> the concept of having single data store is probably not realistic at
least
> within foreseeable timeframes.

Incorrect. My solution is in the foreseeable future.

> We use separate data store for search (ElasticSearch) and probably will
> have to have separate one for queries, whatever would be the mechanism.

It would be interesting to read how much "resource" is poured into keeping
all those synchronized:

- ElasticSearch
- MySQL
- BlazeGraph

Maybe some REDIS?
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Scaling Wikidata Query Service

2019-06-11 Thread Marco Neumann
and of course not to forget the fully open source  SPARQL 1.1 compliant RDF
database Apache Jena with TDB. Did you already evaluate Apache Jena for use
in wikidata?



On Tue, Jun 11, 2019 at 5:07 PM Andra Waagmeester  wrote:

>
>
> On Tue, Jun 11, 2019 at 11:23 AM Jerven Bolleman et al wrote:
>
>>
>> >>  So we are playing the game since ten years now: Everybody tries other
>> databases, but then most people come back to virtuoso.
>>
>
> Nothing bad about virtuoso, on the contrary, they are a prime
> infrastructure provider (Except maybe their trademark SPARQL query: "select
> distinct ?Concept where {[] a ?Concept}" ;). But I personally think that
> replacing the current WDS with virtuoso would be a bad idea. Not from a
> performance perspective, but more from the signal it gives. If indeed as
> you state virtuoso is the only viable solution in the field, this field is
> nothing more than a niche. We really need more competition to get things
> done.
> Since both DBpedia and UniProt are indeed already running on Virtuoso -
> where it is doing a prime job -, having Wikidata running on another
> vendor's infrastructure does provide us with the so needed benchmark. The
> benchmark seems to be telling some of us already that there is room for
> other alternatives. So it is fulfilling its benchmarks role.
> Is there really no room for improvement with Blazegraph? How about graphDB?
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>


-- 


---
Marco Neumann
KONA
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Scaling Wikidata Query Service

2019-06-11 Thread Andra Waagmeester
On Tue, Jun 11, 2019 at 11:23 AM Jerven Bolleman et al wrote:

>
> >>  So we are playing the game since ten years now: Everybody tries other
> databases, but then most people come back to virtuoso.
>

Nothing bad about virtuoso, on the contrary, they are a prime
infrastructure provider (Except maybe their trademark SPARQL query: "select
distinct ?Concept where {[] a ?Concept}" ;). But I personally think that
replacing the current WDS with virtuoso would be a bad idea. Not from a
performance perspective, but more from the signal it gives. If indeed as
you state virtuoso is the only viable solution in the field, this field is
nothing more than a niche. We really need more competition to get things
done.
Since both DBpedia and UniProt are indeed already running on Virtuoso -
where it is doing a prime job -, having Wikidata running on another
vendor's infrastructure does provide us with the so needed benchmark. The
benchmark seems to be telling some of us already that there is room for
other alternatives. So it is fulfilling its benchmarks role.
Is there really no room for improvement with Blazegraph? How about graphDB?
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Scaling Wikidata Query Service

2019-06-11 Thread Kingsley Idehen
On 6/10/19 4:25 PM, Stas Malyshev wrote:
>> Just a note here: Virtuoso is also a full RDMS, so you could probably
>> keep wikibase db in the same cluster and fix the asynchronicity. That is
> Given how the original data is stored (JSON blob inside mysql table) it
> would not be very useful. In general, graph data model and Wikitext data
> model on top of which Wikidata is built are very, very different, and
> expecting same storage to serve both - at least without very major and
> deep refactoring of the code on both sides - is not currently very
> realistic. And of course moving any of the wiki production databases to
> Virtuoso would be a non-starter. Given than original Wikidata database
> stays on Mysql - which I think is a reasonable assumption - there would
> need to be a data migration pipeline for data to come from Mysql to
> whatever is the WDQS NG storage.
>

Hi Stas,

Data living in an RDBMS engine distinct from Virtuoso is handled via the
engines Virtual Database module i.e., you can build powerful RDF Views
over ODBC- or JDBC- accessible data using Virtuoso. These view also have
the option of being materialized etc..

[1]
https://medium.com/virtuoso-blog/conceptual-data-virtualization-for-sql-and-rdf-using-open-standards-24520925c7ce
-- Conceptual Data Virtualization using Virtuoso

[2]
https://medium.com/virtuoso-blog/generate-relational-tables-to-rdf-relational-graphs-mappings-using-virtuosos-rdb2rdf-wizard-c4b83402599a
-- RDF Views generation over SQL RDBMS data sources using the Virtuoso
Wizard


-- 
Regards,

Kingsley Idehen   
Founder & CEO 
OpenLink Software   
Home Page: http://www.openlinksw.com
Community Support: https://community.openlinksw.com
Weblogs (Blogs):
Company Blog: https://medium.com/openlink-software-blog
Virtuoso Blog: https://medium.com/virtuoso-blog
Data Access Drivers Blog: 
https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers

Personal Weblogs (Blogs):
Medium Blog: https://medium.com/@kidehen
Legacy Blogs: http://www.openlinksw.com/blog/~kidehen/
  http://kidehen.blogspot.com

Profile Pages:
Pinterest: https://www.pinterest.com/kidehen/
Quora: https://www.quora.com/profile/Kingsley-Uyi-Idehen
Twitter: https://twitter.com/kidehen
Google+: https://plus.google.com/+KingsleyIdehen/about
LinkedIn: http://www.linkedin.com/in/kidehen

Web Identities (WebID):
Personal: http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i
: 
http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this



smime.p7s
Description: S/MIME Cryptographic Signature
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Scaling Wikidata Query Service

2019-06-11 Thread Kingsley Idehen
On 6/10/19 4:46 PM, Stas Malyshev wrote:
> Hi!
>
>> thanks for the elaboration. I can understand the background much better.
>> I have to admit, that I am also not a real expert, but very close to the
>> real experts like Vidal and Rahm who are co-authors of the SWJ paper or
>> the OpenLink devs.
> If you know anybody at OpenLink that would be interested in trying to
> evaluate such thing (i.e. how Wikidata could be hosted on Virtuso) and
> provide support for this project, it would be interesting to discuss it.
> While open-source thing is still a barrier and in general the
> requirements are different, at least discussing it and maybe getting
> some numbers might be useful.
>
> Thanks,
> -- Stas Malyshev smalys...@wikimedia.org


I am listening.

I am only a ping away.

-- 
Regards,

Kingsley Idehen   
Founder & CEO 
OpenLink Software   
Home Page: http://www.openlinksw.com
Community Support: https://community.openlinksw.com
Weblogs (Blogs):
Company Blog: https://medium.com/openlink-software-blog
Virtuoso Blog: https://medium.com/virtuoso-blog
Data Access Drivers Blog: 
https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers

Personal Weblogs (Blogs):
Medium Blog: https://medium.com/@kidehen
Legacy Blogs: http://www.openlinksw.com/blog/~kidehen/
  http://kidehen.blogspot.com

Profile Pages:
Pinterest: https://www.pinterest.com/kidehen/
Quora: https://www.quora.com/profile/Kingsley-Uyi-Idehen
Twitter: https://twitter.com/kidehen
Google+: https://plus.google.com/+KingsleyIdehen/about
LinkedIn: http://www.linkedin.com/in/kidehen

Web Identities (WebID):
Personal: http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i
: 
http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this



smime.p7s
Description: S/MIME Cryptographic Signature
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Scaling Wikidata Query Service

2019-06-11 Thread Kingsley Idehen
On 6/10/19 3:49 PM, Guillaume Lederrey wrote:
> On Mon, Jun 10, 2019 at 9:03 PM Sebastian Hellmann
>  wrote:
>> Hi Guillaume,
>>
>> On 10.06.19 16:54, Guillaume Lederrey wrote:
>>
>> Hello!
>>
>> On Mon, Jun 10, 2019 at 4:28 PM Sebastian Hellmann
>>  wrote:
>>
>> Hi Guillaume,
>>
>> On 06.06.19 21:32, Guillaume Lederrey wrote:
>>
>> Hello all!
>>
>> There has been a number of concerns raised about the performance and
>> scaling of Wikdata Query Service. We share those concerns and we are
>> doing our best to address them. Here is some info about what is going
>> on:
>>
>> In an ideal world, WDQS should:
>>
>> * scale in terms of data size
>> * scale in terms of number of edits
>> * have low update latency
>> * expose a SPARQL endpoint for queries
>> * allow anyone to run any queries on the public WDQS endpoint
>> * provide great query performance
>> * provide a high level of availability
>>
>> Scaling graph databases is a "known hard problem", and we are reaching
>> a scale where there are no obvious easy solutions to address all the
>> above constraints. At this point, just "throwing hardware at the
>> problem" is not an option anymore. We need to go deeper into the
>> details and potentially make major changes to the current architecture.
>> Some scaling considerations are discussed in [1]. This is going to take
>> time.
>>
>> I am not sure how to evaluate this correctly. Scaling databases in general 
>> is a "known hard problem" and graph databases a sub-field of it, which are 
>> optimized for graph-like queries as opposed to column stores or relational 
>> databases. If you say that "throwing hardware at the problem" does not help, 
>> you are admitting that Blazegraph does not scale for what is needed by 
>> Wikidata.
>>
>> Yes, I am admitting that Blazegraph (at least in the way we are using
>> it at the moment) does not scale to our future needs. Blazegraph does
>> have support for sharding (what they call "Scale Out"). And yes, we
>> need to have a closer look at how that works. I'm not the expert here,
>> so I won't even try to assert if that's a viable solution or not.
>>
>> Yes, sharding is what you need, I think, instead of replication. This is the 
>> technique where data is repartitioned into more manageable chunks across 
>> servers.
> Well, we need sharding for scalability and replication for
> availability, so we do need both. The hard problem is sharding.
>
>> Here is a good explanation of it:
>>
>> http://vos.openlinksw.com/owiki/wiki/VOS/VOSArticleWebScaleRDF
> Interesting read. I don't see how Virtuoso addresses data locality, it
> looks like sharding of their RDF store is just hash based (I'm
> assuming some kind of uniform hash).


It handles data locality across a shared nothing cluster just fine i.e.,
you can interact with any node in a Virtuoso cluster and experience
identical behavior (everyone node looks like single node in the eyes of
the operator).


>  I'm not enough of an expert on
> graph databases, but I doubt that a highly connected graph like
> Wikidata will be able to scale reads without some way to address data
> locality. Obviously, this needs testing.
>
>> http://docs.openlinksw.com/virtuoso/ch-clusterprogramming/


There are live instances of Virtuoso that demonstrate its capabilities.
If you want to explore shared-nothing cluster capabilities then our live
LOD Cloud cache is the place to start [1][2][3]. If you want to see the
single-server open source edition that you have DBpedia, DBpedia-Live,
Uniprot and many other nodes in the LOD Cloud to choose from. All of
these instance are highly connected.

If you want to get into the depths of Linked Data regarding query
processing pipelines that include URI (or Super Key) de-reference, you
can take a look at our URIBurner Service [4][5].

Virtuoso handles both shared-nothing clusters and replication i.e., you
can have a cluster configuration used in conjunction with a replication
topology if your solution requires that.

Virtuoso is a full-blown SQL RDBMS that leverages SPARQL and a SQL
extension for handling challenges associated with Entity Relationship
Graphs represented as RDF statement collections. You can even use SPARQL
inside SQL from any ODBC- or JDBC-compliant app or service etc..


Links:

[1] http://lod.openlinksw.com

[2]
https://twitter.com/search?f=tweets=default=%23PermID%20%40kidehen=typd
-- query samplings via links included in tweets

[3] https://tinyurl.com/y47prg9h -- SPARQL transitive option applied to
a skos taxonomy tree

[4] https://linkeddata.uriburner.com -- this service provides Linked
Data transformation combined with an ability to de-ref URI-variables and
URI-constants in the body of a query as part of the solution production
pipeline; it also includes a service that adds image processing to the
aforementioned pipeline via the PivotViewer module for data visualization

[5]
https://medium.com/virtuoso-blog/what-is-small-data-and-why-is-it-important-fbf5f267884
-- About Small Data (use of 

Re: [Wikidata] Scaling Wikidata Query Service

2019-06-11 Thread Kingsley Idehen
On 6/10/19 10:54 AM, Guillaume Lederrey wrote:
>> - Virtuoso has proven quite useful. I don't want to advertise here, but the 
>> thing they have going for DBpedia uses ridiculous hardware, i.e. 64GB RAM 
>> and it is also the OS version, not the professional with clustering and 
>> repartition capability. So we are playing the game since ten years now: 
>> Everybody tries other databases, but then most people come back to virtuoso. 
>> I have to admit that OpenLink is maintaining the hosting for DBpedia 
>> themselves, so they know how to optimise. They normally do large banks as 
>> customers with millions of write transactions per hour. In LOD2 they also 
>> implemented column store features with MonetDB and repartitioning in 
>> clusters.
> I'm not entirely sure how to read the above (and a quick look at
> virtuoso website does not give me the answer either), but it looks
> like the sharding / partitioning options are only available in the
> enterprise version. That probably makes it a non starter for us.
>

Virtuoso Cluster Edition is as described by Sebastian in an earlier post
to this thread [1]. Online that's behind our LOD Cloud cache which hosts
40 Billion+ triples, but still using ridiculously cheap hard-ware for
the share-nothing cluster.

As Jerven has already articulated [2], the single-server open source
edition of Virtuoso can also scale to 40 Billion+ triples as
demonstrated by Uniprot amongst others.

There's a publicly available Google Spreadsheet that provides insights
into a variety of Virtuoso configurations that you can also look at
regarding resource requirements [3].

Bottom line, Virtuoso has no fundamental issues with performance, scale,
or security (most haven't hit this bump yet, but its coming!) regarding
RDF-data deployed in line with Linked Data principles.

We are always opened to collaboration with anyone (or group) seeking to
fully exploit the power and promise of a Semantic Web derived from
Linked Data :)

Links:

[1] https://lists.wikimedia.org/pipermail/wikidata/2019-June/013132.html
-- Sebastian Hellman comment

[2] https://lists.wikimedia.org/pipermail/wikidata/2019-June/013143.html
-- Jerven Bolleman comment

[3]
https://docs.google.com/spreadsheets/d/1-stlTC_WJmMU3xA_NxA1tSLHw6_sbpjff-5OITtrbFw/

-- Virtuoso configurations sample spreadsheet

[4] https://hub.docker.com/u/openlink/ -- Docker Hub offerings

[5] https://aws.amazon.com/marketplace/pp/B00ZWMSNOG -- Amazon
Marketplace BYOL Edition

[6] https://aws.amazon.com/marketplace/pp/B011VMCZ8K -- Amazon
Marketplace PAGO Edition

[7] https://github.com/openlink/virtuoso-opensource -- Github

[8] http://download.openlinksw.com -- Download Site


-- 
Regards,

Kingsley Idehen   
Founder & CEO 
OpenLink Software   
Home Page: http://www.openlinksw.com
Community Support: https://community.openlinksw.com
Weblogs (Blogs):
Company Blog: https://medium.com/openlink-software-blog
Virtuoso Blog: https://medium.com/virtuoso-blog
Data Access Drivers Blog: 
https://medium.com/openlink-odbc-jdbc-ado-net-data-access-drivers

Personal Weblogs (Blogs):
Medium Blog: https://medium.com/@kidehen
Legacy Blogs: http://www.openlinksw.com/blog/~kidehen/
  http://kidehen.blogspot.com

Profile Pages:
Pinterest: https://www.pinterest.com/kidehen/
Quora: https://www.quora.com/profile/Kingsley-Uyi-Idehen
Twitter: https://twitter.com/kidehen
Google+: https://plus.google.com/+KingsleyIdehen/about
LinkedIn: http://www.linkedin.com/in/kidehen

Web Identities (WebID):
Personal: http://kingsley.idehen.net/public_home/kidehen/profile.ttl#i
: 
http://id.myopenlink.net/DAV/home/KingsleyUyiIdehen/Public/kingsley.ttl#this



smime.p7s
Description: S/MIME Cryptographic Signature
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Scaling Wikidata Query Service

2019-06-11 Thread Jerven Bolleman

Hi Guillaume, All,

As the lead developer for sparql.uniprot.org one of the few sparql 
endpoints with much more data (7x) than wikidata and significant 
external users. I can chime in with our experiences of hosting data with 
Virtuoso. All in all, I am very happy with it and it has made our 
endpoint possible and useful at a shoe string budget.


We like WikiData have an async loading process and allow anyone to run
analytics queries on our SPARQL endpoint with generous timeouts.

We have two servers each with 256GB of ram and 8TB of raw SSD space 
(consumer). These are whitebox AMD machines from 2014 and the main cost 
at the time was the RAM. The setup was relatively cheap (cheaper than 
what is documented at 
https://www.mediawiki.org/wiki/Wikidata_Query_Service/Implementation#Hardware) 


Even in 2014 we already had more data than you do now.

There is a third multi use server which does the loading of data 
offline. This is now a larger new Epyc server with more ram and more SSD 
but used for much more than just the RDF loading).


Unlike, most sites we do have our own custom frontend in front of 
virtuoso. We did this to allow more styling, as well as being flexible 
and change implementations at our whim. e.g. we double parse the SPARQL 
queries and even rewrite some to be friendlier. I suggest you do the 
same no matter which DB you use in the end, and we would be willing to 
open source ours (it is in Java, and uses RDF4J and some ugly JSPX but 
it works, if not to use at least as an inspiration). We did this to 
avoid being locked into endpoint specific features.


We use the opensource edition of virtuoso, and do not need the sharding 
etc. features. We use the CAIS (Cheap Array Of Independent Servers ;) 
approach to resilience. OpenlinkSW behind Virtuoso can deliver support 
for the OpenSource edition and if you are interested I suggest you talk 
to them.


Virtuoso 7 has become very resilient over the years, and does not need 
much hand-holding anymore (in 2015 this was different). Of course we 
have aggressive auto-restart code but this is rarely triggered these 
days. While the inbound queries are getting more complex.


Some of the tricks you have build into WQS are going to be a pain to 
redo in virtuoso. But I don't see anything impossible there.


Pragmatically, while WDS is a Graph database, the queries are actually 
very relational. And none of the standard graph algorithms are used. To 
be honest RDF is actually a relational system which means that 
relational techniques are very good at answering them. The sole issue is
recursive queries (e.g. rdfs:subClassOf+) in which the virtuoso 
implementation is adequate but not great.


This is why recovering physical schemata from RDF data is such a 
powerful optimization technique [1]. i.e. you tend to do joins not 
traversals. This is not always true but I strongly suspect it will hold

for the vast majority of the Wikidata Query Service case.

I hope this was helpful, and I am willing to answer further questions.

Regards,
Jerven




[1] https://research.vu.nl/files/61555276/complete%20dissertation.pdf
 and associated work that was done by Orri Erling. Which unfortunately 
has not yet landed in the Virtuoso master branch.




On 6/10/19 9:49 PM, Guillaume Lederrey wrote:

On Mon, Jun 10, 2019 at 9:03 PM Sebastian Hellmann
 wrote:


Hi Guillaume,

On 10.06.19 16:54, Guillaume Lederrey wrote:

Hello!

On Mon, Jun 10, 2019 at 4:28 PM Sebastian Hellmann
 wrote:

Hi Guillaume,

On 06.06.19 21:32, Guillaume Lederrey wrote:

Hello all!

There has been a number of concerns raised about the performance and
scaling of Wikdata Query Service. We share those concerns and we are
doing our best to address them. Here is some info about what is going
on:

In an ideal world, WDQS should:

* scale in terms of data size
* scale in terms of number of edits
* have low update latency
* expose a SPARQL endpoint for queries
* allow anyone to run any queries on the public WDQS endpoint
* provide great query performance
* provide a high level of availability

Scaling graph databases is a "known hard problem", and we are reaching
a scale where there are no obvious easy solutions to address all the
above constraints. At this point, just "throwing hardware at the
problem" is not an option anymore. We need to go deeper into the
details and potentially make major changes to the current architecture.
Some scaling considerations are discussed in [1]. This is going to take
time.

I am not sure how to evaluate this correctly. Scaling databases in general is a "known hard 
problem" and graph databases a sub-field of it, which are optimized for graph-like queries as 
opposed to column stores or relational databases. If you say that "throwing hardware at the 
problem" does not help, you are admitting that Blazegraph does not scale for what is needed by 
Wikidata.

Yes, I am admitting that Blazegraph (at least in the way we are using
it at the moment) does not scale to our 

Re: [Wikidata] Scaling Wikidata Query Service

2019-06-10 Thread Sebastian Hellmann
Yes, I can ask. I am talking a lot with them as we are redeploying 
DBpedia live and also pushing the new DBpedia to them soon.


I think, they also had a specific issue with how Wikidata does linked 
data, but I didn't get it, as it was mentioned too briefly.


All the best,

Sebastian


On 10.06.19 22:46, Stas Malyshev wrote:

Hi!


thanks for the elaboration. I can understand the background much better.
I have to admit, that I am also not a real expert, but very close to the
real experts like Vidal and Rahm who are co-authors of the SWJ paper or
the OpenLink devs.

If you know anybody at OpenLink that would be interested in trying to
evaluate such thing (i.e. how Wikidata could be hosted on Virtuso) and
provide support for this project, it would be interesting to discuss it.
While open-source thing is still a barrier and in general the
requirements are different, at least discussing it and maybe getting
some numbers might be useful.

Thanks,

--
All the best,
Sebastian Hellmann

Director of Knowledge Integration and Linked Data Technologies (KILT) 
Competence Center

at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org, 
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt 


Homepage: http://aksw.org/SebastianHellmann
Research Group: http://aksw.org
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Scaling Wikidata Query Service

2019-06-10 Thread Stas Malyshev
Hi!

> thanks for the elaboration. I can understand the background much better.
> I have to admit, that I am also not a real expert, but very close to the
> real experts like Vidal and Rahm who are co-authors of the SWJ paper or
> the OpenLink devs.

If you know anybody at OpenLink that would be interested in trying to
evaluate such thing (i.e. how Wikidata could be hosted on Virtuso) and
provide support for this project, it would be interesting to discuss it.
While open-source thing is still a barrier and in general the
requirements are different, at least discussing it and maybe getting
some numbers might be useful.

Thanks,
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Scaling Wikidata Query Service

2019-06-10 Thread Sebastian Hellmann

Hi Stas,

thanks for the elaboration. I can understand the background much better. 
I have to admit, that I am also not a real expert, but very close to the 
real experts like Vidal and Rahm who are co-authors of the SWJ paper or 
the OpenLink devs.


I am also spoiled, because OpenLink solves the hosting for DBpedia and 
also DBpedia-live with ca. 130k updates per day for the English 
Wikipedia. I think, this one is the recent report: 
https://medium.com/virtuoso-blog/dbpedia-usage-report-as-of-2018-01-01-8cae1b81ca71 
Then again DBpedia didn't grow for a while, but we made a "Best of" now 
[1]. But  will not host it all.


[1] https://svn.aksw.org/papers/2019/ISWC_FlexiFusion/public.pdf

I also see that your context is difficult. Maybe you can custom 
shard/scaleout Blazegraph based on the queries and then replicate the 
sharded clusters. Like a mix between sharding and replication, maybe 
just 3 * 3 servers instead of 9 replicated ones or 9 servers full of 
shards. Not so many options here, if you have the OS requirement. I 
guess you are caching static content already as much as possible.


This also seems pretty much what I know, but it really is all second 
hand as my expertise is more focused on what's inside the database.


All the best,

Sebastian

On 10.06.19 22:02, Stas Malyshev wrote:

Hi!


I am not sure how to evaluate this correctly. Scaling databases in
general is a "known hard problem" and graph databases a sub-field of it,
which are optimized for graph-like queries as opposed to column stores
or relational databases. If you say that "throwing hardware at the
problem" does not help, you are admitting that Blazegraph does not scale
for what is needed by Wikidata.

I think this is over-generalizing. We have a database that grew 10x over
the last 4 years. We have certain hardware and software limits, both
with existing hardware and in principle by hardware we could buy. We
also have certain issues specific to graph databases that make scaling
harder - for example, document databases, like ElasticSearch, and
certain models of relational databases, shard easily. Sharding something
like Wikidata graph is much harder, especially if the underlying
database knows nothing about specifics of Wikidata data (which would be
the case for all off-the-shelf databases). If we just randomly split the
triples between several servers, we'd probably be just modeling a large
but an extremely slow disk. So there needs to be some smarter solution,
one that we'd unlike to develop inhouse but one that has already been
verified by industry experience and other deployments.

Is the issue specific to Blazegraph and can the issue be solved by
switching platform? Maybe, we do not know yet. We do not have any better
solution that guarantees us better scalability identified, but we have a
plan on looking for that solution, given the resources. We also have a
plan on improving the throughput of Blazegraph, which we're working on now.

Non-sharding model might be hard to sustain indefinitely, but it is not
clear it can't work in the short term, and also it is not clear that
sharding model would deliver clear performance win, as it will have to
involve network latencies inside the queries, which can significantly
affect performance. This can only resolved by proper testing evaluation
of the candidate solutions.


Then it is not a "cluster" in the sense of databases. It is more a
redundancy architecture like RAID 1. Is this really how BlazeGraph does

I do not think our time here would be productively spent arguing
semantics about what should and should not be called a "cluster". We
call that setup a cluster, and I think now we all understand what we're
talking about.


it? Don't they have a proper cluster solution, where they repartition
data across servers? Or is this independent servers a wikimedia staff
homebuild?

If you mean sharded or replicated setup, as far as I know, Blazegraph
does not support that (there's some support for replication IIRC but
replication without sharding probably won't give us much improvement).
We have a plan to evaluate a solution that does shard, given necessary
resources.


Some info here:

- We evaluated some stores according to their performance:
http://www.semantic-web-journal.net/content/evaluation-metadata-representations-rdf-stores-0
"Evaluation of Metadata Representations in RDF stores"

Thanks for the link, it looks very interesting, I'll read it and see
which parts we could use here.


- Virtuoso has proven quite useful. I don't want to advertise here, but
the thing they have going for DBpedia uses ridiculous hardware, i.e.
64GB RAM and it is also the OS version, not the professional with
clustering and repartition capability. So we are playing the game since
ten years now: Everybody tries other databases, but then most people
come back to virtuoso. I have to admit that OpenLink is maintaining the
hosting for DBpedia themselves, so they know how to optimise. They
normally do large 

Re: [Wikidata] Scaling Wikidata Query Service

2019-06-10 Thread Stas Malyshev
Hi!

> Yes, sharding is what you need, I think, instead of replication. This is
> the technique where data is repartitioned into more manageable chunks
> across servers.

Agreed, if we are to get any solution that is not constrained by
hardware limits of a single server, we can not avoid looking at sharding.

> Here is a good explanation of it:
> 
> http://vos.openlinksw.com/owiki/wiki/VOS/VOSArticleWebScaleRDF

Thanks, very interesting article. I'd certainly would like to know how
this works with database in the size of 10 bln. triples and queries both
accessing and updating random subsets of them. Updates are not covered
very thoroughly there - this is, I suspect, because many databases of 10
bln. size do not have as active (non-append) update workload as we do.
Maybe they still manage to solve it, if so, I'd very much like to know
about it.

> Just a note here: Virtuoso is also a full RDMS, so you could probably
> keep wikibase db in the same cluster and fix the asynchronicity. That is

Given how the original data is stored (JSON blob inside mysql table) it
would not be very useful. In general, graph data model and Wikitext data
model on top of which Wikidata is built are very, very different, and
expecting same storage to serve both - at least without very major and
deep refactoring of the code on both sides - is not currently very
realistic. And of course moving any of the wiki production databases to
Virtuoso would be a non-starter. Given than original Wikidata database
stays on Mysql - which I think is a reasonable assumption - there would
need to be a data migration pipeline for data to come from Mysql to
whatever is the WDQS NG storage.

> also true for any mappers like Sparqlify:
> http://aksw.org/Projects/Sparqlify.html However, these shift the
> problem, then you need a sharded/repartitioned relational database

Yes, relational-RDF bridges are known but my experience is they usually
are not very performant (the difference in "you can do it" and "you can
do it fast" is sometimes very significant) and in our case, it would be
useless anyway as Wikidata data is not really stored in relational
database per se - it's stored in JSON blob opaquely saved in relational
database structure that knows nothing about Wikidata. Yes, it's not the
ideal structure for optimal performance of Wikidata itself, but I do not
foresee this changing, at least in any short term. Again, we could of
course have data export pipeline to whatever storage format we want -
essentially we already have one - but the concept of having single data
store is probably not realistic at least within foreseeable timeframes.
We use separate data store for search (ElasticSearch) and probably will
have to have separate one for queries, whatever would be the mechanism.

Thanks,
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Scaling Wikidata Query Service

2019-06-10 Thread Stas Malyshev
Hi!

> I am not sure how to evaluate this correctly. Scaling databases in
> general is a "known hard problem" and graph databases a sub-field of it,
> which are optimized for graph-like queries as opposed to column stores
> or relational databases. If you say that "throwing hardware at the
> problem" does not help, you are admitting that Blazegraph does not scale
> for what is needed by Wikidata. 

I think this is over-generalizing. We have a database that grew 10x over
the last 4 years. We have certain hardware and software limits, both
with existing hardware and in principle by hardware we could buy. We
also have certain issues specific to graph databases that make scaling
harder - for example, document databases, like ElasticSearch, and
certain models of relational databases, shard easily. Sharding something
like Wikidata graph is much harder, especially if the underlying
database knows nothing about specifics of Wikidata data (which would be
the case for all off-the-shelf databases). If we just randomly split the
triples between several servers, we'd probably be just modeling a large
but an extremely slow disk. So there needs to be some smarter solution,
one that we'd unlike to develop inhouse but one that has already been
verified by industry experience and other deployments.

Is the issue specific to Blazegraph and can the issue be solved by
switching platform? Maybe, we do not know yet. We do not have any better
solution that guarantees us better scalability identified, but we have a
plan on looking for that solution, given the resources. We also have a
plan on improving the throughput of Blazegraph, which we're working on now.

Non-sharding model might be hard to sustain indefinitely, but it is not
clear it can't work in the short term, and also it is not clear that
sharding model would deliver clear performance win, as it will have to
involve network latencies inside the queries, which can significantly
affect performance. This can only resolved by proper testing evaluation
of the candidate solutions.

> Then it is not a "cluster" in the sense of databases. It is more a
> redundancy architecture like RAID 1. Is this really how BlazeGraph does

I do not think our time here would be productively spent arguing
semantics about what should and should not be called a "cluster". We
call that setup a cluster, and I think now we all understand what we're
talking about.

> it? Don't they have a proper cluster solution, where they repartition
> data across servers? Or is this independent servers a wikimedia staff
> homebuild?

If you mean sharded or replicated setup, as far as I know, Blazegraph
does not support that (there's some support for replication IIRC but
replication without sharding probably won't give us much improvement).
We have a plan to evaluate a solution that does shard, given necessary
resources.

> Some info here:
> 
> - We evaluated some stores according to their performance:
> http://www.semantic-web-journal.net/content/evaluation-metadata-representations-rdf-stores-0
>  
> "Evaluation of Metadata Representations in RDF stores" 

Thanks for the link, it looks very interesting, I'll read it and see
which parts we could use here.

> - Virtuoso has proven quite useful. I don't want to advertise here, but
> the thing they have going for DBpedia uses ridiculous hardware, i.e.
> 64GB RAM and it is also the OS version, not the professional with
> clustering and repartition capability. So we are playing the game since
> ten years now: Everybody tries other databases, but then most people
> come back to virtuoso. I have to admit that OpenLink is maintaining the
> hosting for DBpedia themselves, so they know how to optimise. They
> normally do large banks as customers with millions of write transactions
> per hour. In LOD2 they also implemented column store features with
> MonetDB and repartitioning in clusters.

I do not know the details of your usage scenario, so before we get into
comparisons, I'd like to understand:

1. Do your servers provide live synchronized updates with Wikdiata or
DBPedia? How many updates per second that server can process?
2. How many queries per second this server is serving? What kind of
queries are those?

We did preliminary very limited evaluation of Virtuoso for hosting
Wikidata, and it looks like it can load and host the necessary data
(though it does not support some customizations we have now and we could
not evaluate whether such customizations are possible) but it would
require significant time investment to port all the functionality to it.
Unfortunately, the lack of resources did not allow us to do fuller
evaluation.

Also, as I understand, "professional" capabilities of Virtuoso are
closed-source and require paid license, which probably would be a
problem to run it on WMF infrastructure unless we reach some kind of
special arrangement. Since this arrangement will probably not include
open-sourcing the enterprise part of Virtuoso, it should deliver a very

Re: [Wikidata] Scaling Wikidata Query Service

2019-06-10 Thread Guillaume Lederrey
On Mon, Jun 10, 2019 at 9:03 PM Sebastian Hellmann
 wrote:
>
> Hi Guillaume,
>
> On 10.06.19 16:54, Guillaume Lederrey wrote:
>
> Hello!
>
> On Mon, Jun 10, 2019 at 4:28 PM Sebastian Hellmann
>  wrote:
>
> Hi Guillaume,
>
> On 06.06.19 21:32, Guillaume Lederrey wrote:
>
> Hello all!
>
> There has been a number of concerns raised about the performance and
> scaling of Wikdata Query Service. We share those concerns and we are
> doing our best to address them. Here is some info about what is going
> on:
>
> In an ideal world, WDQS should:
>
> * scale in terms of data size
> * scale in terms of number of edits
> * have low update latency
> * expose a SPARQL endpoint for queries
> * allow anyone to run any queries on the public WDQS endpoint
> * provide great query performance
> * provide a high level of availability
>
> Scaling graph databases is a "known hard problem", and we are reaching
> a scale where there are no obvious easy solutions to address all the
> above constraints. At this point, just "throwing hardware at the
> problem" is not an option anymore. We need to go deeper into the
> details and potentially make major changes to the current architecture.
> Some scaling considerations are discussed in [1]. This is going to take
> time.
>
> I am not sure how to evaluate this correctly. Scaling databases in general is 
> a "known hard problem" and graph databases a sub-field of it, which are 
> optimized for graph-like queries as opposed to column stores or relational 
> databases. If you say that "throwing hardware at the problem" does not help, 
> you are admitting that Blazegraph does not scale for what is needed by 
> Wikidata.
>
> Yes, I am admitting that Blazegraph (at least in the way we are using
> it at the moment) does not scale to our future needs. Blazegraph does
> have support for sharding (what they call "Scale Out"). And yes, we
> need to have a closer look at how that works. I'm not the expert here,
> so I won't even try to assert if that's a viable solution or not.
>
> Yes, sharding is what you need, I think, instead of replication. This is the 
> technique where data is repartitioned into more manageable chunks across 
> servers.

Well, we need sharding for scalability and replication for
availability, so we do need both. The hard problem is sharding.

> Here is a good explanation of it:
>
> http://vos.openlinksw.com/owiki/wiki/VOS/VOSArticleWebScaleRDF

Interesting read. I don't see how Virtuoso addresses data locality, it
looks like sharding of their RDF store is just hash based (I'm
assuming some kind of uniform hash). I'm not enough of an expert on
graph databases, but I doubt that a highly connected graph like
Wikidata will be able to scale reads without some way to address data
locality. Obviously, this needs testing.

> http://docs.openlinksw.com/virtuoso/ch-clusterprogramming/
>
>
> Sharding, scale-out or repartitioning is a classical enterprise feature for 
> Open-source databases. I am rather surprised that Blazegraph is full GPL 
> without an enterprise edition. But then they really sounded like their goal 
> as a company was to be bought by a bigger fish, in this case Amazon Web 
> Services. What is their deal? They are offering support?
>
> So if you go open-source, I think you will have a hard time finding good free 
> databases sharding/repartition. FoundationDB as proposed in the grant [1]is 
> from Apple
>
> [1] https://meta.wikimedia.org/wiki/Grants:Project/WDQS_On_FoundationDB
>
>
> I mean try the sharding feature. At some point though it might be worth 
> considering to go enterprise. Corporate Open Source often has a twist.

Closed source is not an option. We have strong open source
requirements to deploy anything in our production environment.

> Just a note here: Virtuoso is also a full RDMS, so you could probably keep 
> wikibase db in the same cluster and fix the asynchronicity. That is also true 
> for any mappers like Sparqlify: http://aksw.org/Projects/Sparqlify.html 
> However, these shift the problem, then you need a sharded/repartitioned 
> relational database

There is no plan to move the Wikibase storage out of MySQL at the
moment. In any case, having a low coupling between the primary storage
for wikidata and a secondary storage for complex querying is a sound
architectural principle. This asynchronous update process is most
probably going to stay in place, just because it makes a lot of sense.

Thanks for the discussion so far! It is always interesting to have outside idea!

   Have fun!

 Guillaume

>
> All the best,
>
> Sebastian
>
>
>
> From [1]:
>
> At the moment, each WDQS cluster is a group of independent servers, sharing 
> nothing, with each server independently updated and each server holding a 
> full data set.
>
> Then it is not a "cluster" in the sense of databases. It is more a redundancy 
> architecture like RAID 1. Is this really how BlazeGraph does it? Don't they 
> have a proper cluster solution, where they repartition data 

Re: [Wikidata] Scaling Wikidata Query Service

2019-06-10 Thread Sebastian Hellmann

Hi Guillaume,

On 10.06.19 16:54, Guillaume Lederrey wrote:

Hello!

On Mon, Jun 10, 2019 at 4:28 PM Sebastian Hellmann
 wrote:

Hi Guillaume,

On 06.06.19 21:32, Guillaume Lederrey wrote:

Hello all!

There has been a number of concerns raised about the performance and
scaling of Wikdata Query Service. We share those concerns and we are
doing our best to address them. Here is some info about what is going
on:

In an ideal world, WDQS should:

* scale in terms of data size
* scale in terms of number of edits
* have low update latency
* expose a SPARQL endpoint for queries
* allow anyone to run any queries on the public WDQS endpoint
* provide great query performance
* provide a high level of availability

Scaling graph databases is a "known hard problem", and we are reaching
a scale where there are no obvious easy solutions to address all the
above constraints. At this point, just "throwing hardware at the
problem" is not an option anymore. We need to go deeper into the
details and potentially make major changes to the current architecture.
Some scaling considerations are discussed in [1]. This is going to take
time.

I am not sure how to evaluate this correctly. Scaling databases in general is a "known hard 
problem" and graph databases a sub-field of it, which are optimized for graph-like queries as 
opposed to column stores or relational databases. If you say that "throwing hardware at the 
problem" does not help, you are admitting that Blazegraph does not scale for what is needed by 
Wikidata.

Yes, I am admitting that Blazegraph (at least in the way we are using
it at the moment) does not scale to our future needs. Blazegraph does
have support for sharding (what they call "Scale Out"). And yes, we
need to have a closer look at how that works. I'm not the expert here,
so I won't even try to assert if that's a viable solution or not.


Yes, sharding is what you need, I think, instead of replication. This is 
the technique where data is repartitioned into more manageable chunks 
across servers.


Here is a good explanation of it:

http://vos.openlinksw.com/owiki/wiki/VOS/VOSArticleWebScaleRDF

http://docs.openlinksw.com/virtuoso/ch-clusterprogramming/


Sharding, scale-out or repartitioning is a classical enterprise feature 
for Open-source databases. I am rather surprised that Blazegraph is full 
GPL without an enterprise edition. But then they really sounded like 
their goal as a company was to be bought by a bigger fish, in this case 
Amazon Web Services. What is their deal? They are offering support?


So if you go open-source, I think you will have a hard time finding good 
free databases sharding/repartition. FoundationDB as proposed in the 
grant [1]is from Apple


[1] https://meta.wikimedia.org/wiki/Grants:Project/WDQS_On_FoundationDB


I mean try the sharding feature. At some point though it might be worth 
considering to go enterprise. Corporate Open Source often has a twist.


Just a note here: Virtuoso is also a full RDMS, so you could probably 
keep wikibase db in the same cluster and fix the asynchronicity. That is 
also true for any mappers like Sparqlify: 
http://aksw.org/Projects/Sparqlify.html However, these shift the 
problem, then you need a sharded/repartitioned relational database



All the best,

Sebastian





 From [1]:

At the moment, each WDQS cluster is a group of independent servers, sharing 
nothing, with each server independently updated and each server holding a full 
data set.

Then it is not a "cluster" in the sense of databases. It is more a redundancy 
architecture like RAID 1. Is this really how BlazeGraph does it? Don't they have a proper 
cluster solution, where they repartition data across servers? Or is this independent 
servers a wikimedia staff homebuild?

It all depends on your definition of a cluster. We have groups of
machine collectively serving some coherent traffic, but each machine
is completely independent from others. So yes, the comparison to RAID1
is adequate.


Some info here:

- We evaluated some stores according to their performance: 
http://www.semantic-web-journal.net/content/evaluation-metadata-representations-rdf-stores-0
  "Evaluation of Metadata Representations in RDF stores"

Thanks for the link! That looks quite interesting!


- Virtuoso has proven quite useful. I don't want to advertise here, but the 
thing they have going for DBpedia uses ridiculous hardware, i.e. 64GB RAM and 
it is also the OS version, not the professional with clustering and repartition 
capability. So we are playing the game since ten years now: Everybody tries 
other databases, but then most people come back to virtuoso. I have to admit 
that OpenLink is maintaining the hosting for DBpedia themselves, so they know 
how to optimise. They normally do large banks as customers with millions of 
write transactions per hour. In LOD2 they also implemented column store 
features with MonetDB and repartitioning in clusters.

I'm not entirely sure how to read the above 

Re: [Wikidata] Scaling Wikidata Query Service

2019-06-10 Thread Guillaume Lederrey
Hello!

On Mon, Jun 10, 2019 at 4:28 PM Sebastian Hellmann
 wrote:
>
> Hi Guillaume,
>
> On 06.06.19 21:32, Guillaume Lederrey wrote:
>
> Hello all!
>
> There has been a number of concerns raised about the performance and
> scaling of Wikdata Query Service. We share those concerns and we are
> doing our best to address them. Here is some info about what is going
> on:
>
> In an ideal world, WDQS should:
>
> * scale in terms of data size
> * scale in terms of number of edits
> * have low update latency
> * expose a SPARQL endpoint for queries
> * allow anyone to run any queries on the public WDQS endpoint
> * provide great query performance
> * provide a high level of availability
>
> Scaling graph databases is a "known hard problem", and we are reaching
> a scale where there are no obvious easy solutions to address all the
> above constraints. At this point, just "throwing hardware at the
> problem" is not an option anymore. We need to go deeper into the
> details and potentially make major changes to the current architecture.
> Some scaling considerations are discussed in [1]. This is going to take
> time.
>
> I am not sure how to evaluate this correctly. Scaling databases in general is 
> a "known hard problem" and graph databases a sub-field of it, which are 
> optimized for graph-like queries as opposed to column stores or relational 
> databases. If you say that "throwing hardware at the problem" does not help, 
> you are admitting that Blazegraph does not scale for what is needed by 
> Wikidata.

Yes, I am admitting that Blazegraph (at least in the way we are using
it at the moment) does not scale to our future needs. Blazegraph does
have support for sharding (what they call "Scale Out"). And yes, we
need to have a closer look at how that works. I'm not the expert here,
so I won't even try to assert if that's a viable solution or not.

> From [1]:
>
> At the moment, each WDQS cluster is a group of independent servers, sharing 
> nothing, with each server independently updated and each server holding a 
> full data set.
>
> Then it is not a "cluster" in the sense of databases. It is more a redundancy 
> architecture like RAID 1. Is this really how BlazeGraph does it? Don't they 
> have a proper cluster solution, where they repartition data across servers? 
> Or is this independent servers a wikimedia staff homebuild?

It all depends on your definition of a cluster. We have groups of
machine collectively serving some coherent traffic, but each machine
is completely independent from others. So yes, the comparison to RAID1
is adequate.

> Some info here:
>
> - We evaluated some stores according to their performance: 
> http://www.semantic-web-journal.net/content/evaluation-metadata-representations-rdf-stores-0
>   "Evaluation of Metadata Representations in RDF stores"

Thanks for the link! That looks quite interesting!

> - Virtuoso has proven quite useful. I don't want to advertise here, but the 
> thing they have going for DBpedia uses ridiculous hardware, i.e. 64GB RAM and 
> it is also the OS version, not the professional with clustering and 
> repartition capability. So we are playing the game since ten years now: 
> Everybody tries other databases, but then most people come back to virtuoso. 
> I have to admit that OpenLink is maintaining the hosting for DBpedia 
> themselves, so they know how to optimise. They normally do large banks as 
> customers with millions of write transactions per hour. In LOD2 they also 
> implemented column store features with MonetDB and repartitioning in clusters.

I'm not entirely sure how to read the above (and a quick look at
virtuoso website does not give me the answer either), but it looks
like the sharding / partitioning options are only available in the
enterprise version. That probably makes it a non starter for us.

> - I recently heard a presentation from Arango-DB and they had a good cluster 
> concept as well, although I don't know anybody who tried it. The slides 
> seemed to make sense.

Nice, another one to add to our list of options to test.

> All the best,
>
> Sebastian
>
>
>
>
> Reasonably, addressing all of the above constraints is unlikely to
> ever happen. Some of the constraints are non negotiable: if we can't
> keep up with Wikidata in term of data size or number of edits, it does
> not make sense to address query performance. On some constraints, we
> will probably need to compromise.
>
> For example, the update process is asynchronous. It is by nature
> expected to lag. In the best case, this lag is measured in minutes,
> but can climb to hours occasionally. This is a case of prioritizing
> stability and correctness (ingesting all edits) over update latency.
> And while we can work to reduce the maximum latency, this will still
> be an asynchronous process and needs to be considered as such.
>
> We currently have one Blazegraph expert working with us to address a
> number of performance and stability issues. We
> are planning to hire an 

Re: [Wikidata] Scaling Wikidata Query Service

2019-06-10 Thread Sebastian Hellmann

Hi Guillaume,

On 06.06.19 21:32, Guillaume Lederrey wrote:

Hello all!

There has been a number of concerns raised about the performance and
scaling of Wikdata Query Service. We share those concerns and we are
doing our best to address them. Here is some info about what is going
on:

In an ideal world, WDQS should:

* scale in terms of data size
* scale in terms of number of edits
* have low update latency
* expose a SPARQL endpoint for queries
* allow anyone to run any queries on the public WDQS endpoint
* provide great query performance
* provide a high level of availability

Scaling graph databases is a "known hard problem", and we are reaching
a scale where there are no obvious easy solutions to address all the
above constraints. At this point, just "throwing hardware at the
problem" is not an option anymore. We need to go deeper into the
details and potentially make major changes to the current architecture.
Some scaling considerations are discussed in [1]. This is going to take
time.


I am not sure how to evaluate this correctly. Scaling databases in 
general is a "known hard problem" and graph databases a sub-field of it, 
which are optimized for graph-like queries as opposed to column stores 
or relational databases. If you say that "throwing hardware at the 
problem" does not help, you are admitting that Blazegraph does not scale 
for what is needed by Wikidata.


From [1]:

At the moment, each WDQS cluster is a group of independent servers, 
sharing nothing, with each server independently updated and each 
server holding a full data set.


Then it is not a "cluster" in the sense of databases. It is more a 
redundancy architecture like RAID 1. Is this really how BlazeGraph does 
it? Don't they have a proper cluster solution, where they repartition 
data across servers? Or is this independent servers a wikimedia staff 
homebuild?


Some info here:

- We evaluated some stores according to their performance: 
http://www.semantic-web-journal.net/content/evaluation-metadata-representations-rdf-stores-0 
"Evaluation of Metadata Representations in RDF stores"


- Virtuoso has proven quite useful. I don't want to advertise here, but 
the thing they have going for DBpedia uses ridiculous hardware, i.e. 
64GB RAM and it is also the OS version, not the professional with 
clustering and repartition capability. So we are playing the game since 
ten years now: Everybody tries other databases, but then most people 
come back to virtuoso. I have to admit that OpenLink is maintaining the 
hosting for DBpedia themselves, so they know how to optimise. They 
normally do large banks as customers with millions of write transactions 
per hour. In LOD2 they also implemented column store features with 
MonetDB and repartitioning in clusters.


- I recently heard a presentation from Arango-DB and they had a good 
cluster concept as well, although I don't know anybody who tried it. The 
slides seemed to make sense.


All the best,

Sebastian





Reasonably, addressing all of the above constraints is unlikely to
ever happen. Some of the constraints are non negotiable: if we can't
keep up with Wikidata in term of data size or number of edits, it does
not make sense to address query performance. On some constraints, we
will probably need to compromise.

For example, the update process is asynchronous. It is by nature
expected to lag. In the best case, this lag is measured in minutes,
but can climb to hours occasionally. This is a case of prioritizing
stability and correctness (ingesting all edits) over update latency.
And while we can work to reduce the maximum latency, this will still
be an asynchronous process and needs to be considered as such.

We currently have one Blazegraph expert working with us to address a
number of performance and stability issues. We
are planning to hire an additional engineer to help us support the
service in the long term. You can follow our current work in phabricator [2].

If anyone has experience with scaling large graph databases, please
reach out to us, we're always happy to share ideas!

Thanks all for your patience!

Guillaume

[1] https://wikitech.wikimedia.org/wiki/Wikidata_query_service/ScalingStrategy
[2] https://phabricator.wikimedia.org/project/view/1239/


--
All the best,
Sebastian Hellmann

Director of Knowledge Integration and Linked Data Technologies (KILT) 
Competence Center

at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org, 
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt 


Homepage: http://aksw.org/SebastianHellmann
Research Group: http://aksw.org
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Scaling Wikidata Query Service

2019-06-09 Thread Amirouche Boubekki
I made a proposal for a grant at
https://meta.wikimedia.org/wiki/Grants:Project/WDQS_On_FoundationDB

Mind the fact that this is not about the versioned quadstore. It is about
simple triplestore, it mainly missing bindings for foundationdb and SPARQL
syntax.

Also, I will prolly need help to interface with geo and label services.

Feedback welcome!
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Scaling Wikidata Query Service

2019-06-07 Thread Amirouche Boubekki
Le jeu. 6 juin 2019 à 21:33, Guillaume Lederrey  a
écrit :

> Hello all!
>
> There has been a number of concerns raised about the performance and
> scaling of Wikdata Query Service. We share those concerns and we are
> doing our best to address them. Here is some info about what is going
> on:
>
> In an ideal world, WDQS should:
>
> * scale in terms of data size
> * scale in terms of number of edits
> * have low update latency
> * expose a SPARQL endpoint for queries
> * allow anyone to run any queries on the public WDQS endpoint
> * provide great query performance
> * provide a high level of availability
>

I will add that, in an ideal world, setting up wikidata ie. the interface
that allows edits and the entity search service and WDQS.

wikidata tools should be (more) accessible.


> Scaling graph databases is a "known hard problem", and we are reaching
> a scale where there are no obvious easy solutions to address all the
> above constraints. At this point, just "throwing hardware at the
> problem" is not an option anymore.


Reasonably, addressing all of the above constraints is unlikely to
> ever happen.


never say never ;-)


> For example, the update process is asynchronous. It is by nature
> expected to lag. In the best case, this lag is measured in minutes,
> but can climb to hours occasionally. This is a case of prioritizing
> stability and correctness (ingesting all edits) over update latency.
> And while we can work to reduce the maximum latency, this will still
> be an asynchronous process and needs to be considered as such.
>


> We currently have one Blazegraph expert working with us to address a
> number of performance and stability issues. We
> are planning to hire an additional engineer to help us support the
> service in the long term. You can follow our current work in phabricator
> [2].
>
> If anyone has experience with scaling large graph databases, please
> reach out to us, we're always happy to share ideas!
>

Good luck!


> Thanks all for your patience!
>
>Guillaume
>
> [1]
> https://wikitech.wikimedia.org/wiki/Wikidata_query_service/ScalingStrategy


Here is my point of view regarding some discussion happening in the talk
page:

> Giving up on SPARQL.

There is an ongoing effort to draft a 1.2 
version of the SPARQL. It is the right time to give some feedback.

Also, look at https://github.com/w3c/EasierRDF/

> JanusGraph  (successor of Titan, now part
DataStax) - Written in java, using scalable data-storage (cassandra/hbase)
and indexing engines (ElasticSearch/SolR), queryable

That would make wikidata much less accessible. Even if JanusGraph has a
Oracle Berkeley backend. The full-text search and geospatial indices are in
yet-another-processus.

> I can't think of any other way than transforming the wikidata RDF
representation to a more suitable one for graph-properties engines

FWIW, OpenCog's AtomSpace has a neo4j backend but they do not use it.

Also, graph-properties engines makes slow to represent things like:

("wikidata", "used-by", "opencog")
("wikidata", "used-by", "google")

That is, one has to create an hyper-edge if you want to be able to query
those facts.


> [2] https://phabricator.wikimedia.org/project/view/1239/



Best regards,


Amirouche ~ amz3
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Scaling Wikidata Query Service

2019-06-06 Thread Daniel Mietchen
Thanks, Guillaume - this is very helpful, and it would be great to
have similar information posted/ collected on other kinds of limits
and potential approaches to addressing them.

Some weeks ago, we started a project to keep track of tsuch limits,
and I have added pointers to your information there:
https://www.wikidata.org/wiki/Wikidata:WikiProject_Limits_of_Wikidata .

If anyone is aware of similar discussions for any of the other limits,
please edit that page to include pointers to those discussions.

Thanks!

Daniel

On Thu, Jun 6, 2019 at 9:33 PM Guillaume Lederrey
 wrote:
>
> Hello all!
>
> There has been a number of concerns raised about the performance and
> scaling of Wikdata Query Service. We share those concerns and we are
> doing our best to address them. Here is some info about what is going
> on:
>
> In an ideal world, WDQS should:
>
> * scale in terms of data size
> * scale in terms of number of edits
> * have low update latency
> * expose a SPARQL endpoint for queries
> * allow anyone to run any queries on the public WDQS endpoint
> * provide great query performance
> * provide a high level of availability
>
> Scaling graph databases is a "known hard problem", and we are reaching
> a scale where there are no obvious easy solutions to address all the
> above constraints. At this point, just "throwing hardware at the
> problem" is not an option anymore. We need to go deeper into the
> details and potentially make major changes to the current architecture.
> Some scaling considerations are discussed in [1]. This is going to take
> time.
>
> Reasonably, addressing all of the above constraints is unlikely to
> ever happen. Some of the constraints are non negotiable: if we can't
> keep up with Wikidata in term of data size or number of edits, it does
> not make sense to address query performance. On some constraints, we
> will probably need to compromise.
>
> For example, the update process is asynchronous. It is by nature
> expected to lag. In the best case, this lag is measured in minutes,
> but can climb to hours occasionally. This is a case of prioritizing
> stability and correctness (ingesting all edits) over update latency.
> And while we can work to reduce the maximum latency, this will still
> be an asynchronous process and needs to be considered as such.
>
> We currently have one Blazegraph expert working with us to address a
> number of performance and stability issues. We
> are planning to hire an additional engineer to help us support the
> service in the long term. You can follow our current work in phabricator [2].
>
> If anyone has experience with scaling large graph databases, please
> reach out to us, we're always happy to share ideas!
>
> Thanks all for your patience!
>
>Guillaume
>
> [1] https://wikitech.wikimedia.org/wiki/Wikidata_query_service/ScalingStrategy
> [2] https://phabricator.wikimedia.org/project/view/1239/
>
> --
> Guillaume Lederrey
> Engineering Manager, Search Platform
> Wikimedia Foundation
> UTC+2 / CEST
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Scaling Wikidata Query Service

2019-06-06 Thread Gerard Meijssen
Hoi,
Thank you for this answer. It helps. It helps to understand / appreciate
the work that is done. Without updates like this, it becomes increasingly
hard to be confident that our future will remain bright.
Thanks,
   GerardM

On Thu, 6 Jun 2019 at 21:33, Guillaume Lederrey 
wrote:

> Hello all!
>
> There has been a number of concerns raised about the performance and
> scaling of Wikdata Query Service. We share those concerns and we are
> doing our best to address them. Here is some info about what is going
> on:
>
> In an ideal world, WDQS should:
>
> * scale in terms of data size
> * scale in terms of number of edits
> * have low update latency
> * expose a SPARQL endpoint for queries
> * allow anyone to run any queries on the public WDQS endpoint
> * provide great query performance
> * provide a high level of availability
>
> Scaling graph databases is a "known hard problem", and we are reaching
> a scale where there are no obvious easy solutions to address all the
> above constraints. At this point, just "throwing hardware at the
> problem" is not an option anymore. We need to go deeper into the
> details and potentially make major changes to the current architecture.
> Some scaling considerations are discussed in [1]. This is going to take
> time.
>
> Reasonably, addressing all of the above constraints is unlikely to
> ever happen. Some of the constraints are non negotiable: if we can't
> keep up with Wikidata in term of data size or number of edits, it does
> not make sense to address query performance. On some constraints, we
> will probably need to compromise.
>
> For example, the update process is asynchronous. It is by nature
> expected to lag. In the best case, this lag is measured in minutes,
> but can climb to hours occasionally. This is a case of prioritizing
> stability and correctness (ingesting all edits) over update latency.
> And while we can work to reduce the maximum latency, this will still
> be an asynchronous process and needs to be considered as such.
>
> We currently have one Blazegraph expert working with us to address a
> number of performance and stability issues. We
> are planning to hire an additional engineer to help us support the
> service in the long term. You can follow our current work in phabricator
> [2].
>
> If anyone has experience with scaling large graph databases, please
> reach out to us, we're always happy to share ideas!
>
> Thanks all for your patience!
>
>Guillaume
>
> [1]
> https://wikitech.wikimedia.org/wiki/Wikidata_query_service/ScalingStrategy
> [2] https://phabricator.wikimedia.org/project/view/1239/
>
> --
> Guillaume Lederrey
> Engineering Manager, Search Platform
> Wikimedia Foundation
> UTC+2 / CEST
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata