date:20190610

gerritbot added a comment.


  Change 516444 had a related patch set uploaded (by Smalyshev; owner: 
Smalyshev):
  [operations/puppet@production] Set up dumps for mediainfo RDF generation
  
  https://gerrit.wikimedia.org/r/516444

TASK DETAIL
  https://phabricator.wikimedia.org/T221917

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: gerritbot
Cc: WMDE-leszek, Poyekhali, Steinsplitter, Aklapper, Lydia_Pintscher, 
Bugreporter, Tgr, Ramsey-WMF, Jarekt, Addshore, Tpt, MB-one, Salgo60, 
Lucas_Werkmeister_WMDE, Smalyshev, E.S.A-Sheild, darthmon_wmde, joker88john, 
CucyNoiD, Nandana, NebulousIris, JKSTNK, Gaboe420, Versusxo, 
Majesticalreaper22, Giuliamocci, Adrian1985, Cpaulf30, Lahi, PDrouin-WMF, Gq86, 
Baloch007, E1presidente, Cparle, Darkminds3113, Anooprao, SandraF_WMF, 
Bsandipan, Lordiis, GoranSMilovanovic, Adik2382, Th3d3v1ls, Ramalepe, Liugev6, 
QZanden, EBjune, Tramullas, Acer, merbst, LawExplorer, WSH1906, Lewizho99, 
Maathavan, Silverfish, _jensen, rosalieper, Jonas, Xmlizer, Susannaanas, 
Wong128hk, Jane023, jkroll, Wikidata-bugs, Jdouglas, Base, matthiasmullie, 
aude, Tobias1984, El_Grafo, Dinoguy1000, Manybubbles, Ricordisamoa, Wesalius, 
Fabrice_Florin, Raymond, Jdforrester-WMF, Mbch331, Keegan
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Commented On] T221917: Create RDF dump of structured data on Commons

gerritbot added a comment.


  Change 516441 had a related patch set uploaded (by Smalyshev; owner: 
Smalyshev):
  [mediawiki/extensions/Wikibase@master] Add option --ignore-missing to dumper
  
  https://gerrit.wikimedia.org/r/516441

TASK DETAIL
  https://phabricator.wikimedia.org/T221917

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: gerritbot
Cc: WMDE-leszek, Poyekhali, Steinsplitter, Aklapper, Lydia_Pintscher, 
Bugreporter, Tgr, Ramsey-WMF, Jarekt, Addshore, Tpt, MB-one, Salgo60, 
Lucas_Werkmeister_WMDE, Smalyshev, darthmon_wmde, Nandana, JKSTNK, Lahi, 
PDrouin-WMF, Gq86, E1presidente, Cparle, Anooprao, SandraF_WMF, 
GoranSMilovanovic, QZanden, EBjune, Tramullas, Acer, merbst, LawExplorer, 
Silverfish, _jensen, rosalieper, Jonas, Xmlizer, Susannaanas, Wong128hk, 
Jane023, jkroll, Wikidata-bugs, Jdouglas, Base, matthiasmullie, aude, 
Tobias1984, El_Grafo, Dinoguy1000, Manybubbles, Ricordisamoa, Wesalius, 
Fabrice_Florin, Raymond, Jdforrester-WMF, Mbch331, Keegan
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Updated] T221917: Create RDF dump of structured data on Commons

gerritbot added a project: Patch-For-Review.

TASK DETAIL
  https://phabricator.wikimedia.org/T221917

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: gerritbot
Cc: WMDE-leszek, Poyekhali, Steinsplitter, Aklapper, Lydia_Pintscher, 
Bugreporter, Tgr, Ramsey-WMF, Jarekt, Addshore, Tpt, MB-one, Salgo60, 
Lucas_Werkmeister_WMDE, Smalyshev, E.S.A-Sheild, darthmon_wmde, joker88john, 
CucyNoiD, Nandana, NebulousIris, JKSTNK, Gaboe420, Versusxo, 
Majesticalreaper22, Giuliamocci, Adrian1985, Cpaulf30, Lahi, PDrouin-WMF, Gq86, 
Baloch007, E1presidente, Cparle, Darkminds3113, Anooprao, SandraF_WMF, 
Bsandipan, Lordiis, GoranSMilovanovic, Adik2382, Th3d3v1ls, Ramalepe, Liugev6, 
QZanden, EBjune, Tramullas, Acer, merbst, LawExplorer, WSH1906, Lewizho99, 
Maathavan, Silverfish, _jensen, rosalieper, Jonas, Xmlizer, Susannaanas, 
Wong128hk, Jane023, jkroll, Wikidata-bugs, Jdouglas, Base, matthiasmullie, 
aude, Tobias1984, El_Grafo, Dinoguy1000, Manybubbles, Ricordisamoa, Wesalius, 
Fabrice_Florin, Raymond, Jdforrester-WMF, Mbch331, Keegan
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Unblock] T221916: Create RDF export for structured data stored for files

2019-06-10 Thread Smalyshev

Smalyshev closed subtask T222995: Decide which prefixes to use for MediaInfo 
RDF as Resolved.

TASK DETAIL
  https://phabricator.wikimedia.org/T221916

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev
Cc: WMDE-leszek, Poyekhali, Steinsplitter, Aklapper, Lydia_Pintscher, Tgr, 
Ramsey-WMF, Jarekt, Addshore, Tpt, MB-one, Salgo60, Lucas_Werkmeister_WMDE, 
Smalyshev, E.S.A-Sheild, darthmon_wmde, joker88john, CucyNoiD, Nandana, 
NebulousIris, JKSTNK, Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, 
Adrian1985, Cpaulf30, Lahi, PDrouin-WMF, Gq86, Baloch007, E1presidente, Cparle, 
Darkminds3113, Anooprao, SandraF_WMF, Bsandipan, Lordiis, GoranSMilovanovic, 
Adik2382, Th3d3v1ls, Ramalepe, Liugev6, QZanden, EBjune, Tramullas, Acer, 
merbst, LawExplorer, WSH1906, Lewizho99, Maathavan, Silverfish, _jensen, 
rosalieper, Cirdan, Jonas, Xmlizer, Susannaanas, Wong128hk, Jane023, jkroll, 
Wikidata-bugs, Jdouglas, Base, matthiasmullie, aude, Tobias1984, El_Grafo, 
Dinoguy1000, Manybubbles, Ricordisamoa, Wesalius, Fabrice_Florin, Raymond, 
Jdforrester-WMF, Mbch331, Keegan
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Closed] T222995: Decide which prefixes to use for MediaInfo RDF

2019-06-10 Thread Smalyshev

Smalyshev closed this task as "Resolved".
Smalyshev claimed this task.

TASK DETAIL
  https://phabricator.wikimedia.org/T222995

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev
Cc: Denny, mkroetzsch, Mholloway, Cparle, daniel, Lucas_Werkmeister_WMDE, 
Salgo60, MB-one, Tpt, Addshore, Jarekt, Ramsey-WMF, Tgr, Lydia_Pintscher, 
Aklapper, Steinsplitter, Poyekhali, WMDE-leszek, Smalyshev, darthmon_wmde, 
Nandana, JKSTNK, Lahi, PDrouin-WMF, Gq86, E1presidente, Anooprao, SandraF_WMF, 
GoranSMilovanovic, QZanden, EBjune, Tramullas, Acer, merbst, LawExplorer, 
Silverfish, _jensen, rosalieper, Jonas, Xmlizer, Susannaanas, Wong128hk, 
Jane023, jkroll, Wikidata-bugs, Jdouglas, Base, matthiasmullie, aude, 
Tobias1984, El_Grafo, Dinoguy1000, Manybubbles, Ricordisamoa, Wesalius, 
Fabrice_Florin, Raymond, Jdforrester-WMF, Mbch331, Keegan
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Merged] T225364: Class undefined: Wikibase\DataModel\Entity\ItemId in WikibaseLib.entitytypes.php on line 35

2019-06-10 Thread matmarex

matmarex closed this task as a duplicate of T189560: mediawiki/vendor REL1_* no 
longer ship dependencies for wmf extensions that are not in the mediawiki 
tarball.

TASK DETAIL
  https://phabricator.wikimedia.org/T225364

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: matmarex
Cc: Paladox, Aklapper, Reedy, darthmon_wmde, Redabr4, Zanziii, Sadisticturd, 
Nandana, A.S.Kochergin, Lahi, Gq86, GoranSMilovanovic, RazeSoldier, QZanden, 
LawExplorer, _jensen, rosalieper, Jonas, MGChecker, Wikidata-bugs, aude, 
Lydia_Pintscher, Mbch331, hashar
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata] [OM-2019] Final CFP: 14th workshop on Ontology Matching collocated with ISWC

2019-06-10 Thread Pavel Shvaiko

... Most often we need to integrate together data sources that were not
aiming at their
integration while being designed, thus, increasing the difficulty of the
matching operation.
Even if a good progress has been made in the matching field as such,
ontology
matching may appear to be virtually impossible. Indeed, for finding the
correspondences
between entities, it is necessary to understand their meaning ...

--
   FINAL CALL FOR CONTRIBUTIONS
   THE SUBMISSION DEADLINE IS APPROACHING ON JUNE 28TH, 2019
--

The Fourteenth International Workshop on
ONTOLOGY MATCHING
(OM-2019)
 http://om2019.ontologymatching.org/
   October 26th or 27th, 2019, ISWC Workshop Program,
  Auckland, New Zealand


BRIEF DESCRIPTION AND OBJECTIVES
Ontology matching is a key interoperability enabler for the Semantic Web,
as well as a useful technique in some classical data integration tasks
dealing with the semantic heterogeneity problem. It takes ontologies
as input and determines as output an alignment, that is, a set of
correspondences between the semantically related entities of those
ontologies.
These correspondences can be used for various tasks, such as ontology
merging, data interlinking, query answering or process mapping.
Thus, matching ontologies enables the knowledge and data expressed
with the matched ontologies to interoperate.

The workshop has three goals:
1.
To bring together leaders from academia, industry and user institutions
to assess how academic advances are addressing real-world requirements.
The workshop will strive to improve academic awareness of industrial
and final user needs, and therefore, direct research towards those needs.
Simultaneously, the workshop will serve to inform industry and user
representatives about existing research efforts that may meet their
requirements. The workshop will also investigate how the ontology
matching technology is going to evolve, especially with respect to
data interlinking, process mapping and web table matching tasks.

2.
To conduct an extensive and rigorous evaluation of ontology matching
and instance matching (link discovery) approaches through
the OAEI (Ontology Alignment Evaluation Initiative) 2019 campaign:
http://oaei.ontologymatching.org/2019/

3. To examine new uses, similarities and differences from database
schema matching, which has received decades of attention
but is just beginning to transition to mainstream tools.


This year, in sync with the main conference, we encourage submissions
specifically devoted to: (i) datasets, benchmarks and replication studies,
services, software, methodologies, protocols and measures
(not necessarily related to OAEI), and (ii) application of
the matching technology in real-life scenarios and assessment
of its usefulness to the final users.


TOPICS of interest include but are not limited to:
Business and use cases for matching (e.g., big, open, closed data);
Requirements to matching from specific application scenarios (e.g.,
public sector, homeland security);
Application of matching techniques in real-world scenarios (e.g., with
environmental data);
Formal foundations and frameworks for matching;
Matching and knowledge graphs;
Matching and deep learning;
Matching and embeddings;
Matching and big data;
Matching and linked data;
Instance matching, data interlinking and relations between them;
Privacy-aware matching;
Process model matching;
Large-scale and efficient matching techniques;
Matcher selection, combination and tuning;
User involvement (including both technical and organizational aspects);
Explanations in matching;
Social and collaborative matching;
Uncertainty in matching;
Reasoning with alignments;
Alignment coherence and debugging;
Alignment management;
Matching for traditional applications (e.g., data science);
Matching for emerging applications (e.g., web tables, knowledge graphs).


SUBMISSIONS
Contributions to the workshop can be made in terms of technical papers and
posters/statements of interest addressing different issues of ontology
matching
as well as participating in the OAEI 2019 campaign. Long technical papers
should
be of max. 12 pages. Short technical papers should be of max. 5 pages.
Posters/statements of interest should not exceed 2 pages.
All contributions have to be prepared using the LNCS Style:
http://www.springer.com/computer/lncs?SGWID=0-164-6-793341-0
and should be submitted in PDF format (no later than June 28th, 2019)
through the workshop submission site at:

https://www.easychair.org/conferences/?conf=om2019

Contributors to the OAEI 2019 campaign have to follow the campaign
conditions
and

[Wikidata-bugs] [Maniphest] [Updated] T194101: Add more metrics to WDQS

2019-06-10 Thread Maintenance_bot

Maintenance_bot removed a project: Patch-For-Review.

TASK DETAIL
  https://phabricator.wikimedia.org/T194101

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Maintenance_bot
Cc: Aklapper, Gehel, Smalyshev, darthmon_wmde, ET4Eva, Nandana, Lahi, Gq86, 
Darkminds3113, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, 
merbst, LawExplorer, Avner, _jensen, rosalieper, Jonas, FloNight, Xmlizer, 
jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331, 
E.S.A-Sheild, joker88john, CucyNoiD, NebulousIris, Gaboe420, Versusxo, 
Majesticalreaper22, Giuliamocci, Adrian1985, Cpaulf30, Baloch007, Bsandipan, 
Lordiis, Adik2382, Th3d3v1ls, Ramalepe, Liugev6, WSH1906, Lewizho99, Maathavan
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Commented On] T194101: Add more metrics to WDQS

gerritbot added a comment.


  Change 514645 **merged** by jenkins-bot:
  [wikidata/query/rdf@master] Add metering of federation requests
  
  https://gerrit.wikimedia.org/r/514645

TASK DETAIL
  https://phabricator.wikimedia.org/T194101

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: gerritbot
Cc: Aklapper, Gehel, Smalyshev, E.S.A-Sheild, darthmon_wmde, joker88john, 
ET4Eva, CucyNoiD, Nandana, NebulousIris, Gaboe420, Versusxo, 
Majesticalreaper22, Giuliamocci, Adrian1985, Cpaulf30, Lahi, Gq86, Baloch007, 
Darkminds3113, Bsandipan, Lordiis, Lucas_Werkmeister_WMDE, GoranSMilovanovic, 
Adik2382, Th3d3v1ls, Ramalepe, Liugev6, QZanden, EBjune, merbst, LawExplorer, 
WSH1906, Avner, Lewizho99, Maathavan, _jensen, rosalieper, Jonas, FloNight, 
Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Re: [Wikidata] Scaling Wikidata Query Service

Yes, I can ask. I am talking a lot with them as we are redeploying 
DBpedia live and also pushing the new DBpedia to them soon.


I think, they also had a specific issue with how Wikidata does linked 
data, but I didn't get it, as it was mentioned too briefly.


All the best,

Sebastian


On 10.06.19 22:46, Stas Malyshev wrote:

Hi!


thanks for the elaboration. I can understand the background much better.
I have to admit, that I am also not a real expert, but very close to the
real experts like Vidal and Rahm who are co-authors of the SWJ paper or
the OpenLink devs.

If you know anybody at OpenLink that would be interested in trying to
evaluate such thing (i.e. how Wikidata could be hosted on Virtuso) and
provide support for this project, it would be interesting to discuss it.
While open-source thing is still a barrier and in general the
requirements are different, at least discussing it and maybe getting
some numbers might be useful.

Thanks,

--
All the best,
Sebastian Hellmann

Director of Knowledge Integration and Linked Data Technologies (KILT) 
Competence Center

at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org, 
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt 


Homepage: http://aksw.org/SebastianHellmann
Research Group: http://aksw.org
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Scaling Wikidata Query Service

2019-06-10 Thread Stas Malyshev

Hi!

> thanks for the elaboration. I can understand the background much better.
> I have to admit, that I am also not a real expert, but very close to the
> real experts like Vidal and Rahm who are co-authors of the SWJ paper or
> the OpenLink devs.

If you know anybody at OpenLink that would be interested in trying to
evaluate such thing (i.e. how Wikidata could be hosted on Virtuso) and
provide support for this project, it would be interesting to discuss it.
While open-source thing is still a barrier and in general the
requirements are different, at least discussing it and maybe getting
some numbers might be useful.

Thanks,
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Scaling Wikidata Query Service

Hi Stas,

thanks for the elaboration. I can understand the background much better.
I have to admit, that I am also not a real expert, but very close to the
real experts like Vidal and Rahm who are co-authors of the SWJ paper or
the OpenLink devs.

I am also spoiled, because OpenLink solves the hosting for DBpedia and
also DBpedia-live with ca. 130k updates per day for the English
Wikipedia. I think, this one is the recent report:
https://medium.com/virtuoso-blog/dbpedia-usage-report-as-of-2018-01-01-8cae1b81ca71
Then again DBpedia didn't grow for a while, but we made a "Best of" now
[1]. But will not host it all.

[1] https://svn.aksw.org/papers/2019/ISWC_FlexiFusion/public.pdf

I also see that your context is difficult. Maybe you can custom
shard/scaleout Blazegraph based on the queries and then replicate the
sharded clusters. Like a mix between sharding and replication, maybe
just 3 * 3 servers instead of 9 replicated ones or 9 servers full of
shards. Not so many options here, if you have the OS requirement. I
guess you are caching static content already as much as possible.

This also seems pretty much what I know, but it really is all second
hand as my expertise is more focused on what's inside the database.

All the best,

Sebastian

On 10.06.19 22:02, Stas Malyshev wrote:

Hi!

I am not sure how to evaluate this correctly. Scaling databases in
general is a "known hard problem" and graph databases a sub-field of it,
which are optimized for graph-like queries as opposed to column stores
or relational databases. If you say that "throwing hardware at the
problem" does not help, you are admitting that Blazegraph does not scale
for what is needed by Wikidata.

I think this is over-generalizing. We have a database that grew 10x over
the last 4 years. We have certain hardware and software limits, both
with existing hardware and in principle by hardware we could buy. We
also have certain issues specific to graph databases that make scaling
harder - for example, document databases, like ElasticSearch, and
certain models of relational databases, shard easily. Sharding something
like Wikidata graph is much harder, especially if the underlying
database knows nothing about specifics of Wikidata data (which would be
the case for all off-the-shelf databases). If we just randomly split the
triples between several servers, we'd probably be just modeling a large
but an extremely slow disk. So there needs to be some smarter solution,
one that we'd unlike to develop inhouse but one that has already been
verified by industry experience and other deployments.

Is the issue specific to Blazegraph and can the issue be solved by
switching platform? Maybe, we do not know yet. We do not have any better
solution that guarantees us better scalability identified, but we have a
plan on looking for that solution, given the resources. We also have a
plan on improving the throughput of Blazegraph, which we're working on now.

Non-sharding model might be hard to sustain indefinitely, but it is not
clear it can't work in the short term, and also it is not clear that
sharding model would deliver clear performance win, as it will have to
involve network latencies inside the queries, which can significantly
affect performance. This can only resolved by proper testing evaluation
of the candidate solutions.

Then it is not a "cluster" in the sense of databases. It is more a
redundancy architecture like RAID 1. Is this really how BlazeGraph does

I do not think our time here would be productively spent arguing
semantics about what should and should not be called a "cluster". We
call that setup a cluster, and I think now we all understand what we're
talking about.

it? Don't they have a proper cluster solution, where they repartition
data across servers? Or is this independent servers a wikimedia staff
homebuild?

If you mean sharded or replicated setup, as far as I know, Blazegraph
does not support that (there's some support for replication IIRC but
replication without sharding probably won't give us much improvement).
We have a plan to evaluate a solution that does shard, given necessary
resources.

Some info here:

- We evaluated some stores according to their performance:
http://www.semantic-web-journal.net/content/evaluation-metadata-representations-rdf-stores-0
"Evaluation of Metadata Representations in RDF stores"

Thanks for the link, it looks very interesting, I'll read it and see
which parts we could use here.

- Virtuoso has proven quite useful. I don't want to advertise here, but
the thing they have going for DBpedia uses ridiculous hardware, i.e.
64GB RAM and it is also the OS version, not the professional with
clustering and repartition capability. So we are playing the game since
ten years now: Everybody tries other databases, but then most people
come back to virtuoso. I have to admit that OpenLink is maintaining the
hosting for DBpedia themselves, so they know how to optimise. They
normally do large

Re: [Wikidata] Scaling Wikidata Query Service

2019-06-10 Thread Stas Malyshev

Hi!

> Yes, sharding is what you need, I think, instead of replication. This is
> the technique where data is repartitioned into more manageable chunks
> across servers.

Agreed, if we are to get any solution that is not constrained by
hardware limits of a single server, we can not avoid looking at sharding.

> Here is a good explanation of it:
> 
> http://vos.openlinksw.com/owiki/wiki/VOS/VOSArticleWebScaleRDF

Thanks, very interesting article. I'd certainly would like to know how
this works with database in the size of 10 bln. triples and queries both
accessing and updating random subsets of them. Updates are not covered
very thoroughly there - this is, I suspect, because many databases of 10
bln. size do not have as active (non-append) update workload as we do.
Maybe they still manage to solve it, if so, I'd very much like to know
about it.

> Just a note here: Virtuoso is also a full RDMS, so you could probably
> keep wikibase db in the same cluster and fix the asynchronicity. That is

Given how the original data is stored (JSON blob inside mysql table) it
would not be very useful. In general, graph data model and Wikitext data
model on top of which Wikidata is built are very, very different, and
expecting same storage to serve both - at least without very major and
deep refactoring of the code on both sides - is not currently very
realistic. And of course moving any of the wiki production databases to
Virtuoso would be a non-starter. Given than original Wikidata database
stays on Mysql - which I think is a reasonable assumption - there would
need to be a data migration pipeline for data to come from Mysql to
whatever is the WDQS NG storage.

> also true for any mappers like Sparqlify:
> http://aksw.org/Projects/Sparqlify.html However, these shift the
> problem, then you need a sharded/repartitioned relational database

Yes, relational-RDF bridges are known but my experience is they usually
are not very performant (the difference in "you can do it" and "you can
do it fast" is sometimes very significant) and in our case, it would be
useless anyway as Wikidata data is not really stored in relational
database per se - it's stored in JSON blob opaquely saved in relational
database structure that knows nothing about Wikidata. Yes, it's not the
ideal structure for optimal performance of Wikidata itself, but I do not
foresee this changing, at least in any short term. Again, we could of
course have data export pipeline to whatever storage format we want -
essentially we already have one - but the concept of having single data
store is probably not realistic at least within foreseeable timeframes.
We use separate data store for search (ElasticSearch) and probably will
have to have separate one for queries, whatever would be the mechanism.

Thanks,
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

Re: [Wikidata] Scaling Wikidata Query Service

2019-06-10 Thread Stas Malyshev

Hi!

> I am not sure how to evaluate this correctly. Scaling databases in
> general is a "known hard problem" and graph databases a sub-field of it,
> which are optimized for graph-like queries as opposed to column stores
> or relational databases. If you say that "throwing hardware at the
> problem" does not help, you are admitting that Blazegraph does not scale
> for what is needed by Wikidata. 

I think this is over-generalizing. We have a database that grew 10x over
the last 4 years. We have certain hardware and software limits, both
with existing hardware and in principle by hardware we could buy. We
also have certain issues specific to graph databases that make scaling
harder - for example, document databases, like ElasticSearch, and
certain models of relational databases, shard easily. Sharding something
like Wikidata graph is much harder, especially if the underlying
database knows nothing about specifics of Wikidata data (which would be
the case for all off-the-shelf databases). If we just randomly split the
triples between several servers, we'd probably be just modeling a large
but an extremely slow disk. So there needs to be some smarter solution,
one that we'd unlike to develop inhouse but one that has already been
verified by industry experience and other deployments.

Is the issue specific to Blazegraph and can the issue be solved by
switching platform? Maybe, we do not know yet. We do not have any better
solution that guarantees us better scalability identified, but we have a
plan on looking for that solution, given the resources. We also have a
plan on improving the throughput of Blazegraph, which we're working on now.

Non-sharding model might be hard to sustain indefinitely, but it is not
clear it can't work in the short term, and also it is not clear that
sharding model would deliver clear performance win, as it will have to
involve network latencies inside the queries, which can significantly
affect performance. This can only resolved by proper testing evaluation
of the candidate solutions.

> Then it is not a "cluster" in the sense of databases. It is more a
> redundancy architecture like RAID 1. Is this really how BlazeGraph does

I do not think our time here would be productively spent arguing
semantics about what should and should not be called a "cluster". We
call that setup a cluster, and I think now we all understand what we're
talking about.

> it? Don't they have a proper cluster solution, where they repartition
> data across servers? Or is this independent servers a wikimedia staff
> homebuild?

If you mean sharded or replicated setup, as far as I know, Blazegraph
does not support that (there's some support for replication IIRC but
replication without sharding probably won't give us much improvement).
We have a plan to evaluate a solution that does shard, given necessary
resources.

> Some info here:
> 
> - We evaluated some stores according to their performance:
> http://www.semantic-web-journal.net/content/evaluation-metadata-representations-rdf-stores-0
>  
> "Evaluation of Metadata Representations in RDF stores" 

Thanks for the link, it looks very interesting, I'll read it and see
which parts we could use here.

> - Virtuoso has proven quite useful. I don't want to advertise here, but
> the thing they have going for DBpedia uses ridiculous hardware, i.e.
> 64GB RAM and it is also the OS version, not the professional with
> clustering and repartition capability. So we are playing the game since
> ten years now: Everybody tries other databases, but then most people
> come back to virtuoso. I have to admit that OpenLink is maintaining the
> hosting for DBpedia themselves, so they know how to optimise. They
> normally do large banks as customers with millions of write transactions
> per hour. In LOD2 they also implemented column store features with
> MonetDB and repartitioning in clusters.

I do not know the details of your usage scenario, so before we get into
comparisons, I'd like to understand:

1. Do your servers provide live synchronized updates with Wikdiata or
DBPedia? How many updates per second that server can process?
2. How many queries per second this server is serving? What kind of
queries are those?

We did preliminary very limited evaluation of Virtuoso for hosting
Wikidata, and it looks like it can load and host the necessary data
(though it does not support some customizations we have now and we could
not evaluate whether such customizations are possible) but it would
require significant time investment to port all the functionality to it.
Unfortunately, the lack of resources did not allow us to do fuller
evaluation.

Also, as I understand, "professional" capabilities of Virtuoso are
closed-source and require paid license, which probably would be a
problem to run it on WMF infrastructure unless we reach some kind of
special arrangement. Since this arrangement will probably not include
open-sourcing the enterprise part of Virtuoso, it should deliver a very

Re: [Wikidata] Scaling Wikidata Query Service

2019-06-10 Thread Guillaume Lederrey

On Mon, Jun 10, 2019 at 9:03 PM Sebastian Hellmann
 wrote:
>
> Hi Guillaume,
>
> On 10.06.19 16:54, Guillaume Lederrey wrote:
>
> Hello!
>
> On Mon, Jun 10, 2019 at 4:28 PM Sebastian Hellmann
>  wrote:
>
> Hi Guillaume,
>
> On 06.06.19 21:32, Guillaume Lederrey wrote:
>
> Hello all!
>
> There has been a number of concerns raised about the performance and
> scaling of Wikdata Query Service. We share those concerns and we are
> doing our best to address them. Here is some info about what is going
> on:
>
> In an ideal world, WDQS should:
>
> * scale in terms of data size
> * scale in terms of number of edits
> * have low update latency
> * expose a SPARQL endpoint for queries
> * allow anyone to run any queries on the public WDQS endpoint
> * provide great query performance
> * provide a high level of availability
>
> Scaling graph databases is a "known hard problem", and we are reaching
> a scale where there are no obvious easy solutions to address all the
> above constraints. At this point, just "throwing hardware at the
> problem" is not an option anymore. We need to go deeper into the
> details and potentially make major changes to the current architecture.
> Some scaling considerations are discussed in [1]. This is going to take
> time.
>
> I am not sure how to evaluate this correctly. Scaling databases in general is 
> a "known hard problem" and graph databases a sub-field of it, which are 
> optimized for graph-like queries as opposed to column stores or relational 
> databases. If you say that "throwing hardware at the problem" does not help, 
> you are admitting that Blazegraph does not scale for what is needed by 
> Wikidata.
>
> Yes, I am admitting that Blazegraph (at least in the way we are using
> it at the moment) does not scale to our future needs. Blazegraph does
> have support for sharding (what they call "Scale Out"). And yes, we
> need to have a closer look at how that works. I'm not the expert here,
> so I won't even try to assert if that's a viable solution or not.
>
> Yes, sharding is what you need, I think, instead of replication. This is the 
> technique where data is repartitioned into more manageable chunks across 
> servers.

Well, we need sharding for scalability and replication for
availability, so we do need both. The hard problem is sharding.

> Here is a good explanation of it:
>
> http://vos.openlinksw.com/owiki/wiki/VOS/VOSArticleWebScaleRDF

Interesting read. I don't see how Virtuoso addresses data locality, it
looks like sharding of their RDF store is just hash based (I'm
assuming some kind of uniform hash). I'm not enough of an expert on
graph databases, but I doubt that a highly connected graph like
Wikidata will be able to scale reads without some way to address data
locality. Obviously, this needs testing.

> http://docs.openlinksw.com/virtuoso/ch-clusterprogramming/
>
>
> Sharding, scale-out or repartitioning is a classical enterprise feature for 
> Open-source databases. I am rather surprised that Blazegraph is full GPL 
> without an enterprise edition. But then they really sounded like their goal 
> as a company was to be bought by a bigger fish, in this case Amazon Web 
> Services. What is their deal? They are offering support?
>
> So if you go open-source, I think you will have a hard time finding good free 
> databases sharding/repartition. FoundationDB as proposed in the grant [1]is 
> from Apple
>
> [1] https://meta.wikimedia.org/wiki/Grants:Project/WDQS_On_FoundationDB
>
>
> I mean try the sharding feature. At some point though it might be worth 
> considering to go enterprise. Corporate Open Source often has a twist.

Closed source is not an option. We have strong open source
requirements to deploy anything in our production environment.

> Just a note here: Virtuoso is also a full RDMS, so you could probably keep 
> wikibase db in the same cluster and fix the asynchronicity. That is also true 
> for any mappers like Sparqlify: http://aksw.org/Projects/Sparqlify.html 
> However, these shift the problem, then you need a sharded/repartitioned 
> relational database

There is no plan to move the Wikibase storage out of MySQL at the
moment. In any case, having a low coupling between the primary storage
for wikidata and a secondary storage for complex querying is a sound
architectural principle. This asynchronous update process is most
probably going to stay in place, just because it makes a lot of sense.

Thanks for the discussion so far! It is always interesting to have outside idea!

   Have fun!

 Guillaume

>
> All the best,
>
> Sebastian
>
>
>
> From [1]:
>
> At the moment, each WDQS cluster is a group of independent servers, sharing 
> nothing, with each server independently updated and each server holding a 
> full data set.
>
> Then it is not a "cluster" in the sense of databases. It is more a redundancy 
> architecture like RAID 1. Is this really how BlazeGraph does it? Don't they 
> have a proper cluster solution, where they repartition data

Re: [Wikidata] Scaling Wikidata Query Service

Hi Guillaume,

On 10.06.19 16:54, Guillaume Lederrey wrote:

Hello!

On Mon, Jun 10, 2019 at 4:28 PM Sebastian Hellmann
wrote:

Hi Guillaume,

On 06.06.19 21:32, Guillaume Lederrey wrote:

Hello all!

There has been a number of concerns raised about the performance and
scaling of Wikdata Query Service. We share those concerns and we are
doing our best to address them. Here is some info about what is going
on:

In an ideal world, WDQS should:

* scale in terms of data size
* scale in terms of number of edits
* have low update latency
* expose a SPARQL endpoint for queries
* allow anyone to run any queries on the public WDQS endpoint
* provide great query performance
* provide a high level of availability

Scaling graph databases is a "known hard problem", and we are reaching
a scale where there are no obvious easy solutions to address all the
above constraints. At this point, just "throwing hardware at the
problem" is not an option anymore. We need to go deeper into the
details and potentially make major changes to the current architecture.
Some scaling considerations are discussed in [1]. This is going to take
time.

I am not sure how to evaluate this correctly. Scaling databases in general is a "known hard
problem" and graph databases a sub-field of it, which are optimized for graph-like queries as
opposed to column stores or relational databases. If you say that "throwing hardware at the
problem" does not help, you are admitting that Blazegraph does not scale for what is needed by
Wikidata.

Yes, I am admitting that Blazegraph (at least in the way we are using
it at the moment) does not scale to our future needs. Blazegraph does
have support for sharding (what they call "Scale Out"). And yes, we
need to have a closer look at how that works. I'm not the expert here,
so I won't even try to assert if that's a viable solution or not.

Yes, sharding is what you need, I think, instead of replication. This is
the technique where data is repartitioned into more manageable chunks
across servers.

Here is a good explanation of it:

http://vos.openlinksw.com/owiki/wiki/VOS/VOSArticleWebScaleRDF

http://docs.openlinksw.com/virtuoso/ch-clusterprogramming/

Sharding, scale-out or repartitioning is a classical enterprise feature
for Open-source databases. I am rather surprised that Blazegraph is full
GPL without an enterprise edition. But then they really sounded like
their goal as a company was to be bought by a bigger fish, in this case
Amazon Web Services. What is their deal? They are offering support?

So if you go open-source, I think you will have a hard time finding good
free databases sharding/repartition. FoundationDB as proposed in the
grant [1]is from Apple

[1] https://meta.wikimedia.org/wiki/Grants:Project/WDQS_On_FoundationDB

I mean try the sharding feature. At some point though it might be worth
considering to go enterprise. Corporate Open Source often has a twist.

Just a note here: Virtuoso is also a full RDMS, so you could probably
keep wikibase db in the same cluster and fix the asynchronicity. That is
also true for any mappers like Sparqlify:
http://aksw.org/Projects/Sparqlify.html However, these shift the
problem, then you need a sharded/repartitioned relational database

All the best,

Sebastian

From [1]:

At the moment, each WDQS cluster is a group of independent servers, sharing
nothing, with each server independently updated and each server holding a full
data set.

Then it is not a "cluster" in the sense of databases. It is more a redundancy
architecture like RAID 1. Is this really how BlazeGraph does it? Don't they have a proper
cluster solution, where they repartition data across servers? Or is this independent
servers a wikimedia staff homebuild?

It all depends on your definition of a cluster. We have groups of
machine collectively serving some coherent traffic, but each machine
is completely independent from others. So yes, the comparison to RAID1
is adequate.

Some info here:

Thanks for the link! That looks quite interesting!

- Virtuoso has proven quite useful. I don't want to advertise here, but the
thing they have going for DBpedia uses ridiculous hardware, i.e. 64GB RAM and
it is also the OS version, not the professional with clustering and repartition
capability. So we are playing the game since ten years now: Everybody tries
other databases, but then most people come back to virtuoso. I have to admit
that OpenLink is maintaining the hosting for DBpedia themselves, so they know
how to optimise. They normally do large banks as customers with millions of
write transactions per hour. In LOD2 they also implemented column store
features with MonetDB and repartitioning in clusters.

I'm not entirely sure how to read the above

[Wikidata-bugs] [Maniphest] [Updated] T217018: Upgrade Wikibase extension to use newer phan

2019-06-10 Thread Jdforrester-WMF

Jdforrester-WMF added a parent task: T225456: Run phan on PHP 7.2, not PHP 7.0.

TASK DETAIL
  https://phabricator.wikimedia.org/T217018

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Jdforrester-WMF
Cc: Jdforrester-WMF, matej_suchanek, gerritbot, Aklapper, 
Lucas_Werkmeister_WMDE, E.S.A-Sheild, darthmon_wmde, joker88john, CucyNoiD, 
Nandana, NebulousIris, Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, 
Adrian1985, Cpaulf30, Lahi, Gq86, Baloch007, Darkminds3113, Bsandipan, Lordiis, 
GoranSMilovanovic, Adik2382, Th3d3v1ls, Ramalepe, Liugev6, QZanden, 
LawExplorer, WSH1906, Lewizho99, Maathavan, _jensen, rosalieper, Wikidata-bugs, 
aude, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Updated] T217018: Upgrade Wikibase extension to use newer phan

2019-06-10 Thread Jdforrester-WMF

Jdforrester-WMF added a comment.


  This is blocking T225456  because 
the version of phan Wikibase master is using is so old. :-(

TASK DETAIL
  https://phabricator.wikimedia.org/T217018

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Jdforrester-WMF
Cc: Jdforrester-WMF, matej_suchanek, gerritbot, Aklapper, 
Lucas_Werkmeister_WMDE, E.S.A-Sheild, darthmon_wmde, joker88john, CucyNoiD, 
Nandana, NebulousIris, Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, 
Adrian1985, Cpaulf30, Lahi, Gq86, Baloch007, Darkminds3113, Bsandipan, Lordiis, 
GoranSMilovanovic, Adik2382, Th3d3v1ls, Ramalepe, Liugev6, QZanden, 
LawExplorer, WSH1906, Lewizho99, Maathavan, _jensen, rosalieper, Wikidata-bugs, 
aude, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Commented On] T170281: Raise PHP version requirement of Wikibase (and its related extensions) to 5.6

2019-06-10 Thread Jdforrester-WMF

Jdforrester-WMF added a comment.


  Can we declare this done? :-)

TASK DETAIL
  https://phabricator.wikimedia.org/T170281

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Jdforrester-WMF
Cc: Jdforrester-WMF, Paladox, Ricordisamoa, PokestarFan, thiemowmde, Reedy, 
Lucas_Werkmeister_WMDE, aude, hoo, ArielGlenn, daniel, Lydia_Pintscher, Jonas, 
Aleksey_WMDE, WMDE-leszek, Aklapper, darthmon_wmde, Premeditated, Nandana, 
Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, 
Wikidata-bugs, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Updated] T198557: Remove support for legacy pre-MCR schema

CCicalese_WMF edited projects, added MW-1.34-release; removed MW-1.33-release.
CCicalese_WMF added a comment.


  Agree, this should be moved to 1.34.

TASK DETAIL
  https://phabricator.wikimedia.org/T198557

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: CCicalese_WMF
Cc: Krinkle, BPirkle, Fjalapeno, CCicalese_WMF, Aklapper, daniel, 
darthmon_wmde, WDoranWMF, Premeditated, holger.knust, EvanProdromou, Nandana, 
Lahi, Gq86, GoranSMilovanovic, RazeSoldier, QZanden, LawExplorer, JJMC89, 
_jensen, rosalieper, Agabi10, Pchelolo, Wikidata-bugs, aude, Jdforrester-WMF, 
Mbch331, Ltrlg, MGChecker
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Updated] T198492: Drop rev_text_id and ar_text_id when running update.php

CCicalese_WMF edited projects, added MW-1.34-release; removed MW-1.33-release.

TASK DETAIL
  https://phabricator.wikimedia.org/T198492

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: CCicalese_WMF
Cc: Aklapper, Anomie, Jdforrester-WMF, Tgr, daniel, darthmon_wmde, WDoranWMF, 
Premeditated, holger.knust, EvanProdromou, Nandana, Lahi, Gq86, 
GoranSMilovanovic, RazeSoldier, QZanden, LawExplorer, JJMC89, _jensen, 
rosalieper, Agabi10, Pchelolo, Wikidata-bugs, aude, Mbch331, Ltrlg, MGChecker
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Commented On] T138332: interwiki links to/from Multilingual Wikisource

2019-06-10 Thread Jarekt

Jarekt added a comment.


  I run into this problem, when dealing with unusual sitelinks, which are not 
displayed. For example Q62081484  
seems to be an empty item with nothing in it other than label in italian, but 
closer look 

 shows that it also has sitelink to 
https://wikisource.org/wiki/Li_fioreti_de_mess%C3%AA_san_Francescho. At the 
moment GUI does not allow you to see, or change it. According to this query 

 there are only 11 items with sitelinks to old wikisource. We need to either 
remove them and prevent future creation, or show sitelink in GUI. Perhaps in 
"other sites" section, or as mulwikisource in wikisource section.

TASK DETAIL
  https://phabricator.wikimedia.org/T138332

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Jarekt
Cc: Jarekt, Ash_Crow, jhsoby, PokestarFan, Nikki, Liuxinyu970226, Candalua, 
Mbch331, Lydia_Pintscher, Thibaut120094, Aklapper, Yann, Zppix, darthmon_wmde, 
MattLongCT, Premeditated, DannyS712, Nandana, Lahi, Gq86, GoranSMilovanovic, 
Mahir256, QZanden, LawExplorer, _jensen, rosalieper, Bodhisattwa, Samwilson, 
Wikidata-bugs, aude, jayvdb, Shizhao, Billinghurst, jayantanth, Krenair
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Updated] T214308: Force usage of MCR aware database schema

CCicalese_WMF set the point value for this task to "1".

TASK DETAIL
  https://phabricator.wikimedia.org/T214308

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: CCicalese_WMF
Cc: Aklapper, CCicalese_WMF, Fjalapeno, daniel, darthmon_wmde, WDoranWMF, 
Premeditated, EvanProdromou, Nandana, Lahi, Gq86, GoranSMilovanovic, 
RazeSoldier, QZanden, LawExplorer, JJMC89, _jensen, rosalieper, Agabi10, 
Pchelolo, MGChecker, Wikidata-bugs, aude, Jdforrester-WMF, Mbch331, Ltrlg
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Updated] T214308: Force usage of MCR aware database schema

CCicalese_WMF moved this task from Inbox to Team 3 on the Core Platform Team 
Kanban board.
CCicalese_WMF edited projects, added Core Platform Team Kanban (Team 3); 
removed Core Platform Team Kanban.

TASK DETAIL
  https://phabricator.wikimedia.org/T214308

WORKBOARD
  https://phabricator.wikimedia.org/project/board/3696/

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: CCicalese_WMF
Cc: Aklapper, CCicalese_WMF, Fjalapeno, daniel, darthmon_wmde, WDoranWMF, 
Premeditated, EvanProdromou, Nandana, Lahi, Gq86, GoranSMilovanovic, 
RazeSoldier, QZanden, LawExplorer, JJMC89, _jensen, rosalieper, Agabi10, 
Pchelolo, MGChecker, Wikidata-bugs, aude, Jdforrester-WMF, Mbch331, Ltrlg
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Updated] T214308: Force usage of MCR aware database schema

CCicalese_WMF edited projects, added Core Platform Team Kanban; removed Core 
Platform Team Backlog (Later).

TASK DETAIL
  https://phabricator.wikimedia.org/T214308

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: CCicalese_WMF
Cc: Aklapper, CCicalese_WMF, Fjalapeno, daniel, darthmon_wmde, WDoranWMF, 
Premeditated, EvanProdromou, Nandana, Lahi, Gq86, GoranSMilovanovic, 
RazeSoldier, QZanden, LawExplorer, JJMC89, _jensen, rosalieper, Agabi10, 
Pchelolo, MGChecker, Wikidata-bugs, aude, Jdforrester-WMF, Mbch331, Ltrlg, 
holger.knust
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Re: [Wikidata] Scaling Wikidata Query Service

2019-06-10 Thread Guillaume Lederrey

Hello!

On Mon, Jun 10, 2019 at 4:28 PM Sebastian Hellmann
 wrote:
>
> Hi Guillaume,
>
> On 06.06.19 21:32, Guillaume Lederrey wrote:
>
> Hello all!
>
> There has been a number of concerns raised about the performance and
> scaling of Wikdata Query Service. We share those concerns and we are
> doing our best to address them. Here is some info about what is going
> on:
>
> In an ideal world, WDQS should:
>
> * scale in terms of data size
> * scale in terms of number of edits
> * have low update latency
> * expose a SPARQL endpoint for queries
> * allow anyone to run any queries on the public WDQS endpoint
> * provide great query performance
> * provide a high level of availability
>
> Scaling graph databases is a "known hard problem", and we are reaching
> a scale where there are no obvious easy solutions to address all the
> above constraints. At this point, just "throwing hardware at the
> problem" is not an option anymore. We need to go deeper into the
> details and potentially make major changes to the current architecture.
> Some scaling considerations are discussed in [1]. This is going to take
> time.
>
> I am not sure how to evaluate this correctly. Scaling databases in general is 
> a "known hard problem" and graph databases a sub-field of it, which are 
> optimized for graph-like queries as opposed to column stores or relational 
> databases. If you say that "throwing hardware at the problem" does not help, 
> you are admitting that Blazegraph does not scale for what is needed by 
> Wikidata.

Yes, I am admitting that Blazegraph (at least in the way we are using
it at the moment) does not scale to our future needs. Blazegraph does
have support for sharding (what they call "Scale Out"). And yes, we
need to have a closer look at how that works. I'm not the expert here,
so I won't even try to assert if that's a viable solution or not.

> From [1]:
>
> At the moment, each WDQS cluster is a group of independent servers, sharing 
> nothing, with each server independently updated and each server holding a 
> full data set.
>
> Then it is not a "cluster" in the sense of databases. It is more a redundancy 
> architecture like RAID 1. Is this really how BlazeGraph does it? Don't they 
> have a proper cluster solution, where they repartition data across servers? 
> Or is this independent servers a wikimedia staff homebuild?

It all depends on your definition of a cluster. We have groups of
machine collectively serving some coherent traffic, but each machine
is completely independent from others. So yes, the comparison to RAID1
is adequate.

> Some info here:
>
> - We evaluated some stores according to their performance: 
> http://www.semantic-web-journal.net/content/evaluation-metadata-representations-rdf-stores-0
>   "Evaluation of Metadata Representations in RDF stores"

Thanks for the link! That looks quite interesting!

> - Virtuoso has proven quite useful. I don't want to advertise here, but the 
> thing they have going for DBpedia uses ridiculous hardware, i.e. 64GB RAM and 
> it is also the OS version, not the professional with clustering and 
> repartition capability. So we are playing the game since ten years now: 
> Everybody tries other databases, but then most people come back to virtuoso. 
> I have to admit that OpenLink is maintaining the hosting for DBpedia 
> themselves, so they know how to optimise. They normally do large banks as 
> customers with millions of write transactions per hour. In LOD2 they also 
> implemented column store features with MonetDB and repartitioning in clusters.

I'm not entirely sure how to read the above (and a quick look at
virtuoso website does not give me the answer either), but it looks
like the sharding / partitioning options are only available in the
enterprise version. That probably makes it a non starter for us.

> - I recently heard a presentation from Arango-DB and they had a good cluster 
> concept as well, although I don't know anybody who tried it. The slides 
> seemed to make sense.

Nice, another one to add to our list of options to test.

> All the best,
>
> Sebastian
>
>
>
>
> Reasonably, addressing all of the above constraints is unlikely to
> ever happen. Some of the constraints are non negotiable: if we can't
> keep up with Wikidata in term of data size or number of edits, it does
> not make sense to address query performance. On some constraints, we
> will probably need to compromise.
>
> For example, the update process is asynchronous. It is by nature
> expected to lag. In the best case, this lag is measured in minutes,
> but can climb to hours occasionally. This is a case of prioritizing
> stability and correctness (ingesting all edits) over update latency.
> And while we can work to reduce the maximum latency, this will still
> be an asynchronous process and needs to be considered as such.
>
> We currently have one Blazegraph expert working with us to address a
> number of performance and stability issues. We
> are planning to hire an

[Wikidata-bugs] [Maniphest] [Closed] T78134: Adapt red links on the translation - part 2

2019-06-10 Thread Pginer-WMF

Pginer-WMF closed this task as "Resolved".
Pginer-WMF claimed this task.
Pginer-WMF added a comment.


  Content translation supports red links, and specific improvements are 
captured in specific tickets (T197787 
)

TASK DETAIL
  https://phabricator.wikimedia.org/T78134

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Pginer-WMF
Cc: Pginer-WMF, Arrbee, Aklapper, darthmon_wmde, Premeditated, Nandana, Lahi, 
Gq86, Bigdog222, GoranSMilovanovic, chapulina, QZanden, LawExplorer, _jensen, 
rosalieper, Nizil, Ixocactus, Wikidata-bugs, aude, Amire80, Jsahleen, 
Nikerabbit, santhosh, KartikMistry, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Re: [Wikidata] Scaling Wikidata Query Service

Hi Guillaume,

On 06.06.19 21:32, Guillaume Lederrey wrote:

Hello all!

In an ideal world, WDQS should:

From [1]:

At the moment, each WDQS cluster is a group of independent servers,
sharing nothing, with each server independently updated and each
server holding a full data set.

Then it is not a "cluster" in the sense of databases. It is more a
redundancy architecture like RAID 1. Is this really how BlazeGraph does
it? Don't they have a proper cluster solution, where they repartition
data across servers? Or is this independent servers a wikimedia staff
homebuild?

Some info here:

- I recently heard a presentation from Arango-DB and they had a good
cluster concept as well, although I don't know anybody who tried it. The
slides seemed to make sense.

All the best,

Sebastian

Reasonably, addressing all of the above constraints is unlikely to
ever happen. Some of the constraints are non negotiable: if we can't
keep up with Wikidata in term of data size or number of edits, it does
not make sense to address query performance. On some constraints, we
will probably need to compromise.

For example, the update process is asynchronous. It is by nature
expected to lag. In the best case, this lag is measured in minutes,
but can climb to hours occasionally. This is a case of prioritizing
stability and correctness (ingesting all edits) over update latency.
And while we can work to reduce the maximum latency, this will still
be an asynchronous process and needs to be considered as such.

We currently have one Blazegraph expert working with us to address a
number of performance and stability issues. We
are planning to hire an additional engineer to help us support the
service in the long term. You can follow our current work in phabricator [2].

If anyone has experience with scaling large graph databases, please
reach out to us, we're always happy to share ideas!

Thanks all for your patience!

Guillaume

[1] https://wikitech.wikimedia.org/wiki/Wikidata_query_service/ScalingStrategy
[2] https://phabricator.wikimedia.org/project/view/1239/

--
All the best,
Sebastian Hellmann

Director of Knowledge Integration and Linked Data Technologies (KILT)
Competence Center

at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org,
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt

Homepage: http://aksw.org/SebastianHellmann
Research Group: http://aksw.org
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata

[Wikidata-bugs] [Maniphest] [Updated] T222624: Unable to remove last alias from entity

2019-06-10 Thread Aklapper

Aklapper added a project: Wikidata.
Aklapper added a comment.


  Assuming this is about #Wikidata 
.

TASK DETAIL
  https://phabricator.wikimedia.org/T222624

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Aklapper
Cc: Matthias_Geisler_WMDE, Jakob_WMDE, Aklapper, Tarrow, darthmon_wmde, 
Premeditated, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, 
_jensen, rosalieper, Wikidata-bugs, aude, Lydia_Pintscher, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Commented On] T203696: Drastically reduce the number of ResourceLoader modules that WikibaseClient registers