[Wikidata-bugs] [Maniphest] [Commented On] T221917: Create RDF dump of structured data on Commons

2019-06-10 Thread Smalyshev
Smalyshev added a comment.


  > Can we get away with a 'dumpwikibaserdf.sh' and some well-chosen variables 
for both wikidata and commons (or at worst two much much shorter scripts) 
instead of a whole new dumpcommonsrdf.sh?
  
  I thought about it, and it might be possible, but it would make the script 
even more complex and even less readable, as it would be pretty much everything 
variables on top of variables on top of variables. That's why I didn't go that 
route - too easy to make an error while piling those variables on top of each 
other, and would be too hard to find one... I understand copypasting is usually 
bad, but I wonder if we want to tolerate it for the sake of readability. If you 
think it's a must I can redo it using common script but that would probably 
make the common part harder to understand and verify.
  
  > Mediainfo is the first name I would have chosen too but maybe we can find 
something else that will keep future dump maintainers from swearing at us.
  
  That's a good time for proposals, while we have not committed to anything 
yet. That said, given that it'd be in other/wikibase/commonswiki, I don't think 
it can be confused with anything else.
  
  > Last comment is that I really dislike the links under 
xmldatadumps/public/{wikidata,commonswiki}. That directory should be only for 
xml/sql dumps and we should be moving away from having links to other stuff 
there, not adding new ones. We need a plan...
  
  I just did what wikidata dump is already doing. I do not have any opinion on 
which way is better, but I think it should be symmetric for both RDF dumps - 
either both have it, or none has it. So you are welcome to propose any model 
that you prefer, I do not have preferences here except for consistency.

TASK DETAIL
  https://phabricator.wikimedia.org/T221917

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev
Cc: ArielGlenn, WMDE-leszek, Poyekhali, Steinsplitter, Aklapper, 
Lydia_Pintscher, Bugreporter, Tgr, Ramsey-WMF, Jarekt, Addshore, Tpt, MB-one, 
Salgo60, Lucas_Werkmeister_WMDE, Smalyshev, Daryl-TTMG, RomaAmorRoma, 
E.S.A-Sheild, darthmon_wmde, joker88john, CucyNoiD, Nandana, NebulousIris, 
JKSTNK, Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, Adrian1985, 
Cpaulf30, Lahi, PDrouin-WMF, Gq86, Baloch007, E1presidente, Cparle, 
Darkminds3113, Anooprao, SandraF_WMF, Bsandipan, Lordiis, GoranSMilovanovic, 
Adik2382, Th3d3v1ls, Ramalepe, Liugev6, QZanden, EBjune, Tramullas, Acer, 
merbst, LawExplorer, WSH1906, Lewizho99, Maathavan, Silverfish, _jensen, 
rosalieper, Jonas, Xmlizer, Susannaanas, Wong128hk, Jane023, jkroll, 
Wikidata-bugs, Jdouglas, Base, matthiasmullie, aude, Tobias1984, El_Grafo, 
Dinoguy1000, Manybubbles, Ricordisamoa, Wesalius, Fabrice_Florin, Raymond, 
Jdforrester-WMF, Mbch331, Keegan
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T221917: Create RDF dump of structured data on Commons

2019-06-10 Thread ArielGlenn
ArielGlenn added a comment.


  A couple thoughts on the above patchset:
  
  Can we get away with a 'dumpwikibaserdf.sh' and some well-chosen variables 
for both wikidata and commons (or at worst two much much shorter scripts) 
instead of a whole new dumpcommonsrdf.sh?
  
  Additionally, we already have media, mediatitles, mediacounts and imageinfo 
being served out of other/. Mediainfo is the first name I would have chosen too 
but maybe we can find something else that will keep future dump maintainers 
from swearing at us.
  
  Last comment is that I really dislike the links under 
xmldatadumps/public/{wikidata,commonswiki}. That directory should be only for 
xml/sql dumps and we should be moving away from having links to other stuff 
there, not adding new ones. We need a plan...
  
  I'm trying to think of these dumps in the context of other projects possibly 
using structured data at some point, and how we can facilitate new dumps in the 
future.

TASK DETAIL
  https://phabricator.wikimedia.org/T221917

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: ArielGlenn
Cc: ArielGlenn, WMDE-leszek, Poyekhali, Steinsplitter, Aklapper, 
Lydia_Pintscher, Bugreporter, Tgr, Ramsey-WMF, Jarekt, Addshore, Tpt, MB-one, 
Salgo60, Lucas_Werkmeister_WMDE, Smalyshev, Daryl-TTMG, RomaAmorRoma, 
E.S.A-Sheild, darthmon_wmde, joker88john, CucyNoiD, Nandana, NebulousIris, 
JKSTNK, Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, Adrian1985, 
Cpaulf30, Lahi, PDrouin-WMF, Gq86, Baloch007, E1presidente, Cparle, 
Darkminds3113, Anooprao, SandraF_WMF, Bsandipan, Lordiis, GoranSMilovanovic, 
Adik2382, Th3d3v1ls, Ramalepe, Liugev6, QZanden, EBjune, Tramullas, Acer, 
merbst, LawExplorer, WSH1906, Lewizho99, Maathavan, Silverfish, _jensen, 
rosalieper, Jonas, Xmlizer, Susannaanas, Wong128hk, Jane023, jkroll, 
Wikidata-bugs, Jdouglas, Base, matthiasmullie, aude, Tobias1984, El_Grafo, 
Dinoguy1000, Manybubbles, Ricordisamoa, Wesalius, Fabrice_Florin, Raymond, 
Jdforrester-WMF, Mbch331, Keegan
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T221917: Create RDF dump of structured data on Commons

2019-06-10 Thread gerritbot
gerritbot added a comment.


  Change 516444 had a related patch set uploaded (by Smalyshev; owner: 
Smalyshev):
  [operations/puppet@production] Set up dumps for mediainfo RDF generation
  
  https://gerrit.wikimedia.org/r/516444

TASK DETAIL
  https://phabricator.wikimedia.org/T221917

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: gerritbot
Cc: WMDE-leszek, Poyekhali, Steinsplitter, Aklapper, Lydia_Pintscher, 
Bugreporter, Tgr, Ramsey-WMF, Jarekt, Addshore, Tpt, MB-one, Salgo60, 
Lucas_Werkmeister_WMDE, Smalyshev, E.S.A-Sheild, darthmon_wmde, joker88john, 
CucyNoiD, Nandana, NebulousIris, JKSTNK, Gaboe420, Versusxo, 
Majesticalreaper22, Giuliamocci, Adrian1985, Cpaulf30, Lahi, PDrouin-WMF, Gq86, 
Baloch007, E1presidente, Cparle, Darkminds3113, Anooprao, SandraF_WMF, 
Bsandipan, Lordiis, GoranSMilovanovic, Adik2382, Th3d3v1ls, Ramalepe, Liugev6, 
QZanden, EBjune, Tramullas, Acer, merbst, LawExplorer, WSH1906, Lewizho99, 
Maathavan, Silverfish, _jensen, rosalieper, Jonas, Xmlizer, Susannaanas, 
Wong128hk, Jane023, jkroll, Wikidata-bugs, Jdouglas, Base, matthiasmullie, 
aude, Tobias1984, El_Grafo, Dinoguy1000, Manybubbles, Ricordisamoa, Wesalius, 
Fabrice_Florin, Raymond, Jdforrester-WMF, Mbch331, Keegan
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T221917: Create RDF dump of structured data on Commons

2019-06-10 Thread gerritbot
gerritbot added a comment.


  Change 516441 had a related patch set uploaded (by Smalyshev; owner: 
Smalyshev):
  [mediawiki/extensions/Wikibase@master] Add option --ignore-missing to dumper
  
  https://gerrit.wikimedia.org/r/516441

TASK DETAIL
  https://phabricator.wikimedia.org/T221917

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: gerritbot
Cc: WMDE-leszek, Poyekhali, Steinsplitter, Aklapper, Lydia_Pintscher, 
Bugreporter, Tgr, Ramsey-WMF, Jarekt, Addshore, Tpt, MB-one, Salgo60, 
Lucas_Werkmeister_WMDE, Smalyshev, darthmon_wmde, Nandana, JKSTNK, Lahi, 
PDrouin-WMF, Gq86, E1presidente, Cparle, Anooprao, SandraF_WMF, 
GoranSMilovanovic, QZanden, EBjune, Tramullas, Acer, merbst, LawExplorer, 
Silverfish, _jensen, rosalieper, Jonas, Xmlizer, Susannaanas, Wong128hk, 
Jane023, jkroll, Wikidata-bugs, Jdouglas, Base, matthiasmullie, aude, 
Tobias1984, El_Grafo, Dinoguy1000, Manybubbles, Ricordisamoa, Wesalius, 
Fabrice_Florin, Raymond, Jdforrester-WMF, Mbch331, Keegan
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Updated] T221917: Create RDF dump of structured data on Commons

2019-06-10 Thread gerritbot
gerritbot added a project: Patch-For-Review.

TASK DETAIL
  https://phabricator.wikimedia.org/T221917

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: gerritbot
Cc: WMDE-leszek, Poyekhali, Steinsplitter, Aklapper, Lydia_Pintscher, 
Bugreporter, Tgr, Ramsey-WMF, Jarekt, Addshore, Tpt, MB-one, Salgo60, 
Lucas_Werkmeister_WMDE, Smalyshev, E.S.A-Sheild, darthmon_wmde, joker88john, 
CucyNoiD, Nandana, NebulousIris, JKSTNK, Gaboe420, Versusxo, 
Majesticalreaper22, Giuliamocci, Adrian1985, Cpaulf30, Lahi, PDrouin-WMF, Gq86, 
Baloch007, E1presidente, Cparle, Darkminds3113, Anooprao, SandraF_WMF, 
Bsandipan, Lordiis, GoranSMilovanovic, Adik2382, Th3d3v1ls, Ramalepe, Liugev6, 
QZanden, EBjune, Tramullas, Acer, merbst, LawExplorer, WSH1906, Lewizho99, 
Maathavan, Silverfish, _jensen, rosalieper, Jonas, Xmlizer, Susannaanas, 
Wong128hk, Jane023, jkroll, Wikidata-bugs, Jdouglas, Base, matthiasmullie, 
aude, Tobias1984, El_Grafo, Dinoguy1000, Manybubbles, Ricordisamoa, Wesalius, 
Fabrice_Florin, Raymond, Jdforrester-WMF, Mbch331, Keegan
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Unblock] T221916: Create RDF export for structured data stored for files

2019-06-10 Thread Smalyshev
Smalyshev closed subtask T222995: Decide which prefixes to use for MediaInfo 
RDF as Resolved.

TASK DETAIL
  https://phabricator.wikimedia.org/T221916

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev
Cc: WMDE-leszek, Poyekhali, Steinsplitter, Aklapper, Lydia_Pintscher, Tgr, 
Ramsey-WMF, Jarekt, Addshore, Tpt, MB-one, Salgo60, Lucas_Werkmeister_WMDE, 
Smalyshev, E.S.A-Sheild, darthmon_wmde, joker88john, CucyNoiD, Nandana, 
NebulousIris, JKSTNK, Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, 
Adrian1985, Cpaulf30, Lahi, PDrouin-WMF, Gq86, Baloch007, E1presidente, Cparle, 
Darkminds3113, Anooprao, SandraF_WMF, Bsandipan, Lordiis, GoranSMilovanovic, 
Adik2382, Th3d3v1ls, Ramalepe, Liugev6, QZanden, EBjune, Tramullas, Acer, 
merbst, LawExplorer, WSH1906, Lewizho99, Maathavan, Silverfish, _jensen, 
rosalieper, Cirdan, Jonas, Xmlizer, Susannaanas, Wong128hk, Jane023, jkroll, 
Wikidata-bugs, Jdouglas, Base, matthiasmullie, aude, Tobias1984, El_Grafo, 
Dinoguy1000, Manybubbles, Ricordisamoa, Wesalius, Fabrice_Florin, Raymond, 
Jdforrester-WMF, Mbch331, Keegan
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Closed] T222995: Decide which prefixes to use for MediaInfo RDF

2019-06-10 Thread Smalyshev
Smalyshev closed this task as "Resolved".
Smalyshev claimed this task.

TASK DETAIL
  https://phabricator.wikimedia.org/T222995

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev
Cc: Denny, mkroetzsch, Mholloway, Cparle, daniel, Lucas_Werkmeister_WMDE, 
Salgo60, MB-one, Tpt, Addshore, Jarekt, Ramsey-WMF, Tgr, Lydia_Pintscher, 
Aklapper, Steinsplitter, Poyekhali, WMDE-leszek, Smalyshev, darthmon_wmde, 
Nandana, JKSTNK, Lahi, PDrouin-WMF, Gq86, E1presidente, Anooprao, SandraF_WMF, 
GoranSMilovanovic, QZanden, EBjune, Tramullas, Acer, merbst, LawExplorer, 
Silverfish, _jensen, rosalieper, Jonas, Xmlizer, Susannaanas, Wong128hk, 
Jane023, jkroll, Wikidata-bugs, Jdouglas, Base, matthiasmullie, aude, 
Tobias1984, El_Grafo, Dinoguy1000, Manybubbles, Ricordisamoa, Wesalius, 
Fabrice_Florin, Raymond, Jdforrester-WMF, Mbch331, Keegan
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Merged] T225364: Class undefined: Wikibase\DataModel\Entity\ItemId in WikibaseLib.entitytypes.php on line 35

2019-06-10 Thread matmarex
matmarex closed this task as a duplicate of T189560: mediawiki/vendor REL1_* no 
longer ship dependencies for wmf extensions that are not in the mediawiki 
tarball.

TASK DETAIL
  https://phabricator.wikimedia.org/T225364

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: matmarex
Cc: Paladox, Aklapper, Reedy, darthmon_wmde, Redabr4, Zanziii, Sadisticturd, 
Nandana, A.S.Kochergin, Lahi, Gq86, GoranSMilovanovic, RazeSoldier, QZanden, 
LawExplorer, _jensen, rosalieper, Jonas, MGChecker, Wikidata-bugs, aude, 
Lydia_Pintscher, Mbch331, hashar
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata] [OM-2019] Final CFP: 14th workshop on Ontology Matching collocated with ISWC

2019-06-10 Thread Pavel Shvaiko
... Most often we need to integrate together data sources that were not
aiming at their
integration while being designed, thus, increasing the difficulty of the
matching operation.
Even if a good progress has been made in the matching field as such,
ontology
matching may appear to be virtually impossible. Indeed, for finding the
correspondences
between entities, it is necessary to understand their meaning ...

--
   FINAL CALL FOR CONTRIBUTIONS
   THE SUBMISSION DEADLINE IS APPROACHING ON JUNE 28TH, 2019
--

The Fourteenth International Workshop on
ONTOLOGY MATCHING
(OM-2019)
 http://om2019.ontologymatching.org/
   October 26th or 27th, 2019, ISWC Workshop Program,
  Auckland, New Zealand


BRIEF DESCRIPTION AND OBJECTIVES
Ontology matching is a key interoperability enabler for the Semantic Web,
as well as a useful technique in some classical data integration tasks
dealing with the semantic heterogeneity problem. It takes ontologies
as input and determines as output an alignment, that is, a set of
correspondences between the semantically related entities of those
ontologies.
These correspondences can be used for various tasks, such as ontology
merging, data interlinking, query answering or process mapping.
Thus, matching ontologies enables the knowledge and data expressed
with the matched ontologies to interoperate.

The workshop has three goals:
1.
To bring together leaders from academia, industry and user institutions
to assess how academic advances are addressing real-world requirements.
The workshop will strive to improve academic awareness of industrial
and final user needs, and therefore, direct research towards those needs.
Simultaneously, the workshop will serve to inform industry and user
representatives about existing research efforts that may meet their
requirements. The workshop will also investigate how the ontology
matching technology is going to evolve, especially with respect to
data interlinking, process mapping and web table matching tasks.

2.
To conduct an extensive and rigorous evaluation of ontology matching
and instance matching (link discovery) approaches through
the OAEI (Ontology Alignment Evaluation Initiative) 2019 campaign:
http://oaei.ontologymatching.org/2019/

3. To examine new uses, similarities and differences from database
schema matching, which has received decades of attention
but is just beginning to transition to mainstream tools.


This year, in sync with the main conference, we encourage submissions
specifically devoted to: (i) datasets, benchmarks and replication studies,
services, software, methodologies, protocols and measures
(not necessarily related to OAEI), and (ii) application of
the matching technology in real-life scenarios and assessment
of its usefulness to the final users.


TOPICS of interest include but are not limited to:
Business and use cases for matching (e.g., big, open, closed data);
Requirements to matching from specific application scenarios (e.g.,
public sector, homeland security);
Application of matching techniques in real-world scenarios (e.g., with
environmental data);
Formal foundations and frameworks for matching;
Matching and knowledge graphs;
Matching and deep learning;
Matching and embeddings;
Matching and big data;
Matching and linked data;
Instance matching, data interlinking and relations between them;
Privacy-aware matching;
Process model matching;
Large-scale and efficient matching techniques;
Matcher selection, combination and tuning;
User involvement (including both technical and organizational aspects);
Explanations in matching;
Social and collaborative matching;
Uncertainty in matching;
Reasoning with alignments;
Alignment coherence and debugging;
Alignment management;
Matching for traditional applications (e.g., data science);
Matching for emerging applications (e.g., web tables, knowledge graphs).


SUBMISSIONS
Contributions to the workshop can be made in terms of technical papers and
posters/statements of interest addressing different issues of ontology
matching
as well as participating in the OAEI 2019 campaign. Long technical papers
should
be of max. 12 pages. Short technical papers should be of max. 5 pages.
Posters/statements of interest should not exceed 2 pages.
All contributions have to be prepared using the LNCS Style:
http://www.springer.com/computer/lncs?SGWID=0-164-6-793341-0
and should be submitted in PDF format (no later than June 28th, 2019)
through the workshop submission site at:

https://www.easychair.org/conferences/?conf=om2019

Contributors to the OAEI 2019 campaign have to follow the campaign
conditions
and 

[Wikidata-bugs] [Maniphest] [Updated] T194101: Add more metrics to WDQS

2019-06-10 Thread Maintenance_bot
Maintenance_bot removed a project: Patch-For-Review.

TASK DETAIL
  https://phabricator.wikimedia.org/T194101

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Maintenance_bot
Cc: Aklapper, Gehel, Smalyshev, darthmon_wmde, ET4Eva, Nandana, Lahi, Gq86, 
Darkminds3113, Lucas_Werkmeister_WMDE, GoranSMilovanovic, QZanden, EBjune, 
merbst, LawExplorer, Avner, _jensen, rosalieper, Jonas, FloNight, Xmlizer, 
jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331, 
E.S.A-Sheild, joker88john, CucyNoiD, NebulousIris, Gaboe420, Versusxo, 
Majesticalreaper22, Giuliamocci, Adrian1985, Cpaulf30, Baloch007, Bsandipan, 
Lordiis, Adik2382, Th3d3v1ls, Ramalepe, Liugev6, WSH1906, Lewizho99, Maathavan
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T194101: Add more metrics to WDQS

2019-06-10 Thread gerritbot
gerritbot added a comment.


  Change 514645 **merged** by jenkins-bot:
  [wikidata/query/rdf@master] Add metering of federation requests
  
  https://gerrit.wikimedia.org/r/514645

TASK DETAIL
  https://phabricator.wikimedia.org/T194101

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: gerritbot
Cc: Aklapper, Gehel, Smalyshev, E.S.A-Sheild, darthmon_wmde, joker88john, 
ET4Eva, CucyNoiD, Nandana, NebulousIris, Gaboe420, Versusxo, 
Majesticalreaper22, Giuliamocci, Adrian1985, Cpaulf30, Lahi, Gq86, Baloch007, 
Darkminds3113, Bsandipan, Lordiis, Lucas_Werkmeister_WMDE, GoranSMilovanovic, 
Adik2382, Th3d3v1ls, Ramalepe, Liugev6, QZanden, EBjune, merbst, LawExplorer, 
WSH1906, Avner, Lewizho99, Maathavan, _jensen, rosalieper, Jonas, FloNight, 
Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


Re: [Wikidata] Scaling Wikidata Query Service

2019-06-10 Thread Sebastian Hellmann
Yes, I can ask. I am talking a lot with them as we are redeploying 
DBpedia live and also pushing the new DBpedia to them soon.


I think, they also had a specific issue with how Wikidata does linked 
data, but I didn't get it, as it was mentioned too briefly.


All the best,

Sebastian


On 10.06.19 22:46, Stas Malyshev wrote:

Hi!


thanks for the elaboration. I can understand the background much better.
I have to admit, that I am also not a real expert, but very close to the
real experts like Vidal and Rahm who are co-authors of the SWJ paper or
the OpenLink devs.

If you know anybody at OpenLink that would be interested in trying to
evaluate such thing (i.e. how Wikidata could be hosted on Virtuso) and
provide support for this project, it would be interesting to discuss it.
While open-source thing is still a barrier and in general the
requirements are different, at least discussing it and maybe getting
some numbers might be useful.

Thanks,

--
All the best,
Sebastian Hellmann

Director of Knowledge Integration and Linked Data Technologies (KILT) 
Competence Center

at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org, 
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt 


Homepage: http://aksw.org/SebastianHellmann
Research Group: http://aksw.org
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Scaling Wikidata Query Service

2019-06-10 Thread Stas Malyshev
Hi!

> thanks for the elaboration. I can understand the background much better.
> I have to admit, that I am also not a real expert, but very close to the
> real experts like Vidal and Rahm who are co-authors of the SWJ paper or
> the OpenLink devs.

If you know anybody at OpenLink that would be interested in trying to
evaluate such thing (i.e. how Wikidata could be hosted on Virtuso) and
provide support for this project, it would be interesting to discuss it.
While open-source thing is still a barrier and in general the
requirements are different, at least discussing it and maybe getting
some numbers might be useful.

Thanks,
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Scaling Wikidata Query Service

2019-06-10 Thread Sebastian Hellmann

Hi Stas,

thanks for the elaboration. I can understand the background much better. 
I have to admit, that I am also not a real expert, but very close to the 
real experts like Vidal and Rahm who are co-authors of the SWJ paper or 
the OpenLink devs.


I am also spoiled, because OpenLink solves the hosting for DBpedia and 
also DBpedia-live with ca. 130k updates per day for the English 
Wikipedia. I think, this one is the recent report: 
https://medium.com/virtuoso-blog/dbpedia-usage-report-as-of-2018-01-01-8cae1b81ca71 
Then again DBpedia didn't grow for a while, but we made a "Best of" now 
[1]. But  will not host it all.


[1] https://svn.aksw.org/papers/2019/ISWC_FlexiFusion/public.pdf

I also see that your context is difficult. Maybe you can custom 
shard/scaleout Blazegraph based on the queries and then replicate the 
sharded clusters. Like a mix between sharding and replication, maybe 
just 3 * 3 servers instead of 9 replicated ones or 9 servers full of 
shards. Not so many options here, if you have the OS requirement. I 
guess you are caching static content already as much as possible.


This also seems pretty much what I know, but it really is all second 
hand as my expertise is more focused on what's inside the database.


All the best,

Sebastian

On 10.06.19 22:02, Stas Malyshev wrote:

Hi!


I am not sure how to evaluate this correctly. Scaling databases in
general is a "known hard problem" and graph databases a sub-field of it,
which are optimized for graph-like queries as opposed to column stores
or relational databases. If you say that "throwing hardware at the
problem" does not help, you are admitting that Blazegraph does not scale
for what is needed by Wikidata.

I think this is over-generalizing. We have a database that grew 10x over
the last 4 years. We have certain hardware and software limits, both
with existing hardware and in principle by hardware we could buy. We
also have certain issues specific to graph databases that make scaling
harder - for example, document databases, like ElasticSearch, and
certain models of relational databases, shard easily. Sharding something
like Wikidata graph is much harder, especially if the underlying
database knows nothing about specifics of Wikidata data (which would be
the case for all off-the-shelf databases). If we just randomly split the
triples between several servers, we'd probably be just modeling a large
but an extremely slow disk. So there needs to be some smarter solution,
one that we'd unlike to develop inhouse but one that has already been
verified by industry experience and other deployments.

Is the issue specific to Blazegraph and can the issue be solved by
switching platform? Maybe, we do not know yet. We do not have any better
solution that guarantees us better scalability identified, but we have a
plan on looking for that solution, given the resources. We also have a
plan on improving the throughput of Blazegraph, which we're working on now.

Non-sharding model might be hard to sustain indefinitely, but it is not
clear it can't work in the short term, and also it is not clear that
sharding model would deliver clear performance win, as it will have to
involve network latencies inside the queries, which can significantly
affect performance. This can only resolved by proper testing evaluation
of the candidate solutions.


Then it is not a "cluster" in the sense of databases. It is more a
redundancy architecture like RAID 1. Is this really how BlazeGraph does

I do not think our time here would be productively spent arguing
semantics about what should and should not be called a "cluster". We
call that setup a cluster, and I think now we all understand what we're
talking about.


it? Don't they have a proper cluster solution, where they repartition
data across servers? Or is this independent servers a wikimedia staff
homebuild?

If you mean sharded or replicated setup, as far as I know, Blazegraph
does not support that (there's some support for replication IIRC but
replication without sharding probably won't give us much improvement).
We have a plan to evaluate a solution that does shard, given necessary
resources.


Some info here:

- We evaluated some stores according to their performance:
http://www.semantic-web-journal.net/content/evaluation-metadata-representations-rdf-stores-0
"Evaluation of Metadata Representations in RDF stores"

Thanks for the link, it looks very interesting, I'll read it and see
which parts we could use here.


- Virtuoso has proven quite useful. I don't want to advertise here, but
the thing they have going for DBpedia uses ridiculous hardware, i.e.
64GB RAM and it is also the OS version, not the professional with
clustering and repartition capability. So we are playing the game since
ten years now: Everybody tries other databases, but then most people
come back to virtuoso. I have to admit that OpenLink is maintaining the
hosting for DBpedia themselves, so they know how to optimise. They
normally do large 

Re: [Wikidata] Scaling Wikidata Query Service

2019-06-10 Thread Stas Malyshev
Hi!

> Yes, sharding is what you need, I think, instead of replication. This is
> the technique where data is repartitioned into more manageable chunks
> across servers.

Agreed, if we are to get any solution that is not constrained by
hardware limits of a single server, we can not avoid looking at sharding.

> Here is a good explanation of it:
> 
> http://vos.openlinksw.com/owiki/wiki/VOS/VOSArticleWebScaleRDF

Thanks, very interesting article. I'd certainly would like to know how
this works with database in the size of 10 bln. triples and queries both
accessing and updating random subsets of them. Updates are not covered
very thoroughly there - this is, I suspect, because many databases of 10
bln. size do not have as active (non-append) update workload as we do.
Maybe they still manage to solve it, if so, I'd very much like to know
about it.

> Just a note here: Virtuoso is also a full RDMS, so you could probably
> keep wikibase db in the same cluster and fix the asynchronicity. That is

Given how the original data is stored (JSON blob inside mysql table) it
would not be very useful. In general, graph data model and Wikitext data
model on top of which Wikidata is built are very, very different, and
expecting same storage to serve both - at least without very major and
deep refactoring of the code on both sides - is not currently very
realistic. And of course moving any of the wiki production databases to
Virtuoso would be a non-starter. Given than original Wikidata database
stays on Mysql - which I think is a reasonable assumption - there would
need to be a data migration pipeline for data to come from Mysql to
whatever is the WDQS NG storage.

> also true for any mappers like Sparqlify:
> http://aksw.org/Projects/Sparqlify.html However, these shift the
> problem, then you need a sharded/repartitioned relational database

Yes, relational-RDF bridges are known but my experience is they usually
are not very performant (the difference in "you can do it" and "you can
do it fast" is sometimes very significant) and in our case, it would be
useless anyway as Wikidata data is not really stored in relational
database per se - it's stored in JSON blob opaquely saved in relational
database structure that knows nothing about Wikidata. Yes, it's not the
ideal structure for optimal performance of Wikidata itself, but I do not
foresee this changing, at least in any short term. Again, we could of
course have data export pipeline to whatever storage format we want -
essentially we already have one - but the concept of having single data
store is probably not realistic at least within foreseeable timeframes.
We use separate data store for search (ElasticSearch) and probably will
have to have separate one for queries, whatever would be the mechanism.

Thanks,
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Scaling Wikidata Query Service

2019-06-10 Thread Stas Malyshev
Hi!

> I am not sure how to evaluate this correctly. Scaling databases in
> general is a "known hard problem" and graph databases a sub-field of it,
> which are optimized for graph-like queries as opposed to column stores
> or relational databases. If you say that "throwing hardware at the
> problem" does not help, you are admitting that Blazegraph does not scale
> for what is needed by Wikidata. 

I think this is over-generalizing. We have a database that grew 10x over
the last 4 years. We have certain hardware and software limits, both
with existing hardware and in principle by hardware we could buy. We
also have certain issues specific to graph databases that make scaling
harder - for example, document databases, like ElasticSearch, and
certain models of relational databases, shard easily. Sharding something
like Wikidata graph is much harder, especially if the underlying
database knows nothing about specifics of Wikidata data (which would be
the case for all off-the-shelf databases). If we just randomly split the
triples between several servers, we'd probably be just modeling a large
but an extremely slow disk. So there needs to be some smarter solution,
one that we'd unlike to develop inhouse but one that has already been
verified by industry experience and other deployments.

Is the issue specific to Blazegraph and can the issue be solved by
switching platform? Maybe, we do not know yet. We do not have any better
solution that guarantees us better scalability identified, but we have a
plan on looking for that solution, given the resources. We also have a
plan on improving the throughput of Blazegraph, which we're working on now.

Non-sharding model might be hard to sustain indefinitely, but it is not
clear it can't work in the short term, and also it is not clear that
sharding model would deliver clear performance win, as it will have to
involve network latencies inside the queries, which can significantly
affect performance. This can only resolved by proper testing evaluation
of the candidate solutions.

> Then it is not a "cluster" in the sense of databases. It is more a
> redundancy architecture like RAID 1. Is this really how BlazeGraph does

I do not think our time here would be productively spent arguing
semantics about what should and should not be called a "cluster". We
call that setup a cluster, and I think now we all understand what we're
talking about.

> it? Don't they have a proper cluster solution, where they repartition
> data across servers? Or is this independent servers a wikimedia staff
> homebuild?

If you mean sharded or replicated setup, as far as I know, Blazegraph
does not support that (there's some support for replication IIRC but
replication without sharding probably won't give us much improvement).
We have a plan to evaluate a solution that does shard, given necessary
resources.

> Some info here:
> 
> - We evaluated some stores according to their performance:
> http://www.semantic-web-journal.net/content/evaluation-metadata-representations-rdf-stores-0
>  
> "Evaluation of Metadata Representations in RDF stores" 

Thanks for the link, it looks very interesting, I'll read it and see
which parts we could use here.

> - Virtuoso has proven quite useful. I don't want to advertise here, but
> the thing they have going for DBpedia uses ridiculous hardware, i.e.
> 64GB RAM and it is also the OS version, not the professional with
> clustering and repartition capability. So we are playing the game since
> ten years now: Everybody tries other databases, but then most people
> come back to virtuoso. I have to admit that OpenLink is maintaining the
> hosting for DBpedia themselves, so they know how to optimise. They
> normally do large banks as customers with millions of write transactions
> per hour. In LOD2 they also implemented column store features with
> MonetDB and repartitioning in clusters.

I do not know the details of your usage scenario, so before we get into
comparisons, I'd like to understand:

1. Do your servers provide live synchronized updates with Wikdiata or
DBPedia? How many updates per second that server can process?
2. How many queries per second this server is serving? What kind of
queries are those?

We did preliminary very limited evaluation of Virtuoso for hosting
Wikidata, and it looks like it can load and host the necessary data
(though it does not support some customizations we have now and we could
not evaluate whether such customizations are possible) but it would
require significant time investment to port all the functionality to it.
Unfortunately, the lack of resources did not allow us to do fuller
evaluation.

Also, as I understand, "professional" capabilities of Virtuoso are
closed-source and require paid license, which probably would be a
problem to run it on WMF infrastructure unless we reach some kind of
special arrangement. Since this arrangement will probably not include
open-sourcing the enterprise part of Virtuoso, it should deliver a very

Re: [Wikidata] Scaling Wikidata Query Service

2019-06-10 Thread Guillaume Lederrey
On Mon, Jun 10, 2019 at 9:03 PM Sebastian Hellmann
 wrote:
>
> Hi Guillaume,
>
> On 10.06.19 16:54, Guillaume Lederrey wrote:
>
> Hello!
>
> On Mon, Jun 10, 2019 at 4:28 PM Sebastian Hellmann
>  wrote:
>
> Hi Guillaume,
>
> On 06.06.19 21:32, Guillaume Lederrey wrote:
>
> Hello all!
>
> There has been a number of concerns raised about the performance and
> scaling of Wikdata Query Service. We share those concerns and we are
> doing our best to address them. Here is some info about what is going
> on:
>
> In an ideal world, WDQS should:
>
> * scale in terms of data size
> * scale in terms of number of edits
> * have low update latency
> * expose a SPARQL endpoint for queries
> * allow anyone to run any queries on the public WDQS endpoint
> * provide great query performance
> * provide a high level of availability
>
> Scaling graph databases is a "known hard problem", and we are reaching
> a scale where there are no obvious easy solutions to address all the
> above constraints. At this point, just "throwing hardware at the
> problem" is not an option anymore. We need to go deeper into the
> details and potentially make major changes to the current architecture.
> Some scaling considerations are discussed in [1]. This is going to take
> time.
>
> I am not sure how to evaluate this correctly. Scaling databases in general is 
> a "known hard problem" and graph databases a sub-field of it, which are 
> optimized for graph-like queries as opposed to column stores or relational 
> databases. If you say that "throwing hardware at the problem" does not help, 
> you are admitting that Blazegraph does not scale for what is needed by 
> Wikidata.
>
> Yes, I am admitting that Blazegraph (at least in the way we are using
> it at the moment) does not scale to our future needs. Blazegraph does
> have support for sharding (what they call "Scale Out"). And yes, we
> need to have a closer look at how that works. I'm not the expert here,
> so I won't even try to assert if that's a viable solution or not.
>
> Yes, sharding is what you need, I think, instead of replication. This is the 
> technique where data is repartitioned into more manageable chunks across 
> servers.

Well, we need sharding for scalability and replication for
availability, so we do need both. The hard problem is sharding.

> Here is a good explanation of it:
>
> http://vos.openlinksw.com/owiki/wiki/VOS/VOSArticleWebScaleRDF

Interesting read. I don't see how Virtuoso addresses data locality, it
looks like sharding of their RDF store is just hash based (I'm
assuming some kind of uniform hash). I'm not enough of an expert on
graph databases, but I doubt that a highly connected graph like
Wikidata will be able to scale reads without some way to address data
locality. Obviously, this needs testing.

> http://docs.openlinksw.com/virtuoso/ch-clusterprogramming/
>
>
> Sharding, scale-out or repartitioning is a classical enterprise feature for 
> Open-source databases. I am rather surprised that Blazegraph is full GPL 
> without an enterprise edition. But then they really sounded like their goal 
> as a company was to be bought by a bigger fish, in this case Amazon Web 
> Services. What is their deal? They are offering support?
>
> So if you go open-source, I think you will have a hard time finding good free 
> databases sharding/repartition. FoundationDB as proposed in the grant [1]is 
> from Apple
>
> [1] https://meta.wikimedia.org/wiki/Grants:Project/WDQS_On_FoundationDB
>
>
> I mean try the sharding feature. At some point though it might be worth 
> considering to go enterprise. Corporate Open Source often has a twist.

Closed source is not an option. We have strong open source
requirements to deploy anything in our production environment.

> Just a note here: Virtuoso is also a full RDMS, so you could probably keep 
> wikibase db in the same cluster and fix the asynchronicity. That is also true 
> for any mappers like Sparqlify: http://aksw.org/Projects/Sparqlify.html 
> However, these shift the problem, then you need a sharded/repartitioned 
> relational database

There is no plan to move the Wikibase storage out of MySQL at the
moment. In any case, having a low coupling between the primary storage
for wikidata and a secondary storage for complex querying is a sound
architectural principle. This asynchronous update process is most
probably going to stay in place, just because it makes a lot of sense.

Thanks for the discussion so far! It is always interesting to have outside idea!

   Have fun!

 Guillaume

>
> All the best,
>
> Sebastian
>
>
>
> From [1]:
>
> At the moment, each WDQS cluster is a group of independent servers, sharing 
> nothing, with each server independently updated and each server holding a 
> full data set.
>
> Then it is not a "cluster" in the sense of databases. It is more a redundancy 
> architecture like RAID 1. Is this really how BlazeGraph does it? Don't they 
> have a proper cluster solution, where they repartition data 

Re: [Wikidata] Scaling Wikidata Query Service

2019-06-10 Thread Sebastian Hellmann

Hi Guillaume,

On 10.06.19 16:54, Guillaume Lederrey wrote:

Hello!

On Mon, Jun 10, 2019 at 4:28 PM Sebastian Hellmann
 wrote:

Hi Guillaume,

On 06.06.19 21:32, Guillaume Lederrey wrote:

Hello all!

There has been a number of concerns raised about the performance and
scaling of Wikdata Query Service. We share those concerns and we are
doing our best to address them. Here is some info about what is going
on:

In an ideal world, WDQS should:

* scale in terms of data size
* scale in terms of number of edits
* have low update latency
* expose a SPARQL endpoint for queries
* allow anyone to run any queries on the public WDQS endpoint
* provide great query performance
* provide a high level of availability

Scaling graph databases is a "known hard problem", and we are reaching
a scale where there are no obvious easy solutions to address all the
above constraints. At this point, just "throwing hardware at the
problem" is not an option anymore. We need to go deeper into the
details and potentially make major changes to the current architecture.
Some scaling considerations are discussed in [1]. This is going to take
time.

I am not sure how to evaluate this correctly. Scaling databases in general is a "known hard 
problem" and graph databases a sub-field of it, which are optimized for graph-like queries as 
opposed to column stores or relational databases. If you say that "throwing hardware at the 
problem" does not help, you are admitting that Blazegraph does not scale for what is needed by 
Wikidata.

Yes, I am admitting that Blazegraph (at least in the way we are using
it at the moment) does not scale to our future needs. Blazegraph does
have support for sharding (what they call "Scale Out"). And yes, we
need to have a closer look at how that works. I'm not the expert here,
so I won't even try to assert if that's a viable solution or not.


Yes, sharding is what you need, I think, instead of replication. This is 
the technique where data is repartitioned into more manageable chunks 
across servers.


Here is a good explanation of it:

http://vos.openlinksw.com/owiki/wiki/VOS/VOSArticleWebScaleRDF

http://docs.openlinksw.com/virtuoso/ch-clusterprogramming/


Sharding, scale-out or repartitioning is a classical enterprise feature 
for Open-source databases. I am rather surprised that Blazegraph is full 
GPL without an enterprise edition. But then they really sounded like 
their goal as a company was to be bought by a bigger fish, in this case 
Amazon Web Services. What is their deal? They are offering support?


So if you go open-source, I think you will have a hard time finding good 
free databases sharding/repartition. FoundationDB as proposed in the 
grant [1]is from Apple


[1] https://meta.wikimedia.org/wiki/Grants:Project/WDQS_On_FoundationDB


I mean try the sharding feature. At some point though it might be worth 
considering to go enterprise. Corporate Open Source often has a twist.


Just a note here: Virtuoso is also a full RDMS, so you could probably 
keep wikibase db in the same cluster and fix the asynchronicity. That is 
also true for any mappers like Sparqlify: 
http://aksw.org/Projects/Sparqlify.html However, these shift the 
problem, then you need a sharded/repartitioned relational database



All the best,

Sebastian





 From [1]:

At the moment, each WDQS cluster is a group of independent servers, sharing 
nothing, with each server independently updated and each server holding a full 
data set.

Then it is not a "cluster" in the sense of databases. It is more a redundancy 
architecture like RAID 1. Is this really how BlazeGraph does it? Don't they have a proper 
cluster solution, where they repartition data across servers? Or is this independent 
servers a wikimedia staff homebuild?

It all depends on your definition of a cluster. We have groups of
machine collectively serving some coherent traffic, but each machine
is completely independent from others. So yes, the comparison to RAID1
is adequate.


Some info here:

- We evaluated some stores according to their performance: 
http://www.semantic-web-journal.net/content/evaluation-metadata-representations-rdf-stores-0
  "Evaluation of Metadata Representations in RDF stores"

Thanks for the link! That looks quite interesting!


- Virtuoso has proven quite useful. I don't want to advertise here, but the 
thing they have going for DBpedia uses ridiculous hardware, i.e. 64GB RAM and 
it is also the OS version, not the professional with clustering and repartition 
capability. So we are playing the game since ten years now: Everybody tries 
other databases, but then most people come back to virtuoso. I have to admit 
that OpenLink is maintaining the hosting for DBpedia themselves, so they know 
how to optimise. They normally do large banks as customers with millions of 
write transactions per hour. In LOD2 they also implemented column store 
features with MonetDB and repartitioning in clusters.

I'm not entirely sure how to read the above 

[Wikidata-bugs] [Maniphest] [Updated] T217018: Upgrade Wikibase extension to use newer phan

2019-06-10 Thread Jdforrester-WMF
Jdforrester-WMF added a parent task: T225456: Run phan on PHP 7.2, not PHP 7.0.

TASK DETAIL
  https://phabricator.wikimedia.org/T217018

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Jdforrester-WMF
Cc: Jdforrester-WMF, matej_suchanek, gerritbot, Aklapper, 
Lucas_Werkmeister_WMDE, E.S.A-Sheild, darthmon_wmde, joker88john, CucyNoiD, 
Nandana, NebulousIris, Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, 
Adrian1985, Cpaulf30, Lahi, Gq86, Baloch007, Darkminds3113, Bsandipan, Lordiis, 
GoranSMilovanovic, Adik2382, Th3d3v1ls, Ramalepe, Liugev6, QZanden, 
LawExplorer, WSH1906, Lewizho99, Maathavan, _jensen, rosalieper, Wikidata-bugs, 
aude, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Updated] T217018: Upgrade Wikibase extension to use newer phan

2019-06-10 Thread Jdforrester-WMF
Jdforrester-WMF added a comment.


  This is blocking T225456  because 
the version of phan Wikibase master is using is so old. :-(

TASK DETAIL
  https://phabricator.wikimedia.org/T217018

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Jdforrester-WMF
Cc: Jdforrester-WMF, matej_suchanek, gerritbot, Aklapper, 
Lucas_Werkmeister_WMDE, E.S.A-Sheild, darthmon_wmde, joker88john, CucyNoiD, 
Nandana, NebulousIris, Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, 
Adrian1985, Cpaulf30, Lahi, Gq86, Baloch007, Darkminds3113, Bsandipan, Lordiis, 
GoranSMilovanovic, Adik2382, Th3d3v1ls, Ramalepe, Liugev6, QZanden, 
LawExplorer, WSH1906, Lewizho99, Maathavan, _jensen, rosalieper, Wikidata-bugs, 
aude, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T170281: Raise PHP version requirement of Wikibase (and its related extensions) to 5.6

2019-06-10 Thread Jdforrester-WMF
Jdforrester-WMF added a comment.


  Can we declare this done? :-)

TASK DETAIL
  https://phabricator.wikimedia.org/T170281

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Jdforrester-WMF
Cc: Jdforrester-WMF, Paladox, Ricordisamoa, PokestarFan, thiemowmde, Reedy, 
Lucas_Werkmeister_WMDE, aude, hoo, ArielGlenn, daniel, Lydia_Pintscher, Jonas, 
Aleksey_WMDE, WMDE-leszek, Aklapper, darthmon_wmde, Premeditated, Nandana, 
Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, 
Wikidata-bugs, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Updated] T198557: Remove support for legacy pre-MCR schema

2019-06-10 Thread CCicalese_WMF
CCicalese_WMF edited projects, added MW-1.34-release; removed MW-1.33-release.
CCicalese_WMF added a comment.


  Agree, this should be moved to 1.34.

TASK DETAIL
  https://phabricator.wikimedia.org/T198557

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: CCicalese_WMF
Cc: Krinkle, BPirkle, Fjalapeno, CCicalese_WMF, Aklapper, daniel, 
darthmon_wmde, WDoranWMF, Premeditated, holger.knust, EvanProdromou, Nandana, 
Lahi, Gq86, GoranSMilovanovic, RazeSoldier, QZanden, LawExplorer, JJMC89, 
_jensen, rosalieper, Agabi10, Pchelolo, Wikidata-bugs, aude, Jdforrester-WMF, 
Mbch331, Ltrlg, MGChecker
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Updated] T198492: Drop rev_text_id and ar_text_id when running update.php

2019-06-10 Thread CCicalese_WMF
CCicalese_WMF edited projects, added MW-1.34-release; removed MW-1.33-release.

TASK DETAIL
  https://phabricator.wikimedia.org/T198492

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: CCicalese_WMF
Cc: Aklapper, Anomie, Jdforrester-WMF, Tgr, daniel, darthmon_wmde, WDoranWMF, 
Premeditated, holger.knust, EvanProdromou, Nandana, Lahi, Gq86, 
GoranSMilovanovic, RazeSoldier, QZanden, LawExplorer, JJMC89, _jensen, 
rosalieper, Agabi10, Pchelolo, Wikidata-bugs, aude, Mbch331, Ltrlg, MGChecker
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T138332: interwiki links to/from Multilingual Wikisource

2019-06-10 Thread Jarekt
Jarekt added a comment.


  I run into this problem, when dealing with unusual sitelinks, which are not 
displayed. For example Q62081484  
seems to be an empty item with nothing in it other than label in italian, but 
closer look 

 shows that it also has sitelink to 
https://wikisource.org/wiki/Li_fioreti_de_mess%C3%AA_san_Francescho. At the 
moment GUI does not allow you to see, or change it. According to this query 

 there are only 11 items with sitelinks to old wikisource. We need to either 
remove them and prevent future creation, or show sitelink in GUI. Perhaps in 
"other sites" section, or as mulwikisource in wikisource section.

TASK DETAIL
  https://phabricator.wikimedia.org/T138332

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Jarekt
Cc: Jarekt, Ash_Crow, jhsoby, PokestarFan, Nikki, Liuxinyu970226, Candalua, 
Mbch331, Lydia_Pintscher, Thibaut120094, Aklapper, Yann, Zppix, darthmon_wmde, 
MattLongCT, Premeditated, DannyS712, Nandana, Lahi, Gq86, GoranSMilovanovic, 
Mahir256, QZanden, LawExplorer, _jensen, rosalieper, Bodhisattwa, Samwilson, 
Wikidata-bugs, aude, jayvdb, Shizhao, Billinghurst, jayantanth, Krenair
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Updated] T214308: Force usage of MCR aware database schema

2019-06-10 Thread CCicalese_WMF
CCicalese_WMF set the point value for this task to "1".

TASK DETAIL
  https://phabricator.wikimedia.org/T214308

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: CCicalese_WMF
Cc: Aklapper, CCicalese_WMF, Fjalapeno, daniel, darthmon_wmde, WDoranWMF, 
Premeditated, EvanProdromou, Nandana, Lahi, Gq86, GoranSMilovanovic, 
RazeSoldier, QZanden, LawExplorer, JJMC89, _jensen, rosalieper, Agabi10, 
Pchelolo, MGChecker, Wikidata-bugs, aude, Jdforrester-WMF, Mbch331, Ltrlg
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Updated] T214308: Force usage of MCR aware database schema

2019-06-10 Thread CCicalese_WMF
CCicalese_WMF moved this task from Inbox to Team 3 on the Core Platform Team 
Kanban board.
CCicalese_WMF edited projects, added Core Platform Team Kanban (Team 3); 
removed Core Platform Team Kanban.

TASK DETAIL
  https://phabricator.wikimedia.org/T214308

WORKBOARD
  https://phabricator.wikimedia.org/project/board/3696/

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: CCicalese_WMF
Cc: Aklapper, CCicalese_WMF, Fjalapeno, daniel, darthmon_wmde, WDoranWMF, 
Premeditated, EvanProdromou, Nandana, Lahi, Gq86, GoranSMilovanovic, 
RazeSoldier, QZanden, LawExplorer, JJMC89, _jensen, rosalieper, Agabi10, 
Pchelolo, MGChecker, Wikidata-bugs, aude, Jdforrester-WMF, Mbch331, Ltrlg
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Updated] T214308: Force usage of MCR aware database schema

2019-06-10 Thread CCicalese_WMF
CCicalese_WMF edited projects, added Core Platform Team Kanban; removed Core 
Platform Team Backlog (Later).

TASK DETAIL
  https://phabricator.wikimedia.org/T214308

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: CCicalese_WMF
Cc: Aklapper, CCicalese_WMF, Fjalapeno, daniel, darthmon_wmde, WDoranWMF, 
Premeditated, EvanProdromou, Nandana, Lahi, Gq86, GoranSMilovanovic, 
RazeSoldier, QZanden, LawExplorer, JJMC89, _jensen, rosalieper, Agabi10, 
Pchelolo, MGChecker, Wikidata-bugs, aude, Jdforrester-WMF, Mbch331, Ltrlg, 
holger.knust
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


Re: [Wikidata] Scaling Wikidata Query Service

2019-06-10 Thread Guillaume Lederrey
Hello!

On Mon, Jun 10, 2019 at 4:28 PM Sebastian Hellmann
 wrote:
>
> Hi Guillaume,
>
> On 06.06.19 21:32, Guillaume Lederrey wrote:
>
> Hello all!
>
> There has been a number of concerns raised about the performance and
> scaling of Wikdata Query Service. We share those concerns and we are
> doing our best to address them. Here is some info about what is going
> on:
>
> In an ideal world, WDQS should:
>
> * scale in terms of data size
> * scale in terms of number of edits
> * have low update latency
> * expose a SPARQL endpoint for queries
> * allow anyone to run any queries on the public WDQS endpoint
> * provide great query performance
> * provide a high level of availability
>
> Scaling graph databases is a "known hard problem", and we are reaching
> a scale where there are no obvious easy solutions to address all the
> above constraints. At this point, just "throwing hardware at the
> problem" is not an option anymore. We need to go deeper into the
> details and potentially make major changes to the current architecture.
> Some scaling considerations are discussed in [1]. This is going to take
> time.
>
> I am not sure how to evaluate this correctly. Scaling databases in general is 
> a "known hard problem" and graph databases a sub-field of it, which are 
> optimized for graph-like queries as opposed to column stores or relational 
> databases. If you say that "throwing hardware at the problem" does not help, 
> you are admitting that Blazegraph does not scale for what is needed by 
> Wikidata.

Yes, I am admitting that Blazegraph (at least in the way we are using
it at the moment) does not scale to our future needs. Blazegraph does
have support for sharding (what they call "Scale Out"). And yes, we
need to have a closer look at how that works. I'm not the expert here,
so I won't even try to assert if that's a viable solution or not.

> From [1]:
>
> At the moment, each WDQS cluster is a group of independent servers, sharing 
> nothing, with each server independently updated and each server holding a 
> full data set.
>
> Then it is not a "cluster" in the sense of databases. It is more a redundancy 
> architecture like RAID 1. Is this really how BlazeGraph does it? Don't they 
> have a proper cluster solution, where they repartition data across servers? 
> Or is this independent servers a wikimedia staff homebuild?

It all depends on your definition of a cluster. We have groups of
machine collectively serving some coherent traffic, but each machine
is completely independent from others. So yes, the comparison to RAID1
is adequate.

> Some info here:
>
> - We evaluated some stores according to their performance: 
> http://www.semantic-web-journal.net/content/evaluation-metadata-representations-rdf-stores-0
>   "Evaluation of Metadata Representations in RDF stores"

Thanks for the link! That looks quite interesting!

> - Virtuoso has proven quite useful. I don't want to advertise here, but the 
> thing they have going for DBpedia uses ridiculous hardware, i.e. 64GB RAM and 
> it is also the OS version, not the professional with clustering and 
> repartition capability. So we are playing the game since ten years now: 
> Everybody tries other databases, but then most people come back to virtuoso. 
> I have to admit that OpenLink is maintaining the hosting for DBpedia 
> themselves, so they know how to optimise. They normally do large banks as 
> customers with millions of write transactions per hour. In LOD2 they also 
> implemented column store features with MonetDB and repartitioning in clusters.

I'm not entirely sure how to read the above (and a quick look at
virtuoso website does not give me the answer either), but it looks
like the sharding / partitioning options are only available in the
enterprise version. That probably makes it a non starter for us.

> - I recently heard a presentation from Arango-DB and they had a good cluster 
> concept as well, although I don't know anybody who tried it. The slides 
> seemed to make sense.

Nice, another one to add to our list of options to test.

> All the best,
>
> Sebastian
>
>
>
>
> Reasonably, addressing all of the above constraints is unlikely to
> ever happen. Some of the constraints are non negotiable: if we can't
> keep up with Wikidata in term of data size or number of edits, it does
> not make sense to address query performance. On some constraints, we
> will probably need to compromise.
>
> For example, the update process is asynchronous. It is by nature
> expected to lag. In the best case, this lag is measured in minutes,
> but can climb to hours occasionally. This is a case of prioritizing
> stability and correctness (ingesting all edits) over update latency.
> And while we can work to reduce the maximum latency, this will still
> be an asynchronous process and needs to be considered as such.
>
> We currently have one Blazegraph expert working with us to address a
> number of performance and stability issues. We
> are planning to hire an 

[Wikidata-bugs] [Maniphest] [Closed] T78134: Adapt red links on the translation - part 2

2019-06-10 Thread Pginer-WMF
Pginer-WMF closed this task as "Resolved".
Pginer-WMF claimed this task.
Pginer-WMF added a comment.


  Content translation supports red links, and specific improvements are 
captured in specific tickets (T197787 
)

TASK DETAIL
  https://phabricator.wikimedia.org/T78134

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Pginer-WMF
Cc: Pginer-WMF, Arrbee, Aklapper, darthmon_wmde, Premeditated, Nandana, Lahi, 
Gq86, Bigdog222, GoranSMilovanovic, chapulina, QZanden, LawExplorer, _jensen, 
rosalieper, Nizil, Ixocactus, Wikidata-bugs, aude, Amire80, Jsahleen, 
Nikerabbit, santhosh, KartikMistry, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


Re: [Wikidata] Scaling Wikidata Query Service

2019-06-10 Thread Sebastian Hellmann

Hi Guillaume,

On 06.06.19 21:32, Guillaume Lederrey wrote:

Hello all!

There has been a number of concerns raised about the performance and
scaling of Wikdata Query Service. We share those concerns and we are
doing our best to address them. Here is some info about what is going
on:

In an ideal world, WDQS should:

* scale in terms of data size
* scale in terms of number of edits
* have low update latency
* expose a SPARQL endpoint for queries
* allow anyone to run any queries on the public WDQS endpoint
* provide great query performance
* provide a high level of availability

Scaling graph databases is a "known hard problem", and we are reaching
a scale where there are no obvious easy solutions to address all the
above constraints. At this point, just "throwing hardware at the
problem" is not an option anymore. We need to go deeper into the
details and potentially make major changes to the current architecture.
Some scaling considerations are discussed in [1]. This is going to take
time.


I am not sure how to evaluate this correctly. Scaling databases in 
general is a "known hard problem" and graph databases a sub-field of it, 
which are optimized for graph-like queries as opposed to column stores 
or relational databases. If you say that "throwing hardware at the 
problem" does not help, you are admitting that Blazegraph does not scale 
for what is needed by Wikidata.


From [1]:

At the moment, each WDQS cluster is a group of independent servers, 
sharing nothing, with each server independently updated and each 
server holding a full data set.


Then it is not a "cluster" in the sense of databases. It is more a 
redundancy architecture like RAID 1. Is this really how BlazeGraph does 
it? Don't they have a proper cluster solution, where they repartition 
data across servers? Or is this independent servers a wikimedia staff 
homebuild?


Some info here:

- We evaluated some stores according to their performance: 
http://www.semantic-web-journal.net/content/evaluation-metadata-representations-rdf-stores-0 
"Evaluation of Metadata Representations in RDF stores"


- Virtuoso has proven quite useful. I don't want to advertise here, but 
the thing they have going for DBpedia uses ridiculous hardware, i.e. 
64GB RAM and it is also the OS version, not the professional with 
clustering and repartition capability. So we are playing the game since 
ten years now: Everybody tries other databases, but then most people 
come back to virtuoso. I have to admit that OpenLink is maintaining the 
hosting for DBpedia themselves, so they know how to optimise. They 
normally do large banks as customers with millions of write transactions 
per hour. In LOD2 they also implemented column store features with 
MonetDB and repartitioning in clusters.


- I recently heard a presentation from Arango-DB and they had a good 
cluster concept as well, although I don't know anybody who tried it. The 
slides seemed to make sense.


All the best,

Sebastian





Reasonably, addressing all of the above constraints is unlikely to
ever happen. Some of the constraints are non negotiable: if we can't
keep up with Wikidata in term of data size or number of edits, it does
not make sense to address query performance. On some constraints, we
will probably need to compromise.

For example, the update process is asynchronous. It is by nature
expected to lag. In the best case, this lag is measured in minutes,
but can climb to hours occasionally. This is a case of prioritizing
stability and correctness (ingesting all edits) over update latency.
And while we can work to reduce the maximum latency, this will still
be an asynchronous process and needs to be considered as such.

We currently have one Blazegraph expert working with us to address a
number of performance and stability issues. We
are planning to hire an additional engineer to help us support the
service in the long term. You can follow our current work in phabricator [2].

If anyone has experience with scaling large graph databases, please
reach out to us, we're always happy to share ideas!

Thanks all for your patience!

Guillaume

[1] https://wikitech.wikimedia.org/wiki/Wikidata_query_service/ScalingStrategy
[2] https://phabricator.wikimedia.org/project/view/1239/


--
All the best,
Sebastian Hellmann

Director of Knowledge Integration and Linked Data Technologies (KILT) 
Competence Center

at the Institute for Applied Informatics (InfAI) at Leipzig University
Executive Director of the DBpedia Association
Projects: http://dbpedia.org, http://nlp2rdf.org, 
http://linguistics.okfn.org, https://www.w3.org/community/ld4lt 


Homepage: http://aksw.org/SebastianHellmann
Research Group: http://aksw.org
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata-bugs] [Maniphest] [Updated] T222624: Unable to remove last alias from entity

2019-06-10 Thread Aklapper
Aklapper added a project: Wikidata.
Aklapper added a comment.


  Assuming this is about #Wikidata 
.

TASK DETAIL
  https://phabricator.wikimedia.org/T222624

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Aklapper
Cc: Matthias_Geisler_WMDE, Jakob_WMDE, Aklapper, Tarrow, darthmon_wmde, 
Premeditated, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, 
_jensen, rosalieper, Wikidata-bugs, aude, Lydia_Pintscher, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T203696: Drastically reduce the number of ResourceLoader modules that WikibaseClient registers

2019-06-10 Thread gerritbot
gerritbot added a comment.


  Change 516240 had a related patch set uploaded (by Krinkle; owner: Krinkle):
  [mediawiki/extensions/Wikibase@master] Consolidate more view/ modules 
resources (16)
  
  https://gerrit.wikimedia.org/r/516240

TASK DETAIL
  https://phabricator.wikimedia.org/T203696

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: gerritbot
Cc: gerritbot, Addshore, Aklapper, Krinkle, Jdforrester-WMF, E.S.A-Sheild, 
darthmon_wmde, Premeditated, joker88john, CucyNoiD, Nandana, NebulousIris, 
Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, Adrian1985, Cpaulf30, 
Imarlier, Lahi, Gq86, Baloch007, Darkminds3113, Bsandipan, Lordiis, 
GoranSMilovanovic, Adik2382, Th3d3v1ls, Ramalepe, Liugev6, QZanden, 
LawExplorer, Vali.matei, WSH1906, Lewizho99, Maathavan, _jensen, rosalieper, 
Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Changed Subscribers] T212843: [EPIC] Access to Wikidata's lexicographical data from Wiktionaries and other WMF sites

2019-06-10 Thread deryckchan
deryckchan added a subscriber: Fnielsen.
deryckchan added a comment.


  There's a Wikimania 2019 submission about Lexemes, by @Fnielsen : 
https://wikimania.wikimedia.org/wiki/2019:Languages/Wikidata_lexemes

TASK DETAIL
  https://phabricator.wikimedia.org/T212843

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: deryckchan
Cc: Fnielsen, RexxS, Pamputt, Mike_Peel, MarcoSwart, Geertivp, Liuxinyu970226, 
Addshore, Jdforrester-WMF, deryckchan, Lydia_Pintscher, Lea_Lacroix_WMDE, 
darthmon_wmde, Premeditated, Nandana, Mringgaard, Lahi, Gq86, Cinemantique, 
GoranSMilovanovic, QZanden, LawExplorer, _jensen, rosalieper, jberkel, 
Psychoslave, Wikidata-bugs, aude, GPHemsley, Shizhao, Nemo_bis, Darkdadaah, 
Mbch331, Ltrlg, Krenair
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs