[Wikidata-bugs] [Maniphest] T270764: Wikidata Truthy dump is missing important metadata triples

2023-10-02 Thread mkroetzsch
mkroetzsch added a comment.


  @Lydia_Pintscher Are you asking about the discrepancy in the counts, or about 
the general idea of this issue report?
  
  I must admit thatI do not get the significance of the SPARQL queries above. 
The missed properties seem to exist and work as expected on query.wikidata.org, 
so one does not need expensive group by queries to compute them (and I think 
this was the main point of adding them).
  
  I don't see a reason why one should not add some such aggregates, which are 
already available and can be exposed with an established vocabulary, to the 
truthy dump as well, other than a general concern that the dump would get too 
big when going down that road.

TASK DETAIL
  https://phabricator.wikimedia.org/T270764

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: mkroetzsch
Cc: Lydia_Pintscher, mkroetzsch, Nicksinch, Aklapper, VladimirAlexiev, 
Danny_Benjafield_WMDE, Astuthiodit_1, karapayneWMDE, Invadibot, maantietaja, 
ItamarWMDE, Akuckartz, Nandana, Lahi, Gq86, GoranSMilovanovic, QZanden, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Wikidata-bugs, aude, Mbch331
___
Wikidata-bugs mailing list -- wikidata-bugs@lists.wikimedia.org
To unsubscribe send an email to wikidata-bugs-le...@lists.wikimedia.org


[Wikidata-bugs] [Maniphest] [Commented On] T244341: Wikibase RDF dump: stop using blank nodes for encoding unknown values and OWL constraints

2020-02-08 Thread mkroetzsch
mkroetzsch added a comment.


  In T244341#5862287 <https://phabricator.wikimedia.org/T244341#5862287>, 
@Jheald wrote:
  
  > Please don't think or refer to the blank nodes as "unknown values".
  
  I fully agree. The use of the word "unknown" in the UI was a mistake that 
stuck. The intention was always to mean "unspecified" without any epistemic 
connotation. That is: an unspecified value only makes a positive statement 
("there is a value for this property") and no negative one ("we [who exactly?] 
do not know this value").

TASK DETAIL
  https://phabricator.wikimedia.org/T244341

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: mkroetzsch
Cc: Jheald, Daniel_Mietchen, mkroetzsch, Denny, Lucas_Werkmeister_WMDE, 
Aklapper, dcausse, darthmon_wmde, Nandana, Lahi, Gq86, GoranSMilovanovic, 
QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, 
Xmlizer, jkroll, Smalyshev, Wikidata-bugs, Jdouglas, aude, Tobias1984, 
Manybubbles, Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T244341: Wikibase RDF dump: stop using blank nodes for encoding unknown values and OWL constraints

2020-02-07 Thread mkroetzsch
mkroetzsch added a comment.


  Hi,
  
  Using the same value for "unknown" is a very bad idea and should not be 
considered. You already found out why. This highlights another general design 
principle: the RDF data should encode meaning in structure in a direct way. If 
two triples have the same RDF term as object, then they should represent 
relationships to the same thing, without any further conditions on the shape of 
that term. Otherwise, SPARQL does not work well. For example, the property 
paths you can write with * have no way of performing extra tests on the nodes 
you traverse, so the meaning of a chain must not be influenced by the shape of 
the terms on a property chain, if you want to use * in queries in a meaningful 
way.
  
  This principle is also why we chose bnodes in the first place. OWL also has a 
standard way of encoding the information that some property has an 
(unspecified) value, but the encoding of this looks more like what we have in 
the case of negation (no value) now. If we had used this, one would need a 
completely different query pattern to find people with unspecified date of 
death and for people with specified date of death. In contrast, the current 
bnode encoding allows you to ask a query for everybody with a date of death 
without having to know if it is given explicitly or left unspecified (you don't 
even have to know that the latter is possible). This should be kept in mind: 
the encoding is not just for "use cases" where you are interested in the 
special situation (e.g., someone having unspecified date of death) but also for 
all other queries dealing with data of some kind. For this reason, the RDF 
structure for encoding unspecified values should as much as possible look as 
the cases where there are values.
  
  I am not aware of any other option for encoding "there is a value but we know 
nothing more about it" in RDF or OWL besides the two options I mentioned. The 
proposal to use a made-up IRI instead of a bnode gives identity to the unkown 
(even if that identity has no meaning in our data yet). It works in many 
unspecified-value use cases where bnodes work, but not in all. The three main 
confusions possible are:
  
  1. confusing a placeholder "unspecified" IRI with a real IRI that is expected 
in normal cases (imagine using a FILTER on URL-type property values),
  2. believing that the data changed when only the placeholder IRI has changed 
(imagine someone deleting and re-adding a quantifier with "unspecified" -- if 
it's a bnode, the outcome is the same in terms of RDF semantics, but if you use 
placeholder IRIs, you need to know their special meaning to compare the two RDF 
data sets correctly)
  3. accidental or deliberate uses of placeholder IRIs in other places (imagine 
somebody puts your placeholders as value into a URL-type property)
  
  Case 3 can probably be disallowed by the software (if one thinks of it).
  
  Another technical issue with the approach is that you would need to use 
placeholder IRIs also with datatype properties that normally require RDF 
literals. RDF engines will tolerate this, and for SPARQL use cases it's not a 
huge difference from tolerating bnodes there. But it does put the data outside 
of OWL, which does not allow properties to be for literals and IRIs at the same 
time. Unfortunately, there is no equivalent of creating a placeholder IRI for 
things like xsd:int or xsd:string in RDF (in OWL, you can write this with a 
class expression, but it will be structurally different from other cases where 
this data is set).
  
  For the encoding of OWL negation, I am not sure if switching this (internal, 
structure) bnode to a (generated, unique) IRI would make any difference. One 
would have to check with the standard to see if this is allowed. I would 
imagine that it just works. In this case, sharing the same auxiliary IRI 
between all negative statements that refer to the same property should also 
work.
  
  So: dropping in placeholder IRIs is the "second best thing" to encode bnodes, 
but it gives up several advantages and introduces some problems (and of course 
inevitably breaks existing queries). Before doing such a change, there should 
be a clearer argument as to why this would help, and in which cases. The linked 
PDF that is posted here for motivation does not speak about updates, and indeed 
if you look at Aidan's work, he has done a lot of interesting analysis with 
bnodes that would not make any sense without them (e.g., related to comparing 
RDF datasets; related to my point 2 above). I am not a big fan of bnodes 
either, but what we try to encode here is what they have genuinely been 
invented for, and any alternative also has its issues.

TASK DETAIL
  https://phabricator.wikimedia.org/T244341

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: mkroetzsch
Cc: mkroetzsch, Denny, Lucas_Werkmeister_

[Wikidata-bugs] [Maniphest] [Commented On] T216842: Specify license of Wikibase ontology

2019-02-24 Thread mkroetzsch
mkroetzsch added a comment.


  CC0 seems to be fine. Using the same license as for the rest seems to be the 
easiest choice for everybody.

TASK DETAIL
  https://phabricator.wikimedia.org/T216842

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: mkroetzsch
Cc: Lydia_Pintscher, daniel, Denny, mkroetzsch, abian, Aklapper, Nandana, Lahi, 
Gq86, GoranSMilovanovic, QZanden, LawExplorer, _jensen, Wikidata-bugs, aude, 
Mbch331
___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T112127: [Story] Move RDF ontology from beta to release status

2018-10-17 Thread mkroetzsch
mkroetzsch added a comment.
Well, for classes and properties, one would use owl:equivalentClass and owl:equivalentProperty rather than sameAs to encode this point. But I agree that this will hardly be considered by any consumer.TASK DETAILhttps://phabricator.wikimedia.org/T112127EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: mkroetzschCc: Addshore, Lydia_Pintscher, Tpt, Lucas_Werkmeister_WMDE, hoo, Ricordisamoa, mkroetzsch, gerritbot, daniel, Aklapper, Smalyshev, CucyNoiD, Nandana, NebulousIris, Gaboe420, Versusxo, Majesticalreaper22, Giuliamocci, Adrian1985, Cpaulf30, Lahi, Gq86, Baloch007, Darkminds3113, Bsandipan, Lordiis, GoranSMilovanovic, Adik2382, Th3d3v1ls, Ramalepe, Liugev6, QZanden, EBjune, merbst, LawExplorer, Avner, Lewizho99, Maathavan, Gehel, Jonas, FloNight, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T190875: Security review for Wikidata queries data release proposal

2018-07-02 Thread mkroetzsch
mkroetzsch added a comment.
This is good news -- thanks for the careful review! The lack of specific threat models for this data was also a challenge for us, for similar reasons, but it is also a good sign that many years after the first SPARQL data releases, there is still no realistic danger to user anonymity known. The footer is still a good idea for general community awareness. People who do have concerns about their anonymity could be encouraged to come forward with scenarios that we should take into account.TASK DETAILhttps://phabricator.wikimedia.org/T190875EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: Bawolff, mkroetzschCc: JBennett, Adrian_Bielefeldt, Bawolff, APalmer_WMF, Aklapper, DarTar, mkroetzsch, EBjune, Smalyshev, leila, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, Avner, dpatrick, ZhouZ, Luke081515, Wikidata-bugs, aude, Capt_Swing, JanZerebecki, Slaporte, csteipp, Mbch331, Jay8g, Krenair, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T190875: Security review for Wikidata queries data release proposal

2018-04-03 Thread mkroetzsch
mkroetzsch added a comment.
Hi,

The code is here: https://github.com/Wikidata/QueryAnalysis
It was not written for general re-use, so it might be a bit messy in places. The code includes the public Wikidata example queries as test data that can be used without accessing any confidential information.

We have a list of query types ordered by frequency. However, there are millions of query types, and the most frequent are those created by bots. I can dig up a pointer to the local file where we have it, if this is what you want. If you are interested in a broader analysis of the data, you could take a look at a recent workshop paper of ours: https://iccl.inf.tu-dresden.de/web/Inproceedings3196/en
It has detailed statistics of SPARQL feature distributions and discusses some findings.TASK DETAILhttps://phabricator.wikimedia.org/T190875EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: mkroetzschCc: Bawolff, APalmer_WMF, Aklapper, DarTar, mkroetzsch, EBjune, Smalyshev, leila, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, Avner, dpatrick, ZhouZ, Luke081515, Mpaulson, Wikidata-bugs, aude, Capt_Swing, jayvdb, JanZerebecki, Slaporte, csteipp, Mbch331, Jay8g, Krenair, Legoktm___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T183020: Investigate the possibility to release Wikidata queries

2017-12-16 Thread mkroetzsch
mkroetzsch added a comment.
I agree with Stas: regular data releases are desirable, but need further thought. The task is easier for our current case since we already know what is in the data. For a regular process, one has to be very careful to monitor potential future issues. By releasing historic data, we avoid exploits that could be theoretically possible based on detailed knowledge of the methodology.

Regarding external IDs, one could whitelist unproblematic IDs that can be preserved, and obfuscate others. I agree that authority control IDs might identify humans, since they scope over so many things that are tied to particular humans (books, authors, etc.) that one could have a hypothetical situation where the interest in a particular item would already suggest who asked the query. I don't think something similar is even theoretically plausible for other IDs (e.g., for proteins or stars). Even for book ids, the lack of user traces makes it very hard to exploit this data further (the certainty you get from a single query being asked can hardly be high, and a query that helps you to guess who asked it will often not be interesting in its own right -- most likely you would want to know what else the identified person has asked). Anyway, we could restrict the "numerical strings are ok" rule to whitelisted properties for our current release. The main reason we have it at all are things like BlazeGraph's "radius" service parameter that have to be a number but are given as a string (I think the gas service might have similar cases).

There is a general limitation to potential exploits of SPARQL logs for breaching someone's privacy. If you don't control the software that formulated the query, then you can only connect queries to people if you already knew that only this person would ask this query. But then you learn very little by observing the query! On the other hand, if you control the software, then it would usually be easy to gather user data more directly, without needing the detour across some SPARQL logs released months later.  One exception that might be relevant in the future is the use of SPARQL from Lua built-ins or MediaWiki tags on Wikipedia pages, which could in theory expose some page traffic. This is not relevant for our historic logs, and it would be hard to fully exploit due to parser caches and crawler-based hits, but it might become a theoretical issue nonetheless. To avoid it, one could either filter all Wikipedia servers from the logs, or use a separate SPARQL service for such requests (as discussed in Berlin), whose logs would not be released.

Considering our current dataset, it seems that even the obfuscation of strings is more than one would have to do, but in the future one might indeed have to add external URLs if they become more common in queries.TASK DETAILhttps://phabricator.wikimedia.org/T183020EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: Smalyshev, mkroetzschCc: mkroetzsch, Smalyshev, DarTar, leila, Aklapper, Lahi, Gq86, GoranSMilovanovic, QZanden, Avner, Wikidata-bugs, aude, Capt_Swing, Mbch331___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T143819: Data request for logs from SparQL interface at query.wikidata.org

2016-09-10 Thread mkroetzsch
mkroetzsch added a comment.
@AndrewSu As I just replied to Benjamin Good in this matter, it is a bit too early for this, since we only have the basic technical access as of very recently. We have not had a chance to extract any community shareable data sets yet, and it is clear that it will require some time to get clearance for such data even after we believe it is ready.

In the long run, I would find some collaboration very interesting, but we need to lay the foundations for this first, which will likely take a few more months.TASK DETAILhttps://phabricator.wikimedia.org/T143819EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: mkroetzschCc: mkroetzsch, leila, debt, thiemowmde, Jonas, Smalyshev, AndrewSu, Aklapper, I9606, mschwarzer, Avner, Gehel, D3r1ck01, FloNight, Xmlizer, Izno, jkroll, Wikidata-bugs, Jdouglas, aude, Deskana, Manybubbles, Mbch331___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T126862: Datatype for chemical formulae on Wikidata

2016-02-25 Thread mkroetzsch
mkroetzsch added a comment.


  Re parsing strings: You are skipping the first step here. The question is not 
which format is better for advanced interpretation, but which format is 
specified at all. Whatever your proposal is, I have not seen any //syntactic// 
description of if yet. If -- in addition to having a specified syntax -- it can 
also be parsed for more complex features, that's a nice capability. But let's 
maybe start by saying what the proposed "structured data" format actually is.
  
  Re multilingual text: There is nothing controversial about the technical 
aspects you refer to. One could make all complex datatypes we have into items 
and only use primitive datatypes instead. There are many reasons against this, 
so we decided to have datatypes instead for representing complex value objects.
  
  It is a pity that you are not trying to explain what you propose but focus on 
attacking current proposals instead (how Wikidata editors store chemical 
formulas now, how the multilingual text datatype was planned). Since you are 
not providing much details, I have now also reviewed the gerrit patch. There is 
nothing in there that would enable you to search for "the nucleon masses of 
atoms which co-occur with at least 6 carbon atoms". The input text is simply 
sent to a LaTeX formatter like mathematical markup. There is no semantic 
interpretation or structured data there at all. There is also no syntax 
specification in this part of the code, so it seems the specification is 
"whatever the current version of MediaWiki does with text in ce-tags". All the 
doubts I raised in my first post remain valid.

TASK DETAIL
  https://phabricator.wikimedia.org/T126862

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: mkroetzsch
Cc: ArthurPSmith, mkroetzsch, Chemical-Markup, gerritbot, daniel, 
Lydia_Pintscher, Aklapper, Physikerwelt, Izno, Wikidata-bugs, Prod, aude, 
fredw, Pkra, scfc, Mbch331, Ltrlg



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T127929: [Story] Add a new datatype for linking to creators of artwork and more (smart URI)

2016-02-24 Thread mkroetzsch
mkroetzsch added a comment.


  +1 sounds like a workable design

TASK DETAIL
  https://phabricator.wikimedia.org/T127929

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: mkroetzsch
Cc: mkroetzsch, Aklapper, daniel, Steinsplitter, Lydia_Pintscher, Izno, 
Wikidata-bugs, aude, El_Grafo, Mbch331



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Updated] T126862: Datatype for chemical formulae on Wikidata

2016-02-15 Thread mkroetzsch
mkroetzsch added a comment.

Re chemical markup for semantics: this is true for Wikitext, where you cannot 
otherwise know that "C" is carbon. It does not apply to Wikidata, where you 
already get the same information from the property used. Think of 
https://phabricator.wikimedia.org/P274 as a way of putting text into "semantic 
markup" on Wikipedia.

In general, one should not confuse the task of adding semantic markup to a wiki 
text with what we do in Wikidata. We did the former in Semantic MediaWiki, and 
this approach never made it into Wikipedia. The communities in Wikipedia prefer 
readability of the source text over semantic markup, and the decision therefore 
was to move "semantic" information to a separate place, Wikidata.

> The advantage of a data type vs. a property is that a service can enhance the 
> input data with additional information and which can thereafter be used by 
> third party services.

You can do this in any case. Services already run on all kinds of property 
values on Wikidata. If you are talking about an enhanced UI that provides 
in-value annotation, then I don't see what exactly you refer to. Is anybody 
developing/designing/planning such a UI? I don't think this is needed for 
chemicals, where automated entity recognition would be fairly trivial to do 
automatically, so I would not spend any effort on this.

> Can you provide a like that justifies the mulitlingual text for example?

Missing word in sentence? Answers for both possible interpretations:

- Use case: needed for all translated texts (e.g, slogans/mottos of 
organisations, quotes of people, usage notes for properties, ...)
- Technical need: The type requires a new value type, since its contents is 
structurally distinct from all datatypes we have. Related to this, it requires 
a new UI, new JSON structures, and a new RDF encoding. Can't e done in a gadget.


TASK DETAIL
  https://phabricator.wikimedia.org/T126862

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: mkroetzsch
Cc: mkroetzsch, Chemical-Markup, gerritbot, daniel, Lydia_Pintscher, Aklapper, 
Physikerwelt, Izno, Wikidata-bugs, Prod, aude, fredw, Pkra, scfc, Mbch331, Ltrlg



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T126862: Datatype for chemical formulae on Wikidata

2016-02-15 Thread mkroetzsch
mkroetzsch added a comment.

I really wonder if the introduction of all kinds of specific markup languages 
in Wikidata is the right way to go. We could just have a Wikitext datatype, 
since it seems that Wikitext became the gold standard for all these special 
data types recently. Mark-up over semantics. By this I mean that the choice of 
format is focussed on presentation, not on data exchange. I am not an expert in 
chemical modelling (but then again, is anyone in this thread?), but it seems 
that this mark-up centric approach is fairly insufficient and naive.

I am also missing the requirements analysis. How many infoboxes are currently 
using any chemical formula with this special markup at all? If you look at a 
page like https://en.wikipedia.org/wiki/Ethanol, you see there is no such 
markup in the whole page. Neither is there in 
https://en.wikipedia.org/wiki/Photosynthesis. Who really needs this in 
Wikidata? Aren't there many other forms of notation in chemistry (and biology) 
that would be equally important?

There are some really fundamental datatypes currently missing, notably 
multi-lingual texts and geo shapes. This is the level on which datatypes are 
useful. Presentational things can be done by gadgets, as we already have for 
URL links shown with IDs. There is no need to codify this in the data model. 
Communities can solve these simple problems already without changing datatypes 
of existing properties (which is costly since existing applications and tools 
need to be updated each time).

It is also notable that https://www.wikidata.org/wiki/Property:P274 has better 
format documentation than what is proposed here (at least there is no 
documentation of what the proposed format actually consists in in this thread). 
They even define a regular language for the possible content. Their direct, 
text-based formatting is preferable in many cases.


TASK DETAIL
  https://phabricator.wikimedia.org/T126862

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: mkroetzsch
Cc: mkroetzsch, Chemical-Markup, gerritbot, daniel, Lydia_Pintscher, Aklapper, 
Physikerwelt, Izno, Wikidata-bugs, Prod, aude, fredw, Pkra, scfc, Mbch331, Ltrlg



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T126349: RDF export for the math data type should not export input texvc string but its MathML representation

2016-02-10 Thread mkroetzsch
mkroetzsch added a comment.

> The MathML expression includes the TeX representation, which can be used in
>  LaTeX documents and also to create new statements.

That would address the conversion back from MathML to TeX. With this in place, 
we could indeed use MathML in JSON and RDF, if we wanted (assuming that this is 
doable for you, that is, that there is a suitable TeX->MathML conversion 
available in Wikibase).


TASK DETAIL
  https://phabricator.wikimedia.org/T126349

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Physikerwelt, mkroetzsch
Cc: thiemowmde, mkroetzsch, fredw, gerritbot, daniel, Lydia_Pintscher, 
Tobias1984, Bene, Wikidata-bugs, Physikerwelt, Tpt, Liuxinyu970226, Rits, 
Ricordisamoa, Sannita, Micru, MGChecker, Aklapper, WickieTheViking, Llyrian, 
TomT0m, ArthurPSmith, Izno, Prod, aude, Pkra, scfc, Mbch331, Ltrlg



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T126349: RDF export for the math data type should not export input texvc string but its MathML representation

2016-02-10 Thread mkroetzsch
mkroetzsch added a comment.

The format should be the same as in JSON. If MathML is preferred there, then 
this is fine with me. If LaTeX is preferred, we can also use this. It seems 
that MathML would be a more reasonable data exchange format, but Moritz was 
suggesting in his emails that he does not think it to be usable enough today, 
so there might be practical reasons to avoid it.

In any case, I am strongly against using a different format in RDF and JSON. 
Otherwise any tool-chain that uses RDF and JSON (e.g., a bot that uses SPARQL 
to fetch relevant information) would have to implement this conversion, maybe 
even back and forth. Tools that do large-scale processing (e.g., Wikidata 
Toolkit generating custom RDF dumps from JSON) would need to implement this 
conversion internally, even if there would be a web service (latency). It would 
be really a lot of work, without a clear benefit. Indeed, whatever format you 
pick, for whatever reason you pick it, the same reason should apply to all 
exchange formats alike.

If you use MathML but keep the TeX-like input syntax, then external users will 
also need a web service that can convert back and forth between these 
representations:

- TeX->MathML is needed, e.g., for a query UI where users enter data to search 
for in SPARQL
- MathML -> TeX is needed, e.g., to display the (raw) value of a math property 
to a user after it was returned from SPARQL

The two conversions do not need to be exact inverses, but they should hopefully 
stabilise after one round-trip. I think using MathML as the main exchange 
format would be doable, given such tool support exists. In particular, I am not 
concerned about showing different things to users than we use in our exchange 
formats. We are doing similar things with other types (dates are also written 
in a user syntax and then converted into an internal data model). The 
representation of dates is not the same in RDF and in JSON either, but the data 
structure is the same (same components that make up a date), which is very 
different from the situation of TeX vs. MathML.


TASK DETAIL
  https://phabricator.wikimedia.org/T126349

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Physikerwelt, mkroetzsch
Cc: thiemowmde, mkroetzsch, fredw, gerritbot, daniel, Lydia_Pintscher, 
Tobias1984, Bene, Wikidata-bugs, Physikerwelt, Tpt, Liuxinyu970226, Rits, 
Ricordisamoa, Sannita, Micru, MGChecker, Aklapper, WickieTheViking, Llyrian, 
TomT0m, ArthurPSmith, Izno, Prod, aude, Pkra, scfc, Mbch331, Ltrlg



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T99820: [Task] Add reference to ontology.owl to the RDF output

2015-11-23 Thread mkroetzsch
mkroetzsch added a comment.

In https://phabricator.wikimedia.org/T99820#1820662, @daniel wrote:

> Looking at the link, it seems to me we'd (trivially) meet these requirements.


Yes, that's what I meant. :-)

> But I'm not sure about the fine details, e.g. regarding the version IRI. But 
> if you are sure we are meeting the formal requirements, fine with me.

The version IRI is an optional aspect that we can include if we have a good one 
(I guess we might). We should give an ontology IRI and say that it is of type 
owl:ontology. Other bits of information about this ontology can be added, but 
there are not many requirements there. There is also not so much said in the 
standard about how to version ontologies in general, so this is something left 
to us.

> We should then probably explicitly state that the dump is an ontology, 
> though...

Not sure on which level you mean this. In RDF or in the user documentation? I 
thought that this discussion was about stating this in RDF, and this would 
already be "explicit". I am not sure it needs much documentation elsewhere. 
Mainly, we are putting this in to have a place where we can have the license 
(user requested feature) and other meta information (like export date and 
imported version of the Wikibase ontology). I agree that we should mention 
these features in the documentation.


TASK DETAIL
  https://phabricator.wikimedia.org/T99820

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: mkroetzsch
Cc: gerritbot, Aklapper, mkroetzsch, daniel, Smalyshev, jkroll, Wikidata-bugs, 
Jdouglas, aude, Deskana, Manybubbles, Mbch331



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T99820: [Task] Add reference to ontology.owl to the RDF output

2015-11-19 Thread mkroetzsch
mkroetzsch added a comment.

> ...and if we consider our data dump to be an ontology, then what isn't an 
> ontology?

The word "ontology" has different meanings in different contexts. Here, we only 
mean the notion of "ontology" meant by the term owl:ontology as used in the W3C 
OWL standard. As Stas says, this is essentially nothing more than a OWL 
collection of OWL-compatible statements (called "OWL axioms" in the standard). 
No deeper meaning involved. See  http://www.w3.org/TR/owl2-syntax/#Ontologies 
for a definition.


TASK DETAIL
  https://phabricator.wikimedia.org/T99820

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: mkroetzsch
Cc: gerritbot, Aklapper, mkroetzsch, daniel, Smalyshev, jkroll, Wikidata-bugs, 
Jdouglas, aude, Deskana, Manybubbles, Mbch331



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T118860: [Tracking] Represent derived data in the data model

2015-11-18 Thread mkroetzsch
mkroetzsch added a comment.

I don't want to detail every bit here, but it should be clear that one can 
easily eliminate the dependency to $db in the formatter code. The Sites object 
I mentioned is an example. It is *not* static in our implementation. You can 
make it an interface. You can inject Sites (or a mocked version of it) for 
testing -- this is what we do. The only dependency you will retain is that the 
formatting code, or some code that calls the formatting code, must know where 
to get the URL from.

All of this will work in any "sane" architecture -- I don't see where you need 
a role manager for this. In particular, you can always pull out dependencies by 
injecting interface-based objects instead, and this has nothing to do with how 
you represent the "derived data" in memory using objects. The role-based 
approach discussed here simply seems to be a generalised version of this 
pattern, with a one-solution-fits-all interface ("Role") instead of 
task-specific interfaces ("Sites" etc.). The reason why I am not convinced by 
this here is that the tasks at hand are quite diverse and refer to different 
objects. So for any particular object (such as SiteLink) you might not have 
many possible roles available, probably just one, and in such a situation the 
complexity of the general solution might be avoidable.

Maybe it's clearer if I say it in terms of the simpler "hash map of additional 
data" approach that @adrianheine mentioned: it seems to me you are adding such 
hashmaps to all objects (using a somewhat complicated way to encode the 
hashmap), just to have one or two entries per object in the end. In such a 
case, rather than using a hashmap, you would better use a member variable that 
can be null if the additional data is not there.


TASK DETAIL
  https://phabricator.wikimedia.org/T118860

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: mkroetzsch
Cc: mkroetzsch, adrianheine, hoo, thiemowmde, aude, Jonas, JanZerebecki, 
JeroenDeDauw, Aklapper, StudiesWorld, daniel, Wikidata-bugs, Mbch331



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T118860: [Tracking] Represent derived data in the data model

2015-11-18 Thread mkroetzsch
mkroetzsch added a comment.

@daniel As long as it works for you, this is all fine by me, but in my 
experience with PHP this could cost a lot of memory, which could be a problem 
for the long item pages that already caused problems in the past.

> But it requires the serialization and formatting code to depend on the lookup 
> services


I know that it's always nice in software architecture to reduce the number of 
dependencies (for all kinds of reasons). However, I don't see a strong reason 
why code that formats a sitelink should not depend on the facility that 
provides the URL to link to. When things have a conceptual dependency, it is 
not bad design to have a code dependency there as well. I don't think there is 
any reason to have duplicate code in either solution -- that's just a matter of 
coordinating work within the team (not saying it is easy, but architecture 
alone will not solve this ...).

> Relying on global state like that for conveniance is what MediaWiki is doing 
> all over the place, with horrible results for testability and modularity


The state would not be global (I was over-simplifying). Of course you would 
have a $db object that provides the access to the sites table. Reading from a 
table is a stateless operation, so there is no state (global or local) involved 
and you can indeed use static code if you like. But of course you could also 
use a Sites object like in WDTK. Regarding testing, I guess you already have a 
mock db connection object anyway (otherwise, how would you test db read 
operations ...). I don't see a reason why this solution should be any less 
modular or testable than what you propose. There is also no need to have any 
duplicate code.


TASK DETAIL
  https://phabricator.wikimedia.org/T118860

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: mkroetzsch
Cc: mkroetzsch, adrianheine, hoo, thiemowmde, aude, Jonas, JanZerebecki, 
JeroenDeDauw, Aklapper, StudiesWorld, daniel, Wikidata-bugs, Mbch331



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T118860: [Tracking] Represent derived data in the data model

2015-11-18 Thread mkroetzsch
mkroetzsch added a subscriber: mkroetzsch.
mkroetzsch added a comment.

Structurally, this would work, but it seems like a very general solution with a 
lot of overhead. Not sure that this pattern works well on PHP, where the cost 
of creating additional objects is huge. I also wonder whether it really is good 
to manage all those (very different!) types of "derived" information in a 
uniform way. The examples given belong to very different objects and are based 
on very different inputs (some things requiring external data sources, some 
not). I find it a bit unmotivated to architecturally unify things that are 
conceptually and technically so very different. The motivation given for 
choosing this solution starts from the premise that one has to find a single 
solution that works for all cases, including some "edge cases". Without this 
assumption, one would be free to solve the different problems individually, 
using what is best for each, instead of being forced to go for some least 
common denominator.

To pick just one example, consider the (article) URL of a SiteLink. To create 
it, one needs to have access to the content of the sites table. In WDTK, we 
encapsulate the sites table in an object (called Sites). To find out the URL of 
a SiteLink, one has to call a method of the Sites object (called 
getArticleUrl() or something), which takes a SiteLink as an input. This design 
is simple and efficient, uses no additional memory for storing new objects or 
values, and clearly locates the responsibility for computing this information 
(the Sites table). In a single-site setting (like you have in PHP), there is 
only one sites table, and you can access it statically, so the caller does not 
even need to have a reference to a Sites object as in WDTK. I therefore don't 
see any benefit in creating a role object for this simple task. It's just more 
indirection, without any convenience for the software developer or any gain in 
performance.

The situation is similar in several of the other cases mentioned. In the end, 
this is a choice for PHP development, which won't affect my work, so I'll leave 
it to you, but it seems you are making your life harder than necessary by going 
for a complicated solution instead of several simple solutions. There is only 
going to be a small number of different kinds of "derived data" ever, and there 
is hardly any place where such data is used in a way that does not need to 
understand its meaning (mainly for serialization).

For JSON exports of such data, I don't think this approach would make sense 
(but I suppose this is not intended here). There, one would simply use optional 
keys for including additional data. Likewise for RDF, where one would not want 
to introduce additional "role" objects that create another layer of indirection 
to access derived values (of course, RDF has a tradition of dealing with 
derived values, and nobody expects special structures there). I suppose that 
this proposal has nothing to do with JSON or RDF, but just to be sure we are on 
the same page.

The term "data model" is a bit over-used in our context -- maybe it would make 
sense to indicate in this bug report that it is specific to the object model 
used in the PHP implementation, and has no implications for other 
implementations or export formats.


TASK DETAIL
  https://phabricator.wikimedia.org/T118860

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: mkroetzsch
Cc: mkroetzsch, adrianheine, hoo, thiemowmde, aude, Jonas, JanZerebecki, 
JeroenDeDauw, Aklapper, StudiesWorld, daniel, Wikidata-bugs, Mbch331



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T113168: [Story] Make it possible to alter only Statements with a certain property

2015-09-22 Thread mkroetzsch
mkroetzsch added a comment.

This was a suggestion we came up with when discussing during WikiCon. People 
are asking for a way to edit the data they pull into infobox templates. 
Clearly, doing this in place will be a long-term effort that needs a 
complicated solution and many more design discussions. Until this is in place, 
people can only link to Wikidata. Unfortunately, people often feel intimidated 
by what they see there, because they get a very long page that takes long time 
to load and contains all kind of data that they have not seen in the infobox.

The goal of this task is to offer a cheap intermediate solution that improves 
the editing experience tremendously without much implementation effort. People 
would be able to link to a limited item page that shows only statements for one 
property and that has a button to view everything. The statement editing UI is 
actually not so bad if you see only one statement (or a smaller group of 
statements) -- it should be pretty self-explaining. This might allow more 
people to make individual changes without oversimplifying things.


TASK DETAIL
  https://phabricator.wikimedia.org/T113168

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: mkroetzsch
Cc: mkroetzsch, Lydia_Pintscher, Aklapper, hoo, Wikidata-bugs, aude



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T111770: Decide how to represent quantities with units in the "truthy" RDF mapping

2015-09-11 Thread mkroetzsch
mkroetzsch added a comment.

Note that this discussion is no longer just about the wdt property values 
(called "truthy" above). Simple values are now used on several levels in the 
RDF encoding.

In general, the same argument as for coordinates applies: if we cannot do it 
right, then better not do it at all (i.e., use a bnode until we have a format). 
This might always be necessary in some cases (e.g., even if we convert units, 
there might be cases where conversion is not possible).

I agree with the advantages and disadvantages of using a custom datatype. 
Without BlazeGraph support for this, one would not be able to do range queries 
over such data, which would make it pretty useless. We could as well use 
strings in this case.

The normalisation of units by converting them to a base unit would still leave 
important problems. If there would be a community controlled way to define 
conversions, there would be the problem that the "main" unit that the RDF data 
is normalised to might change. This would change the content and meaning of 
simple values even though actual property values have not changed. Somehow 
declaring this in other triples in the RDF dump would not solve this, since we 
assume many fixed (standing) queries to be used which would not be able to 
adapt automatically to a new unit declaration. The normalisation scheme would 
also create problems for incremental update: a single change in the conversion 
definitions would require changes in millions of simple values that are part of 
the export of items that have not changed at all.

A possible solution to work around the absence of a datatype and even in the 
absence of conversion support would be to create properties like "P1234inCm" 
and "P1234inInch". They would have plain number values that work in range 
queries. This would basically simulate the custom datatype with very similar 
effect on query answering (users would need to adjust queries to specify the 
unit that is queried for, but they would at least be sure that the data they 
query refers to this unit). The downside is that you need a different property 
for each unit, and that therefore you still have no good value to use for the 
simple value properties. However, I think this is how other datasets are doing 
it (has anybody checked DBpedia?).


TASK DETAIL
  https://phabricator.wikimedia.org/T111770

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: mkroetzsch
Cc: Denny, mkroetzsch, Smalyshev, Aklapper, daniel, jkroll, Wikidata-bugs, 
Jdouglas, aude, Deskana, Manybubbles, JanZerebecki



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T111770: Decide how to represent quantities with units in the "truthy" RDF mapping

2015-09-11 Thread mkroetzsch
mkroetzsch added a comment.

If we could distinguish type quantity properties that require a unit from those 
that do not allow units, there would be another options. Then we could use a 
compound value as the "simple" value for all properties with unit to simulate 
the missing datatype. On the query level, this would be fully equivalent to 
having a custom datatype, since one can specify the unit and the (ranged) 
number individually. (While the P1234inCm properties support only the number, 
but no queries that refer to the unit).

Using a compound value as a simple value is fine. It's not worse than a bnode 
if you do not want to look into the inner structure, but it has additional 
features for those who want. The only problem is that you should not mix number 
literals with URIs that refer to compound values for the same property -- this 
is why one would need to fix in the property datatype whether units are 
required (always there) or forbidden (never there). Mixing this would not work.


TASK DETAIL
  https://phabricator.wikimedia.org/T111770

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: mkroetzsch
Cc: Denny, mkroetzsch, Smalyshev, Aklapper, daniel, jkroll, Wikidata-bugs, 
Jdouglas, aude, Deskana, Manybubbles, JanZerebecki



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Updated] T111770: Decide how to represent quantities with units in the "truthy" RDF mapping

2015-09-11 Thread mkroetzsch
mkroetzsch added a comment.

I think the discussion now lists all main ideas on how to handle this in RDF, 
but most of them are not feasible because of the very general way in which 
Wikibase implements unit support now. Given that there is no special RDF 
datatype for units and given that we have neither conversion support nor any 
kind way to restrict that a property must/must not have units, only one of the 
options is actually possible now: **export as string** (no range queries, but 
minimally more informative than just using a blank node).

It would be possible to export data as numbers for unit-modified properties 
such as "P1234inCm" //in addition//. This can only be an additional feature 
though, since we still need a simple value in any case. It might not be worth 
to do this, since one can always use the complex value to access the number in 
any case. Note that the properties "P1234inCm" would need to have very 
complicated, lengthy, unreadable names since units in Wikibase are represented 
not as "cm", and not even as item ids, but as 
full URIs. But you cannot use a URI within another URI directly -- you would 
need to escape certain characters. Moreover, the resulting string might not be 
allowed as a local name in abbreviations like 
wdt:https://phabricator.wikimedia.org/P1234, so users would have to type the 
full URI. Therefore, it seems that using the (already existing) complex values 
in such queries would actually be more readable.


TASK DETAIL
  https://phabricator.wikimedia.org/T111770

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: mkroetzsch
Cc: Denny, mkroetzsch, Smalyshev, Aklapper, daniel, jkroll, Wikidata-bugs, 
Jdouglas, aude, Deskana, Manybubbles, JanZerebecki



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T101837: [Story] switch default rdf format to full (include statements)

2015-09-10 Thread mkroetzsch
mkroetzsch added a comment.

Including more data (within reason) will not be a problem (other than a 
performance/bandwidth problem for your servers).

However, if there are further ideas and small improvements that will take time 
to implement, it would be good to switch to "dump" as the default right now. It 
is already a big improvement over the current (statement-free) default. Further 
improvements can then be done in small steps.


TASK DETAIL
  https://phabricator.wikimedia.org/T101837

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: mkroetzsch
Cc: mkroetzsch, Smalyshev, Rybesh, Jimkont, Klaraspina, Aklapper, daniel, 
Lydia_Pintscher, Wikidata-bugs, aude



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T101837: [Story] switch default rdf format to full (include statements)

2015-09-09 Thread mkroetzsch
mkroetzsch added a comment.

Data on the referenced entities does not have to be included as long as one can 
get this data by resolving these entities' URIs. However, some basic data 
(ontology header, license information) should be in each single entity export.


TASK DETAIL
  https://phabricator.wikimedia.org/T101837

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: mkroetzsch
Cc: mkroetzsch, Smalyshev, Rybesh, Jimkont, Klaraspina, Aklapper, daniel, 
Lydia_Pintscher, Wikidata-bugs, aude



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T101837: [Story] switch default rdf format to full (include statements)

2015-09-09 Thread mkroetzsch
mkroetzsch added a subscriber: mkroetzsch.
mkroetzsch added a comment.

One the mailing list, Stas brought up the question "which RDF" should be 
delivered by the linked data URIs by default. Our dumps contain data in 
multiple encodings (simple and complex), and the PHP code can create several 
variants of RDF based on parameters now.

I think the default should be to simply return all data that is in the dumps. 
This would address the BBC's use case of building a linked data crawler that 
fetches live data rather than using dumps. Such a crawler would not have any 
way to specify which part of RDF is needed, since linked data is such an 
extremely simple, parameter-free API.


TASK DETAIL
  https://phabricator.wikimedia.org/T101837

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: mkroetzsch
Cc: mkroetzsch, Smalyshev, Rybesh, Jimkont, Klaraspina, Aklapper, daniel, 
Lydia_Pintscher, Wikidata-bugs, aude



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T85444: get Wikidata added to LOD cloud

2015-09-08 Thread mkroetzsch
mkroetzsch added a subscriber: mkroetzsch.
mkroetzsch added a comment.

As another useful feature, this will also allow us to have our SPARQL endpoint 
monitored at http://sparqles.ai.wu.ac.at/ Basic registration should not be too 
much work; please look into it (I don't want to create an account for Wikimedia 
;-).


TASK DETAIL
  https://phabricator.wikimedia.org/T85444

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: mkroetzsch
Cc: mkroetzsch, AnjaJentzsch, Aklapper, Lucie, Lydia_Pintscher, daniel, 
Wikidata-bugs, aude



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T73349: [Bug] Fix empty map serialization behaviour

2015-08-24 Thread mkroetzsch
mkroetzsch added a comment.

It seems that the Web API for wbeditentities is also returning empty lists when 
creating new items (at least on test.wikidata.org). Is this the same bug or a 
different component?


TASK DETAIL
  https://phabricator.wikimedia.org/T73349

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: mkroetzsch
Cc: Addshore, Aklapper, Ricordisamoa, Liuxinyu970226, Jimkont, JeroenDeDauw, 
Wikidata-bugs, mkroetzsch, JanZerebecki, Lydia_Pintscher, aude, Malyacko



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T105432: Drop wikibase:quantityUnit for now from RDF dump

2015-08-05 Thread mkroetzsch
mkroetzsch added a comment.

If not dropped, then it should be fixed. The value of 1 (a string literal) is 
not correct. Units should be represented by URIs, not by literals.


TASK DETAIL
  https://phabricator.wikimedia.org/T105432

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, mkroetzsch
Cc: gerritbot, mkroetzsch, daniel, Aklapper, Smalyshev, jkroll, Wikidata-bugs, 
Jdouglas, aude, Manybubbles, JanZerebecki, Malyacko, P.Copp



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T102717: https switch changed wdata prefix to https:

2015-06-23 Thread mkroetzsch
mkroetzsch added a comment.

While I did say that pretty much all URIs I know use http, I do not have any 
reason to believe that https would cause problems. It is not so extensively 
tested maybe, but in most contexts it should work fine.

A bigger issue is that some people are already using our http URIs.


TASK DETAIL
  https://phabricator.wikimedia.org/T102717

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, mkroetzsch
Cc: JanZerebecki, mkroetzsch, thiemowmde, Lydia_Pintscher, gerritbot, daniel, 
Aklapper, Smalyshev, jkroll, Wikidata-bugs, Jdouglas, aude, Manybubbles



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T95316: Comparison of the existing Wikidata RDF dumps

2015-06-17 Thread mkroetzsch
mkroetzsch added a comment.

In https://phabricator.wikimedia.org/T95316#1373937, @Lydia_Pintscher wrote:

 Are there any differences we're missing? Are we ok with these differences?


I will do a complete review of the update RDF mapping in the course of the next 
week. I will report back then if there is anything missing in the diff.

Also, what is the expected outcome of this bug? A table like the one posted by 
Lucie? Or something with more detail? Some rows in the current table are 
probably only understood by people who already know both dumps ;-) Is this 
meant to be only for our internal information?

Another relevant note here might be that the plan is to fully align WDTK 
mappings with the updated RDF dumps, so that many of the above will go away 
(the split into several files would remain though). We just did not do this 
while we were still discussing the updated RDF mapping.

@Lucie:

- The second row difference is just a consequence of what was already stated in 
the first row (NTriples vs Turtle). Maybe this can be merged/deleted.
- It seems that the entry in row labels (aliasesdescriptions) only refers 
to labels. The properties skos:prefLabel and schema:name are not used for 
descriptions or aliases in either dumps, AFAIK.
- It would make sense to distinguish differences in distribution/surface syntax 
(which format, how many files, which compression algorithm, ...) from real 
differences in the RDF model (=differences that matter for SPARQL users).


TASK DETAIL
  https://phabricator.wikimedia.org/T95316

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Lucie, mkroetzsch
Cc: Lydia_Pintscher, mkroetzsch, daniel, Smalyshev, Aklapper, Lucie, 
Wikidata-bugs, aude



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T102155: find a way to surface rdf/json representation in item UI

2015-06-12 Thread mkroetzsch
mkroetzsch added a comment.

 we once planned a popup box with links to the various formats. It would be 
 shown when you click on the Q-id in the title.


A pop-up box is a good solution if there are several options, but the Qid is 
not a good place to trigger it, since it gives no hint that it can be clicked 
or what one would get by doing so. There is a standard icon used for share 
that looks very similar to the RDF icon and is sometimes used in this way 
(google share icon or search for share on 
http://marcoceppi.github.io/bootstrap-glyphicons/). This could be used to 
indicate the popup box. One could put it somewhere in the title region (I would 
prefer the top-right corner).


TASK DETAIL
  https://phabricator.wikimedia.org/T102155

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Snaterlicious, mkroetzsch
Cc: daniel, Nikki, mkroetzsch, Aklapper, Lydia_Pintscher, Wikidata-bugs, aude



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T101752: Introduce ExternalEntityId

2015-06-10 Thread mkroetzsch
mkroetzsch added a comment.

I think this is a useful change if you want Wikibase sites to be able to refer 
to other Wikibase sites. In WDTK, all of our EntityId objects are external, 
of course. A lesson learned for us was that it is not enough to know the base 
URI in all cases. You sometimes need URLs for API, file path, and page path in 
addition to the plain URI. MediaWiki already has a solution for this in form of 
the sites table. I would suggest to use this and to store pairs 
sitekey,localEntityId and to have the URI prefix stored in the sites table. 
It's cleaner than storing the actual URI string (which might change if an 
external site is reconfigured!) in the actual values on the page.

A strategy for introducing this without breaking anything much is to keep the 
local wikis sitekey as the default setting in all cases. So callers who are not 
aware of the external site support can keep sending local ids and will get 
the right thing. Only when they read data they will have to mind the new 
information (but that's always the case if you enable linking to external 
entities).

But, overall, I think it would be good to make this change. Commons will want 
to link to Wikidata, for example, but also many Wikibase instances outside of 
WMF will benefit from the ability to link to WIkidata content.


TASK DETAIL
  https://phabricator.wikimedia.org/T101752

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: mkroetzsch
Cc: mkroetzsch, JeroenDeDauw, Aklapper, daniel, Wikidata-bugs, aude



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T99907: Human-readable serialization of TimeValue precisions in RDF

2015-05-21 Thread mkroetzsch
mkroetzsch added a comment.

A big advantage of the numbers is that you can search for values where the 
precision is at least a certain value (e.g., dates with precision day or 
above). This would be lost when using URIs.


TASK DETAIL
  https://phabricator.wikimedia.org/T99907

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: mkroetzsch
Cc: mkroetzsch, Smalyshev, daniel, thiemowmde, Aklapper, Wikidata-bugs, aude



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T94064: Date of +0000-01-01 is allowed but undefined in wikibase but is not allowed in xsd:dateTime as implemented by blazegraph

2015-05-19 Thread mkroetzsch
mkroetzsch added a comment.

@Jc3s5h You are right that date conversion only makes sense in a certain range. 
I think the software should disallow day-precision dates in prehistoric eras 
(certainly everything before -1). There are no records that could possibly 
justify this precision, and the question of calendar conversion becomes moot. 
Do you think 4713BCE would be enough already, or do you think there could be a 
reason to find more complex algorithms to get calendar support that extends 
further to the past?


TASK DETAIL
  https://phabricator.wikimedia.org/T94064

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, mkroetzsch
Cc: thiemowmde, Jc3s5h, Lydia_Pintscher, Denny, Manybubbles, daniel, 
mkroetzsch, Smalyshev, JanZerebecki, Aklapper, jkroll, Wikidata-bugs, Jdouglas, 
aude



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T97195: Create real URLs for wikidata ontology

2015-05-11 Thread mkroetzsch
mkroetzsch added a comment.

Sounds good.

I am not aware of any best practice re http vs. https but all URIs I know are 
using http as a protocol.


TASK DETAIL
  https://phabricator.wikimedia.org/T97195

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: mkroetzsch
Cc: Lydia_Pintscher, Denny, mkroetzsch, Manybubbles, daniel, Aklapper, 
Smalyshev, Wikidata-bugs, aude



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T94747: Make decision on RDF ontology prefix

2015-04-03 Thread mkroetzsch
mkroetzsch added a comment.

I agree with the proposal of @Smalyshev.


TASK DETAIL
  https://phabricator.wikimedia.org/T94747

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: mkroetzsch
Cc: Denny, mkroetzsch, daniel, Lydia_Pintscher, Aklapper, Smalyshev, jkroll, 
Wikidata-bugs, Jdouglas, aude, GWicke, Manybubbles, JanZerebecki



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T94747: Make decision on RDF ontology prefix

2015-04-03 Thread mkroetzsch
mkroetzsch added a comment.

@daniel Changing the base URIs is not working as a way to communicate breaking 
changes to users of RDF. You can change them, but there is no way to make users 
notice this change, and it will just break a few more queries. It's just not 
how RDF works. Most of our test queries do not even mention any wikibase 
ontology URI, yet they are likely to be broken by changes to come. If you think 
that we need a way to warn users of such changes, you need to think of another 
way of doing this.

Here is how I would do ontology and RDF model versioning:

- Ontology URIs are never changed. They are all based on the same base URI 
(prefix).
- If an update would change the meaning of an ontology element in fundamental 
ways, a new URI (new local name) is used rather than redefining an existing URI.
- The ontology needs to have a file (with OWL property and class declarations, 
some labels, etc.). The file name of this file (and thus its URL) should 
include a version number.
- The ontology URIs should redirect to the most recent version of the ontology 
file. An improved setup would use content negotiation to redirect to an HTML 
documentation or to the ontology file. One can use URL fragments for local 
names in both cases.
- The versioned ontology file is imported into the data dump using an OWL 
import statement.
- In addition to the ontology import triple, the data contains a triple that 
defines the version of the RDF model used. This triple uses an annotation 
property to the dataset ontology element (i.e., the subject of the import 
triple, not the Wikibase ontology file).

Then users can query for the RDF model version to check compatibility. Like in 
Daniel's proposal, there is no warning if they don't do this check, but in 
contrast to Daniel's proposal, users have a way to find out which version of 
the model is used, and the versioning can refer to the whole RDF model (not 
just to the Wikibase ontology, which most of our current example queries are 
not even referring to).


TASK DETAIL
  https://phabricator.wikimedia.org/T94747

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: mkroetzsch
Cc: Denny, mkroetzsch, daniel, Lydia_Pintscher, Aklapper, Smalyshev, jkroll, 
Wikidata-bugs, Jdouglas, aude, GWicke, Manybubbles, JanZerebecki



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T94064: Date of +0000-01-01 is allowed but undefined in wikibase but is not allowed in xsd:dateTime as implemented by blazegraph

2015-03-31 Thread mkroetzsch
mkroetzsch added a comment.

@Smalyshev You comment on my Item 1 by referring to BlazeGraph and Virtuoso. 
However, my Item 1 is about reading Wikidata, not about exporting to RDF. Your 
concerns about BlazeGraph compatibility are addressed by my item 2. I hope this 
clarifies this part.

As for the wrong dates, I simply say that we do not know how to fix them, since 
the errors are not sufficiently systematic. At best we can replace one kind of 
error with another kind of error. I agree with you that wrong dates are a bad 
thing, but a bad thing that is beyond our power to fix. We should focus on the 
RDF export and rely on others to do their work, so that everything will run 
smoothly in the end. At the current stage of the RDF work, the issue is of 
relatively minor relevance compared to the problems it is causing elsewhere. 
All of our current RDF exports and several applications that people are using 
suffer from the same errors in Wikidata. We need to fix it at the root, not in 
each consumer.


TASK DETAIL
  https://phabricator.wikimedia.org/T94064

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, mkroetzsch
Cc: Lydia_Pintscher, Denny, Manybubbles, daniel, mkroetzsch, Smalyshev, 
JanZerebecki, Aklapper, jkroll, Wikidata-bugs, Jdouglas, aude, GWicke



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T94064: Date of +0000-01-01 is allowed but undefined in wikibase but is not allowed in xsd:dateTime as implemented by blazegraph

2015-03-30 Thread mkroetzsch
mkroetzsch added a comment.

@Smalyshev

Re halting the work on the query engine/produce code now: The WDTK RDF 
exports are generated based on the original specification. There is no 
technical issue with this and it does not block development to do just this. 
The reason we are in a blocker situation is that you want to move forward with 
an implementation that is different from the RDF model we proposed and that 
goes against our original specification, so that Denny and I are fundamentally 
disagreeing with your design. If you want to return to the original plan, 
please do it and move on. If not, then better wait until Lydia has a conclusion 
for what to do with dates, rather than implementing your point of view without 
consensus. For me, this is a benchmark of whether or not our current discussion 
setup is working.

Here is why I am optimistic that we can align with RDF 1.1 and ISO 8601:2000 
before the query engine would even go live: Basically all calendar-accurate BCE 
dates will be revised and many of them will be changed because of the ongoing 
date review. We can well fix the year zero issue at the same time. Thus we can 
as well work on the hypothesis that dates are in ISO 8601:2000 as originally 
intended. From the feedback we got from the SPARQL group, it seems that this 
would be preferable, if we can make it work technically. The date review is a 
great opportunity to get the whole internal representation back on track.

Re deep value model: the core of the issue is that you propose to represent 
dates as the original string. Denny and I have clarified that we don't find 
this an acceptable representation for dates. As opposed to the XSD 1.0 issue, 
this proposal leads to a completely different structure in RDF and queries. 
There is no upgrade path from this implementation to the one we actually want. 
If we can agree on getting rid of this first, this would be a good start to 
move on. Changing from XSD 1.0 to XSD 1.1 is a minor issue in comparison, and 
one which can be deferred in implementation until we have BlazeGraph support 
for this.


TASK DETAIL
  https://phabricator.wikimedia.org/T94064

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, mkroetzsch
Cc: Lydia_Pintscher, Denny, Manybubbles, daniel, mkroetzsch, Smalyshev, 
JanZerebecki, Aklapper, jkroll, Wikidata-bugs, Jdouglas, aude, GWicke



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T94064: Date of +0000-01-01 is allowed but undefined in wikibase but is not allowed in xsd:dateTime as implemented by blazegraph

2015-03-30 Thread mkroetzsch
mkroetzsch added a comment.

 @mkroetzsch I already listed a few of the tools that implement XSD 1.0 style 
 BCE years and I read your answer as to say that you know of no tools that 
 implement XSD 1.1 style BCE years.


Then you misread my answer. Almost all tools that exist today use the 2000 
version of the ISO standard. A prominent example is ECMAScript, and thus all 
JavaScript implementations, and virtually every JavaScript timeline 
implementation. See 
http://www.ecma-international.org/ecma-262/5.1/#sec-15.9.1.15 and the examples 
in the following section to see this. RDF tools are an understandable 
exception, not because people there think one should cling to the old standard, 
but because RDF 1.1 was only standardised in 2014. It is natural that existing 
RDF implementations have more pressing upgrade work to do than to fix BCE date 
handling. I am sure they will all move to the new standard in due course.

It is worth noting that all of the ECMAScript documents use ISO 8601 to refer 
to ISO 8601:2000, just like we did in the Wikidata data model specification. 
It seems that most people are not confused by this.

Having dug up ECMA from the Web, I can now also safely say that JSON exports 
should definitely use ISO 8601:2000 dates.

The new situation therefore is: ISO, W3C, and all JavaScript implementations 
vs. a subset of the developers in the WMDE office. I am very unhappy about the 
amount of my time I have to put into digging up for you what the rest of the 
world is thinking. It's a nice position to put yourself in, asking others to 
find specific arguments against your position and assuming you are right if 
they don't have the time or knowledge to do it. Now I am myself far from being 
an expert in JavaScript or even in all details of SPARQL 1.1, but if I don't 
know something I try to find out before taking part in discussions like this.


TASK DETAIL
  https://phabricator.wikimedia.org/T94064

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, mkroetzsch
Cc: Lydia_Pintscher, Denny, Manybubbles, daniel, mkroetzsch, Smalyshev, 
JanZerebecki, Aklapper, jkroll, Wikidata-bugs, Jdouglas, aude, GWicke



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T94064: Date of +0000-01-01 is allowed but undefined in wikibase but is not allowed in xsd:dateTime as implemented by blazegraph

2015-03-30 Thread mkroetzsch
mkroetzsch added a comment.

@Smalyshev We really want the same thing: move on with minimal disturbance as 
quickly as possible. As you rightly say, the data we generate right now is not 
meant for production use but for testing. We must make sure that our production 
environment will understand dates properly, but it's still some time before 
that. Here is my proposal summed up:

1. Implement RDF export now as if Wikidata would encode all dates in ISO 
8601:2000 (proleptic Gregorian with year 0 encoding -1BCE)
2. Have a switch in the RDF export code that allows us to export to RDF 1.1 or 
to RDF 1.0.

Item 1 will ensure that we can work with the dates as they will most likely be 
when we enter production. With the discovery that all of JavaScript relies on 
ISO 8601:2000, there is not much of a question that we will have this corrected 
in the end. It would be a waste of programming time to work around issues that 
others are already trying to fix as we speak. We can still implement 
reinterpretations of the internal dates when we find that the internal format 
is still broken when we want to release this (I hope this won't happen).

Item 2 is a compromise. It will ensure that we can use BlazeGraph even before 
the xsd:date bug is fixed. Has anyone reported the issue to them yet? It might 
well be that they are quicker fixing this than we are finishing this 
discussion. I am sure they would also like to conform to SPARQL 1.1.

I agree with you that some dates will not be interpreted as intended, but this 
is unavoidable (whatever rule we pick, we will always have some dates that are 
not as intended, already because of the calendar model mix-up). We have to rely 
on the ongoing review to get this fixed. This should not worry us right now, as 
it affects everyone (including actual production uses of Wikidata data from the 
API or JSON dumps).


TASK DETAIL
  https://phabricator.wikimedia.org/T94064

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, mkroetzsch
Cc: Lydia_Pintscher, Denny, Manybubbles, daniel, mkroetzsch, Smalyshev, 
JanZerebecki, Aklapper, jkroll, Wikidata-bugs, Jdouglas, aude, GWicke



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T94064: Date of +0000-01-01 is allowed but undefined in wikibase but is not allowed in xsd:dateTime as implemented by blazegraph

2015-03-30 Thread mkroetzsch
mkroetzsch added a comment.

@Smalyshev P.S. Your finding of  years in our Virtuoso instance is quite 
peculiar given that this endpoint is based on RDF 1.0 dumps as they are 
currently generated in WDTK using this code: 
https://github.com/Wikidata/Wikidata-Toolkit/blob/a9f676bfbc2df545d386bfa72e5130fa280521a9/wdtk-rdf/src/main/java/org/wikidata/wdtk/rdf/values/TimeValueConverter.java#L112-L117


TASK DETAIL
  https://phabricator.wikimedia.org/T94064

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, mkroetzsch
Cc: Lydia_Pintscher, Denny, Manybubbles, daniel, mkroetzsch, Smalyshev, 
JanZerebecki, Aklapper, jkroll, Wikidata-bugs, Jdouglas, aude, GWicke



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T94064: Date of +0000-01-01 is allowed but undefined in wikibase but is not allowed in xsd:dateTime as implemented by blazegraph

2015-03-29 Thread mkroetzsch
mkroetzsch added a comment.

@Smalyshev @Lydia_Pintscher Dates without years should not be allowed by the 
time datatype. They are impossible to order, almost impossible to query, and 
they do not have any meaning whatsoever in combination with a preferred 
calendar model. All the arguments @Denny has already given elsewhere for why we 
should unify dates to Proleptic Gregorian internally apply here too. My 
suspicion is that the existing dates of this form are simply a glitch in the 
UI, where users got the impression that dates without years are recognized and 
pressing save silently set the year to zero without them seeing the change in 
meaning. If this is an important use case, then we should develop a day-of-year 
datatype that supports this, or suggest the community to use dedicated 
properties/qualifiers to encode this. However, other datatype extensions would 
be much more important than this rare case (e.g., units of measurement).

The above proposal of @Smalyshev is simply to use RDF 1.0 for export and to 
assume XSD 1.0 (non-ISO) dates to be used in Wikidata. After all the discussion 
here, I am completely baffled by this proposal. It goes against all current 
standards, and against the view of the SPARQL working group. The additional 
proposal to revert to the dates are just strings view for deep values ignores 
the original design and documentation, and dismisses the recommendations that 
Denny and I have been making via email. It seems we have reached an impasse 
here.

**I suggest to freeze the RDF-time encoding discussions now** until we have 
established a joint understanding what dates in Wikidata mean. As soon as we 
export dates to RDF, we are defining their meaning indirectly via the RDF 
semantics, and this bug report is not the right place for doing this.


TASK DETAIL
  https://phabricator.wikimedia.org/T94064

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, mkroetzsch
Cc: Lydia_Pintscher, Denny, Manybubbles, daniel, mkroetzsch, Smalyshev, 
JanZerebecki, Aklapper, jkroll, Wikidata-bugs, Jdouglas, aude, GWicke



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T94064: Date of +0000-01-01 is allowed but undefined in wikibase but is not allowed in xsd:dateTime as implemented by blazegraph

2015-03-29 Thread mkroetzsch
mkroetzsch added a comment.

 @mkroetzsch Do you know of some widely used software that implements XSD 1.1 
 handling of BCE dates?


Many applications that process dates are based on ISO rather than on XSD. 
Java's SimpleDateFormat class, for example, is based on ISO and thus interprets 
year numbers like XSD 1.1. I would assume that most time-processing 
applications, e.g., JavaScript timelines, use the same. Only XSD-based 
implementations tend to have legacy handling. For many RDF tools it is really 
hard to tell without digging into their code (usually they don't document this 
detail, and they use own implementations rather than relying on any XSD 
library). But I think it is fair to assume that ISO has a much larger market 
share, and that XSD 1.0 implementations will be updated at some point.

 I think the best way forward is to leave the lexical 0 in the year fraction 
 as undefined in Wikidata. Yes its used, but AFAIK it was always undefined.


Our original specification of Wikidata said: The calendar model used for 
saving the data is always the proleptic Gregorian calendar according to ISO 
8601. This is the specification that Denny and I support, but there have been 
changes recently. WMDE is currently in the process of reviewing these changes 
to gauge the impact they have had in the data over time, and to come up with 
ideas how to recover to a consistent state. We have to await their report and 
suggestions before deciding what to do in RDF.


TASK DETAIL
  https://phabricator.wikimedia.org/T94064

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, mkroetzsch
Cc: Lydia_Pintscher, Denny, Manybubbles, daniel, mkroetzsch, Smalyshev, 
JanZerebecki, Aklapper, jkroll, Wikidata-bugs, Jdouglas, aude, GWicke



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T94064: Date of +0000-01-01 is allowed in wikibase but has no meaning as xsd:dateTime

2015-03-27 Thread mkroetzsch
mkroetzsch added a comment.

Note that all current data representation formats assume that 
-01-01T00:00:00 is a valid representation:

- XML Schema 1.1: http://www.w3.org/TR/xmlschema11-2/#dateTime
- RDF 1.1: http://www.w3.org/TR/rdf11-concepts/#section-Datatypes
- OWL 2: http://www.w3.org/TR/owl2-syntax/#Datatype_Maps

Moreover, XML Schema 1.1 argues that this change was made in order to agree 
with existing usage, and I would agree there: many existing documents used the 
ISO interpretation of years even before it became official. In other words, if 
we want to export data to RDF, we should definitely conform with current usage 
and standards. I imagine that it would be easy for BlazeGraph to use either 
semantics if we asked for support there.

Regarding the intention of SPARQL 1.1, I now have sent an enquiry to the former 
SPARQL WG:
http://lists.w3.org/Archives/Public/public-sparql-dev/2015JanMar/0031.html
which will hopefully lead to further clarification on this matter.


TASK DETAIL
  https://phabricator.wikimedia.org/T94064

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, mkroetzsch
Cc: Denny, Manybubbles, daniel, mkroetzsch, Smalyshev, JanZerebecki, Aklapper, 
jkroll, Wikidata-bugs, Jdouglas, aude, GWicke



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T93451: Data format updates for RDF export

2015-03-27 Thread mkroetzsch
mkroetzsch added a comment.

 Don't see why it would be this many. It'd be like 4 additional rows per 
 property:


I was referring to the labels. For some use cases, it could be convenient of 
each of the property variants would also have the rdfs:label of the property 
item. For example, RDF browsers will not be able to label a property variant 
such as :P1234q (or whatever we use) if we don't include any label for it. But 
including all labels (up to 300 languages) for all variants would lead to a lot 
of triples in the dump.


TASK DETAIL
  https://phabricator.wikimedia.org/T93451

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, mkroetzsch
Cc: gerritbot, Denny, mkroetzsch, daniel, Manybubbles, Aklapper, Smalyshev, 
jkroll, Wikidata-bugs, Jdouglas, aude, GWicke, JanZerebecki



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T94064: Date of +0000-01-01 is allowed but undefined in wikibase but is not allowed in xsd:dateTime as implemented by blazegraph

2015-03-27 Thread mkroetzsch
mkroetzsch added a comment.

   we don't know what year it was but it was July 4th


Ouch. Where has this been designed? Can you point to the specification of this?

@Denny, is this intended? Dates without a year are extremely hard to handle in 
queries and don't work at all like the normal dates we have. This should be a 
different datatype.


TASK DETAIL
  https://phabricator.wikimedia.org/T94064

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, mkroetzsch
Cc: Lydia_Pintscher, Denny, Manybubbles, daniel, mkroetzsch, Smalyshev, 
JanZerebecki, Aklapper, jkroll, Wikidata-bugs, Jdouglas, aude, GWicke



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T93451: Data format updates for RDF export

2015-03-27 Thread mkroetzsch
mkroetzsch added a comment.

All RDF tools should be able to handle resources without labels (no matter if 
used as subject, predicate, or objcet). But data browsers or other UIs will 
simply show the URL (or an automatically created abbreviated version of it) to 
the user. So instead of instance of it would read something like 
http://www.wikidata.org/entity/P31c;. Nevertheless, we can accept this for 
now. AFAIK there are no widely used generic RDF data browsers anyway, and it's 
much more likely that people will first create Wikidata-aware interfaces.


TASK DETAIL
  https://phabricator.wikimedia.org/T93451

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, mkroetzsch
Cc: gerritbot, Denny, mkroetzsch, daniel, Manybubbles, Aklapper, Smalyshev, 
jkroll, Wikidata-bugs, Jdouglas, aude, GWicke, JanZerebecki



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Changed Subscribers] T94019: Generate RDF from JSON

2015-03-27 Thread mkroetzsch
mkroetzsch added a subscriber: mkroetzsch.

TASK DETAIL
  https://phabricator.wikimedia.org/T94019

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: mkroetzsch
Cc: mkroetzsch, Aklapper, daniel, Wikidata-bugs, aude



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Changed Subscribers] T94064: Date of +0000-01-01 is allowed but undefined in wikibase but is not allowed in xsd:dateTime as implemented by blazegraph

2015-03-27 Thread mkroetzsch
mkroetzsch added a subscriber: Lydia_Pintscher.
mkroetzsch added a comment.

Yes, the discussion on SPARQL has converged surprisingly quickly to the view 
that XSD 1.1 is both normative and intended in SPARQL 1.1 (by the way, I can 
only recommend this list if you have SPARQL questions, or the analogous list 
for RDF -- people are usually very quick and helpful in answering queries, esp. 
if you say why you need it ;-).

Your findings about Virtuoso and BlazeGraph show that it might be hard to find 
a conforming processor right now. However, I would still hope that it can be 
done, since the transformation of the values is quite easy after all. In fact, 
I think that neither of these projects is very likely to have customers who 
have cared about BCE years so far ;-). Technically, it should be not too hard: 
even if you use an XSD 1.0 library in most places, you could surely find an XSD 
1.1 library to use with date functions, or you could transform dates internally 
before passing them to the XSD 1.0 operator functions. If such transformation 
is not efficient enough, one could also convert all input to XSD 1.0 dates on 
loading (or before) and then merely translate dates in queries and results 
accordingly (should not be a big performance issue since these datasets are 
rather small). However, I think it should be easy to take the few affected 
XPath functions from another library or to implement them
directly. Julian day calculation is a very simple algorithm 
(https://en.wikipedia.org/wiki/Julian_day#Calculation), and this is all you 
need for date comparisons, calendar conversion, and time intervals. One way or 
the other, implementations will most likely have to do some custom extensions 
of internal date handling, unless the standard XSD libraries can cope with the 
age of the universe.

Data publication is another issue. It's clear that we need to use XSD 1.1 when 
we publish RDF online, since this is what the current RDF specification 
requires. Applications that find our data on the web cannot know what we 
discussed and there is no way of telling them. They can only assume that we are 
using the current standard.

For JSON and Wikidata internally, the main reference is probably ISO 8601 not 
XSD (I don't actually know what JSON says here, but usually it says nothing 
about things other than primitive Javascript types). I'd find it hard to 
explain why we would choose to deviate from that. Year  has been legal in 
ISO for many years before Wikidata was even started. @Lydia_Pintscher recently 
triggered an action to review the dates stored internally in Wikidata, so this 
issue should probably be part of it (esp. since all BCE dates that are exact to 
the year should use Julian calendar). As you said, we already have year  in 
values, and it is likely that we have other negative year numbers entered by 
the same bots (assuming ISO semantics). So we need to change many dates one or 
the other way in any case. I think most technical consumers will appreciate if 
we stick to ISO since it is easier to do calculations with.

Moreover, now that every standard has agreed to use the same format, time is 
working for those who go with it.


TASK DETAIL
  https://phabricator.wikimedia.org/T94064

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, mkroetzsch
Cc: Lydia_Pintscher, Denny, Manybubbles, daniel, mkroetzsch, Smalyshev, 
JanZerebecki, Aklapper, jkroll, Wikidata-bugs, Jdouglas, aude, GWicke



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T72385: Wikidata JSON dump: file directory location should follow standard patterns

2015-03-26 Thread mkroetzsch
mkroetzsch added a comment.

@Smalyshev Yes, this is what I was saying. @hoo was proposing to create a 
special directory for truthy based on offline discussion in the office.


TASK DETAIL
  https://phabricator.wikimedia.org/T72385

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: ArielGlenn, mkroetzsch
Cc: Manybubbles, JanZerebecki, Smalyshev, aude, daniel, Wikidata-bugs, 
Nemo_bis, mkroetzsch, Svick, ArielGlenn, Lydia_Pintscher, hoo, jeremyb



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T93451: Data format updates for RDF export

2015-03-26 Thread mkroetzsch
mkroetzsch added a comment.

@Smalyshev Yes, using lower-case local names for properties is a widely used 
convention and we should definitely follow that for our ontology. However, I 
would rather not change case of our P1234 property ids when they occur in 
property URIs, since Wikibase ids might be case sensitive in the future 
(Commons files will have their filename as id, and even if standard MW is 
first-letter case-insensitive in articles, it can be configured to be 
otherwise). It would also create some confusion if one would have to write 
p1234 in some interfaces and P1234 in others (maybe even both would occur 
in RDF since we have a P1234 entity and several related properties).


TASK DETAIL
  https://phabricator.wikimedia.org/T93451

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, mkroetzsch
Cc: Denny, mkroetzsch, daniel, Manybubbles, Aklapper, Smalyshev, jkroll, 
Wikidata-bugs, Jdouglas, aude, GWicke, JanZerebecki



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T93207: Better namespace URI for the wikibase ontology

2015-03-25 Thread mkroetzsch
mkroetzsch added a comment.

@daniel Changing URIs of the ontology vocabulary is silently producing wrong 
results as well. I understand the problems you are trying to solve. I am just 
saying that changing the URIs does not actually solve them.

@adrianheine You are right. My example was less suitable than I had thought. 
The reason is that for an API, you can report an error if somebody uses the 
wrong actions. This is exactly what you cannot do in RDF. If someone uses the 
wrong URIs in a SPARQL query, he will just get the wrong results (or maybe, 
coincidentally, the correct results) without any warning.


TASK DETAIL
  https://phabricator.wikimedia.org/T93207

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: mkroetzsch
Cc: adrianheine, Manybubbles, Smalyshev, mkroetzsch, Denny, Lydia_Pintscher, 
Aklapper, daniel, Wikidata-bugs, aude, Krenair, Dzahn



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T72385: Wikidata JSON dump: file directory location should follow standard patterns

2015-03-25 Thread mkroetzsch
mkroetzsch added a comment.

@hoo Thanks for the heads up! I do have comments.

(1) I would remove the full and truthy distinction from the path and rather 
make this part of the dump type (for example statements and 
truthy-statements). The reason is that we have many full dumps (terms, 
sitelinks, statements, properties), which can be readily exported in RDF and 
JSON, but we have only one truthy dump and it really is mainly for RDF (at 
least we did not discuss a JSON format for single-triple statements). 
Therefore, it does not seem worth to make a top-level distinction in the 
directory structure for this. For consumers, it is easier if a dump file is 
addressed with four components (projectname, dumptype, date, file format). The 
truthy/full distinction would be another parameter that does not seem to add 
any functionality.

(2) My comment right at the beginning of this bug report was to have 
timestamped subdirectories, just like we have for the main dumps. Maybe you 
have reasons for not having these, but could you explain them here?


TASK DETAIL
  https://phabricator.wikimedia.org/T72385

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: ArielGlenn, mkroetzsch
Cc: Manybubbles, JanZerebecki, Smalyshev, aude, daniel, Wikidata-bugs, 
Nemo_bis, mkroetzsch, Svick, ArielGlenn, Lydia_Pintscher, hoo, jeremyb



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T72385: Wikidata JSON dump: file directory location should follow standard patterns

2015-03-25 Thread mkroetzsch
mkroetzsch added a comment.

@Lydia_Pintscher I understand this problem, but if you put different dumps for 
different times all in one directory, won't this become quite big over time and 
hard to use? Maybe one should group dumps by how often they are created (and 
have date-directories only below that). For some cases, there does not seem to 
be any problem. For example, creating all RDF dumps from the JSON dump takes 
about 3-6h in total (on labs). So this is easily doable on the same day as the 
JSON dump generation. I am sure that we could also generate alternative JSON 
dumps in comparable time (maybe add an hour to the RDF if you do it in one 
batch). The slow part seems to be the DB export that leads to the first JSON 
dump -- once you have this the other formats should be relatively quick to do.


TASK DETAIL
  https://phabricator.wikimedia.org/T72385

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: ArielGlenn, mkroetzsch
Cc: Manybubbles, JanZerebecki, Smalyshev, aude, daniel, Wikidata-bugs, 
Nemo_bis, mkroetzsch, Svick, ArielGlenn, Lydia_Pintscher, hoo, jeremyb



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T72385: Wikidata JSON dump: file directory location should follow standard patterns

2015-03-25 Thread mkroetzsch
mkroetzsch added a comment.

 All of these dumps will be generated by exporting from the DB.


Why would one want to do this? The JSON dump contains all information we need 
for building the other dumps, and it seems that the generation from the JSON 
dump is much faster, avoids any load on the DB, and would guarantee consistent 
state of all files (same revision status). Moreover, we already have code for 
doing it now (which will be updated to agree with any changes in RDF export 
structures we want).


TASK DETAIL
  https://phabricator.wikimedia.org/T72385

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: ArielGlenn, mkroetzsch
Cc: Manybubbles, JanZerebecki, Smalyshev, aude, daniel, Wikidata-bugs, 
Nemo_bis, mkroetzsch, Svick, ArielGlenn, Lydia_Pintscher, hoo, jeremyb



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T72385: Wikidata JSON dump: file directory location should follow standard patterns

2015-03-25 Thread mkroetzsch
mkroetzsch added a comment.

@Smalyshev

Re what does consistent mean: to be based on the same input data. All dumps 
are based on Wikidata content. If they are based on the same content, they are 
consistent, otherwise they are not.

Re discussing RDF dump partitioning in 
https://phabricator.wikimedia.org/T93488: Agreed. We are not discussing which 
RDF dumps to have here, only whether they are likely to be well organised by 
distinguishing full and truthy as a primary categorisation that sits above 
format (RDF vs. JSON and other matters).


TASK DETAIL
  https://phabricator.wikimedia.org/T72385

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: ArielGlenn, mkroetzsch
Cc: Manybubbles, JanZerebecki, Smalyshev, aude, daniel, Wikidata-bugs, 
Nemo_bis, mkroetzsch, Svick, ArielGlenn, Lydia_Pintscher, hoo, jeremyb



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T72385: Wikidata JSON dump: file directory location should follow standard patterns

2015-03-25 Thread mkroetzsch
mkroetzsch added a comment.

@JanZerebecki:

Re using the same code: That's not essential here. All we want is that the 
dumps are the same. It's also not necessary to develop the code twice, since it 
is already there twice anyway. It's just the question if we want to use a slow 
method that keeps people waiting for the dumps for days (as they already do now 
with many other dumps), or a fast one that you can run anywhere (even without 
DB access; on a laptop if you like). The fact that we must have the code in PHP 
too makes it possible to go back to the slow system if it should ever be 
needed, so there is no lock-in. Dump file generation is also not 
operation-critical for Wikidata (the internal SPARQL query will likely be based 
on a live feed, not on dumps). What's not to like?

Re consistency: I meant that the dumps would contain the same information, not 
that they reflect a consistent state of the site. If it is important for you to 
have a defined state, then the dump-based file generation is also your friend: 
one can do the same with the full history dump, where one could exactly specify 
the revision to dump. Probably still as fast as the DB method, but guaranteed 
to provide a globally consistent snapshot (yes, I know, modulo deletions). Not 
sure if this type of consistency is relevant though. Having a guarantee that 
the dump files in various formats are based on the same data, however, would be 
quite useful (e.g., in SPARQL, where you often mix data from truthy and full 
dumps in one query).

Recall that we are discussing this here since Lydia said that the slowness of 
the DB-based exports would be a reason for why we cannot have an (otherwise 
convenient) date-based directory structure. I agree with Lydia that this would 
be a blocker, but in this case it's really one that we can easily remove. The 
code I am talking about is at https://github.com/Wikidata/Wikidata-Toolkit, 
well tested, extensively documented, and partially WMF-funded. Why not make 
this into a community engagement success story? :-)


TASK DETAIL
  https://phabricator.wikimedia.org/T72385

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: ArielGlenn, mkroetzsch
Cc: Manybubbles, JanZerebecki, Smalyshev, aude, daniel, Wikidata-bugs, 
Nemo_bis, mkroetzsch, Svick, ArielGlenn, Lydia_Pintscher, hoo, jeremyb



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T93451: Data format updates for RDF export

2015-03-22 Thread mkroetzsch
mkroetzsch added a comment.

 is there any existing ontology we may want to use to create such links 
 between entity:P1234 and v:P1234 or q:P1234? Or should we just invent our own?


We would have to make new URIs here. This depends on which/how many variants of 
RDF property URIs we use: we should use a different link for each kind of RDF 
property variant. For example, we could have :P1234 wikibase:qualifierProperty 
q:P1234.

 Also, if we never use entity:P1234 in statements, to look it up (e.g. for 
 type, etc. if we add type to property export, or for properties) one would 
 have to do additional hop with something like: ?entity wikibase:represents 
 v:P1234 instead of just using it directly. Not sure if it's a big issue.


I would say that it is not a big issue since most of the RDF properties we use 
will always have that problem. If we use the property entity as an RDF 
property, it would only replace one of the uses of property variants. In all 
other places, you would still need the additional hop to get the label.

It would be good if the linked data export for all RDF property variants could 
include the entity labels. I would not add them to the dumps though (1000 x 300 
x 5 is a lot of additional triples).

For convenient SPARQL-based access, we should provide query interfaces that 
retrieve labels for IRIs that occur in query results so that users don't have 
to SPARQL for the label. Such post-query labelling is done in WDQ. It will be 
easy to extend this to property variants without even looking at the RDF graph. 
This will make the SPARQL queries much lighter in general.


TASK DETAIL
  https://phabricator.wikimedia.org/T93451

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, mkroetzsch
Cc: Denny, mkroetzsch, daniel, Manybubbles, Aklapper, Smalyshev, jkroll, 
Wikidata-bugs, Jdouglas, aude, GWicke, JanZerebecki



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T93451: Data format updates for RDF export

2015-03-21 Thread mkroetzsch
mkroetzsch added a comment.

 Also, it was suggested that we may want to change the fact that we use 
 entity:P1234 in link Entity-Statement and give it a distinct URL. However, 
 then it is not clear what would be the link between entity:P1234 and the rest 
 of the data.


This is a good point. It affects all property variants (qualifiers, values, 
...) that we generate. We should have explicit links from the Enitity P1234 to 
every RDF property that we use to model this Wikidata property in different 
contexts. WDTK already has an open issue on this: 
https://github.com/Wikidata/Wikidata-Toolkit/issues/84


TASK DETAIL
  https://phabricator.wikimedia.org/T93451

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, mkroetzsch
Cc: Denny, mkroetzsch, daniel, Manybubbles, Aklapper, Smalyshev, jkroll, 
Wikidata-bugs, Jdouglas, aude, GWicke, JanZerebecki



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T93207: Better namespace URI for the wikibase ontology

2015-03-20 Thread mkroetzsch
mkroetzsch added a comment.

@daniel It makes sense to use wikibase rather than wikidata, but I don't think 
it matters very much at all. We should just define it rather sooner than later.

As for the versioning, I don't see how to convince you. Four more attempts:

- Try to apply your proposal to the MediaWiki API: Every API action should 
contain the MW version number. I think from your experience with MW it should 
be easier for you to see why this would be a bad idea. RDF is the same, but it 
affects a lot more APIs.

- Another argument is that, of course, changing URIs does not give users any 
warnings about the change either. Their queries will just return different 
results, but there won't be any error message or the like. This behaviour is 
exactly the same as for other kinds of breaking changes. You just add a new 
kind of breaking change that is sure to break everybody's usage (not just the 
users' who use BCE dates, to stay in your example), but the breakage is still 
subtle and hard to notice in a running system. URI versioning does not 
implement any kind of fail fast principle that you would want for announcing 
breaking changes. There is no standard way of announcing breaking changes via 
an RDF or SPARQL API; you need to work on your community communication to get 
this done (e.g., one could send notes about breaking changes well in advance to 
wikidata-tech and gather feedback).

- You gave an example where a well-informed group of experts decided against 
your recommendation. I know of many other examples where URIs were initially 
created to contain a version number that was then never changed even after 
major updates (FOAF for example), again because experts in the field deemed 
that this was a sensible way to go. I would also claim some expertise in this 
area. Your view is natural for somebody who has not worked much with 
ontologies. Many smart people have thought similar ten years ago (you can see a 
lot of 0.1 and 1.0 version numbers in vocabulary URIs. Even the SMW 
ontology includes a version number in the URIs; of course it also never 
changed). Experience shows that the case where you would ever want to do such a 
drastic thing is the case where you use completely new URIs anyway (and 
probably give the project another name, too).

- You could always decide to do the versioning later on if you must. There is 
no problem going from URIs http://wikiba.se/ontology#... to URIs 
http://wikiba.se/ontology-2.0.0# There is no standard way of encoding 
version information in URIs and you would not write a SPARQL query to extract 
it from there. However, in most cases, if you really change the meaning of one 
URI, you would rather use a new URI for this one thing only and keep all the 
other URIs as they are.


TASK DETAIL
  https://phabricator.wikimedia.org/T93207

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: mkroetzsch
Cc: adrianheine, Manybubbles, Smalyshev, mkroetzsch, Denny, Lydia_Pintscher, 
Aklapper, daniel, Wikidata-bugs, aude, Krenair, Dzahn



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T93207: Better namespace URI for the wikibase ontology

2015-03-19 Thread mkroetzsch
mkroetzsch added a comment.

@daniel: Have you wondered why XML Schema decided against changing their URIs? 
It is by far the most disruptive thing that you could possibly do. Ontologies 
don't work like software libraries where you download a new version and build 
your tool against it, changing identifiers as required. Changing all URIs of an 
ontology (even if only on a major version increment) will break third-party 
applications and established usage patterns in a single step. There is no 
mechanism in place to do this smoothly. You never want to do this. Even 
changing a single URI can be very costly, and is probably not what you want if 
the breaking change affects only a diminishing part of your users (How many BCE 
dates are there in XSD files? How many of those were already assuming the ISO 
reading anyway?).

Having a version number in URLs does not solve the problem of versioning: it 
just creates an obligation for you to change *all* URIs whenever you update the 
ontology. If you really ever want to make a breaking change to one URI, then 
you just create a new URI for this purpose and keep the old one defined as it 
was. Then you stop using the old one in the data. Clean, easy, and usually 
without any disruption to 99% of the users (depending on which URI you changed 
of course ;-). This introduction of new names is completely independent of your 
ontology document version.

Besides all this, breaking changes are extremely rare. The example you gave 
(changing the meaning of an XML Schema datatype) does not apply to us, since we 
cannot do such things in our ontology. In essence, our ontology is just a 
declaration of technical vocabulary. Most changes you could make do not cause 
any incompatibility -- the Semantic Web is built on an open-world assumption so 
that additions of information to the ontology never are breaking anything. The 
only potentially breaking change to an ontology is when you delete some 
information, but even there it is hard to see how it should break a specific 
application.

Summing up, the only breaking change to an ontology is to change an important 
URI that many people rely on. The current proposal is to introduce a mechanism 
for doing exactly this.


TASK DETAIL
  https://phabricator.wikimedia.org/T93207

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: mkroetzsch
Cc: adrianheine, Manybubbles, Smalyshev, mkroetzsch, Denny, Lydia_Pintscher, 
Aklapper, daniel, Wikidata-bugs, aude, Krenair, Dzahn



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T93207: Better namespace URI for the wikibase ontology

2015-03-19 Thread mkroetzsch
mkroetzsch added a comment.

Hi Daniel.

Good point, I agree that this should change. A URL based on wikiba.se seems to 
be the best. I don't think we need to worry about domain ownership here (why 
would anybody sell this domain? Is it not WMF-owned?)

I think it is not a good idea to change ontoogy URLs based on the version of 
the ontology. External tools will depend on the exact URI and you will thus 
soon be locked into the version string for all future. You can see this in 
FOAF, which has always been at http://xmlns.com/foaf/0.1/; although it's not 
at v0.1 these days. Basically, putting the version into your URIs is like 
putting the version number into every class name in your program in software 
versioning. I would suggest to maintain versioned ontology files (somewhere) 
and to define the version in the ontology header, but to keep the identifiers 
as they were.


TASK DETAIL
  https://phabricator.wikimedia.org/T93207

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: mkroetzsch
Cc: adrianheine, Manybubbles, Smalyshev, mkroetzsch, Denny, Lydia_Pintscher, 
Aklapper, daniel, Wikidata-bugs, aude, Krenair, Dzahn



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Created] T91117: Empty JSON maps serialized as empty lists in XML dumps

2015-02-27 Thread mkroetzsch
mkroetzsch created this task.
mkroetzsch added subscribers: mkroetzsch, Lydia_Pintscher.
mkroetzsch added a project: Wikibase-DataModel-Serialization.
Restricted Application added a subscriber: Aklapper.

TASK DESCRIPTION
  The XML dumps of Wikidata contain many JSON serialization errors where empty 
maps {} are wrongly serialized as empty lists []. This affects many different 
fields, such as claims, aliases, and descriptions, and it seems likely 
that all empty maps are serialized wrongly.
  
  This is a major inconvenience to all users who for some reason or other 
prefer XML over the JSON dumps (which do not have the problem). In particular, 
the XML dumps are needed to get access to the full history of the pages. Using 
the same JSON as used in the JSON dumps would fix the problem.

TASK DETAIL
  https://phabricator.wikimedia.org/T91117

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: mkroetzsch
Cc: Lydia_Pintscher, mkroetzsch, Aklapper, Wikidata-bugs



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T89949: RDF mapping should not assert that .../entity/Q123 is-a Wikidata item

2015-02-20 Thread mkroetzsch
mkroetzsch added a comment.

In https://phabricator.wikimedia.org/T89949#1052731, @daniel wrote:

 Nik tells me that the HA features in Virtuoso are only available in the 
 closed source enterprise version. That basically means WMF is not going to 
 use it in production.


Yes, I guessed that this would cause issues. I don't know which other tool 
could deliver the performance you need though. 4Store is free, too, but may not 
be active enough since 5Store became the main (closed) product; it may also 
lack some features you need. Beyond this, the only free options (beyond 
research prototypes) are Jena and Sesame (OpenRDF). I think they won't scale to 
what we need.


TASK DETAIL
  https://phabricator.wikimedia.org/T89949

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: daniel, mkroetzsch
Cc: Manybubbles, Denny, Smalyshev, mkroetzsch, Aklapper, daniel, Wikidata-bugs, 
Jdouglas, aude



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T89949: RDF mapping should not assert that .../entity/Q123 is-a Wikidata item

2015-02-19 Thread mkroetzsch
mkroetzsch added a comment.

The RDF should certainly contain information about the entity type of exported 
data. This is essential to ensure that the RDF data contains all the 
information that is found in the JSON (other than the ordering). As I read it, 
things that are of rdf:type Item are things that are described by on item on 
Wikidata. If this is not obvious to anybody who uses the data (maybe somebody 
really thinks that Washington himself is an item?!), we can always emphasize 
this in the documentation of the Item class. I therefore suggest to close this 
issue as invalid. It's just a matter of how we document our ontology. In 
particular, it should not be assumed that any triple in RDF has a self-evident 
ground truth associated with it that one can grasp just by reading the URIs (or 
their labels), though I think confusion is very unlikely here since we do not 
export any RDF data about item documents.

The comment on rdfs:Resource seems to be a misinterpretation of the spec. In 
RDF, we certainly distinguish between a thing and its description, it is just 
that the description itself is yet another thing (in short: everything is a 
thing). I don't think that this has any bearing on how we want to encode the 
entity type in our RDF.

In general, I would suggest to stick to the RDF encoding that Denny and I have 
worked out and published, as it is used in the existing dumps. We can always 
discuss changes if really needed, but we should not start to re-discuss things 
that are already done. What is needed now is implementation, not design.


TASK DETAIL
  https://phabricator.wikimedia.org/T89949

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: mkroetzsch
Cc: Smalyshev, mkroetzsch, Aklapper, daniel, Wikidata-bugs, Jdouglas, aude



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Updated] T89949: RDF mapping should not assert that .../entity/Q123 is-a Wikidata item

2015-02-19 Thread mkroetzsch
mkroetzsch added a comment.

Our primary goal is to encode the JSON information in RDF, and possibly to 
enrich this information where it makes sense in an RDF-context (e.g., by adding 
links to other datasets). The JSON data includes the entity type, so it is 
clear that we want to encode it in RDF in some way. As I said, my understanding 
is that Q42 *is* an item for a suitable sense of item, just as 
https://phabricator.wikimedia.org/P31 is a property in this sense. In neither 
case are we referring to the HTML page or any other electronic document. The 
confusion arises from your preconception of the item class referring to a 
document or description, which in turn is understandable given our lack of 
up-to-date documentation for this vocabulary.

We can just invent new vocabulary as needed for exporting data about the HTML 
item pages, e.g., by introducing an ItemDocument class to refer to the 
documents. I think including data in RDF that we do not even have in our JSON 
exports is secondary for now. In fact, creating RDF exports for data that is 
collected by MediaWiki for every page seems a much bigger task that is hardly 
addressed in a satisfactory way by the snippet pasted above. Something like the 
SIOC vocabulary should probably be used there, and a suitable linked-data 
interface for accessing all revisions would be needed. I am not in favour of 
creating a makeshift solution now that mixes data that is special to Wikibase 
with data that should be there for all MW installations. MW should have it's 
own linked-data export for page metadata, and Wikidata should merely link to 
the relevant URIs from its data exports.


TASK DETAIL
  https://phabricator.wikimedia.org/T89949

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: mkroetzsch
Cc: Smalyshev, mkroetzsch, Aklapper, daniel, Wikidata-bugs, Jdouglas, aude



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T89949: RDF mapping should not assert that .../entity/Q123 is-a Wikidata item

2015-02-19 Thread mkroetzsch
mkroetzsch added a comment.

Thanks for adding Denny. Long reply, but details matter here.

I agree that there are different things one could talk about (document, real 
thing). However, for now I am mainly interested in talking about the latter, 
since this should be our primary concern in Wikibase (the document is a thing 
MediaWiki has to care about).

Now you argue that certain triples should not be given for the real thing 
(whether related to the reified statements or to the item type). However, your 
arguments do not have an objective foundation: the only reason why you do not 
want to have certain triples is that you interpret them to mean something that 
would not be true, whereas I am interpreting the very same triples in a way 
that would be correct. In other words, we have a dispute about the meaning of 
triples.

Interestingly, the triples we discuss primarily refer to vocabulary that we 
created for the very purpose of being used in these triples. How could it mean 
the wrong thing? Only if we define it to mean the wrong thing. Therefore, let's 
just define those triples to mean the right thing and we are all set. There is 
no technical discussion to be had here; it's all about desired or undesired 
interpretations.

If you look at the RDF structures that we get if we want to represent all of 
our data (without even including its order), then it should be clear that these 
structures simply do not have any self-evident natural interpretation. We 
need to tell people what they mean. Let's just tell them what we think they 
should mean. The only thing to keep in mind (and this is also what you are 
saying) is that we cannot use the same URI to mean different things in 
different contexts, so we need different URIs for referring to the class of 
real-world items and for referring to the class of item documents. I do not see 
a problem with this.

An RDF document represents a graph. It is a purely abstract, mathematical 
model. Nothing is said there about the real world or about documents or about 
truth. It's all in our heads. The reason why we are so careful to distinguish 
documents from real things etc. is that we want to make sure that the data as a 
whole (taking all RDF from one site together) still makes sense. Yet, we are 
free to define this sense. We could define a property that means is described 
on a Wikipedia page that was once edited by someone who was born in. Would 
this say something about the real George Washington? Sure.

For the same reason, please do not confuse reification in RDF (which represents 
a triple without stating that the triple is true) with reification in our 
export (which simply uses an auxiliary resource to make a statement). Using 
auxiiary nodes in RDF data is a common technique that does not have any impact 
on whether you are saying something about the real world or whether you are 
saying that a document made a certain claim. In particular, it is not any 
stronger or more direct to use one triple to represent a statement than to use 
a group of triples around an auxiliary node. You always need to document in 
your ontology what your RDF structures express.

Whether we need to use different subject URIs for simplified and for reified 
exports I don't know. Maybe it also depends on the exact way in which the 
simplified export is created. Already in our RDF exports, we are using 
different property URIs in both cases, so even in the union of the datasets 
there would never be any doubt as to which triple belongs to which view on the 
data. Moreover, many triples are the same in both views (e.g., labels). 
Therefore I am inclined to think that there is no need for different URIs 
there. (I don't see the connection to your truthy projection if you just use it 
for answering queries, unless of course you are returning query results in RDF 
so that these results would turn into another kind of RDF export that needs to 
be consistent with those we have now.)


TASK DETAIL
  https://phabricator.wikimedia.org/T89949

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: daniel, mkroetzsch
Cc: Denny, Smalyshev, mkroetzsch, Aklapper, daniel, Wikidata-bugs, Jdouglas, 
aude



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T89949: RDF mapping should not assert that .../entity/Q123 is-a Wikidata item

2015-02-19 Thread mkroetzsch
mkroetzsch added a comment.

Now my reply was so long that the ticket has already been closed in the 
meantime :-D Anyway, those are my two (or more) cents on this topic ;-) I don't 
think the paper goes into these topics very much (as they are not so much 
technical as philosophical).


TASK DETAIL
  https://phabricator.wikimedia.org/T89949

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: daniel, mkroetzsch
Cc: Denny, Smalyshev, mkroetzsch, Aklapper, daniel, Wikidata-bugs, Jdouglas, 
aude



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T72385: Wikidata JSON dump: file directory location should follow standard patterns

2015-02-16 Thread mkroetzsch
mkroetzsch added a comment.

I think json should be in the path somewhere. It does not have to be at the 
top-level, but it would not be good if dump files of one type end up in their 
own directory. The only way for tools to detect and download dumps 
automatically is to look at the HTML directory listings, and this listing 
should not change its appearance (again). Note that different types of dumps 
will be created in different intervals, so a combined directory that contains 
several types of dumps would look quite messy in the end.

We could have wikibase-dumps/wikidatawiki/json if you prefer this over 
something like other/wikibase-json/wikidatawiki. However, the latter seems to 
be more consistent with /other/incr/wikidatawiki. I don't care much about the 
details, but it would be good to have something systematic in the end: either 
other/projectname/dumptype or other/dumptype/projectname seems most logical. 
Also, I think that dumptype could already mention wikibase if desired, so 
that there is no need for an extra directory wikibase-dumps on the path. The 
thing to avoid is to introduce a new directory structure for every new kind of 
dump (and wikibase-dumps smells a lot like this, even if there is a faint 
possibility that there will be more dumps of this kind in the future -- do you 
actually have any plans to move our RDF dumps from 
http://tools.wmflabs.org/wikidata-exports/rdf/ to the dumps site? Could be 
done, but not sure if it is needed.).


TASK DETAIL
  https://phabricator.wikimedia.org/T72385

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: ArielGlenn, mkroetzsch
Cc: Wikidata-bugs, Nemo_bis, mkroetzsch, Svick, ArielGlenn, Lydia_Pintscher, 
jeremyb-phone, hoo, jeremyb



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T86524: use data model implementation for import

2015-01-12 Thread mkroetzsch
mkroetzsch added a comment.

I don't know about the details of the import task discussed here, but for the 
record: we are happy to support this use of WDTK by helping to update our 
implementation where necessary.


TASK DETAIL
  https://phabricator.wikimedia.org/T86524

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: mkroetzsch
Cc: Aklapper, JanZerebecki, daniel, mkroetzsch, Manybubbles, jkroll, Smalyshev, 
Wikidata-bugs, aude, GWicke



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T86278: Define which data the query service would store

2015-01-11 Thread mkroetzsch
mkroetzsch added a comment.

In https://phabricator.wikimedia.org/T86278#969184, @Multichill wrote:

 I would like to turn it around. We should support indexing everything:


...

 The fact that we're not creative enough to make up queries for everything 
 doesn't mean it isn't useful.


I have to disagree with this approach of designing a (query) system. Of course 
supporting everything would be nice, but indexing always needs to be viewed 
in the context of a query language. Depending on the query features you 
support, indexing may mean completely different things. The requirement that 
everything should be indexed is fuzzy and vague.

For example, Wikidata Query supports regular path queries (in Wikidata, 
Kleene-star recursion is accessed with an operator called TREE). When you say 
that references should be indexed, do you mean that it should be possible to 
navigate through references within regular path expressions? How about 
qualifiers? I think Wikidata query only supports TREE in the main statements. 
On the other hand, Wikidata query does not support any kind of cyclic queries 
(Find people whose parents are married to each other.) even though the 
relevant data is indexed. The point is that you need adequate index structures 
for each query feature. You may have indexed everything and still not be able 
to do the queries you want.

Moreover, I am creative enough to come up with queries that would be very hard 
to implement, yet this does not mean that we should do it. Both everything 
and everything we are creative enough for are poor design principles.

So how to move forward? The usual approach to design a practical system is to 
collect use cases and requirements (example queries) and then make a clear 
decision what should be supported and what shouldn't. One can always revise 
this decision later, but it's still much better to make it explicit than to 
just go along and see what we get. In all of this, it must be understood that 
supporting everything is an impossible task. The question is merely where to 
draw the line(s).


TASK DETAIL
  https://phabricator.wikimedia.org/T86278

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, mkroetzsch
Cc: Aklapper, Smalyshev, Lydia_Pintscher, Multichill, Magnus, daniel, 
JeroenDeDauw, JanZerebecki, aude, mkroetzsch, Denny, Sjoerddebruin, 
Tobi_WMDE_SW, jkroll, Wikidata-bugs, GWicke, Manybubbles



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T86278: Define which data the query service would store

2015-01-11 Thread mkroetzsch
mkroetzsch added a comment.

@Smalyshev My suggestion was just about the surface appearance, not about the 
inner workings. I am saying that the following two phrases have the same 
structure:

- Find things with a *sitelink* that *has badge* *featured*.
- Find things with a *population* that has *point in time* *2014*.

If you look at it like this, you use badge like a (special) qualifier 
property and featured like a value. This does not mean that the query 
answering will be any less efficient than with another syntax. The query engine 
would easily parse the queries in any case, without ambiguity, and know that 
sitelinks are (possibly) stored in a different way internally. Same for labels. 
The reason I was suggesting this unification here was that it also somehow 
answers the question what do we mean by 'indexing' this data?: any query that 
would work over statements would also be expected to work for sitelinks, even 
if different structures are used internally.


TASK DETAIL
  https://phabricator.wikimedia.org/T86278

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, mkroetzsch
Cc: Aklapper, Smalyshev, Lydia_Pintscher, Multichill, Magnus, daniel, 
JeroenDeDauw, JanZerebecki, aude, mkroetzsch, Denny, Sjoerddebruin, 
Tobi_WMDE_SW, jkroll, Wikidata-bugs, GWicke, Manybubbles



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Updated] T86278: Define which data the query service would store

2015-01-11 Thread mkroetzsch
mkroetzsch added a comment.

 This is not correct, original structure can be recovered


Then I misunderstood the transformation that was proposed. My impression was 
that a statement with three qualifier snaks: P1 V1, P1 V2, 
https://phabricator.wikimedia.org/P2 V3 would be stored as two statements, one 
with qualifiers P1 V1, https://phabricator.wikimedia.org/P2 V3, and one with 
qualifiers P1 V2, https://phabricator.wikimedia.org/P2 V3. In this case, one 
would not be able to distinguish this from the case where two statements with 
two qualifiers each had been given originally. Could you explain what kind of 
transformation you had in mind?

 Scan of the database shows there are no entries generating more than 15 
 qualifier splits


My point was that an attacker could craft a single statement that makes you 
index millions of statements. It's clear that such statements are not in the 
current data, since they are hopefully not needed. Again this depends on my 
(possibly incorrect) understanding of your intended transformation.


TASK DETAIL
  https://phabricator.wikimedia.org/T86278

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, mkroetzsch
Cc: Aklapper, Smalyshev, Lydia_Pintscher, Multichill, Magnus, daniel, 
JeroenDeDauw, JanZerebecki, aude, mkroetzsch, Denny, Sjoerddebruin, 
Tobi_WMDE_SW, jkroll, Wikidata-bugs, GWicke, Manybubbles



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T86278: Define which data the query service would store

2015-01-11 Thread mkroetzsch
mkroetzsch added a comment.

@JanZerebecki I understand what you are saying about what indexing means 
here. Makes sense to me. What you are saying about my example query sounds as 
if you are planning to implement query execution manually. I hope this is not 
the case and you can just give the query to Titan to get it answered for you.

You mentioned splitting certain statements with duplicate qualifiers. This 
changes the structure, and the original structure is no longer represented and 
can no longer be faithfully recovered. I don't know if this is an issue with 
the current data (which duplicate qualifiers are actually used?) but it is an 
issue in general. It also means that with a mere 40 qualifiers (20 properties, 
each with two values) on one forged statement, I could create a million 
statements in your index -- a possible DOS attack vector?

An alternative technique for handling such cases is to use a special encoding 
for overflowing values. This is done successfully in some database 
implementations, but it may mean additional checks during query answering, and 
depending on how low in the query answering process you can do these checks, 
they may or may not add significant cost (hence it's a pity that Titan does not 
support this efficiently out of the box).


TASK DETAIL
  https://phabricator.wikimedia.org/T86278

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, mkroetzsch
Cc: Aklapper, Smalyshev, Lydia_Pintscher, Multichill, Magnus, daniel, 
JeroenDeDauw, JanZerebecki, aude, mkroetzsch, Denny, Sjoerddebruin, 
Tobi_WMDE_SW, jkroll, Wikidata-bugs, GWicke, Manybubbles



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T86278: Define which data the query service would store

2015-01-11 Thread mkroetzsch
mkroetzsch added a comment.

@Smalyshev My point is merely that sitelinks and labels //can// be handled like 
statements. Since statements must be supported anyway, it would be sensible to 
reuse the data structures and query expressions defined for them. I don't think 
that confusion is likely, since the query language will not use the colloquial 
names as my examples. Properties of Wikidata will always be referred to by 
their Pid, whereas something like has badge would not have an id of this 
form. So it's not like having a reserved label has badge that competes with 
Wikidata property labels.

Structurally, however, statements and sitelinks can all be represented in the 
same data structure as far as querying is concerned. Maybe you would like it 
better if you viewed it as a separate, independent data structure qualified 
triple that we would use to represent statements, sitelinks, and labels? There 
is no conceptual mix-up between the high-level terms here, just taking 
advantage of similar structure at a low level. If you look at common query 
languages like SQL, SPARQL, Cypher, etc. then you can see that they are always 
based on a relatively small set of structural primitives that do not have a 
domain-specific meaning. You can always build UIs that use domain specific 
terms like sitelink and that make them appear separate, but for implementers 
and API users it is very useful if some things can be unified.


TASK DETAIL
  https://phabricator.wikimedia.org/T86278

REPLY HANDLER ACTIONS
  Reply to comment or attach files, or !close, !claim, !unsubscribe or !assign 
username.

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev, mkroetzsch
Cc: Aklapper, Smalyshev, Lydia_Pintscher, Multichill, Magnus, daniel, 
JeroenDeDauw, JanZerebecki, aude, mkroetzsch, Denny, Sjoerddebruin, 
Tobi_WMDE_SW, jkroll, Wikidata-bugs, GWicke, Manybubbles



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs