[Wikidata] New Wikidata maps

2015-06-25 Thread Markus Kroetzsch

Hi all,

inspired by the recent works of Adam, I have now recreated the Wikidata 
maps that Denny made some years ago:


https://ddll.inf.tu-dresden.de/web/Wikidata/Maps-06-2015/en

There are some interesting observations to be made there, and in any 
case the images are quite pretty.


The code is available online as one of the Wikidata Toolkit examples [1] 
for anyone who wants to create more/different maps. Building all of the 
maps takes about half an hour on my laptop, once the dump is downloaded.


Cheers,

Markus

[1] https://github.com/Wikidata/Wikidata-Toolkit/tree/master/wdtk-examples

--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Properties for family relationships in Wikidata

2015-08-17 Thread Markus Kroetzsch
!). On the other hand, these are properties with a very narrowly
defined scope, and we actively *want* them to be comprehensively
symmetric - every parent article should list all their children on
Wikidata, and every child article should list their parent and all
their siblings.

Perhaps it's worth reconsidering whether to allow symmetry for a
specifically defined class of properties - would an automatically
symmetric P26 really swamp the system? It would be great if the system
could match up relationships and fill in missing parent/child,
sibling, and spouse links. I can't be the only one who regularly adds
one half of the relationship and forgets to include the other!

A bot looking at all of these and filling in the gaps might be a
useful approach... but it would break down if someone tries to remove
one of the symmetric entries without also removing the other, as the
bot would probably (eventually) fill it back in. Ultimately, an
automatic symmetry would seem best.

Thoughts on either of these? If there is interest I will write up a
formal proposal on-wiki.




--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Learning from DBpedia experiences

2015-07-31 Thread Markus Kroetzsch

Hi all,

Vladimir Alexiev has published an interesting post on issues he 
encountered when using DBpedia at OntoText, drawing many comparisons to 
Wikidata [1]. While this post praises the practices of Wikidata in 
several places, I think there are a few important insights for Wikidata 
there as well. There are certainly points where Wikidata is running into 
similar issues (if you have had a look at our class hierarchy recently, 
you know what I mean). Even in the cases where we are doing good, it 
might be valuable to obtain some feedback and reconfirmation from 
another perspective.


Cheers,

Markus

[1] 
http://vladimiralexiev.github.io/pres/20150209-dbpedia/dbpedia-problems-long.html


--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Announcing Wikidata Taxonomy Browser (beta)

2015-10-22 Thread Markus Kroetzsch

On 22.10.2015 19:29, Dario Taraborelli wrote:

I’m constantly getting 500 errors.



I also observed short outages in the past, and I sometimes had to run a 
request twice to get an answer. It seems that the hosting on bitbucket 
is not very reliable. At the moment, this is still a first preview of 
the tool without everything set up as it should be. The tool should 
certainly move to Wikimedia labs in the future.


Markus


--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Announcing Wikidata Taxonomy Browser (beta)

2015-10-22 Thread Markus Kroetzsch

Hi all,

I am happy to announce a new tool [1], written by Serge Stratan, which 
allows you to browse the taxonomy (subclass of & instance of relations) 
between Wikidata's most important class items. For example, here is the 
Wikidata taxonomy for Pizza (discussed recently on this list):


http://sergestratan.bitbucket.org?draw=true=s0=177,2095,7802,28877,35120,223557,386724,488383,666242,736427,746549,2424752,1513,16686448


== What you see there ==

Solid green lines mean "subclass of" relations (subclasses are lower), 
while dashed purple lines are "instance of" relations (instances are 
lower). Drag and zoom the view as usual. Hover over items for more 
information. Click on arrows with numbers to display upper or lower 
neighbours. Right-click on classes to get more options.


The sidebar on the left shows statistics and presumed problems in the 
data (redundancies and likely errors). You can select a report type to 
see the reports, and click on any line to show the error. If you search 
for a class in the search field, the errors will be narrowed down to 
issues related to the taxonomy of this class.


The toolbar at the top has options to show and hide items based on the 
current selection (left click on any box).


Edges in red are the wrong way around (top to bottom). This occurs only 
when there are cycles in the "taxonomy".



== Micro tutorial ==

(1) Enter "Unicorn" in the search box, press return.
(2) Zoom out a bit by scrolling your mouse/touchpad
(3) Click on the "Unicorn" item box. It becomes blue (selected).
(4) Click "Expand up" in the toolbar at the top
(5) Zoom out to see the taxonomy of unicorn
(6) Find the class "Fictional Horse" (directly above unicorn) and click 
its downwards arrow labelled "3" to see all three children items of 
"fictional horse".

(7) Click the share button on the top right to get a link to this view.

You can also create your own share link manually by just changing the 
Qids in the URL as you like.



== Status and limitations ==

This is a prototype and it still has some limits:

* It only shows "proper" classes that have at least one instance or 
subclass. This is to reduce the overall data size and load time.
* The data is based on dumps (the date is shown on the right). It is not 
a live view.
* The layout is sometimes too dense. You can find a "hidden" option to 
make it more spacy behind the sidebar (click "Sidebar" to see it). This 
helps to disentangle larger graphs.
* There are some minor bugs in the UI. You sometimes need to click more 
than once until the right thing happens.
* The help page at http://sergestratan.bitbucket.org/howtouse.html does 
not explain everything in detail yet.


It is planned to work on some of these limitations in the future.

The hope is that this tool will reveal many errors in Wikidata's 
taxonomy that are otherwise hard to detect. For example, you can see 
easily that every "Ship" is an "Event" in Wikidata, that every "Hobbit" 
is a "Fantasy Race", and that every "Monday" is both a "Mathematical 
object" and a "Unit of measurement".


Feedback is welcome (on the tool; better start new threads for feedback 
on the Wikidata taxonomy ;-),


Markus


[1] http://sergestratan.bitbucket.org

--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] WIkidata reasoning (Was: Properties for family relationships in Wikidata)

2015-08-27 Thread Markus Kroetzsch
.

 Then there is the issue that there is no theory of how the
 machine-interpretable information that is associated with entities in 
Wikidata

 is to be processed.   All the processing is currently done using
 uninterpretable procedures.  For example, on
 https://www.wikidata.org/wiki/Property_talk:P22 there is information 
that is

 used to control some piece of code that checks to see that the subject of
 https://www.wikidata.org/wiki/Property:P21 belongs to person (Q215627) or
 fictional character (Q95074).  However, there is no theory showing 
how this
 interacts with other parts of Wikidata, even such inherent parts of 
Wikidata

 as https://www.wikidata.org/wiki/Property:P31

 In fact, there is even difficulty of determining simple truth in 
Wikidata.

 Two sources can conflict, and Wikidata is not in the position of being an
 arbiter for such conflicts, certainly not in general.  To make the 
situation
 even more complex, Wikidata has a temporal aspect as well and has a 
need to

 admit exceptions to general statements.

 So what can be done?  Any solution is going to be tricky.  That is 
not to say
 that some solutions cannot be found by looking at systems and 
standards that

 are already being used for storing large amounts of complex information.
 However, any solution is going to have to be carefully tailored to 
meet the
 requirements of Wikidata and Wikidatans.  (Is there an official term 
for the

 people who are putting Wikidata and Wikidata information together?)

 There is also a big chicken-and-egg problem here - a good solution to 
reliable

 machine-interpretation of Wikidata information requires, for example,
 consistent use of instance of, subclass, and subproperty; but what 
counts as a
 consistent use of these fundamental properties depends on a formal 
theory of

 what they mean.


 I, for one, would find even just the attempt to solve this problem vastly
 interesting, and I have been doing some exploration as to what might be
 needed.  My company is interested in using Wikidata as a source of 
background
 information, but finds that the lack of a good theory of Wikidata 
information

 is problematic, so I have some cover for spending time on this problem.

 Anyway, if there is interest in machine interpretation of Wikidata
 information, if only to detect potential anomalies, I, and probably 
others,

 would be motivated to spend more time on trying to come up with potential
 solutions, hopefully in a collaborative effort that includes not just
 theoreticians but also Wikidatans.

 In the case of the hierarchy Stubbs is associated with the maintainers
 have assumed all mayors are, without exception, humans or they somehow
 thought that if there were exceptions to this, the machines could
 somehow detect and apply them in each case. Both of those methods are, I
 think we agree, are wrong and we should find out why it's happening.

 Is there a tool where one can put in a Wikidata item and it extracts
 declarations based on higher properties like subclass or instance of?
 Like if I were to input the item for Stubbs, it would travel the
 hierarchy and tell me what would be assumed about Stubbs based on the
 declarations further up in the tree.

 Yes, it is called a reasoner.  The design of a reasoner would very 
likely be
 one result of the sort of work described above, but without such work 
it is
 very hard to figure out just what is supposed to be done in any 
except the

 simple cases.

 - Svavar Kjarrval

 Peter F. Patel-Schneider
 Nuance Communications


 ___
 Wikidata mailing list
 Wikidata@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/wikidata



--
Markus Kroetzsch, Departmental Lecturer
Department of Computer Science, University of Oxford
Room 306, Parks Road, OX1 3QD Oxford, United Kingdom
+44 (0)1865 283529   http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Item count

2015-09-08 Thread Markus Kroetzsch

On 08.09.2015 00:39, Daniel Kinzler wrote:

Thanks for investigating, Makrus!


Unfortunately, none of my results can explain the missing 4 million 
items; they just tell us what is *not* the problem. Is there anything 
else that should be checked, or do you think the problem is just 
somewhere in the counting in MediaWiki (i.e., the 4 million items are 
not special at all, just overlooked for some reason)?


Markus



Am 07.09.2015 um 22:54 schrieb Markus Krötzsch:

On 07.09.2015 22:10, Markus Krötzsch wrote:

On 07.09.2015 21:48, Markus Krötzsch wrote:
...


I'll count how many of each we have. Back in 30min.


This does not seem to be the explanation after all. I could only find 33
items in total that have no data at all. If I also count items that have
nothing but descriptions or aliases, I get 589.

Will check for duplicates next.


Update: there are no duplicate items in the dump.

Markus


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata






--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] SQID evolved again: references

2016-06-02 Thread Markus Kroetzsch

Dear all,

By popular demand, SQID now also shows references for most statements 
(collapsed by default, of course). You can see it, e.g., here:


http://tools.wmflabs.org/sqid/#/view?id=Q42

Cheers,

Markus

--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Possibility of data lock

2016-06-10 Thread Markus Kroetzsch

On 10.06.2016 12:53, Sandra Fauconnier wrote:



On 10 Jun 2016, at 12:39, Yellowcard <yellowc...@wikipedia.de> wrote:



However, there are single statements (!) that
are proven to be correct (possibly in connection with qualifiers) and
are no subject to being changed in future. Locking these statements
would make them much less risky to obtain them and use them directly in
Wikipedia. What would be the disadvantage of this, given that slightly
experienced users can still edit them and the lock is only a protection
against anonymous vandalism?


I agree 100%, and would like to add (again) that this would also make our data 
more reliable for re-use outside Wikimedia projects.

There’s a huge scala of possibilities between locking harshly (no-one can edit 
it anymore) and leaving stuff entirely open. I disagree that just one tiny step 
away from ‘entirely open’ betrays the wiki principle.


I don't want to argue about principles :-). What matters is just how 
users perceive things. If they come to the site and want to make a 
change, and they cannot, you have to make sure that they understand why 
and what to do to fix it. The more you require them to learn and do 
before they are allowed to contribute, the more of them you will lose 
along the way. If a statement is not editable, then the (new or old) 
user has to:


(1) be told why this is the case
(2) be told what to do to change it anyway or at least to tell somebody 
to have a look at it (because there will always be errors, even in the 
"fixed" statements)


These things are difficult to get right.

There is a lot of discussion in recent years as to why new editors turn 
their back on Wikipedia after a short time, and one major cause is that 
many of their first edits get reverted very quickly, often by automated 
tools. I think the reasons for the reverting are often valid, so it is a 
short-term improvement to the content, yet it severely hurts the 
community. Therefore, whenever one discusses new measure to stop or undo 
edits, one should also discuss how to avoid this negative effect.


I completely agree that there is a lot of middle ground to consider, 
without having to go to an extreme lock-down. However, things tend to 
develop over time, and I think it is fair to say that many Wikipedias 
have become more closed as they evolved. I am not eager to speed up this 
(natural, unavoidable?) process for Wikidata too much.


The pros and cons of flagged revisions have been discussed in breadth on 
several Wikipedias, with diverse outcomes, so it is probably a tricky 
thing to settle on a conclusion here.


Best regards,

Markus

--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Possibility of data lock

2016-06-10 Thread Markus Kroetzsch

On 10.06.2016 12:49, jay...@gmail.com wrote:

Wikis also have auto patrolled rights.
The fundamental issue here is that modification of a well qualified /
sourced statement should be extremely rare as facts rarely change. This
is  a level of granularity that Wikidata promises to make fundamental
differences to how content grows and how editing on wikis is managed
needs to reflect that.


This is what I meant when asking for more advanced watching technology. 
One particularly interesting thing to watch for are cases where the 
value or qualifiers of a statement is changed, but the reference given 
is not changed. I think this could be a symptom of a typical 
misunderstanding by newcomers (the population changed, so we just "edit" 
it to fix the number, rather than creating a new population value that 
supersedes the historic one).


Markus


--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Possibility of data lock

2016-06-10 Thread Markus Kroetzsch

On 10.06.2016 12:20, Jane Darnell wrote:

Yes I agree, so I guess I disagree with the idea of a "data lock". I do
however, recognize the desire for a "data lock" which arises out of a
personal frustration with good-faith Wikidata editor behavior. Many of
these unnecessary edits & subsequent reversions on Wikidata could be
avoided when a warning is sent to the good faith editor who makes the
same mistake for the nth time. We should be investigating ways to build
tooling to address this issue, as I believe a lot of the mistakes are
caused by Wikidata beginners who don't understand Wikidata. Don't
forget, the matters are complicated by the fact that most editors don't
speak a common language except for the labels on the items and
properties they are "edit warring" over. I expect that eventually the
need for this will decrease as the number of wikipedians in all language
versions slowly get onboarded in the proper use of Wikidata.


Hopefully. We also need some stronger inter-language coordination to 
support this (on a non-technical level). For example, the "allowed" 
values for P21 (sex or gender) as given in the description and usage 
guides (P2559) are different from language to language, and do not agree 
with the values actually used:


http://tools.wmflabs.org/sqid/#/view?id=P21

The usage notes (which exist only in few languages) have more agreement 
than the descriptions. For example, German is stricter in that it's 
description asks editors to use the given values *exclusively* but it is 
more inclusive in that it allows Genderqueer (Q48270) as a value. I 
wonder if there are even editors who check these discrepancies in the 
"soft" part that the labels and descriptions constitute.


Markus





On Fri, Jun 10, 2016 at 12:01 PM, Markus Bärlocher
> wrote:

Dear all

I confirm this view:

 > the way in which Wikipedia is working:
 > The power to edit is the foundation of all Wikimedia projects.
 > Any attempt to shut out some "undesired" users will also reduce the
 > inflow of competent, well-meaning users.
 > Wikipedia and Wikidata alike are built upon the bet that there are
 > more things to be gained than to be lost by being open.

Best regards,
Markus


___
Wikidata mailing list
Wikidata@lists.wikimedia.org 
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Possibility of data lock

2016-06-10 Thread Markus Kroetzsch
 do for technical reasons.


Best regards,

Markus



On 09.06.2016 06:42, Biyanto Rebin wrote:

Hello,

Wikidata is a great collaboration of database, I love it, but sometimes
when I encounter some vandalism, it makes me so angry. Just like today,
I've found Q879704 <https://www.wikidata.org/wiki/Q879704>, the label,
description and alias in Indonesian (ID) is vandal into something else
andit was happened in 2014
<https://www.wikidata.org/wiki/Special:Contributions/202.67.33.46>!

Now, I'm just curious, will everyone agree if (maybe) in the future
Wikidata team can lock some cell that already fix? For example Javanese
(Q33549) = P31: language (Q315). Or we just let the Wikidata always open
for everyone to edit?

PS: I'm making WikiProjects Languages in Indonesia
<https://www.wikidata.org/wiki/Wikidata:WikiProject_Languages_in_Indonesia>,
anyone who want to join please let me know :)


​Best regards,​

--

Biyanto Rebin | Ketua Umum (/Chair/) 2016-2018
Wikimedia Indonesia
Nomor Ponsel: +62 8989 037379
Surel: biyanto.re...@wikimedia.or.id <mailto:biyanto.re...@wikimedia.or.id>


Dukung upaya kami membebaskan pengetahuan:
http://wikimedia.or.id/wiki/Wikimedia_Indonesia:Donasi


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] another WDQS update lag example

2016-06-03 Thread Markus Kroetzsch

Hi,

Another recent example of a statement that does not seem to have been 
updated: one of the WDQS servers seems to think the population of 
Denmark is 5, though this was fixed on 1st of June. The other has the 
correct value, it seems.


It's hard to reproduce since only one of the servers has it. I got the 
error on this query: http://tinyurl.com/hnpxgyh (this is a variant of a 
not-so-simple example query I just built: "German states, ordered by the 
number of company headquarters per million inhabitants" [1]). For trying 
it out, this query might be nicer:


SELECT ?population WHERE {
wd:Q35 wdt:P1082 ?population .
FILTER (?population < 200)
}

since you can change the "200" to trick the caching.

Cheers,

Markus

[1] 
https://www.mediawiki.org/wiki/Wikibase/Indexing/SPARQL_Query_Examples#German_states.2C_ordered_by_the_number_of_company_headquarters_per_million_inhabitants


Side remark: using arithmetic operations on query results is a great way 
to get even more misleading statistics out of Wikidata ;-) It does not 
seem as if we have that feature yet in many example queries.


--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata humans per million inhabitants

2016-06-04 Thread Markus Kroetzsch

On 04.06.2016 12:40, Andrew Gray wrote:

Hi Markus,

Fun! Here's the same query with one additional caveat: it only counts
people known to have been born since 1900.

http://tinyurl.com/jgzxvlq

This removes anyone who is definitely dead (but doesn't have a birth
date), but also cuts out anyone who is alive but where we don't know
either a birth or death date.  So it's a more conservative estimate.


Good point. The imprecision of some dates may be an issue with this 
approach, since people born in the 20th century could have any date 
within this range stored (Wikibase is not consistent internally 
regarding the handling of these values). Maybe going from 1900 to 1899 
could avoid all "born in 20th century" cases.



(It's a pity WD can't say "dates unknown, but definitely alive"...)


You could do this: birth date "some value" (a.k.a. "unknown" [1]), death 
date "no value". Of course, for this to work, all living people should 
have "death date" set to "no value" ...


As long as the absence of a death date is considered an indication of 
someone being alive, the only option would be to add death date 
"unknown" to those who are certainly not alive, and to consider all 
others to be alive.


Both of these approaches are workable, but both require that the 
approaches are applied consistently across Wikidata.




Orders are still much the same, but numbers return drop substantially
- from 21k to 15k for Finland, but only 27k to 25k for Sweden. It
seems Finland has more people, but Sweden has better-documented ones
:-)


Nice observation. Or maybe the Finish just get very old ;-). Or maybe 
there are more people with unknown birthdate there (the imprecision I 
was mentioning above who are really alive but filtered by your 
approach). Either way this could be something to look at.


Cheers

Markus

[1] A regrettable misnaming, since it has an epistemological component 
that is not part of the technical usage.


A.


On 4 June 2016 at 00:04, Markus Kroetzsch
<markus.kroetz...@tu-dresden.de> wrote:

Hi,

Here is a little fun query to show the relative prominence of several
countries' populations on Wikidata [1]:

http://tinyurl.com/zlq9bfv

Doing this for all countries (not just for EU countries) times out, but you
can get individual numbers for each country using BIND, as for the US:

http://tinyurl.com/huouz39

(576 Wikidata people per million in habitants) or for China (6 Wikidata
people per million in habitants). May serve to show some regional biases but
also some natural effects.

Interestingly, it seems we already have almost 0.4% of the current
population of Finland on Wikidata.

Cheers,

Markus

[1]
https://www.mediawiki.org/wiki/Wikibase/Indexing/SPARQL_Query_Examples#Wikidata_people_per_million_inhabitants_for_all_EU_countries

--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata








___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] SQID evolved

2016-06-02 Thread Markus Kroetzsch

On 02.06.2016 15:55, Markus Kroetzsch wrote:

On 02.06.2016 12:11, Gerard Meijssen wrote:

Hoi,
Check out the head of government
<http://tools.wmflabs.org/sqid/#/view?id=P6>. Having both is probably
redundant.


I see. The order in SQID is not good there, since Obama is at the bottom
of a long list. We will fix the order (in a future release).


Tracked at https://github.com/Wikidata/WikidataClassBrowser/issues/70

Markus



And I was wondering why we have only two of the US presidents under head
of state. It's really a strange way of modelling this, but then the
question is if it is really redundant or not (I don't know). I guess the
two properties are different and only happen to coincide for the US.

Markus


On 2 June 2016 at 09:07, Markus Kroetzsch
<markus.kroetz...@tu-dresden.de <mailto:markus.kroetz...@tu-dresden.de>>
wrote:

Hi,

On 02.06.2016 08 <tel:02.06.2016%2008>:13, Gerard Meijssen wrote:

Hoi,
It is great to see it evolve so well. Congratulations.

There are a few things that may be considered. Mr Obama for
instance is
the current president, it has the necessary indication but squid
does
not consider this yet.


Not sure what you mean here. When I open Q30 in SQID, I can see
Obama listed as the president, with the correct times, and marked as
a preferred statement (this is the star). What's missing for you?

Cheers,

Markus


    --
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486 <tel:%2B49%20351%20463%2038486>
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata






--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] SQID evolved

2016-06-02 Thread Markus Kroetzsch

Hi,

On 02.06.2016 08:13, Gerard Meijssen wrote:

Hoi,
It is great to see it evolve so well. Congratulations.

There are a few things that may be considered. Mr Obama for instance is
the current president, it has the necessary indication but squid does
not consider this yet.


Not sure what you mean here. When I open Q30 in SQID, I can see Obama 
listed as the president, with the correct times, and marked as a 
preferred statement (this is the star). What's missing for you?


Cheers,

Markus

--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Property about time of researching

2016-05-26 Thread Markus Kroetzsch

On 27.05.2016 07:06, Biyanto Rebin wrote:

Hello all,

Do we have specific property about time of researching before it's
published?
I'm trying to find in here:
https://www.wikidata.org/wiki/Special:WhatLinksHere/Q18636219
and here:
https://www.wikidata.org/wiki/Wikidata:List_of_properties/all#Dates

But I couln't find it.


Not sure what you mean ("time of researching before it's published" 
seems to be a very vague notion).


For what it's worth, here is a complete list of all type-time properties 
that Wikidata has at the moment:


http://tools.wmflabs.org/sqid/#/browse?type=properties=8:Time

You can use the filter controls on the right to widen the search (your 
query sounds as if you are looking for a property to measure the length 
of an interval of time, so maybe you want to look among the properties 
of type Quantity).


Markus


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Op-ed on Wikipedia Signpost regarding Wikidata licensing

2016-06-16 Thread Markus Kroetzsch

On 16.06.2016 17:45, nicolasm...@tutanota.com wrote:

FYI:
https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2016-06-15/Op-ed


Picking licenses is a complex topic, and it is extremely important to 
some people -- projects have split over this. I understand. But do 
emotions always have to be put above facts? Is the cause justifying all 
means, even against our own principles of rigour and truthfulness that 
are otherwise so important in our projects? Here is what I mean:


You say that Microsoft donated to Wikidata. Is it possible that you have 
just made this up since it fits the picture you want to paint? No 
concerns about misinforming your readers here?


You claim that Google is using Wikidata content. I have not seen any 
proof of this. I have challenged Mr. Kolbe about this before, and indeed 
it seems that he is now avoiding this claim in the text you cite. The 
fact that Google stopped working on the Freebase imports does not seem 
to suggest that they are very interested in the data right now [1]. 
Maybe you have new information you would like to share with us? It would 
surely be of interest to many people here.


You mention "vain threats made by those who wish to use us as mere free 
labor for their enterprises". Which threats? Who made them? What are 
they threatening with? Are you just trying to stir the emotions of the 
reader, making them wish to rebel against some imagined enemy?


We can discuss which licence will lead to the best return of investment 
for the Wikidata community, if it is desirable that restrictive data 
licenses become legally binding world wide, and who would really benefit 
from this change in legislation(s). But being untruthful for the sake of 
argument is not a good start for such a discussion.


Markus

[1] I think this is nothing to be ashamed of -- Google is huge and their 
own internal data is likely much larger than what we have in Wikidata 
today. We may get there yet. Most importantly, our data is available 
freely while Google's is not.


--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Op-ed on Wikipedia Signpost regarding Wikidata licensing

2016-06-16 Thread Markus Kroetzsch
In addition to my previous critique about the unsourced claims here, I 
also have made a comment on the talk page regarding my own position in 
this matter, which I replicate here for completeness:


"""
Current legislations do not support the licensing of individual facts, 
only of databases as a whole, and only in some countries. What you are 
asking for is Wikidata to lobby for the introduction of new notions of 
"copyright" which do not exist today. Yes, you could use these laws to 
enforce attribution and share-alike, but companies will also use the 
same laws to enforce conditions on using "their" facts. This is not 
desirable. Plain data is free from such legal control, and this is the 
position of the EFF (see this recent article [1]) and also of many 
people in our community. Concepts like the infamous illegal prime [2] 
express the fundamental opposition that free culture proponents have 
against putting terms and conditions on data items. By suggesting that 
laws should be more restrictive, the article is arguing against some of 
the basic freedoms we are supporting with our movement. --Markus 
Krötzsch 22:43, 16 June 2016 (UTC)

"""

In particular, it should be noted that the Electronic Frontier 
Foundation is fully supporting the approach of Wikidata: "raw data 
itself is not copyrightable, but there are still good reasons to 
explicitly assert its public domain status" [1].


Markus

[1] 
https://www.eff.org/deeplinks/2016/06/open-government-data-act-would-uh-open-government-data 


[2] https://en.wikipedia.org/wiki/Illegal_prime


On 16.06.2016 17:45, nicolasm...@tutanota.com wrote:

FYI:
https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2016-06-15/Op-ed

Nicolas Maia
--
Enviado seguramente pelo Tutanota. Torne sua caixa de correio
criptografada hoje mesmo! https://tutanota.com


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] SQID evolved

2016-06-21 Thread Markus Kroetzsch

Hi all:

A small but useful change got deployed on SQID [1] today: there is now a 
search field at the top that allows you to select an item directly. It 
only covers items, since this is what the Wikidata API supports (and 
since it is not clear how to mix property matches in with the item 
matches in the results without obscuring your view). To search for 
properties by label, you can use the text filter in the Properties 
browser [2].


There have also been some mostly internal changes to the i18n system, 
and more of the UI has been internationalised so that it can be 
translated. If you'd like to contribute UI translations, see [3]. We 
would like to connect to translatewiki.net at some point, so if you 
would like to help with this, this is very welcome [4].


To see what else we are currently working on (and to add more to our 
list of todos), please see our issues [5].


Cheers,

Markus

[1] http://tools.wmflabs.org/sqid/
[2] http://tools.wmflabs.org/sqid/#/browse?type=properties
[3] https://github.com/Wikidata/SQID/tree/master/src/lang
[4] https://github.com/Wikidata/SQID/issues/71
[6] https://github.com/Wikidata/SQID/issues

--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Op-ed on Wikipedia Signpost regarding Wikidata licensing

2016-06-18 Thread Markus Kroetzsch

On 18.06.2016 08:37, Gerard Meijssen wrote:

Hoi,
I have written my opinion on the licensing of Wikidata data.. [1]


I agree with your position there.

It's nice to have an argument that appeals to our goals of sharing and 
altruism. My former argument was more about the undesirable legal 
implications that such copyright-law strengthenings would imply -- a 
warning of the negative effects -- but it's good to also remember the 
positive effects that the current situation brings us.


Markus



[1]
http://ultimategerardm.blogspot.nl/2016/06/wikidata-has-cc-0-license-this-should.html

2016-06-16 17:45 GMT+02:00 <nicolasm...@tutanota.com
<mailto:nicolasm...@tutanota.com>>:

FYI:
https://en.wikipedia.org/wiki/Wikipedia:Wikipedia_Signpost/2016-06-15/Op-ed


Nicolas Maia
--
Enviado seguramente pelo Tutanota. Torne sua caixa de correio
criptografada hoje mesmo! https://tutanota.com

___
Wikidata mailing list
Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Machine-readable Wikidata ontology/schema?

2016-06-23 Thread Markus Kroetzsch

On 23.06.2016 07:13, Stas Malyshev wrote:

Hi!


A quick search only returned those tables so far:
https://www.wikidata.org/wiki/Wikidata:List_of_properties/all
<mailto:wikidata@lists.wikimedia.org>

Any formal representation would work: OWL, etc.


There's basic OWL with Wikibase ontology here:
http://wikiba.se/ontology-1.0.owl
The properties can be found in the general dump (
https://dumps.wikimedia.org/wikidatawiki/entities/  )
described as outlined here:

https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format#Properties

There's no separate file, RDF, OWL or otherwise, with only properties,
AFAIK.


There is one for the initial (prototype) dumps [1], file 
wikidata-properties.nt.gz. Adjusting this to the RDF encoding used in 
the Wikidata SPARQL Service would be doable (mostly some URIs have 
changed, but there is a simple mapping).


With the small number of properties, it should also be easy to get much 
of their data with a SPARQL query (depending on what you need). Does 
BlazeGraph support CONSTRUCT?


In fact, depending on what you want to do with the data, you may find 
other formats that list all properties useful, esp. the property list 
used in SQID [2]. You can download the JSON file with the underlying 
data (see the link in the README of the github project for SQID).


Both our RDF dumps and the SQID file are generated using Wikidata 
Toolkit. You could use this too if you want custom exports that are not 
easy to get through the SPARQL endpoint.


Markus


[1] Most recent one is already two montsh old though; there seems to be 
a bug with the generator: 
http://tools.wmflabs.org/wikidata-exports/rdf/index.php?content=dump_download.php=20160425

[2] http://tools.wmflabs.org/sqid/#/browse?type=properties






--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Op-ed on Wikipedia Signpost regarding Wikidata licensing

2016-06-20 Thread Markus Kroetzsch

On 20.06.2016 19:57, Stas Malyshev wrote:

Hi!


Lest we forget
https://www.google.com/?gws_rd=ssl#q=tina+turner

Look at the right side Google Knowledge Graph panel Wikipedia is
displayed as sources of information.  Wikipedia gets attribution and a
bit of free advertising.


Compared to bold prominent link to Wikipedia as the first result of the
same search, I wonder how much traffic a little link tucked into the
sidebar on the right actually gets.



I think at least one group of people will read the small links very 
carefully: spammers who want to get visibility on Google. I am not sure 
if it would be so good for us at the current stage (with our current 
anti-spam infrastructure) to widely advertise Wikidata as an entry point 
to Google (which I still believe it is not, but things might change, and 
the question of how we want this to be displayed might come up 
eventually). The SEO people already try to convince others that Google 
is using Wikidata, but I think they also are aware of the fact that this 
is all but clear so far. At the time when Google really starts using our 
data in significant ways, and this becomes publicly documented, we need 
to be prepared for a lot more spam than we are seeing now.


Markus

--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Op-ed on Wikipedia Signpost regarding Wikidata licensing

2016-06-21 Thread Markus Kroetzsch

Hi Stas,

Well said! The irony of it all is that more restrictive license terms 
would be such a big obstacle mainly to smaller users. Companies like 
Google have both the lawyers and the IT support to handle any kind of 
license. You can see it from the sources that Google already shows for 
their knowledge graph displays (not just Wikipedia, but also many 
others). They have all the infrastructure in place to use our data 
whatever license we pick, and the copyleft nature of some of their 
sources' licenses does not seem to affect them either.


Markus


On 20.06.2016 20:20, Stas Malyshev wrote:

Hi!


Current legislations do not support the licensing of individual facts,
only of databases as a whole, and only in some countries. What you are


Added to that, even if it *were* possible to copyright facts, I think
using restrictive license (and make no mistake, any license that
requires people to do specific things in exchange for data access *is*
restrictive) makes a lot of trouble for any people using the data. This
is especially true for data that is meant for automatic processing - you
will have to add code to track licenses for each data unit, figure out
how exactly to comply with the license (which would probably require
professional help, always expensive), track license-contaminated data
throughout the mixed databases, verify all outputs to ensure only
properly-licensed data goes out... It presents so much trouble many
people would just not bother with it. It would hinder exactly the thing
opens source excels at - creating community of people building on each
other's work by means of incremental contribution and wide participation.
Want to create cool a visualization based on Wikidata? Talk to a lawyer
first. Want kickstart your research exploration using Wikidata facts? To
the lawyer you go. Want to write an article on, say, gender balance in
science over the ages and places, and feature Wikidata facts as an
example? Where's that lawyer's email again?
You get the picture, I hope. How many people would decide "well, it
would be cool but I have no time and resource to figure out all the
license issues" and not do the next cool thing they could do? Is it
something we really want to happen?

And all that trouble to no benefit to anyone - there's absolutely no
threat of Wikidata database being taken over and somehow subverted by
"enterprises", whatever that nebulous term means. In fact, if Google
example shows us anything, it's that "enterprises" are not very good at
it and don't really want it. Would they benefit from the free and open
data? Of course they would, as would everybody. The world - including
everybody, including "enterprises" - benefited enormously from free and
open participatory culture, be it open source software or free data. It
is a *good thing*, not something to be afraid of!

Wikidata data is meant for free use and reuse. Let's not erect
artificial barriers to it out of misguided fear to somehow benefit
somebody "wrong".




--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] How would I add data to Wikidata in voice AI - and anticipating brain wave head sets?

2016-06-25 Thread Markus Kroetzsch

Dear Scott,

Support for brain wave head sets is not on our current development plan, 
but pull requests are always welcome.


Finding SQID pages in Google will be difficult, since it is an 
all-Javascript application without static pages. Google may not see much 
of the content of our pages.


Best,

Markus


On 24.06.2016 19:48, Scott MacLeod wrote:

Hi Markus and Wikidatans,

I'm resending my email of a few minutes ago in a new thread, and
wondering about how all this might lead to very early
Wikidata/SQID/Wikipedia planning for anticipating brain wave head sets?
In anticipating voice use of Wikidata by other organizations (e.g.
Google Voice +), is there anything different Wikidata might plan for in
anticipating brain wave headsets - and even an open source / CC brain
wave head set? ...


Hi Markus, Egon, Gu, Joachim and Wikidatans,

In Google voice in Android on my smartphone  I asked the following
orally, and this is what was transcribed and understood AI -wise, and
also these were the "hits" I got via Google search:

1
"How would I add data to Wiki data" (no question mark was transcribed)

https://m.wikidata.org/wiki/Help:About_data

https://m.wikidata.org/wiki/Wikidata:Data_donation

2
"How would I get data using SQID from Wikidata"

but Google Voice doesn't recognize SQID yet, and it kept searching on
"SQUID"

3
Then I asked orally/in voice ..


"How would I edit Wikipedia"

... and I heard read back to me out loud, as well as saw in text at the
top of the Google search hit list:

"To edit the whole page at once, click the "edit this page" tab at the
top. To edit just one section, click the "edit" link to the right of the
section heading. To edit on Wikipedia, you type in a special markup
language called wikitext. See the cheat sheet for the most basic
wikitext codes." (good to see wikitext mentioned, #Wikimedia-Office hour
folks from June 21 at 2pm PT)

https://en.m.wikipedia.org/wiki/Wikipedia:FAQ/Editing

...and below this was this link ...

https://en.m.wikipedia.org/wiki/Help:Editing

4
Then I asked in a further experimental mode (and Google Voice
transcribed the following)

"how would I add MIT opencourseware to Wiki data" (capitalization and
spacing by Google Voice)

And I didn't get any relevant "hits" linking these two databases, or, in
particular, allowing me to add MIT OCW in 7 languages to Wikidata just
using my voice.


5

I then asked

"What is the Douglas Adams Wiki data page"

And got this ..
https://m.wikidata.org/wiki/Q42

6
Then spelling out S-Q-I-D, and after a few further different attempts, I
was able to orally ask

"What is the Wiki data S-Q-I-D page for Douglas Adams"

And I got the 2nd hit in Google search of

http://osdir.com/ml/general/2016-06/msg28035.html

Which was your email to me Markus on Wednesday, June 22 (in a "osdir"
which I hadn't seen before)! Cool ...

7

I then asked in Google Voice ...


"What is the Douglas Adams Wikipedia page"

https://en.m.wikipedia.org/wiki/Douglas_Adams

All this is great and exciting, and the very initial steps to engaging
Wikidata with voice, as I see it, but via Google Voice (with its
TensorFlow), and interoperability between organizations' data is so
important and generative (especially given CC Wikidata and Google's
opennesses), ... but not yet in a Wikidata voice project for example.

8

So, what are Wikidata plans for voice with AI ,machine learning and
machine translation?

What are Wikidata plans for anticipating Google Voice (with TensorFlow)
and by extension all other interesting voice AI projects (IBM, Amazon,
Siri, etc ... )?


I'm curious further about planning for how "voice AI" projects (e.g.
Google Voice) will engage Wikidata /SQID / Wikipedia and for robust
interactivity and knowledge generation, and even, for example, my coding
of Wikidata by voice.

The way such voice processes will further newly engage SQID references
(which are such a varied, rich and remarkable source of data) could
dramatically boost Wikidatas and SQIDs generativty, as well.

How might I begin to plan to use "voice" with Wikidata and SQID (and
also re adding CC MIT OCW in 7 languages to CC World University and
School, per an extended correspondence with MIT OCW's Dean Cecilia
d'Oliveira this springs when WUaS was in a UC Berkeley Law class - and
for anticipating student applications this autumn)? Could prospective
students begin to apply in voice to WUaS, for example, as if applying to
MIT or Stanford from around the world, first in English?

Thank you.

Best regards, Scott

http://worlduniversityandschool.org <http://worlduniversityandschool.org/>



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




--
Markus Kroetzsch
Faculty of Computer Science
Technisc

[Wikidata] Wikidata RDF PHP export bug

2016-02-08 Thread Markus Kroetzsch

Hi,

I just noticed a bug in the RDF live exports of Wikidata: they still use 
the base URI <http://wikiba.se/ontology-beta#> for all Wikidata 
vocabulary terms. The correct base URI would be 
<http://wikiba.se/ontology#>. I guess this has been forgotten and never 
got noticed yet (not sure if there are consumers of the live exports).


The SPARQL query service uses the correct URIs in its example queries 
and data. The URIs in the ontology documents at wikiba.se are also 
correct, so this only seems to affect the PHP code.


Cheers,

Markus

[1] wikiba.se/ontology

--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata RDF PHP export bug

2016-02-08 Thread Markus Kroetzsch

On 08.02.2016 19:35, Stas Malyshev wrote:

Hi!


I just noticed a bug in the RDF live exports of Wikidata: they still use
the base URI <http://wikiba.se/ontology-beta#> for all Wikidata
vocabulary terms. The correct base URI would be
<http://wikiba.se/ontology#>. I guess this has been forgotten and never
got noticed yet (not sure if there are consumers of the live exports).


It's not forgotten - in fact, we have an issue for that,
https://phabricator.wikimedia.org/T112127 - but we never got to defining
the point when we do it. One can argue RDF mapping is still not complete
- we do not support units fully, and we may have to add stuff for
geo-coordinates too - but one can argue it's good enough to be 1.0 and
I'd agree with it. But we need to take decision on this. Please feel
free to also comment on the task.


My two points on this matter are:

(1) Whatever we do, the same thing should use the same URI in SPARQL 
results and live exports.
(2) It is better to version ontology documents rather than the URIs 
themselves. Otherwise everybody will have to change all of their queries 
(in wiki pages, in code, in their heads, ...) once we get out of beta.


Cheers,

Markus

--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata RDF PHP export bug

2016-02-08 Thread Markus Kroetzsch

On 08.02.2016 19:35, Stas Malyshev wrote:

Hi!


I just noticed a bug in the RDF live exports of Wikidata: they still use
the base URI <http://wikiba.se/ontology-beta#> for all Wikidata
vocabulary terms. The correct base URI would be
<http://wikiba.se/ontology#>. I guess this has been forgotten and never
got noticed yet (not sure if there are consumers of the live exports).


It's not forgotten - in fact, we have an issue for that,
https://phabricator.wikimedia.org/T112127 - but we never got to defining
the point when we do it. One can argue RDF mapping is still not complete
- we do not support units fully, and we may have to add stuff for
geo-coordinates too - but one can argue it's good enough to be 1.0 and
I'd agree with it. But we need to take decision on this. Please feel
free to also comment on the task.


My two points on this matter are:

(1) Whatever we do, the same thing should use the same URI in SPARQL 
results and live exports.
(2) It is better to version ontology documents rather than the URIs 
themselves. Otherwise everybody will have to change all of their queries 
(in wiki pages, in code, in their heads, ...) once we get out of beta ;-)


Cheers,

Markus

--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] upcoming deployments/features

2016-02-05 Thread Markus Kroetzsch

On 04.02.2016 18:59, Daniel Kinzler wrote:

Am 04.02.2016 um 08:03 schrieb Markus Krötzsch:

Data model updates are costly. Don't make them on a week's notice, without prior
discussion, and without having any documentation ready to give to data users. It
would also be good to announce breaking technical changes more prominently on
wikidata-tech as well.


These have been discussed for months, if not years. Especially identifiers.


Citation needed ;-) Note that my emails were about math, not about the 
identifiers. Also, discussing about something is not enough. In the end, 
you need to give us the technical details so we can fix our tools. I 
knew that you planned to introduce identifier types, but I still don't 
know the RDF IRI for this new type.




I do not consider adding new data types a breaking change. Converting existing
properties to a different data type is a breaking change to the data-set, not to
the model or the software.


No, sorry, this is just wrong. The datatypes are part of the model, not 
of the data. Changing the format of JSON to include new, hitherto 
unknown types might break a tool (and not break others). It will depend 
on the function of the tool (and its implementation technique) but some 
will break.


For example, a tool that converts Wikidata to RDF will have a problem if 
it encounters something that it cannot translate. This is hard to 
recover from, since you cannot even declare the exported property as a 
property at all unless you probe your data to find hints in the form of 
values that use this data (I don't think any existing export tool would 
work like this). As a result, you fail to export a significant part of 
the property definitions, which makes the dumps invalid for OWL, or you 
have to omit big parts of the data. This is clearly breaking essential 
functionality.


An earlier email in this thread reported that the recent changes also 
break pywikibot, but maybe this was another part of the changes and the 
one-week time-to-update I complain about does not apply there (maybe 
this break was clear earlier so the team there had more time to adjust?).


You really need to give people more time to accommodate data model 
changes -- and you should start counting the time when you have finished 
and publicised the documentation of a change. I can't believe that my 
question on the type IDs in JSON and RDF is still unanswered. I would 
really like to release an update of our software and online tools next 
week, so it would be good to know by then. Does nobody know this yet, or 
do you not have enough resources to say it, or what is the problem? We 
are no longer in those early years where everything was new and there 
were no applications to break.


Markus

--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] WDQS stability

2016-02-05 Thread Markus Kroetzsch

Hi Stas,

Thanks for the update. Maybe this is a good opportunity to say that you 
and everybody involved in WDQS are doing a tremendous job in maintaining 
this service. Even with the small glitches in the last week, this is 
still one of the most reliable public SPARQL endpoints that I have seen. 
This is not a small achievement, considering how the load is 
continuously shifting and changing (e.g., if someone announces a tool 
that queries the hitherto neglected "GAS Service" every time that a user 
clicks on the link!). And on top of all this, we are getting some very 
quick email responses here whenever there is an issue (or even just a 
usage question). So thanks for all the efforts -- this is an absolutely 
crucial piece of infrastructure, and it is good to see it in such 
professional hands.


Cheers,

Markus



On 04.02.2016 22:10, Stas Malyshev wrote:

Hi!

As it was noted on the list, we recently tried to update Blazegraph -
software running Wikidata Query Service - to version 2.0, which has
numerous bugfixes and performance improvements, and some infrastructure
for future work on Geospatial search, etc.

Unfortunately, it seems, as it sometimes happens with new major
releases, that there are certain bugs in it, and yet more unfortunately,
one of the bugs seems to be of a race condition nature, which is very
hard to trigger on test environment, and that, when triggered, seriously
impacts the stability of the service. All this lead to WDQS service
being somewhat unstable last couple of days.

Due to this, I have rolled the production deployment back to pre-2.0
state. This means the service should be stable again and not experience
glitches anymore. I'll be watching it just in case and if you notice
anything that looks broken (like queries producing weird exceptions -
timeout does not count - or service being down, etc.) please ping me.

In the meantime, we will look for the cause of instability, and once it
is identified and fixed, we'll try the Blazegraph 2.0 roll-out again,
with the fixes applied. I'll send a note to the list when it happens.

Thanks,




--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] [Wikidata-tech] Technical information about the new "math" and "external-id" data types

2016-02-05 Thread Markus Kroetzsch

On 05.02.2016 12:19, Daniel Kinzler wrote:

As Lydia announced, we are going to deploy support for two new data types soon
(think of "data types" as "property types", as opposed to "value types"):

...

The datatypes themselves are declared as follows:

wd:P708 a wikibase:Property ;
wikibase:propertyType wikibase:ExternalId .

wd:P717 a wikibase:Property ;
wikibase:propertyType wikibase:Math .

Accordingly, the URIs of the datatypes (not the types of the literals!) are:
<http://wikiba.se/ontology-beta#ExternalId>
<http://wikiba.se/ontology-beta#Math>


Thanks, this is all I need to know. We will have a new release in time.

...



Here are some changes concerning the math and external-id data types that we are
considering or planning for the future.

* For the Math datatype, we may want to provide a type URI for the RDF string
literal that indicates that the format is indeed TeX.
Perhaps we could use <http://purl.org/xtypes/Fragment-LaTeX>.


+1 to this, especially if the string can actually be guaranteed to be 
LaTeX (not just regarding special commands, but also in general -- not 
sure if the current datatype does any type checking for the string).




* For the ExternalId data type, we would like to use resource URIs for external
IDs (in "direct claims"), if possible. This would only work if we know the base
URI for the property  (provided by a statement on the property definition). For
properties with no base URI set, we would still use plain string literals.


Note that your "base URI" on Wikidata is called "URI pattern for RDF 
resource" (https://www.wikidata.org/wiki/Property:P1921). We are already 
using this in RDF exports. This is not specific to identifier properties 
but can be used with any string property where IRIs make sense.




In our example above, the base URI for P708 might be
<https://tardis.net/allonzy/>. The Turtle snippet would read:

wd:Q2209 a wikibase:Item ;
   wdt:P717 "\\sin x^2 + \\cos_b x ^ 2 = e^{2 \\tfrac\\pi{i}}"
^^purl:Fragment-LaTeX;
   wdt:P708 <https://tardis.net/allonzy/BADWOLF> .


Going from string literals to IRIs changes the property type in 
incompatible ways. To keep existing queries (with filters etc.) working, 
it is better to add the URI as an extra triple rather than having it 
replace the main (string) id value. This is also important for users who 
want to display the data returned by a query in a way that looks like on 
Wikidata (you don't want to extract the string value from the IRI with 
string operations). This is also how it is currently implemented in the 
RDF exports.




However, the full representation of the statement would still use the original
string literal:

wds:Q2209-24942a17-4791-a49d-6469-54e581eade55 a wikibase:Statement,
wikibase:BestRank ;
wikibase:rank wikibase:NormalRank ;
ps:P708 "BADWOLF" .


We would also like to provide the full URI of the external resource in JSON,
making us a good citizen of the web of linked data. We plan to do this using a
mechanism we call "derived values", which we also plan to use for other kinds of
normalization in the JSON output. The idea is to include additional data values
in the JSON representation of a Snak:

{
"snaktype": "value",
"property": "P708",
"datavalue": {
"value": "BADWOLF",
"type": "string"
},
"datavalue-uri": {
"value": "https://tardis.net/allonzy/BADWOLF;,
"type": "string"
},
"datatype": "external-id"
}

In some cases, such as ISBNs, we would want a URL as well as a URI:
   {
 "snaktype": "value",
 "property": "P708",
 "datavalue": {
   "value": "3827370191",
   "type": "string"
 },
 "datavalue-uri": {
   "value": "urn:isbn:3827370191",
   "type": "string"
 },
 "datavalue-url": {
   "value": "https://www.wikidata.org/wiki/Special:BookSources/3827370191;,
   "type": "string"
 },
 "datatype": "external-id"
   }

The base URL would be given as a statement on the property, just like the base 
URI.

We plan to use the same mechanism for giving Quantities in a standard unit,
providing thumbnail URLs for CommonsMedia values, etc.


I think I already commented on this in other places. Wasn't there a 
tracker item where the derived values were discussed? Some thing to keep 
in mind here is that many properties have multiple URIs and URLs 
associated. This is no problem in RDF, but your above encoding might not 
work for this case.


Markus

--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] SPARQL CONSTRUCT results truncated

2016-02-13 Thread Markus Kroetzsch

And here is another comment on this interesting topic :-)

I just realised how close the service is to answering the query. It 
turns out that you can in fact get the whole set of (currently >324000 
result items) together with their GND identifiers as a download *within 
the timeout* (I tried several times without any errors). This is a 63M 
json result file with >640K individual values, and it downloads in no 
time on my home network. The query I use is simply this:


PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>

select ?item ?gndId
where {
  ?item wdt:P227 ?gndId ; # get gnd ID
wdt:P31  wd:Q5  . # instance of human
} ORDER BY ASC(?gndId) LIMIT 10

(don't run this in vain: even with the limit, the ORDER clause requires 
the service to compute all results every time someone runs this. Also be 
careful when removing the limit; your browser may hang on an HTML page 
that large; better use the SPARQL endpoint directly to download the 
complete result file.)


It seems that the timeout is only hit when adding more information 
(labels and wiki URLs) to the result.


So it seems that we are not actually very far away from being able to 
answer the original query even within the timeout. Certainly not as far 
away as I first thought. It might not be necessary at all to switch to a 
different approach (though it would be interesting to know how long LDF 
takes to answer the above -- our current service takes less than 10sec).


Cheers,

Markus


On 13.02.2016 11:40, Peter Haase wrote:

Hi,

you may want to check out the Linked Data Fragment server in Blazegraph:
https://github.com/blazegraph/BlazegraphBasedTPFServer

Cheers,
Peter

On 13.02.2016, at 01:33, Stas Malyshev <smalys...@wikimedia.org> wrote:

Hi!


The Linked data fragments approach Osma mentioned is very interesting
(particularly the bit about setting it up on top of an regularily
updated existing endpoint), and could provide another alternative,
but I have not yet experimented with it.


There is apparently this: https://github.com/CristianCantoro/wikidataldf
though not sure what it its status - I just found it.

In general, yes, I think checking out LDF may be a good idea. I'll put
it on my todo list.

--
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] SPARQL CONSTRUCT results truncated

2016-02-13 Thread Markus Kroetzsch

On 13.02.2016 22:56, Markus Kroetzsch wrote:

And here is another comment on this interesting topic :-)

I just realised how close the service is to answering the query. It
turns out that you can in fact get the whole set of (currently >324000
result items) together with their GND identifiers as a download *within
the timeout* (I tried several times without any errors). This is a 63M
json result file with >640K individual values, and it downloads in no
time on my home network. The query I use is simply this:

PREFIX wd: <http://www.wikidata.org/entity/>
PREFIX wdt: <http://www.wikidata.org/prop/direct/>

select ?item ?gndId
where {
   ?item wdt:P227 ?gndId ; # get gnd ID
 wdt:P31  wd:Q5  . # instance of human
} ORDER BY ASC(?gndId) LIMIT 10

(don't run this in vain: even with the limit, the ORDER clause requires
the service to compute all results every time someone runs this. Also be
careful when removing the limit; your browser may hang on an HTML page
that large; better use the SPARQL endpoint directly to download the
complete result file.)


P.S. For those who are interested, here is the direct link to the 
complete result (remove the line break [1]):


https:
//query.wikidata.org/sparql?query=PREFIX+wd%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fentity%2F%3E%0D%0APREFIX+wdt%3A+%3Chttp%3A%2F%2Fwww.wikidata.org%2Fprop%2Fdirect%2F%3E%0D%0Aselect+%3Fitem+%3FgndId+where+{+%3Fitem+wdt%3AP227+%3FgndId+%3B+wdt%3AP31++wd%3AQ5+.+}+ORDER+BY+ASC%28%3FgndId%29=json

Markus

[1] Is the service protected against internet crawlers that find such 
links in the online logs of this email list? It would be a pity if we 
would have to answer this query tens of thousands of times for many 
years to come just to please some spiders who have no use for the result.




It seems that the timeout is only hit when adding more information
(labels and wiki URLs) to the result.

So it seems that we are not actually very far away from being able to
answer the original query even within the timeout. Certainly not as far
away as I first thought. It might not be necessary at all to switch to a
different approach (though it would be interesting to know how long LDF
takes to answer the above -- our current service takes less than 10sec).

Cheers,

Markus


On 13.02.2016 11:40, Peter Haase wrote:

Hi,

you may want to check out the Linked Data Fragment server in Blazegraph:
https://github.com/blazegraph/BlazegraphBasedTPFServer

Cheers,
Peter

On 13.02.2016, at 01:33, Stas Malyshev <smalys...@wikimedia.org> wrote:

Hi!


The Linked data fragments approach Osma mentioned is very interesting
(particularly the bit about setting it up on top of an regularily
updated existing endpoint), and could provide another alternative,
but I have not yet experimented with it.


There is apparently this: https://github.com/CristianCantoro/wikidataldf
though not sure what it its status - I just found it.

In general, yes, I think checking out LDF may be a good idea. I'll put
it on my todo list.

--
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata







--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] SPARQL CONSTRUCT results truncated

2016-02-13 Thread Markus Kroetzsch

On 13.02.2016 23:50, Kingsley Idehen wrote:
...

Markus and others interested in this matter,

What about using OFFSET and LIMIT to address this problem? That's what
we advice users of the DBpedia endpoint (and others we publish) to do.

We have to educate people about query implications and options. Even
after that, you have the issue of timeouts (which aren't part of the
SPARQL spec) that can be used to produce partial results (notified via
HTTP headers), but that's something that comes after the basic scrolling
functionality of OFFSET and LIMIT are understood.


I think this does not help here. If I only ask for part of the data (see 
my previous email), I can get all 300K results in 9.3sec. The size of 
the result does not seem to be the issue. If I add further joins to the 
query, the time needed seems to go above 10sec (timeout) even with a 
LIMIT. Note that you need to order results for using LIMIT in a reliable 
way, since the data changes by the minute and the "natural" order of 
results would change as well. I guess with a blocking operator like 
ORDER BY in the equation, the use of LIMIT does not really save much 
time (other than for final result serialisation and transfer, which 
seems pretty quick).


Markus



[1]
http://stackoverflow.com/questions/20937556/how-to-get-all-companies-from-dbpedia
[2] https://sourceforge.net/p/dbpedia/mailman/message/29172307/



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] SPARQL returns bnodes for some items

2016-02-26 Thread Markus Kroetzsch

Hi Stas, hi all,

I just noted that BlazeGraph seems to contain a few erroneous triples. 
The following query, for example, returns a blank node "t7978245":


SELECT ?superClass WHERE {
  <http://www.wikidata.org/entity/Q595133> p:P279/ps:P279 ?superClass
}

https://query.wikidata.org/#SELECT%20%3FsuperClass%20WHERE%20{%0A%20%20%3Chttp%3A%2F%2Fwww.wikidata.org%2Fentity%2FQ595133%3E%20p%3AP279%2Fps%3AP279%20%3FsuperClass%0A}

I stumbled upon six cases like this (for P279): Q595133 (shown above), 
Q1691488, Q11259005, Q297106, Q1293664, and Q539558. This would be less 
than 0.001% of the 623,963 P279 statements, but it's still enough to 
have application code trip over the unexpected return format ;-).


Best

Markus

--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata Propbrowse

2016-02-14 Thread Markus Kroetzsch

On 14.02.2016 18:03, Hay (Husky) wrote:

On Sun, Feb 14, 2016 at 4:40 PM, Markus Kroetzsch
<markus.kroetz...@tu-dresden.de> wrote:

I suspect that https://query.wikidata.org can count how many times each
property is used.



Amazingly, you can (I was surprised):

https://query.wikidata.org/#SELECT%20%3FanyProp%20%28count%28*%29%20as%20%3Fcount%29%0AWHERE%20{%0A%20%20%20%20%3Fpid%20%3FanyProp%20%3FsomeValue%20.%0A}%0AGROUP%20BY%20%3FanyProp%0AORDER%20BY%20DESC%28%3Fcount%29

That's a really nice find! Any idea how to filter the query so you
only get the property statements?


I would just filter this in code; a more complex SPARQL query is just 
getting slower. Here is a little example Python script that gets all the 
data you need:


https://github.com/Wikidata/WikidataClassBrowser/blob/master/helpers/python/fetchPropertyStatitsics.py

I intend to use this in our upcoming new class/property browser as well. 
Maybe it would actually make sense to merge the two applications at some 
point (the focus of our tool are classes and their connection to 
properties, as in the existing Miga tool, but a property browser is an 
integral part of this).


Markus


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata Propbrowse

2016-02-14 Thread Markus Kroetzsch
  >> Wikidata mailing list
  >> Wikidata@lists.wikimedia.org
<mailto:Wikidata@lists.wikimedia.org>
<mailto:Wikidata@lists.wikimedia.org
<mailto:Wikidata@lists.wikimedia.org>>
  >> https://lists.wikimedia.org/mailman/listinfo/wikidata
  >
  >
  >
  > ___
  > Wikidata mailing list
  > Wikidata@lists.wikimedia.org
<mailto:Wikidata@lists.wikimedia.org>
<mailto:Wikidata@lists.wikimedia.org
<mailto:Wikidata@lists.wikimedia.org>>
  > https://lists.wikimedia.org/mailman/listinfo/wikidata
  >

 ___
 Wikidata mailing list
Wikidata@lists.wikimedia.org
<mailto:Wikidata@lists.wikimedia.org>
<mailto:Wikidata@lists.wikimedia.org
<mailto:Wikidata@lists.wikimedia.org>>
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
<mailto:Wikidata@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikidata



___
Wikidata mailing list
Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikidata



___
Wikidata mailing list
Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikidata



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Status and ETA External ID conversion

2016-03-07 Thread Markus Kroetzsch

On 07.03.2016 09:13, Lydia Pintscher wrote:

On Mon, Mar 7, 2016 at 2:57 AM Tom Morris <tfmor...@gmail.com
<mailto:tfmor...@gmail.com>> wrote:

On Sun, Mar 6, 2016 at 5:31 PM, Lydia Pintscher
<lydia.pintsc...@wikimedia.de <mailto:lydia.pintsc...@wikimedia.de>>
wrote:

On Sun, Mar 6, 2016 at 10:56 PM Stas Malyshev
<smalys...@wikimedia.org <mailto:smalys...@wikimedia.org>> wrote:

Is there a process somewhere of how the checking is done,
what are
criteria, etc.? I've read
https://www.wikidata.org/wiki/User:Addshore/Identifiers but
there's a
lot of discussion but not clear if it ever come to some end.
Also not
clear what the process is - should I just move a property I
like to
"good to convert"? Should I run it through some checklist
first? Should
I ask somebody?


Yes. Good ones should be moved to good to convert. If no-one
disagrees we'll convert them.


So, no decision criteria? Just whatever we individually like?

What are the rules for "disputed" - is some process for
review planned?


Let's concentrate on the ones people can agree on for now. We'll
tackle the ones that are disputed in the next step. If editors
can't sort it out I will make an executive decision at some
point but I don't think this will be needed.


I think the fact that some obvious good identifiers like IMDb have
been blocked has made potential contributors unsure how to evaluate
other candidates which would also, on the surface, seem obviously good.

Perhaps since the criteria aren't being used, someone could just
delete all the proposed criteria from the page and replace the old
text with something like "Whatever you, personally, think is best"
so that people know what's expected of them? That might help break
the logjam. I know it would make me more comfortable in contributing.



Ok. I think we're making this much more complicated than necessary. The
question you should ask yourself is: Does this identify a concept in
another database/website/...? Nice to have: a website to link to.
Once we have that we can look at corner cases and exceptions.


The community actually already has a class for such properties:

"Wikidata property representing a unique identifier" 
http://www.wikidata.org/entity/Q19847637


In general, the community uses several classes for properties that could 
have been used for UI organisation, rather than introducing new 
datatypes. The current discussion is caused mainly by the fact that 
there is just *one* new datatype, but many types of identifiers based on 
different criteria -- so people argue which one the new datatype should 
represent. The classes used on properties are much less controversial, 
because one just have one for each criterion that people consider 
relevant. For example, there also is


"multi-source external identifier"
http://www.wikidata.org/entity/Q21264328

There are many other classes that could be used in the interface, e.g., 
"Wikidata property for human relationships" 
<http://www.wikidata.org/entity/Q22964231> that one could use very well 
to group properties. One would not need to use all classes to group 
properties: there would be a (short) list that the community would 
decide on. I think this is the best approach to get reasonable property 
groups Reasonator-style into Wikidata at some point. It works much 
better than creating new datatypes for each case, it can build on 
existing data (rather than starting new discussions on datatype 
conversion), and it has the advantage that it can also group properties 
of different types.


Markus


--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Status and ETA External ID conversion

2016-03-06 Thread Markus Kroetzsch

On 06.03.2016 22:56, Stas Malyshev wrote:

Hi!


The community is checking each property to verify it should be converted:

https://www.wikidata.org/wiki/User:Addshore/Identifiers/0

https://www.wikidata.org/wiki/User:Addshore/Identifiers/1

https://www.wikidata.org/wiki/User:Addshore/Identifiers/2


Is there a process somewhere of how the checking is done, what are
criteria, etc.? I've read
https://www.wikidata.org/wiki/User:Addshore/Identifiers but there's a
lot of discussion but not clear if it ever come to some end. Also not
clear what the process is - should I just move a property I like to
"good to convert"? Should I run it through some checklist first? Should
I ask somebody?
What are the rules for "disputed" - is some process for review planned?

I think some more definite statement would help, especially to people
willing to contribute.


+1 I have had the same questions.

In your case, however, the answer probably is: you cannot contribute 
there at all, since you are a Wikimedia employee and this is a 
content-related community discussion. ;-)


Best,

Markus






--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Status and ETA External ID conversion

2016-03-07 Thread Markus Kroetzsch

On 06.03.2016 23:31, Stas Malyshev wrote:

Hi!


In your case, however, the answer probably is: you cannot contribute
there at all, since you are a Wikimedia employee and this is a
content-related community discussion. ;-)


Many WMF employees contribute to wikis in their non-work time, as far as
I know. I don't even seek to participate in the discussion (though I
don't think WMF employment would disqualify me from contributing in
volunteer capacity, given my affiliations - as they are - are clearly
stated) - but only to know the results so I could contribute in editor
capacity, following whatever rules are there.


Yes, sure, your free time is a different matter. I just thought you are 
speaking as a WMF employee here, since you were using this email. I am 
probably over-sensitive there since I am used to the very strict 
policies of WMDE. They are very careful to keep paid and private 
activities separate by using different accounts.


Markus

--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] SQID: the new "Wikidata classes and properties browser"

2016-04-20 Thread Markus Kroetzsch

On 20.04.2016 21:58, Stas Malyshev wrote:

Hi!


Nice work! I especially like the ability to filter the properties by
usage amount here:
https://tools.wmflabs.org/sqid/#/browse?type=properties This makes it
super easy to find unused or nearly unused properties for example.


Yes! Also some usage that seems strange - e.g., why use P31 or P279 in a
reference?


Planned future work: have links to queries on each property page where 
you can see/browse the actual places of each kind of usage ;-) Should be 
ready soonish.


Regarding spurious uses in qualifiers: these are sometimes caused by 
property example statements on the property pages, like here:


http://localhost/wdcb/#/view?id=P553

But I have no idea about the references.

Markus


--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] SQID: the new "Wikidata classes and properties browser"

2016-04-20 Thread Markus Kroetzsch

On 20.04.2016 22:29, Markus Kroetzsch wrote:

On 20.04.2016 21:58, Stas Malyshev wrote:

Hi!


Nice work! I especially like the ability to filter the properties by
usage amount here:
https://tools.wmflabs.org/sqid/#/browse?type=properties This makes it
super easy to find unused or nearly unused properties for example.


Yes! Also some usage that seems strange - e.g., why use P31 or P279 in a
reference?


Planned future work: have links to queries on each property page where
you can see/browse the actual places of each kind of usage ;-) Should be
ready soonish.

Regarding spurious uses in qualifiers: these are sometimes caused by
property example statements on the property pages, like here:

http://localhost/wdcb/#/view?id=P553


And this is my local copy ... what I meant was:

http://tools.wmflabs.org/sqid/#/view?id=P553



But I have no idea about the references.

Markus





___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] SQID: the new "Wikidata classes and properties browser"

2016-04-21 Thread Markus Kroetzsch

On 21.04.2016 22:27, Gerard Meijssen wrote:

Hoi,
A question.. I do understand that proper i18n is essential. But the
localisation could be / should be done at translatewiki,net.


I love translatewiki.net for PHP projects, but can I actually use it to 
do translations on a JavaScript project?


Markus

--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] SQID: the new "Wikidata classes and properties browser"

2016-04-21 Thread Markus Kroetzsch

On 21.04.2016 22:43, Gerard Meijssen wrote:

Hoi,
They not only but also internationalise PHP.


Ok, we need to check this. We use Angular Translate for i18n, and it 
might be that one would first have to develop a new converter for 
translatewiki to use its message files. Could still be worthwhile at 
some point, since we do have quite a lot of messages already.


Markus




On 21 April 2016 at 22:40, Markus Kroetzsch
<markus.kroetz...@tu-dresden.de <mailto:markus.kroetz...@tu-dresden.de>>
wrote:

On 21.04.2016 22:27, Gerard Meijssen wrote:

Hoi,
A question.. I do understand that proper i18n is essential. But the
localisation could be / should be done at translatewiki,net.


I love translatewiki.net <http://translatewiki.net> for PHP
projects, but can I actually use it to do translations on a
JavaScript project?


Markus

--
    Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486 <tel:%2B49%20351%20463%2038486>
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Multiple properties/identifiers for the same resource

2016-04-27 Thread Markus Kroetzsch

On 27.04.2016 21:13, Sebastian Burgstaller wrote:

Hi everyone,

I am lately facing the following problem: There are many (biomedical)
resources we import data from, which consist of several parts. And for
each of these parts, they use either a different identifier structure,
or they use the same identifier structure but with different accession
URLs. This is valid for very essential resources like ChEMBL (e.g.
compounds, targets, assays), miRNA database, IUPHAR and others

In order to represent and link to these resources properly in Wikidata,
how should we do this? The "easy" way is to just propose properties for
each of these parts of a resource, which also allows to specify the
proper formatter url. But this certainly would create several properties
for the same resource.

The other way would be to specify a set of formatter urls, but this
fails currently anyway, as this has not been implemented (yet). Maybe we
could specify formatter urls on a value basis which could override the
formatter url specified in the property? But I guess this requires
substantial dev time in Wikibase.

What are your thoughts/ideas?


Doing this in Wikidata is tricky and takes time. I don't even see how to 
do it well (note that external tools like Reasonator or SQID would also 
need to implement the same smart resolution mechanism). Having several 
properties for the same thing just because of different ID types used 
does not seem very compelling either.


How about building a little external referrer service that redirects IDs 
to the correct resource based on their structure? This could be a simple 
PHP-based web service hosted on Labs. In the end, the formatter URL is 
just for users to click on, so as long as you end up at the right place, 
this little indirection is maybe no problem.


Cheers,

Markus



Thanks!

Sebastian


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata ontology

2016-04-30 Thread Markus Kroetzsch

On 01.05.2016 01:34, Jan Macura wrote:

Hi all

I've been using the <http://wikidata.org/ontology#> namespace for
datatype properties for some time (more than a year).
Now I can see everywhere only the <http://wikiba.se/ontology#> ns.
Was there some reason for change? Are these two somehow compatible? Will
the first one be deprecated?


Hi Jan,

We have revised the OWL/RDF encoding as part of the work on the SPARQL 
query service, and the URIs have been changed in the process. Many of 
the same terms can still be found under the new namespace. The new 
namespace reflects that the ontology is the same for any site using this 
software, not just Wikidata.


There have also been further modifications in the RDF export, e.g., in 
relation to how certain values are encoded (e.g., geo coordinates use 
WKT in Wikidata, but used a custom type with planet in our initial RDF 
dumps). Another major extension was that simplified data values (encoded 
as single resources) are now available on every level -- statement, 
qualifier, reference -- and some new properties had to be introduced for 
this. Finally, there are also some changes in the URIs used for various 
RDF properties. All in all, the basic encoding (with statements, 
references, and complex values represented by own RDF resources) is the 
same, but the syntactic details changed quite a bit between our original 
ISWC publication and the launch of the SPARQL service.


Cheers,

Markus

--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Multiple properties/identifiers for the same resource

2016-04-29 Thread Markus Kroetzsch
I tend to agree with Jerven. He is right to say that URIs work best as 
identifiers. However, some things should still be kept in mind:


* The strings we are talking about are in fact IDs and not ambiguous: no 
string id identifies multiple objects.
* The problem is in finding the right web page to refer a user to for 
each ID. URIs are often distinct from the URLs that users would like to 
read. It is even possible that there are already official URIs for some 
of the datasets we were talking about, and that these URIs do not help 
us in finding the right URL either.


In some datasets, the problem might be solved by switching to URIs, but 
this requires a working content negotiation to redirect users when they 
open the URI in their browser. I have some doubts that we can find this 
for the problematic cases, given that they don't even have a simple 
redirection service for finding their URLs.


Moreover, there is the technical problem that the design that has been 
selected for distinguishing external IDs in Wikidata is such that these 
IDs must be of type string.


In a perfect world, Jerven's approach would still be the cleanest, I 
believe, but it might be impractical at the moment.


Cheers,

Markus


On 29.04.2016 15:29, Jerven Tjalling Bolleman wrote:

Could I be so bold to suggest that in Wikidata we should strive
to use external URI's for identifiers not Strings.

For example in Wikidata, there are a lot of UniProt accessions.
e.g. behind the property https://www.wikidata.org/wiki/P352
and there is a formatter for a URL.

I think this is the wrong way round, there should be an URL/URI there
and a formatter to generate a local string for display purposes.

And of course for chembl the URL/URI to use would be

<http://rdf.ebi.ac.uk/resource/chembl/molecule/CHEMBL101690?

There a 2 advantages to this. It allows easier federates queries from
the source databases into wikidata (no URI conversions etc..)
The second is that these URIs are clearly not ambiguous.

Regards,
Jerven

On 28/04/16 23:49, Julie McMurry wrote:

"One should also point out to the authorities maintaining these IDs

that they should spend some effort on producing a workable solution for
this. It seems they should be the first to provide a resolver service
(or maybe it would be an "ID search engine" if it is so complicated).

With the qualifiers in place, Wikidata can also be used to achieve this,
of course, but it seems we are just manually reverse engineering
something that should be done at the site of whoever is controlling the
ID registration."

Well said, Markus. A most hearty agreement here on my side and one
colleagues and I have been trying to raise awareness of for a long time
now (http://bit.ly/id-guidance). One of the challenges is that databases
are already being asked to do more with less. They can see the utility
of such a service to others, but when I've asked DBs before (not naming
names), traction has been limp (I've yet to ask Chembl). Sometimes it
works out though. For instance, KEGG used to have 12 different
type-specific URLs, corresponding to:

kegg.compound
kegg.disease
kegg.drug
kegg.environ
kegg.genes
kegg.genome
kegg.glycan
kegg.metagenome
kegg.module
kegg.orthology
kegg.pathway
kegg.reaction

Thankfully, they've collapsed those to a single URL pattern.

The databases that find it the toughest are not those who simply don't
embed typing, but rather those that don't embed typing AND ALSO have
local identifiers that would otherwise collide. For instance, a
prominent bio database is in this boat (not naming names) and would like
to make things better but it is hard and messy due to the collisions.

FYI 345 of the 560+ records in the identifiers.org
<http://identifiers.org> corpus are type-specific at the level of
identifiers.org <http://identifiers.org>'s namespace; these roll up to
~300 providers.

The question though is what WikiData is trying to accomplish. Say you
encounter the chembl ID CHEMBL308052
<http://linkedchemistry.info/chembl/chemblid/CHEMBL308052> do you need
to retrieve the type of the entity for reasons other than determining
what URL to use?

How are you representing entity labels / IDs to users?

Best,
Julie









___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata






--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] EKAW2016 - Call for Doctoral Consortium Papers

2016-05-05 Thread Markus Kroetzsch
.cs.unibo.it/?q=callforpapers).


== IMPORTANT DATES ==

 - Abstract Submission: September 8th, 2016
 - Full Paper Submission: September 15th
 - Notification: October 6th, 2016
 - Camera-Ready: October 13th
 - Doctoral Consortium: November 19th-20th, 2016


== CHAIRS ==

 - Valentina Presutti (STLab, ISTC-CNR, Italy)
 - Mathieu d’Aquin (KMi, The Open University, UK)


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] EKAW2016 - Call for Doctoral Consortium Papers

2016-05-05 Thread Markus Kroetzsch

On 05.05.2016 22:17, Lydia Pintscher wrote:

On Thu, May 5, 2016 at 9:47 PM Markus Kroetzsch
<markus.kroetz...@tu-dresden.de <mailto:markus.kroetz...@tu-dresden.de>>
wrote:

Dear Andrea,

It does not give a good impression if your first and only message to an
email list is a cross-posted advertisement that does not make any effort
to clarify the relationship to the topic of the list. We don't usually
have such calls here (as opposed to other lists, where you have many
messages of this type every day). If you have a relevant message for the
Wikidata community, please take your time to write a personalised email
that clarifies why people should be interested.


Indeed. I am putting people who do this on moderation.


Off-topic: would it make sense to have someone else be in charge of 
this? I know you are highly efficient, it seems an odd side-job for a 
software product owner to also moderate the community mailing list ;-) 
I'd rather have you focus on development and deployment, where nobody 
else can help out.


Cheers,

Markus


--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] SPARQL service timeouts

2016-04-18 Thread Markus Kroetzsch

Hi,

I have the impression that some not-so-easy SPARQL queries that used to 
run just below the timeout are now timing out regularly. Has there been 
a change in the setup that may have caused this, or are we maybe seeing 
increased query traffic [1]?


Cheers,

Markus


[1] The deadline for the Int. Semantic Web Conf. is coming up, so it 
might be that someone is running experiments on the system to get their 
paper finished. It has been observed for other endpoints that traffic 
increases at such times. This community sometimes is the greatest enemy 
of its own technology ... (I recently had to IP-block an RDF crawler 
from one of my sites after it had ignored robots.txt completely).


--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] SQID: the new "Wikidata classes and properties browser"

2016-04-19 Thread Markus Kroetzsch

Hi all,

As promised a while ago, we have reworked our "Wikidata Classes and 
Properties" browser. I am happy to introduce the first beta version of a 
new app, called SQID:


http://tools.wmflabs.org/sqid/

It is a complete rewrite of the earlier application. Much faster, more 
usable, much more up-to-date information, supported on all reasonable 
browsers, and with tons of new features. Try it yourself, or read on for 
the main functions and current dev plans:



== Browse classes and properties ==

You can use it to find properties and class items by all kinds of 
filtering settings:


http://tools.wmflabs.org/sqid/#/browse?type=properties
http://tools.wmflabs.org/sqid/#/browse?type=classes

New features:
* Sort results by a criterion of choice
* Powerful, easy-to-use filtering interface
* Search properties by label, datatype, qualifiers used, or co-occurring 
properties
* Search classes by label, (indirect) superclass or by properties used 
on instances of the class

* All property statistics and some class statistics are updated every hour

== View Wikidata entities ==

The goal was to have a page for every property and every class, but we 
ended up having a generic data browser that can show all (live) data + 
some additional data for classes and properties (this is our main goal, 
but the other data is often helpful to understand the context). The UI 
is modelled after Reasonator as the quasi-standard of how Wikidata 
should look, but if you look beyond the surface you can see many 
differences in what SQID will (or will not) display.


Examples:
* Dresden, a plain item with a lot of data:
  http://tools.wmflabs.org/sqid/#/view?id=Q1731
* Volcano, a class with many subclasses:
  http://tools.wmflabs.org/sqid/#/view?id=Q8072
* sex or gender, a frequently used property:
  http://tools.wmflabs.org/sqid/#/view?id=P21

Notable features:
* Fast display
* All statement data with all qualifiers shown
* Extra statistical and live query data embedded

== General Wikidata statistics ==

As a minor feature, we also publish statistics on the weekly full 
Wikidata dump (which we process to get some of the statistics):


http://tools.wmflabs.org/sqid/#/status

Don't trust the main page -- find out how many entities there really are ;-)


== Plans, todos, feedback, contributions ==

We are on github, so please make feature requests and bug reports there:

https://github.com/Wikidata/WikidataClassBrowser/issues

Pull requests are welcome too.

Known limitations of the current version:

* Data update still a bit shaky. We refresh most statistical data every 
hour (entity data is live anyway), but you may not see this unless you 
clear your browser cache. This will be easier in the future.

* Entity data browser does not show sitelinks and references yet.
* Incoming properties not shown yet on entity pages
* I18N not complete yet (if you would like to try the current dev 
status, see, e.g.: http://tools.wmflabs.org/sqid/#/view?id=Q318=de)


Moreover, we are also planning to integrate more data displays, better 
live statistics, and editing capabilities. Developers who want to help 
there are welcome. SQID can also be a platform for other data display 
ideas (it's built using AngularJS, so integration is easy).


== And what about Miga? ==

The old Miga-based app at 
http://tools.wmflabs.org/wikidata-exports/miga/ will be retired in due 
course. Please update your links.


== Credits ==

I have had important development support from Markus Damm, Michael 
Günther and Georg Wild. We are funded by the German Research Foundation 
DFG. All of us are at TU Dresden. Complex statistics are computed with 
Wikidata Toolkit. Live query results come from the Wikidata SPARQL Query 
Service.


Enjoy,

Markus

--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] SPARQL service timeouts

2016-04-18 Thread Markus Kroetzsch

On 18.04.2016 21:56, Markus Kroetzsch wrote:

Thanks, the dashboard is interesting.

I am trying to run this query:

SELECT ?subC ?supC WHERE { ?subC p:P279/ps:P279 ?supC }

It is supposed to return a large result set. But I am only running it
once per week. It used to work fine, but today I could not get it to
succeed a single time.


Actually, the query seems to work as it should. I am investigating why I 
get an error in some cases on my machine.


Markus



On 18.04.2016 21:40, Stas Malyshev wrote:

Hi!


I have the impression that some not-so-easy SPARQL queries that used to
run just below the timeout are now timing out regularly. Has there been
a change in the setup that may have caused this, or are we maybe seeing
increased query traffic [1]?


We've recently run on a single server for couple of days due to
reloading of the second one, so this may have made it a bit slower. But
that should be gone now, we're back to two. Other than that, not seeing
anything abnormal in
https://grafana.wikimedia.org/dashboard/db/wikidata-query-service


[1] The deadline for the Int. Semantic Web Conf. is coming up, so it
might be that someone is running experiments on the system to get their
paper finished. It has been observed for other endpoints that traffic
increases at such times. This community sometimes is the greatest enemy
of its own technology ... (I recently had to IP-block an RDF crawler
from one of my sites after it had ignored robots.txt completely).


We don't have any blocks or throttle mechanisms right now. But if we see
somebody making serious negative impact on the service, we may have to
change that.







--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] SPARQL service timeouts

2016-04-18 Thread Markus Kroetzsch

On 18.04.2016 22:21, Markus Kroetzsch wrote:

On 18.04.2016 21:56, Markus Kroetzsch wrote:

Thanks, the dashboard is interesting.

I am trying to run this query:

SELECT ?subC ?supC WHERE { ?subC p:P279/ps:P279 ?supC }

It is supposed to return a large result set. But I am only running it
once per week. It used to work fine, but today I could not get it to
succeed a single time.


Actually, the query seems to work as it should. I am investigating why I
get an error in some cases on my machine.


Ok, I found that this is not so easy to reproduce reliably. The symptom 
I am seeing is a truncated JSON response, which just stops in the middle 
of the data (at a random location, but usually early on), and which is 
*not* followed by any error message. The stream just ends.


So far, I could only get this in Java, not in Python, and it does not 
always happen. If successful, the result is about 250M in size. The 
following Python script can retrieve it:


import requests
SPARQL_SERVICE_URL = 'https://query.wikidata.org/sparql'
query = """SELECT ?subC ?supC WHERE { ?subC p:P279/ps:P279 ?supC }"""
print requests.get(SPARQL_SERVICE_URL, params={'query': query, 'format': 
'json'}).text


(output should be redirected to a file)

I will keep an eye on the issue, but I don't know how to debug this any 
further now, since it started to work without me changing any code.


I also wonder how to read the dashboard after all. In spite of me 
repeating an experiment that creates a 250M result file for five times 
in the past few minutes, the "Bytes out" figure remains below a few MB 
for most of the time.


Markus




On 18.04.2016 21:40, Stas Malyshev wrote:

Hi!


I have the impression that some not-so-easy SPARQL queries that used to
run just below the timeout are now timing out regularly. Has there been
a change in the setup that may have caused this, or are we maybe seeing
increased query traffic [1]?


We've recently run on a single server for couple of days due to
reloading of the second one, so this may have made it a bit slower. But
that should be gone now, we're back to two. Other than that, not seeing
anything abnormal in
https://grafana.wikimedia.org/dashboard/db/wikidata-query-service


[1] The deadline for the Int. Semantic Web Conf. is coming up, so it
might be that someone is running experiments on the system to get their
paper finished. It has been observed for other endpoints that traffic
increases at such times. This community sometimes is the greatest enemy
of its own technology ... (I recently had to IP-block an RDF crawler
from one of my sites after it had ignored robots.txt completely).


We don't have any blocks or throttle mechanisms right now. But if we see
somebody making serious negative impact on the service, we may have to
change that.










--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] SQID: the new "Wikidata classes and properties browser"

2016-04-21 Thread Markus Kroetzsch

On 21.04.2016 10:22, Neubert, Joachim wrote:

Hi Markus,

Great work!

One short question: Is there a way to switch to labels and descriptions in 
another language, by URL or otherwise?


This feature is not fully implemented yet, but there is partial support 
for the that can be tried out already. To use it, add the "lang=" 
parameter to the view URLs, like so:


http://tools.wmflabs.org/sqid/#/view?id=Q1339=de

Once you did this, all links will retain this setting, so you can 
happily browse along.


Limitations:
* It only works for the entity view perspective, not for the property 
and class browser
* All labels will be translated to any language (if available), but the 
application interface right now is only available in English and German


The feature will be extended further in the future, including support 
for user-provided interface translations.


Markus



Cheers, Joachim


-Ursprüngliche Nachricht-
Von: Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] Im Auftrag von
Markus Kroetzsch
Gesendet: Dienstag, 19. April 2016 21:45
An: Discussion list for the Wikidata project.
Betreff: [Wikidata] SQID: the new "Wikidata classes and properties browser"

Hi all,

As promised a while ago, we have reworked our "Wikidata Classes and
Properties" browser. I am happy to introduce the first beta version of a new
app, called SQID:

http://tools.wmflabs.org/sqid/

It is a complete rewrite of the earlier application. Much faster, more usable,
much more up-to-date information, supported on all reasonable browsers, and
with tons of new features. Try it yourself, or read on for the main functions 
and
current dev plans:


== Browse classes and properties ==

You can use it to find properties and class items by all kinds of filtering 
settings:

http://tools.wmflabs.org/sqid/#/browse?type=properties
http://tools.wmflabs.org/sqid/#/browse?type=classes

New features:
* Sort results by a criterion of choice
* Powerful, easy-to-use filtering interface
* Search properties by label, datatype, qualifiers used, or co-occurring
properties
* Search classes by label, (indirect) superclass or by properties used on
instances of the class
* All property statistics and some class statistics are updated every hour

== View Wikidata entities ==

The goal was to have a page for every property and every class, but we ended
up having a generic data browser that can show all (live) data + some
additional data for classes and properties (this is our main goal, but the other
data is often helpful to understand the context). The UI is modelled after
Reasonator as the quasi-standard of how Wikidata should look, but if you look
beyond the surface you can see many differences in what SQID will (or will not)
display.

Examples:
* Dresden, a plain item with a lot of data:
http://tools.wmflabs.org/sqid/#/view?id=Q1731
* Volcano, a class with many subclasses:
http://tools.wmflabs.org/sqid/#/view?id=Q8072
* sex or gender, a frequently used property:
http://tools.wmflabs.org/sqid/#/view?id=P21

Notable features:
* Fast display
* All statement data with all qualifiers shown
* Extra statistical and live query data embedded

== General Wikidata statistics ==

As a minor feature, we also publish statistics on the weekly full Wikidata dump
(which we process to get some of the statistics):

http://tools.wmflabs.org/sqid/#/status

Don't trust the main page -- find out how many entities there really are ;-)


== Plans, todos, feedback, contributions ==

We are on github, so please make feature requests and bug reports there:

https://github.com/Wikidata/WikidataClassBrowser/issues

Pull requests are welcome too.

Known limitations of the current version:

* Data update still a bit shaky. We refresh most statistical data every hour
(entity data is live anyway), but you may not see this unless you clear your
browser cache. This will be easier in the future.
* Entity data browser does not show sitelinks and references yet.
* Incoming properties not shown yet on entity pages
* I18N not complete yet (if you would like to try the current dev status, see,
e.g.: http://tools.wmflabs.org/sqid/#/view?id=Q318=de)

Moreover, we are also planning to integrate more data displays, better live
statistics, and editing capabilities. Developers who want to help there are
welcome. SQID can also be a platform for other data display ideas (it's built
using AngularJS, so integration is easy).

== And what about Miga? ==

The old Miga-based app at
http://tools.wmflabs.org/wikidata-exports/miga/ will be retired in due course.
Please update your links.

== Credits ==

I have had important development support from Markus Damm, Michael
Günther and Georg Wild. We are funded by the German Research Foundation
DFG. All of us are at TU Dresden. Complex statistics are computed with
Wikidata Toolkit. Live query results come from the Wikidata SPARQL Query
Service.

Enjoy,

Markus

--
Markus Kroetzsch
Faculty of Computer Science
Tech

Re: [Wikidata] Grammatical display of units

2016-07-28 Thread Markus Kroetzsch

Hi Stas,

Good point. Could we not just have a monolingual text string property 
that gives the preferred writing of the unit when used after a number? I 
don't think the plural/singular issue is very problematic, since you 
would have plural almost everywhere, even for "1.0 metres". So maybe we 
just need one alternative label for most languages? Or are there 
languages with more complex grammar rules for units?


Best regards,

Markus


On 27.07.2016 21:18, Stas Malyshev wrote:

Hi!

Right now, quantities with units are displayed by attaching unit name to
the number. While it gives the idea of what is going on, it is somewhat
ungrammatical in English (83 kilgoramm, 185 centimetre, etc.) [1] and in
other languages - i.e. in Russian it's 83 килограмм, 185 сантиметр -
instead of the correct "83 килограмма", "185 сантиметров". For some
units, the norms are kind of tricky and fluid (e.g. see [2]), and they
are not even identical across all units in the same language, but the
common theme is that there are grammatical rules on how to do it and
we're ignoring them right now.

I think we do have some means to grammatically display numbers - for
example, number of references is displayed correctly in English and
Russian. As I understand, it is done by using certain formats in message
strings, and these formats are supported in the code in Language
classes. So, I wonder if we should maybe have an (optional) property
that defines the same format for units? We could then reuse the same
code to display units in proper grammatical way.

Alternatively, we could use short units display [3] - i.e. cm instead of
centimetre - and then plurals are not required. However, this relies on
units having short names, and for some units short names can be rather
obscure, and maybe in some language short names need grammatical forms
too. Given that we do not link unit names, it would be rather confusing
(btw, why don't we?). Some units may not have short forms at all.

And the short names do not exactly match the languages - rather, they
usually match the script (i.e. Cyrillic, or Latin, or Hebrew) - and we
may not even have data on which language uses which script, in a useful
form. So using short forms is very tricky.

Any other ideas on this topic? Do we have a ticket tracking this
somewhere? I looked but couldn't find it.

[1]
http://english.stackexchange.com/questions/22082/are-units-in-english-singular-or-plural
[2]
https://ru.wikipedia.org/wiki/%D0%9E%D0%B1%D1%81%D1%83%D0%B6%D0%B4%D0%B5%D0%BD%D0%B8%D0%B5_%D0%92%D0%B8%D0%BA%D0%B8%D0%BF%D0%B5%D0%B4%D0%B8%D0%B8:%D0%9E%D1%84%D0%BE%D1%80%D0%BC%D0%BB%D0%B5%D0%BD%D0%B8%D0%B5_%D1%81%D1%82%D0%B0%D1%82%D0%B5%D0%B9#.D0.A1.D0.BA.D0.BB.D0.BE.D0.BD.D0.B5.D0.BD.D0.B8.D0.B5_.D0.B5.D0.B4.D0.B8.D0.BD.D0.B8.D1.86_.D0.B8.D0.B7.D0.BC.D0.B5.D1.80.D0.B5.D0.BD.D0.B8.D1.8F
[3] https://phabricator.wikimedia.org/T86528




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] An attribute for "famous person"

2016-08-02 Thread Markus Kroetzsch

On 02.08.2016 13:11, Ghislain ATEMEZING wrote:

Thanks Yuri. I will try to define a kind a metric for those having  a
number of wikipedia entries. For example, a person with 127 entries
would be "famous" while another with just 10 is not "famous"...


Side remark @Stas: it could be very helpful to have the number of 
Wikimedia project articles stored as a numeric value for a new property 
in RDF. Doing a SPARQL query that computes this number and does 
something with it afterwards almost always times out. The number could 
be very useful as a heuristic "popularity" measure that can also help to 
give the most "important" items first in a number of queries.


Best,

Markus




El mar., 2 ago. 2016 a las 12:52, Yuri Astrakhan
(>) escribió:

Any person in wikidata is "famous" - otherwise they wouldn't be
notable and therefore wouldn't be there))
If you prefer the stricter notability requirement(as used by
Wikipedia), search only for those that have a wikipedia page


On Aug 2, 2016 1:44 PM, "Ghislain ATEMEZING"
>
wrote:

Ahoy,
I am curious to know if there is a way to know that a given
person is "famous" in Wikidata. I want for example to retrieve
"all famous French people born after a given date".

Thanks in advance for your help.

Best,
Ghislain
--
---
"Love all, trust a few, do wrong to none" (W. Shakespeare)
Web: http://atemezing.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org 
https://lists.wikimedia.org/mailman/listinfo/wikidata

___
Wikidata mailing list
Wikidata@lists.wikimedia.org 
https://lists.wikimedia.org/mailman/listinfo/wikidata

--
---
"Love all, trust a few, do wrong to none" (W. Shakespeare)
Web: http://atemezing.org


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] An attribute for "famous person"

2016-08-02 Thread Markus Kroetzsch

On 02.08.2016 22:28, Yuri Astrakhan wrote:

Is there a way we could have more than just the number of language
links? Eg number of incoming links from other wikipedia pages?


One could have other data added to the store, but this may be more work 
depending on what you want. You ask about links from "wikipedia pages". 
If you really mean this (and not Wikidata items), then this would be a 
lot of work to do since one would have to update RDF when (any) 
Wikipedia page changes. I guess we do not have infrastructure for doing 
this in a life update mode. Also note that the number of these links is 
different in each language, so one would have to store many numbers. 
Overall, this link count would really be (meta)data about Wikipedia 
pages and their relations, and not so much about Wikidata. I think you 
could get such Wikipedia-specific data from DBpedia, but I am not sure 
how well their life endpoint keeps track of this data (since it is 
tricky). Maybe an offline solution that combines RDF dumps is the most 
practical approach for now if you really need this data.


Even storing the number of incoming links (properties) from other 
Wikidata items would actually be tricky. Currently, the RDF data about 
each item only depends on the content of this item's Wikidata page. The 
number of inlinks depends on other Wikidata pages, and therefore it is 
much more work to keep it up to date when there are edits.


Markus





On Aug 2, 2016 10:41 PM, "Markus Kroetzsch"
<markus.kroetz...@tu-dresden.de <mailto:markus.kroetz...@tu-dresden.de>>
wrote:

On 02.08.2016 20:59, Daniel Kinzler wrote:

    Am 02.08.2016 um 20:19 schrieb Markus Kroetzsch:

Oh, there is a little misunderstanding here. I have not
suggested to create a
property "number of sitelinks in this document". What I
propose instead is to
create a property "number of sitelinks for the document
associated with this
entity". The domain of this suggested property is entity.
The advantage of this
proposal over the thing that you understood is that it makes
queries much
simpler, since you usually want to sort items by this value,
not documents. One
could also have a property for number of sitelinks per
document, but I don't
think it has such a clear use case.


"number of sitelinks for the document associated with this
entity" strikes me as
semantically odd, which was the point of my earlier mail. I'd
much rather have
"number of sitelinks in this document". You are right that the
primary use would
be to "rank" items, and that it would be more conveniant to have
the count
assocdiated directly with the item (the entity), but I fear it
will lead to a
blurring of the line between information about the entity, and
information about
the document. That is already a common point of confusion, and
I'd rather keep
that separation very clear. I also don't think that one level of
indirection
would be orribly complicated.

To me it's just natural to include the sitelink info on the same
level as we
provide a timestmap or revision id: for the document.


I just proposed the simple and straightforward way to solve the
practical problem at hand. It leads to shorter, more readable
queries that execute faster. (I don't claim originality for this; it
is the obvious solution to the problem and most people would arrive
at exactly the same conclusion).

Your concern is based on the assumption that there is some kind of
psychological effect that a particular RDF encoding would have on
users. I don't think that there is any such effect. Our users will
not confuse the city of Paris with an RDF document just because of
some data in the RDF store.

Markus

--
Prof. Dr. Markus Kroetzsch
Knowledge-Based Systems Group
Faculty of Computer Science
TU Dresden
+49 351 463 38486 <tel:%2B49%20351%20463%2038486>
https://iccl.inf.tu-dresden.de/web/KBS/en

___
Wikidata mailing list
Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikidata



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] An attribute for "famous person"

2016-08-02 Thread Markus Kroetzsch

On 03.08.2016 02:49, Stas Malyshev wrote:

Hi!


Oh, there is a little misunderstanding here. I have not suggested to
create a property "number of sitelinks in this document". What I propose
instead is to create a property "number of sitelinks for the document
associated with this entity". The domain of this suggested property is


I think this is covered by https://phabricator.wikimedia.org/T129046 -
which seeks to add page props (which already have sitelinks count I
think but we can define any that we want) to RDF. I kind of neglected it
due to the lack of demand, but it should not be that hard to do.



If you think it is best to implement a more general feature that adds 
even more properties, then I am sure nobody will complain, but it sounds 
like more work to me. The number I was asking for is something that you 
can easily compute from the data that you process already. You can also 
compute the number in a SPARQL query from the RDF. It is a completely 
redundant piece of information. It's only purpose is to make SPARQL 
queries that currently time out fast. In databases, such things are 
called "materialized views".


This leads to a slightly different perspective than the one you'd have 
in T129046. By adding page props, you want to add "new" information from 
another source, and questions like data modelling etc. come to the fore. 
With a materialized view, you just add some query results back to the 
database for technical reasons that are specific to the database. The 
two motivations might lead to different requirements at some point 
(e.g., if you want to add another materialized query result to the RDF 
you may have to extend page props, which involves more dependencies than 
if you just extend the RDF converter).


Markus



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] An attribute for "famous person"

2016-08-02 Thread Markus Kroetzsch

On 03.08.2016 02:51, Stas Malyshev wrote:

Hi!


Is there a way we could have more than just the number of language
links? Eg number of incoming links from other wikipedia pages?


If we implement T129046 we can have any page props we want to :)
Of course, adding them for the whole DB would require new dump and
either data reload or some manual work... which can be solved too with
some effort. I wonder if Wikidata Toolkit has tools to do filtering like
this (i.e. "get me specific triple from all entities in this dump"). If
not, it probably should have ;)


Are you asking if WDTK can make RDF exports that contain only some parts 
of the RDF data? Yes, this is possible. We did not put too much effort 
in the RDF export though since WMF has reimplemented this anyway. 
Probably would need some de-dusting.


If you were asking about triple filtering on RDF data that you already 
have (not "filtered generation" of new RDF), then this is not something 
that WDTK aims at (since WDTK does not read RDF data in the first 
place). However, you can often achieve this with grep if the RDF data is 
in ntriples format.


Markus



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Render sparql queries using the Histropedia timeline engine

2016-08-11 Thread Markus Kroetzsch

On 11.08.2016 13:40, Daniel Kinzler wrote:

Hi Navino!

Thank you for your awesome work!

Since this has caused some confusion again recently, I want to caution you about
a major gotcha regarding dates in RDF and JSON: they use different conventions
to represent years BCE. I just updated our JSON spec to reflect that reality,
see .

There is a lot of confusion about this issue throughout the linked data web,
since the convention changed between XSL 1.0 (which uses -0044 to represent 44
BCE, and -0001 to represent 1 BCE) and XSL 1.1 (which uses -0043 to represent 44
BCE, and + to represent 1 BCE). Our JSON uses the traditional numbering (1
BCE is -0001), while RDF uses the astronomical numbering (1 BCE is +).


Is this still true? We have discussed this at length over a year ago [1] 
and there is really not much complication or "fun" about this. It is 
actually quite simple: the whole world has agreed on using + to mean 
1 BCE in technical contexts. It's just nicer to calculate with.


In particular, the JSON export is at odds with JavaScript itself (!), 
which also treats year + as 1 BCE, or course:


http://www.ecma-international.org/ecma-262/6.0/#sec-extended-years

Besides JavaScript, the exact same convention is used by ISO 8601, XML 
Schema, RDF, SPARQL, and in other programming languages that support BCE 
dates, such as Java (see SimpleDateFormat class).


Can we file a bug against the current JSON export to have this fixed? It 
would be very good if our JSON would agree with ISO, W3C, JavaScript, 
our own RDF export, and all astronomers ;-)


This would affect users of BCE dates, such as Histropedia, so it would 
be good if any users of such dates could comment on what they prefer.


Cheers,

Markus

[1] https://phabricator.wikimedia.org/T94064


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] (Ab)use of "deprecated"

2016-08-11 Thread Markus Kroetzsch

On 11.08.2016 18:45, Andra Waagmeester wrote:

On Thu, Aug 11, 2016 at 4:15 PM, Markus Kroetzsch
<markus.kroetz...@tu-dresden.de <mailto:markus.kroetz...@tu-dresden.de>>
wrote:


has a statement "population: 20,086 (point in time: 2011)" that is
confirmed by a reference. Nevertheless, the statement is marked as
"deprecated". This would mean that the statement "the popluation was
20,086 in 2011" is wrong. As far as I can tell, this is not the case.


I wouldn't say that with a deprecated rank, that statement is "wrong". I
consider de term deprecated to indicate that a given statement is no
longer valid in the context of a given resource (reference). I agree, in
this specific case the use of the deprecated rank is wrong, since no
references are given to that specific statement.
Nevertheless, I think it is possible to have disagreeing resources on an
identical statement, where two identical statements exists, one with
rank "deprecated" and one with rank "normal". It is up to the user to
decide which source s/he trusts.


The status "deprecated" is part of the claim of the statement. The 
reference is supposed to support this claim, which in this case is also 
the claim that it is deprecated. The status is not meant to deprecate a 
reference (not saying that this is never useful, potentially, but you 
can only use it in one way, and it seems much more practical if 
deprecated statements get references that explain why they are deprecated).






It seems that somebody wanted to indicate that this old population
is no longer current. This is achieved not by deprecating the old
value, but by setting another (newer) value as "preferred".


I would argue that this is better done by using qualifiers (e.g. start
data, end data).  If a statement on the population size would be set to
preferred, but isn't monitored for quite some time, it can be difficult
to see if the "preferred" statement is still accurate, whereas a
qualifier would give a better indication that that stament might need an
update.


Sure, there should always be qualifiers as needed, and we already have 
qualifiers like start and end date in most cases. However, one should 
still set the "best" statemnt(s) to be preferred as a help for users of 
the data. When you use date in queries or in LUA, it would be very hard 
to analyse all statements' qualifiers to find out which one is currently 
the best. The "preferred" rank gives a simple shortcut there. In SPARQL, 
for example, the best ranked statements will be used in the simplified 
"direct" properties in namespace wdt: Users who want to get all the 
details can still use the qualifiers, but this leads to more complicated 
queries.


Best regards,

Markus



Cheers,

Andra



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Breaking change in JSON serialization?

2016-08-11 Thread Markus Kroetzsch

Dear all,

There have been some interesting discussions about breaking changes 
here, but before we continue in this direction, let me repeat that I did 
not start this thread to define what is a "breaking change" in JSON. 
There are JSON libraries that define this in a strict way (siding with 
Peter) and browsers that are more tolerant (siding with Daniel). I don't 
think we can come to definite conclusions here. Format versioning, as 
Stas suggests, can't be a bad thing.


However, all I was asking for was to get a little email when JSON is 
changed. It is not necessary to discuss if this is really necessary due 
to some higher principle. Even if my software tolerates the change, I 
should *always* know about new information being available. It is 
usually there for a purpose, so my software should do better than "not 
breaking".


Lydia has already confirmed early on that suitable notification emails 
should be sent in the future, so I don't see a need to continue this 
particular discussion. Daniel's position seemed to be a mix of "I told 
you so" and "you volunteers should write better code", which is of 
little help to me or my users. It would be good to rethink how to 
approach the community in such cases, to make sure that a coherent and 
welcoming message is sent to contributors. (On that note, all the best 
on your new job, Léa! -- communicating with this crowd can be a 
challenge at times ;-).


Markus


On 11.08.2016 22:35, Stas Malyshev wrote:

Hi!


My view is that this tool should be extremely cautious when it sees new data
structures or fields.  The tool should certainly not continue to output
facts without some indication that something is suspect, and preferably
should refuse to produce output under these circumstances.


I don't think I agree. I find tools that are too picky about details
that are not important to me hard to use, and I'd very much prefer a
tool where I am in control of which information I need and which I don't
need.


What can happen if the tool instead continues to operate without complaint
when new data structures are seen?  Consider what would happen if the tool
was written for a version of Wikidata that didn't have rank, i.e., claim
objects did not have a rank name/value pair.  If ranks were then added,
consumers of the output of the tool would have no way of distinguishing
deprecated information from other information.


Ranks are a bit unusual because ranks are not just informational change,
it's a semantic change. It introduces a concept of a statement that has
different semantics than the rest. Of course, such change needs to be
communicated - it's like I would make format change "each string
beginning with letter X needs to be read backwards" but didn't tell the
clients. Of course this is a breaking change if it changes semantics.

What I was talking are changes that don't break semantics, and majority
of additions are just that.


Of course this is an extreme case.  Most changes to the Wikidata JSON dump
format will not cause such severe problems.  However, given the current
situation with how the Wikidata JSON dump format can change, the tool cannot
determine whether any particular change will affect the meaning of what it
produces.  Under these circumstances it is dangerous for a tool that
extracts information from the Wikidata JSON dump to continue to produce
output when it sees new data structures.


The tool can not. It's not possible to write a tool that would derive
semantics just from JSON dump, or even detect semantic changes. Semantic
changes can be anywhere, it doesn't have to be additional field - it can
be in the form of changing the meaning of the field, or format, or
datatype, etc. Of course the tool can not know that - people should know
that and communicate it. Again, that's why I think we need to
distinguish changes that break semantics and changes that don't, and
make the tools robust against the latter - but not the former because
it's impossible. For dealing with the former, there is a known and
widely used solution - format versioning.


This does make consuming tools sensitive to changes to the Wikidata JSON
dump format that are "non-breaking".  To overcome this problem there should
be a way for tools to distinguish changes to the Wikidata JSON dump format
that do not change the meaning of existing constructs in the dump from those
that can.  Consuming tools can then continue to function without problems
for the former kind of change.


As I said, format versioning. Maybe even semver or some suitable
modification of it. RDF exports BTW already carry version. Maybe JSON
exports should too.




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] (Ab)use of "deprecated"

2016-08-11 Thread Markus Kroetzsch

Dear all,

As you may know, statements in Wikidata can be marked as "preferred" or 
"deprecated" to distinguish them from the "normal" ones.


I found that many items have perfectly valid historical statements 
marked as "deprecated". For example, our showcase item "Kleinmachnow"


https://www.wikidata.org/wiki/Q104192

has a statement "population: 20,086 (point in time: 2011)" that is 
confirmed by a reference. Nevertheless, the statement is marked as 
"deprecated". This would mean that the statement "the popluation was 
20,086 in 2011" is wrong. As far as I can tell, this is not the case.


It seems that somebody wanted to indicate that this old population is no 
longer current. This is achieved not by deprecating the old value, but 
by setting another (newer) value as "preferred".


Similar problems occur for the mayor of this town.

I hope there is no deeper confusion in the community regarding this 
intended use of "preferred" and "deprecated". Most other items are using 
it correctly, it seems. The fact that it occurs in a showcase item is 
still making me a bit concerned.


Cheers,

Markus



--
Prof. Dr. Markus Kroetzsch
Knowledge-Based Systems Group
Faculty of Computer Science
TU Dresden
+49 351 463 38486
https://iccl.inf.tu-dresden.de/web/KBS/en

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Info box proposal

2016-08-03 Thread Markus Kroetzsch
Mh. Is this actually leading anywhere? I can see both views, but there 
is a danger that things are getting non-constructive here. A particular 
issue in my view is playing the "Wikipedia-vs-Wikidata" card. I don't 
see things in this way, and I hope most Wikipedia and Wikidata editors 
don't either.


Of course there are different interfaces and different pitfalls for each 
system. Let's face it: both are far from perfect when it comes to UI. 
People use them because they are extremely important projects, in spite 
-- not because -- of the UIs. I have also read about missing 
documentation on how to do things. Again, I don't think either project 
really shines here. There often is documentation if you know where to 
look, but if you just come by the page and want to work, it is very 
difficult to find it. Things could be much better.


Therefore, any approach that looks only at current editors (who already 
have made a lot of effort to wrap their heads around one of the 
not-always-intuitive processes and interfaces) is necessarily too 
limited. Their tolerance to the "other" UI will be as low as anybody's 
(ask someone on the street how nice they find either template editing or 
Wikidata input forms -- you'll get similar views). At the same time, 
current users often have a kind of Stockholm syndrome towards the UI 
they are used to. We have to take their views very serious, but we must 
not build our sites only for the people who already use them now.


The question therefore is not at all which of the current UIs is better, 
but rather how both can be improved. For this list, this mainly leads to 
the question how Wikidata can be improved. The practical insights 
gathered with different editor groups around the world are useful here. 
The findings need to be split into small, actionable units and 
prioritized. Then they will be fixed.


For this to work, it is completely irrelevant if more people like one UI 
or more people like the other. Since the UIs are doing completely 
different things, we won't be able to replace one by the other anyway. 
All we can do is to improve on our side. For this reason, any 
"vs"-themed discussion can only be harmful, attracting trolls who love 
to chime in whenever there is critique, and frustrate contributors who 
would rather like to get things done than to argue.


As for the (little) project that started this discussion, I think it 
should not be overrated in its scope. If people don't find the current 
UI usable enough, they will not switch to use it until we have our 
processes improved. But having other pieces of the puzzle in place will 
increase the pressure on Wikidata to fix remaining pain points, and 
possibly do exactly what Erika is asking for: make the voice of current 
Wikipedia editors (even) more relevant to ongoing Wikidata development.


Peace,

Markus


On 03.08.2016 19:24, Federico Leva (Nemo) wrote:

Brill Lyle, 03/08/2016 19:20:

I am not saying editing Wiki Markup on Wikidata. Is that what you are
describing?


No.

Nemo

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Breaking change in JSON serialization?

2016-08-04 Thread Markus Kroetzsch

On 04.08.2016 11:45, Lydia Pintscher wrote:

On Thu, Aug 4, 2016 at 9:27 AM, Markus Kroetzsch
<markus.kroetz...@tu-dresden.de> wrote:

Hi,

It seems that some changes have been made to the JSON serialization
recently:

https://github.com/Wikidata/Wikidata-Toolkit/issues/237

Could somebody from the dev team please comment on this? Is this going to be
in the dumps as well or just in the API? Are further changes coming up? Are
we ever going to get email notifications of API changes implemented by the
team rather than having to fix the damage after they happened?


Hey Markus,

Sorry. You are right in that I should have announced the addition. It
slipped through. As we've said before we don't consider adding fields
a breaking change. Nonetheless I should have announced it.

For your particular usecase it seems the addition is actually useful
because the entity id no longer needs to be created from the entity
type and ID field.
https://stackoverflow.com/questions/5455014/ignoring-new-fields-on-json-objects-using-jackson
might also be helpful.


Well, I know how to fix it. I just need a week or so of time to 
implement and release the fix (not because it takes so long, but because 
I need to find a slot of time to do it).


Markus


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Breaking change in JSON serialization?

2016-08-04 Thread Markus Kroetzsch

Hi,

It seems that some changes have been made to the JSON serialization 
recently:


https://github.com/Wikidata/Wikidata-Toolkit/issues/237

Could somebody from the dev team please comment on this? Is this going 
to be in the dumps as well or just in the API? Are further changes 
coming up? Are we ever going to get email notifications of API changes 
implemented by the team rather than having to fix the damage after they 
happened?


Markus

--
Prof. Dr. Markus Kroetzsch
Knowledge-Based Systems Group
Faculty of Computer Science
TU Dresden
+49 351 463 38486
https://iccl.inf.tu-dresden.de/web/KBS/en

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] (Ab)use of "deprecated"

2016-08-14 Thread Markus Kroetzsch

On 12.08.2016 17:24, Jean-Luc Léger wrote:

On 2016-08-11 22:29, Markus Kroetzsch wrote:

On 11.08.2016 18:45, Andra Waagmeester wrote:

On Thu, Aug 11, 2016 at 4:15 PM, Markus Kroetzsch
<markus.kroetz...@tu-dresden.de
<mailto:markus.kroetz...@tu-dresden.de>
<mailto:markus.kroetz...@tu-dresden.de
<mailto:markus.kroetz...@tu-dresden.de>>>
wrote:


has a statement "population: 20,086 (point in time: 2011)"
that is
confirmed by a reference. Nevertheless, the statement is
marked as
"deprecated". This would mean that the statement "the
popluation was
20,086 in 2011" is wrong. As far as I can tell, this is not
the case.


I wouldn't say that with a deprecated rank, that statement is
"wrong". I
consider de term deprecated to indicate that a given statement is no
longer valid in the context of a given resource (reference). I
agree, in
this specific case the use of the deprecated rank is wrong, since no
references are given to that specific statement.
Nevertheless, I think it is possible to have disagreeing
resources on an
identical statement, where two identical statements exists, one with
rank "deprecated" and one with rank "normal". It is up to the
user to
decide which source s/he trusts.


The status "deprecated" is part of the claim of the statement. The
reference is supposed to support this claim, which in this case is
also the claim that it is deprecated. The status is not meant to
deprecate a reference (not saying that this is never useful,
potentially, but you can only use it in one way, and it seems much
more practical if deprecated statements get references that explain
why they are deprecated).


Yes. I think a complete deprecated statement should look like this :

Rank: Deprecated
Value: 
Qualifier: P2241:reason for deprecation + 

References
* P248:Stated in (or any other property for a reference)   --> a
reference where the value is true (explaining why we added it)
  Value: 
  + any additional qualifiers
* P1310:statement disputed by  --> a
reference explaining why the claim is deprecated
  Value: 
  + any additional qualifiers


I am afraid that this is not a good approach, and it will lead to 
problems in the future. The status "deprecated" refers to the *complete 
claim, including all qualifiers*. So if you add a qualifier P2241, it 
would also be part of what is "deprecated", which is clearly not 
intended here. This is part of the general data structure in Wikidata, 
and tools using the data would expect this to hold true. Ranks are a 
built-in feature of the software, so this aspect is not really open to 
interpretation.


What you are doing here is giving up part of the pre-defined structure 
and replacing it by some local (site-specific) consensus. I know that 
this might be a bit subtle and not so easy to see at first, but it is a 
big step away from structured data that is easy to share across 
applications.


For example, imagine an application wants to compare "normal" statements 
with "deprecated" statements to see if there is any apparent 
contradiction (the same statement being given with both ranks). This 
would no longer work if you add meta-information to deprecated 
statements in the form of qualifiers. For a software tool, an additional 
quantifier simply changes the meaning. Imagine that one statement has an 
additional "end date" qualifier that the other one is lacking -- 
clearly, it would be perfectly reasonable that the statement with the 
end date is deprecated while the one that has only a start but no end is 
not. Technically, there is no difference between this situation and the 
situation where you add a new qualifier "P2241".


Now you could say: "Software should know the special meaning of P2241 
and treat it accordingly." But this is only working for one site 
(Wikidata in this case). A future Wikibase-enabled Commons or Wiktionary 
would use different properties. You end up with having to change 
software for each site, and severely reducing interoperability across 
sites (imagine you want to combine data from two sites before processing 
it).


Even if you are only interested in a single site (Wikidata), you are 
changing the way in which statements should be interpreted over time. If 
the community uses qualifiers to change the data model like this, then 
the current definition of these qualifiers dictates how statements 
should be interpreted. Then if you want to analyse history, things can 
be very difficult.


What to do? It is quite simple: P2241 clearly belongs into the reference 
of a deprecated sta

Re: [Wikidata] Breaking change in JSON serialization?

2016-08-04 Thread Markus Kroetzsch

Daniel,

You present arguments on issues that I would never even bring up. I 
think we fully agree on many things here. Main points of misunderstanding:


* I was not talking about the WMDE definition of "breaking change". I 
just meant "a change that breaks things". You can define this term for 
yourself as you like and I won't argue with this.


* I would never say that it is "right" that things break in this case. 
It's annoying. However, it is the standard behaviour of widely used JSON 
parsing libraries. We won't discuss it away.


* I am not arguing that the change as such is bad. I just need to know 
about it to fix things before they break.


* I am fully aware of many places where my software should be improved, 
but I cannot fix all of them just to be prepared if a change should 
eventually happen (if it ever happens). I need to know about the next 
thing that breaks so I can prioritize this.


* The best way to fix this problem is to annotate all Jackson classes 
with the respective switch individually. The global approach you linked 
to requires that all users of the classes implement the fix, which is 
not working in a library.


* When I asked for announcements, I did not mean an information of the 
type "we plan to add more optional bits soonish". This ancient wiki page 
of yours that mentions that some kind of change should happen at some 
point is even more vague. It is more helpful to learn about changes when 
you know how they will look and when they will happen. My assumption is 
that this is a "low cost" improvement that is not too much to ask for.


* I did not follow what you want to make an "official policy" for. 
Software won't behave any differently just because there is a policy 
saying that it should.


Markus


On 04.08.2016 16:48, Daniel Kinzler wrote:

Hi Markus!

I would like to elaborate a little on what Lydia said.

Am 04.08.2016 um 09:27 schrieb Markus Kroetzsch:

It seems that some changes have been made to the JSON serialization recently:

https://github.com/Wikidata/Wikidata-Toolkit/issues/237


This specific change has been announced in our JSON spec for as long as the
document exists.
<https://www.mediawiki.org/wiki/Wikibase/DataModel/JSON#wikibase-entityid> sais:


WARNING: wikibase-entityid may in the future change to be represented as a
single string literal, or may even be dropped in favor of using the string
value type to reference entities.

NOTE: There is currently no reliable mechanism for clients to generate a
prefixed ID or a URL from the information in the data value.


That was the problem: With the current format, all clients needed a hard coded
mapping of entity types to prefixes, in order to construct ID strings from the
JSON serialization of ID values. That means no entity types can be added without
breaking clients. This has now been fixed.


Of course, it would have been good to announce this in advance. However, it is
not a breaking change, and we do not plan to treat additions as breaking 
changes.

Adding something to a public interface is not a breaking change. Adding a method
to an API isn't, adding an element to XML isn't, and adding a key to JSON isn't
- unless there is a spec that explicitly states otherwise.

These are "mix and match" formats, in which anything that isn't forbidden is
allowed. It's the responsibility of the client to accommodate such changes. This
is simple best practice - a HTTP client shouldn't choke on header fields it
doesn't know, etc. See <https://en.wikipedia.org/wiki/Robustness_principle>.


If you use a library that is touchy about extra data per default, configure it
to be more accommodating, see for instance
<https://stackoverflow.com/questions/14343477/how-do-you-globally-set-jackson-to-ignore-unknown-properties-within-spring>.


Could somebody from the dev team please comment on this? Is this going to be in
the dumps as well or just in the API?


Yes, we use the same basic serialization for the API and the dumps. For the
future, note that some parts (such as sitelink URLs) are optional, and we plan
to add more optional bits (such as normalized quantities) soonish.


Are further changes coming up?


Yes. The next one in the pipeline is Quantities without upperBound and
lowerBound, see <https://phabricator.wikimedia.org/T115270>. That IS a breaking
change, and the implementation is thus blocked on announcing it, see
<https://gerrit.wikimedia.org/r/#/c/302248/>.

Furthermore, we will probably remove the entity-type and numeric-id fields from
the serialization of EntityIdValues eventually. But there is no concrete plan
for that at the moment.

When we remove the old fields for ItemId and PropertyId, that IS a breaking
change, and will be announced as such.


Are we ever
going to get email notifications of API changes implemented by the team rather
than having to fix the damage after they happened?


We a

Re: [Wikidata] Wikidata query performance paper

2016-08-06 Thread Markus Kroetzsch

Hi Aidan,

Thanks, very interesting, though I have not read the details yet.

I wonder if you have compared the actual query results you got from the 
different stores. As far as I know, Neo4J actually uses a very 
idiosyncratic query semantics that is neither compatible with SPARQL 
(not even on the BGP level) nor with SQL (even for SELECT-PROJECT-JOIN 
queries). So it is difficult to compare it to engines that use SQL or 
SPARQL (or any other standard query language, for that matter). In this 
sense, it may not be meaningful to benchmark it against such systems.


Regarding Virtuoso, the reason for not picking it for Wikidata was the 
lack of load-balancing support in the open source version, not the 
performance of a single instance.


Best regards,

Markus


On 06.08.2016 18:19, Aidan Hogan wrote:

Hey all,

Recently we wrote a paper discussing the query performance for Wikidata,
comparing different possible representations of the knowledge-base in
Postgres (a relational database), Neo4J (a graph database), Virtuoso (a
SPARQL database) and BlazeGraph (the SPARQL database currently in use)
for a set of equivalent benchmark queries.

The paper was recently accepted for presentation at the International
Semantic Web Conference (ISWC) 2016. A pre-print is available here:

http://aidanhogan.com/docs/wikidata-sparql-relational-graph.pdf

Of course there are some caveats with these results in the sense that
perhaps other engines would perform better on different hardware, or
different styles of queries: for this reason we tried to use the most
general types of queries possible and tried to test different
representations in different engines (we did not vary the hardware).
Also in the discussion of results, we tried to give a more general
explanation of the trends, highlighting some strengths/weaknesses for
each engine independently of the particular queries/data.

I think it's worth a glance for anyone who is interested in the
technology/techniques needed to query Wikidata.

Cheers,
Aidan


P.S., the paper above is a follow-up to a previous work with Markus
Krötzsch that focussed purely on RDF/SPARQL:

http://aidanhogan.com/docs/reification-wikidata-rdf-sparql.pdf

(I'm not sure if it was previously mentioned on the list.)

P.P.S., as someone who's somewhat of an outsider but who's been watching
on for a few years now, I'd like to congratulate the community for
making Wikidata what it is today. It's awesome work. Keep going. :)

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata query performance paper

2016-08-08 Thread Markus Kroetzsch

On 07.08.2016 22:58, Stas Malyshev wrote:

Hi!


the area for a long time). I guess the more difficult question then, is,
which RDF/SPARQL implementation to choose (since any such implementation
should cover as least points 1, 2 and 4 in a similar way), which in turn
reduces down to the distinguishing questions of performance, licensing,
distribution, maturity, tech support, development community, and
non-standard features (keyword search), etc.


We indeed had a giant spreadsheet in which a dozen of potential
solutions (some of them were eliminated very early, but some put up a
robust fight :) were evaluated on about 50 criteria. Of course, some of
them were hard to formalize, and some number were a bit arbitrary, but
that's what we did and Blazegraph came out with the best score.



If you want to go into Wikidata history, here is the "giant spreadsheet" 
Stas was referring to:


https://docs.google.com/spreadsheets/d/1MXikljoSUVP77w7JKf9EXN40OB-ZkMqT8Y5b2NYVKbU/edit?usp=sharing

Some criteria there are obviously rather vague and subjective, but even 
when disregarding the scoring, it shows which systems have been looked at.


Markus


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] An attribute for "famous person"

2016-08-02 Thread Markus Kroetzsch

On 02.08.2016 20:06, Daniel Kinzler wrote:

Am 02.08.2016 um 18:41 schrieb Andrew Gray:

I'd agree with both interpretations - the majority of people in Wikidata are
Using the existence of Wikipedia articles as a threshold, as suggested, seems a
pretty good test - it's flawed, of course, but it's easy to check for and works
as a first approximation of "probably is actually famous".


If we want to have the number of sidelinks in RDF, let's please make sure that
this number is associated with the item *document* uri, not with the concept
uri. After all, the person doesn't have links, the item document does.


Oh, there is a little misunderstanding here. I have not suggested to 
create a property "number of sitelinks in this document". What I propose 
instead is to create a property "number of sitelinks for the document 
associated with this entity". The domain of this suggested property is 
entity. The advantage of this proposal over the thing that you 
understood is that it makes queries much simpler, since you usually want 
to sort items by this value, not documents. One could also have a 
property for number of sitelinks per document, but I don't think it has 
such a clear use case.


Markus



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] An attribute for "famous person"

2016-08-02 Thread Markus Kroetzsch

On 02.08.2016 20:59, Daniel Kinzler wrote:

Am 02.08.2016 um 20:19 schrieb Markus Kroetzsch:

Oh, there is a little misunderstanding here. I have not suggested to create a
property "number of sitelinks in this document". What I propose instead is to
create a property "number of sitelinks for the document associated with this
entity". The domain of this suggested property is entity. The advantage of this
proposal over the thing that you understood is that it makes queries much
simpler, since you usually want to sort items by this value, not documents. One
could also have a property for number of sitelinks per document, but I don't
think it has such a clear use case.


"number of sitelinks for the document associated with this entity" strikes me as
semantically odd, which was the point of my earlier mail. I'd much rather have
"number of sitelinks in this document". You are right that the primary use would
be to "rank" items, and that it would be more conveniant to have the count
assocdiated directly with the item (the entity), but I fear it will lead to a
blurring of the line between information about the entity, and information about
the document. That is already a common point of confusion, and I'd rather keep
that separation very clear. I also don't think that one level of indirection
would be orribly complicated.

To me it's just natural to include the sitelink info on the same level as we
provide a timestmap or revision id: for the document.



I just proposed the simple and straightforward way to solve the 
practical problem at hand. It leads to shorter, more readable queries 
that execute faster. (I don't claim originality for this; it is the 
obvious solution to the problem and most people would arrive at exactly 
the same conclusion).


Your concern is based on the assumption that there is some kind of 
psychological effect that a particular RDF encoding would have on 
users. I don't think that there is any such effect. Our users will not 
confuse the city of Paris with an RDF document just because of some data 
in the RDF store.


Markus

--
Prof. Dr. Markus Kroetzsch
Knowledge-Based Systems Group
Faculty of Computer Science
TU Dresden
+49 351 463 38486
https://iccl.inf.tu-dresden.de/web/KBS/en

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] NLP text corpus annotated with Wikidata entities?

2017-02-05 Thread Markus Kroetzsch

On 05.02.2017 15:47, Samuel Printz wrote:

Hello everyone,

I am looking for a text corpus that is annotated with Wikidata entites.
I need this for the evaluation of an entity linking tool based on
Wikidata, which is part of my bachelor thesis.

Does such a corpus exist?

Ideal would be a corpus annotated in the NIF format [1], as I want to
use GERBIL [2] for the evaluation. But it is not necessary.


I don't know of any such corpus, but Wikidata is linked with Wikipedia 
in all languages. You can therefore take any Wikipedia article and find, 
with very little effort, the Wikidata entity for each link in the text.


The downside of this is that Wikipedia pages do not link all occurrences 
of all linkable entities. You can get a higher coverage when taking only 
the first paragraph of each page, but many things will still not be linked.


However, you could also take any existing Wikipedia-page annotated 
corpus and translate the links to Wikidata in the same way.


Finally, DBpedia also is linked to Wikipedia (in fact, the local names 
of entities are Wikipedia article names). So if you find any 
DBpedia-annotated corpus, you can also translate it to Wikidata easily.


Good luck,

Markus

P.S. If you build such a corpus from another resource, it would be nice 
if you could publish it for others to save some effort :-)




Thanks for hints!
Samuel

[1] https://site.nlp2rdf.org/
[2] http://aksw.org/Projects/GERBIL.html


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Resolver for Sqid?

2017-02-16 Thread Markus Kroetzsch

Hi Joachim,

Thanks for the suggestion. We are at it. Preview:

http://tools.wmflabs.org/sqid/dev/#/view?quick=VIAF:12307054

But which of these special keys (like "VIAF") should the service 
support? (note that there are 1100+ external id properties ...)


The other syntax with prop=P227=120434059 is not implemented yet.

Best regards,

Markus

On 07.02.2017 16:00, Neubert, Joachim wrote:

For wikidata, there exists a resolver at
https://tools.wmflabs.org/wikidata-todo/resolver.php, which allows me to
build URLs such as



https://tools.wmflabs.org/wikidata-todo/resolver.php?quick=VIAF:12307054
, or

https://tools.wmflabs.org/wikidata-todo/resolver.php?prop=P227=120434059



in order to address wikidata items directly from their external identifiers.



Squid is more appealing for viewing the items. Does a similar mechanism
exist there?



Cheers, Joachim





___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Resolver for Sqid?

2017-02-17 Thread Markus Kroetzsch
Thanks, Magnus. Could you also change the SQID taret URL to the stable 
version (remove "dev/" from URL)?


Best,

Markus

On 16.02.2017 23:52, Magnus Manske wrote:

I have extended the resolver to include squid and reasonator as targets:

https://tools.wmflabs.org/wikidata-todo/resolver.php?quick=VIAF:12307054=sqid

https://tools.wmflabs.org/wikidata-todo/resolver.php?quick=VIAF:12307054=reasonator

On Thu, Feb 16, 2017 at 10:32 PM Markus Kroetzsch
<markus.kroetz...@tu-dresden.de <mailto:markus.kroetz...@tu-dresden.de>>
wrote:

Hi Joachim,

Thanks for the suggestion. We are at it. Preview:

http://tools.wmflabs.org/sqid/dev/#/view?quick=VIAF:12307054

But which of these special keys (like "VIAF") should the service
support? (note that there are 1100+ external id properties ...)

The other syntax with prop=P227=120434059 is not implemented yet.

Best regards,

Markus

On 07.02.2017 16:00, Neubert, Joachim wrote:
> For wikidata, there exists a resolver at
> https://tools.wmflabs.org/wikidata-todo/resolver.php, which allows
me to
> build URLs such as
>
>
>
>
https://tools.wmflabs.org/wikidata-todo/resolver.php?quick=VIAF:12307054
> , or
>
>

https://tools.wmflabs.org/wikidata-todo/resolver.php?prop=P227=120434059
>
>
>
> in order to address wikidata items directly from their external
identifiers.
>
>
>
> Squid is more appealing for viewing the items. Does a similar
mechanism
> exist there?
>
>
>
> Cheers, Joachim
>
>
>
>
>
> ___
> Wikidata mailing list
> Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>

___
Wikidata mailing list
Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikidata



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Wikidata Toolkit 0.7.0 released

2016-08-05 Thread Markus Kroetzsch

Dear all,

I hereby announce the release of Wikidata Toolkit 0.7.0 [1], the Java 
library for programming with Wikidata and Wikibase.


This is a maintenance release that implements several fixes to ensure 
that WDTK can be used with recent Wikidata API outputs and future 
Wikidata JSON dumps.


The new version also ships the code used to generate the basic 
statistics used in the back of the SQID Wikidata Browser [2].


Maven users can get the library directly from Maven Central (see [1]); 
this is the preferred method of installation. There is also an 
all-in-one JAR at github [3] and of course the sources [4] and updated 
JavaDocs [5].


As usual, feedback is welcome. Developers are also invited to contribute 
via github.


Cheers,

Markus

[1] https://www.mediawiki.org/wiki/Wikidata_Toolkit
[2] https://tools.wmflabs.org/sqid/
[3] https://github.com/Wikidata/Wikidata-Toolkit/releases
[4] https://github.com/Wikidata/Wikidata-Toolkit/
[5] http://wikidata.github.io/Wikidata-Toolkit/


--
Prof. Dr. Markus Kroetzsch
Knowledge-Based Systems Group
Faculty of Computer Science
TU Dresden
+49 351 463 38486
https://iccl.inf.tu-dresden.de/web/KBS/en

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] A property to exemplify SPARQL queries associated witha property

2016-08-25 Thread Markus Kroetzsch

On 25.08.2016 07:40, Gerard Meijssen wrote:

Hoi,
For many categories we have exactly that in Reasonator. This
functionality is based on "is a list of".


Yes, I was thinking of this when making this proposal. The thing is that 
"is a list of" is not a very powerful way to describe lists. It can only 
do very simple things. Examples like the "list of inventors killed by 
their own invention", which are easy to do in SPARQL, are not possible 
there.


I should add that there are also problems with having arbitrary SPARQL 
queries. They are on the other end of the sprectrum from where "is a 
list of" is: they can express too much details (sorting order, label 
recall, query optimiser settings, etc.). When using SPARQL for list 
descriptions, the community should try to use the simplest query 
possible without any extras for formatting and sorting, or it will again 
be hard to use this.


Markus

> It functions not for all

categories or all content for a category. Wikidata often shows more data
particularly when the categories from "other" Wikipedias have been
harvested.

It would be relatively easy to continuously harvest data from Wikipedias
based on such categories.
Thanks,
 GerardM

https://tools.wmflabs.org/reasonator/?q=Q8328346

On 24 August 2016 at 14:21, Navino Evans > wrote:

If you could store queries, you could also store queries for
each item that is about a list of things, so that the query
returns exactly the things that should be in the list ... could
be useful.


This also applies to a huge number of Wikipedia categories (the non
subjective ones). It would be extremely useful to have queries
describing them attached to the Wikidata items for the categories.

On 24 August 2016 at 02:31, Ananth Subray > wrote:

मा

From: Stas Malyshev 
Sent: ‎24-‎08-‎2016 12:33 AM
To: Discussion list for the Wikidata project.

Subject: Re: [Wikidata] A property to exemplify SPARQL queries
associated witha property

Hi!

> Relaying a question from a brief discussion on Twitter [1], I
am curious
> to hear how people feel about the idea of creating a a "SPARQL
query
> example" property for properties, modeled after "Wikidata property
> example" [2]?

Might be nice, but we need a good way to present the query in the UI
(see below).

> This would allow people to discover queries that exemplify how the
> property is used in practice. Does the approach make sense or
would it
> stretch too much the scope of properties of properties? Are
there better
> ways to reference SPARQL examples and bring them closer to
their source?

I think it may be a good idea to start thinking about some way of
storing queries on Wikidata maybe? On one hand, they are just
strings,
on the other hand, they are code - like CSS or Javascript - and
storing
them just as strings may be inconvenient. Maybe .sparql file
extension
handler like we have for .js and .json and so on?

--
Stas Malyshev
smalys...@wikimedia.org 

___
Wikidata mailing list
Wikidata@lists.wikimedia.org 
https://lists.wikimedia.org/mailman/listinfo/wikidata


___
Wikidata mailing list
Wikidata@lists.wikimedia.org 
https://lists.wikimedia.org/mailman/listinfo/wikidata





--
___

The Timeline of Everything

www.histropedia.com 

Twitter  Facebo
ok
 Google +


   L inke
dIn



___
Wikidata mailing list
Wikidata@lists.wikimedia.org 
https://lists.wikimedia.org/mailman/listinfo/wikidata






Re: [Wikidata] SPARQL query increased timeouts?

2016-09-05 Thread Markus Kroetzsch
Thanks. Meanwhile, I have also implemented a fall-back query for SQID 
that does not use the aggregate but only finds distinct results instead. 
This enables us to update at least the labels automatically from SPARQL 
(downside: I no longer have reliable statistics on whether the problem 
is fixed).


Markus

On 05.09.2016 11:27, Guillaume Lederrey wrote:

Hello!

I just had a quick look at some of our graphs [1][2]. It does look
like we've had a slight increase in number of queries per second and
correlated higher IO. I'm not sure at this point if this is related to
the timeout you see. I'll dig a bit deeper...

Thanks for the patience!

[1] https://grafana.wikimedia.org/dashboard/db/wikidata-query-service
[2] https://ganglia.wikimedia.org/latest/?r=month=Wikidata+Query+Service+eqiad

On Sun, Sep 4, 2016 at 2:26 PM, Markus Kroetzsch
<markus.kroetz...@tu-dresden.de> wrote:

Hi,

SQID uses a somewhat challenging SPARQL query to refresh its statistical
data for the current usage of classes [1]. This is done once per hour, with
one retry after 60sec if the first attempt times out. In the past, timeouts
have been common, but it usually worked after a while.

Since a few days, however, the query always times out. In spite of the 48
attempts throughout each day, the query did not succeed once since
8/30/2016, 8:12:28 PM [2].

Possible explanations:
* WDQS experiences more load now (every day, every hour).
* The query got slower since for some reason the overall number of P31
statements increased in a sudden way (or for some reason crossed some
threshold).
* There have been technical changes to WDQS that reduce performance.

I don't have statistics on the success rate of the problematic query in past
weeks, so I cannot say if the timeout rate had increased before the current
week.

Does anybody have further information or obsevations that could help to
clarify what is going on? We can rewrite our software to use simpler queries
if this one fails now, but it seems like a step backwards.

Best regards,

Markus


[1] Here is the query:

SELECT ?cl ?clLabel ?c WHERE {
  { SELECT ?cl (count(*) as ?c) WHERE { ?i wdt:P31 ?cl } GROUP BY ?cl }
SERVICE wikibase:label {
  bd:serviceParam wikibase:language "en" .
  }
}

[2] https://tools.wmflabs.org/sqid/#/status

--
Prof. Dr. Markus Kroetzsch
Knowledge-Based Systems Group
Faculty of Computer Science
TU Dresden
+49 351 463 38486
https://iccl.inf.tu-dresden.de/web/KBS/en

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata







___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Item Label from ItemId

2016-09-01 Thread Markus Kroetzsch

On 31.08.2016 22:14, Sumit Asthana wrote:

Hi,
I've written a code to scrape Wikidata dump following Wikidata Toolkit
examples.

In processItemDocument, I have extracted the target entityId for the
property 'instanceof' for the current item. However I'm unable to find a
way to get the label of the target entity given that I have the
entityId, but not the entityDocument? Help would be appreciated :)


When you process a dump, you don't have random access to the data of all 
entities -- you just get to see them in order. Depending on your 
situation, there are several ways to go forward:


(1) You can use the Wikidata Toolkit API support to query the labels 
from Wikidata. This can be done in bulk at the end of the dump 
processing (fewer requests, since you can ask for many labels at once), 
or you can do it each time you need a label (more requests, slower, but 
easiest to implement). In the latter case, you should probably cache 
labels locally in a hashmap or similar to avoid repeated request.


This solution works well if you have a small or medium amount of labels. 
Otherwise, the API requests will take too long to be practical. 
Moreover, this solution will give you *current* labels from Wikidata. If 
you want to make sure that the labels are at a similar revision as your 
dump data (e.g., for historic analyses), then you must get them from the 
dump, not from the Web.


(2) If you need large amounts of labels (in the order of millions), then 
Web requests will not be practical. In this case, the easiest solution 
is to process the dump twice: first you collect all qids that you care 
about, second you gather all of their labels. Takes twice the time, but 
is very scalable: it will work for all data sizes (provided you can 
store the qids/labels while your program is running; if your local 
memory is very limited, you will need to use a database for this, which 
would slow down things more).


(1+2) You can do a combined approach of (1) and (2): do a single pass; 
remember all ids that you need labels for; if you find such an id in the 
dump, store the label; for ids that you did not find (because they 
occurred before you knew you needed them), do Web API queries after the 
dump processing.


(3) If you need to run such analyses a lot, you could also build up a 
label database locally: just write a small program that processes the 
dump and stores the label(s) for each id in a on-disk database. Then 
your actual program can get the labels from this database rather than 
asking the API. If your label set is not so large, you can also store 
the labels in a file that you load into memory when you need it. In 
fact, for the case of "class" items (things with an incoming P31 link), 
you can find such a file online:


http://tools.wmflabs.org/sqid/data/classes.json

It contains some more information, but also all English labels. This is 
26M, so quite manageable.


(4) If the items that you need labels for can be described easily (e.g., 
"all items with incoming P31 links") and are not too many (e.g., around 
10), then you can use SPARQL to get all labels at once. This may 
(sometimes) time out if the result set is big. For example, the 
following query gets you all P31-targets + the number of their direct 
"best rank" instances:


SELECT ?cl ?clLabel ?c WHERE {
  { SELECT ?cl (count(*) as ?c) WHERE { ?i wdt:P31 ?cl } GROUP BY ?cl }
  SERVICE wikibase:label {
bd:serviceParam wikibase:language "en" .
  }
}

Do *not* run this in your browser! There are too many results to 
display. Use the query service API programmatically instead. This query 
times out in as much as half of the cases, but so far I could always get 
it to return a complete result after a few attempts (you have to wait 
for at least 60sec before trying again).



My applications now do a single pass in WDTK for only the "hard" things, 
and then complete the output file using (4) with a Python script filling 
in labels. If the Python script's query does not time out, then the 
update of all labels takes less than a minute in this way. We had an 
implementation of (1+2) at some point, but it was more complicated to 
program and less efficient in this case. We did not have a reason to do 
(3) since we process each dump only once, so the effort of creating a 
label file does not pay off compared to (2).


Best regards,

Markus


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] List of WP-languages and language short code

2016-09-08 Thread Markus Kroetzsch

On 07.09.2016 21:48, Jan Macura wrote:

Hi all,

In a "[Wikidata] How many languages supports
Wikibase/Wikidata?" thread beginning May 26, Jan Macura writes


Hi all,

looking into [1] I read that Wikidata supports 358 languages. Is it
still true? For example, I tried to add label in language coded as
"nan" (defined in ISO 639-3) and it worked. However it didn't worked
for e.g. "arb", which is also part of the ISO 639-3 standard. So how
many?

Thanks
 Jan

[1] VRANDEČIĆ, Denny, KRÖTZSCH, Markus. Wikidata:
A Free Collaborative Knowledgebase. /Communications of the ACM/.
2014-10, Vol. 57 No. 10,
7885. DOI 10.1145/2629489. 
http://cacm.acm.org/magazines/2014/10/178785-wikidata/fulltext

Thanks for helping to focus these unfolding Wikdaata language
development questions, Markus.

Scott

thanks Scott! You reminded me, that I still haven't received appropriate
answer to my original question... ping Markus et al.


Such questions refer to current implementation details and can only be 
answered by the developers.


Best,

Markus



Thanks
 Jan


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] SPARQL query increased timeouts?

2016-09-04 Thread Markus Kroetzsch

Hi,

SQID uses a somewhat challenging SPARQL query to refresh its statistical 
data for the current usage of classes [1]. This is done once per hour, 
with one retry after 60sec if the first attempt times out. In the past, 
timeouts have been common, but it usually worked after a while.


Since a few days, however, the query always times out. In spite of the 
48 attempts throughout each day, the query did not succeed once since 
8/30/2016, 8:12:28 PM [2].


Possible explanations:
* WDQS experiences more load now (every day, every hour).
* The query got slower since for some reason the overall number of P31 
statements increased in a sudden way (or for some reason crossed some 
threshold).

* There have been technical changes to WDQS that reduce performance.

I don't have statistics on the success rate of the problematic query in 
past weeks, so I cannot say if the timeout rate had increased before the 
current week.


Does anybody have further information or obsevations that could help to 
clarify what is going on? We can rewrite our software to use simpler 
queries if this one fails now, but it seems like a step backwards.


Best regards,

Markus


[1] Here is the query:

SELECT ?cl ?clLabel ?c WHERE {
  { SELECT ?cl (count(*) as ?c) WHERE { ?i wdt:P31 ?cl } GROUP BY ?cl }
SERVICE wikibase:label {
  bd:serviceParam wikibase:language "en" .
  }
}

[2] https://tools.wmflabs.org/sqid/#/status

--
Prof. Dr. Markus Kroetzsch
Knowledge-Based Systems Group
Faculty of Computer Science
TU Dresden
+49 351 463 38486
https://iccl.inf.tu-dresden.de/web/KBS/en

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] SPARQL power users and developers

2016-09-30 Thread Markus Kroetzsch

Dear SPARQL users,

We are starting a research project to investigate the use of the 
Wikidata SPARQL Query Service, with the goal to gain insights that may 
help to improve Wikidata and the query service [1]. Currently, we are 
still waiting for all data to become available. Meanwhile, we would like 
to ask for your input.


Preliminary analyses show that the use of the SPARQL query service 
varies greatly over time, presumably because power users and software 
tools are running large numbers of queries. For a meaningful analysis, 
we would like to understand such high-impact biases in the data. We 
therefore need your help:


(1) Are you a SPARQL power user who sometimes runs large numbers of 
queries (over 10,000)? If so, please let us know how your queries might 
typically look so we can identify them in the logs.


(2) Are you the developer of a tool that launches SPARQL queries? If so, 
then please let us know if there is any way to identify your queries.


If (1) or (2) applies to you, then it would be good if you could include 
an identifying comment into your SPARQL queries in the future, to make 
it easier to recognise them. In return, this would enable us to provide 
you with statistics on the usage of your tool [2].


Further feedback is welcome.

Cheers,

Markus


[1] https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries

[2] Pending permission by the WMF. Like all Wikimedia usage data, the 
query logs are under strict privacy protection, so we will need to get 
clearance before sharing any findings with the public. We hope, however, 
that there won't be any reservations against publishing non-identifying 
information.


--
Prof. Dr. Markus Kroetzsch
Knowledge-Based Systems Group
Faculty of Computer Science
TU Dresden
+49 351 463 38486
https://iccl.inf.tu-dresden.de/web/KBS/en

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Greater than 400 char limit for Wikidata string data types

2016-09-19 Thread Markus Kroetzsch

On 19.09.2016 18:12, Lydia Pintscher wrote:

On Mon, Sep 19, 2016 at 6:19 AM, Denny Vrandečić  wrote:

Can you figure out what a good limit would be for these two use cases? I.e.
what would support 99%, 99.9%, and 100%?


Yes this would be extremely helpful. In general I agree that we can
now be more relaxed about this than we were at the beginning because
you all understand that Wikidata isn't a place to store long free
text. However I still think we need to have some measures in place.
One thing we could maybe do is a new datatype for longer text but I'm
undecided about this yet. I still don't feel too good about making
every string property several thousand characters long.


I am not excited about having another new datatype for this. The 
proposed difference of 400 vs. 2000 chars does not seem so fundamental, 
and the limits are rather arbitrary too, so it seems too much detail on 
the user level to name these things in special ways. Datatypes should be 
used if they have a benefit to the user (easier input, better display) 
and not to enforce constraints. There are very many relevant 
constraints, and length is hardly the most important one, so we should 
not give it the prominence of having an own type.


Best,

Markus


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Deleting properties / items in test.wikidata.org

2016-09-16 Thread Markus Kroetzsch

Hi,

I don't think you need to worry at all about cluttering 
test.wikidata.org. I guess it is purged regularly anyway.


Best,

Markus

On 14.09.2016 20:32, Legoktm wrote:

Hi,

On 09/08/2016 08:54 AM, Loic Dachary wrote:

Hi,

But I was not able to figure out how to remove them afterwards, to not clutter 
test.wikidata.org.


You would just use the normal MediaWiki API delete feature[1]. Pywikibot
abstracts it as Page.delete(...).

[1] https://www.mediawiki.org/wiki/API:Delete

-- Legoktm

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Greater than 400 char limit for Wikidata string data types

2016-09-16 Thread Markus Kroetzsch

On 13.09.2016 11:39, Sebastian Burgstaller wrote:

Hi all,

I think this topic might have been discussed many months ago. For
certain data types in the chemical compound space (P233, canonical
smiles, P2017 isomeric smiles and P234 Inchi key) a higher character
limit than 400 would be really helpful (1500 to 2000 chars (I sense
that this might cause problems with SPARQL)). Are there any plans on
implementing this? In general, for quality assurance, many string
property types would profit from a fixed max string length.


FWIW, I recall that the main reason for the char limit originally was to 
discourage the use of Wikidata for textual content. Simply put, we did 
not want Wikipedia articles in the data. Long texts could also make 
copyright/license issues more relevant (though, in theory, a copyrighted 
poem could be rather short).


However, given that we now have such a well informed community with 
established practices and good quality checks, it seems unproblematic to 
lift the character limit. I don't think there are major technical 
reasons for having it. Surely, BlazeGraph (the WMF SPARQL engine) should 
not expect texts to be short, and I would be surprised if they did. So I 
would not expect problems on this side.


Best,
Markus




Best,
Sebastian

Sebastian Burgstaller-Muehlbacher, PhD
Research Associate
Andrew Su Lab
MEM-216, Department of Molecular and Experimental Medicine
The Scripps Research Institute
10550 North Torrey Pines Road
La Jolla, CA 92037
@sebotic

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] SPARQL query increased timeouts?

2016-09-07 Thread Markus Kroetzsch

On 07.09.2016 03:05, Stas Malyshev wrote:

Hi!


I bet wikibase:label has to be reimplemented in some other way to prove
efficient...


Yes, label service may be inefficient sometimes. I'll look into how it
can be improved.



However, the query without the counting but with the labels included 
also works. Probably we need to do two queries instead of one ...


Markus

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Aggregate info on Wikidata items

2016-08-27 Thread Markus Kroetzsch

On 27.08.2016 07:18, Sumit Asthana wrote:

Hi,

I'm trying to use offline wikidata dump
 but when I
run an example from Wikidata Toolkit - EntityStatisticsProcessor
,
I hit the following error - https://dpaste.de/TNpd.

Apparently it is unable to parse the dump but I can't seem to figure it
out. Help would be appreciated :)


This happens if your dump download was incomplete. It seems that 
(recently) the download is sometimes interrupted and needs to be resumed 
to get the whole file. Our implementation is not smart enough to fix 
this and ends up with an incomplete dump.


You can download the dump in any way you like, including using a browser 
with "safe as". I prefer to use wget. You just need to put it into the 
right directory where WDTK also puts dumps. When you start WDTK, it 
reports the file to be downloaded and the place where it puts the 
download, so this is one way to find out.


Dump files are the ones found at 
https://dumps.wikimedia.org/other/wikidata/ (with the file names used 
there). They go into the directory named like 
./dumpfiles/wikidatawiki/json-20160801 (for the dump 
https://dumps.wikimedia.org/other/wikidata/20160801.json.gz). The 
dumpfiles directory is under the directory from where you run your program.


Best,

Markus




-Thanks,
Sumit


On Sat, Aug 27, 2016 at 1:18 AM, Stas Malyshev > wrote:

Hi!

> For example "I want to know the number of statements on an average with
> dead external reference links".

Since there are over a million links in references, you probably may
want to use dump - either JSON or RDF, and looking for references there.
It would be relatively easy to find those in reference statements.
However, checking a million links might require some careful planning :)
--
Stas Malyshev
smalys...@wikimedia.org 

___
Wikidata mailing list
Wikidata@lists.wikimedia.org 
https://lists.wikimedia.org/mailman/listinfo/wikidata





___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] SPARQL power users and developers

2016-09-30 Thread Markus Kroetzsch

On 30.09.2016 20:47, Denny Vrandečić wrote:

Markus, do you have access to the corresponding HTTP request logs? The
fields there might be helpful (although I might be overtly optimistic
about it)


Yes, we can access all logs. For bot-based queries, this should be very 
helpful indeed. I can still think of several cases where this won't help 
much:


* People writing a quick Python (or whatever) script to run thousands of 
queries, without setting a meaningful user agent.
* Web applications like Reasonator or SQID that cause the client to run 
SPARQL queries when viewing a page (in this case, the user agent that 
gets logged is the user's browser).


But, yes, we will definitely look at all signals that we can get from 
the data.


Best,

Markus





On Fri, Sep 30, 2016 at 11:38 AM Yuri Astrakhan
<yastrak...@wikimedia.org <mailto:yastrak...@wikimedia.org>> wrote:

I guess I qualify for #2 several times:
* The  &  support access to the geoshapes
service, which in turn can make requests to WDQS. For example, see
https://en.wikipedia.org/wiki/User:Yurik/maplink  (click on
"governor's link")

* The  wiki tag supports the same geoshapes service, as well
as direct queries to WDQS. This graph uses both (one to get all
countries, the other is to get the list of disasters)
https://www.mediawiki.org/wiki/Extension:Graph/Demo/Sparql/Largest_disasters

* There has been some discussion to allow direct WDQS querying from
maps too - e.g. to draw points of interest based on Wikidata (very
easy to implement, but we should be careful to cache it properly)

Since all these queries are called from either nodejs or our
javascript, we could attach extra headers, like X-Analytics, which
is already handled by Varnish.  Also, NodeJS queries could set the
user agent string.


On Fri, Sep 30, 2016 at 10:44 AM Markus Kroetzsch
<markus.kroetz...@tu-dresden.de
<mailto:markus.kroetz...@tu-dresden.de>> wrote:

On 30.09.2016 16:18, Andra Waagmeester wrote:
> Would it help if I add the following header to every large
batch of queries?
>
> ###
> # access: (http://query.wikidata.org
> or
https://query.wikidata.org/bigdata/namespace/wdq/sparql?query={SPARQL}

<https://query.wikidata.org/bigdata/namespace/wdq/sparql?query=%7BSPARQL%7D>
.)
> # contact: email, acountname, twittername etc
> # bot: True/False
> # .
> ##

This is already more detailed than what I had in mind. Having a
way to
tell apart bots and tools from "organic" queries would already
be great.
We are mainly looking for something that will help us to understand
sudden peaks of activity. For this, it might be enough to have a
short
signature (a URL could be given, but a tool name with a version
would
also be fine). This is somewhat like the "user agent" field in HTTP.

But you are right that some formatting convention may help
further here.
How about this:

#TOOL:

Then one could look for comments of this form without knowing
all the
tools upfront. Of course, this is just a hint in any case, since one
could always use the same comment in any manually written query.

    Best regards,

Markus

>
> On Fri, Sep 30, 2016 at 4:00 PM, Markus Kroetzsch
> <markus.kroetz...@tu-dresden.de
<mailto:markus.kroetz...@tu-dresden.de>
<mailto:markus.kroetz...@tu-dresden.de
<mailto:markus.kroetz...@tu-dresden.de>>>
> wrote:
>
> Dear SPARQL users,
>
> We are starting a research project to investigate the use
of the
> Wikidata SPARQL Query Service, with the goal to gain
insights that
> may help to improve Wikidata and the query service [1].
Currently,
> we are still waiting for all data to become available.
Meanwhile, we
> would like to ask for your input.
>
> Preliminary analyses show that the use of the SPARQL query
service
> varies greatly over time, presumably because power users and
> software tools are running large numbers of queries. For a
> meaningful analysis, we would like to understand such
high-impact
> biases in the data. We therefore need your help:
>
> (1) Are you a SPARQL power user who sometimes runs large
numbers of
> queries (over 10,000)? If so, please let us know how your
queries
> might typically look so we can identify

Re: [Wikidata] SPARQL power users and developers

2016-09-30 Thread Markus Kroetzsch

On 30.09.2016 19:50, Andra Waagmeester wrote:

Just curious while we are on the topic. When you are inspecting the
headers to separate between "organic" queries and bot queries, would it
be possible to count the times a set of properties is used in the
different queries? This would be a nice way to demonstrate to original
external resources how "their" data is used and which combination of
properties are used together with "their" properties (eg. P351 for ncbi
gene or P699 for the disease ontology). It would be interesting to know
how often for example those two properties are used in one single query.


Yes, we definitely want to do such analyses. The first task is to clean 
up and group/categorize queries so we can get a better understanding (if 
a property is used in 100K queries a day, it would still be nice to know 
if they come from a single script or from many users).


Once we have this, we would like to analyse for content (which 
properties and classes are used, etc.) but also for query feature (how 
many OPTIONALs, GROUP BYs, etc. are used). Ideas on what to analyse 
further are welcome. Of course, SPARQL can only give a partial idea of 
"usage", since Wikidata content can be used in ways that don't involve 
SPARQL. Moreover, counting raw numbers of queries can also be 
misleading: we have had cases where a single query result was discussed 
by hundreds of people (e.g. the Panama papers query that made it to Le 
Monde online), but in the logs it will still show up only as a single 
query among millions.


Best,

Markus



On Fri, Sep 30, 2016 at 4:44 PM, Markus Kroetzsch
<markus.kroetz...@tu-dresden.de <mailto:markus.kroetz...@tu-dresden.de>>
wrote:

On 30.09.2016 16:18, Andra Waagmeester wrote:

Would it help if I add the following header to every large batch
of queries?

###
# access: (http://query.wikidata.org
or
https://query.wikidata.org/bigdata/namespace/wdq/sparql?query={SPARQL}

<https://query.wikidata.org/bigdata/namespace/wdq/sparql?query=%7BSPARQL%7D>
.)
# contact: email, acountname, twittername etc
# bot: True/False
# .
##


This is already more detailed than what I had in mind. Having a way
to tell apart bots and tools from "organic" queries would already be
great. We are mainly looking for something that will help us to
understand sudden peaks of activity. For this, it might be enough to
have a short signature (a URL could be given, but a tool name with a
version would also be fine). This is somewhat like the "user agent"
field in HTTP.

But you are right that some formatting convention may help further
here. How about this:

#TOOL:

Then one could look for comments of this form without knowing all
the tools upfront. Of course, this is just a hint in any case, since
one could always use the same comment in any manually written query.

    Best regards,

Markus


On Fri, Sep 30, 2016 at 4:00 PM, Markus Kroetzsch
<markus.kroetz...@tu-dresden.de
<mailto:markus.kroetz...@tu-dresden.de>
<mailto:markus.kroetz...@tu-dresden.de
<mailto:markus.kroetz...@tu-dresden.de>>>

wrote:

Dear SPARQL users,

We are starting a research project to investigate the use of the
Wikidata SPARQL Query Service, with the goal to gain
insights that
may help to improve Wikidata and the query service [1].
Currently,
we are still waiting for all data to become available.
Meanwhile, we
would like to ask for your input.

Preliminary analyses show that the use of the SPARQL query
service
varies greatly over time, presumably because power users and
software tools are running large numbers of queries. For a
meaningful analysis, we would like to understand such
high-impact
biases in the data. We therefore need your help:

(1) Are you a SPARQL power user who sometimes runs large
numbers of
queries (over 10,000)? If so, please let us know how your
queries
might typically look so we can identify them in the logs.

(2) Are you the developer of a tool that launches SPARQL
queries? If
so, then please let us know if there is any way to identify your
queries.

If (1) or (2) applies to you, then it would be good if you could
include an identifying comment into your SPARQL queries in the
future, to make it easier to recognise them. In return, this
would
enable us to provide you with statistics on the usage of
your tool [2].

Further feedback is welcom

Re: [Wikidata] SPARQL power users and developers

2016-09-30 Thread Markus Kroetzsch

On 30.09.2016 16:18, Andra Waagmeester wrote:

Would it help if I add the following header to every large batch of queries?

###
# access: (http://query.wikidata.org
or https://query.wikidata.org/bigdata/namespace/wdq/sparql?query={SPARQL} .)
# contact: email, acountname, twittername etc
# bot: True/False
# .
##


This is already more detailed than what I had in mind. Having a way to 
tell apart bots and tools from "organic" queries would already be great. 
We are mainly looking for something that will help us to understand 
sudden peaks of activity. For this, it might be enough to have a short 
signature (a URL could be given, but a tool name with a version would 
also be fine). This is somewhat like the "user agent" field in HTTP.


But you are right that some formatting convention may help further here. 
How about this:


#TOOL:

Then one could look for comments of this form without knowing all the 
tools upfront. Of course, this is just a hint in any case, since one 
could always use the same comment in any manually written query.


Best regards,

Markus



On Fri, Sep 30, 2016 at 4:00 PM, Markus Kroetzsch
<markus.kroetz...@tu-dresden.de <mailto:markus.kroetz...@tu-dresden.de>>
wrote:

Dear SPARQL users,

We are starting a research project to investigate the use of the
Wikidata SPARQL Query Service, with the goal to gain insights that
may help to improve Wikidata and the query service [1]. Currently,
we are still waiting for all data to become available. Meanwhile, we
would like to ask for your input.

Preliminary analyses show that the use of the SPARQL query service
varies greatly over time, presumably because power users and
software tools are running large numbers of queries. For a
meaningful analysis, we would like to understand such high-impact
biases in the data. We therefore need your help:

(1) Are you a SPARQL power user who sometimes runs large numbers of
queries (over 10,000)? If so, please let us know how your queries
might typically look so we can identify them in the logs.

(2) Are you the developer of a tool that launches SPARQL queries? If
so, then please let us know if there is any way to identify your
queries.

If (1) or (2) applies to you, then it would be good if you could
include an identifying comment into your SPARQL queries in the
future, to make it easier to recognise them. In return, this would
enable us to provide you with statistics on the usage of your tool [2].

Further feedback is welcome.

Cheers,

Markus


[1]
https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries
<https://meta.wikimedia.org/wiki/Research:Understanding_Wikidata_Queries>

[2] Pending permission by the WMF. Like all Wikimedia usage data,
the query logs are under strict privacy protection, so we will need
to get clearance before sharing any findings with the public. We
hope, however, that there won't be any reservations against
publishing non-identifying information.

--
Prof. Dr. Markus Kroetzsch
Knowledge-Based Systems Group
Faculty of Computer Science
TU Dresden
+49 351 463 38486 <tel:%2B49%20351%20463%2038486>
https://iccl.inf.tu-dresden.de/web/KBS/en
<https://iccl.inf.tu-dresden.de/web/KBS/en>

___
Wikidata mailing list
Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikidata
<https://lists.wikimedia.org/mailman/listinfo/wikidata>




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] [Analytics] SPARQL power users and developers

2016-10-04 Thread Markus Kroetzsch

Hi again,

The solutions discussed here seem to be quite a bit more general than 
what I was thinking about. Of course it would be nice to have a uniform, 
cross-client way to indicate tools in any MW Web service or API, but 
this is a slightly bigger (and probably more long-term) goal than what I 
had in mind. It is a good idea to suggest a standard approach to tool 
developers there and to have a documentation page on that, but it would 
take some time until this is adopted by enough tools to work.


For our present task, we just need some more signals we can use. 
Analysing SPARQL queries requires us to parse them anyway, so comments 
are fine. In general, the data we are looking at has a lot of noise, so 
we cannot rely on a single field. We will combine user agents, 
X-analytics, query comments, and also query shapes (if you get 1M+ 
similar looking queries in one hour, you know its a bot). With the 
current data, the query shape is often our main clue, so comments would 
already be a big step forward.


Best,

Markus


On 04.10.2016 07:05, Yuri Astrakhan wrote:

For consistency between all possible clients, we seem to have only two
options:  either part of the query, or the X-Analytics header.   The
user-agent header is not really an option because it is not available
for all types of clients, and we want to have just one way for everyone.
Headers other than X-Analytics will need custom handling, whereas we
already have plenty of Varnish code to deal with X-Analytics header,
split it into parts, and for Hive to parse it. Yes it will be an extra
line of code in JS ($.ajax instead of $.get), but I am sure this is not
such a big deal if we provide cookie cutter code. Parsing query string
in varnish/hive is also some complex extra work, so lets keep
X-Analytics. Proposed required values (semicolon separated):
* tool=
* toolver=
* contact=mailto:em...@example.com>, +1.212.555.1234, ...>

Bikeshedding ?   See also:  https://wikitech.wikimedia.org/wiki/X-Analytics

On Tue, Oct 4, 2016 at 12:45 AM Stas Malyshev > wrote:

Hi!

> Using custom HTTP headers would, of course, complicate calls for the
> tool authors (i.e., myself). $.ajax instead of $.get and all that. I
> would be less inclined to change to that.

Yes, if you're using browser, you probably can't change user agent. In
that case I guess we need either X-Analytics or put it in the query. Or
maybe Referer header would be fine then - it is also recorded. If
Referer is distinct enough it can be used then.

--
Stas Malyshev
smalys...@wikimedia.org 

___
Analytics mailing list
analyt...@lists.wikimedia.org 
https://lists.wikimedia.org/mailman/listinfo/analytics



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata ontology

2017-01-05 Thread Markus Kroetzsch

Hi Rüdiger,

Daniel refers to several independent aspects of Wikidata:

(1) The ontology is not separated from the data. Schematic information 
is mostly managed by encoding it in data as well. Therefore, if you want 
some of it (but not the rest), then some extraction will be necessary. 
The Wikidata SPARQL service is your friend for not-too-big (up to some 
100K triples) on-the-fly data exports, enough to get the whole class 
hierarchy, for example. We also have created some ontology-like excerpts 
in the past [1]. These have been done offline by processing the data 
dump using Wikidata Toolkit.


(2) The ontology is very lightweight. Wikidata mostly encodes properties 
and their types, some hierarchical information on properties and 
classes, and some "weak" hints on things like domain and range for some 
properties. So there are no complex OWL axioms there. This is also the 
reason why the ontology should not contain any logical contradictions -- 
when Daniel refers to "contradictions" I guess he means incoherences in 
the overall modelling (which contradict human intuition).


(3) The ontology may change at any time. This is a consequence of (1) 
and the fact that Wikidata is controlled by a global community.


For all of these reasons, there cannot be one "Wikidata ontology" but 
there might still be many useful ontological things you can get without 
too much effort.


If you are interested in learning about the classes and properties used 
in Wikidata to get an informal idea of its current schema and content, 
then you could also browse this data in SQID [2].


Best regards,

Markus

[1] 
http://tools.wmflabs.org/wikidata-exports/rdf/exports/20160801/dump_download.html

[2] https://tools.wmflabs.org/sqid/#/browse?type=properties

On 05.01.2017 16:15, Daniel Kinzler wrote:

Am 04.01.2017 um 11:00 schrieb Léa Lacroix:

Hello,

You can find it here: http://wikiba.se/ontology-1.0.owl

If you have questions regarding the ontology, feel free to ask.



Please note that this is the *wikibase* ontology, which thefines the meta-model
for the information on Wikidata. It defines models statements, sitelinks, source
references, etc.

This ontology does not model "real world" concepts or properties like location
or color or children, etc. Modeling on this level is done on Wikidata itself,
there is no fixed RDF or OWL schema or ontology.

The best you can get in terms of "downloading the wikidata ontology" would be to
download all properties and all the items representing classes. We currently
don't have a separate dump for these. Also, do not expect this to be a concise
or consistent model that can be used for reasoning. You are bound to find
contradictions and lose ends.




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] I'm calling it. We made it ;-)

2017-01-01 Thread Markus Kroetzsch
Very nice! Also great to see journalists getting their hands dirty in 
some raw data.


Happy New Year everyone :-)

Markus

On 31.12.2016 11:57, Lydia Pintscher wrote:

Folks,

We're now officially mainstream ;-)
https://www.buzzfeed.com/katiehasty/song-ends-melody-lingers-in-2016?utm_term=.nszJxrKqR#.sknE4nVAg


Cheers
Lydia



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata ontology

2017-01-07 Thread Markus Kroetzsch

On 07.01.2017 15:27, Gerard Meijssen wrote:

Hoi,
The biggest casualty of the current mess is that people like me do not
care at all about it. It cannot be explained, nobody is interested in
explaining it and consequently there is little use for it. It is "must
have" so it is there.. fine, lets move on.


The subclass of and instance of statements are actually used in very 
many WDQS queries, often with * expressions to navigate the hierarchy. 
WDQ also had a special feature TREE for this purpose. So I'd say that 
this part of the data is rather important to Wikidata. But if you find 
little use in it, that's ok too. Nevertheless, we should try to fix the 
modelling errors there, since they will affect many other people's 
Wikidata experience.


Cheers,

Markus



On 7 January 2017 at 10:39, Markus Kroetzsch
<markus.kroetz...@tu-dresden.de <mailto:markus.kroetz...@tu-dresden.de>>
wrote:



On 06.01.2017 18:24, Thomas Douillard wrote:

Same entity can be treated both as class and individual


This is valid for OWL as well.


Yes, and since Wikidata does not feature very powerful ontological
statements, you could treat this like in OWL 2 DL semantically as
well, i.e., a weak approach where the "class" and the "instance" are
not really identified works.

Nevertheless, using the ontology might still be challenging
depending on what you want to do with it, since there are quite a
few meta-levels (classes of classes of classes ...) that are not
cleanly separated. When I last checked, we even had some instance-of
cycles ;-) Even this is not a technical problem for the OWL
semantics, but maybe for some tools and approaches.

Cheers,

Markus


2017-01-05 22:21 GMT+01:00 Stas Malyshev
<smalys...@wikimedia.org <mailto:smalys...@wikimedia.org>
<mailto:smalys...@wikimedia.org <mailto:smalys...@wikimedia.org>>>:

Hi!

> The best you can get in terms of "downloading the wikidata
ontology" would be to
> download all properties and all the items representing
classes. We currently
> don't have a separate dump for these. Also, do not expect
this to be a concise
> or consistent model that can be used for reasoning. You
are bound to find
> contradictions and lose ends.

Also, Wikidata Toolkit
(https://github.com/Wikidata/Wikidata-Toolkit
<https://github.com/Wikidata/Wikidata-Toolkit>
<https://github.com/Wikidata/Wikidata-Toolkit
<https://github.com/Wikidata/Wikidata-Toolkit>>)
can be used to generate something like taxonomy - see e.g.


http://tools.wmflabs.org/wikidata-exports/rdf/exports/20160801/dump_download.html

<http://tools.wmflabs.org/wikidata-exports/rdf/exports/20160801/dump_download.html>


<http://tools.wmflabs.org/wikidata-exports/rdf/exports/20160801/dump_download.html

<http://tools.wmflabs.org/wikidata-exports/rdf/exports/20160801/dump_download.html>>

But one has to be careful with it as Wikidata may not (and
frequently
does not) follow assumptions that are true for proper OWL
models - there
are no limits on what can be considered a class, a subclass, an
instance, etc. Same entity can be treated both as class and
individual,
and there may be some weird structures, including even
outright errors
such as cycles in subclass graph, etc. And, of course, it
changes all
the time :)

--
Stas Malyshev
smalys...@wikimedia.org <mailto:smalys...@wikimedia.org>
<mailto:smalys...@wikimedia.org <mailto:smalys...@wikimedia.org>>

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
<mailto:Wikidata@lists.wikimedia.org>
<mailto:Wikidata@lists.wikimedia.org
<mailto:Wikidata@lists.wikimedia.org>>
https://lists.wikimedia.org/mailman/listinfo/wikidata
<https://lists.wikimedia.org/mailman/listinfo/wikidata>
<https://lists.wikimedia.org/mailman/listinfo/wikidata
<https://lists.wikimedia.org/mailman/listinfo/wikidata>>




___
Wikidata mailing list
Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikidata
<https://lists.wikimedia.org/mailman/listinfo/wikidata>


___
Wikidata mailing list
Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikim

Re: [Wikidata] Wikidata ontology

2017-01-09 Thread Markus Kroetzsch

On 09.01.2017 12:55, Gerard Meijssen wrote:

Hoi,
It is in the logic. When a king is a monarch and a monarch is a
politician I am fine. But when people insist that a "King of Iberia" is
a subclass it does not make sense. People hold the office of and it is
singular. When such things result in struggles, I think we have a problem.


I would say: "Every King of Iberia was also a king."

Only the "current king of Iberia" is a single person, but Wikidata is 
about all of history, so there are many such kings. The office of "King 
of Iberia" is still singular (it is a singular class) and it can have 
its own properties etc. I would therefore say (without having checked 
the page):


King of Iberia  instance of  office
King of Iberia  subclass of  king



I have asked in the past to explain the nonsense on items like monarch.
When I look at Reasonator there is so much that is plain problematic
that it is best to ignore it. What complicates it is that the ontology
seems to end with politician and that is a travesty in and of itself.
With other "occupations" there is a wealth of upper levels that seem to
be completely arbitrary and when asked I find it reasonable that nobody
steps up to explain because the consequences of answers are problematic.


I agree that there are many cases that need to be modelled in a more 
coherent way. I can only imagine progress in this area to happen on a 
case-by-case basis. One really has to look into the details and check 
what works best in each case. Often there is no wrong or right here, but 
there is a choice how to model things. But once a choice is made, it 
should be applied coherently throughout.


Regards,

Markus

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata ontology

2017-01-07 Thread Markus Kroetzsch



On 06.01.2017 18:24, Thomas Douillard wrote:

Same entity can be treated both as class and individual


This is valid for OWL as well.


Yes, and since Wikidata does not feature very powerful ontological 
statements, you could treat this like in OWL 2 DL semantically as well, 
i.e., a weak approach where the "class" and the "instance" are not 
really identified works.


Nevertheless, using the ontology might still be challenging depending on 
what you want to do with it, since there are quite a few meta-levels 
(classes of classes of classes ...) that are not cleanly separated. When 
I last checked, we even had some instance-of cycles ;-) Even this is not 
a technical problem for the OWL semantics, but maybe for some tools and 
approaches.


Cheers,

Markus



2017-01-05 22:21 GMT+01:00 Stas Malyshev >:

Hi!

> The best you can get in terms of "downloading the wikidata ontology" 
would be to
> download all properties and all the items representing classes. We 
currently
> don't have a separate dump for these. Also, do not expect this to be a 
concise
> or consistent model that can be used for reasoning. You are bound to find
> contradictions and lose ends.

Also, Wikidata Toolkit (https://github.com/Wikidata/Wikidata-Toolkit
)
can be used to generate something like taxonomy - see e.g.

http://tools.wmflabs.org/wikidata-exports/rdf/exports/20160801/dump_download.html



But one has to be careful with it as Wikidata may not (and frequently
does not) follow assumptions that are true for proper OWL models - there
are no limits on what can be considered a class, a subclass, an
instance, etc. Same entity can be treated both as class and individual,
and there may be some weird structures, including even outright errors
such as cycles in subclass graph, etc. And, of course, it changes all
the time :)

--
Stas Malyshev
smalys...@wikimedia.org 

___
Wikidata mailing list
Wikidata@lists.wikimedia.org 
https://lists.wikimedia.org/mailman/listinfo/wikidata





___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] SPARQL service: "too many requests"

2016-12-19 Thread Markus Kroetzsch



On 19.12.2016 07:04, Stas Malyshev wrote:

Hi!


in SQID, I now frequently see WDQS responses of type 429 when trying to
load a page (doing this will usually issue a few dozen queries for
larger pages). How many SPARQL queries are users allowed to ask in a
certain time and how should tools behave when they hit this limit?


Right now, we have a limit of 5 parallel requests from the same IP. If
the requests are one after another, there should be no limitation. So I
would make sure the code does not make too much requests in parallel, at
the same time.



Ok, thanks, we should be able to implement this with some synchronisation.

Best

Markus

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Linked data fragment enabled on the Query Service

2016-12-21 Thread Markus Kroetzsch

Hi Stas,

Thanks for the info. Yes, all my comments apply to the ldf demo. I 
understand that it is a demo, and what the motivation is on paper, but 
if it returns incorrect results, then it is of little use. You can get 
those without any load on server or client ;-).


Also, there should be some way of doing queries that don't run on WDQS 
already, i.e., there must be something that times out now but can be 
done with ldf in a reasonable time [1]. Or are federated queries the 
main goal here? (that's still useful, but I hope that WDQS will also 
support a whitelisted set of external endpoint at some time)


Best,

Markus

[1] Upper bound for "reasonable": time it takes to download the RDF 
dump, install Blazegraph locally without timeout, load the dump, and get 
your query answered there ;-)


On 21.12.2016 18:10, Stas Malyshev wrote:

Hi!


(1) The results do not seem to be correct. The example query related to
films returns 55 results, while on the official endpoint it returns 128.
It seems that this is not because of missing data, but because of wrong
multiplicities (the correct result has several rows repeated multiple
times). Is there an implicit DISTINCT applied in this service somewhere?
Are there any other changes from the normal SPARQL semantics?


You mean the results on http://ldfclient.wmflabs.org/ or the results
returned directly from the endpoint?

Note that http://ldfclient.wmflabs.org/ is just a demo. It's not a
production service, it's just a showcase of what can be done using LDF.
So it's possible SPARQL implementation there is somehow buggy or
different from others. I think it relies on
https://www.npmjs.com/package/sparqljs and
https://www.npmjs.com/package/ldf-client - however I can't really vouch
on what happens in that code. I'll try to see where the differences come
from, but not sure it's worth spending too much time debugging a demo
service. If however something is wrong with the patterns themselves that
would be a serious issue.

This example is just a demo of how pattern fragments enable to take
SPARQL work out of the server to the client. It's not intended as a
production SPARQL service :)


(2) It is really slow. The sample query took 55s on my machine
(producing only half of the results), while it takes 0.2s on WDQS. I am
afraid that hard queries which would timeout on WDQS might take too long
to be used at all.


Well, comparing it to WDQS is not really fair :) It's a JS
implementation of SPARQL, running running in your browser and loading
data over the network. Still 55 s it too much - it finishes in around
10s for me. Maybe your network is slower?

I think if it is run on real hardware on stronger JS or Java engine
though it might be faster. Also, I'm sure there are other LDF clients,
such as Java one: http://linkeddatafragments.org/software/


However, I would still like to try it with one of our harder queries.
Can I use the service from a program (my harder queries have too many
results to be displayed in a browser -- this is why they are hard ;-)?


Yes. Please tell me if something goes wrong. You may want to use
non-browser client though.


Ideally, I would like to use it like a SPARQL service that I send a
request to. Is this possible?


Not for this one. This is a triple pattern service, by design - it can
be used by a client-side SPARQL implementation but it does not include
one. Including full SPARQL implementation into it would be contradictory
- the whole point of this is to shift the workload to the client, if we
shift it back to the server, we're back to the regular SPARQL endpoint
with all the limitations we must put on it. However, I'm pretty sure
something like Client.js or Client.java mentioned above can do SPARQL
queries - it's how the demo works.



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Can LDF scale?

2016-12-23 Thread Markus Kroetzsch



On 23.12.2016 01:29, Stas Malyshev wrote:
...


I don't think we plan to invest more time than we already did into it.
The endpoint is up now, we don't really plan to do anything additional
with it - it's for the users now to see if it's useful. We'll be
watching to see whether it is not overtaxing resources and not dragging
SPARQL part down, but otherwise for now that's all the investment we're
doing for now. If we come up with some use case helpful for us we would
then implement it but nothing planned yet.


Sounds good. I support this as a community project, and I remain curious 
about the results, just as long as it does not affect the production 
usage of Wikidata.


Markus





___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Can LDF scale? (Was: Linked data fragment enabled on the Query Service)

2016-12-23 Thread Markus Kroetzsch



On 23.12.2016 01:16, Stas Malyshev wrote:

Hi!


So I did some light benchmarking, and it looks like a single server can
do 700 to 800 rps for TPF queries without significant rise in the load
(which is understandable since it's almost all IO). Single request
median time seems to be around 150ms and 99% time around 500ms.
This quick test was done on 150 parallel threads.


I've re-run the benchmark with best-practices setting on 150 threads
while randomizing the patterns I look up and it gave me over 1000 rps
with average response time around 150 ms. The load was slightly higher
but nowhere near the max.

So these are the parameters so far (remember that's for one server, so 3
servers ideally are supposed to do 3x of that).


Maybe I am slightly confused here. The number of 1000 requests per 
second seems to be too low if a single query leads to 100 rps, no? Or do 
you mean 1000K rps?


Of course adding more servers will help, like it also does with 
full-fledged SPARQL. But then there is no advantage compared to SPARQL. 
We know that we can do 20-30 SPARQL queries per second with two servers. 
If query execution times would be the same for TPF (!), then this would 
be 2000-3000 rps already. If this requires two servers as well, then 
there is no real advantage.


Best,

Markus


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


  1   2   >