[Wikidata] Re: State of the (Wiki)data

2022-11-02 Thread Markus Krötzsch

Dear all,

Thanks, Romaine, for this detailed and careful analysis of the 
situation. I think much of this is spot-on. I think one of the main 
insights here is that we need more uniformity. Wikidata in many places 
is still used like some exotic "structured" format for entering plain 
texts, which make sense to human readers but prevent or confuse 
automated usage. The key is to "see" collections of items rather than 
single pages.


It seems Wikidata would need more stakeholder communities for specific 
areas (say sports events) to oversee and guide the modeling of the items 
in this kind. We need more WikiProjects.


Regarding the question whether solutions need to be technical or social, 
I'd say both must go together. I also have often been disheartened by 
the sheer effort that it would require to add even the most obvious 
statements to a larger set of items. Geography is a good example: there 
are so many nearby places that share the same geo-administrative history 
(take a look at the country, P17, of Dresden, Q1731), yet it is 
practically impossible to add this to any significant amount of the 
thousands of Germany cities ... Here, like in many of the cases Romaine 
has described, the technical limitations may smother necessary community 
activity. (The specific case might also be an example of something where 
an approach of "data sharing" is needed, i.e. a modeling paradigm that 
simply allows us to say "this place has the same history of P17 
statements as this other place"; but that's not the main topic of this 
post).


New tools may also enable and encourage communities to grow that have 
not formed in the past decade. One aspect here might be that it is 
difficult for communities to appreciate the result of their efforts. For 
example, it is very difficult to create a uniform appearance for a group 
of pages, already since the order of statements (in a group of the same 
property) is so hard to change, and also since the pages are already 
very long. Even if one can achieve complete semantic uniformity, one 
will not currently have much opportunity to "see" this success. There 
are unsolved challenges here that cannot be compared with the relatively 
simple and small data that one can find in a typical Wikipedia Infobox. 
External developers and maybe even researchers could contribute here, 
but they would also benefit form the input and concrete ideas from 
WikiProjects (Romain's email already had quite a number of directly 
implementable ideas in it ... this kind of constructive input is already 
half of the solution).


Cheers,

Markus


On 31/10/2022 23:40, Romaine Wiki wrote:
Yesterday it was 10 years ago when Wikidata was founded and two weeks 
ago Wikidata reached the amount of 100 million items. This is a good 
moment to see what we have (and don't have), to look a bit back, and 
also some hope for the future.


The idea to describe this already started in September and since then I 
have done various analysis to get a picture. This, however, will not be 
a complete overview as there are too many factors involved, just a 
general picture of what I came across.


(Spoiler: This e-mail gets more structure further below. :-p)

== Structured? ==

Wikidata, it is said it contains structured data. I think we need to be 
more precise with it: it is how the data is stored that is structured. 
And this structured data is _only_ present on an individual item. If we 
zoom out a little bit, and view multiple items of a serie, among items 
the data is often missing, fragmented, differently organised, and 
sometimes even problematic. On a multi-item-level (serie-level) it 
highly depends if a user has done all the work to synchronise the 
various items all together or not.


*Example:* I came across a serie of items about a certain sports 
tournament with an edition organised each year for 50 years on a row. 
For P31 (instance of), on 5 items it was called an event, on 25 items it 
was called a sporting event, on on 13 items a tournament, on some others 
a competition, and a few without P31. To be clear, each edition had the 
same setup, was for the same sport, everything the same. The articles on 
Wikipedia are better structured!


This is just a simple serie of items. Zooming out another level, the 
differences between series are huge, which makes the quality low.


How is a new item added? In the past ten years many items have been 
added with bots/tools based on the articles on Wikipedia. (Yes, for I 
ignore here other additions.) In future still many items will be created 
when an article on Wikipedia has been created. In the worst case, the 
user adds the sitelink and the items stays empty (practically useless!). 
A little bit better, the user adds P31/P279 (instance of/subclass of) 
(not useful, but it helps). A bit more better, also other statements are 
added (an item becomes useful). Better when a user checks one/two other 
items in a series. Much better when a user checks all items 

Re: [Wikidata] An answer to Lydia Pintscher regarding its considerations on Wikidata and CC-0

2017-11-30 Thread Markus Krötzsch

Dear Mathieu,

Your post demands my response since I was there when CC0 was first 
chosen (i.e., in the April meeting). I won't discuss your other claims 
here -- the discussions on the Wikidata list are already doing this, and 
I agree with Lydia that no shouting is necessary here.


Nevertheless, I must at least testify to what John wrote in his earlier 
message (quote included below this email for reference): it was not 
Denny's decision to go for CC0, but the outcome of a discussion among 
several people who had worked with open data for some time before 
Wikidata was born. I have personally supported this choice and still do. 
I have never received any money directly or indirectly from Google, 
though -- full disclosure -- I got several T-shirts for supervising in 
Summer of Code projects.


At no time did Google or any other company take part in our discussions 
in the zeroth hour of Wikidata. And why should they? From what I can see 
on their web page, Google has no problem with all kinds of different 
license terms in the data they display. Also, I can tell you that we 
would have reacted in a very allergic way to such attempts, so if any 
company had approached us, this would quite likely have backfired. But, 
believe it or not, when we started it was all but clear that this would 
become a relevant project at all, and no major company even cared to 
lobby us. It was still mostly a few hackers getting together in varying 
locations in Berlin. There was a lot of fun, optimism, and excitement in 
this early phase of Wikidata (well, I guess we are still in this phase).


So please do not start emails with made-up stories around past events 
that you have not even been close to (calling something "research" is no 
substitute for methodology and rigour). Putting unsourced personal 
attacks against community members before all other arguments is a 
reckless way of maximising effect, and such rhetoric can damage our 
movement beyond this thread or topic. Our main strength is not our 
content but our community, and I am glad to see that many have already 
responded to you in such a measured and polite way.


Peace,

Markus


On 30.11.2017 09:55, John Erling Blad wrote:
> Licensing was discussed in the start of the project, as in start of
> developing code for the project, and as I recall it the arguments for
> CC0 was valid and sound. That was long before Danny started working for
> Google.
>
> As I recall it was mention during first week of the project (first week
> of april), and the duscussion reemerged during first week of
> development. That must have been week 4 or 5 (first week of may), as the
> delivery of the laptoppen was delayed. I was against CC0 as I expected
> problems with reuse og external data. The arguments for CC0 convinced me.
>
> And yes, Denny argued for CC0 AS did Daniel and I believe Jeroen and
> Jens did too.



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata-tech] Order of claims on entity page

2017-11-29 Thread Markus Krötzsch

Dear Vlad,

Ordering claims on a page as you suggest would not work well, since 
several other orders must take precedence over the order you suggest. 
First of all, statements are grouped by property and you don't want to 
change this. Hence, you cannot use the order across statements of 
different properties, since this would force you in some cases to 
ungroup (which would have other disadvantages).


Second, it makes sense to order statements of one property by other 
aspects, e.g., by time, to make it possible for humans to find 
something. Hence, again, we are not free to use the order to encode 
further information.


So what remains is to order quantifiers inside statements, but there it 
is rarely relevant (usually there are only a few qualifiers and all of 
them can be seen at once, without getting tired).


In summary, order does not lend itself as a way to encode much 
additional information, since there are usability concerns that make you 
want to change order in different contexts (or maybe for different 
users), since order cannot be preserved when remixing data, and since it 
is overall too implicit for people to build up a shared understanding of 
what it is supposed to mean (you don't want fights about whether some 
item has to be in fourth or fifth position of some list based on some 
vague understanding of "quality" or "trustworthiness" -- it would be 
very hard to find objective arguments for or against a particular order).


Cheers,

Markus

On 29.11.2017 12:45, Владимир Рябцев wrote:

OK Lydia, what is the purpose of giving order of qualifiers then?

Along with helping to give a user a better representation of data, the order 
can be useful in automated processing of properties. To my mind, it starts with 
the most important entity data. Moreover, in case of contradiction, I would 
assume that first properties are “ranked” higher. After all we are humans and 
pay more attention to the top of page. Our mind may get bit tired by the end of 
page. In an ideal world you are right that order does not matter, but in the 
reality it may help algorithms.

Vlad


29 нояб. 2017 г., в 14:19, Lydia Pintscher  
написал(а):


On Wed, Nov 29, 2017 at 11:14 AM, Владимир Рябцев  wrote:
Thanks for the link with sorted properties. Is this page updated
automatically or maintained manually by someone? In latter case this looks
weird to me, because the order may become not actual at some moment.


Yes it is maintained by hand by the editors.


It is curious that when properties are used as qualifiers we have a separate
field specifying the order (called ‘qualifiers-order’). Why not to add the
same at the top-level of entity definition?


It is just a heading to make the page more manageable - it doesn't
have a meaning beyond that.


Cheers
Lydia

--
Lydia Pintscher - http://about.me/lydia.pintscher
Product Manager for Wikidata

Wikimedia Deutschland e.V.
Tempelhofer Ufer 23-24
10963 Berlin
www.wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das
Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech




___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


[Wikidata] SPARQL service: "too many requests"

2016-12-17 Thread Markus Krötzsch

Dear SPARQL team,

in SQID, I now frequently see WDQS responses of type 429 when trying to 
load a page (doing this will usually issue a few dozen queries for 
larger pages). How many SPARQL queries are users allowed to ask in a 
certain time and how should tools behave when they hit this limit?


Best regards,

Markus

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Browsing concepts and entities

2016-06-08 Thread Markus Krötzsch

On 08.06.2016 13:34, Satya Gadepalli wrote:

I want to look up concepts and entities by their name even if it
contains typos or omissions in wiki data.

Can I do this using Wikidata-Toolkit?


No, there is no error-tolerant string matching function in there. If no 
other tool can help you, Wikidata Toolkit could be used to get access to 
all labels and aliases, so that you can run the (slow) search yourself 
(deciding for each label whether you like it or not depending on custom 
code). But this is not the same as a live search interface.




Can I use achieve using sparql query from the web interface?


SPARQL has several string matching functions available, including a 
general regular expression matching. Running regular expressions over 
all labels and aliases for a big language will still take time, possibly 
too long for the timeout (since there is no easy way to index for 
arbitrary regexps). However, specialised regexps, such as search words 
by their initial letters, is quite fast. See the example query for 
"Rockbands starting with M" for illustration.


If nothing else help you, you could load the relevant data into a more 
specialised string searching database such as Lucene. Wikidata Toolkit 
can parse the dumps for you in this case, so you don't have to implement 
the whole dump file decompression and parsing, but you would have to 
write code to fill your DB. This would only give you a static version of 
the data yet; if you want live updates, this is more work.


Regards,

Markus



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Accessing qualifier-specific data

2016-05-19 Thread Markus Krötzsch

On 19.05.2016 14:51, Markus Krötzsch wrote:

Here is a simple SPARQL query to get population numbers from (any time
in) 2015 of (arbitrary types of) entities, limited to 100 results:

SELECT ?entity ?entityLabel ?population ?time
WHERE
{
 ?entity p:P1082 ?statement .
 ?statement ps:P1082 ?population .
 ?statement pq:P585 ?time .
 FILTER (
   ?time > "2015-01-01T00:00:00Z"^^xsd:dateTime &&
   ?time < "2015-12-31T23:59:59Z"^^xsd:dateTime
 )

 SERVICE wikibase:label {
 bd:serviceParam wikibase:language "en" .
 }
}
LIMIT 100

See http://tinyurl.com/gwzubox

You can replace ?entity by something like wd:Q64 to query the population
of a specific place: http://tinyurl.com/jnajczu (I changed to 2014 here
since there are no 2015 figures for Berlin).

You could also add other qualifiers to narrow down statements further,
but of course only if Wikidata has such information in the first place.
I don't see many qualifiers other than P585 being used with population
statements, so this is probably of little use.


Well, there are some useful qualifiers in some cases, e.g., 
determination method. Here are estimated populations between 2000 and 
2015: http://tinyurl.com/zp6ymwr


See https://tools.wmflabs.org/sqid/#/view?id=P1082 for more qualifiers.

Markus




Cheers

Markus

On 19.05.2016 14:35, Yetkin Sakal wrote:

The only way I could find to retrieve it is through theAPI.

https://www.wikidata.org/w/api.php?action=wbgetclaims=Q2674064=P1082


How to go about picking a population type (urban, rural, etc.) and
accessing its value? I cannot see such a qualifier, so what is the right
way to do it?


On Thursday, May 19, 2016 9:50 AM, Gerard Meijssen
<gerard.meijs...@gmail.com> wrote:


Hoi,
So 2015 is preferred, how do I then get the data for 1984?
Thanks.
 GerardM

On 18 May 2016 at 21:00, Stas Malyshev <smalys...@wikimedia.org
<mailto:smalys...@wikimedia.org>> wrote:

Hi!

 > Is there any chance we can access qualifier-specific data on
Wikidata?
 > For instance, we have two population properties on
 > https://www.wikidata.org/wiki/Q2674064 and want to access the
value of
 > the first one (i.e, the population value for 2015).

What should happen is that 2015 value should be marked as preferred.
That is regardless of data access. Then probably something like this:

https://www.mediawiki.org/wiki/Extension:Wikibase_Client/Lua#mw.wikibase.entity:getBestStatements

can be used (not sure if there's better way to achieve the same).
Also not sure what #property would return...
--
Stas Malyshev
smalys...@wikimedia.org <mailto:smalys...@wikimedia.org>

___
Wikidata mailing list
Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikidata



___
Wikidata mailing list
Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata






___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata-tech] Wikidata entity dumps on Labs

2016-05-11 Thread Markus Krötzsch

On 11.05.2016 08:28, Marius Hoch wrote:

This time it took quite long to produce the dump in the first place
(until after 8pm UTC for the gzip version, the bzip2 one didn't even
finish until Tuesday).

I presume that is due to one of the shards picking a slow database slave
which significantly slows that shard down. We should get new database
slaves soon, thus I presume that this problem is going to disappear soon.

Cheers,

Marius


That's alright, I was not actually worried about slow dump generation. 
What I noticed was that the dumps are available online many hours before 
they appear on Labs. I would like to use the central dump on labs 
instead of downloading my own copy each time, but right now this delays 
dump processing further. I was wondering who is providing the central 
entity dumps on labs.


Cheers,

Markus



On 10.05.2016 12:05, Markus Krötzsch wrote:

Pushing this up a bit again. The 9 May dump is not available on labs
yet. There is just the empty directory

/public/dumps/public/wikidatawiki/entities/20160509/

I really wonder why it might be taking so long.

Markus


On 02.05.2016 21:36, Markus Kroetzsch wrote:

Hi,

I noticed that there is considerable delay between the weekly Wikidata
JSON dump appearing online and the file appearing on the Labs servers
[1]. For example, the 20160502 dump is online right now, but there is
only an empty directory for this date on Labs.

In retrospect, file modification dates on Labs give the appearance that
the files have been around earlier than they seem to be, but they have
not been available at this time last week either. As it is now, it is
faster to download the dump instead of waiting for the file to show up
in the central location, but it's probably not intended that each tool
gets its own copy. For a weekly dump, half a day of delay is
significant.

Any ideas (including whom to ask)?

Cheers,

Markus

[1] Under /public/dumps/public/wikidatawiki/entities/




___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech



___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech



___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] Wikidata entity dumps on Labs

2016-05-10 Thread Markus Krötzsch
Pushing this up a bit again. The 9 May dump is not available on labs 
yet. There is just the empty directory


/public/dumps/public/wikidatawiki/entities/20160509/

I really wonder why it might be taking so long.

Markus


On 02.05.2016 21:36, Markus Kroetzsch wrote:

Hi,

I noticed that there is considerable delay between the weekly Wikidata
JSON dump appearing online and the file appearing on the Labs servers
[1]. For example, the 20160502 dump is online right now, but there is
only an empty directory for this date on Labs.

In retrospect, file modification dates on Labs give the appearance that
the files have been around earlier than they seem to be, but they have
not been available at this time last week either. As it is now, it is
faster to download the dump instead of waiting for the file to show up
in the central location, but it's probably not intended that each tool
gets its own copy. For a weekly dump, half a day of delay is significant.

Any ideas (including whom to ask)?

Cheers,

Markus

[1] Under /public/dumps/public/wikidatawiki/entities/




___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata] Wikimedia Blog post: "TED is partnering with the Wikimedia community..."

2016-04-22 Thread Markus Krötzsch

On 22.04.2016 18:41, Pine W wrote:

Good news blog post:
https://blog.wikimedia.org/2016/04/22/ted-wikimedia-collaboration/


Interesting. I just checked, and, lo and behold!, we really have TED 
talk information in Wikidata :-)


http://tools.wmflabs.org/sqid/#/view?id=Q23058816

The page shows general statistics on the various types of TED talks we 
have, and let's you browse to concrete examples. Almost 2000 talks in 
total as of last Monday.


Markus


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] SPARQL service timeouts

2016-04-19 Thread Markus Krötzsch

On 19.04.2016 11:33, Addshore wrote:

Also per https://phabricator.wikimedia.org/T126730 and
https://gerrit.wikimedia.org/r/#/c/274864/8 requests to the query
service are now cached for 60 seconds.
I expect this will include error results from timeouts so retrying a
request within the same 60 seconds as the first won't event reach the
WDQS servers now.


Maybe this could be the answer. Is it possible that the cache stores the 
truncated result but not the Java exception? Then the behaviour could be 
a timeout which just is not reported properly. Ideally, partial results 
should not be cached or the "timeout" should be cached so that a renewed 
request (in 60sec) returns an immediate timeout rather than a broken 
result set.


Cheers,

Markus



On 19 April 2016 at 10:05, Addshore > wrote:

In the case we are discussing here the truncated JSON is caused by
blaze graph deciding it has been sending data for too long and then
stopping (as I understand).
Thus you will only see a spike on the graph for the amount of data
actually sent from the server, not the size of the result blazegraph
was trying to send back.

I also ran into this with some simple queries that returned big sets
of data.
Although with my issue I did actually also see a Java exception
somewhere.

On 18 April 2016 at 21:51, Markus Kroetzsch
> wrote:

On 18.04.2016 22:21, Markus Kroetzsch wrote:

On 18.04.2016 21:56, Markus Kroetzsch wrote:

Thanks, the dashboard is interesting.

I am trying to run this query:

SELECT ?subC ?supC WHERE { ?subC p:P279/ps:P279 ?supC }

It is supposed to return a large result set. But I am
only running it
once per week. It used to work fine, but today I could
not get it to
succeed a single time.


Actually, the query seems to work as it should. I am
investigating why I
get an error in some cases on my machine.


Ok, I found that this is not so easy to reproduce reliably. The
symptom I am seeing is a truncated JSON response, which just
stops in the middle of the data (at a random location, but
usually early on), and which is *not* followed by any error
message. The stream just ends.

So far, I could only get this in Java, not in Python, and it
does not always happen. If successful, the result is about 250M
in size. The following Python script can retrieve it:

import requests
SPARQL_SERVICE_URL = 'https://query.wikidata.org/sparql'
query = """SELECT ?subC ?supC WHERE { ?subC p:P279/ps:P279 ?supC
}"""
print requests.get(SPARQL_SERVICE_URL, params={'query': query,
'format': 'json'}).text

(output should be redirected to a file)

I will keep an eye on the issue, but I don't know how to debug
this any further now, since it started to work without me
changing any code.

I also wonder how to read the dashboard after all. In spite of
me repeating an experiment that creates a 250M result file for
five times in the past few minutes, the "Bytes out" figure
remains below a few MB for most of the time.


Markus



On 18.04.2016 21:40, Stas Malyshev wrote:

Hi!

I have the impression that some not-so-easy
SPARQL queries that used to
run just below the timeout are now timing out
regularly. Has there been
a change in the setup that may have caused this,
or are we maybe seeing
increased query traffic [1]?


We've recently run on a single server for couple of
days due to
reloading of the second one, so this may have made
it a bit slower. But
that should be gone now, we're back to two. Other
than that, not seeing
anything abnormal in

https://grafana.wikimedia.org/dashboard/db/wikidata-query-service

[1] The deadline for the Int. Semantic Web Conf.
is coming up, so it
might be that someone is running experiments on
the system to get their
paper finished. It has been observed for other
endpoints that traffic
increases at such times. This community
sometimes is the greatest enemy
of its own technology ... (I recently had to
  

Re: [Wikidata] SPARQL service timeouts

2016-04-19 Thread Markus Krötzsch

On 19.04.2016 11:05, Addshore wrote:

In the case we are discussing here the truncated JSON is caused by blaze
graph deciding it has been sending data for too long and then stopping
(as I understand).
Thus you will only see a spike on the graph for the amount of data
actually sent from the server, not the size of the result blazegraph was
trying to send back.


I successfully got five files of 250M JSON each, but even those 
successful queries did not show up in the stats. The five files had 
three different versions (slightly different sizes) so they did not all 
come from a common cache either. Maybe the size is counted in terms of 
compressed or otherwise "raw" results?




I also ran into this with some simple queries that returned big sets of
data.
Although with my issue I did actually also see a Java exception somewhere.


I know the case where large result sets end in a Java timeout exception. 
This occurs reproducibly when you retrieve all humans or something like 
that. However, in my case, the behaviour is not always reproducible and 
there is no Java exception at the end of the output; it just stops in 
the middle of the file.


Markus



On 18 April 2016 at 21:51, Markus Kroetzsch
>
wrote:

On 18.04.2016 22:21, Markus Kroetzsch wrote:

On 18.04.2016 21:56, Markus Kroetzsch wrote:

Thanks, the dashboard is interesting.

I am trying to run this query:

SELECT ?subC ?supC WHERE { ?subC p:P279/ps:P279 ?supC }

It is supposed to return a large result set. But I am only
running it
once per week. It used to work fine, but today I could not
get it to
succeed a single time.


Actually, the query seems to work as it should. I am
investigating why I
get an error in some cases on my machine.


Ok, I found that this is not so easy to reproduce reliably. The
symptom I am seeing is a truncated JSON response, which just stops
in the middle of the data (at a random location, but usually early
on), and which is *not* followed by any error message. The stream
just ends.

So far, I could only get this in Java, not in Python, and it does
not always happen. If successful, the result is about 250M in size.
The following Python script can retrieve it:

import requests
SPARQL_SERVICE_URL = 'https://query.wikidata.org/sparql'
query = """SELECT ?subC ?supC WHERE { ?subC p:P279/ps:P279 ?supC }"""
print requests.get(SPARQL_SERVICE_URL, params={'query': query,
'format': 'json'}).text

(output should be redirected to a file)

I will keep an eye on the issue, but I don't know how to debug this
any further now, since it started to work without me changing any code.

I also wonder how to read the dashboard after all. In spite of me
repeating an experiment that creates a 250M result file for five
times in the past few minutes, the "Bytes out" figure remains below
a few MB for most of the time.


Markus



On 18.04.2016 21:40, Stas Malyshev wrote:

Hi!

I have the impression that some not-so-easy SPARQL
queries that used to
run just below the timeout are now timing out
regularly. Has there been
a change in the setup that may have caused this, or
are we maybe seeing
increased query traffic [1]?


We've recently run on a single server for couple of days
due to
reloading of the second one, so this may have made it a
bit slower. But
that should be gone now, we're back to two. Other than
that, not seeing
anything abnormal in

https://grafana.wikimedia.org/dashboard/db/wikidata-query-service

[1] The deadline for the Int. Semantic Web Conf. is
coming up, so it
might be that someone is running experiments on the
system to get their
paper finished. It has been observed for other
endpoints that traffic
increases at such times. This community sometimes is
the greatest enemy
of its own technology ... (I recently had to
IP-block an RDF crawler
from one of my sites after it had ignored robots.txt
completely).


We don't have any blocks or throttle mechanisms right
now. But if we see
somebody making serious negative impact on the service,
we may have to
change that.







--
Markus Kroetzsch
Faculty of Computer Science
Technische 

Re: [Wikidata] Status and ETA External ID conversion

2016-03-06 Thread Markus Krötzsch
Another reason why "uniqueness" is not such a good criterion: it cannot 
be applied to decide the type of a newly created property (no 
statements, no uniqueness score). In general, the fewer statements there 
are for a property, the more likely they are to be unique. The criterion 
rewards data incompleteness (example: if Luca deletes the six multiple 
ids he mentioned, then the property could be converted -- and he could 
later add the statements again). If you think about it, it does not seem 
like a very good idea to make the datatype of a property depend on its 
current usage in Wikidata.


Markus

On 05.03.2016 17:15, Markus Krötzsch wrote:

Hi,

I agree with Egon that the uniqueness requirement is rather weird. What
it means is that a thing is only considered an "identifier" if it points
to a database that uses a similar granularity for modelling the world as
Wikidata. If the external database is more fine-grained than Wikidata
(several ids for one item), then it is not a valid "identifier",
according to the uniqueness idea. I wonder what good this may do. In
particular, anybody who cares about uniqueness can easily determine it
from the data without any property type that says this.

Markus


On 05.03.2016 15:35, Egon Willighagen wrote:

On Sat, Mar 5, 2016 at 3:25 PM, Lydia Pintscher
<lydia.pintsc...@wikimedia.de> wrote:

On Sat, Mar 5, 2016 at 3:17 PM Egon Willighagen
<egon.willigha...@gmail.com>

What is the exact process? Do you just plan to wait longer to see if
anyone supports/contradicts my tagging? Should I get other Wikidata
users and contributors to back up my suggestion?


Add them to the list Katie linked if you think they should be
converted. We
wait a bit to see if anyone disagrees and I also do a quick sanity
check for
each property myself before conversion.


I am adding comments for now. I am also looking at the comments for
what it takes to be "identifier":

https://www.wikidata.org/wiki/User:Addshore/Identifiers#Characteristics_of_external_identifiers


What is the resolution in these? There are some strong, often
contradiction, opinions...

For example, the uniqueness requirement is interesting... if an
identifier must be unique for a single Wikidata entry, this is
effectively disqualifying most identifiers used in the life
sciences... simply because Wikidata rarely has the exact same concept
in Wikidata as it has in the remote database.

I'm sure we can give examples from any life science field, but
consider a gene: the concept of a gene in Wikidata is not like a gene
sequence in a DNA sequence database. Hence, an identifier from that
database could not be linked as "identifier" to that Wikidata entry.

Same for most identifiers for small organic compounds (like drugs,
metabolites, etc). I already commented on CAS (P231) and InChI (P234),
both are used as identifier, but none are unique to concepts used as
"types" in Wikidata. The CAS for formaldehyde and formaline is
identical. The InChI may be unique, but only of you strongly type the
definition of a chemical graph instead of a substance (as is now)...
etc.

So, in order to make a decision which chemical identifiers should be
marked as "identifier" type depends on resolution of those required
characteristics...

Can you please inform me about the state of those characteristics
(accepted or declined)?

Egon


Cheers
Lydia
--
Lydia Pintscher - http://about.me/lydia.pintscher
Product Manager for Wikidata

Wikimedia Deutschland e.V.
Tempelhofer Ufer 23-24
10963 Berlin
www.wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
unter
der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das Finanzamt für
Körperschaften I Berlin, Steuernummer 27/029/42207.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata










___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Status and ETA External ID conversion

2016-03-05 Thread Markus Krötzsch

On 05.03.2016 14:45, Lydia Pintscher wrote:

On Sat, Mar 5, 2016 at 1:28 PM Markus Krötzsch
<mar...@semantic-mediawiki.org <mailto:mar...@semantic-mediawiki.org>>
wrote:

Thanks, Katie. I see that the external ID datatype does not work as
planed. At least I thought the original idea was to clean up the UI by
moving hard-to-understand string IDs to a separate section. From the
discussions on these pages, I see that the community uses criteria that
are completely unrelated to UI aspects, but have something to do with
the degree to which the property encodes a one-to-one mapping. I guess
this is also valid, but won't be useful for UI purposes. I will need to
use another solution for my case then.


Give it another 2 to 3 weeks and it'll get there. More and more editors
are exposed to the separation in the UI now and start noticing the ones
that intuitively should be moved into the identifier section.


Ok, let's see what happens. I am not saying that the other criteria 
applied now in the discussions are bad. It's just another use of the 
datatype than I would have expected.


Markus



Cheers
Lydia
--
Lydia Pintscher - http://about.me/lydia.pintscher
Product Manager for Wikidata

Wikimedia Deutschland e.V.
Tempelhofer Ufer 23-24
10963 Berlin
www.wikimedia.de <http://www.wikimedia.de>

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das
Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Status and ETA External ID conversion

2016-03-05 Thread Markus Krötzsch

Hi,

I noticed that many id properties still use the string datatype 
(including extremely frequent ids like 
https://www.wikidata.org/wiki/Property:P213 and 
https://www.wikidata.org/wiki/Property:P227).


Why is the conversion so slow, and when is it supposed to be completed?

Cheers,

Markus

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata-tech] Caches for Special:EntityData json

2016-02-29 Thread Markus Krötzsch

Hi,

Yes, 31 days is indeed quite long.

Another workaround for some people might be to use the API 
(action=wbgetentties) instead. However, this was not so easy for me 
since my JavaScript application complains about the cross-site request 
here even though this action is only for reading and has no sensitive 
functionality (such as login). I am sure that there is some way around 
this, but I would have to look into it. I will use the purge for now.


Thanks,

Markus


On 29.02.2016 23:59, Stas Malyshev wrote:

Hi!


Output from Special:EntityData is cached for 31 days. Looking at the code, it
seems we are not automatically purging the web caches when an entity is edited -
please file a ticket for that. I think we originally decided against it for
performance reasons (there are quite a few URLs to purge for every edit), but I
suppose we should look into that again.


That would be nice. Right now I'm using cache-defeating URL to fetch
data in WDQS because obviously getting 31-day old data is not good. But
if data for Special:EntityData URLs would be purged on edit that could
allow to simplify that part a bit and maybe also save some performance
impact when running multiple WDQS instances.


You can force the cache to be purged by setting action=purge in the request.
Note that this will purge all serializations of the entity, not just the one
requested.


That also what would happen on edit, I assume.




___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


[Wikidata-tech] Caches for Special:EntityData json

2016-02-29 Thread Markus Krötzsch

Hi,

I found that Special:EntityData returns outdated JSON data that is not 
in agreement with the page. I have fetched the data using wget to ensure 
that no browser cache is in the way. Concretely, I have been looking at


https://www.wikidata.org/wiki/Special:EntityData/Q17444909.json

where I recently changed the P279 value from Q217594 to Q16889133. Of 
course, this might no longer be a valid example when you read this email 
(in case the cache gets updated at some point).


Is this a bug in the configuration of the HTTP (or other) cache, or is 
this the desired behaviour? When will the cache be cleared?


Thanks,

Markus

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata] SPARQL returns bnodes for some items

2016-02-26 Thread Markus Krötzsch

On 26.02.2016 13:32, James Heald wrote:

These are used as placeholders for the meta-values "unknown value" and
"no value" aren't they ?


Oh, right! I had not considered it to be possible that any subclass of 
statement would use "unknown value". All classes could at least be 
subclasses of "Entity" if nothing else can be said about them. For some 
of the cases in the list I found, it is also unclear why they should be 
classes at all.


Anyway, this at least clarifies that there is no problem with the RDF 
export/import here and instead we just have some strange data.


Markus




On 26/02/2016 12:27, Markus Kroetzsch wrote:

Hi Stas, hi all,

I just noted that BlazeGraph seems to contain a few erroneous triples.
The following query, for example, returns a blank node "t7978245":

SELECT ?superClass WHERE {
    p:P279/ps:P279 ?superClass
}

https://query.wikidata.org/#SELECT%20%3FsuperClass%20WHERE%20{%0A%20%20%3Chttp%3A%2F%2Fwww.wikidata.org%2Fentity%2FQ595133%3E%20p%3AP279%2Fps%3AP279%20%3FsuperClass%0A}



I stumbled upon six cases like this (for P279): Q595133 (shown above),
Q1691488, Q11259005, Q297106, Q1293664, and Q539558. This would be less
than 0.001% of the 623,963 P279 statements, but it's still enough to
have application code trip over the unexpected return format ;-).

Best

Markus




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] from Freebase to Wikidata: the great migration

2016-02-23 Thread Markus Krötzsch

On 23.02.2016 16:30, Tom Morris wrote:
...



Or the paper might be off. Addressing the flaws in the paper would
require a full paper in its own right.


Criticising papers is good academic practice. Doing so without factual 
support, however, is not. You may be right, but you should try to 
produce a bit more evidence than your intuition.


[...]


The paper says in section 4, "At the time of writing (January, 2016),
the tool has been used by more than a hundred users who performed about
90,000 approval or rejection actions." which probably means ~80,000 new
statements (since ~10% get rejected). My 106K number is from the current
dashboard .


As Gerard has pointed out before, he prefers to re-enter statements 
instead of approving them. This means that the real number of "imported" 
statements is higher than what is shown in the dashboard (how much so 
depends on how many statements Gerard and others with this approach have 
added). It seems that one should rather analyse the number of statements 
that are already in Wikidata than just the ones that were approved directly.


Markus


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] from Freebase to Wikidata: the great migration

2016-02-21 Thread Markus Krötzsch

On 21.02.2016 20:37, Tom Morris wrote:

On Sun, Feb 21, 2016 at 11:41 AM, Markus Krötzsch
<mar...@semantic-mediawiki.org <mailto:mar...@semantic-mediawiki.org>>
wrote:

On 18.02.2016 15:59, Lydia Pintscher wrote:


Thomas, Denny, Sebastian, Thomas, and I have published a paper
which was
accepted for the industry track at WWW 2016. It covers the migration
from Freebase to Wikidata. You can now read it here:
http://research.google.com/pubs/archive/44818.pdf


Is it possible that you have actually used the flawed statistics
from the Wikidata main page regarding the size of the project? 14.5M
items in Aug 2015 seems far too low a number. Our RDF exports from
mid August already contained more than 18.4M items. It would be nice
to get this fixed at some point. There are currently almost 20M
items, and the main page still shows only 16.5M.


Numbers are off throughout the paper.  They also quote 48M instead of
58M topics for Freebase and mischaracterize some other key points. They
key number is that 3.2 billion facts for 58 million topics has generated
106,220 new statements for Wikidata. If my calculator had more decimal
places, I could tell you what percentage that is.


Obviously, any tool can only import statements for which we have items 
and properties at all, so the number of importable facts is much lower. 
I don't think anyone at Google could change this (they cannot override 
notability criteria, and they cannot even lead discussions to propose 
new content).


Markus


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] from Freebase to Wikidata: the great migration

2016-02-21 Thread Markus Krötzsch

On 18.02.2016 15:59, Lydia Pintscher wrote:

Hey everyone :)

Thomas, Denny, Sebastian, Thomas, and I have published a paper which was
accepted for the industry track at WWW 2016. It covers the migration
from Freebase to Wikidata. You can now read it here:
http://research.google.com/pubs/archive/44818.pdf


Congratulations!

Is it possible that you have actually used the flawed statistics from 
the Wikidata main page regarding the size of the project? 14.5M items in 
Aug 2015 seems far too low a number. Our RDF exports from mid August 
already contained more than 18.4M items. It would be nice to get this 
fixed at some point. There are currently almost 20M items, and the main 
page still shows only 16.5M.


Markus


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] from Freebase to Wikidata: the great migration

2016-02-21 Thread Markus Krötzsch

On 21.02.2016 16:00, Gerard Meijssen wrote:

Hoi,
I add statements of the primary sources tool in preference to add them
myself (Primary Sources takes more time).

I am still of the strongest opinion that given the extremely
disappointing number of added statements the Primary Sources tool is a
failure.


What is the number of added statements you refer to?

Markus



It is sad that all the good work of Freebase is lost in this way. It is
sad that we cannot even discuss this and consider alternatives.
Thanks,
GerardM

On 18 February 2016 at 18:07, Maximilian Klein > wrote:

Congratulations on a fantastic project and a your acceptance in WWW2016.

Make a great day,
Max Klein ‽ http://notconfusing.com/

On Thu, Feb 18, 2016 at 10:54 AM, Federico Leva (Nemo)
> wrote:

Lydia Pintscher, 18/02/2016 15:59:

Thomas, Denny, Sebastian, Thomas, and I have published a
paper which was
accepted for the industry track at WWW 2016. It covers the
migration
from Freebase to Wikidata. You can now read it here:
http://research.google.com/pubs/archive/44818.pdf


Nice!

> Concluding, in a fairly short amount of time, we have been
> able to provide the Wikidata community with more than
> 14 million new Wikidata statements using a customizable

I must admit that, despite knowing the context, I wasn't able to
understand whether this is the number of "mapped"/"translated"
statements or the number of statements actually added via the
primary sources tool. I assume the latter given paragraph 5.3:

> after removing dupli
> cates and facts already contained in Wikidata, we obtain
> 14 million new statements. If all these statements were
> added to Wikidata, we would see a 21% increase of the num-
> ber of statements in Wikidata.


I was confused about that too. "the [Primary Sources] tool has been
used by more than a hundred users who performed about
90,000 approval or rejection actions. More than 14 million
statements have been uploaded in total."  I think that means that ≤
90,000 items or statements were added of 14 million available to be
add through Primary Sources tool.


Nemo

___
Wikidata mailing list
Wikidata@lists.wikimedia.org 
https://lists.wikimedia.org/mailman/listinfo/wikidata



___
Wikidata mailing list
Wikidata@lists.wikimedia.org 
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Undoing merges (bibliographic articles are not humans)

2016-02-18 Thread Markus Krötzsch

Hi all,

What is the correct process to undo merges? There are three cases where 
a bibliographical article (in Wikisource) has been accidentally merged 
with the human the article is about:


http://www.wikidata.org/entity/Q85393
http://www.wikidata.org/entity/Q312607
http://www.wikidata.org/entity/Q320923

The merge in each case should be undone, and the "main subject" property 
be set instead, like here:


https://www.wikidata.org/wiki/Q15985561

But the "undo" option in the history just deletes the merged-in 
statements (it seems) without restoring the old page. What to do?


Markus

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] SPARQL endpoint caching

2016-02-17 Thread Markus Krötzsch

On 17.02.2016 10:34, Magnus Manske wrote:



On Wed, Feb 17, 2016 at 7:16 AM Stas Malyshev > wrote:


Well, again the problem is that one use case that I think absolutely
needs caching - namely, exporting data to graphs, maps, etc. deployed on
wiki pages - is also the one not implemented yet because we don't have
cache (not only, but one of the things we need) so we've got chicken and
egg problem here :) Of course, we can just choose something now based on
educated guess and change it later if it works badly. That's probably
what we'll do.

Wouldn't those usecases be wrapped in an extension or WMF-controlled
JavaScript? In that case, queries could always indicate that use, and
they could be cached, for hours if need be. No reason to put everything
behind a long cache by default, just because of those controllable cases.

Also, what about creating more independent blazegraph instances? One (or
more) could be for wiki extension queries, with long cache; others could
be for Labs use (internal network only?), a "general" external-facing
server, etc.

If the problem (and it's not even certain we have one) can be mitigated
or solved with throwing a few more VMs at it, I'm all for it :-)


+1 for adding some servers before building complicated caching solutions

I think long caching periods for wiki queries could lead to user 
frustration for the reasons I gave in my other post. But maybe one can 
simply give the user a way to say "please recompute this query now" to 
avoid this.


Another thing one could do for wiki-based (and other) queries is to use 
caches as fallbacks in case of timeouts: "we will try our best to give 
you a fresh result, but if current load is too high, we will at least 
give you some older result." This makes most sense for wiki-based 
queries which repeat reliably over time (so it makes sense to keep one 
result, however old it is, for all queries still used on some wiki 
page). Would be more work to implement, so probably not the first thing 
to do without any real wiki usage experiences yet.


Markus


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] SPARQL endpoint caching

2016-02-17 Thread Markus Krötzsch

On 17.02.2016 09:54, Katie Filbert wrote:
...



I think it would be nice if having a graph with query on a page does not
too much adversely affect the time it takes to save a page. (e.g. if
running the query takes 20 seconds..., and instead reuse cached query
results)  And not have such usage kill / overwhelm the query service, is
also important.

If we incorporate entity usage or something like that, then maybe that
could be used to handle cache invalidation in cases something used in a
query changed.


This might be one of the more complex cache maintenance strategies that 
I had delegated to BlazeGraph above. It is not too hard to monitor 
objects in a query result for changes, but for cache invalidation to 
work reliably, you would also have to watch out for items that are only 
becoming part of the result because of the changes. For example, a query 
for the largest cities would need to be updated if someone creates a new 
city (item) that is larger than all other cities. So you have to monitor 
all items to update the query, not just those used in the current result.


Markus




___
Wikidata mailing list
Wikidata@lists.wikimedia.org 
https://lists.wikimedia.org/mailman/listinfo/wikidata




--
Katie Filbert
Wikidata Developer

Wikimedia Germany e.V. | Tempelhofer Ufer 23-24, 10963 Berlin
Phone (030) 219 158 26-0

http://wikimedia.de

Wikimedia Germany - Society for the Promotion of free knowledge eV
Entered in the register of Amtsgericht Berlin-Charlottenburg under the
number 23 855 as recognized as charitable by the Inland Revenue for
corporations I Berlin, tax number 27/681/51985.


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] SPARQL CONSTRUCT results truncated

2016-02-16 Thread Markus Krötzsch

Hi Joachim,

I think SERVICE queries should be working, but maybe Stas knows more 
about this. Even if they are disabled, this should not result in some 
message rather than in a NullPointerException. Looks like a bug.


Markus


On 16.02.2016 13:56, Neubert, Joachim wrote:

Hi Markus,

Great that you checked that out. I can confirm that the simplified query worked 
for me, too. It took 15.6s and revealed roughly the same number of results 
(323789).

When I loaded the results into http://zbw.eu/beta/sparql/econ_pers/query, an endpoint for 
"economics-related" persons, it matched with 36050 persons (supposedly the "most 
important" 8 percent of our set).

What I normally would do to get the according Wikipedia site URLs, is a query against the 
wikidata endpoint, which references the relevant wikidata URIs via a "service" 
clause:

PREFIX skos: 
PREFIX schema: 
#
construct {
   ?gnd schema:about ?sitelink .
}
where {
   service  {
 ?gnd skos:prefLabel [] ;
  skos:exactMatch ?wd .
 filter(contains(str(?wd), 'wikidata'))
   }
   ?sitelink schema:about ?wd ;
 schema:inLanguage ?language .
   filter (contains(str(?sitelink), 'wikipedia'))
   filter (lang(?wdLabel) = ?language && ?language in ('en', 'de'))
}

This however results in a java error.

If "service" clauses are supposed to work in the wikidata endpoint, I'd happily 
provide addtitional details in phabricator.

For now, I'll get the data via your java example code :)

Cheers, Joachim

-Ursprüngliche Nachricht-
Von: Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] Im Auftrag von 
Markus Kroetzsch
Gesendet: Samstag, 13. Februar 2016 22:56
An: Discussion list for the Wikidata project.
Betreff: Re: [Wikidata] SPARQL CONSTRUCT results truncated

And here is another comment on this interesting topic :-)

I just realised how close the service is to answering the query. It turns out that 
you can in fact get the whole set of (currently >324000 result items) together 
with their GND identifiers as a download *within the timeout* (I tried several times 
without any errors). This is a 63M json result file with >640K individual values, 
and it downloads in no time on my home network. The query I use is simply this:

PREFIX wd:  PREFIX wdt: 


select ?item ?gndId
where {
?item wdt:P227 ?gndId ; # get gnd ID
  wdt:P31  wd:Q5  . # instance of human } ORDER BY ASC(?gndId) LIMIT 10

(don't run this in vain: even with the limit, the ORDER clause requires the 
service to compute all results every time someone runs this. Also be careful 
when removing the limit; your browser may hang on an HTML page that large; 
better use the SPARQL endpoint directly to download the complete result file.)

It seems that the timeout is only hit when adding more information (labels and 
wiki URLs) to the result.

So it seems that we are not actually very far away from being able to answer 
the original query even within the timeout. Certainly not as far away as I 
first thought. It might not be necessary at all to switch to a different 
approach (though it would be interesting to know how long LDF takes to answer 
the above -- our current service takes less than 10sec).

Cheers,

Markus


On 13.02.2016 11:40, Peter Haase wrote:

Hi,

you may want to check out the Linked Data Fragment server in Blazegraph:
https://github.com/blazegraph/BlazegraphBasedTPFServer

Cheers,
Peter

On 13.02.2016, at 01:33, Stas Malyshev  wrote:

Hi!


The Linked data fragments approach Osma mentioned is very
interesting (particularly the bit about setting it up on top of an
regularily updated existing endpoint), and could provide another
alternative, but I have not yet experimented with it.


There is apparently this:
https://github.com/CristianCantoro/wikidataldf
though not sure what it its status - I just found it.

In general, yes, I think checking out LDF may be a good idea. I'll
put it on my todo list.

--
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list

Re: [Wikidata] Wikidata Propbrowse

2016-02-15 Thread Markus Krötzsch

On 15.02.2016 11:52, Hay (Husky) wrote:

On Mon, Feb 15, 2016 at 10:56 AM, André Costa  wrote:

Would it be possible to set the language used to search with? Whilst I most
often use English on Wikidata I'm sure a lot of people don't.

Not yet. The query takes quite a while, so it's done in realtime but
every 24 hours, and then it's compiled to the HTML list. Adding
multi-language support would be a bit more cumbersome. I'm open to
pull requests though ;)


Yes, as usual: it is easy to support *any* language, but tricky to 
support *all* languages.




Markus wrote:

I would just filter this in code; a more complex SPARQL query is just getting 
slower.
Here is a little example Python script that gets all the data you need:

Ah, excellent. In that case i'll just do a query and filter in Python.


I intend to use this in our upcoming new class/property browser as well.
Maybe it would actually make sense to merge the two applications at some point

I hope the propbrowser will be made irrelevant by improvements in
other tools and the main Wikidata site ;)


We shall see. For now, it is certainly not obsolete. There might also be 
different tools with different specialisations.


Markus




-- Hay

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata Propbrowse

2016-02-14 Thread Markus Krötzsch

On 14.02.2016 16:04, Hay (Husky) wrote:

On Sun, Feb 14, 2016 at 3:53 PM, Jane Darnell  wrote:

Now I suddenly understand why we should have "properties for properties"  so
we can categorize these things. It would be nice to have a list of
"authority control"  properties and also the number of times a property is
used.

You can already do that in propbrowse. Just filter for 'unique
identifier' and you'll get all properties that are an instance of
Q19847637 (Wikidata property representing a unique identifier).

Considering number of times a property is used:i think that would be
really interesting. Unfortunately, i don't know of a way to easily get
a count of how many times a property is used. The 'linkshere'
functionality in the API doesn't give a count.


Filtering by usage count is something we already have in Miga. For 
example, here is a list of authority control properties that are used at 
least 1000 times on Wikidata:


http://tools.wmflabs.org/wikidata-exports/miga/#_cat=Properties/Related%20properties=VIAF%20identifier/Datatype=String/Items%20with%20such%20statements=1000%20-%20100

Currently requires Google Chrome or Opera to view. We are about to 
reimplement this application completely to make it use live data and be 
compatible with all browsers. Should be ready later this month.


Regarding the data source, we will use a mix of live data and cached 
precomputed data for more expensive statistics that you cannot query for 
every second.


Markus


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata Propbrowse

2016-02-14 Thread Markus Krötzsch
Very nice. Showing the shortened property classifications under "use" is 
a very good idea!


Markus

On 14.02.2016 15:11, Jane Darnell wrote:

Wow Hay, this is super useful

On Sun, Feb 14, 2016 at 8:50 AM, Hay (Husky) > wrote:

Awesome, thanks! :)

-- Hay

On Sun, Feb 14, 2016 at 1:36 PM, Yuri Astrakhan
> wrote:
 > Well done! Absolutely love it!  I'm already using it to build
SPARQL queries
 > for the wikidata visualizations [1].
 >
 > [1]: http://en.wikipedia.beta.wmflabs.org/wiki/Sparql
 >
 > On Sun, Feb 14, 2016 at 2:44 PM, Hay (Husky) > wrote:
 >>
 >> Hey everyone,
 >> it seems we're getting new properties every day. Currently there are
 >> over 2000 properties on Wikidata, and for me personally it's
becoming
 >> a bit difficult to see the forest through the trees. Of course there
 >> are a couple of places where they are listed, but they're either
a bit
 >> unwieldy ([1]), broken ([2]) or not detailed enough ([3]).
 >>
 >> Anyway, code is better than complaining so i hacked up a tool:
 >> https://tools.wmflabs.org/hay/propbrowse/
 >>
 >> You can filter the list by typing in a couple of characters, see
 >> either a detailed or compact list and it's possible to sort the list
 >> by different properties (id, label, example, etc.).
 >>
 >> This list is updated every night and code is available on Github
([4],
 >> [5])
 >>
 >> Let me know if you think is useful and if there's anything
you're missing.
 >>
 >> -- Hay / Husky
 >>
 >> [1]: https://www.wikidata.org/wiki/Wikidata:List_of_properties
 >> [2]:
 >>

https://www.wikidata.org/wiki/Special:MyLanguage/Wikidata:List_of_properties/Summary_table
 >> [3]: https://www.wikidata.org/wiki/Special:ListProperties
 >> [4]:
https://github.com/hay/wiki-tools/tree/master/public_html/propbrowse
 >> [5]:
https://github.com/hay/wiki-tools/tree/master/etc/wikidata-props
 >>
 >> ___
 >> Wikidata mailing list
 >> Wikidata@lists.wikimedia.org 
 >> https://lists.wikimedia.org/mailman/listinfo/wikidata
 >
 >
 >
 > ___
 > Wikidata mailing list
 > Wikidata@lists.wikimedia.org 
 > https://lists.wikimedia.org/mailman/listinfo/wikidata
 >

___
Wikidata mailing list
Wikidata@lists.wikimedia.org 
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata-tech] WDTK searchEntities

2016-02-13 Thread Markus Krötzsch

[Moving to wikidata-tech; previous conversation inline below]

Hi Polyglot,

ah, now I see. The Wikidata Toolkit method you call is looking for items 
by Wikipedia page title, not for items by label. Labels and titles are 
not related in Wikidata. The search by title is supported by the 
wbgetentities API action for which we have a wrapper class, but this API 
action does not support the search by label.


In fact, I am not sure that there is any API action for doing what you 
want. There is only wbsearchentities, but this search will return near 
matches and also look for aliases. Maybe this is not a big issue for 
long strings as in your case, but for shorter strings you would get many 
results and you would still need to check if they really match.


Anyway, you are right that it would be nice if we would implement 
support for the label/alias search as well. For this, we need to make a 
wrapper class for wbsearchentities. I created an issue to track this:


https://github.com/Wikidata/Wikidata-Toolkit/issues/228

Cheers,

Markus


On 13.02.2016 23:22, Jo wrote:

Hi Markus,

I'm searching for a wikidata item with that label. It would be even
better if it were possible to search for a label/description combination.

This is the item I'm looking for:
https://www.wikidata.org/wiki/Q22695926

I mostly want to make sure that I'm not creating duplicate entries in
Wikidata, most of those schools are not noteworthy enough to get an
article on Wikipedia, but since they have objects in Openstreetmap, I
would think they are interesting enough for Wikidata.

Polyglot

2016-02-13 23:13 GMT+01:00 Markus Krötzsch
<mar...@semantic-mediawiki.org <mailto:mar...@semantic-mediawiki.org>>:

Hi Jo,

You are searching for an item that is assigned to the article
"Kasega Church of Uganda Primary School" on English Wikipedia.
However, there is not article of this name on English Wikipedia.
Maybe there is a typo? Can you tell me which Wikidata item should be
returned here?

Cheers,

Markus

P.S. If you agree, I would prefer to continue this discussion on
wikidata-tech for the benefit of others who may have similar questions.



On 13.02.2016 14:47, Jo wrote:

Hi Marcus,

I had started to write my own implementation of a Wikidata bot in
Jython, so I could use it in JOSM, but still get to code in
Python. This
worked well for a while, but now apparently something was
changed to the
login API.

Anyway, I can't code in all possible things that can go wrong, so it
makes more sense to reuse an existing framework.

What I want to do is add items, but I want to check if they already
exist first. Try as I may, I can't seem to retrieve the items I
create
myself, like:


   Kasega Church of Uganda Primary School

Douglas Adams, on the other hand doesn't pose a problem.


I can't figure out why this is. Some things can be found, others
can't.
I tried with a few more entries from recent changes.


In my own bot, I had more succes with searchEntities than with
getEntities. Was this implemented in WDTK?

I hope you can help, I'm stuck, as it doesn't make a lot of sense to
continue with the conversion, if I can't even get a trivial
thing like
this to work.

from org.wikidata.wdtk.datamodel.helpers import Datamodel
from org.wikidata.wdtk.datamodel.helpers import ItemDocumentBuilder
from org.wikidata.wdtk.datamodel.helpers import ReferenceBuilder
from org.wikidata.wdtk.datamodel.helpers import StatementBuilder
from org.wikidata.wdtk.datamodel.interfaces import DatatypeIdValue
from org.wikidata.wdtk.datamodel.interfaces import EntityDocument
from org.wikidata.wdtk.datamodel.interfaces import ItemDocument
from org.wikidata.wdtk.datamodel.interfaces import ItemIdValue
from org.wikidata.wdtk.datamodel.interfaces import PropertyDocument
from org.wikidata.wdtk.datamodel.interfaces import PropertyIdValue
from org.wikidata.wdtk.datamodel.interfaces import Reference
from org.wikidata.wdtk.datamodel.interfaces import Statement
from org.wikidata.wdtk.datamodel.interfaces import StatementDocument
from org.wikidata.wdtk.datamodel.interfaces import StatementGroup
from org.wikidata.wdtk.wikibaseapi import ApiConnection
from org.wikidata.wdtk.util import WebResourceFetcherImpl
from org.wikidata.wdtk.wikibaseapi import ApiConnection
from org.wikidata.wdtk.wikibaseapi import LoginFailedException
from org.wikidata.wdtk.wikibaseapi import WikibaseDataEditor
from org.wikidata.wdtk.wikibaseapi import WikibaseDataFetcher
from org.wikidata.wdtk.wikibaseapi.apierrors import
MediaWikiApiErrorException
# print dir(ItemDocument)
# p

Re: [Wikidata] SPARQL CONSTRUCT results truncated

2016-02-12 Thread Markus Krötzsch

On 12.02.2016 00:04, Stas Malyshev wrote:

Hi!


We basically have two choices: either we offer a limited interface that only
allows for a narrow range of queries to be run at all. Or we offer a very
general interface that can run arbitrary queries, but we impose limits on time
and memory consumption. I would actually prefer the first option, because it's
more predictable, and doesn't get people's hopes up too far. What do you think?


That would require implementing pretty smart SPARQL parser... I don't
think it worth the investment of time. I'd rather put caps on runtime
and maybe also on parallel queries per IP, to ensure fair access. We may
also have a way to run longer queries - in fact, we'll need it anyway if
we want to automate lists - but that is longer term, we'll need to
figure out infrastructure for that and how we allocate access.


+1

Restricting queries syntactically to be "simpler" is what we did in 
Semantic MediaWiki (because MySQL did not support time/memory limits per 
query). It is a workaround, but it will not prevent long-running queries 
unless you make the syntactic restrictions really severe (and thereby 
forbid many simple queries, too). I would not do it if there is support 
for time/memory limits instead.


In the end, even the SPARQL engines are not able to predict reliably how 
complicated a query is going to be -- it's an important part of their 
work (for optimising query execution), but it is also very difficult.


Markus






___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] SPARQL CONSTRUCT results truncated

2016-02-12 Thread Markus Krötzsch

On 12.02.2016 10:01, Osma Suominen wrote:

12.02.2016, 10:43, Markus Krötzsch wrote:


Restricting queries syntactically to be "simpler" is what we did in
Semantic MediaWiki (because MySQL did not support time/memory limits per
query). It is a workaround, but it will not prevent long-running queries
unless you make the syntactic restrictions really severe (and thereby
forbid many simple queries, too). I would not do it if there is support
for time/memory limits instead.


Would providing a Linked Data Fragments server [1] help here? It seems
to be designed exactly for situations like this, where you want to
provide a SPARQL query service a large amount of linked data, but are
worried about server performance particularly for complex, long-running
queries. Linked Data Fragments pushes some of the heavy processing to
the client side, which parses and executes the SPARQL queries.

Dynamically updating the data might be an issue here, but some of the
server implementations support running on top of a SPARQL endpoint [2].
I think that from the perspective of the server this means that a heavy,
long-running SPARQL query is broken up already on the client side into
several small, simple SPARQL queries that are relatively easy to serve.


There already is such a service for Wikidata (Cristian Consonni has set 
it up a while ago). You could try if the query would work there. I think 
that such queries would be rather challenging for a server of this type, 
since they require you to iterate almost all of the data client-side. 
Note that both "instance of human" and "has a GND identifier" are not 
very selective properties. In this sense, the queries may not be 
"relatively easy to serve" in this particular case.


Markus



-Osma

[1] http://linkeddatafragments.org/

[2]
https://github.com/LinkedDataFragments/Server.js#configure-the-data-sources





___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] SPARQL CONSTRUCT results truncated

2016-02-11 Thread Markus Krötzsch

Hi Joachim,

I think the problem is not to answer your query in 5min or so (Wikidata 
Toolkit on my laptop takes 27min without a database, by simply parsing 
the whole data file, so any database that already has the data should be 
much faster). The bigger issue is that you would have to configure the 
site to run for 5min before timeout. This would mean that other queries 
that never terminate (because they are really hard) also can run for at 
least this time. It seems that this could easily cause the service to 
break down.


Maybe one could have an "unstable" service on a separate machine that 
does the same as WDQS but with a much more liberal timeout and less 
availability (if it's overloaded a lot, it will just be down more often, 
but you would know when you use it that this is the deal).


Cheers,

Markus


On 11.02.2016 15:54, Neubert, Joachim wrote:

Hi Stas,

Thanks for your answer. You asked how long the query runs: 8.21 sec (having 
processed 6443 triples), in an example invocation. If roughly linear, that 
could mean 800-1500 sec for the whole set. However, I would expect a clearly 
shorter runtime: I routinely use queries of similar complexity and result sizes 
on ZBW's public endpoints. One arbitrary selected query which extracts data 
from GND runs for less than two minutes to produce 1.2m triples.

Given the size of Wikidata, I wouldn't consider such an use abusive. Of course, 
if you have lots of competing queries and resources are limited, it is 
completely legitimate to implement some policy which formulates limits and 
enforces them technically (throddle down long-running queries, or limit the 
number of produced triples, or the execution time, or whatever seems reasonable 
and can be implemented).

Anyway, in this case (truncation in the middle of a statement), it looks much 
more like some technical bug (or an obscure timeout somewhere down the way). 
The execution time and the result size varies widely:

5.44s empty result
8.60s 2090 triples
5.44s empty result
22.70s 27352 triples

Can you reproduce this kind of results with the given query, or with other 
supposedly longer-running queries?

Thanks again for looking into this.

Cheers, Joachim

PS. I plan to set up an own Wikidata SPAQL endpoint to do more complex things, but that 
depends on a new machine which will be available in some month. For now, I'd just like to 
know which for "our" persons (economists and the like) have wikipedia pages.

PPS. From my side, I would much more have liked to build a query which asks for 
exactly the GND IDs I'm interested in (about 430.000 out of millions of GNDs). 
This would have led to a much smaller result - but I cannot squeeze that query 
into a GET request ...


-Ursprüngliche Nachricht-
Von: Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] Im Auftrag von Stas 
Malyshev
Gesendet: Donnerstag, 11. Februar 2016 01:35
An: Discussion list for the Wikidata project.
Betreff: Re: [Wikidata] SPARQL CONSTRUCT results truncated

Hi!


I try to extract all mappings from wikidata to the GND authority file,
along with the according wikipedia pages, expecting roughly 500,000 to
1m triples as result.


As a starting note, I don't think extracting 1M triples may be the best way to 
use query service. If you need to do processing that returns such big result 
sets - in millions - maybe processing the dump - e.g. with wikidata toolkit at 
https://github.com/Wikidata/Wikidata-Toolkit - would be better idea?


However, with various calls, I get much less triples (about 2,000 to
10,000). The output seems to be truncated in the middle of a statement, e.g.


It may be some kind of timeout because of the quantity of the data being sent. 
How long does such request take?

--
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] SPARQL CONSTRUCT results truncated

2016-02-11 Thread Markus Krötzsch

Hi Joachim,

Here is a short program that solves your problem:

https://github.com/Wikidata/Wikidata-Toolkit-Examples/blob/master/src/examples/DataExtractionProcessor.java

It is in Java, so, you need that (and Maven) to run it, but that's the 
only technical challenge ;-). You can run the program in various ways as 
described in the README:


https://github.com/Wikidata/Wikidata-Toolkit-Examples

The program I wrote puts everything into a CSV file, but you can of 
course also write RDF triples if you prefer this, or any other format 
you wish. The code should be easy to modify.


On a first run, the tool will download the current Wikidata dump, which 
takes a while (it's about 6G), but after this you can find and serialise 
all results in less than half an hour (for a processing rate of around 
10K items/second). A regular laptop is enough to run it.


Cheers,

Markus


On 11.02.2016 01:34, Stas Malyshev wrote:

Hi!


I try to extract all mappings from wikidata to the GND authority file,
along with the according wikipedia pages, expecting roughly 500,000 to
1m triples as result.


As a starting note, I don't think extracting 1M triples may be the best
way to use query service. If you need to do processing that returns such
big result sets - in millions - maybe processing the dump - e.g. with
wikidata toolkit at https://github.com/Wikidata/Wikidata-Toolkit - would
be better idea?


However, with various calls, I get much less triples (about 2,000 to
10,000). The output seems to be truncated in the middle of a statement, e.g.


It may be some kind of timeout because of the quantity of the data being
sent. How long does such request take?




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] SPARQL CONSTRUCT results truncated

2016-02-11 Thread Markus Krötzsch

On 11.02.2016 15:01, Gerard Meijssen wrote:

Hoi,
What I hear is that the intentions were wrong in that you did not
anticipate people to get actual meaningful requests out of it.

When you state "we have two choices", you imply that it is my choice as
well. It is not. The answer that I am looking for is yes, it does not
function as we would like, we are working on it and in the mean time we
will ensure that toolkit is available on Labs for the more complex queries.

Wikidata is a service and the service is in need of being better.


Gerard, do you realise how far away from technical reality your wishes 
are? We are far ahead of the state of the art in what we already have 
for Wikidata: two powerful live query services + a free toolkit for 
batch analyses + several Web APIs for live lookups. I know of no site of 
this scale that is anywhere near this in terms of functionality. You can 
always ask for more, but you should be a bit reasonable too, or people 
will just ignore you.


Markus



On 11 February 2016 at 12:32, Daniel Kinzler
> wrote:

Am 11.02.2016 um 10:17 schrieb Gerard Meijssen:
> Your response is technical and seriously, query is a tool and it should 
function
> for people. When the tool is not good enough fix it.

What I hear: "A hammer is a tool, it should work for people. Tearing
down a
building with it takes forever, so fix the hammer!"

The query service was never intended to run arbitrarily large or complex
queries. Sure, would be nice, but that also means committing an
arbitrary amount
of resources to a single request. We don't have arbitrary amounts of
resources.

We basically have two choices: either we offer a limited interface
that only
allows for a narrow range of queries to be run at all. Or we offer a
very
general interface that can run arbitrary queries, but we impose
limits on time
and memory consumption. I would actually prefer the first option,
because it's
more predictable, and doesn't get people's hopes up too far. What do
you think?

Oh, and +1 for making it easy to use WDT on labs.

--
Daniel Kinzler
Senior Software Developer

Wikimedia Deutschland
Gesellschaft zur Förderung Freien Wissens e.V.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org 
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] SPARQL CONSTRUCT results truncated

2016-02-11 Thread Markus Krötzsch

Hi Joachim,

Stas would be the right person to discuss service parameters and the 
possible setup of more servers with other parameters. He is part of the 
team at WMF who is in charge of the SPARQL ops.


You note that "it isn’t always obvious what is right and what the 
limitations of a tool are". I think this is the key point here. There is 
not enough experience with the SPARQL service yet to define very clear 
guidelines on what works and what doesn't. On this mailing list, we have 
frequently been reminded to use LIMIT in queries to make sure they 
terminate and don't overstress the server, but I guess this is not part 
of the official documentation you refer to. There was no decision 
against supporting bigger queries either -- it just did not come up as a 
major demand yet, since typical applications that use SPARQL so far 
require 10s to 1000s of results but not 100,000s to millions. To be 
honest, I would not have expected this to work so well in practice that 
it could be considered here. It is interesting to learn that you are 
already using SPARQL for generating custom data exports. It's probably 
not the most typical use of a query service, but at least the query 
language could support this usage in principle.


Cheers,

Markus



On 11.02.2016 19:32, Neubert, Joachim wrote:

Hi Lydia,

I agree on using the right tool for the job. Yet, it isn’t always
obvious what is right and what the limitations of a tool are.

For me, it’s perfectly ok when a query runs for 20 minutes, when it
spares me some hours of setting up a specific environment for one
specific dataset (and doing it again when I need current data two month
later). And it would be no issue if the query runs much longer, in
situations where it competes with several others. But of course, that’s
not what I want to experience when I use a wikidata service to drive,
e.g., an autosuggest function for selecting entities.

So, can you agree to Markus suggestion that an experimental “unstable”
endpoint could solve different use cases and expectiations?

And do you think the policies and limitations of different access
strategies could be documented? These could include a high-reliability
interface for a narrow range of queries (as Daniel suggests as his
preferred option). And on the other end of the spectrum something what
allows people to experiment freely. Finally, the latter kind of
interface could allow new patterns of usage to evolve, with perhaps a
few of them worthwhile to become part of an optimized, highly reliabile
query set.

I could imagine that such a documentation of (and perhaps discussion on)
different options and access strategies, limitations and tradeoffs could
solve Gerards claim to give people what they need, or at least let them
make informed choises when restrictions are unavoidable.

Cheers, Joachim

*Von:*Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] *Im Auftrag
von *Lydia Pintscher
*Gesendet:* Donnerstag, 11. Februar 2016 17:55
*An:* Discussion list for the Wikidata project.
*Betreff:* Re: [Wikidata] SPARQL CONSTRUCT results truncated

On Thu, Feb 11, 2016 at 5:53 PM Gerard Meijssen
> wrote:

Hoi,'

Markus when you read my reply on the original question you will see
that my approach is different. The first thing that I pointed out
was that a technical assumption has little to do with what people
need. I indicated that when this is the approach, the answer is fix
it. The notion that a large number of returns is outrageous is not
of this time.

My approach was one where I even offered a possible solution, a crutch.

The approach Daniel took was to make me look ridiculous. His choice,
not mine. I stayed polite and told him that his answers are not my
answers and why. The point that I make is that Wikidata is a
service. It will increasingly be used for the most outrageous
queries and people will expect it to work because why else do we put
all this data in there. Why else is this the data hub for Wikipedia.
Why else

Do appreciate that the aim of the WMF is to share in the sum of all
available knowledge. When the current technology is what we have to
make do with, fine for now. Say so, but do not ridicule me for
saying that it is not good enough, it is not now and it will
certainly not be in the future...

Thanks,

GerardM

Gerard, it all boils down to using the right tool for the job. Nothing
more - nothing less. Let's get back to making Wikidata rock.

Cheers
Lydia

--

Lydia Pintscher - http://about.me/lydia.pintscher

Product Manager for Wikidata

Wikimedia Deutschland e.V.

Tempelhofer Ufer 23-24

10963 Berlin

www.wikimedia.de 

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
unter der Nummer 23855 Nz. Als gemeinnützig anerkannt 

[Wikidata] Wikidata Toolkit 0.6.0 released

2016-02-10 Thread Markus Krötzsch

Hi all,

I am happy to announce the release of Wikidata Toolkit 0.6.0 [1], the 
Java library for programming with Wikidata and Wikibase.


The most prominent new feature of this release is improved support for 
writing bots (full support for maxlag and edit throttling, simpler code 
through convenience methods, fixed a previous issue with API access). In 
addition, the new version introduces support for the new Wikidata 
property types "external-id" and "math".


We have also improved our documentation by creating an example project 
that shows how to use Wikidata Toolkit as a library in your own, 
stand-alone Java code [2].


The bot code in the examples is used in actual bots, and was used for 
thousands of edits on Wikidata (e.g., some may have noticed that the 
annoying "+-1" after population numbers and the like has become quite 
rare recently ;-).


Maven users can get the library directly from Maven Central (see [1]); 
this is the preferred method of installation. There is also an 
all-in-one JAR at github [3] and of course the sources [4] and updated 
JavaDocs [5].


As usual, feedback is welcome. Developers are also invited to contribute 
via github.


Cheers,

Markus

[1] https://www.mediawiki.org/wiki/Wikidata_Toolkit
[2] https://github.com/Wikidata/Wikidata-Toolkit-Examples
[3] https://github.com/Wikidata/Wikidata-Toolkit/releases
[4] https://github.com/Wikidata/Wikidata-Toolkit/
[5] http://wikidata.github.io/Wikidata-Toolkit/

--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata-tech] On interface stability and forward compatibility

2016-02-05 Thread Markus Krötzsch

Hi Daniel,

I feel that this tries to evade the real issue by making formal rules 
about what kind of "breaking" you have to care about. It would be better 
to define "breaking change" based on its consequences: if important 
services will stop working, then you should make sure you announce it in 
time so this will not happen. This requires you to talk to people on 
this list. I think the whole proposal below is mainly trying to give you 
some justification to avoid communication with your stakeholders. This 
is not the way to go.


This said, it is always nice to have some guidelines as to what is 
likely to change and what isn't. It is probably enough to give some 
warnings about this ("there might be additional keys in this map in the 
future" or "there might be additional datatype URIs in the future"). 
However, this is no recipe to avoid breaking changes. In particular, the 
guideline to ignore snaks of properties that have no understandable 
declaration is just codifying a controlled way of failing, not avoiding 
failure:


* Browsing interfaces (e.g., Reasonator, Miga Class & Property Browser) 
are expected to show all data to users. If they don't, this is breaking 
them.
* Query services are expected to use all data. If you do an aggregate 
query to count all properties on Wikidata, then the number returned will 
not be incomplete but simply wrong if the service ignores half of the data.
* Editing tools (including bot frameworks) are most heavily affected, 
since they might create duplicates of statements if they fail to see 
some of the data following your guideline.


This does not mean that your guideline is unreasonable -- in fact, I 
think this is what most tools are doing anyway. But as the examples 
show, it's not enough to prevent major service disruptions that would 
affect many people. The guideline that tools should sometimes raise an 
alert or issue a warning does work in many cases, since we have a 
complex ecosystem with many inter-dependent services (for example, how 
should a SPARQL Web service communicate problems that occurred when 
importing the data? All of them or somehow only the ones that might have 
affected they query result?).


Our tools rely on being able to use all data, and the easiest way to 
ensure that they will work is to announce technical changes to the JSON 
format well in advance using this list. For changes that affect a 
particular subset of widely used tools, it would also be possible to 
seek the feedback from the main contributors of these tools at 
design/development time. I am sure everybody here is trying their best 
to keep up with whatever changes you implement, but it is not always 
possible for all of us to sacrifice part of our weekend on short notice 
for making a new release before next Wednesday.


Cheers,

Markus


On 05.02.2016 13:10, Daniel Kinzler wrote:

Hi all!

In the context of introducing the new "math" and "external-id" data types, the
question came up whether this introduction constitutes a breaking change to the
data model. The answer to this depends on whether you take the "English" or the
"German" approach to interpreting the format: According to
, in
England, "everything which is not forbidden is allowed", while, in Germany, the
opposite applies, so "everything which is not allowed is forbidden".

In my mind, the advantage of formats like JSON, XML and RDF is that they provide
good discovery by eyeballing, and that they use a mix-and-match approach. In
this context, I favour the English approach: anything not explicitly forbidden
in the JSON or RDF is allowed.

So I think clients should be written in a forward-compatible way: they should
handle unknown constructs or values gracefully.


In this vein, I would like to propose a few guiding principles for the design of
client libraries that consume Wikibase RDF and particularly JSON output:

* When encountering an unknown structure, such as an unexpected key in a JSON
encoded object, the consumer SHOULD skip that structure. Depending on context
and use case, a warning MAY be issued to alert the user that some part of the
data was not processed.

* When encountering a malformed structure, such as missing a required key in a
JSON encoded object, the consumer MAY skip that structure, but then a warning
MUST be issued to alert the user that some part of the data was not processed.
If the structure is not skipped, the consumer MUST fail with a fatal error.

* Clients MUST make a clear distinction of data types and values types: A Snak's
data type determines the interpretation of the value, while the type of the
Snak's data value specifies the structure of the value representation.

* Clients SHOULD be able to process a Snak about a Property of unknown data
type, as long as the value type is known. In such a case, the client SHOULD fall
back to the behaviour defined for the value type. If this is not possible, 

Re: [Wikidata] How to reach the wikipedia abstract propert?

2016-02-04 Thread Markus Krötzsch

Hi,

For the record, I have heard a similar question recently. Maybe we could 
actually offer the abstracts as a service or otherwise "virtual" 
property that is simply added to the query result at the end. With the 
API Finn mentions (I did not know this, thanks!), it seems that this is 
not so hard.


Another thing to point out is that, if the goal is to display short 
"previews" of articles in a web page, it might be better to use the 
Wikipedia mobile view instead. It should be possible to embed it into 
any web page with a bit of Javascript and it gives much nicer (yet space 
efficient) results.


Cheers,

Markus


On 04.02.2016 17:47, Finn Årup Nielsen wrote:

Hi Miriam,


I am not aware of the ability of Wikidata to return the abstract. I
think you have to use Wikipedia.

https://en.wikipedia.org/w/api.php?format=jsonfm=query=extracts==2=SPARQL


https://stackoverflow.com/questions/8555320/is-there-a-clean-wikipedia-api-just-for-retrieve-content-summary


This gives you:

{
 "batchcomplete": "",
 "query": {
 "pages": {
 "2574343": {
 "pageid": 2574343,
 "ns": 0,
 "title": "SPARQL",
 "extract": "SPARQL (pronounced \"sparkle\", a recursive
acronym for SPARQL Protocol and RDF Query Language) is an RDF query
language, that is, a semantic query language for databases, able to
retrieve and manipulate data stored in Resource Description Framework
(RDF) format. It was made a standard by the RDF Data Access Working
Group (DAWG) of the World Wide Web Consortium, and is recognized as one
of the key technologies of the semantic web."
 }
 }
 }
}


best regards
Finn


On 02/04/2016 01:03 PM, Miriam Allalouf wrote:

Hi,

We are accessing Wikidata using embedded SPARQL queries.

We need to have the abstract of an article (namely to retrieve it from
the Wikipedia article).

I can get the site link of an article – though I cannot get a specific
paragraph,  particularly the abstract of the article.

It works for me when I use the DBpedia but I want to do it using
wikidata.

Please let me know how?

Thanks a lot,

Miriam

=

/Miriam Allalouf, PhD/

/Software Engineering Department, JCE ///

/Academic Head of Mahar Project///

/ Mobile tel: +972-52-3664129//
 email: miria...@jce.ac.il
Azrieli  College of Engineering Jerusalem - JCE

/



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata







___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] upcoming deployments/features

2016-02-04 Thread Markus Krötzsch

On 03.02.2016 12:44, John Erling Blad wrote:

It is a bit strange to defines a data type in terms of a library of
functions in another language.
Or is just me that thinks this is a bit odd?

What about MathML?


The arxiv report that Moritz posted (after you had already asked your 
question) says that he has improved the tooling to translate into 
MathML, so this is in the picture to some extent. I would not consider 
it as a workable input format (try writing in MathML! ;-). On the other 
hand, there is AsciiMathML, but I don't know how useful that is (and 
people know LaTeX much better). One could consider having MathML as the 
internal format, and AsciiMathML and LaTeX as input options for writing 
it, but this seems like a big project to get to work. I guess we can 
exclude the option of using a visual input interface for math (Microsoft 
tried that in Word once, they have a lot of developers, and yet ...).


Markus



On Wed, Feb 3, 2016 at 12:06 PM, Markus Krötzsch
<mar...@semantic-mediawiki.org <mailto:mar...@semantic-mediawiki.org>>
wrote:

For a consumer, the main practical questions would be:

(1) What subset of LaTeX exactly do you need to support to display
the math expressions in Wikidata?
(2) As a follow up: does MathJAX work to display this? If not, what
does?

Cheers,

Markus

On 02.02.2016 10:01, Moritz Schubotz wrote:

The string is interpreted by the math extension in the same way
as the
Math extension interprets the text between the  tags.
There is an API to extract identifiers and the packages required to
render the input with regular latex from here:
http://api.formulasearchengine.com/v1/?doc
or also

https://en.wikipedia.org/api/rest_v1/?doc#!/Math/post_media_math_check_type
(The wikipedia endpoint has been opened to the public just
moments ago)
In the future, we are planning to provide additional semantics
from there.
If you have additional questions, please contact me directly,
since I'm
not a member on the list.
Moritz

On Tue, Feb 2, 2016 at 8:53 AM, Lydia Pintscher
<lydia.pintsc...@wikimedia.de
<mailto:lydia.pintsc...@wikimedia.de>
<mailto:lydia.pintsc...@wikimedia.de
<mailto:lydia.pintsc...@wikimedia.de>>> wrote:

 On Mon, Feb 1, 2016 at 8:44 PM Markus Krötzsch
 <mar...@semantic-mediawiki.org
<mailto:mar...@semantic-mediawiki.org>
 <mailto:mar...@semantic-mediawiki.org
<mailto:mar...@semantic-mediawiki.org>>> wrote:

 On 01.02.2016 17:14, Lydia Pintscher wrote:
  > Hey folks :)
  >
  > I just sat down with Katie to plan the next
important feature
  > deployments that are coming up this month. Here is
the plan:
  > * new datatype for mathematical expressions: We'll
get it live on
  > test.wikidata.org <http://test.wikidata.org>
<http://test.wikidata.org>
 <http://test.wikidata.org> tomorrow and then bring it
  > to wikidata.org <http://wikidata.org>
<http://wikidata.org> <http://wikidata.org>
 on the 9th

 Documentation? What will downstream users like us need
to do to
 support
 this? How is this mapped to JSON? How is this mapped to
RDF?


 It is a string representing markup for the Math extension.
You can
 already test it here:
http://wikidata.beta.wmflabs.org/wiki/Q117940.
 See also
https://en.wikipedia.org/wiki/Help:Displaying_a_formula.
 Maybe Moritz wants to say  bit more as his students created the
 datatype.

 Cheers
 Lydia
 --
 Lydia Pintscher - http://about.me/lydia.pintscher
 Product Manager for Wikidata

 Wikimedia Deutschland e.V.
 Tempelhofer Ufer 23-24
 10963 Berlin
www.wikimedia.de <http://www.wikimedia.de> <http://www.wikimedia.de>

 Wikimedia Deutschland - Gesellschaft zur Förderung Freien
Wissens e. V.

 Eingetragen im Vereinsregister des Amtsgerichts
 Berlin-Charlottenburg unter der Nummer 23855 Nz. Als
gemeinnützig
 anerkannt durch das Finanzamt für Körperschaften I Berlin,
 Steuernummer 27/029/42207 <tel:27%2F029%2F42207>.




--
Moritz Schubotz
TU Berlin, Fakultät IV
DIMA - Sekr. EN7
Raum EN742
Einsteinufer 17
D-10587 Berlin
Germany

Tel.: +49 30 314 22784 <tel:%2B49%2030%20

Re: [Wikidata] upcoming deployments/features

2016-02-04 Thread Markus Krötzsch
ation.org/ . In 
addition, I'm an offsite collaborator of the National Institute of Standards 
and Technology in the USA and I really appreciate standards.

Moritz Schubotz
TU Berlin, Fakultät IV
DIMA - Sekr. EN7
Raum E-N 741
Einsteinufer 17
D-10587 Berlin
Germany

Tel.: +49 30 314 22784
Mobil:+49 1578 047 1397
E-Mail: schub...@tu-berlin.de
Skype: Schubi87
ICQ: 200302764
Msn: mor...@schubotz.de


-Ursprüngliche Nachricht-
Von: Markus Krötzsch [mailto:mar...@semantic-mediawiki.org]
Gesendet: Donnerstag, 4. Februar 2016 08:20
An: Schubotz, Moritz; Discussion list for the Wikidata project.
Betreff: Re: AW: AW: [Wikidata] upcoming deployments/features

Hi Moritz,

On 03.02.2016 15:25, Schubotz, Moritz wrote:

Hi Markus,

I think we agree on the goals cf. http://arxiv.org/abs/1404.6179 By
the way the texvc dialect is now 13 years old at least.
For now it's required to be 100% compatible to the texvc dialect in order to 
use wikidata in Mediawiki instances.
However, for the future there are also plans to support more markup.
But all new options are blocked by
https://phabricator.wikimedia.org/T74240

Mathoid, the service that converts the texvc dialect to MathML, SVG + PNG can 
also be used without a MediaWiki instance.
I posted links to the Restbase Web UI before.

api.formulasearchengine.com (with experimental features)
de.wikipedia.org/api (stable)


This is the API you said "has been opened to the public just moments ago" and which 
describes itself as "currently in beta testing"? That seems a bit shaky to say the least. 
In your email, you said that this API was for extracting LaTeX package names and identifiers, not 
for rendering content, so I have not looked at it for this purpose. How does this compare to 
MathJax in terms of usage? Are the output types similar?
It seems your solution adds the dependency on an external server, so this 
cannot be used in offline mode, I suppose? How does it support styling of 
content for your own application, e.g., how do you select the fonts to be used?

I think we agree that real documentation should be a bit more than an 
unexplained link in an email. Anyway, it is not your role to provide 
documentation on new Wikidata features or to make sure that stakeholders are 
taken along when new features are deployed, so don't worry too much about this. 
I am sure your students did a good job implementing this, and from there on it 
is really in other people's hands.

Cheers,

Markus





Am 03.02.2016 um 14:31 schrieb Markus Krötzsch:

Hi Moritz,

I must say that this is not very reassuring. So basically what we
have in this datatype now is a "LaTeX-like" markup language that is
only supported by one implementation that was created for MediaWiki,
and partially by a LaTeX package that you created.


Markus, this TeX dialoect is not a new invention by Moritz. It's what
the Math extension for MediaWiki has been using for over a decade now,
and it's used on hundreds of thousands of pages on Wikipedia. All that
we are doing now is making this same exact syntax available for
property values on wikibase, using the same exact code for rendering it.

I think having consistent handling for math formulas between wikitext
and wikibase is the right thing to do. Of course it would have been
nice for MediaWiki to not invent it's own TeX dialect for this, but
it's 10 years to late for that complaint now.

Moritz, I seem to recall that the new Math extension uses a standalone
service for rendering TeX to PNG, SVG, or MathML. Can that service
easily be used outside the context of MediaWiki?







___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] SPARQL service slow?

2016-02-03 Thread Markus Krötzsch

Hi,

is it me or is the SPARQL service very slow right now?

Thanks,

Markus


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] upcoming deployments/features

2016-02-03 Thread Markus Krötzsch

Hi Moritz,

I must say that this is not very reassuring. So basically what we have 
in this datatype now is a "LaTeX-like" markup language that is only 
supported by one implementation that was created for MediaWiki, and 
partially by a LaTeX package that you created.


Why did you not just use LaTeX? Do we really need additional commands 
here? Just think about the interoperability of the data you are 
creating. Just because some command alias is used on Wikipedia does not 
mean we need it -- we can simply translate it to standard LaTeX on import.


The question about MathJax I asked because consumers of the data clearly 
need some way to display this markup. MathJax is a widely used library 
that can support this. Or maybe MathJax already supports your custom 
extensions somehow (I am not that familiar with it)? If not, then what 
other ways are there of embedding your custom math markup language into 
my application?


Cheers,

Markus


On 03.02.2016 13:52, Schubotz, Moritz wrote:

Hi Markus,

it's not exactly a subset of LaTeX. Some commands were added some were removed. 
A large portion of those parts are documented here
https://en.wikipedia.org/wiki/Help:Displaying_a_formula , and a complete list 
is available from  http://1drv.ms/1RtoZoW
For LaTeX users I created a LaTeX macro package so that they can copy and paste 
texvc style LaTeX code to regular LaTeX documents
https://www.ctan.org/pkg/texvc
However, there are two issues with this package: \and  and \or are not 
supported by my LaTeX package since redefining those commands caused internal 
problems. However, most of the time people use the standard LaTeX command \lor 
and \land anyhow. Altogether, enwiki contains 969 \lor and 1581 \land.
Statistics on the usage frequencies are available from
https://gitlab.tubit.tu-berlin.de/data/wikiFormulae/tree/master

w.r.t 2) I have no idea how that relates to MathJax? Can you explain the 
background of your question?

Best
Moritz


Moritz Schubotz
TU Berlin, Fakultät IV
DIMA - Sekr. EN7
Raum E-N 741
Einsteinufer 17
D-10587 Berlin
Germany

Tel.: +49 30 314 22784
Mobil:+49 1578 047 1397
E-Mail: schub...@tu-berlin.de
Skype: Schubi87
ICQ: 200302764
Msn: mor...@schubotz.de


-Ursprüngliche Nachricht-
Von: Markus Krötzsch [mailto:mar...@semantic-mediawiki.org]
Gesendet: Mittwoch, 3. Februar 2016 12:06
An: Discussion list for the Wikidata project.; Lydia Pintscher
Cc: Schubotz, Moritz
Betreff: Re: [Wikidata] upcoming deployments/features

For a consumer, the main practical questions would be:

(1) What subset of LaTeX exactly do you need to support to display the math 
expressions in Wikidata?
(2) As a follow up: does MathJAX work to display this? If not, what does?

Cheers,

Markus

On 02.02.2016 10:01, Moritz Schubotz wrote:

The string is interpreted by the math extension in the same way as the
Math extension interprets the text between the  tags.
There is an API to extract identifiers and the packages required to
render the input with regular latex from here:
http://api.formulasearchengine.com/v1/?doc
or also
https://en.wikipedia.org/api/rest_v1/?doc#!/Math/post_media_math_check
_type (The wikipedia endpoint has been opened to the public just
moments ago) In the future, we are planning to provide additional
semantics from there.
If you have additional questions, please contact me directly, since
I'm not a member on the list.
Moritz

On Tue, Feb 2, 2016 at 8:53 AM, Lydia Pintscher
<lydia.pintsc...@wikimedia.de <mailto:lydia.pintsc...@wikimedia.de>> wrote:

 On Mon, Feb 1, 2016 at 8:44 PM Markus Krötzsch
 <mar...@semantic-mediawiki.org
 <mailto:mar...@semantic-mediawiki.org>> wrote:

 On 01.02.2016 17:14, Lydia Pintscher wrote:
  > Hey folks :)
  >
  > I just sat down with Katie to plan the next important feature
  > deployments that are coming up this month. Here is the plan:
  > * new datatype for mathematical expressions: We'll get it live on
  > test.wikidata.org <http://test.wikidata.org>
 <http://test.wikidata.org> tomorrow and then bring it
  > to wikidata.org <http://wikidata.org> <http://wikidata.org>
 on the 9th

 Documentation? What will downstream users like us need to do to
 support
 this? How is this mapped to JSON? How is this mapped to RDF?


 It is a string representing markup for the Math extension. You can
 already test it here: http://wikidata.beta.wmflabs.org/wiki/Q117940.
 See also https://en.wikipedia.org/wiki/Help:Displaying_a_formula.
 Maybe Moritz wants to say  bit more as his students created the
 datatype.

 Cheers
 Lydia
 --
 Lydia Pintscher - http://about.me/lydia.pintscher
 Product Manager for Wikidata

 Wikimedia Deutschland e.V.
 Tempelhofer Ufer 23-24
 10963 Berlin
 www.wikimedia.de <http://www.wikimedi

Re: [Wikidata] upcoming deployments/features

2016-02-03 Thread Markus Krötzsch

On 03.02.2016 14:38, Daniel Kinzler wrote:

Am 03.02.2016 um 14:31 schrieb Markus Krötzsch:

Hi Moritz,

I must say that this is not very reassuring. So basically what we have in this
datatype now is a "LaTeX-like" markup language that is only supported by one
implementation that was created for MediaWiki, and partially by a LaTeX package
that you created.


Markus, this TeX dialoect is not a new invention by Moritz. It's what the Math
extension for MediaWiki has been using for over a decade now, and it's used on
hundreds of thousands of pages on Wikipedia. All that we are doing now is making
this same exact syntax available for property values on wikibase, using the same
exact code for rendering it.

I think having consistent handling for math formulas between wikitext and
wikibase is the right thing to do. Of course it would have been nice for
MediaWiki to not invent it's own TeX dialect for this, but it's 10 years to late
for that complaint now.


I do not agree with this argument. One could use a simplified version 
that is compatible with Wikipedia *and* with the rest of the world. We 
do not have MediaWiki markup in our text data, in spite of it being 
widely used on Wikipedia for many years -- instead, we now introduce a 
subset of it (the part you could put into ). If we have settled 
for a subset, why not use one that works with more commonly used tools 
as well? I don't think that MediaWiki LaTeX users would find it very 
hard to go back to the LaTeX they use elsewhere (in their own documents, 
on StackExchange, etc.).


A question you should ask when making extensions to the Wikidata data 
model is how much it will cost your data users to keep supporting 
Wikidata content in full. Such little twists, for a few extra commands, 
are creating extra work for many people.


The initial announcement one week before roll-out in the live system is 
not ideal either, adding some urgency to make this even more expensive. 
Now, four days later, even the final JSON datatype id for this has not 
been communicated yet ... we have 5 days left to update our code and 
make new releases. Of course, this schedule would leave no time for 
downstream users of our tools to update to the new version.


Data model updates are costly. Don't make them on a week's notice, 
without prior discussion, and without having any documentation ready to 
give to data users. It would also be good to announce breaking technical 
changes more prominently on wikidata-tech as well.


It would also be nice to include some motivation in your announcement 
(like "we expect at least 100K items -- 0.5% of current items -- to use 
properties of this type"). In the case of math, I could find some 
infoboxes that use LaTeX, so I guess this is what you are aiming for? I 
am not sure how many pages use such data though (inline math is of 
course very frequent, but it's not something you would store on 
Wikidata). If everybody can see why this is really needed, it will also 
increase acceptance in spite of some technical quirks.


Regards,

Markus




Moritz, I seem to recall that the new Math extension uses a standalone service
for rendering TeX to PNG, SVG, or MathML. Can that service easily be used
outside the context of MediaWiki?




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] upcoming deployments/features

2016-02-03 Thread Markus Krötzsch

Hi Moritz,

On 03.02.2016 15:25, Schubotz, Moritz wrote:

Hi Markus,

I think we agree on the goals cf. http://arxiv.org/abs/1404.6179
By the way the texvc dialect is now 13 years old at least.
For now it's required to be 100% compatible to the texvc dialect in order to 
use wikidata in Mediawiki instances.
However, for the future there are also plans to support more markup.
But all new options are blocked by https://phabricator.wikimedia.org/T74240

Mathoid, the service that converts the texvc dialect to MathML, SVG + PNG can 
also be used without a MediaWiki instance.
I posted links to the Restbase Web UI before.

api.formulasearchengine.com (with experimental features)
de.wikipedia.org/api (stable)


This is the API you said "has been opened to the public just moments 
ago" and which describes itself as "currently in beta testing"? That 
seems a bit shaky to say the least. In your email, you said that this 
API was for extracting LaTeX package names and identifiers, not for 
rendering content, so I have not looked at it for this purpose. How does 
this compare to MathJax in terms of usage? Are the output types similar? 
It seems your solution adds the dependency on an external server, so 
this cannot be used in offline mode, I suppose? How does it support 
styling of content for your own application, e.g., how do you select the 
fonts to be used?


I think we agree that real documentation should be a bit more than an 
unexplained link in an email. Anyway, it is not your role to provide 
documentation on new Wikidata features or to make sure that stakeholders 
are taken along when new features are deployed, so don't worry too much 
about this. I am sure your students did a good job implementing this, 
and from there on it is really in other people's hands.


Cheers,

Markus





Am 03.02.2016 um 14:31 schrieb Markus Krötzsch:

Hi Moritz,

I must say that this is not very reassuring. So basically what we have in this
datatype now is a "LaTeX-like" markup language that is only supported by one
implementation that was created for MediaWiki, and partially by a LaTeX package
that you created.


Markus, this TeX dialoect is not a new invention by Moritz. It's what the Math
extension for MediaWiki has been using for over a decade now, and it's used on
hundreds of thousands of pages on Wikipedia. All that we are doing now is making
this same exact syntax available for property values on wikibase, using the same
exact code for rendering it.

I think having consistent handling for math formulas between wikitext and
wikibase is the right thing to do. Of course it would have been nice for
MediaWiki to not invent it's own TeX dialect for this, but it's 10 years to late
for that complaint now.

Moritz, I seem to recall that the new Math extension uses a standalone service
for rendering TeX to PNG, SVG, or MathML. Can that service easily be used
outside the context of MediaWiki?




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] weekly summary #194

2016-02-02 Thread Markus Krötzsch

On 02.02.2016 02:07, Michael Karpeles wrote:

Well, https://angryloki.github.io/wikidata-graph-builder will change my
life, this is amazing. Thank you AngryLoki and all the hundreds of
layers of contributors which lead to a tool like this. Also Lydia et al,
thanks for the hard work in keeping these updates going.


Indeed, this is very nifty. I also note that this uses some special 
features of our SPARQL endpoint that I did not know about (the "gas 
service"). It seems that this is a proprietary extension of BlazeGraph, 
which comes in very handy here.


Best

Markus



On Mon, Feb 1, 2016 at 9:44 AM, Lydia Pintscher
> wrote:

Hey everyone :)

Here's your summary of what's been happening around Wikidata over
the past week.


  Events /Press/Blogs
  

  * Replicator: Wikidata import tool


  * The Facebook of German Playwrights

  * Language usage on Wikidata

  * Past: FOSDEM (slides of talk by Lucie

)


  Other Noteworthy Stuff

  * Please help us classify a bunch of edits to improve
anti-vandalism tools on Wikidata


  * Over 18000 people who made at least one edit over the last month!
  * some visualizations:
  o Family tree of King Halo, race horse

  o Doctoral students of Gauss


  o Tributaries of the Danube


  o Children of Kronos


  o Influenced by Leibnitz


  o Graphs


  o Zika virus papers


  * KasparBot is now removing all PersonData template usages from
English Wikipedia. They added machine-readable information to
articles.
  * Wikiversity will get the first phase of Wikidata support
(language links) on February 23rd.
  * Upcoming deployments of new datatypes, In Other Projects
Sidebar, Article Placeholder and more


  * WD-FIST  now
supports SPARQL queries


  Did you know?

  * Newest properties
: National
Historic Sites of Canada ID

  * Query example: horses



Re: [Wikidata] upcoming deployments/features

2016-02-01 Thread Markus Krötzsch

On 01.02.2016 17:14, Lydia Pintscher wrote:

Hey folks :)

I just sat down with Katie to plan the next important feature
deployments that are coming up this month. Here is the plan:
* new datatype for mathematical expressions: We'll get it live on
test.wikidata.org  tomorrow and then bring it
to wikidata.org  on the 9th


Documentation? What will downstream users like us need to do to support 
this? How is this mapped to JSON? How is this mapped to RDF?


Markus


* Article Placeholder: We'll get it to test.wikipedia.org
 on the 9th
* new datatype for identifiers: we'll bring it to wikidata.org
 on the 16th. We'll convert existing properties
according to the list on
https://www.wikidata.org/wiki/User:Addshore/Identifiers in two rounds on
17th and 18th.
* In Other Projects Sidebar: We'll enable it by default on 16th for all
projects that do not opt-out on https://phabricator.wikimedia.org/T103102.
* interwiki links via Wikidata for Wikiversity: We'll enable phase 1 on
Wikiversity on the 23rd.


Cheers
Lydia
--
Lydia Pintscher - http://about.me/lydia.pintscher
Product Manager for Wikidata

Wikimedia Deutschland e.V.
Tempelhofer Ufer 23-24
10963 Berlin
www.wikimedia.de 

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das
Finanzamt für Körperschaften I Berlin, Steuernummer 27/029/42207.


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata-tech] Long QIDs in Wikidata dump

2016-01-08 Thread Markus Krötzsch

Hi David,

Those are the ids of statements. They are formed using the statement's 
UUID, which typically uses the item id (sometimes with a lower-case "q") 
as its first part. However, the exact form of the IDs should not be used 
to find out what the thing is or which item it belongs to: all of this 
information is encoded in triples in the dump, and these should be used 
to find out if a URI refers to an item or a statement etc.


Cheers,

Markus

On 08.01.2016 12:45, David Przybilla wrote:

While playing with the dump, I came accross some entries like the following:



 "/m/021821" .

what does the `Q12258SCD97A47E-A0CA-453F-B01A-DEE8829139BF` stand for ?


___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech




___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata] Miga Classes and Properties Browser

2015-12-16 Thread Markus Krötzsch

On 16.12.2015 11:10, Gerard Meijssen wrote:

Hoi,
In the WDQ database all the data on P and Q values exist. It is stable
and it has proven itself over the last years as flexible and very fast.
Why build another database that is specific to one goal when another
database already exists that largely provides a similar function? A
database that can be even run at real time??


I think you misunderstood my email. I was not saying that WDQ cannot be 
used. I was merely asking you how one could use it, as I am not aware of 
any method to get the information that we display from this tool 
(certainly not through the Web interface, but I am not familiar with any 
other).


Markus


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Miga Classes and Properties Browser

2015-12-16 Thread Markus Krötzsch

On 16.12.2015 09:06, Gerard Meijssen wrote:

Hoi,
The WDQ database is likely to include all the information that you need.
It is optimised to be fast. Did you consider to use it?


Can you explain this a bit more? How would I use the WDQ database for 
this application? (I guess you don't mean via the Web API)


Thanks,

Markus



On 15 December 2015 at 22:54, Markus Krötzsch
<mar...@semantic-mediawiki.org <mailto:mar...@semantic-mediawiki.org>>
wrote:

Hi,

Something to be noted here is that initial loading is quite a bit
slower than it used to be, since there are a lot more classes now.
We are looking into options of making this faster, but this might
need a full rewrite to become really fast. The good thing is that
loading only has to happen once per month (until the next data update).

There is a known issue with character encoding now, leading to "?"
in some labels/descriptions. We are looking into it.

Another new feature is that we now also count properties used on
property pages:


http://tools.wmflabs.org/wikidata-exports/miga/#_cat=Properties/Uses%20in%20properties=1%20-%202

Compared to the old data, we have a lot more objects in some classes
now (it's amazing how many asteroids, first names, bands, and legal
cases there are ...). It seems we have more than 10,000 galaxies on
Wikidata already.

Here are our top-100 classes by number of instances:


http://tools.wmflabs.org/wikidata-exports/miga/#_cat=Classes/Number%20of%20direct%20instances=1%20-%201000

Other classes have a lot of subclasses rather than instances:


http://tools.wmflabs.org/wikidata-exports/miga/#_cat=Classes/Number%20of%20direct%20subclasses=1000%20-%2020

And of course, as usual, you can browse individual properties to see
the classes of objects they are used on ("What kind of things have a
diameter?") and browse classes to see which properties are typical
for them ("What kind of statements do we have about poems?").

Cheers,

Markus



On 14.12.2015 22:00, Markus Damm wrote:

Hi all,

there  are some good news: I updated the Miga Classes and Properties
Browser which collects several statistics about classes and
properties
used in Wikidata. In the future it will be updated monthly.

You can find it here:
http://tools.wmflabs.org/wikidata-exports/miga/

Hint: Since Miga uses WebSQL, the browser does not run in Internet
Explorer or Mozilla Firefox.

Best regards,
Markus

___
Wikidata mailing list
Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikidata



___
Wikidata mailing list
Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Miga Classes and Properties Browser

2015-12-15 Thread Markus Krötzsch

Hi,

Something to be noted here is that initial loading is quite a bit slower 
than it used to be, since there are a lot more classes now. We are 
looking into options of making this faster, but this might need a full 
rewrite to become really fast. The good thing is that loading only has 
to happen once per month (until the next data update).


There is a known issue with character encoding now, leading to "?" in 
some labels/descriptions. We are looking into it.


Another new feature is that we now also count properties used on 
property pages:


http://tools.wmflabs.org/wikidata-exports/miga/#_cat=Properties/Uses%20in%20properties=1%20-%202

Compared to the old data, we have a lot more objects in some classes now 
(it's amazing how many asteroids, first names, bands, and legal cases 
there are ...). It seems we have more than 10,000 galaxies on Wikidata 
already.


Here are our top-100 classes by number of instances:

http://tools.wmflabs.org/wikidata-exports/miga/#_cat=Classes/Number%20of%20direct%20instances=1%20-%201000

Other classes have a lot of subclasses rather than instances:

http://tools.wmflabs.org/wikidata-exports/miga/#_cat=Classes/Number%20of%20direct%20subclasses=1000%20-%2020

And of course, as usual, you can browse individual properties to see the 
classes of objects they are used on ("What kind of things have a 
diameter?") and browse classes to see which properties are typical for 
them ("What kind of statements do we have about poems?").


Cheers,

Markus


On 14.12.2015 22:00, Markus Damm wrote:

Hi all,

there  are some good news: I updated the Miga Classes and Properties
Browser which collects several statistics about classes and properties
used in Wikidata. In the future it will be updated monthly.

You can find it here: http://tools.wmflabs.org/wikidata-exports/miga/

Hint: Since Miga uses WebSQL, the browser does not run in Internet
Explorer or Mozilla Firefox.

Best regards,
Markus

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Query Service Examples

2015-12-11 Thread Markus Krötzsch

Thanks, nice.

Two comments:
* Activating the tag cloud in the UI was not very intuitive to me. I 
thought this was the search button or something.
* The tag cloud pop-up is half off-screen for me, and I cannot move it 
to be fully visible (Firefox, Linux).


Cheers,

Markus

On 10.12.2015 17:12, Jonas Kress wrote:


Hey,

the new query example dialog has just been released on
query.wikidata.org .

It looks like this:

Inline-Bild 2

It has this cool feature to filter queries via tag cloud:

Inline-Bild 3


The sample queries are parsed from this wiki page
.
When a query defines an item or property use via Q template, those will
be shown in the tag cloud.

Please feel free to add new fancy queries!

Cheers,
Jonas

--

Jonas Kress
Software Developer Wikimedia Deutschland e. V. | Tempelhofer Ufer 23-24
| 10963 Berlin Phone: +49 (0)30 219 158 26-0 http://wikimedia.de

Imagine a world, in which every single human being can freely share in the sum 
of all knowledge. That‘s our commitment.

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V. 
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter der 
Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für 
Körperschaften I Berlin, Steuernummer 27/681/51985.



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] [Wikimedia-l] Quality issues

2015-12-09 Thread Markus Krötzsch

On 08.12.2015 00:02, Andreas Kolbe wrote:

Hi Markus,

...



Apologies for the late reply.

While you indicated that you had crossposted this reply to Wikimedia-l,
it didn't turn up in my inbox. I only saw it today, after Atlasowa
pointed it out on the Signpost op-ed's talk page.[1]


Yes, we have too many communication channels. Let me only reply briefly 
now, to the first point:



 > This prompted me to reply. I wanted to write an email that merely
says: > "Really? Where did you get this from?" (Google using Wikidata
content)

Multiple sources, including what appears to be your own research group's
writing:[2]


What this page suggested was that that Freebase being shutdown means 
that Google will use Wikidata as a source. Note that the short intro 
text on the page did not say anything else about the subject, so I am 
surprised that this sufficed to convince you about the truth of that 
claim (it seems that other things I write with more support don't have 
this effect). Anyway, I am really sorry to hear that this 
quickly-written intro on the web has misled you. When I wrote this after 
Google had made their Freebase announcement last year, I really believed 
that this was the obvious implication. However, I was jumping to 
conclusions there without having first-hand evidence. I guess many 
people did the same. I fixed the statement now.


To be clear: I am not saying that Google is not using Wikidata. I just 
don't know. However, if you make a little effort, there is a lot of 
evidence that Google is not using Wikidata as a source, even when it 
could. For example, population numbers are off, even in cases where they 
refer to the same source and time, and Google also shows many statements 
and sources that are not in Wikidata at all (and not even in Primary 
Sources).


I still don't see any problem if Google would be using Wikidata, but 
that's another discussion.


You mention "multiple sources".
{{Which}}?

Markus



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] [Wikimedia-l] Quality issues

2015-12-09 Thread Markus Krötzsch
P.S. Meanwhile, your efforts in other channels are already leading some 
people to vandalise Wikidata just to make a point [1].


Markus

[1] 
http://forums.theregister.co.uk/forum/1/2015/12/08/wikidata_special_report/



On 09.12.2015 11:32, Markus Krötzsch wrote:

On 08.12.2015 00:02, Andreas Kolbe wrote:

Hi Markus,

...



Apologies for the late reply.

While you indicated that you had crossposted this reply to Wikimedia-l,
it didn't turn up in my inbox. I only saw it today, after Atlasowa
pointed it out on the Signpost op-ed's talk page.[1]


Yes, we have too many communication channels. Let me only reply briefly
now, to the first point:


 > This prompted me to reply. I wanted to write an email that merely
says: > "Really? Where did you get this from?" (Google using Wikidata
content)

Multiple sources, including what appears to be your own research group's
writing:[2]


What this page suggested was that that Freebase being shutdown means
that Google will use Wikidata as a source. Note that the short intro
text on the page did not say anything else about the subject, so I am
surprised that this sufficed to convince you about the truth of that
claim (it seems that other things I write with more support don't have
this effect). Anyway, I am really sorry to hear that this
quickly-written intro on the web has misled you. When I wrote this after
Google had made their Freebase announcement last year, I really believed
that this was the obvious implication. However, I was jumping to
conclusions there without having first-hand evidence. I guess many
people did the same. I fixed the statement now.

To be clear: I am not saying that Google is not using Wikidata. I just
don't know. However, if you make a little effort, there is a lot of
evidence that Google is not using Wikidata as a source, even when it
could. For example, population numbers are off, even in cases where they
refer to the same source and time, and Google also shows many statements
and sources that are not in Wikidata at all (and not even in Primary
Sources).

I still don't see any problem if Google would be using Wikidata, but
that's another discussion.

You mention "multiple sources".
{{Which}}?

Markus





___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata Analyst, a tool to comprehensively analyze quality of Wikidata

2015-12-08 Thread Markus Krötzsch

Hi Amir,

Very nice, thanks! I like the general approach of having a stand-alone 
tool for analysing the data, and maybe pointing you to issues. Like a 
dashboard for Wikidata editors.


What backend technology are you using to produce these results? Is this 
live data or dumped data? One could also get those numbers from the 
SPARQL endpoint, but performance might be problematic (since you compute 
averages over all items; a custom approach would of course be much 
faster but then you have the data update problem).


An obvious feature request would be to display entity ids as links to 
the appropriate page, and maybe with their labels (in a language of your 
choice).


But overall very nice.

Regards,

Markus


On 08.12.2015 18:48, Amir Ladsgroup wrote:

Hey,
There has been several discussion regarding quality of information in
Wikidata. I wanted to work on quality of wikidata but we don't have any
source of good information to see where we are ahead and where we are
behind. So I thought the best thing I can do is to make something to
show people how exactly sourced our data is with details. So here we
have *http://tools.wmflabs.org/wd-analyst/index.php*

You can give only a property (let's say P31) and it gives you the four
most used values + analyze of sources and quality in overall (check this
out )
  and then you can see about ~33% of them are sources which 29.1% of
them are based on Wikipedia.
You can give a property and multiple values you want. Let's say you want
to compare P27:Q183 (Country of citizenship: Germany) and P27:Q30 (US)
Check this out
. And
you can see US biographies are more abundant (300K over 200K) but German
biographies are more descriptive (3.8 description per item over 3.2
description over item)

One important note: Compare P31:Q5 (a trivial statement) 46% of them are
not sourced at all and 49% of them are based on Wikipedia **but* *get
this statistics for population properties (P1082
) It's not a
trivial statement and we need to be careful about them. It turns out
there are slightly more than one reference per statement and only 4% of
them are based on Wikipedia. So we can relax and enjoy these
highly-sourced data.

Requests:

  * Please tell me whether do you want this tool at all
  * Please suggest more ways to analyze and catch unsourced materials

Future plan (if you agree to keep using this tool):

  * Support more datatypes (e.g. date of birth based on year, coordinates)
  * Sitelink-based and reference-based analysis (to check how much of
articles of, let's say, Chinese Wikipedia are unsourced)

  * Free-style analysis: There is a database for this tool that can be
used for way more applications. You can get the most unsourced
statements of P31 and then you can go to fix them. I'm trying to
build a playground for this kind of tasks)

I hope you like this and rock on!

Best


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] REST API for Wikidata

2015-12-02 Thread Markus Krötzsch

On 02.12.2015 23:17, Martynas Jusevičius wrote:

JSON-LD does add complexity over plain JSON -- because it also can be
interpreted as RDF. And that makes all the difference.

The importance of this distinction cannot be overstated. If one views
some custom JSON and JSON-LD (and by extension, RDF) as two
alternative formats for doing the same thing, than clearly one fails
to grasp what the semantics and web of data are about.


Well, we have only one database. So, in our case, all exports are, 
indeed, "alternative formats for the same thing". :-)


But, more seriously: we already support RDF exports, LOD-style and in 
dumps. The question for JSON-LD is not "do we appreciate the semantic 
clarity of RDF" but "should we add another RDF syntax to our exports"? I 
guess most RDF tools will be happy with NTriples, so there might not be 
any benefit for them to have JSON-LD in addition. The other question is 
if the API that Jeroen announced here is actually something that RDF 
crawlers would like to use (in which case it would help to support at 
least the RDF formats we already have elsewhere). That's what I meant 
when referring to use cases.


Markus



On Wed, Dec 2, 2015 at 8:46 PM, Jeroen De Dauw  wrote:

Hey,

Conal, thanks for explaining. The items are indeed not linked. Indeed, the
current format hides the fact that they are items altogether. Perhaps this
is going a step too far, and it is better to take an approach similar to
that of Hay: still have a dedicated wikibase-item data type for which the
value includes the id, but also the label. What are your thoughts on using
this approach [0]? (Everyone is welcome to comment on this.) As you can see,
it makes the format significantly more verbose and is not quite as trivial
to use when you don't care about the links or ids. It's still a lot simpler
than dealing with the canonical Wikibase item format, and perhaps strikes a
better balance than what I created initially.

[0] https://gist.github.com/JeroenDeDauw/fc17f9fdd2e4567a17ff


And if the data serialization is JSON, why not use JSON-LD which is
all the rage these days? http://www.w3.org/TR/json-ld/


Thanks for the suggestion. I'm not really familiar with JSON-LD and quickly
read through the examples there. While this is certainly interesting, I
wonder what the actual benefits are, going on the assumption that most
developers are unfamiliar with this format. It does add complexity, so I'm
quite hesitant to use this in the main format. That said, this might be a
good candidate for an additional response format. (And adding additional
formats to the API in a clean way ought to be quite easy, see
https://github.com/JeroenDeDauw/QueryrAPI/issues/39)

Cheers

--
Jeroen De Dauw - http://www.bn2vs.com
Software craftsmanship advocate
~=[,,_,,]:3

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Preferred rank -- choices for infoboxes, versus SPARQL

2015-11-29 Thread Markus Krötzsch

On 28.11.2015 16:51, Federico Leva (Nemo) wrote:

Gerard Meijssen, 28/11/2015 07:05:

A big city is what? A city with more than a given number of inhabitants?
If so it is redundant because it can be inferred.


Criteria might be defined by local law and/or require some
administrative act. That's how it works in Italy, for instance.


German actually has a noun "Großstadt" ("big city"), which has a fixed 
meaning to be a city with at least 100,000 inhabitants. The noun is now 
widely used as a native expression in everyday German, unlike the 
descriptive English translation "big city" (which feels to English 
speakers like "große Stadt" would feel to German speakers). It is an 
example of a concept that natively exists in some languages but not in 
others.


The important clarification to Gerard's reply is that this is not a case 
of an over-specific Wikipedia category that somehow became a Wikidata 
class. Rather, it is a concept that is natural to some languages and not 
to others. It's a challenge for Wikidata to deal with this, clearly. 
Nevertheless, from my point of view, a classification that integrates 
concepts from many cultures/languages is preferable over many disjointed 
classifications that are perfectly aligned with one particular 
culture/language.


Markus


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Odd results from wdqs

2015-11-27 Thread Markus Krötzsch

On 27.11.2015 15:22, Magnus Manske wrote:

It was the "absolute terms" problem here  ;-)


But 3MB uncompressed string data does not seem to be so big in absolute 
terms, or are you referring to something else (I got this number from 
the long pages special)? Parsing a 3MB string may need some extra 
memory, but the data you get in the end should not be much bigger than 
the original string, or should it?


Markus



On Fri, Nov 27, 2015 at 2:12 PM Markus Krötzsch
<mar...@semantic-mediawiki.org <mailto:mar...@semantic-mediawiki.org>>
wrote:

On 25.11.2015 16:05, Lydia Pintscher wrote:
 > On Mon, Nov 23, 2015 at 10:54 PM, Magnus Manske
 > <magnusman...@googlemail.com
<mailto:magnusman...@googlemail.com>> wrote:
 >> Well, my import code chokes on the last two JSON dumps (16th and
23rd). As
 >> it fails about half an hour or so in, debugging is ...
inefficient. Unless
 >> there is something that has changed with the dump itself (new
data type or
 >> so), and someone tells me, it will be quite some time (days,
weeks) until I
 >> figure it out.
 >
 > To update everyone here as well: Magnus has been able to pinpoint the
 > problem and fix the tools. They're catching up again. The issue was
 > one the extremely big pages that have have recently been created for
 > research papers: https://www.wikidata.org/wiki/Special:LongPages

Thanks for explaining. This explains why we did not see any problems or
unusual behaviour in Wikidata Toolkit. I guess Java simply does not care
about how long pages are, as long as they are not very big in absolute
terms.

Markus


___
Wikidata mailing list
Wikidata@lists.wikimedia.org <mailto:Wikidata@lists.wikimedia.org>
https://lists.wikimedia.org/mailman/listinfo/wikidata



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Preferred rank -- choices for infoboxes, versus SPARQL

2015-11-27 Thread Markus Krötzsch

Hi James,

I would immediately agree to the following measures to alleviate your 
problem:


(1) If some instance-of statements are historic (i.e., no longer valid), 
then one should make the current ones "preferred" and leave the historic 
ones "normal", just like for, e.g., population numbers. This would get 
rid of the rather inappropriate "Free imperial city" label for Frankfurt.


(2) If some classes are redundant, they could be removed (e.g., if we 
already have "Big city" we do not need "city"). However, community might 
decide to prefer the direct use of a main class (such as "Human"), even 
if redundant.


The other issues you mention are more tricky. Especially issues of 
translation/cultural specificity. The most specific classes are not 
always the ones that all languages would want to see, e.g., if the 
concept of the class is not known in that language.


Possible options for solving your problem:

* Make a whitelist of classes you want to show at all in the template, 
and default to "city" if none of them occurs.

* Make a blacklist of classes you want to hide.
* Instead of blacklist or whitelist, show only classes that have a 
Wikipedia page in your language; default to "city" if there are none.
* Try to generalise overly specific classes (change "big city" to "city" 
etc.). I don't know if there is a good programmatic approach for this, 
or if you would have to make a substitution list or something, which 
would not be very maintainable.
* Do not use instance-of information like this in the infobox. It might 
sound radical, but I am not sure if "instance of" is really working very 
well for labelling things in the way you expect. Instance-of can refer 
to many orthogonal properties of an object, in essentially random order, 
while a label should probably focus on certain aspects only.


For obvious reasons, ranks of statements cannot be used to record 
language-specific preferences.


Cheers,

Markus

On 27.11.2015 15:58, James Heald wrote:

Some items have quite a lot of "instance of" statements, connecting them
to quite a few different classes.

For example, Frankfurt is currently an instance of seven different classes,
 https://www.wikidata.org/wiki/Q1794

and Glasgow is currently an instance of five different classes:
 https://www.wikidata.org/wiki/Q4093

This can produce quite a pile-up of descriptions in the
description/subtitle section of an infobox -- for example, as on the
Spanish page for Frankfurt at
 https://es.wikipedia.org/wiki/Fr%C3%A1ncfort_del_Meno
in the section between the infobox title and the picture.


Question:

Is it an appropriate use of ranking, to choose a few of the values to
display, and set those values to be "preferred rank" ?

It would be useful to have wider input, as to whether it is a good thing
as to whether this is done widely.

Discussions are open at
https://www.wikidata.org/wiki/Wikidata:Project_chat#Preferred_and_normal_rank

and
https://www.wikidata.org/wiki/Wikidata:Bistro#Rang_pr.C3.A9f.C3.A9r.C3.A9

-- but these have so far been inconclusive, and have got slightly taken
over by questions such as

* how well terms really do map from one language to another --
near-equivalences that may be near enough for sitelinks may be jarring
or insufficient when presented boldly up-front in an infobox.

(For example, the French translation "ville" is rather unspecific, and
perhaps inadequate in what it conveys, compared to "city" in English or
"ciudad" in Spanish; "town" in English (which might have over 100,000
inhabitants) doesn't necessarily match "bourg" in French or "Kleinstadt"
in German).

* whether different-language wikis may seek different degrees of
generalisation or specificity in such sub-title areas, depending on how
"close" the subject is to that wiki.

(For readers in some languages, some fine distinctions may be highly
relevant and familiar, whereas for other language groups that level of
detail may be undesirably obscure).


There is also the question of the effect of promoting some values to
"preferred rank" for the visibility of other values in SPARQL -- in
particular when so queries are written assuming they can get away with
using just the simple "truthy" wdt:... form of properties.

However, making eg the value "city" preferred for Glasgow means that it
will no longer be returned in searches for its other values, if these
have been written using "wdt:..." -- so it will now be missed in a
simple-level query for "council areas", the current top-level
administrative subdivisions of Scotland, or for historically-based
"registration counties" -- and this problem will become more pronounced
if the practice becomes more widespread of making some values
"preferred" (and so other values invisible, at least for queries using
wdt:...).

 From a SPARQL point of view, what would actually be very helpful would
to add a (new) fourth rank -- "misleading without qualifier", below
"normal" but above "deprecated" -- for statements that *are* 

Re: [Wikidata] [Wiki-research-l] Quality issues

2015-11-21 Thread Markus Krötzsch

On 21.11.2015 12:21, Jane Darnell wrote:

+1
I think many Wikipedians are control freaks who like to think their
articles are the endpoint in any internet search on their article
subjects. We really need to suppress the idea that the data they have
curated so painstakingly over the years is less valuable because it is
not on Wikidata or disagrees with data on Wikidata in some way. We can
and should let these people continue to thrive on Wikipedia without
pressuring them to look at their data on Wikidata, which might confuse
and overwhelm them. They figured out Wikipedia at some point and
presumably some of them have figured out Commons. In future they may
figure out Wikidata, but that will be on their own terms and in their
own individual way.



Yes, one can also understand the point of view of many seasoned 
Wikipedians. Because of the popularity of the platform, large parts of 
their daily work consist in defending "their" content against all kinds 
of absurd ideas and changes for the worse. Rather than writing new, 
better content, their main work is in rejecting content that is worse. 
They therefore are spending a lot of time on talk pages, having debates 
with people whom most of us would simply ignore on the Internet, but 
which they cannot ignore if they want to protect what has been achieved 
already. Doing this is hard work, since Wikipedia rejects the notion of 
personal standing or seniority as a basis for "trusting" someone to be 
right -- every puny battle of opinions has to be fought out on the talk 
page. The only thing to allude to is some abstract notion of "quality" 
-- and a complex system of policies and processes.


This tough work hardens people and gives them a negative bias towards 
change, especially towards process changes that might lead to reduced 
control. They worry (not unreasonably!) that Wikidata does not have this 
community of gate keepers that can fend off the irrational and the 
misguided. They also worry that they themselves may not have enough time 
to take on this task, watching yet another site in addition to what they 
already do in their Wikipedias.


Conversely, people on Wikidata are (not unreasonably!) frustrated when 
being met with the same distrust as the average Internet freak that 
Wikipedians are fighting off on a daily basis, rather than being 
accepted as members of the Wikimedia community who are working towards 
the same goal.


Considering all this, it is amazing what has been achieved already :-)

Markus


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] [Wiki-research-l] Quality issues

2015-11-21 Thread Markus Krötzsch

On 20.11.2015 09:18, Federico Leva (Nemo) wrote:

Gerard Meijssen, 20/11/2015 08:18:

At this moment there
are already those at Wikidata that argue not to bother about Wikipedia
quality because in their view, Wikipedians do not care about its own
quality.


And some wikipedians say the same of Wikidata. So "quality" in such
discussions is just a red herring used to raise matters of control (i.e.
power and social structure). Replace "quality" with "the way I do
things" in all said discussions and suddenly things will make more sense.


+1 to this accurate analysis

What we need to overcome this is more mutual trust, and more personal 
overlaps between communities. There are already some remarkable projects 
where the boundary between "Wikipedian" and "Wikidatista" (or what's our 
demonym now?) has vanished. I think these will naturally grow and 
prosper as Wikidata becomes better and better (bigger, more reliable, 
more usable, etc.), but it will take some patience and we should not 
expect Wikipedia veterans to change their processes overnight to 
accommodate Wikidata. I think the right strategy is to do this 
grass-roots style, not by expecting big policy changes, but by showing 
the gain of Wikidata to individual domains one by one.


Markus



The first step to improve the situation, imho, is to banish the word
"quality".

Nemo

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] WDQS updates have stopped

2015-11-19 Thread Markus Krötzsch

On 19.11.2015 10:40, Gerard Meijssen wrote:

Hoi,
Because once it is a requirement and not a recommendation, it will be
impossible to reverse this. The insidious creep of more rules and
requirements will make Wikidata increasingly less of a wiki. Arguably
most of the edits done by bot are of a higher quality than those done by
hand. It is for the people maintaining the SPARQL environment to ensure
that it is up to the job as it does not affect Wikidata itself.

I think each of these argument holds its own. Together they are
hopefully potent enough to prevent such silliness.


Maybe it would not be that bad. I actually think that many bots right 
now are slower than they could be because they are afraid to overload 
the site. If bots would check the lag, they could operate close to the 
maximum load that the site can currently handle, which is probably more 
than most bots are doing now.


The "requirement" vs. "recommendation" thing is maybe not so relevant, 
since bot rules (mandatory or not) are currently not enforced in any 
strong way. Basically, the whole system is based on mutual trust and 
this is how it should stay.


Markus

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] "Implementing" OWL RL in SPARQL (Was: qwery.me - simpler queries for wikidata)

2015-11-13 Thread Markus Krötzsch

On 12.11.2015 22:09, Peter F. Patel-Schneider wrote:

On 11/12/2015 09:10 AM, Markus Krötzsch wrote:
[...]

On the other hand, it is entirely possible to implement correct OWL QL (note:
*QL* not *RL*) reasoning in SPARQL without even using "rules" that need any
recursive evaluation [3]. This covers all of RDFS, and indeed some of the
patterns in these queries are quite well-known to Wikidata users too (e.g.,
using "subclassOf*" in a query). Depending on how much of OWL QL you want to
support, the SPARQL queries you get in this case are more or less simple. This
work also gives arguments as to why this style of SPARQL-based implementation
does (most likely) not exist for OWL RL [3].


Does OWL QL cover *all* of RDFS, even things like subproperties of
rdfs:subclassOf and rdfs:subPropertyOf?


No, surely not. What I meant is the RDFS-fragment of OWL DL here (which 
is probably what RDFS processors are most likely to implement, too).


I think I recall you showing P-hardness of RDFS proper a while ago, 
which would obviously preclude translation into single SPARQL 1.1 
queries (unless NL=P).


Markus


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Data model explanation and protection

2015-11-11 Thread Markus Krötzsch

On 11.11.2015 11:33, Thomas Douillard wrote:

There is a proposal for some kind of class disjointness :
https://www.wikidata.org/wiki/Wikidata:Property_proposal/Generic#subclass this
is here for a while now, maybe a few more supporters would speed up the
process :)


Interesting. This looks like a more complex modelling that combines 
"union of" with "disjoint classes". I would prefer to have the simpler 
modelling primitives before introducing such a shortcut.


There is also an slight mismatch between the Wikidata statement format 
(with main value and qualifiers) and the use for assigning a set of 
values (two classes, that are equally important). It's clear that we 
have to do something like this if we want to make such statements, but I 
would prefer an encoding where the classes are both in qualifiers, e.g.:


 Disjoint union of SOME VALUE
   of 
   of 

This is also similar to what is done in OWL, and we already have the 
"of" qualifier.


Will add this comment.



I think a proposal for "DisjointWith" was rejected a long time ago. But
another one could pass.


Yes, I think we should revisit this decision in the light of the new 
requirements and our grown experience in working with Wikidata.


Markus





___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Data model explanation and protection

2015-11-10 Thread Markus Krötzsch

On 29.10.2015 05:41, Benjamin Good wrote:

For what its worth, I tend to agree with Peter here.  It makes sense to
me to add constraints akin to 'disjoint with' at the class level.


+1 for having this. This does not preclude to have an additional 
mechanism on the instance level if needed to augment the main thing, but 
the classes are an easier way to start.


This can also help with detecting other issues that are unrelated to 
merging. For instance, nothing should be an event and an airplane at the 
same time.


We need a common approach on how to deal with ambiguous Wikipedia 
articles. One option would be to create an "auxiliary" item that is not 
linked to Wikipedia in such a case, but that is used to represent some 
aspects of the "main" item that would otherwise be incompatible.


Benjamin is right that these issues are not specific to the bio domain. 
It's rather the opposite: the bio domain is one of the domains that is 
advanced enough to notice these problems ...



The
problem I see is that we don't exactly have classes here as the term is
used elsewhere.  I guess in wikidata, a 'class' is any entity that
happens to be used in a subclassOf claim ?


In this case, one can leave this to the user: two items that are 
specified to be disjoint classes are classes.


In the Wikidata Taxonomy Browser, we consider items as classes if one of 
the following is true:

(1) they have a "subclass of" statement
(2) they are the target of a "subclass of" statement
(3) they are the target of an "instance of" statement

We then (mostly) ignore the classes that do not have own instances or 
own subclasses (the "leafs" in the taxonomy), since these are very many:

* The above criterion leads to over 200,000 class items.
* Only about 20,000 of them have instances or subclasses.



Another way forward could be to do this using properties rather than
classes.  I think this could allow use to use the constraint-checking
infrastructure that is already in place?  You could add a constraint on
a property that it is 'incompatible with' another property.  In the
protein/gene case we could pragmatically use Property:P351 (entrez gene
id), incompatible with Property:P352 (uniprot gene id).  More
semantically, we could use 'encoded by' incompatible-with 'encodes' or
'genomic start'


I think the constraint checking infrastructure should be able to handle 
both approaches equally well. If "disjoint with" is a statement, one 
could even check this constraint in SPARQL (possibly further restricting 
to query only for constraint violations in a particular domain).


Cheers,

Markus



On Wed, Oct 28, 2015 at 5:08 PM, Peter F. Patel-Schneider


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Blazegraph

2015-10-27 Thread Markus Krötzsch

On 27.10.2015 15:34, Paul Houle wrote:

One thing I really liked about Kasabi was that it had a simple interface
for people to enter queries and share them with people.  The
"Information Workbench" from fluidOps does something similar although I
never seen it open to the public.  A database of queries also is a great
tool for testing both the code and the documentation,  both of the
reference and cookbook kind.


Have you had a look at http://wikidata.metaphacts.com/? It has some 
interesting data presentation/visualisation features that are tied in 
with a SPARQL endpoint over Wikidata (not sure if it is the same one now).




I see no reason why one instance of Blazegraph is having all the fun.
With a good RDF dump,  people should be loading Wikidata into all sorts
of triple stores and since Wikidata is not that terribly big at this
time,  "alternative" endpoints ought to be cheap and easy to run


Definitely. However, there is some infrastructural gap between loading a 
dump once in a while and providing a *live* query service. 
Unfortunately, there are no standard technologies that would routinely 
enable live updates of RDF stores, and Wikidata is rather low-tech when 
it comes to making its edits available to external tools. One could set 
up the code that is used to update query.wikidata.org (I am sure it's 
available somewhere), but it's still some extra work.


Regards,

Markus






On Mon, Oct 26, 2015 at 11:31 AM, Kingsley Idehen
> wrote:

On 10/25/15 10:51 AM, James Heald wrote:

Hi Gerard.  Blazegraph is the name of the open-source SPARQL
engine being used to provide the Wikidata SPARQL service.

So Blazegraph **is** available to all of us, at
https://query.wikidata.org/ , via
both the query editor, and the SPARQL API endpoint.

It's convenient to talk describe some issues with the SPARQL
service being "Blazegraph issues", if the issues appear to lie
with the query engine.

Other query engines that other people be running might be running
might have other specific issues, eg "Virtuoso issues".  But it is
Blazegraph that the Discovery team and Wikidata have decided to go
with.


The beauty of SPARQL is that you can use URLs to show query results
(and even query definitions). Ultimately, engine aside, there is
massive utility in openly sharing queries and then determining what
might the real problem.

Let's use open standards to work in as open a fashion as is possible.

--
Regards,

Kingsley Idehen 
Founder & CEO
OpenLink Software
Company Web:http://www.openlinksw.com
Personal Weblog 1:http://kidehen.blogspot.com
Personal Weblog 2:http://www.openlinksw.com/blog/~kidehen
Twitter Profile:https://twitter.com/kidehen
Google+ Profile:https://plus.google.com/+KingsleyIdehen/about
LinkedIn Profile:http://www.linkedin.com/in/kidehen
Personal WebID:http://kingsley.idehen.net/dataspace/person/kidehen#this


___
Wikidata mailing list
Wikidata@lists.wikimedia.org 
https://lists.wikimedia.org/mailman/listinfo/wikidata




--
Paul Houle

*Applying Schemas for Natural Language Processing, Distributed Systems,
Classification and Text Mining and Data Lakes*

(607) 539 6254paul.houle on Skype ontolo...@gmail.com


:BaseKB -- Query Freebase Data With SPARQL
http://basekb.com/gold/

Legal Entity Identifier Lookup
https://legalentityidentifier.info/lei/lookup/


Join our Data Lakes group on LinkedIn
https://www.linkedin.com/grp/home?gid=8267275



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Announcing Wikidata Taxonomy Browser (beta)

2015-10-25 Thread Markus Krötzsch

On 25.10.2015 02:18, Kingsley Idehen wrote:

On 10/24/15 10:51 AM, Markus Krötzsch wrote:

On 24.10.2015 12:29, Martynas Jusevičius wrote:

I don't see how cycle queries can be a requirement for SPARQL engines if
they are not part of SPARQL spec? The closest thing you have is property
paths.


We were talking about *cyclic data* not cyclic queries (which you can
also create easily using BGPs, but that's unrelated here). Apparently,
BlazeGraph has performance issues when computing a path expression
over a cyclic graph.

Markus


Markus,

Out of curiosity, can you share a SPARQL query example (text or query
results url) that demonstrates your point?


You mean a query with BlazeGraph having performance issues? That problem 
was reported by Stas. He should have examples. In any case, it is always 
a combination of query and data.


Markus


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Announcing Wikidata Taxonomy Browser (beta)

2015-10-24 Thread Markus Krötzsch

On 24.10.2015 12:29, Martynas Jusevičius wrote:

I don't see how cycle queries can be a requirement for SPARQL engines if
they are not part of SPARQL spec? The closest thing you have is property
paths.


We were talking about *cyclic data* not cyclic queries (which you can 
also create easily using BGPs, but that's unrelated here). Apparently, 
BlazeGraph has performance issues when computing a path expression over 
a cyclic graph.


Markus



On Sat, 24 Oct 2015 at 09:37, James Heald > wrote:

On 24/10/2015 00:50, Stas Malyshev wrote:
 > Hi!
 >
 >> least one Wikipedia) are considered to refer to equivalent
classes on
 >> Wikidata, which could be expressed by a small subclass-of cycle. For
 >
 > We can do it, but I'd rather we didn't. The reason is that it would
 > require engine that queries such data (e.g. SPARQL engine) to be
 > comfortable with cycles in property paths (especially ones with + and
 > *), and not every one is (Blazegraph for example looks like does not
 > handle them out of the box). It can be dealt with, I assume, but why
 > create trouble for ourselves?

It should be a basic requirement of any SPARQL engine that it should be
able to handle path queries that contain cycles.

For example, consider equivalence relationships like P460 "said to be
the same as", which is being used to link given names together.

If we want to find all the names in a particular equivalence class, and
eg rank them by their incidence count, as is done in the 'query'
columns at
https://www.wikidata.org/wiki/Wikidata:WikiProject_Names/given-name_variants

then being able to handle cycles in path queries is a basic requirement
for the job.

 -- James.


___
Wikidata mailing list
Wikidata@lists.wikimedia.org 
https://lists.wikimedia.org/mailman/listinfo/wikidata



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Announcing Wikidata Taxonomy Browser (beta)

2015-10-24 Thread Markus Krötzsch

On 24.10.2015 09:36, James Heald wrote:

On 24/10/2015 00:50, Stas Malyshev wrote:

Hi!


least one Wikipedia) are considered to refer to equivalent classes on
Wikidata, which could be expressed by a small subclass-of cycle. For


We can do it, but I'd rather we didn't. The reason is that it would
require engine that queries such data (e.g. SPARQL engine) to be
comfortable with cycles in property paths (especially ones with + and
*), and not every one is (Blazegraph for example looks like does not
handle them out of the box). It can be dealt with, I assume, but why
create trouble for ourselves?


It should be a basic requirement of any SPARQL engine that it should be
able to handle path queries that contain cycles.

For example, consider equivalence relationships like P460 "said to be
the same as", which is being used to link given names together.

If we want to find all the names in a particular equivalence class, and
eg rank them by their incidence count, as is done in the 'query' columns at
https://www.wikidata.org/wiki/Wikidata:WikiProject_Names/given-name_variants


then being able to handle cycles in path queries is a basic requirement
for the job.


I agree. Even if we discourage cycles in other cases, there is still no 
guarantee that there won't be any, so the engine should be robust 
against this.


On the other hand, we have to live with the technical infrastructure we 
got. If BlazeGraph does not handle cycles well, we should encourage 
their team to work on fixing this, but at the same time we need to work 
around the issue for a while.


"Said to be the same as" is a good example of a case where cycles are 
unavoidable. A possible workaround in this case is to make sure that the 
transitive closure of "said to be the same as" is already in the data, 
such that the path "P460+" returns the same results as a mere "P460" 
would. It's not ideal, but maybe workable.


Markus


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Announcing Wikidata Taxonomy Browser (beta)

2015-10-23 Thread Markus Krötzsch

On 23.10.2015 11:16, Gerard Meijssen wrote:

Hoi,
The problem with tools like this is that they get a moment attention.
Particularly when they are stand alone, not integrated, they will lose
interest.


Problems, problems, ...



Would it be an option to host this tool on Labs?


Yes, this is planned for the future, especially to automate regular data 
updates, which Serge now has to do manually. Besides the changed URL, 
this move would make a big difference for users. What you see right now 
is a first prototype beta-release that is meant to gather user feedback 
on how to develop this tool further.


Markus



On 22 October 2015 at 21:27, Markus Kroetzsch
>
wrote:

On 22.10.2015 19:29, Dario Taraborelli wrote:

I’m constantly getting 500 errors.


I also observed short outages in the past, and I sometimes had to
run a request twice to get an answer. It seems that the hosting on
bitbucket is not very reliable. At the moment, this is still a first
preview of the tool without everything set up as it should be. The
tool should certainly move to Wikimedia labs in the future.

Markus



--
Markus Kroetzsch
Faculty of Computer Science
Technische Universität Dresden
+49 351 463 38486 
http://korrekt.org/

___
Wikidata mailing list
Wikidata@lists.wikimedia.org 
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] need help in running Wikidata-Toolkit-0.5.0 examples

2015-10-20 Thread Markus Krötzsch
[Maybe let's move this to wikidata-tech -- including the Wikidata 
discussion list here for the last time; please remove it in your reply]



Dear Satya,

Both issues you encountered seem to be caused by how you run the 
examples. It seems that the project is not configured/compiled properly 
yet. To help you, I need to know how you downloaded and ran WDTK. There 
two main options:


(1) Develop stand-alone code that works with the released libraries, as 
provided by Maven Central. This method is described here:

https://www.mediawiki.org/wiki/Wikidata_Toolkit#Beginner.27s_guide

(2) Extend the Wikidata Toolkit project by adding own examples etc. In 
this case, you would download (clone/branch) the code and build all of 
Wikidata Toolkit locally. Our developers' guide describes how to set up 
our project in Eclipse:


https://www.mediawiki.org/wiki/Wikidata_Toolkit/Eclipse_setup

Tests are only included in the source code, so I suppose you have 
followed (2)? Or did you just download the code and compile it with 
Maven locally (no Eclipse)?


Best regards,

Markus


On 20.10.2015 13:31, Satya Gadepalli wrote:

Issue 1: Tests are failing as I am not able to
find testdump-20150512.json.gz

Any Idea where i can find this file?

Issue 2: FetchOnlineDataExample is failing due to MediaWikiApiErrorException


E:\temp\WikiData\Execute>call java -cp  wdtk-examples-0.5.0.jar
org.wikidata.wdtk.examples.FetchOnlineDataExample
Picked up JAVA_TOOL_OPTIONS: -Dfile.encoding=UTF8
Error: A JNI error has occurred, please check your installation and try
again
Exception in thread "main" java.lang.NoClassDefFoundError:
org/wikidata/wdtk/wikibaseapi/apierrors/MediaWikiApiErrorException
 at java.lang.Class.getDeclaredMethods0(Native Method)
 at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
 at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
 at java.lang.Class.getMethod0(Class.java:3018)
 at java.lang.Class.getMethod(Class.java:1784)
 at
sun.launcher.LauncherHelper.validateMainClass(LauncherHelper.java:544)
 at
sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:526)
Caused by: java.lang.ClassNotFoundException:
org.wikidata.wdtk.wikibaseapi.apierrors.MediaWikiApiErrorException
 at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
 ... 7 more

Any help?

thx


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Announcing wdq command line client

2015-10-07 Thread Markus Krötzsch

Hi Jakob,

Very handy, thanks.

Markus


On 07.10.2015 14:04, Jakob Voß wrote:

Hi,

Based on a script by Marius Hoch
(https://github.com/mariushoch/asparagus) I created a command line
client to access Wikidata Query Service. The current release 0.2.0
includes the following features:

* adding of default namespaces if needed
* multiple output formats
* abbreviate Wikidata URIs if requested
* directly export to Excel (requires additional CPAN module)
* fancy colors

Usage Examples are given at https://github.com/nichtich/wdq#examples

Bug reports and feature requests at
https://github.com/nichtich/wdq/issues

Pull requests are welcome if you know some Perl. By now its 300 LoC and
~200 documentation, the largest contribution would be a WDQ to SPARQL
translator as implemented here: https://github.com/smalyshev/wdq2sparql

Cheers
Jakob




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Duplicate identifiers (redirects & non-redirects)

2015-10-01 Thread Markus Krötzsch

On 01.10.2015 00:58, Ricordisamoa wrote:

I think Tom is referring to external identifiers such as MusicBrainz
artist ID  etc. and whether
Wikidata items should show all of them or 'preferred' ones only as we
did for VIAF redirects
.


There are also other cases where external sites have duplicates that are 
not reconciled (yet). For example, Q46843 has multiple GeoNames Ids:


http://sws.geonames.org/7602447
http://sws.geonames.org/2954602

The second was suggested by Freebase, the first is what Wikipedia had. I 
think the first is better (polygon rather than bounding box), so I made 
this preferred. This is a situation where we should keep multiple 
identifiers, since the external database really has two ids that are not 
integrated yet.


Now if the external site reconciles the ids, we have these options:
(1) Keep everything as is (one main id marked as "preferred")
(2) Make the redirect ids deprecated on Wikidata (show people that we 
are aware of the ids but they should not be used)

(3) Delete the redirect ids

I think (2) would be cleanest, since it avoids that unaware users re-add 
the old ids. (3) would also be ok once the old id is no longer in 
circulation.


Is there any benefit in removing old ids completely? I guess constraint 
reports will work better (but maybe constraint reports should not count 
deprecated statements in single value contraints ...). Other than this, 
I don't see a big reason to spend time on removing some ids. It's not 
wrong to claim that these are ids, just slightly redundant, and the old 
ids might still be useful for integrating with web sources that were not 
updated when the redirect happened.


Markus




Il 01/10/2015 00:48, Addshore ha scritto:



On 30 September 2015 at 20:58, Tom Morris > wrote:

I think I've seen something somewhere saying that the prevailing
sentiment is that obsolete identifiers which are just redirects to
a new identifier should be removed.


I hope not. See my post at
http://addshore.com/2015/04/redirects-on-wikidata/ Redirects should
remain!

Also see http://addshore.com/2015/09/un-deleting-50-wikidata-items/


There's also the case of sites like MusicBrainz which keep the
non-canonical IDs without redirecting to the canonical ID, but
will tell you which ID is preferred, e.g. Fritz Kreisler


https://musicbrainz.org/artist/590fcad4-2ba4-43bc-a22f-a4bb9b496fe8
https://musicbrainz.org/artist/627ac6c2-ee5c-4120-8af3-ab00345447f5
https://musicbrainz.org/artist/bf6d6ce1-ce88-40e6-9424-11d11d2e54ea

where all the tabs for the second two pages actually point to the
first, canonical entry.

Is there an established policy for either the redirect or
non-redirect case?


See
https://www.wikidata.org/wiki/Wikidata:Deletion_policy#Deletion_of_items_.28Phase_I.29
which says "Items should not be deleted when - The item redirects to
another item"

Also see https://www.wikidata.org/wiki/Help:Merge#Create_redirect
which says redirects should be created when items are merged


I'd argue that even the obsolete identifiers are useful for
inbound resolution and reconciliation. Aggressively pruning them
just makes more work for people, because they must resolve the
identifier that they have in hand to its canonical form (probably
by hitting the issuing authority) before using it for Wikidata
lookups.

What do others think?

Tom

___
Wikidata mailing list
Wikidata@lists.wikimedia.org 
https://lists.wikimedia.org/mailman/listinfo/wikidata




--
Addshore


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Importing Freebase (Was: next Wikidata office hour)

2015-09-28 Thread Markus Krötzsch

Hi Thad,

thanks for your support. I think this can be really useful. Now just to 
clarify: I am not developing or maintaining the Primary Sources tool, I 
just want to see more Freebase data being migrated :-) I think making 
the mapping more complete is clearly necessary and valuable, but maybe 
someone with more insights into the current progress on that level can 
make a more insightful comment.


Markus


On 28.09.2015 20:44, Thad Guidry wrote:

Markus, Lydia...

It looks like TPT had another page where the WD Properties were being
mapped to Freebase here:
https://www.wikidata.org/wiki/Wikidata:WikiProject_Freebase/Mapping

Do you need help in filling that out more ?

Thad
+ThadGuidry 



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Italian Wikipedia imports gone haywire ?

2015-09-28 Thread Markus Krötzsch

On 28.09.2015 13:31, Luca Martinelli wrote:

2015-09-28 11:16 GMT+02:00 Markus Krötzsch <mar...@semantic-mediawiki.org>:

If this is the case, then maybe it
should just be kept as an intentionally broad property that captures what we
now find in the Wikipedias.


+1, the more broad the application of certain property is, the better.
We really don't need to be 100% specific with a property, if we can
exploit qualifiers.


I would not completely agree to this: otherwise we could just have a 
property "related to" and use qualifiers for the rest ;-) It's always 
about finding the right balance for each case. Many properties (probably 
most) have a predominant natural definition that is quite clear. Take 
"parent" as a simple example of a property that can have a very strict 
definition (biological parent) and still be practically useful and easy 
to understand. The trouble is often with properties that have a 
legal/political meaning since they are different in each legislation 
(which in itself changes over space and time). "Twin city" is such a 
case; "mayor" is another; also classes like "company" are like this. I 
think we do well to stick to the "folk terminology" in such cases, which 
lacks precision but caters to our users.


This can then be refined in the mid and long term (maybe using 
qualifiers, more properties, or new editing conventions). Each domain 
could have a dedicated Wikiproject to work this out (the Wikiproject 
Names is a great example of such an effort [1]).


Markus

[1] https://www.wikidata.org/wiki/Wikidata:WikiProject_Names


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Italian Wikipedia imports gone haywire ?

2015-09-28 Thread Markus Krötzsch

Hi,

Important discussion (but please don't get angry over such things -- 
some emails sounded a bit rough to my taste if I may say so :-).


Property definitions are an important issue, and ours are too vague in 
general. However, some properties need to be quite broad to be useful: 
they need some semantic wiggle room to allow them to be used in slightly 
different situations rather than having hundreds of hardly-used (but 
very precise) properties that are not natural. If such broadness is 
intended for a property, it should of course still be documented.


As it is now "twin cities" seem to be all cities that have some form of 
bilateral partnership contract that defines some such status. One could 
use a qualifier to specify which kind of contract it is (if someone can 
find out what the main types of partner cities are!). However, to be 
honest, it is hard to see an application where this information would be 
relevant, other than for trivia (the last time I needed such information 
was in a pub quiz ;-) and for display in Wikipedia pages. If this is the 
case, then maybe it should just be kept as an intentionally broad 
property that captures what we now find in the Wikipedias. The 
ontologists among us could better spend their time on properties like 
"part of".


Cheers,

Markus


On 27.09.2015 23:45, Thad Guidry wrote:


On Sun, Sep 27, 2015 at 4:37 PM, Jan Ainali > wrote:

2015-09-27 23:03 GMT+02:00 Thad Guidry >:

I have added my viewpoint to the P190 Discussion page.


Great, thanks!


Are you seriously saying that a Talk Page (where many are 2 - 10
pages scrollable) is the culmination of a definition of a
property in Wikidata ?


In theory, only the top box should be needed. In practice, since
everything in the Wikimedia movement is a work in progress, the talk
page are almost always worth reading, especially if you have doubts
on how to use it.


​Good to know.  I will use it profusely from now on.
​


You expect folks to read 10 pages of Talk to fully understand
the intent and how to use a Property in Wikidata ?


No, _I_ do not expect "folks" to read it. But I do expect people who
want to improve the use of a property to read it before they setup
their own definitions of that property.


​As I shall from now on, knowing its importance, or more to the point,
where I can help with confusion.
​
Thad
+ThadGuidry 


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] next Wikidata office hour

2015-09-24 Thread Markus Krötzsch

On 24.09.2015 23:48, James Heald wrote:

Has anybody actually done an assessment on Freebase and its reliability?

Is it *really* too unreliable to import wholesale?


From experience with the Primary Sources tool proposals, the quality is 
mixed. Some things it proposes are really very valuable, but other 
things are also just wrong. I added a few very useful facts and fitting 
references based on the suggestions, but I also rejected others. Not 
sure what the success rate is for the cases I looked at, but my feeling 
is that some kind of "supervised import" approach is really needed when 
considering the total amount of facts.


An issue is that it is often fairly hard to tell if a suggestion is true 
or not (mainly in cases where no references are suggested to check). In 
other cases, I am just not sure if a fact is correct for the property 
used. For example, I recently ended up accepting "architect: Charles 
Husband" for Lovell Telescope (Q555130), but to be honest I am not sure 
that this is correct: he was the leading engineer contracted to design 
the telescope, which seems different from an architect; no official web 
site uses the word "architect" it seems; I could not find a better 
property though, and it seemed "good enough" to accept it (as opposed to 
the post code of the location of this structure, which apparently was 
just wrong).




Are there any stats/progress graphs as to how the actual import is in
fact going?


It would indeed be interesting to see which percentage of proposals are 
being approved (and stay in Wikidata after a while), and whether there 
is a pattern (100% approval on some type of fact that could then be 
merged more quickly; or very low approval on something else that would 
maybe better revisited for mapping errors or other systematic problems).


Markus




   -- James.


On 24/09/2015 19:35, Lydia Pintscher wrote:

On Thu, Sep 24, 2015 at 8:31 PM, Tom Morris  wrote:

This is to add MusicBrainz to the primary source tool, not anything
else?



It's apparently worse than that (which I hadn't realized until I
re-read the
transcript).  It sounds like it's just going to generate little warning
icons for "bad" facts and not lead to the recording of any new facts
at all.

17:22:33  we'll also work on getting the extension
deployed that
will help with checking against 3rd party databases
17:23:33  the result of constraint checks and checks
against 3rd
party databases will then be used to display little indicators next to a
statement in case it is problematic
17:23:47  i hope this way more people become aware of
issues and
can help fix them
17:24:35  Do you have any names of databases that are
supported? :)
17:24:59  sjoerddebruin: in the first version the german
national library. it can be extended later


I know Freebase is deemed to be nasty and unreliable, but is MusicBrainz
considered trustworthy enough to import directly or will its facts
need to
be dripped through the primary source soda straw one at a time too?


The primary sources tool and the extension that helps us check against
other databases are two independent things.
Imports from Musicbrainz have been happening since a very long time
already.


Cheers
Lydia




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Announcing the release of the Wikidata Query Service

2015-09-09 Thread Markus Krötzsch

Good morning :-)

On 09.09.2015 00:45, Stas Malyshev wrote:

Hi!


P.S. I am not convinced yet of this non-standard extension of SPARQL to
fetch labels. Its behaviour based on the variables given in SELECT seems


You don't have to use variables in SELECT, it's just a shortcut.


What I meant with my comment is that the SPARQL semantics does not allow 
for extensions that modify the query semantics based on the selected 
variables. Even if this is optional, it changes some fundamental 
assumptions about how SPARQL works.


...


Nothing prevents us from creating such UI, but for many purposes having
to do extra step to see labels does not seem optimal for me, especially
if data is intended for human consumption.


I agree that creating such a UI should not be left to WMF or WMDE 
developers. The SPARQL web api is there for everybody to use. One could 
also start from general SPARQL tools such as YASGUI (about.yasgui.org) 
as a basis for SPARQL editor.


An extra step should not be needed. Users would just use a query page 
like the one we have now. Only the display of the result table would be 
modified so that there is a language selector above the table: if a 
language is selected, all URIs that refer to Wikidata entities will get 
a suitable label as their anchor text. One could also have a the option 
to select "no language" where only the item ids are shown.





results), one could as well have some JavaScript there to beautify the
resulting item URIs based on client-side requests. Maybe some consumers


I'm not sure what you mean by "beautify". If you mean to fetch labels,
querying labels separately would slow things down significantly.


There should be no slowdown in the SPARQL service, since the labels 
would be fetched client-side. There would be a short delay between the 
arrival of the results and the fetching of the labels. We already have 
similar delays when opening a Wikidata page (site names, for example, 
take a moment to fetch). Wikidata Query/Autolist also uses this method 
to fetch labels client-side, and the delay is not too big (it is 
possible to fetch up to 50 labels in one request).


With beautify I mean that one could do further things, such as having a 
tooltip when hovering over entities in results that shows more 
information (or maybe fetches an additional description). That's where 
people can be creative. I think these kinds of features are best placed 
in the hands of community members. We can easily have several SPARQL UIs.





really need to get labels from SPARQL, but at least the users who want
to see results right away would not need this.


Then why wouldn't these users just ignore the label service altogether?


Because all examples are using it, and many users are learning SPARQL 
now from the examples. This is the main reason why I care at all. After 
all, every SPARQL processor has some built-in extensions that deviate 
from the standard.


Markus

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata Live RDF?

2015-09-09 Thread Markus Krötzsch

On 09.09.2015 08:30, Stas Malyshev wrote:

Hi!


Now that the SPARQL endpoint is "official", will the life RDF data
(which you get, e.g, via Special:EntityData) also be switched to show
the content using in SPARQL? Since this is already implemented, I guess


I think it might be a good idea.
We have a number of "flavors" in the data now, that include different
aspects of RDF. E.g.
https://www.wikidata.org/wiki/Special:EntityData/Q1.ttl?flavor=simple
produces only "simple" statements, flavor=full produces everything we
know, and flavor=dump produces the same things we have in RDF dump. We
can of course create new flavors with different aspects included or
excluded. So what should default RDF content display?

We also have a tracking bug for this:
https://phabricator.wikimedia.org/T101837 so maybe we should discuss it
there.


Ah, thanks for reminding me. I have commented there. I think the default 
should be to simply return all data that is in the dumps.


Markus




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Announcing the release of the Wikidata Query Service

2015-09-08 Thread Markus Krötzsch

On 08.09.2015 23:56, Denny Vrandečić wrote:

Anyone an idea why this query has a trouble when I add the OPTIONAL keyword?

*http://tinyurl.com/pgsujp2*

Doesn't look much harder than the queries in the examples.


Looking at the bottom of the exceptions, you can see that the system 
simply refuses to run the query. The problem seems to be with the 
labelling service. If you don't select the label for ?head, it works fine:


http://tinyurl.com/p6rfpgv

It seems the implementation of the label service needs some improvement 
to support unbound variables (which should then return unbound labels, 
rather than throw a runtime exception ;-).


Markus



On Tue, Sep 8, 2015 at 1:39 PM Andra Waagmeester > wrote:

Hi Denny,

The following R script
  (https://gist.github.com/andrawaag/2b8c831ab4dd70b16cf2) plots
wikidata content on a worldmap in R.


Andra


On Tue, Sep 8, 2015 at 10:19 PM, Denny Vrandečić
> wrote:

Is there a write up of how you are using Wikidata from R? That
sounds quite cool.


On Tue, Sep 8, 2015, 13:03 Andra Waagmeester > wrote:

I think a blinking banner would work on the main side.
However, you would miss people like me, who use the service
mainly from within R or my text editor (Text mate with
turtle bundle). I wouldn't see a blinking banner.


On Tue, Sep 8, 2015 at 9:55 PM, Lydia Pintscher
> wrote:

On Tue, Sep 8, 2015 at 9:52 PM, Stas Malyshev
> wrote:
> Hi!
>
>> Yes it is the continuation of the beta on labs.
>> Stas: Do you want to turn that into a redirect now?
>
> Not sure yet what to do with it. I want to keep the labs 
setup for
> continued development work, especially when potentially 
breaking things,
> but we do want to redirect most of the people now to main 
endpoint as it
> is much better at handling the load.

Fair enough. A big blinking banner at the top of the
page might do then?


Cheers
Lydia

--
Lydia Pintscher - http://about.me/lydia.pintscher
Product Manager for Wikidata

Wikimedia Deutschland e.V.
Tempelhofer Ufer 23-24
10963 Berlin
www.wikimedia.de 

Wikimedia Deutschland - Gesellschaft zur Förderung
Freien Wissens e. V.

Eingetragen im Vereinsregister des Amtsgerichts
Berlin-Charlottenburg
unter der Nummer 23855 Nz. Als gemeinnützig anerkannt
durch das
Finanzamt für Körperschaften I Berlin, Steuernummer
27/681/51985 .

___
Wikidata mailing list
Wikidata@lists.wikimedia.org

https://lists.wikimedia.org/mailman/listinfo/wikidata


___
Wikidata mailing list
Wikidata@lists.wikimedia.org

https://lists.wikimedia.org/mailman/listinfo/wikidata


___
Wikidata mailing list
Wikidata@lists.wikimedia.org 
https://lists.wikimedia.org/mailman/listinfo/wikidata


___
Wikidata mailing list
Wikidata@lists.wikimedia.org 
https://lists.wikimedia.org/mailman/listinfo/wikidata



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Announcing the release of the Wikidata Query Service

2015-09-08 Thread Markus Krötzsch

On 09.09.2015 00:07, Markus Krötzsch wrote:

On 08.09.2015 23:56, Denny Vrandečić wrote:

Anyone an idea why this query has a trouble when I add the OPTIONAL
keyword?

*http://tinyurl.com/pgsujp2*

Doesn't look much harder than the queries in the examples.


Looking at the bottom of the exceptions, you can see that the system
simply refuses to run the query. The problem seems to be with the
labelling service. If you don't select the label for ?head, it works fine:

http://tinyurl.com/p6rfpgv

It seems the implementation of the label service needs some improvement
to support unbound variables (which should then return unbound labels,
rather than throw a runtime exception ;-).


P.S. I am not convinced yet of this non-standard extension of SPARQL to 
fetch labels. Its behaviour based on the variables given in SELECT seems 
to contradict SPARQL (where SELECT is applied to the results of a query 
*after* they are computed, without having any direct influence on the 
meaning of the WHERE part).


A simple UI that would query the Web API on demand might be better, and 
not put any load on the query service. Considering that the main query 
service is not the W3C conforming SPARQL endpoint (which gives you raw 
results), one could as well have some JavaScript there to beautify the 
resulting item URIs based on client-side requests. Maybe some consumers 
really need to get labels from SPARQL, but at least the users who want 
to see results right away would not need this.


Markus

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Announcing the release of the Wikidata Query Service

2015-09-08 Thread Markus Krötzsch

On 09.09.2015 00:13, Stas Malyshev wrote:

Hi!


Anyone an idea why this query has a trouble when I add the OPTIONAL keyword?

*http://tinyurl.com/pgsujp2*

Doesn't look much harder than the queries in the examples.


It's not because it's harder. It's because ?head can be unbound, and you
can not apply label service to unbound variables. If you drop ?headLabel
then it works. It is a downside of the label service, not sure yet how
to fix it (feel free to submit the Phabricator issue, maybe myself or
somebody else has an idea later).


Why can't the label service just return unbound labels for unbound inputs?

Markus



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Wikidata Live RDF?

2015-09-08 Thread Markus Krötzsch

Hi,

Now that the SPARQL endpoint is "official", will the life RDF data 
(which you get, e.g, via Special:EntityData) also be switched to show 
the content using in SPARQL? Since this is already implemented, I guess 
this would be a minor switch. It could be very useful for users who want 
to learn about how to use SPARQL, since it can help a lot to see the 
data for one item to know what to query for.


Cheers,

Markus



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Source statistics

2015-09-07 Thread Markus Krötzsch
P.S. If you want to do this yourself to play with it, below is the 
relevant information on how I wrote this code (looks a bit clumsy in 
email, but I don't have time now to set up a tutorial page ;-).


Markus


(1) I modified the example program "EntityStatisticsProcessor" that is 
part of Wikidata Toolkit [1].

(2) I added a new field to count references:

final HashMap<Reference, Integer> refStatistics = new HashMap<>();

(3) The example program already downloads and processes all items and 
properties in the most recent dump. You just have to add the counting. 
Essentially, this is the code I run on every ItemDocument and 
PropertyDocument:


public void countReferences(StatementDocument statementDocument) {
  for (StatementGroup sg : statementDocument.getStatementGroups()) {
for (Statement s : sg.getStatements()) {
  for (Reference r : s.getReferences()) {
if (!refStatistics.containsKey(r)) {
  refStatistics.put(r, 1);
} else {
  refStatistics.put(r, refStatistics.get(r) + 1);
}
  }
}
  }
}

(the example already has a method "countStatements" that does these 
iterations, so you can also insert the code there).



(4) To print the output to a file, I sort the hash map by values first. 
Here's some standard code for how to do this:


try (PrintStream out = new PrintStream(
  ExampleHelpers.openExampleFileOuputStream("reference-counts.txt"))) {
List<Entry<Reference, Integer>> list =
   new LinkedList<Entry<Reference, Integer>>(
   refStatistics.entrySet());

 Collections.sort(list, new Comparator<Entry<Reference, Integer>>()
   {
 @Override
 public int compare(Entry<Reference, Integer> o1,
Entry<Reference, Integer> o2) {
   return o2.getValue().compareTo(o1.getValue());
 }
   }
 );

 int singleRefs = 0;
 for (Entry<Reference, Integer> entry : list) {
   if (entry.getValue() > 1) {
 out.println(entry.getValue() + " x " + entry.getKey());
   } else {
 singleRefs++;
   }
 }
 out.println("... and another " + singleRefs
 + " references that occurred just once.");
} catch (IOException e) {
  e.printStackTrace();
}

This code I put into the existing method writeFinalResults() that is 
called at the end.


As I said, this runs in about 30min on my laptop, but downloading the 
dump file first time takes a bit longer.



[1] 
https://github.com/Wikidata/Wikidata-Toolkit/blob/v0.5.0/wdtk-examples/src/main/java/org/wikidata/wdtk/examples/EntityStatisticsProcessor.java


On 07.09.2015 15:49, Markus Krötzsch wrote:

Hi André,

I just made a small counting program with Wikidata Toolkit to count
unique references. Running it on the most recent dump took about 30min.
I uploaded the results:

http://tools.wmflabs.org/wikidata-exports/statistics/20150831/reference-counts-50.txt


The file lists all references that are used at least 50 times, ordered
by number of use. There were 593778 unique references for 35485364
referenced statements (out of 69942556 statements in total).

416480 of the references are used only once. If you want to see all
references used at least twice, this is a slightly longer file:

http://tools.wmflabs.org/wikidata-exports/statistics/20150831/reference-counts.txt.gz


Best regards,

Markus


On 07.09.2015 13:25, André Costa wrote:

Hi all!

I'm wondering if there is a way (SQL, api, tool or otherwise) for
finding out how often a particular source is used on Wikidata.

The background is a collaboration with two GLAMs where we have used ther
open (and CC0) datasets to add and/or source statements on Wikidata for
items on which they can be considered an authority. Now I figured it
would be nice to give them back a number for just how big the impact was.

While I can find out how many items should be affected I couldn't find
an easy way, short of analysing each of these, for how many statements
were affected.

Any suggestions would be welcome.

Some details: Each reference is a P248 claim + P577 claim (where the
latter may change)

Cheers,
André / Lokal_Profil
André Costa | GLAM-tekniker, Wikimedia Sverige |andre.co...@wikimedia.se
<mailto:andre.co...@wikimedia.se> |+46 (0)733-964574

Stöd fri kunskap, bli medlem i Wikimedia Sverige.
Läs mer på blimedlem.wikimedia.se <http://blimedlem.wikimedia.se/>



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata






___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Source statistics

2015-09-07 Thread Markus Krötzsch

On 07.09.2015 14:25, Edgard Marx wrote:

Is not an updated version, but

dbtrends.aksw.org 


I am getting an error there. Is the server down maybe?

Markus



best,
Edgard

On Mon, Sep 7, 2015 at 1:25 PM, André Costa > wrote:

Hi all!

I'm wondering if there is a way (SQL, api, tool or otherwise) for
finding out how often a particular source is used on Wikidata.

The background is a collaboration with two GLAMs where we have used
ther open (and CC0) datasets to add and/or source statements on
Wikidata for items on which they can be considered an authority. Now
I figured it would be nice to give them back a number for just how
big the impact was.

While I can find out how many items should be affected I couldn't
find an easy way, short of analysing each of these, for how many
statements were affected.

Any suggestions would be welcome.

Some details: Each reference is a P248 claim + P577 claim (where the
latter may change)

Cheers,
André / Lokal_Profil
André Costa | GLAM-tekniker, Wikimedia Sverige
|andre.co...@wikimedia.se 
|+46 (0)733-964574

Stöd fri kunskap, bli medlem i Wikimedia Sverige.
Läs mer på blimedlem.wikimedia.se 


___
Wikidata mailing list
Wikidata@lists.wikimedia.org 
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata Toolkit 0.5.0 released

2015-09-02 Thread Markus Krötzsch

On 02.09.2015 08:11, Gerard Meijssen wrote:

Hoi,
It reads like a lot of work went in there.. Congratulations.

I understand that it uses the live data and that it can be using dumps
of other installations :) Now how do I use it from Labs ?


Labs has Java installed. You can just copy any Java application to your 
home directory there and run it. We are running a WDTK-based application 
on labs to generate the RDF exports, for example.


Markus


___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] First version for units is ready for testing!

2015-09-01 Thread Markus Krötzsch

On 01.09.2015 05:11, Stas Malyshev wrote:
...


That's a very interesting point, I think this can be handled by
establishing:
1. Types/classes of units, such as "measure of length"
2. Designated standard unit - e.g. "meter" is a "standard measure of
length".
3. Conversion properties - e.g. "foot" is 0.305 "meters"
4. Having RDF exports contain values converted to standardized measures
- i.e. every quantity that has unit that is "measure of length" will
also have a value expressed in "meters".

Then reconciling data between instances would be just reconciling
properties for the above (which can be made easier by finding some
existing ontology featuring units and relating to it), and then matching
entities for standard measures.


Mapping a unit entity to a (unique) external IRI could be done by a 
suitable property that is then used during RDF export, as we already do 
with other special properties. It still leaves each wiki with the (huge) 
effort of defining all units, their labels, and conversions manually.


It would be great if it would be possible for a Wikibase site to (also) 
use the Wikidata unit entities as if they were local entities. One could 
probably just fetch their labels and conversions through the API, which 
I suppose the UI is already doing now for units that are "local". This 
feature would be a bit like MediaWiki's InstantCommons for Wikidata 
units. Ideally, this would lead to a situation where most units in most 
Wikibase sites are taken from Wikidata, with only a few special things 
defined locally as needed.


Markus

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] First version for units is ready for testing!

2015-09-01 Thread Markus Krötzsch

On 01.09.2015 09:56, Stas Malyshev wrote:

Hi!


Also, I don't see a reason why the JSON encoding should use an IRI


It probably doesn't have to, just Q-id would be enough. "1" is OK too,
but a bit confusing - if the rest would be Q-ids, then it makes sense to
make all of the Q-ids. Other option would be to make it just null or
something special like that.


My reasoning is that people who are using JSON now already are coping 
with "1" in some way. So any negative effects that it might have are 
already there ;-).


I am not concerned with uniformity, since it is not clear anyway which 
strings are considered valid "unit IRIs" by Wikibase now. On the JSON 
level, this is a"string"; there is no "IRI" datatype there. It's 
different in RDF, where we must use a valid IRI as a value because RDF 
knows this "datatype".





string does not seem to help anyone. I would suggest keeping the "1" as
a marker for "no unit". Of course, this "1" would never be shown in the


It is possible, but "1" the looks like "magic value", which is usually
bad design since one needs to check for it all the time. It would be
nicer if there could be a way to avoid it.


Indeed, but any value will be "magic" in this sense. Using the IRI of 
Q199 will just be a superficial cure to this -- code would still have to 
check for equality with this IRI explicitly in many places, making it a 
"special value" again. The cleanest solution would be to omit the "unit" 
key if there is no unit, but this would break some existing 
implementations that do not check if "unit" is present or not, and which 
may therefore get errors when trying to access it blindly. Maybe using 
"null" would be a compromise.





Wikibase and elsewhere. If we create a special IRI for denoting this
situation, it will be better distinguished from other (regular) units,
and there will be no dependency on the current content of Wikidata's Q199.


We already have such dependencies - e.g. in calendars and globes - so it
won't be anything new. But let's see what the Wikidata team thinks about
it :)



I agree for calendars. For globes, I don't think that Wikibase treats 
them as "special values". They are just "some IRI" with no code 
depending on what exactly they are. It will be different for our "1", 
which will be treated different on the software level. For calendars, it 
still had some advantage to use items (we needed the labels), but I 
don't see any advantage for the case of "1". I think your concern about 
"special values" is exactly what I had in mind, but it was not as clear 
yet :-).


Markus




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata-tech] how is the datetime value with precision of one year stored

2015-09-01 Thread Markus Krötzsch

On 01.09.2015 05:17, Stas Malyshev wrote:

Hi!


I would have thought that the correct approach would be to encode these
values as gYear, and just record the four-digit year.


While we do have a ticket for that
(https://phabricator.wikimedia.org/T92009) it's not that simple since
many triple stores consider dateTime and gYear to be completely
different types and as such some queries between them would not work.



I agree. Our original RDF exports in Wikidata Toolkit are still using 
gYear, but I am not sure that this is a practical approach. In 
particular, this does not solve the encoding of time precisions in RDF. 
It only introduces some special cases for year (and also for month and 
day), but it cannot be used to encode decades, centuries, etc.


My current view is that it would be better to encode the actual time 
point with maximal precision, and to keep the Wikidata precision 
information independently. This applies to the full encoding of time 
values (where you have a way to give the precision as a separate value).


For the simple encoding, where the task is to encode a Wikidata time in 
a single RDF literal, things like gYear would make sense. At least full 
precision times (with time of day!) would be rather misleading there.


In any case, when using full precision times for cases with limited 
precision, it would be good to create a time point for RDF based on a 
uniform rule. Easiest option that requires no calendar support: use the 
earliest second that is within the given interval. So "20th century" 
would always lead to the time point "1900-01-01T00:00:00". If this is 
not done, it will be very hard to query for all uses of "20th century" 
in the data.


Markus

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] API JSON format for warnings

2015-09-01 Thread Markus Krötzsch

On 01.09.2015 16:57, Thiemo Mättig wrote:

Hi,

 > I now identified another format for API warnings [...] from action
"watch"

I'm not absolutely sure, but I think this is not really an other format.
The "warnings" field contains a list of modules. For each module you can
either have a list of "messages", or a plain string. In the later case
the string is stored with the "*" key you see in both examples.

The relevant code that creates these "warnings" structures for Wikibase
can be found in the ApiErrorReporter class.

The { "name": ..., "parameters": ..., "html": { "*": ... } } thing you
see is a rendering of a Message object. The "html" key can be seen in
\ApiErrorFormatter::addWarningOrError, the "*" is a result from the
Message class.

Hope that helps.


Yes, this is very helpful. Thanks. I had looked at this PHP code, but I 
could not see these things there (strings like "*" are not very 
distinctive, so I was not sure which "*" I am looking at ;-).


Markus


___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] API JSON format for warnings

2015-09-01 Thread Markus Krötzsch
I now identified another format for API warnings. For example, I got the 
following from action "watch":


"warnings": {
"watch": {
"*": "The title parameter has been deprecated."
}
}

For comparison, here is again what I got from wbeditentity:

"warnings" :
  {"wbeditentity":
{"messages":
   [{"name":"wikibase-self-conflict-patched",
 "parameters":[],
 "html": { "*":"Your edit was patched into the latest version, 
overriding some of your own intermediate changes."}

   }]
}
  }

For action "paraminfo", I managed to trigger multiple warnings. Guess 
how three warnings are reported there!


"warnings": {
"paraminfo": {
"*": "The mainmodule parameter has been deprecated.\nThe 
pagesetmodule parameter has been deprecated.\nThe module \"main\" does 
not have a submodule \"foo\""

}
}

I will now implement these two forms. The "html":{"*":"..."} form seems 
a bit risky to implement (will it always be "html"? will it always be 
"*"?), but I could not get any other warning in such a form, so this is 
the one I will support.


I wasted some time with trying to trace this in the PHP code, but did 
not get to the point where the "messages" or even "html" key is 
inserted. I got as far as ApiResult.php, where messages end up being 
added in addValue(). It seems that this is the same for all modules, 
more or less. I lost the trace after this. I have no idea what happens 
with the thus "added" messages or how they might surface again elsewhere 
in this code. There are various JsonFormatter classes but they are very 
general and do not mention "messages". Neither do the actual ApiMessage 
objects.


Markus



On 30.08.2015 14:22, Markus Krötzsch wrote:



A partial answer would also be helpful (maybe some of my questions are
more tricky than others).

Thanks,

Markus


On 28.08.2015 10:41, Markus Krötzsch wrote:

Hi,

I am wondering how errors and warnings are reported through the API, and
which errors and warnings are possible. There is some documentation on
Wikidata errors [1], but I could not find documentation on how the
warning messages are communicated in JSON. I have seen structures like
this:

{ "warnings" :
   {"wbeditentity":
 {"messages":
[{"name":"wikibase-self-conflict-patched",
  "parameters":[],
  "html": { "*":"Your edit was patched into the latest version,
overriding some of your own intermediate changes."}
}]
 }
   }
}

I don't know how to provoke more warnings, or multiple warnings in one
request, so I found it hard to guess how this pattern generalises. Some
questions:

* What is the purpose of the map with the "*" key? Which other keys but
"*" could this map have?
* The key "wbeditentity" points to a list. Is this supposed to encode
multiple warnings of this type?
* I guess the "name" is a message name, and "parameters" are message
"arguments" (as they are called in action="query") for the message?
* Is this the JSON pattern used in all warnings or can there also be
other responses from wbeditentity?
* Is this the JSON pattern used for warnings in all Wikibase actions or
can there also be other responses from other actions?
* Is there a list of relevant warning codes anywhere?
* Is there a list of relevant error codes anywhere? The docs in [1]
point to paraminfo (e.g.,
http://www.wikidata.org/w/api.php?action=paraminfo=wbeditentity)
but there are no errors mentioned there.

Thanks,

Markus

[1] https://www.mediawiki.org/wiki/Wikibase/API#Errors

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech





___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata] First version for units is ready for testing!

2015-08-31 Thread Markus Krötzsch

Hi,

Great news, but I have some questions ;-) It seems that the unit can be 
any IRI that starts with http or https, even if not referring to an 
entity of the wiki.


Then what happens with the "unit":"1" that currently we have in JSON? It 
seems that one cannot enter this string in the field, and I guess it 
would (rightly) be invalid as a unit. Will Wikibase continue to use this 
(invalid) string as a placeholder for "no unit"?


If every wiki needs to define its own unit items to see labels, how is 
data exchange supposed to work? Will the RDF export then contain a 
different IRI for "meter" when exporting data from Wikidata and from 
Commons (or whatever other future instance)?


Cheers,

Markus


On 31.08.2015 17:53, Lydia Pintscher wrote:

Hi everyone :)

We've finally done all the groundwork for unit support. I'd love for
you to give the first version a try on the test system here:
http://wikidata.beta.wmflabs.org/wiki/Q23950

There are a few known issues still but since this is one of the things
holding back Wikidata I made the call to release now and work on these
remaining things after that. What I know is still missing:
* We're showing the label of the item of the unit. We should be
showing the symbol of the unit in the future.
(https://phabricator.wikimedia.org/T77983)
* We can't convert between units yet - we only have the groundwork for
it so far. (https://phabricator.wikimedia.org/T77978)
* The items representing often-used units should be ranked higher in
the selector. (https://phabricator.wikimedia.org/T110673)
* When editing an existing value you see the URL of unit's item. This
should be replaced by the label.
(https://phabricator.wikimedia.org/T110675)
* When viewing a diff of a unit change you see the URL of the unit's
item. This should be replaced by the label.
(https://phabricator.wikimedia.org/T108808)
* We need to think some more about the automatic edit summaries for
unit-related changes. (https://phabricator.wikimedia.org/T108807)

If you find any bugs or if you are missing other absolutely critical
things please let me know here or file a ticket on
phabricator.wikimedia.org. If everything goes well we can get this on
Wikidata next Wednesday.


Cheers
Lydia




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata-tech] API JSON format for warnings

2015-08-30 Thread Markus Krötzsch

Push

A partial answer would also be helpful (maybe some of my questions are 
more tricky than others).


Thanks,

Markus


On 28.08.2015 10:41, Markus Krötzsch wrote:

Hi,

I am wondering how errors and warnings are reported through the API, and
which errors and warnings are possible. There is some documentation on
Wikidata errors [1], but I could not find documentation on how the
warning messages are communicated in JSON. I have seen structures like
this:

{ warnings :
   {wbeditentity:
 {messages:
[{name:wikibase-self-conflict-patched,
  parameters:[],
  html: { *:Your edit was patched into the latest version,
overriding some of your own intermediate changes.}
}]
 }
   }
}

I don't know how to provoke more warnings, or multiple warnings in one
request, so I found it hard to guess how this pattern generalises. Some
questions:

* What is the purpose of the map with the * key? Which other keys but
* could this map have?
* The key wbeditentity points to a list. Is this supposed to encode
multiple warnings of this type?
* I guess the name is a message name, and parameters are message
arguments (as they are called in action=query) for the message?
* Is this the JSON pattern used in all warnings or can there also be
other responses from wbeditentity?
* Is this the JSON pattern used for warnings in all Wikibase actions or
can there also be other responses from other actions?
* Is there a list of relevant warning codes anywhere?
* Is there a list of relevant error codes anywhere? The docs in [1]
point to paraminfo (e.g.,
http://www.wikidata.org/w/api.php?action=paraminfomodules=wbeditentity)
but there are no errors mentioned there.

Thanks,

Markus

[1] https://www.mediawiki.org/wiki/Wikibase/API#Errors

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech



___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


[Wikidata-tech] API JSON format for warnings

2015-08-28 Thread Markus Krötzsch

Hi,

I am wondering how errors and warnings are reported through the API, and 
which errors and warnings are possible. There is some documentation on 
Wikidata errors [1], but I could not find documentation on how the 
warning messages are communicated in JSON. I have seen structures like this:


{ warnings :
  {wbeditentity:
{messages:
   [{name:wikibase-self-conflict-patched,
 parameters:[],
 html: { *:Your edit was patched into the latest version, 
overriding some of your own intermediate changes.}

   }]
}
  }
}

I don't know how to provoke more warnings, or multiple warnings in one 
request, so I found it hard to guess how this pattern generalises. Some 
questions:


* What is the purpose of the map with the * key? Which other keys but 
* could this map have?
* The key wbeditentity points to a list. Is this supposed to encode 
multiple warnings of this type?
* I guess the name is a message name, and parameters are message 
arguments (as they are called in action=query) for the message?
* Is this the JSON pattern used in all warnings or can there also be 
other responses from wbeditentity?
* Is this the JSON pattern used for warnings in all Wikibase actions or 
can there also be other responses from other actions?

* Is there a list of relevant warning codes anywhere?
* Is there a list of relevant error codes anywhere? The docs in [1] 
point to paraminfo (e.g., 
http://www.wikidata.org/w/api.php?action=paraminfomodules=wbeditentity) 
but there are no errors mentioned there.


Thanks,

Markus

[1] https://www.mediawiki.org/wiki/Wikibase/API#Errors

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


[Wikidata-tech] Deleting labels/descriptions/aliases via API

2015-08-26 Thread Markus Krötzsch

Hi,

How do you delete, say, the English label of an entity via wbeditentity? 
I could not find documentation on this. Whatever the answer, I guess it 
is the same for descriptions, right?


How about aliases? I know that writing one English alias will delete all 
existing aliases, but how can you write no English aliases?


In either case, I do not want to use the clear flag, of course :-)

Thanks,

Markus

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


[Wikidata-tech] Question on bot flag in wbeditentity

2015-08-26 Thread Markus Krötzsch

Hi,

I wondered why wbeditentity has a parameter bot. The documentation 
says that this will mark the edit as a bot edit, but only if the user is 
in the bot group. In other words, users in the bot group can use this 
parameter to decide if they want to have their API-based edit flagged as 
bot or not. Is there any reason why a user in bot group would *not* want 
their API-based edit flagged as bot?


Cheers,

Markus

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata-tech] Question on bot flag in wbeditentity

2015-08-26 Thread Markus Krötzsch

On 26.08.2015 12:27, Addshore wrote:

The reason our bot flag works like this is to remain consistent with all
other mediawiki API modules.

See https://www.mediawiki.org/wiki/Manual:Bots#The_.22bot.22_flag
Mainly, Not all users with this right are bots.


Ok, but when a user edits a wiki thourgh the browser, this is not 
related to the API anyway, is it? I wonder in which cases a users with a 
bot flag makes an edit through the API that should not be flagged as bot.


Background: I am currently implementing write access to Wikidata. I am 
inclined to set the bot flag on all requests to simplify the interface.


Cheers,

Markus



Cheers

Addshore

On 26 August 2015 at 10:51, Markus Krötzsch
mar...@semantic-mediawiki.org mailto:mar...@semantic-mediawiki.org
wrote:

Hi,

I wondered why wbeditentity has a parameter bot. The documentation
says that this will mark the edit as a bot edit, but only if the
user is in the bot group. In other words, users in the bot group can
use this parameter to decide if they want to have their API-based
edit flagged as bot or not. Is there any reason why a user in bot
group would *not* want their API-based edit flagged as bot?

Cheers,

Markus

___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
mailto:Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech




--
Addshore


___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech




___
Wikidata-tech mailing list
Wikidata-tech@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-tech


Re: [Wikidata] Wikidata needs your votes

2015-08-22 Thread Markus Krötzsch

On 21.08.2015 16:55, Jane Darnell wrote:

it would help if you included the link in these reminders!


Here you go:

https://www.land-der-ideen.de/ausgezeichnete-orte/preistraeger/wikida

- yellow vote button upper right, enter email (the terms and conditions 
in German ensure that it will only be used for this voting and not 
shared), check box, check emails, click link.


Markus



2015-08-21 16:07 GMT+02:00 Lydia Pintscher lydia.pintsc...@wikimedia.de
mailto:lydia.pintsc...@wikimedia.de:

On Tue, Aug 18, 2015 at 2:37 PM, Alessandro Marzocchi
alemar...@gmail.com mailto:alemar...@gmail.com wrote:
  Position 7.
  ... good, I hope.

We have today, Saturday and Sunday left to vote.


Cheers
Lydia

--
Lydia Pintscher - http://about.me/lydia.pintscher
Product Manager for Wikidata

Wikimedia Deutschland e.V.
Tempelhofer Ufer 23-24
10963 Berlin
www.wikimedia.de http://www.wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e. V.

Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das
Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985
tel:27%2F681%2F51985.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org mailto:Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata needs your votes

2015-08-22 Thread Markus Krötzsch

Of course I meant

https://www.land-der-ideen.de/ausgezeichnete-orte/preistraeger/wikidata

Markus

On 22.08.2015 14:03, Markus Krötzsch wrote:

On 21.08.2015 16:55, Jane Darnell wrote:

it would help if you included the link in these reminders!


Here you go:

https://www.land-der-ideen.de/ausgezeichnete-orte/preistraeger/wikidata

- yellow vote button upper right, enter email (the terms and conditions
in German ensure that it will only be used for this voting and not
shared), check box, check emails, click link.

Markus



2015-08-21 16:07 GMT+02:00 Lydia Pintscher lydia.pintsc...@wikimedia.de
mailto:lydia.pintsc...@wikimedia.de:

On Tue, Aug 18, 2015 at 2:37 PM, Alessandro Marzocchi
alemar...@gmail.com mailto:alemar...@gmail.com wrote:
  Position 7.
  ... good, I hope.

We have today, Saturday and Sunday left to vote.


Cheers
Lydia

--
Lydia Pintscher - http://about.me/lydia.pintscher
Product Manager for Wikidata

Wikimedia Deutschland e.V.
Tempelhofer Ufer 23-24
10963 Berlin
www.wikimedia.de http://www.wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens
e. V.

Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg
unter der Nummer 23855 Nz. Als gemeinnützig anerkannt durch das
Finanzamt für Körperschaften I Berlin, Steuernummer 27/681/51985
tel:27%2F681%2F51985.

___
Wikidata mailing list
Wikidata@lists.wikimedia.org mailto:Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata






___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Naming Conventions for URIs

2015-08-21 Thread Markus Krötzsch
For the benefit of those of you reading only the Wikidata list: this 
question has been answered on the semantic web list now. The conclusion 
is that the problem does not affect SPARQL and RDF 1.1 Turtle syntax: 
you can start local names with numbers in both cases.


XML, RDF and SPARQL all have abbreviations of the form prefix:local, 
but they are defined differently and subject to different restrictions. 
The feature of XML is called QName (qualified name) and should not be 
confused with the features of RDF and SPARQL, called prefixed names.


The limitation Paul mentioned is a historic shortcoming of XML. This is 
also the reason why XML syntax is not well-suited for encoding RDF: in 
fact, there are many valid RDF graphs that cannot be encoded in RDF/XML 
syntax at all.


Markus


On 20.08.2015 17:36, Paul Houle wrote:

Tell me if I am right or wrong about this.

If I am coining a URI for something that has an identifier in an outside
system is is straightforward to append the identifier (possibly modified
a little) to a prefix,  such as

http://dbpedia.org/resource/Stellarator

Then you can write

@prefix dbpedia: http://dbpedia.org/resource/

and then refer to the concept (in either Turtle or SPARQL) as
dbpedia:Stellarator.

I will take one step further than this and say that for pedagogical and
other coding situations,  the extra length of prefix declarations is an
additional cognitive load on top of all the other cognitive loads of
dealing with the system,  so in the name of concision you can do
something like

@base http://dbpedia.org/resource/
@prefix : http://dbpedia.org/ontology/

and then you can write :someProperty and Stellarator,  and your
queries are looking very simple.

The production for a QName  cannot begin with a number so it is not
correct to write something like

dbpedia:100

or expect to have the full URI squashed to that.  This kind of gotcha
will drive newbies nuts,  and the realization of RDF as a universal
solvent requires squashing many of them.

Another example is

isbn:9971-5-0210-0

If you look at the @base declaration above,  you see a way to get around
this,  because with the base above you can write

100 which works just fine in the dbpedia case.

I like what Wikidata did with using fairly dense sequential integers for
the ids,  so a dbpedia resource URI looks like

http://www.wikidata.org/entity/Q4876286

which is always a QName,  so you can write

@base http://www.wikidata.org/entity/
@prefix wd: http://www.wikidata.org/entity/

and then you can write

wd:Q4876286
Q4876286

and it is all fine,  because (i) wikidata added the alpha prefix and
(ii) started at the beginning with it,  and (iii) made up a plausible
explanation for it is that way.  Freebase mids have the same property,
  so :BaseKB has it too

I think customers would expect to be able to give us

isbn:0884049582

and have it just work,  but because a number is never valid in the
QName,  you can encode the URI like this:

http://isbn.example.com/I0884049582

and then write

isbn:I0884049582
I0884049582

which is not too bad.  Note,  however,  if you want to write

0884049582 you have to encode as

http://isbn.example.com/I0884049582

because,  at least with the Jena framework,  the same thing happens if
you write

@base http://isbn.example.com/I http://isbn.example.com/I

or

@base http://isbn.example.com/ http://isbn.example.com/

so you can't choose a representation which supports that mode of
expression and a :+prefix mode.

Now what bugs me is,  what to do in the case of something which might
or might not be numeric.  What internal prefix would find good
acceptability for end users?


--
Paul Houle

*Applying Schemas for Natural Language Processing, Distributed Systems,
Classification and Text Mining and Data Lakes*

(607) 539 6254paul.houle on Skype ontolo...@gmail.com
mailto:ontolo...@gmail.com

:BaseKB -- Query Freebase Data With SPARQL
http://basekb.com/gold/

Legal Entity Identifier Lookup
https://legalentityidentifier.info/lei/lookup/
http://legalentityidentifier.info/lei/lookup/

Join our Data Lakes group on LinkedIn
https://www.linkedin.com/grp/home?gid=8267275



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata




___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Properties for family relationships in Wikidata

2015-08-20 Thread Markus Krötzsch

On 20.08.2015 14:51, Andrew Gray wrote:

As someone with an extensive collection of Hindi-speaking relatives, I
agree entirely with the complexity here. Never did a language have
such specialised ways of identifying your relations :-)


This is in fact exactly where inferred relations can make life easier. 
Instead of storing many different culture-specific properties on 
Wikidata (which would lead to a lengthy page with a lot of 
culture-specific relations), one can infer their values from existing 
data on the fly. It is not necessary to show these inferences to all 
users in all contexts, but one can offer them to users who are 
interested in this (e.g., in Reasonator, based on the language setting).


There are still some steps needed until we can have this, but I can see 
a great chance there to make Wikidata more adapted to the cultural 
diversity of its users while keeping the underlying data simple.


Markus



However, we already seem to manage fine with simple relation
properties like spouse or child, without significant language
complications, and as long as all we're doing is putting these on more
items rather than inferring more complex relationships, I think we
should be okay.

Andrew.

On 17 August 2015 at 17:58, Gerard Meijssen gerard.meijs...@gmail.com wrote:

Hoi,
When you make these inferences, you have to appreciate how English oriented
they are. In many cultures there are specific names for older sisters,
brothers and younger sisters and brothers. There are names for uncles aunts
from mother's side that differ from those of father's side.

Inferences are language specific. They may have a place but they are not
obvious when you look at a scale of Wikidata.
Thanks,
   GerardM

On 17 August 2015 at 14:47, Markus Kroetzsch
markus.kroetz...@tu-dresden.de wrote:


Hi Andrew,

I am very interested in this, especially in the second aspect (how to
handle symmetry). There are many cases where we have two or more ways to say
the same thing on Wikidata (symmetric properties are only one case). It
would be useful to draw these inferences so that they can used for queries
and maybe also in the UI.

This can also help to solve some of the other problems you mention: for
those who would like to have properties son and daughter, one could
infer their values automatically from other statements, without editors
having to maintain this data at all.

A possible way to maintain these statements on wiki would be to use a
special reference to encode that they have been inferred (and from what).
This would make it possible to maintain them automatically without the
problem of human editors ending up wrestling with bots ;-) Moreover, it
would not require any change in the software on which Wikidata is running.

For the cases you mentioned, I don't think that there is a problem with
too many inferred statements. There are surely cases where it would not be
practical (in the current system) to store inferred data, but family
relationships are usually not problematic. In fact, they are very useful to
human readers.

Of course, the community needs to fully control what is inferred, and this
has to be done in-wiki. We already have symmetry information in constraints,
but for useful inference we might have to be stricter. The current
constraints also cover some not-so-strict cases where exceptions are likely
(e.g., most people have only one gender, but this is not a strong rule; on
the other hand, one is always the child of one's mother by definition).

One also has to be careful with qualifiers etc. For example, the start end
end of a spouse statement should be copied to its symmetric version, but
there might also be qualifiers that should not be copied like this. I would
like to work on a proposal for how to specify such things. It would be good
to coordinate there.

A first step (even before adding any statement to Wikidata) could be to
add inferred information to the query services and RDF exports. This will
make it easier to solve part of the problem first without having too many
discussions in parallel.

Best regards,

Markus



On 17.08.2015 13:29, Andrew Gray wrote:


Hi all,

I've recently been thinking about how we handle family/genealogical
relationships in Wikidata - this is, potentially, a really valuable
source of information for researchers to have available in a
structured form, especially now we're bringing together so many
biographical databases.

We currently have the following properties to link people together:

* spouses (P26) and cohabitants (P451) - not gendered
* parents (P22/P25) and step-parents (P43/P44) - gendered
* siblings (P7/P9) - gendered
* children (P40) - not gendered (and oddly no step-children?)
* a generic related to (P1038) for more distant relationships

There's two big things that jump out here.

** First, gender. Parents are split by gender while children are not
(we have mother/father not son/daughter). Siblings are likewise
gendered, and spouses are not. These are 

  1   2   3   4   >