Re: [Wikidata] Apologies

2015-06-03 Thread Stas Malyshev
Hi!

 This query shows all things owned by National Trust, with English label,
 description, and coordinates (where available):
 
 http://tinyurl.com/ntpx8qf
 
 Click execute to run it. There are ways to get this result in
 different formats (not embedded in HTML) but I don't find this right now.

If you just want the results and not the GUI, you can ask the endpoint
directly:
http://tinyurl.com/ph3wn4m

(this is
https://wdqs-beta.wmflabs.org/bigdata/namespace/wdq/sparql?query= and
then SPARQL, URL-encoded).

If you want it in JSON, you'll need to add header:
Accept:application/sparql-results+json

(not easy to do from browser, unfortunately, unless you use tool like
Postman in Chrome) - otherwise you'll get the default XML.
That's the endpoint that the GUI is using, proceeding then to parse the
result and present them in more human-friendly form.

 other one (30k still worked for me, while the WMF experimental endpoint
 currently times out even at 10k -- the service is running on a virtual
 machine that is not very powerful right now; this will change soon). The

Yes, this service has 30 seconds cap currently, so if the query takes
longer, sorry :) The cap of course rill be raised significantly (and
also performance would be better) once we get it to production.
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Update frequency on the Wikidata Query API

2015-06-18 Thread Stas Malyshev
Hi!

 I have seen that SPARQL query service and it indeed is an interesting
 alternative. In terms of stability and update frequency how different is
 the SPARQL query service from the Wikidata Query API?

In terms of stability: it's beta, so while we try to keep it up and
running smoothly, it is not out of the question that it can be taken
down at any moment, either because we found a bug or because we need to
update something, and the data model can change too. We do not expect
substantial changes in data model anymore, and we try to keep it up and
running (doesn't help that we are in the middle of large labs outage
right now:
https://wikitech.wikimedia.org/wiki/Incident_documentation/20150617-LabsNFSOutage
) and synched continuously (i.e. no more than minutes behind wikidata
edits), but as long as it's beta we can give no guarantees on anything.
We're working hard to make it production-quality, but that will take a
bit more time.

The differences between WDQ and WDQS/SPARQL is that SPARQL is a
full-features language for querying triple-based (RDF) data sets, and
allows very complex queries. It is also a standard in linked data world.
You can use the translator (http://tools.wmflabs.org/wdq2sparql/w2s.php)
- once the labs outage ends of course - to convert between WDQ syntax
and SPARQL. Also check out other links on the WDQS beta page for short
intros about how things are done with SPARQL and examples of which
queries you can run.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Update frequency on the Wikidata Query API

2015-06-18 Thread Stas Malyshev
Hi!

 How often is the WDQ api really being updated? Is it possible to query
 wikidata live, with WDQ and if not, are there alternatives that would
 allow this?

We currently have SPARQL query service in beta[1], which is updated
constantly from Wikidata. Note that since it's beta it is not stable yet
both operationally and data-model-wise, so please be aware of this, also
it has timeout limits that won't allow you for now to run queries that
are too complex. But if you want to check it out and see if that fits
your use case you are most welcome.

[1] http://wdqs-beta.wmflabs.org/
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] accessing data from a wikidata concept page

2015-06-11 Thread Stas Malyshev
Hi!

 The links would point to the standard export URLs:
 * https://www.wikidata.org/wiki/Special:EntityData/Q423111.json
 * https://www.wikidata.org/wiki/Special:EntityData/Q423111.rdf

Speaking about these, shouldn't we also have link rel=alternate for
export formats in the header?

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Lists of things as entities in Wikidata

2015-06-15 Thread Stas Malyshev
Hi!

 In Freebase, we had bot scripts that went through and removed Lists of
 Things topic entities since they are lists of entities and not useful
 clumped together and normalized in a graph database.

Why delete them? Wikidata has a number of things which are not your
standard entity - lists, sources, news, quotes, service entries,
narrative articles (e.g.
https://en.wikipedia.org/wiki/Control_of_fire_by_early_humans - it's not
exactly entity like human or fire), etc. So I don't think the
approach that singles out and excludes lists would help much - if you
have an application that needs individual entities like Douglas
Adams or London and exclude other types will have to exclude much
more than just lists - but I think the approach of asking for exactly
what you need and ignoring the rest may prove more efficient. I'm not
sure there's really well-defined criteria to specify what individual
entity actually is - I'm sure you have one that matches your
application, but some other application may have completely different one.
Generally, this can be solved by better classification I think, but so
far I'm not sure what to base this classification on.
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Orphaned items

2015-05-31 Thread Stas Malyshev
Hi!

 As a practical suggestion for helping:
 http://tools.wmflabs.org/wikidata-todo/random_item_without_instance.php

I would also suggest
http://tools.wmflabs.org/wikidata-todo/important_blank_items.php which
lists most linked items from wikis that have no connection to other
items whatsoever. Some of them are tough to classify or link to
anything, but some are rather obvious.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Orphaned items

2015-05-31 Thread Stas Malyshev
Hi!

 If an item has no statements, no sitelinks, and isn't used anywhere, how do 
 you
 tell what it even *is*? The label only? Is that sufficient and/or useful? What
 would be lost by deleting it? Maybe, if it has labels in many languages, with

Unless its purpose if obvious (i.e. label/description/talk page
describes it clearly) I'd say it might be more dangerous to keep it
around, as if some people start to use it in different meanings, and
then people add independent articles on Wiki which would produce
different items with the same meaning, in multiple languages, pretty
soon we'd have quite a mess on our hands. Empty item by itself with no
links, no good labels and no data or almost no data (like John Smith,
human and that's it) is not worth much, IMHO.

Yes, I don't have good formal criteria for obvious so I imagine we'd
have to take it on case basis or maybe think about some.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Announcing Wikidata Taxonomy Browser (beta)

2015-10-27 Thread Stas Malyshev
Hi!

> It should be a basic requirement of any SPARQL engine that it should be
> able to handle path queries that contain cycles.

So I did some simple checks, and on simple examples Blazegraph handles
cycles just fine. However, on more complex queries, the cycles seem to
be causing trouble. I don't know yet why, I'll look at it further,
probably next week.

So the problem is not "handling cycles" in general, it is handling some
specific data set, and most probably is a consequence of some bug. I'll
report when I have more data about what exactly triggers the bug.
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] birthday present: improved query.wikidata.org

2015-10-29 Thread Stas Malyshev
Hi!

> Would it be possible to have the results optionally in another format?
> When it is possible to have the results in a format where it can be
> picked up by one of Magnus's tools it would be actually useful. For best
> results it needs to be configurable; not everyone wants or needs the
> same tooling..

We have export formats - CSV, TSV, JSON, etc. (look for "Download
results" link on the right side). If you would like to see any other
format that is not supported now, please ask (best in the form of
Phabricator ticket but writing on feedback page or mail works too).

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Announcing Wikidata Taxonomy Browser (beta)

2015-10-23 Thread Stas Malyshev
Hi!

> least one Wikipedia) are considered to refer to equivalent classes on
> Wikidata, which could be expressed by a small subclass-of cycle. For

We can do it, but I'd rather we didn't. The reason is that it would
require engine that queries such data (e.g. SPARQL engine) to be
comfortable with cycles in property paths (especially ones with + and
*), and not every one is (Blazegraph for example looks like does not
handle them out of the box). It can be dealt with, I assume, but why
create trouble for ourselves?

> We also have/had cycles involving instance-of, which is definitely an
> error. ;-)

Right. So I think we need to mark properties that should not form cycles
with
https://www.wikidata.org/wiki/Q18647519 (asymmetric property) and have
constraints checking scripts/bots find out such cases and alert about them.
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] how to map other identifiers to Wikidata entity IDs

2015-11-08 Thread Stas Malyshev
Hi!

> I found you can do it one-by-one in Wikidata Query [3] and in Wikidata
> Query Serivce [4] but neither seems amenable to doing a query on the fly
> "Get me the Wikidata item for each of these 100 ISBNs "2-7071-1620-3", ...

At least in sparql, this would be easy to do:

PREFIX wdt: <http://www.wikidata.org/prop/direct/>
SELECT ?book ?isbn WHERE {
 VALUES ?isbn { "2-7071-1620-3" "2-7071-1620-4" "2-7071-1620-5" ... }
 ?book  wdt:P957 ?isbn
}

Unless I misunderstand what you mean here.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Use of Sparql service is going through the roof

2015-11-06 Thread Stas Malyshev
Hi!

> Does anyone know what's going on with the Sparql service ?
> 
> Up until a couple of days ago, the most hits ever in one day was about
> 6000.
> 
> But according to
>  http://searchdata.wmflabs.org/wdqs/
> 
> two days ago suddenly there were 6.77 *million* requests, and yesterday
> over 21 million.
> 
> Does anyone know what sort of requests these are, and whether they are
> all coming from the same place ?

Looks like yes, they are coming from the same place, and in that place
seems to be a bot doing something wrong. So if anybody knows whose bot
it is please ask that person to seek advice and guidance (which I would
be glad to provide) on how to make it work properly :)

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Use of Sparql service is going through the roof

2015-11-06 Thread Stas Malyshev
Hi!

> Might this be affecting our searches? The following query times out very
> quickly on Chrome, and runs forever in Firefox before crashing the whole
> browser (or is there a problem with my query?)

The symptoms you describe seem to suggest you have too many results for
this query and browser gets out of memory. Try this query with LIMIT 10
first and see what happens.

As for the bot activities affecting other users, the effect seems to be
negligible, so if this query is slow, it is slow on its own merits :)


-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] diseases as classes in Wikidata - was Re: An Ambitious Wikidata Tutorial

2015-10-19 Thread Stas Malyshev
Hi!

> Similarly the "Diary of Anne Frank" is an instance of a memoir or a
> literary work but is a subclass of book (because there are lots of
> physical books with that name). Literary works have authors and
> publishers. Books have numbers of pages and printers and physical locations.

I'm not sure I understand this. What is the difference between "instance
of memoir" and "subclass of book"? You could literally argue with the
same words that it is also "subclass of memoir" and again since very
rarely any specific physical book is notable enough (maybe excluding
things like first Gutenberg Bible, etc.) we would have virtually no
instances of book at all. I do not think people think that way - if you
ask somebody is "Diary of Anne Frank" an example of a book or a class I
think most people would say it's an example of a book and not a class.
Unless we plan to seek out and record every printed physical copy of
that book, I don't see any practical reason to describe it as a class.
This class - and hundreds of thousands of other book titles, maybe with
rare exceptions of the Gutenberg Bible, etc. - would never have any
instances. So my question is - what is the use of modeling something as
a class if there won't be ever any instances of the class modeled?

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] WDQS updates have stopped

2015-11-18 Thread Stas Malyshev
Hi!

>> Usually the data gets updated every minute or two, but it's over 11 hours
>> now.
> 
> My best guess looking at things right now is that SuccuBot is making a
> huge number of edits and the updater for the query service might not
> be able to handle that yet. Stas: Could you have a look?

Yes, looks like there's a large volume of updates, so the service is
several hours behind, but it seems to be catching up now. What's the bot
is doing?

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] WDQS updates have stopped

2015-11-18 Thread Stas Malyshev
Hi!

>> Yes, looks like there's a large volume of updates, so the service is
>> several hours behind, but it seems to be catching up now. What's the bot
>> is doing?
> 
> https://www.wikidata.org/wiki/Special:Contributions/SuccuBot

Last set of edits seems suspect to me - e.g. adding copies of en label
to a bunch of species as ru label without it even having ruwiki entry.
I'm not sure it's a good thing, but yes, that would generate quite a big
load on updating especially that it seems to be adding hundreds (if not
thousands) labels per minute. I've also added a note on Succu's talk
page to discuss it.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] First version for units is ready for testing!

2015-08-31 Thread Stas Malyshev
Hi!

> We've finally done all the groundwork for unit support. I'd love for
> you to give the first version a try on the test system here:
> http://wikidata.beta.wmflabs.org/wiki/Q23950

This is awesome, congratulations to Wikidata team on this milestone!

> There are a few known issues still but since this is one of the things
> holding back Wikidata I made the call to release now and work on these
> remaining things after that. What I know is still missing:

Looking at it, I also notice one has to edit to get from the unit label
to its entity. I wonder if it's possible to make it easier in the UI to
get to the entity of the unit and see its URL.

Also, is there any work/planning around marking standard units with
specific properties and establishing classes (i.e. measures of length,
weight, etc.) which can be considered convertible? If we want to be able
to run queries against quantities with units (and I think we do, don't
we?) then we would need to figure out the common basis at least for
common units. I wonder if it's tracked somewhere?

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] First version for units is ready for testing!

2015-09-01 Thread Stas Malyshev
Hi!

> Also, I don't see a reason why the JSON encoding should use an IRI

It probably doesn't have to, just Q-id would be enough. "1" is OK too,
but a bit confusing - if the rest would be Q-ids, then it makes sense to
make all of the Q-ids. Other option would be to make it just null or
something special like that.

> string does not seem to help anyone. I would suggest keeping the "1" as
> a marker for "no unit". Of course, this "1" would never be shown in the

It is possible, but "1" the looks like "magic value", which is usually
bad design since one needs to check for it all the time. It would be
nicer if there could be a way to avoid it.

> Wikibase and elsewhere. If we create a special IRI for denoting this
> situation, it will be better distinguished from other (regular) units,
> and there will be no dependency on the current content of Wikidata's Q199.

We already have such dependencies - e.g. in calendars and globes - so it
won't be anything new. But let's see what the Wikidata team thinks about
it :)

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Announcing the release of the Wikidata Query Service

2015-09-08 Thread Stas Malyshev
Hi!

> As suggested somewhere far above, it would be great for the community to
> catalogue the queries that are most important for their use cases that
> do not do well on the SPARQL endpoint.  Its likely that the list isn't
> going to be super-long (in terms of query structure), hence it might
> make sense to establish dedicated, optimized web services (that exist
> apart from the endpoint) to call upon when those kinds of queries need
> to be executed.  

Good idea. As a preliminary point, I think the same basics as with most
other query engines (SQL, etc.) apply:

- non-restrictive queries with tons of results will be slow -
i.e. "list of all humans" is probably not a good question to ask :)

- negative searches are usually slower - e.g. "all humans without
images" will be slow, since that query would have to inspect records for
every human

- Unbound paths/traversals will usually be slower (unfortunately many
queries that have TREE in WDQ are those), especially if there are a lot
of starting points for traversals (again, "all humans that", etc...)

It is also a good idea to put LIMIT on queries when experimenting, i.e.
if you intended to write query that asks for 10 records but accidentally
wrote one that returns 10 million, it's much nicer to discover it with
suitable limit than waiting for the query to time out and then try to
figure out why it happened.

Yes, I realize all this has to go to some page in the manual eventually :)
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Announcing the release of the Wikidata Query Service

2015-09-08 Thread Stas Malyshev
Hi!

> Yes it is the continuation of the beta on labs.
> Stas: Do you want to turn that into a redirect now?

Not sure yet what to do with it. I want to keep the labs setup for
continued development work, especially when potentially breaking things,
but we do want to redirect most of the people now to main endpoint as it
is much better at handling the load.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Announcing the release of the Wikidata Query Service

2015-09-08 Thread Stas Malyshev
Hi!

> Before turning that it into a redirect, it might be worth looking at
> the content. There seems to be some discrepancy between the results
> from http://wdqs-beta.wmflabs.org/ and those from
> http://query.wikidata.org. When submitted to both endpoints the
> following query returns different results:

These are fed from the same data and supposed to be the same code, so I
wonder how it happened there's a difference... The query actually shows
there's two values for Q181391 on P699 - 987 and DOID:987.

I'll investigate that. It may be some kind of bug. Thanks for bringing
it to my attention.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Announcing the release of the Wikidata Query Service

2015-09-08 Thread Stas Malyshev
Hi!

> Anyone an idea why this query has a trouble when I add the OPTIONAL keyword?
> 
> *http://tinyurl.com/pgsujp2*
> 
> Doesn't look much harder than the queries in the examples.

It's not because it's harder. It's because ?head can be unbound, and you
can not apply label service to unbound variables. If you drop ?headLabel
then it works. It is a downside of the label service, not sure yet how
to fix it (feel free to submit the Phabricator issue, maybe myself or
somebody else has an idea later).

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Announcing the release of the Wikidata Query Service

2015-09-08 Thread Stas Malyshev
Hi!

> It seems the implementation of the label service needs some improvement
> to support unbound variables (which should then return unbound labels,
> rather than throw a runtime exception ;-).

Not sure whether this is possible (need to research it) but if it is,
yes, that probably would be the way to fix it.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Source statistics

2015-09-07 Thread Stas Malyshev
Hi!

> A small fix though: I think you should better use count(?statement)
> rather than count(?ref), right?

Yes, of course, my mistake - I modified it from different query and
forgot to change it.

> I have tried a similar query on the public test endpoint on labs
> earlier, but it timed out for me (I was using a very common reference
> though ;-). For rarer references, live queries are definitely the better
> approach.

Works for me for Q216047, didn't check others though. For a popular
references, labs one may be too slow, indeed. A faster one is coming
"real soon now" :)

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Importing Freebase (Was: next Wikidata office hour)

2015-09-28 Thread Stas Malyshev
Hi!

> I see that 19.6k statements have been approved through the tool, and
> 5.1k statements have been rejected - which means that about 1 in 5
> statements is deemed unsuitable by the users of primary sources.

From my (limited) experience with Primary Sources, there are several
kinds of things there that I had rejected:

- Unsourced statements that contradict what is written in Wikidata
- Duplicate claims already existing in Wikidata
- Duplicate claims with worse data (i.e. less accurate location, less
specific categorization, etc) or unnecessary qualifiers (such as adding
information which is already contained in the item to item's qualifiers
- e.g. zip code for a building)
- Source references that do not exist (404, etc.)
- Source references that do exist but either duplicate existing one (a
number of sources just refer to different URL of the same data) or do
not contain the information they should (e.g. link to newspaper's
homepage instead of specific article)
- Claims that are almost obviously invalid (e.g. "United Kingdom" as a
genre of a play)

I think at least some of these - esp. references that do not exist and
duplicates with no refs - could be removed automatically, thus raising
the relative quality of the remaining items.

OTOH, some of the entries can be made self-evident - i.e. if we talk
about movie and Freebase has IMDB ID or Netflix ID, it may be quite easy
to check if that ID is valid and refers to a movie by the same name,
which should be enough to merge it.

Not sure if those one-off things worth bothering with, just putting it
out there to consider.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Why do these two SPARQL queries take such different times to run?

2015-09-09 Thread Stas Malyshev
Hi!

> here's a query to find multiple humans with nationality:Greece that have
> the same day of birth and day of death:
>   http://tinyurl.com/ow6lpen
> It produces one pair, and executes in about 0.6 seconds.
> 
> Here's a query to try to add item numbers and labels to the previous
> search:
>   http://tinyurl.com/ovjwzc9
> 
> It *just* completes, taking just over 60 seconds to execute.

It looks like some issue with nested queries in Blazegraph, I've sent a
report to them and will see what they say.

> Obviously the second query as written at the moment involves a
> sub-query, which inevitably must make it a bit slower -- but given the
> solution set of the sub-query only has two rows, and an exact date for a
> given property ought to be a fairly quick key to look up, why is the
> second query taking 100 times longer than the first ?

Yes, in theory it should be fast, so I suspect some kind of bug.
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Duplicate identifiers (redirects & non-redirects)

2015-10-05 Thread Stas Malyshev
Hi!

> It seems like the constraint checker could check for either only one
> "Preferred" or all but one "Deprecated" which would allow editors to
> evolve in whichever way they wanted.

It should probably consider "best rank" ones - i.e. if Preferred exists
then Preferred ones, otherwise Normal ones but never Deprecated ones.
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Using the SPARQL endpoint of WIkidata outside its GUI

2015-09-23 Thread Stas Malyshev
Hi!

> With the release of http://query.wikidata.org, this long address doesn't
> seem to work anymore, nor the same pattern in query.wikidata.org
> <http://query.wikidata.org>
> (i.e.: http://query.wikidata.org/bigdata/namespace/wdq/sparql) both
> return a http 301 header. 

Should probably be https://query.wikidata.org/bigdata/namespace/wdq/sparql

That works for me.
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] query.wikidata.org and wikibase:Statement ?

2016-01-08 Thread Stas Malyshev
Hi!

> statements (about 2.5M) and on the question if SPARQL could list all
> entries in Wikidata that do not have statements. I played a bit with

Technically, it could, but since it's so many of them, they might not
finish in time. The problem is that since there's no indexes on
something not existing, what probably happens is that the database would
go entity by entity trying to find one that doesn't have a statement,
and that is slow. I think there may be a bug with LIMIT implementation,
or maybe it's just indeed taking too long...

> combinations of OPTIONAL and FILTER-BOUND and FILTER NOT EXIST...
> something like:
> 
> PREFIX wikibase: <http://wikiba.se/ontology#>
> SELECT DISTINCT ?entry ?label ?statement WHERE {
>   ?entry rdfs:label ?label . FILTER (lang(?label) = "en")
>   FILTER NOT EXISTS {
> ?statement ?prop ?entry ;
>   wikibase:rank ?rank .
>   }
> } LIMIT 5

This query also seems a bit wrong since it looks for ?entry as object,
not subject.

> But there was something else I noted... statements are not typed...
> that would probably kick in some index, rather than the above query,
> and the documentation actually speaks about wikibase:Statement [1] but
> if I search for anything rdf:type-d as such, then it finds nothing in
> the SPARQL end point:

Right, please check out:
https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format#WDQS_data_differences

wikibase:Statement is ommitted from the database for performance
reasons. You could still match statements by URL by converting them to
str() and then using substr() function, but that probably wouldn't help
much since there's a lot of statements so the filtering would not be
very selective.
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] REST API for Wikidata

2015-11-30 Thread Stas Malyshev
Hi!

> * About this API: http://queryr.wmflabs.org
> * Documentation: http://queryr.wmflabs.org/about/docs
> * API root: http://queryr.wmflabs.org/api

I like the idea of truly REST-form API for wikidata. Some comments from
the first look:

* I'm not sure items and properties should really be separate, and not
have common /entity/ endpoint.

* You can do /items/{item_id}/data/{property_label} but not by property
ID. In fact, property label seems to be working only in English, and
some properties have pretty unwieldy names even in English - e.g. P357
(I know it's obsolete, but it may still have data and people may have to
use it)

* http://queryr.wmflabs.org/api/items/Q42/data/occupation returns only
one value, shouldn't it return multiple ones?
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] WDQS updates have stopped

2015-11-18 Thread Stas Malyshev
Hi!

The service have caught up now, but in the near future I would like to
ask to make the bots throttle the edits a bit, for now. In the meantime,
I'll look into speeding up the update process further, but given the
nature of the database it may still be possible to temporarily overload
it with large enough update stream. So keeping bot updates under 10 per
second would be nice (this is somewhat arbitrary from back-of-the
envelope calculation, so don't take the exact figure *too* seriously).
Note that this should not be too hard a limit - that allows to update
every single record now in Wikidata in about 2 weeks, which seems to be
OK for most tasks. But it is a limitation and as I said, I'll work to
eventually get rid of it.

Thanks,
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] WDQS URL shortener

2016-06-03 Thread Stas Malyshev
Hi!

> I could have used another one on my own I guess, but the current
> implementation is much faster and less error prone when dealing with
> monster sparqling urls... 
> 
> please find a way to keep it

There are no plans to remove URL shortening. There are plans to switch
URL shortener to Wikimedia's own one, which is supposed to be coming up
eventually, but before that, we plan to use existing ones. We might
change a provider if it turns out there is a better one, but we do not
plan to remove the functionality.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Possibility of data lock

2016-06-10 Thread Stas Malyshev
Hi!

> (1) Statement- and property-level watching for changes (seeing
> problematic changes should come before disallowing them)

This might be harder to do as it's not really possible - internally - to
edit just one statement. It looks like editing one statement, but the
data for the whole entity is stored together, so you are actually always
edit the whole entity. Which means additional code is needed to filter
the watches. Though if the number of watches is small it may be fine.

Additional question is what exactly is "statement"? We do not have any
user-exposed identity for statement, and internal identity can change
with each edit. So if you edit the statement - for how long it is the
same statement as before? (Ship of Theseus problem :)

> (2) Statement- and property-level history (it's currently really hard to
> find out who last changed a property, e.g., to contact the person or to
> look at their other edits)

Again, due to the above two it may be a bit hard to construct - though
we do have diffs, relating diffs to specific statement may be tricky.
Property, though, probably is much easier since they have clear identity.

It should be possible at least to have some code allowing to answer
questions like "which edit was the last one to touch this property" and
"which edits changed this property's statements" though it is not
trivial and I don't think it exists now.

> (3) Statement- and property-level protection (for the hard cases, mostly
> temporarily, same policies as for page-level protection)
> (4) Statement-level patrolling (can I approve a more recent change to
> P31 without approving an older change to P580?)

This may be possible but I'm not sure it's necessary. The watch approach
below may be more effective and more consistent with the project spirit,
I think.

> (5) Query-based watching: if you want to watch all property changes for
> a large set of articles, you need better tools

I think a tool that takes a query and creates a list, and allows to:
1. See how the list changes over time
2. Mark some items on the list as "ok" and some as "not ok"
3. Alert people on changes in the list

would make watching for such changes much easier.

> (6) More work on edit filters (preventing some edits based on content,
> e.g., the use of shortened URLs as a homepage)
> (7) Better UIs to prevent accidental edits (e.g., I can see a lot of
> cases where people have entered qualifier information as new statement
> values)

That's a good case for watching too - we have properties that are
predominantly used in qualifiers, and even marked so. It should not be
hard to make auto-lists with violations and have people to look at it.

> (8) Further work on easy-to-customise quality analysis and display of
> related results (the constraint service is great, but hard to use in a
> targeted way to find errors in a specific area). While data-vandalism
> can have far-reaching consequences, it also is much harder to hide if
> the community has the right tools at hand.
> (9) Better data importing infrastructures (some problems mentioned in
> this thread seem to be caused by a multi-stage data import approach that
> only works if nothing changes in the meantime; I am sure one could get
> this fixed without relying on user-editable data).

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Possibility of data lock

2016-06-10 Thread Stas Malyshev
Hi!

> More concretely: While the item about Barack Obama as a whole will
> probably change ever and ever again, the value of his birthday statement
> should never change: The current value is proven to be true and can't
> change by its nature. What would be the problem about protecting this
> specific value in this specific statement?

Small wrinkle here: one may want to change it by adding a reference or
removing a reference that went stale, or replacing it with an archived
version. It is also not out of the question we may add qualifier to it
(e.g. if we introduce new qualifier that didn't exist before) or remove
one (if we deprecate it, for example). So while the value is not likely
to change, other components of the claim very well might.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Federated queries between wikidata and e.g. sparql.uniprot.org

2016-05-26 Thread Stas Malyshev
Hi!

> What do we need to take into account to figure this out? What are the
> constraints/concerns/etc?

Hmm... Looks like I maybe spoke too soon. While it is not a problem to
make *WDQS* accept federated queries, it may be a problem to make
federated queries actually *work*, because production machines do not
seem to have direct access to the internet.

Which means if we want to make it work, we'd need to ask ops for one-off
exceptions to firewall policies, which are usually frowned upon - they
are hard to maintain and in general require extra effort and aren't
really best practice.

Maybe some of the use cases might be better served by TPF server
(https://github.com/blazegraph/BlazegraphBasedTPFServer) - this does not
enable federated queries per se but enables to produce content that can
be queried externally more easily.
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Federated queries between wikidata and e.g. sparql.uniprot.org

2016-05-26 Thread Stas Malyshev
Hi!

> Dear Wikidata developers and contributors,
> 
> I see that it is currently not possible to run federated queries from
> wikidata to other sparql endpoints. I understand why you are not
> allowing this in the general case. However, it would be nice to allow
> this in the special case.
> 
> Would a patch allowing limited known remote sparql endpoints to
> org.wikidata.query.rdf.blazegraph.WikibaseContextListener
> be possible?

Technically, it is possible and shouldn't be very hard to do. We need to
figure out which endpoints we want to allow.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] WDQS URL shortener

2016-06-01 Thread Stas Malyshev
Hi!

> that makes sense. It sounds like in the short term (and until we
> have a Wikimedia-operated shortener), using full URLs from WDQS is –
> alas – the only way to go. One option we haven't mentioned would be
> for WDQS itself to support URL shortening, I have no idea where that
> would sit in terms of priorities.

I don't think it'd be very high. It would require building
infrastructure that Wiki extension already has, and for sole purpose of
implementing functionality that is already provided by that extension.
So I think the best way here is just to wait until it's deployed, and
maybe gently prod responsible people from time to time :)

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] WDQS URL shortener

2016-06-01 Thread Stas Malyshev
Hi!

> Hi there, may I ask what link shorteners provide you that w3id does not?
> Eg baked in metrics or 10char urls? Just curious why you would want to
> reimplement.

From what I can see, the target audience of w3id is a relatively small
set of very stable URL prefixes that are used a lot and never change. It
also does not aim at making these URLs shorter, it aims at making them
stable. Adding URL namespace to it is a manual process, and individual
URLs are not stored.

The target of URL shorteners is much bigger set of URLs, many of which
are relatively low-use or transient, but which can be created via
automatic means in great volumes, which make URL shorter and which are
aimed at storing, at least for a while, each URL as individual data piece.

So, for our purposes w3id would not be very useful.
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Machine-readable Wikidata ontology/schema?

2016-06-22 Thread Stas Malyshev
Hi!

> A quick search only returned those tables so far:
> https://www.wikidata.org/wiki/Wikidata:List_of_properties/all
> <mailto:wikidata@lists.wikimedia.org>
> 
> Any formal representation would work: OWL, etc.

There's basic OWL with Wikibase ontology here:
http://wikiba.se/ontology-1.0.owl
The properties can be found in the general dump (
https://dumps.wikimedia.org/wikidatawiki/entities/  )
described as outlined here:

https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format#Properties

There's no separate file, RDF, OWL or otherwise, with only properties,
AFAIK.
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Op-ed on Wikipedia Signpost regarding Wikidata licensing

2016-06-20 Thread Stas Malyshev
Hi!

> Current legislations do not support the licensing of individual facts,
> only of databases as a whole, and only in some countries. What you are

Added to that, even if it *were* possible to copyright facts, I think
using restrictive license (and make no mistake, any license that
requires people to do specific things in exchange for data access *is*
restrictive) makes a lot of trouble for any people using the data. This
is especially true for data that is meant for automatic processing - you
will have to add code to track licenses for each data unit, figure out
how exactly to comply with the license (which would probably require
professional help, always expensive), track license-contaminated data
throughout the mixed databases, verify all outputs to ensure only
properly-licensed data goes out... It presents so much trouble many
people would just not bother with it. It would hinder exactly the thing
opens source excels at - creating community of people building on each
other's work by means of incremental contribution and wide participation.
Want to create cool a visualization based on Wikidata? Talk to a lawyer
first. Want kickstart your research exploration using Wikidata facts? To
the lawyer you go. Want to write an article on, say, gender balance in
science over the ages and places, and feature Wikidata facts as an
example? Where's that lawyer's email again?
You get the picture, I hope. How many people would decide "well, it
would be cool but I have no time and resource to figure out all the
license issues" and not do the next cool thing they could do? Is it
something we really want to happen?

And all that trouble to no benefit to anyone - there's absolutely no
threat of Wikidata database being taken over and somehow subverted by
"enterprises", whatever that nebulous term means. In fact, if Google
example shows us anything, it's that "enterprises" are not very good at
it and don't really want it. Would they benefit from the free and open
data? Of course they would, as would everybody. The world - including
everybody, including "enterprises" - benefited enormously from free and
open participatory culture, be it open source software or free data. It
is a *good thing*, not something to be afraid of!

Wikidata data is meant for free use and reuse. Let's not erect
artificial barriers to it out of misguided fear to somehow benefit
somebody "wrong".
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] counting gendered items for the gendergap project

2016-02-06 Thread Stas Malyshev
Hi!

> Is there any way to find out if a clothing item (dress, hat, shoes, etc)
> is gendered? Any ideas how to do this? Thanks in advance

I don't think we have currently anything, we could have something like
"subclass/instance of" "male clothing" or "female clothing" but these
things are very culture-specific (including changing within the culture
with a passage of time) so I'm not sure it would be easy to represent
this in Wikidata.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] SPARQL CONSTRUCT results truncated

2016-02-10 Thread Stas Malyshev
Hi!

> I try to extract all mappings from wikidata to the GND authority file,
> along with the according wikipedia pages, expecting roughly 500,000 to
> 1m triples as result.

As a starting note, I don't think extracting 1M triples may be the best
way to use query service. If you need to do processing that returns such
big result sets - in millions - maybe processing the dump - e.g. with
wikidata toolkit at https://github.com/Wikidata/Wikidata-Toolkit - would
be better idea?

> However, with various calls, I get much less triples (about 2,000 to
> 10,000). The output seems to be truncated in the middle of a statement, e.g.

It may be some kind of timeout because of the quantity of the data being
sent. How long does such request take?

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] WDQS stability

2016-02-04 Thread Stas Malyshev
Hi!

As it was noted on the list, we recently tried to update Blazegraph -
software running Wikidata Query Service - to version 2.0, which has
numerous bugfixes and performance improvements, and some infrastructure
for future work on Geospatial search, etc.

Unfortunately, it seems, as it sometimes happens with new major
releases, that there are certain bugs in it, and yet more unfortunately,
one of the bugs seems to be of a race condition nature, which is very
hard to trigger on test environment, and that, when triggered, seriously
impacts the stability of the service. All this lead to WDQS service
being somewhat unstable last couple of days.

Due to this, I have rolled the production deployment back to pre-2.0
state. This means the service should be stable again and not experience
glitches anymore. I'll be watching it just in case and if you notice
anything that looks broken (like queries producing weird exceptions -
timeout does not count - or service being down, etc.) please ping me.

In the meantime, we will look for the cause of instability, and once it
is identified and fixed, we'll try the Blazegraph 2.0 roll-out again,
with the fixes applied. I'll send a note to the list when it happens.

Thanks,
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata RDF PHP export bug

2016-02-08 Thread Stas Malyshev
Hi!

> I just noticed a bug in the RDF live exports of Wikidata: they still use
> the base URI <http://wikiba.se/ontology-beta#> for all Wikidata
> vocabulary terms. The correct base URI would be
> <http://wikiba.se/ontology#>. I guess this has been forgotten and never
> got noticed yet (not sure if there are consumers of the live exports).

It's not forgotten - in fact, we have an issue for that,
https://phabricator.wikimedia.org/T112127 - but we never got to defining
the point when we do it. One can argue RDF mapping is still not complete
- we do not support units fully, and we may have to add stuff for
geo-coordinates too - but one can argue it's good enough to be 1.0 and
I'd agree with it. But we need to take decision on this. Please feel
free to also comment on the task.


-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] SPARQL CONSTRUCT results truncated

2016-02-11 Thread Stas Malyshev
Hi!

> For me, it’s perfectly ok when a query runs for 20 minutes, when it
> spares me some hours of setting up a specific environment for one
> specific dataset (and doing it again when I need current data two month
> later). And it would be no issue if the query runs much longer, in
> situations where it competes with several others. But of course, that’s
> not what I want to experience when I use a wikidata service to drive,
> e.g., an autosuggest function for selecting entities.

I understand that, but this is a shared server which is supposed to
serve many users, and if we allow to run 20-minute queries on this
service, soon enough it would become unusable. This is why we have
30-second limit on the server.

Now, we have considered having an option for the server or setup that
allows to run longer queries, but currently we don't have one. It would
require some budget allocation and work to make it, so it's not
something we can have right now. There are use cases for very long
queries and very large results, the current public service endpoint is
just not good in serving them, because it's not what it was meant for.

> And do you think the policies and limitations of different access
> strategies could be documented? These could include a high-reliability

I agree that limitations better to be documented, the problem is we
don't know everything we may need to document. Such as "what are queries
that may be bad". When I see something like "I want to download
million-row dataset" I know it's probably a bit too much. But I can't
have hard rule that says 1M-1 is ok, but 1M is too much.

> preferred option). And on the other end of the spectrum something what
> allows people to experiment freely. Finally, the latter kind of

I'm not sure how I could maintain an endpoint that would allow people to
do anything they want and still provide adequate experience for
everybody. Maybe if we had infinite hardware resources... but we do not.

Otherwise, it is possible - and should not be extremely hard - to set
one's own instance of the Query Service and use it for experimenting
with heavy lifting. Of course, that would require resources - but
there's no magic here, it'd require resources from us too, both in terms
of hardware and people that would maintain it. So some things we can do
now, some things we would be able to do later, and some things we
probably would not be able to offer with any adequate quality.
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] SPARQL CONSTRUCT results truncated

2016-02-11 Thread Stas Malyshev
Hi!

> 5.44s empty result
> 8.60s 2090 triples
> 5.44s empty result
> 22.70s 27352 triples

That looks weirdly random. I'll check out what is going on there.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] upcoming deployments/features

2016-02-02 Thread Stas Malyshev
Hi!

> Can you try again please? And in an in-cognito window? I just tried it
> and it works for me: https://test.wikidata.org/wiki/Q649 We've had some
> issues with local store though.

Weird, does work for me incognito but not when logged in.

> The datatype changes but the value type stays string. So depending on
> what they use they might need to be adapted.

RDF export seems to be fine, except that we need to update OWL and docs
for new types, I'll check pywikibot a bit later.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata - short biographies

2016-01-31 Thread Stas Malyshev
Hi!

> 
> ** For all English bios:*
> 
> SELECT *
> WHERE
> {
>?s <http://schema.org/description> ?o .   
>filter(lang(?o)='en').
> }

Please don't run this on query.wikidata.org though. Please add LIMIT.
Otherwise you'd be trying to download several millions of data items,
which would probably time out anyway. Add something like "LIMIT 10" to it.

Thanks,
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] SPARQL service slow?

2016-02-03 Thread Stas Malyshev
Hi!

> 1. Is that stability now reached?
> 2. Is there some sort of flag-mechanism, that would indicate if the
> SPARQL endpoint is stable, being upgraded or had change in state?

It's supposed to be stable now, though there's at least one known bug
(not affecting you unless you run the affected query, but if you see
NotMaterializedException you probably hit it, it's being worked on and
will be fixed soon, ping me if you want boring details).

In general, the upgrade went fine so I didn't send a note but then
apparently (of course, as soon as I waited to ensure everything is going
fine, which it was, and went to sleep) something gone wrong and I'm
still not sure what exactly - I'll continue to research that, but
whatever happened doesn't seem to repeat itself so far. In the future,
I'll make announcement when I do potentially risky stuff.

We have a number of plans for implementing new functions which would
involve full DB reload at least once - which will mean the DB will be
temporarily reset to the state of the latest weekly dump (i.e. 2-3 days
behind current state of wikidata) and then will re-sync with the current
state over the course of a day or so. I'll write about that additionally
before we do it, depending on how development is going could be end of
this month or somewhere next month.

Thanks,
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] SPARQL service slow?

2016-02-03 Thread Stas Malyshev
Hi!

> Well, we only noticed what was up due to this email!
> Take a look at https://phabricator.wikimedia.org/T119915

Yes, we need to look into it. The problem is that the service has two
failure modes:

1. Completely dead, rejecting all queries. This would be caught by
icinga and alerted.

2. Crawling slow, but still partially alive, just performing very very
badly. For this one, we do not have adequate alert system. This failure
mode is rare, but we've seen it to happen, both due to somebody sending
a torrent of heavy queries and some bug scenarios. Icinga does not catch
that because it only checks very basic queries and those are still under
timeout.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] weekly summary #194

2016-02-02 Thread Stas Malyshev
Hi!

> Indeed, this is very nifty. I also note that this uses some special
> features of our SPARQL endpoint that I did not know about (the "gas
> service"). It seems that this is a proprietary extension of BlazeGraph,
> which comes in very handy here.

Yes, it's described here:
https://wiki.blazegraph.com/wiki/index.php/RDF_GAS_API and it's a
service implementing basic graph algorithms such as BFS, shortest path,
PageRank, etc. I personally didn't use it too much but it may be very
useful for tasks which are naturally expressed as graph traversals.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] SPARQL service slow?

2016-02-03 Thread Stas Malyshev
Hi!

> is it me or is the SPARQL service very slow right now?

I've upgraded it yesterday to Blazegraph 2.0 and it looks like there was
some glitch there. I've restarted it and now it seems to be fine. I'll
be watching it and see if it repeats.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Upcoming WDQS upgrade to Blazegraph 2.0

2016-02-23 Thread Stas Malyshev
Hi!

It looks like all the issues that we have had with Blazegraph 2.0 are
now fixed, so I will attempt to upgrade our install to 2.0 again
sometime around noon PDT tomorrow. There should be no visible changes
except for brief restart for each server that should not be externally
visible (since we have two so the other one would just take over).
However, if something bad happens, there might be a brief disruption of
service.

I'll send a message when it's done, and if you notice anything weird
after the upgrade, please ping me or submit issue in Phabricator.

Thanks,
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Upcoming WDQS upgrade to Blazegraph 2.0

2016-02-25 Thread Stas Malyshev
Hi!

> So, is there an equivalent for WDQ "AROUND[]" now? :-)

Not yet :) I.e., we could simulate it by accessing longitude/latitude
manually, but in order to use BG 2.0's geospatial index, we need to add
code to specifically encode our coordinate literals in a way compatible
with this index. So there's still work to do, see
https://phabricator.wikimedia.org/T123565
There are some interesting challenges due to the fact that our
coordinates include globes which is not a very common thing, but we're
working on supporting it.
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Littar won second prize

2016-01-20 Thread Stas Malyshev
Hi!

> My 'Littar' (literature radar) website/app won second prize in DBC's
> (Danish Library Center) app competition last week at the Data Science Day.
> 
> Littar displays narrative locations from literary works on a map, -
> presently only Danish locations. Data comes from Wikidata P840 and
> presently colored according to P136 using the Leaflet marker. Text is
> from the P1683 qualifier under P840:
> 
> http://fnielsen.github.io/littar/

Congratulations, very nice project!

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] This SPARQL query no longer works, any ideas why?

2016-01-25 Thread Stas Malyshev
Hi!

> Wikidata SPARQL aficionados,
> 
> This SPARQL query worked for several weeks, but quit working a few days
> ago:

No idea what happened, I'll look into it.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] This SPARQL query no longer works, any ideas why?

2016-01-25 Thread Stas Malyshev
Hi!

> Stas,
> 
> I've narrowed it down to the ORDER BY clause.  Changing from:
> 
> ORDER BY ?propUrl ?valUrl
> 
> to:
> 
> ORDER BY ?propLabel ?valLabel

Seems to be caused by recent fix for T113374, which did not work as
expected. I have rolled back the deployment for now and will investigate
why it broke later.

Thanks for the report!
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] SPARQL endpoint caching

2016-02-16 Thread Stas Malyshev
Hi!

With Wikidata Query Service usage raising and more use cases being
found, it is time to consider caching infrastructure for results, since
queries are expensive. One of the questions I would like to solicit
feedback on is the following:

Should we have default SPARQL endpoint cached or uncached? If cached,
which default cache duration would be good for most users? The cache, of
course, applies to the results of the same (identical) query only.
Please also note the following is not an implementation plan, but rather
an opinion poll, whatever we end up deciding we will have an
announcement with actual plan before we do it.

Also, whichever default we choose, there should be a possibility to get
both cached and uncached results. The question is when you access the
endpoint with no options, which one would it be. So possible variants are:

1. query.wikidata.org/sparql is uncached, to get cached result you use
something like query.wikidata.org/sparql?cached=120 to get result no
older than 120 seconds ago.
PRO: least surprise for default users.
CON: relies on goodwill of tool writers, if somebody doesn't know about
cache option and uses the same query heavily, we would have to ask them
to use the parameter.

2. query.wikidata.org/sparql is cached for short duration (e.g. 1
minute) by default, if you'd like fresh result, you do something like
query.wikidata.org/sparql?cached=0. If you're fine with older result,
you can use query.wikidata.org/sparql?cached=3600 and get cached result
if it's still in cache but by default you never get result older than 1
minute. This of course assuming Varnish magic can do this, if not, the
scheme has to be amended.
PRO: performance improvement while keeping default results reasonably fresh
CON: it is not obvious that result is not the freshest data but can be
stale, so if you update something in wikidata and query again within
minute, you can be surprised

3. query.wikidata.org/sparql is cached for long duration (e.g. hours) by
default, if you'd like fresher result you do something like
query.wikidata.org/sparql?cache=120 to get result no older than 2
minutes, or cache=0 if you want uncached one.
PRO: best performance improvement for most queries, works well with
queries that display data that rarely changes, such as lists, etc.
CON: for people not knowing about cache option, in may be rather
confusing to not be able to get up-to-date results.

So we'd like to hear - especially from current SPARQL endpoint users -
what do you think about these and which would work for you?

Also, for the users of the WDQS GUI - provided we have cached and
uncached options, which one the GUI should return by default? Should it
be always uncached? Performance there is not a major question - the
traffic to the GUI is pretty low - but rather convenience. Of course, if
you run cached query from GUI and the data in cache, you can get results
much faster for some queries. OTOH, it may be important in many cases to
be able to access actual content up-to-date, not the cached version.

I also created a poll: https://phabricator.wikimedia.org/V8
so please feel free to vote for your favorite option.

OK, this letter is long enough already so I'll stop here and wait to
hear what everybody's thinking.

Thanks in advance,
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] SPARQL endpoint caching

2016-02-16 Thread Stas Malyshev
Hi!

> How often does *exactly* the same query get run within 2 minutes ?

Depends where the query is coming from. E.g. if there's a graph backed
by query, then a lot of people can be seeing the graph and running a
query. Same if somebody publishes a link to some query e.g. during a
talk or in article and a bunch of people come to look at it. Depends on
use case.
Some use cases - like graphs - we are just planning, so we can't really
rely on statistics here.

> I'd guess there's probably only a very few queries like that though.

Well, maybe - we don't really know yet. That's why I want to hear
opinions on this :)

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] SPARQL endpoint caching

2016-02-16 Thread Stas Malyshev
Hi!

> I agree, we should look at some actual traffic to see how many queries
> /could/ be cached in a 2/5/10/60 min window. Maybe remove the example
> queries from those numbers, to separate the "production" and testing
> usage. Also, look at query runtime; if only "cheap" queries would be
> cached, there is no point in caching.

Makes sense, but some of the use cases are not implemented yet, and I'm
kind of scared of allowing them without caching - e.g. graph embedding -
so it's hard to rely on past data.

> Once you run a query, you know both the runtime and the result size.
> Maybe expensive queries with a huge result set could be cached longer by
> default, and cheap/small queries not at all? If you expect your recent
> Wikidata edit to change the results from 3 to 4, you should see that
> ASAP; if the change would be 50.000 to 50.001, it seems less critical
> somehow.

That sounds like a good idea, we'll need to check if Varnish allows us
to do tricks like this...

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Make federated queries possible / was: SPARQL CONSTRUCT results truncated

2016-02-18 Thread Stas Malyshev
Hi!

> Now, obviously endpoints referenced in a federated query via a
> service clause have to be open - so any attacker could send his
> queries directly instead of squeezing them through some other
> endpoint. The only scenario I can think of is that an attackers IP
> already is blocked by the attacked site. If (instead of much more
> common ways to fake an IP) the attacker would choose to do it by
> federated queries through WDQS, this _could_ result in WDQS being
> blocked by this endpoint.

This is not what we are concerned with. What we are concerned with is
that federation essentially requires you to run an open proxy - i.e. to
allow anybody to send requests to any URL. This is not acceptable to us
because this means somebody could abuse this both to try and access our
internal infrastructure and to launch attacks to other sites using our
site as a platform.

We could allow, if there is enough demand, to access specific
whitelisted endpoints but so far we haven't found any way to allow
access to any SPARQL endpoint without essentially allowing anybody to
launch arbitrary network connections from our server.

> provide for the linked data cloud. This must not involve the
> highly-protected production environment, but could be solved by an
> additional unstable/experimental endpoint under another address.

The problem is we can not run production-quality endpoint in
non-production environment. We could set up an endpoint on the Labs, but
this endpoint would be underpowered and we won't be able to guarantee
any quality of service there. To serve the amount of Wikidata data and
updates, the machines should have certain hardware capabilities, which
Labs machines currently do not have.

Additionally, I'm not sure running open proxy even there would be a good
idea. Unfortunately, in the internet environment of today there is no
lack of players that would want to abuse such thing for nefarious purposes.

We will keep looking for solution for this, but so far we haven't found one.

Thanks,
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Undoing merges (bibliographic articles are not humans)

2016-02-18 Thread Stas Malyshev
Hi!

> If I get the question correctly, after undoing changes in the target
> page, you need to go to the redirect page and undo changes made there.

Yes, I tried to do that and it works fine. You may need to click on
"redirected from" link when going to redirect page and then go to
history and restore pre-merge version.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] SPARQL endpoint caching

2016-02-19 Thread Stas Malyshev
Hi!

> 
> I'll do a presentation next week, in which I intend to demonstrate
> that I can add a Wikidata value online, which then is available
> immediately for my application - as well as for the whole rest of the
> world. (In Library Land, that's a real blast, because business
> processes related to authority data often take weeks or month ...)

I think we'll always have some way to run un-cached query. The question
is only how easy would it be - i.e. would you need to add parameter,
click a checkbox, etc.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] SPARQL CONSTRUCT results truncated

2016-02-13 Thread Stas Malyshev
Hi!

> [1] Is the service protected against internet crawlers that find such
> links in the online logs of this email list? It would be a pity if we
> would have to answer this query tens of thousands of times for many
> years to come just to please some spiders who have no use for the result.

That's a very good point. We currently do not have robots.txt file on
the service. We should have it. I'll fix it ASAP.

GUI links do not run the query until click, so they are safe from bots
anyway. But direct links to sparql endpoint do run the query (it's the
API after all :) So robots.txt is needed there.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] SPARQL CONSTRUCT results truncated

2016-02-13 Thread Stas Malyshev
Hi!

> you may want to check out the Linked Data Fragment server in Blazegraph:
> https://github.com/blazegraph/BlazegraphBasedTPFServer

Thanks, I will check it out!

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Status and ETA External ID conversion

2016-03-10 Thread Stas Malyshev
Hi!

> Couldn't you use P460 when there is doubt?
> 
> https://www.wikidata.org/wiki/Property:P460

P460's type is Item, which means it is relation between two Wikidata
items. External ID is relation between Wikidata item and something
outside Wikidata.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Recognizing deleted resource in the Wikidata SPARQL endpoint

2016-03-18 Thread Stas Malyshev
Hi!

> Is there an alternative to SPARQL that can either check for deleted
> records or give only non-deleted ones? currently, I am checking if
> they are deleted by making a "ping" using HTTP head requests but this
> takes a huge amount of time since I need to check about 70k
> resources.

Well, SPARQL data store is not supposed to contain any deleted
entries... But looks like there's some bug there. If you give me the
list of the "bad" entries, it's easy to update them. Considerable harder
is to find *why* they weren't updated in the first place. I'm still
looking into it.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wordnet mappings

2016-04-12 Thread Stas Malyshev
Hi!

> Is there a property for WordnetId?

The list of properties is here:
https://www.wikidata.org/wiki/Wikidata:List_of_properties

Don't see there anything for Wordnet.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Recognizing deleted resource in the Wikidata SPARQL endpoint

2016-03-19 Thread Stas Malyshev
Hi!

> I am currently running an experiment to figure out how many Wikidata
> entries refer to identifiers in our dataset (i.e. using property P727)
> but I am receiving in the results entries that have apparently been
> deleted/deprecated (e.g. http://www.wikidata.org/entity/Q18573617)... is

Could you send me the query and the items you see that are wrong?

> there a way to detect them using SPARQL, perhaps some meta-property or
> some information in a statement, or is it simply because the endpoint is
> not in sync with the main repo. 

Short answer - unfortunately, no. Longer answer in
https://phabricator.wikimedia.org/T128947#2104017


-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Fwd: [Ops] Wikidata Query Service (WQDS) regular deployment window

2016-03-23 Thread Stas Malyshev
Hi!

FYI to those it may concern - we plan to institute regular WDQS
deployments on Mondays for both code and GUI. Not much going to change
except regular deployments would happen at predictable time instead of
"whenever I feel like it" :) Does not preclude emergency deployments in
case something is broken, of course, but hopefully this would introduce
more predictability as the service matures. Guillaume would be doing the
deployments, usually, with me stepping in if he can not.

 Forwarded Message 
Subject: [Ops] Wikidata Query Service (WQDS) regular deployment window
Date: Tue, 22 Mar 2016 22:04:21 +0100
From: Guillaume Lederrey 
To: A public mailing list about Wikimedia Search and Discovery projects
, Operations Engineers


Hello!

After discussion with Stas, we want to have a regular deployment
window for Wikidata Query Service. This should help give better
visibility on when new version arrives and help track issues with
those new versions. I will take care of the deployments (with Stas'
support, of course).

The deployment window is: every Monday, from 7pm CET (10am PST - 5pm
UTC) starting from Monday April 11th.

Let me know if you have any question or if you know of another place
where I should publicize this deployment window.

  Take care,

Guillaume

-- 
Guillaume Lederrey
Operations Engineer, Discovery
Wikimedia Foundation

___
Ops mailing list
o...@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/ops



___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Fwd: [Ops] Wikidata Query Service (WQDS) regular deployment window

2016-03-24 Thread Stas Malyshev
Hi!

> Could we also have some daily deployed test site, so new features could
> be tried out and tested?

There's a test server at wdq-beta.eqiad.wmflabs, on which I routinely
test stuff. You could use it too. I'm not updating it daily though but
anytime I need to test stuff, basically.

We could have another instance that is auto-deployed from the deployment
repo daily by scripts, if necessary.
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] SPARQL/BlazeGraph: label service performance

2016-03-06 Thread Stas Malyshev
Hi!

> There is a performance issue with the labelling service. Using labels
> makes even simple queries time out. For example this one:
> 
> SELECT $p $pLabel
> WHERE {
>$p wdt:P31 _:bnode .
>SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . }
> } LIMIT 11

I suspect the issue here can be that it tries to calculate the full set
of values before applying service. Which may make sense if the service
is external, but if it is internal and result set is huge it obviously
is not working.

Other alternative can be, since you are just looking for English labels,
to use direct query approach:

SELECT $p $pLabel
WHERE {
   $p wdt:P31 _:bnode .
   OPTIONAL {
   $p rdfs:label $pLabel .
   FILTER(lang($pLabel) = "en")
   }
} LIMIT 11


This seems to work just fine. You lose a bit of added value on the
service (nicer no-label labels) but you gain a lot of speed.

In any case, I'll raise this issue with Blazegraph and it also may be
worth to submit Phabricator issue about it.
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Status and ETA External ID conversion

2016-03-06 Thread Stas Malyshev
Hi!

> The community is checking each property to verify it should be converted:
> 
> https://www.wikidata.org/wiki/User:Addshore/Identifiers/0
> 
> https://www.wikidata.org/wiki/User:Addshore/Identifiers/1
> 
> https://www.wikidata.org/wiki/User:Addshore/Identifiers/2

Is there a process somewhere of how the checking is done, what are
criteria, etc.? I've read
https://www.wikidata.org/wiki/User:Addshore/Identifiers but there's a
lot of discussion but not clear if it ever come to some end. Also not
clear what the process is - should I just move a property I like to
"good to convert"? Should I run it through some checklist first? Should
I ask somebody?
What are the rules for "disputed" - is some process for review planned?

I think some more definite statement would help, especially to people
willing to contribute.
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Status and ETA External ID conversion

2016-03-06 Thread Stas Malyshev
Hi!

> In your case, however, the answer probably is: you cannot contribute
> there at all, since you are a Wikimedia employee and this is a
> content-related community discussion. ;-)

Many WMF employees contribute to wikis in their non-work time, as far as
I know. I don't even seek to participate in the discussion (though I
don't think WMF employment would disqualify me from contributing in
volunteer capacity, given my affiliations - as they are - are clearly
stated) - but only to know the results so I could contribute in editor
capacity, following whatever rules are there.
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Status and ETA External ID conversion

2016-03-07 Thread Stas Malyshev
Hi!

> Yes, sure, your free time is a different matter. I just thought you are
> speaking as a WMF employee here, since you were using this email. I am

It's Sunday here, so no :) I do use two separate logins for WMF official
and volunteer work on Wiki, but using two emails is too cumbersome for me.

> probably over-sensitive there since I am used to the very strict
> policies of WMDE. They are very careful to keep paid and private
> activities separate by using different accounts.

Surely, it is common in WMF too. But again two email accounts seems
excessive to me. Usually it's pretty clear from the context but if
needed, I will clarify.
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] upcoming change in RDF format data

2016-04-03 Thread Stas Malyshev
Hi!

We are committing a patch that implements a change in RDF format output,
specifically how we output coordinates as WKT points.

If you do not use RDF format exports and specifically WKT coordinate
literals there, this change has no effect for you.

When we first implemented it, we chose to make it "Point(latitude
longitude)". Unfortunately, turns out the standard way in WKT is
Point(longitude latitude) and that's how most of the tools that
implement WKT format understand it. In general, geo-data formats are
split on this question, see http://www.macwright.org/lonlat/. But WKT is
pretty universally in lon-lat camp, so we have to follow the established
practice.

As such, we are changing the WKT representation and we are bumping the
format version (reported as schema:softwareVersion on RDF dumps/exports)
from 0.0.1 to 0.0.2 so that the tools could adjust properly.

See more details in: https://phabricator.wikimedia.org/T130049

Thanks,
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] geospatial search preview

2016-04-03 Thread Stas Malyshev
Hi all!

I would like to present to you a preview of an upcoming Wikidata Query
Service feature, namely geospatial search - on
http://geotest.wmflabs.org/. With this, you will be able to search
things by their geographic coordinates, e.g.:

Objects within 10km of Paris:
http://tinyurl.com/jaxr3bn

Airports within 100km from Berlin:
http://tinyurl.com/gtbmqz3

(try also the map view!)

The purpose of the preview is to let people play with it and collect
feedback on what works, what doesn't and what you'd like to see
added/changed/removed.

This is a preview implementation, so do not use it for anything beyond
admiring the marvels of modern technology :) and providing feedback. The
full release on query.wikidata.org will follow sometime in late
April-early May, hopefully.

What is implemented?
- Coordinate storage & indexing as WKT literals
- Search within radius and within bounding box
- Support for different globes

What is missing but would be added soon?
- Distances as search output and as separate function
- Documentation
- You tell me

Thanks,
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] SQID: the new "Wikidata classes and properties browser"

2016-04-20 Thread Stas Malyshev
Hi!

> Nice work! I especially like the ability to filter the properties by
> usage amount here:
> https://tools.wmflabs.org/sqid/#/browse?type=properties This makes it
> super easy to find unused or nearly unused properties for example.

Yes! Also some usage that seems strange - e.g., why use P31 or P279 in a
reference?

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] question on claim-filtered search/dump and works on a Wikidata subset search engine

2016-04-28 Thread Stas Malyshev
Hi!

> feature that needs to be snappy. So the alternative approach I have been
> working on to is to get a subset of a Wikidata dump and put it in an
> ElasticSearch instance.

The linked data fragments implementation would probably be useful for
that, and I think it would be good idea to get one eventually for the
Wikidata Query Service, but not yet. Also, we do have ElasticSearch
index for Wikidata (that's what drives search on site) so it would be
possible to integrate it with Query Service too (there's some support
for it in Blazegraph) but it's still not done. So for now I think we
don't have a ready-made solution yet. You could still try to
prefix-search or regex-search on the query service, but depending on the
query it may be too slow right now.

> *Question:
> *What is the best way to get all the entities matching a given claim?
> My answer so far was downloading a dump, then filtering the entities by
> claim,  but are there better/less resource-intensive ways?

Probably not currently without some outside tools. When we get LDF
support, then that may be the way :)

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Query service issue

2016-04-28 Thread Stas Malyshev
Hi!

During the data reload for geospatial service enabling, we have
discovered a problem with Wikidata dumps
(https://phabricator.wikimedia.org/T133924). The effect of this problem
is that some items are missing from the dump. Dumps starting 20160418
are affected, previous ones seem to be fine.

The immediate fix for this would be to reload the data from a correct
dump (20160411) and re-sync the data since then. Unfortunately, this may
take some time (a day or so for reload, and another day or so for
resync), and until then you'll see some missing data on
query.wikidata.org. Please be patient until then.

I apologize for the inconvenience caused, and will continue to research
the cause of the missing data and then fix it. I'll update the ticket
when we have new info.

Thanks,
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Geospatial search for Wikidata Query Service is up

2016-05-23 Thread Stas Malyshev
Hi!

> Very nice! From where is the map-data? Open street map?

Yes, see: https://www.mediawiki.org/wiki/Maps

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Geospatial search distance enabled

2016-05-16 Thread Stas Malyshev
Hi!

The geospatial search in WDQS now includes ability to sort by distance.
This can be achieved in one of two ways:

1. Circular search now has clause to fetch distance from center:
bd:serviceParam wikibase:distance ?distance .

This places distance from center to ?distance variable for each found
point, which then can be used for sorting.

2. geof:distance function - this function returns distance between any
two points. This function can be used on any two points, regardless of
usage of the search service.

In both cases, distances are returned in kilometers, other units are not
currently supported (maybe in the future).

Please tell me if you notice anything wrong or have any
comments/suggestions.

Thanks,
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata Query Service problems

2016-05-13 Thread Stas Malyshev
Hi!

> I am having problems with the Wikidata Query Service. Sometimes I just
> get an black page at other time I get an interface but without a
> responsive edit field and the message "Data last updated: [connecting]"
> on the green status field. The issue has been going on for some days now.

Probably linked to recent Varnish issues:
https://phabricator.wikimedia.org/T134989

It's being worked on.

> I know there has been recent instabilities, cf.
> https://lists.wikimedia.org/pipermail/wikidata/2016-May/008674.html ,
> but https://phabricator.wikimedia.org/T134238 'Query service fails with
> "Too many open files"' should be resolved as of 9 May. A new one

No, that's different one and it is fixed.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Wikidata ontology

2016-05-01 Thread Stas Malyshev
Hi!

>  If geo-coordinates use WKT in Wikidata (which I can't see anywhere
> there), does it mean, that the original idea of /{latitude, longitude,
> altitude, precision, globe}/ format was abandoned?

Coordinates are WKT in the RDF output of Wikidata, when represented as
single literal. That's not the original representation inside Wikibase
(that still has separate elements and can be also seen in the JSON dump)
and not the only RDF representation - there's also "full value"
representation, described here:
https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format#Globe_coordinate

This one is harder to index and search for, but it allows you to do
lookups that depend on specific part of the coordinate (like, which
objects are on Mars, or what is located on the equator)?

> Oh, and another thing: when I download "wikidata-properties.nt" from
> WDTK dump files site [1], there is <http://wikidata.org/ontology#> used
> everywhere. So... is the WB ontology somehow translated to WD ontology?

Hmm, not sure about that one, Markus should know more about it.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Sparql query for "evidence based practice"

2016-05-05 Thread Stas Malyshev
Hi!

> One of the things Reasonator does not do is provide information on
> qualifiers. I am interested particularly in Q2354820
> <http://www.wikidata.org/wiki/Q2354820> and where it is used.

If you mean usage as a qualifier's value, this would probably work:

SELECT * WHERE {
  ?st ?pred wd:Q2354820 .
  ?p wikibase:qualifier ?pred .
} LIMIT 10

but no result is produced, so I assume Q2354820 is not used as a
qualifier value (unless I'm missing something).

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Sparql query for "evidence based practice"

2016-05-05 Thread Stas Malyshev
Hi!

> Hoi,
> Thanks Stas :) This is one example where Q2354820 is used.
> 
> http://tools.wmflabs.org/reasonator/?=24013782

OK, this looks like a new one and the query now returns it.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Geospatial search for Wikidata Query Service is up

2016-05-10 Thread Stas Malyshev
Hi!

> Are the distances available somehow? I.e., the distances between the
> wikibase:center and the searched locations?

There is a distance function - see
https://www.mediawiki.org/wiki/Wikidata_query_service/User_Manual#Distance_function
- and there also soon will be a way to retrieve distances when searching
around point - right now they are recalculated and this is a waste since
the search already knows the distance, but in current version it
discards it.

There's a slight bug in current version (see the example above for the
workaround - the hint to disable analytic mode is required in some
situations) - that's why it was not announced yet.

But for many applications geof:distance already works. The rest will be
implemented an announced soon, probably sometime next week or week after
that.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Geospatial search for Wikidata Query Service is up

2016-05-09 Thread Stas Malyshev
Hi!

After a number of difficulties and unexpected setbacks[1] I am happy to
announce that geospatial search for Wikidata Query Service is now
deployed and functional. You can now search for items within certain
radius of a point and within a box defined by two points - more detailed
instructions are in the User Manual[2]. See also query examples[3] such
as "airports within 100km of Berlin": http://tinyurl.com/zxy8o64

There are still a couple of things to complete, namely sorting by
distance (coming soon) and units support (maybe). Overall progress of
the task is tracked by T133566[4].

[1] https://lists.wikimedia.org/pipermail/wikidata/2016-May/008674.html
[2]
https://www.mediawiki.org/wiki/Wikidata_query_service/User_Manual#Geospatial_search
[3]
https://www.mediawiki.org/wiki/Wikibase/Indexing/SPARQL_Query_Examples#Airports_within_100km_of_Berlin
[4] https://phabricator.wikimedia.org/T123565
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] SPARQL service timeouts

2016-04-18 Thread Stas Malyshev
Hi!

> I have the impression that some not-so-easy SPARQL queries that used to
> run just below the timeout are now timing out regularly. Has there been
> a change in the setup that may have caused this, or are we maybe seeing
> increased query traffic [1]?

We've recently run on a single server for couple of days due to
reloading of the second one, so this may have made it a bit slower. But
that should be gone now, we're back to two. Other than that, not seeing
anything abnormal in
https://grafana.wikimedia.org/dashboard/db/wikidata-query-service

> [1] The deadline for the Int. Semantic Web Conf. is coming up, so it
> might be that someone is running experiments on the system to get their
> paper finished. It has been observed for other endpoints that traffic
> increases at such times. This community sometimes is the greatest enemy
> of its own technology ... (I recently had to IP-block an RDF crawler
> from one of my sites after it had ignored robots.txt completely).

We don't have any blocks or throttle mechanisms right now. But if we see
somebody making serious negative impact on the service, we may have to
change that.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] SPARQL service timeouts

2016-04-19 Thread Stas Malyshev
Hi!

> I'm late to the game, but a quick look into the nginx logs does not
> show all that much. I see a few connection refused, but that should
> translate in an HTTP 502 error, not in a partial answer.
> 
> I'm really not good at reading VCL, but it seems that we do have some
> rules in our Varnish config to cache pages in error. This would make
> sense as pages in error tend to be expensive, so we probably want to
> ensure the same error is capped at a maximum rate.

Note that if page produced some data and then timed out - which is
possible for Blazegraph queries - then by the time the error happens
part of the response has been sent already, so there's no way to set
error http code etc. Thus such responses are not distinguishable from
valid replies, at least not without looking into the content.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Date of birth and name correlations

2016-04-16 Thread Stas Malyshev
Hi!

> Vaguely related: The full list of items "different from" (P1889) another
> item sure does give an interesting read though. *
> http://tinyurl.com/h23q2et

Some pairs there are definitely unexpected. And I'm not sure about these
two:

https://www.wikidata.org/wiki/Q1787424
https://www.wikidata.org/wiki/Q166542

Some days I have a feeling those should be P460... ;)

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Grammatical display of units

2016-07-27 Thread Stas Malyshev
Hi!

Right now, quantities with units are displayed by attaching unit name to
the number. While it gives the idea of what is going on, it is somewhat
ungrammatical in English (83 kilgoramm, 185 centimetre, etc.) [1] and in
other languages - i.e. in Russian it's 83 килограмм, 185 сантиметр -
instead of the correct "83 килограмма", "185 сантиметров". For some
units, the norms are kind of tricky and fluid (e.g. see [2]), and they
are not even identical across all units in the same language, but the
common theme is that there are grammatical rules on how to do it and
we're ignoring them right now.

I think we do have some means to grammatically display numbers - for
example, number of references is displayed correctly in English and
Russian. As I understand, it is done by using certain formats in message
strings, and these formats are supported in the code in Language
classes. So, I wonder if we should maybe have an (optional) property
that defines the same format for units? We could then reuse the same
code to display units in proper grammatical way.

Alternatively, we could use short units display [3] - i.e. cm instead of
centimetre - and then plurals are not required. However, this relies on
units having short names, and for some units short names can be rather
obscure, and maybe in some language short names need grammatical forms
too. Given that we do not link unit names, it would be rather confusing
(btw, why don't we?). Some units may not have short forms at all.

And the short names do not exactly match the languages - rather, they
usually match the script (i.e. Cyrillic, or Latin, or Hebrew) - and we
may not even have data on which language uses which script, in a useful
form. So using short forms is very tricky.

Any other ideas on this topic? Do we have a ticket tracking this
somewhere? I looked but couldn't find it.

[1]
http://english.stackexchange.com/questions/22082/are-units-in-english-singular-or-plural
[2]
https://ru.wikipedia.org/wiki/%D0%9E%D0%B1%D1%81%D1%83%D0%B6%D0%B4%D0%B5%D0%BD%D0%B8%D0%B5_%D0%92%D0%B8%D0%BA%D0%B8%D0%BF%D0%B5%D0%B4%D0%B8%D0%B8:%D0%9E%D1%84%D0%BE%D1%80%D0%BC%D0%BB%D0%B5%D0%BD%D0%B8%D0%B5_%D1%81%D1%82%D0%B0%D1%82%D0%B5%D0%B9#.D0.A1.D0.BA.D0.BB.D0.BE.D0.BD.D0.B5.D0.BD.D0.B8.D0.B5_.D0.B5.D0.B4.D0.B8.D0.BD.D0.B8.D1.86_.D0.B8.D0.B7.D0.BC.D0.B5.D1.80.D0.B5.D0.BD.D0.B8.D1.8F
[3] https://phabricator.wikimedia.org/T86528
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Grammatical display of units

2016-07-27 Thread Stas Malyshev
Hi!

> Where are the names of those units translated at the moment?

I assume on the wikidata items for them, those are just labels for
wikidata items (as units are items).

> If these are MediaWiki messages, grammar rules for them can be added
> fairly easily. If I can see where they are now, I could probably make a
> quite demo patch to show how it can be done.

I don't think we can put grammar rules in labels, that's why I proposed
a special property as an option.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Grammatical display of units

2016-07-29 Thread Stas Malyshev
Hi!

> You mean the MediaWiki message processing code? This would probably be

Yes, exactly.

> powerful enough for units as well, but it works based on message strings
> that look a bit like MW template calls. Someone has to enter such
> strings for all units (and languages). This would be doable but the
> added power comes at the price of more difficult editing of such message
> strings instead of plain labels.

True. OTOH, we already have non-plain strings in the database - e.g.
math formulae - so that would be another example of such strings. It's
not ideal but would be a start, and maybe we can have some gadgets later
to deal with it :)

>> Oh yes :) Russian is one, but I'm sure there are others.
>>
> 
> Forgive my ignorance; I was not able to read the example you gave there.

Sorry, it's hard to give examples in foreign languages that would be
comprehensible :) The gist of it is that Russian, as many other
inflected languages, changes nouns by grammatical case, and uses
different cases for different number of items (i.e. 1, 2,  and 5 will
use three different cases). Labels are of course in singular nominative
case, which is wrong for many numbers.
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Controversy around Wikimania talks

2016-07-31 Thread Stas Malyshev
Hi!

> Ask yourself what it is about.. It is about the Wikimania talks. What
> was done is removing all the Wikimania talks without any discussion.

I wonder whether Wikidata is really the best platform to host Wikimania
talks and information about it. While I have no doubt these are
excellent talks of great interest to Wiki community, their notability in
the larger world is a more difficult question. Specifically, would we
create an item for every talk even for a major conference (not
considering copyright etc. questions now)? We have a lot of conferences
with much wider attendance than Wikimania happening each year.

Now, Wikimania is of course special - for Wiki movement. And having
*some* repository for this content and knowledge would be completely
appropriate. However, is that repository Wikidata - as purported to be
repository of knowledge of general public interest? I am much less sure
of it. Unless we take the wider mission of accepting data about talks on
any conference of note - which may be possible, but I'm not sure whether
it should be done... If yes, then of course clear policy statement to
that effect may be helpful - so people who are not sure about it like me
would know what the community consensus has arrived to.
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] An attribute for "famous person"

2016-08-03 Thread Stas Malyshev
Hi!

On 8/2/16 11:36 PM, Jane Darnell wrote:
> Would page props also give me the creation date of the Wikipedia page in
> that specific sitelink? Because this is something I needed when

Don't think so and I don't think such data should be in Wikidata or WDQS
database - it's Wikipedia administrative data and should be there.

External service can combine data from these sources but I don't think
it falls under WDQS tasks.
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] An attribute for "famous person"

2016-08-02 Thread Stas Malyshev
Hi!

> If you think it is best to implement a more general feature that adds
> even more properties, then I am sure nobody will complain, but it sounds
> like more work to me. The number I was asking for is something that you

I don't think it's *much* more work, and I planned to do this work
anyway :) Of course, it may happen that I am wrong about how much work
it is, and then I might reconsider.

> compute the number in a SPARQL query from the RDF. It is a completely
> redundant piece of information. It's only purpose is to make SPARQL
> queries that currently time out fast. In databases, such things are
> called "materialized views".

Speaking of which, Blazegraph does have support for inferring data, but
I don't want to open that particular can of worms just yet.

> This leads to a slightly different perspective than the one you'd have
> in T129046. By adding page props, you want to add "new" information from
> another source, and questions like data modelling etc. come to the fore.
> With a materialized view, you just add some query results back to the
> database for technical reasons that are specific to the database. The
> two motivations might lead to different requirements at some point
> (e.g., if you want to add another materialized query result to the RDF
> you may have to extend page props, which involves more dependencies than
> if you just extend the RDF converter).

While in theory this is true, we don't have any process that allows us
to do literally materialized views on current platform (there are named
queries but that's not the same I think). Inference "kind of" might be
that, but doing it that way probably would be very inefficient for this
particular case. There are of course other ways to achieve the same, so
I'll look into various options, but so far page props doesn't sound like
that bad an idea, to me.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Grammatical display of units

2016-08-01 Thread Stas Malyshev
Hi!

> Well, I think we could sidestep the grammar issue by using unit symbols. We

True, but what unit symbol is "apple"? It's actually used as measure of
height (bonus points if you can guess on which item :). Even if we don't
go this far, while SI units probably all have short names, for non-SI
units, especially older and rarer ones, it may very well not be the case.

Another tricky part is that short names are not connected to languages
right now. I.e. if your interface language is Serbian, which short name
to use? What if it's Farsi? We'd need to change how we relate units &
unit symbols then.

> An alternative is to use MediaWiki i18n messages instead of entity labels. 
> E.g.
> if the unit is Q11573, we could check if MediaWiki:wikibase-unit-Q11573 
> exists,
> and if it does, use it. We'd get internationalization including support for
> plurals for free.

That may work, but downside of this is that it is linked to unit ID - so
if we wanted to use it for, say, Commons data, we'd have to somehow link
between "metre" on Wikidata and "metre" on Commons.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] (Ab)use of "deprecated"

2016-08-11 Thread Stas Malyshev
Hi!

> I would argue that this is better done by using qualifiers (e.g. start
> data, end data).  If a statement on the population size would be set to
> preferred, but isn't monitored for quite some time, it can be difficult
> to see if the "preferred" statement is still accurate, whereas a
> qualifier would give a better indication that that stament might need an
> update. 

Right now this bot:
https://www.wikidata.org/wiki/User:PreferentialBot
watches statements like "population" that have multiple values with
different time qualifiers but no current preference.

What it doesn't currently do is to verify that the preferred one refers
to the latest date. It probably shouldn't fix these cases (because there
may be valid cause why the latest is not the best, e.g. some population
estimates are more precise than others) but it can alert about it. This
can be added if needed.

-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Breaking change in JSON serialization?

2016-08-11 Thread Stas Malyshev
Hi!

> My view is that this tool should be extremely cautious when it sees new data
> structures or fields.  The tool should certainly not continue to output
> facts without some indication that something is suspect, and preferably
> should refuse to produce output under these circumstances.

I don't think I agree. I find tools that are too picky about details
that are not important to me hard to use, and I'd very much prefer a
tool where I am in control of which information I need and which I don't
need.

> What can happen if the tool instead continues to operate without complaint
> when new data structures are seen?  Consider what would happen if the tool
> was written for a version of Wikidata that didn't have rank, i.e., claim
> objects did not have a rank name/value pair.  If ranks were then added,
> consumers of the output of the tool would have no way of distinguishing
> deprecated information from other information.

Ranks are a bit unusual because ranks are not just informational change,
it's a semantic change. It introduces a concept of a statement that has
different semantics than the rest. Of course, such change needs to be
communicated - it's like I would make format change "each string
beginning with letter X needs to be read backwards" but didn't tell the
clients. Of course this is a breaking change if it changes semantics.

What I was talking are changes that don't break semantics, and majority
of additions are just that.

> Of course this is an extreme case.  Most changes to the Wikidata JSON dump
> format will not cause such severe problems.  However, given the current
> situation with how the Wikidata JSON dump format can change, the tool cannot
> determine whether any particular change will affect the meaning of what it
> produces.  Under these circumstances it is dangerous for a tool that
> extracts information from the Wikidata JSON dump to continue to produce
> output when it sees new data structures.

The tool can not. It's not possible to write a tool that would derive
semantics just from JSON dump, or even detect semantic changes. Semantic
changes can be anywhere, it doesn't have to be additional field - it can
be in the form of changing the meaning of the field, or format, or
datatype, etc. Of course the tool can not know that - people should know
that and communicate it. Again, that's why I think we need to
distinguish changes that break semantics and changes that don't, and
make the tools robust against the latter - but not the former because
it's impossible. For dealing with the former, there is a known and
widely used solution - format versioning.

> This does make consuming tools sensitive to changes to the Wikidata JSON
> dump format that are "non-breaking".  To overcome this problem there should
> be a way for tools to distinguish changes to the Wikidata JSON dump format
> that do not change the meaning of existing constructs in the dump from those
> that can.  Consuming tools can then continue to function without problems
> for the former kind of change.

As I said, format versioning. Maybe even semver or some suitable
modification of it. RDF exports BTW already carry version. Maybe JSON
exports should too.
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


  1   2   3   4   >