Re: [Wikidata] Apologies
Hi! This query shows all things owned by National Trust, with English label, description, and coordinates (where available): http://tinyurl.com/ntpx8qf Click execute to run it. There are ways to get this result in different formats (not embedded in HTML) but I don't find this right now. If you just want the results and not the GUI, you can ask the endpoint directly: http://tinyurl.com/ph3wn4m (this is https://wdqs-beta.wmflabs.org/bigdata/namespace/wdq/sparql?query= and then SPARQL, URL-encoded). If you want it in JSON, you'll need to add header: Accept:application/sparql-results+json (not easy to do from browser, unfortunately, unless you use tool like Postman in Chrome) - otherwise you'll get the default XML. That's the endpoint that the GUI is using, proceeding then to parse the result and present them in more human-friendly form. other one (30k still worked for me, while the WMF experimental endpoint currently times out even at 10k -- the service is running on a virtual machine that is not very powerful right now; this will change soon). The Yes, this service has 30 seconds cap currently, so if the query takes longer, sorry :) The cap of course rill be raised significantly (and also performance would be better) once we get it to production. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Update frequency on the Wikidata Query API
Hi! I have seen that SPARQL query service and it indeed is an interesting alternative. In terms of stability and update frequency how different is the SPARQL query service from the Wikidata Query API? In terms of stability: it's beta, so while we try to keep it up and running smoothly, it is not out of the question that it can be taken down at any moment, either because we found a bug or because we need to update something, and the data model can change too. We do not expect substantial changes in data model anymore, and we try to keep it up and running (doesn't help that we are in the middle of large labs outage right now: https://wikitech.wikimedia.org/wiki/Incident_documentation/20150617-LabsNFSOutage ) and synched continuously (i.e. no more than minutes behind wikidata edits), but as long as it's beta we can give no guarantees on anything. We're working hard to make it production-quality, but that will take a bit more time. The differences between WDQ and WDQS/SPARQL is that SPARQL is a full-features language for querying triple-based (RDF) data sets, and allows very complex queries. It is also a standard in linked data world. You can use the translator (http://tools.wmflabs.org/wdq2sparql/w2s.php) - once the labs outage ends of course - to convert between WDQ syntax and SPARQL. Also check out other links on the WDQS beta page for short intros about how things are done with SPARQL and examples of which queries you can run. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Update frequency on the Wikidata Query API
Hi! How often is the WDQ api really being updated? Is it possible to query wikidata live, with WDQ and if not, are there alternatives that would allow this? We currently have SPARQL query service in beta[1], which is updated constantly from Wikidata. Note that since it's beta it is not stable yet both operationally and data-model-wise, so please be aware of this, also it has timeout limits that won't allow you for now to run queries that are too complex. But if you want to check it out and see if that fits your use case you are most welcome. [1] http://wdqs-beta.wmflabs.org/ -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] accessing data from a wikidata concept page
Hi! The links would point to the standard export URLs: * https://www.wikidata.org/wiki/Special:EntityData/Q423111.json * https://www.wikidata.org/wiki/Special:EntityData/Q423111.rdf Speaking about these, shouldn't we also have link rel=alternate for export formats in the header? -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Lists of things as entities in Wikidata
Hi! In Freebase, we had bot scripts that went through and removed Lists of Things topic entities since they are lists of entities and not useful clumped together and normalized in a graph database. Why delete them? Wikidata has a number of things which are not your standard entity - lists, sources, news, quotes, service entries, narrative articles (e.g. https://en.wikipedia.org/wiki/Control_of_fire_by_early_humans - it's not exactly entity like human or fire), etc. So I don't think the approach that singles out and excludes lists would help much - if you have an application that needs individual entities like Douglas Adams or London and exclude other types will have to exclude much more than just lists - but I think the approach of asking for exactly what you need and ignoring the rest may prove more efficient. I'm not sure there's really well-defined criteria to specify what individual entity actually is - I'm sure you have one that matches your application, but some other application may have completely different one. Generally, this can be solved by better classification I think, but so far I'm not sure what to base this classification on. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Orphaned items
Hi! As a practical suggestion for helping: http://tools.wmflabs.org/wikidata-todo/random_item_without_instance.php I would also suggest http://tools.wmflabs.org/wikidata-todo/important_blank_items.php which lists most linked items from wikis that have no connection to other items whatsoever. Some of them are tough to classify or link to anything, but some are rather obvious. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Orphaned items
Hi! If an item has no statements, no sitelinks, and isn't used anywhere, how do you tell what it even *is*? The label only? Is that sufficient and/or useful? What would be lost by deleting it? Maybe, if it has labels in many languages, with Unless its purpose if obvious (i.e. label/description/talk page describes it clearly) I'd say it might be more dangerous to keep it around, as if some people start to use it in different meanings, and then people add independent articles on Wiki which would produce different items with the same meaning, in multiple languages, pretty soon we'd have quite a mess on our hands. Empty item by itself with no links, no good labels and no data or almost no data (like John Smith, human and that's it) is not worth much, IMHO. Yes, I don't have good formal criteria for obvious so I imagine we'd have to take it on case basis or maybe think about some. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Announcing Wikidata Taxonomy Browser (beta)
Hi! > It should be a basic requirement of any SPARQL engine that it should be > able to handle path queries that contain cycles. So I did some simple checks, and on simple examples Blazegraph handles cycles just fine. However, on more complex queries, the cycles seem to be causing trouble. I don't know yet why, I'll look at it further, probably next week. So the problem is not "handling cycles" in general, it is handling some specific data set, and most probably is a consequence of some bug. I'll report when I have more data about what exactly triggers the bug. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] birthday present: improved query.wikidata.org
Hi! > Would it be possible to have the results optionally in another format? > When it is possible to have the results in a format where it can be > picked up by one of Magnus's tools it would be actually useful. For best > results it needs to be configurable; not everyone wants or needs the > same tooling.. We have export formats - CSV, TSV, JSON, etc. (look for "Download results" link on the right side). If you would like to see any other format that is not supported now, please ask (best in the form of Phabricator ticket but writing on feedback page or mail works too). -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Announcing Wikidata Taxonomy Browser (beta)
Hi! > least one Wikipedia) are considered to refer to equivalent classes on > Wikidata, which could be expressed by a small subclass-of cycle. For We can do it, but I'd rather we didn't. The reason is that it would require engine that queries such data (e.g. SPARQL engine) to be comfortable with cycles in property paths (especially ones with + and *), and not every one is (Blazegraph for example looks like does not handle them out of the box). It can be dealt with, I assume, but why create trouble for ourselves? > We also have/had cycles involving instance-of, which is definitely an > error. ;-) Right. So I think we need to mark properties that should not form cycles with https://www.wikidata.org/wiki/Q18647519 (asymmetric property) and have constraints checking scripts/bots find out such cases and alert about them. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] how to map other identifiers to Wikidata entity IDs
Hi! > I found you can do it one-by-one in Wikidata Query [3] and in Wikidata > Query Serivce [4] but neither seems amenable to doing a query on the fly > "Get me the Wikidata item for each of these 100 ISBNs "2-7071-1620-3", ... At least in sparql, this would be easy to do: PREFIX wdt: <http://www.wikidata.org/prop/direct/> SELECT ?book ?isbn WHERE { VALUES ?isbn { "2-7071-1620-3" "2-7071-1620-4" "2-7071-1620-5" ... } ?book wdt:P957 ?isbn } Unless I misunderstand what you mean here. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Use of Sparql service is going through the roof
Hi! > Does anyone know what's going on with the Sparql service ? > > Up until a couple of days ago, the most hits ever in one day was about > 6000. > > But according to > http://searchdata.wmflabs.org/wdqs/ > > two days ago suddenly there were 6.77 *million* requests, and yesterday > over 21 million. > > Does anyone know what sort of requests these are, and whether they are > all coming from the same place ? Looks like yes, they are coming from the same place, and in that place seems to be a bot doing something wrong. So if anybody knows whose bot it is please ask that person to seek advice and guidance (which I would be glad to provide) on how to make it work properly :) -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Use of Sparql service is going through the roof
Hi! > Might this be affecting our searches? The following query times out very > quickly on Chrome, and runs forever in Firefox before crashing the whole > browser (or is there a problem with my query?) The symptoms you describe seem to suggest you have too many results for this query and browser gets out of memory. Try this query with LIMIT 10 first and see what happens. As for the bot activities affecting other users, the effect seems to be negligible, so if this query is slow, it is slow on its own merits :) -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] diseases as classes in Wikidata - was Re: An Ambitious Wikidata Tutorial
Hi! > Similarly the "Diary of Anne Frank" is an instance of a memoir or a > literary work but is a subclass of book (because there are lots of > physical books with that name). Literary works have authors and > publishers. Books have numbers of pages and printers and physical locations. I'm not sure I understand this. What is the difference between "instance of memoir" and "subclass of book"? You could literally argue with the same words that it is also "subclass of memoir" and again since very rarely any specific physical book is notable enough (maybe excluding things like first Gutenberg Bible, etc.) we would have virtually no instances of book at all. I do not think people think that way - if you ask somebody is "Diary of Anne Frank" an example of a book or a class I think most people would say it's an example of a book and not a class. Unless we plan to seek out and record every printed physical copy of that book, I don't see any practical reason to describe it as a class. This class - and hundreds of thousands of other book titles, maybe with rare exceptions of the Gutenberg Bible, etc. - would never have any instances. So my question is - what is the use of modeling something as a class if there won't be ever any instances of the class modeled? -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] WDQS updates have stopped
Hi! >> Usually the data gets updated every minute or two, but it's over 11 hours >> now. > > My best guess looking at things right now is that SuccuBot is making a > huge number of edits and the updater for the query service might not > be able to handle that yet. Stas: Could you have a look? Yes, looks like there's a large volume of updates, so the service is several hours behind, but it seems to be catching up now. What's the bot is doing? -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] WDQS updates have stopped
Hi! >> Yes, looks like there's a large volume of updates, so the service is >> several hours behind, but it seems to be catching up now. What's the bot >> is doing? > > https://www.wikidata.org/wiki/Special:Contributions/SuccuBot Last set of edits seems suspect to me - e.g. adding copies of en label to a bunch of species as ru label without it even having ruwiki entry. I'm not sure it's a good thing, but yes, that would generate quite a big load on updating especially that it seems to be adding hundreds (if not thousands) labels per minute. I've also added a note on Succu's talk page to discuss it. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] First version for units is ready for testing!
Hi! > We've finally done all the groundwork for unit support. I'd love for > you to give the first version a try on the test system here: > http://wikidata.beta.wmflabs.org/wiki/Q23950 This is awesome, congratulations to Wikidata team on this milestone! > There are a few known issues still but since this is one of the things > holding back Wikidata I made the call to release now and work on these > remaining things after that. What I know is still missing: Looking at it, I also notice one has to edit to get from the unit label to its entity. I wonder if it's possible to make it easier in the UI to get to the entity of the unit and see its URL. Also, is there any work/planning around marking standard units with specific properties and establishing classes (i.e. measures of length, weight, etc.) which can be considered convertible? If we want to be able to run queries against quantities with units (and I think we do, don't we?) then we would need to figure out the common basis at least for common units. I wonder if it's tracked somewhere? -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] First version for units is ready for testing!
Hi! > Also, I don't see a reason why the JSON encoding should use an IRI It probably doesn't have to, just Q-id would be enough. "1" is OK too, but a bit confusing - if the rest would be Q-ids, then it makes sense to make all of the Q-ids. Other option would be to make it just null or something special like that. > string does not seem to help anyone. I would suggest keeping the "1" as > a marker for "no unit". Of course, this "1" would never be shown in the It is possible, but "1" the looks like "magic value", which is usually bad design since one needs to check for it all the time. It would be nicer if there could be a way to avoid it. > Wikibase and elsewhere. If we create a special IRI for denoting this > situation, it will be better distinguished from other (regular) units, > and there will be no dependency on the current content of Wikidata's Q199. We already have such dependencies - e.g. in calendars and globes - so it won't be anything new. But let's see what the Wikidata team thinks about it :) -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Announcing the release of the Wikidata Query Service
Hi! > As suggested somewhere far above, it would be great for the community to > catalogue the queries that are most important for their use cases that > do not do well on the SPARQL endpoint. Its likely that the list isn't > going to be super-long (in terms of query structure), hence it might > make sense to establish dedicated, optimized web services (that exist > apart from the endpoint) to call upon when those kinds of queries need > to be executed. Good idea. As a preliminary point, I think the same basics as with most other query engines (SQL, etc.) apply: - non-restrictive queries with tons of results will be slow - i.e. "list of all humans" is probably not a good question to ask :) - negative searches are usually slower - e.g. "all humans without images" will be slow, since that query would have to inspect records for every human - Unbound paths/traversals will usually be slower (unfortunately many queries that have TREE in WDQ are those), especially if there are a lot of starting points for traversals (again, "all humans that", etc...) It is also a good idea to put LIMIT on queries when experimenting, i.e. if you intended to write query that asks for 10 records but accidentally wrote one that returns 10 million, it's much nicer to discover it with suitable limit than waiting for the query to time out and then try to figure out why it happened. Yes, I realize all this has to go to some page in the manual eventually :) -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Announcing the release of the Wikidata Query Service
Hi! > Yes it is the continuation of the beta on labs. > Stas: Do you want to turn that into a redirect now? Not sure yet what to do with it. I want to keep the labs setup for continued development work, especially when potentially breaking things, but we do want to redirect most of the people now to main endpoint as it is much better at handling the load. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Announcing the release of the Wikidata Query Service
Hi! > Before turning that it into a redirect, it might be worth looking at > the content. There seems to be some discrepancy between the results > from http://wdqs-beta.wmflabs.org/ and those from > http://query.wikidata.org. When submitted to both endpoints the > following query returns different results: These are fed from the same data and supposed to be the same code, so I wonder how it happened there's a difference... The query actually shows there's two values for Q181391 on P699 - 987 and DOID:987. I'll investigate that. It may be some kind of bug. Thanks for bringing it to my attention. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Announcing the release of the Wikidata Query Service
Hi! > Anyone an idea why this query has a trouble when I add the OPTIONAL keyword? > > *http://tinyurl.com/pgsujp2* > > Doesn't look much harder than the queries in the examples. It's not because it's harder. It's because ?head can be unbound, and you can not apply label service to unbound variables. If you drop ?headLabel then it works. It is a downside of the label service, not sure yet how to fix it (feel free to submit the Phabricator issue, maybe myself or somebody else has an idea later). -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Announcing the release of the Wikidata Query Service
Hi! > It seems the implementation of the label service needs some improvement > to support unbound variables (which should then return unbound labels, > rather than throw a runtime exception ;-). Not sure whether this is possible (need to research it) but if it is, yes, that probably would be the way to fix it. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Source statistics
Hi! > A small fix though: I think you should better use count(?statement) > rather than count(?ref), right? Yes, of course, my mistake - I modified it from different query and forgot to change it. > I have tried a similar query on the public test endpoint on labs > earlier, but it timed out for me (I was using a very common reference > though ;-). For rarer references, live queries are definitely the better > approach. Works for me for Q216047, didn't check others though. For a popular references, labs one may be too slow, indeed. A faster one is coming "real soon now" :) -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Importing Freebase (Was: next Wikidata office hour)
Hi! > I see that 19.6k statements have been approved through the tool, and > 5.1k statements have been rejected - which means that about 1 in 5 > statements is deemed unsuitable by the users of primary sources. From my (limited) experience with Primary Sources, there are several kinds of things there that I had rejected: - Unsourced statements that contradict what is written in Wikidata - Duplicate claims already existing in Wikidata - Duplicate claims with worse data (i.e. less accurate location, less specific categorization, etc) or unnecessary qualifiers (such as adding information which is already contained in the item to item's qualifiers - e.g. zip code for a building) - Source references that do not exist (404, etc.) - Source references that do exist but either duplicate existing one (a number of sources just refer to different URL of the same data) or do not contain the information they should (e.g. link to newspaper's homepage instead of specific article) - Claims that are almost obviously invalid (e.g. "United Kingdom" as a genre of a play) I think at least some of these - esp. references that do not exist and duplicates with no refs - could be removed automatically, thus raising the relative quality of the remaining items. OTOH, some of the entries can be made self-evident - i.e. if we talk about movie and Freebase has IMDB ID or Netflix ID, it may be quite easy to check if that ID is valid and refers to a movie by the same name, which should be enough to merge it. Not sure if those one-off things worth bothering with, just putting it out there to consider. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Why do these two SPARQL queries take such different times to run?
Hi! > here's a query to find multiple humans with nationality:Greece that have > the same day of birth and day of death: > http://tinyurl.com/ow6lpen > It produces one pair, and executes in about 0.6 seconds. > > Here's a query to try to add item numbers and labels to the previous > search: > http://tinyurl.com/ovjwzc9 > > It *just* completes, taking just over 60 seconds to execute. It looks like some issue with nested queries in Blazegraph, I've sent a report to them and will see what they say. > Obviously the second query as written at the moment involves a > sub-query, which inevitably must make it a bit slower -- but given the > solution set of the sub-query only has two rows, and an exact date for a > given property ought to be a fairly quick key to look up, why is the > second query taking 100 times longer than the first ? Yes, in theory it should be fast, so I suspect some kind of bug. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Duplicate identifiers (redirects & non-redirects)
Hi! > It seems like the constraint checker could check for either only one > "Preferred" or all but one "Deprecated" which would allow editors to > evolve in whichever way they wanted. It should probably consider "best rank" ones - i.e. if Preferred exists then Preferred ones, otherwise Normal ones but never Deprecated ones. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Using the SPARQL endpoint of WIkidata outside its GUI
Hi! > With the release of http://query.wikidata.org, this long address doesn't > seem to work anymore, nor the same pattern in query.wikidata.org > <http://query.wikidata.org> > (i.e.: http://query.wikidata.org/bigdata/namespace/wdq/sparql) both > return a http 301 header. Should probably be https://query.wikidata.org/bigdata/namespace/wdq/sparql That works for me. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] query.wikidata.org and wikibase:Statement ?
Hi! > statements (about 2.5M) and on the question if SPARQL could list all > entries in Wikidata that do not have statements. I played a bit with Technically, it could, but since it's so many of them, they might not finish in time. The problem is that since there's no indexes on something not existing, what probably happens is that the database would go entity by entity trying to find one that doesn't have a statement, and that is slow. I think there may be a bug with LIMIT implementation, or maybe it's just indeed taking too long... > combinations of OPTIONAL and FILTER-BOUND and FILTER NOT EXIST... > something like: > > PREFIX wikibase: <http://wikiba.se/ontology#> > SELECT DISTINCT ?entry ?label ?statement WHERE { > ?entry rdfs:label ?label . FILTER (lang(?label) = "en") > FILTER NOT EXISTS { > ?statement ?prop ?entry ; > wikibase:rank ?rank . > } > } LIMIT 5 This query also seems a bit wrong since it looks for ?entry as object, not subject. > But there was something else I noted... statements are not typed... > that would probably kick in some index, rather than the above query, > and the documentation actually speaks about wikibase:Statement [1] but > if I search for anything rdf:type-d as such, then it finds nothing in > the SPARQL end point: Right, please check out: https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format#WDQS_data_differences wikibase:Statement is ommitted from the database for performance reasons. You could still match statements by URL by converting them to str() and then using substr() function, but that probably wouldn't help much since there's a lot of statements so the filtering would not be very selective. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] REST API for Wikidata
Hi! > * About this API: http://queryr.wmflabs.org > * Documentation: http://queryr.wmflabs.org/about/docs > * API root: http://queryr.wmflabs.org/api I like the idea of truly REST-form API for wikidata. Some comments from the first look: * I'm not sure items and properties should really be separate, and not have common /entity/ endpoint. * You can do /items/{item_id}/data/{property_label} but not by property ID. In fact, property label seems to be working only in English, and some properties have pretty unwieldy names even in English - e.g. P357 (I know it's obsolete, but it may still have data and people may have to use it) * http://queryr.wmflabs.org/api/items/Q42/data/occupation returns only one value, shouldn't it return multiple ones? -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] WDQS updates have stopped
Hi! The service have caught up now, but in the near future I would like to ask to make the bots throttle the edits a bit, for now. In the meantime, I'll look into speeding up the update process further, but given the nature of the database it may still be possible to temporarily overload it with large enough update stream. So keeping bot updates under 10 per second would be nice (this is somewhat arbitrary from back-of-the envelope calculation, so don't take the exact figure *too* seriously). Note that this should not be too hard a limit - that allows to update every single record now in Wikidata in about 2 weeks, which seems to be OK for most tasks. But it is a limitation and as I said, I'll work to eventually get rid of it. Thanks, -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] WDQS URL shortener
Hi! > I could have used another one on my own I guess, but the current > implementation is much faster and less error prone when dealing with > monster sparqling urls... > > please find a way to keep it There are no plans to remove URL shortening. There are plans to switch URL shortener to Wikimedia's own one, which is supposed to be coming up eventually, but before that, we plan to use existing ones. We might change a provider if it turns out there is a better one, but we do not plan to remove the functionality. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Possibility of data lock
Hi! > (1) Statement- and property-level watching for changes (seeing > problematic changes should come before disallowing them) This might be harder to do as it's not really possible - internally - to edit just one statement. It looks like editing one statement, but the data for the whole entity is stored together, so you are actually always edit the whole entity. Which means additional code is needed to filter the watches. Though if the number of watches is small it may be fine. Additional question is what exactly is "statement"? We do not have any user-exposed identity for statement, and internal identity can change with each edit. So if you edit the statement - for how long it is the same statement as before? (Ship of Theseus problem :) > (2) Statement- and property-level history (it's currently really hard to > find out who last changed a property, e.g., to contact the person or to > look at their other edits) Again, due to the above two it may be a bit hard to construct - though we do have diffs, relating diffs to specific statement may be tricky. Property, though, probably is much easier since they have clear identity. It should be possible at least to have some code allowing to answer questions like "which edit was the last one to touch this property" and "which edits changed this property's statements" though it is not trivial and I don't think it exists now. > (3) Statement- and property-level protection (for the hard cases, mostly > temporarily, same policies as for page-level protection) > (4) Statement-level patrolling (can I approve a more recent change to > P31 without approving an older change to P580?) This may be possible but I'm not sure it's necessary. The watch approach below may be more effective and more consistent with the project spirit, I think. > (5) Query-based watching: if you want to watch all property changes for > a large set of articles, you need better tools I think a tool that takes a query and creates a list, and allows to: 1. See how the list changes over time 2. Mark some items on the list as "ok" and some as "not ok" 3. Alert people on changes in the list would make watching for such changes much easier. > (6) More work on edit filters (preventing some edits based on content, > e.g., the use of shortened URLs as a homepage) > (7) Better UIs to prevent accidental edits (e.g., I can see a lot of > cases where people have entered qualifier information as new statement > values) That's a good case for watching too - we have properties that are predominantly used in qualifiers, and even marked so. It should not be hard to make auto-lists with violations and have people to look at it. > (8) Further work on easy-to-customise quality analysis and display of > related results (the constraint service is great, but hard to use in a > targeted way to find errors in a specific area). While data-vandalism > can have far-reaching consequences, it also is much harder to hide if > the community has the right tools at hand. > (9) Better data importing infrastructures (some problems mentioned in > this thread seem to be caused by a multi-stage data import approach that > only works if nothing changes in the meantime; I am sure one could get > this fixed without relying on user-editable data). -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Possibility of data lock
Hi! > More concretely: While the item about Barack Obama as a whole will > probably change ever and ever again, the value of his birthday statement > should never change: The current value is proven to be true and can't > change by its nature. What would be the problem about protecting this > specific value in this specific statement? Small wrinkle here: one may want to change it by adding a reference or removing a reference that went stale, or replacing it with an archived version. It is also not out of the question we may add qualifier to it (e.g. if we introduce new qualifier that didn't exist before) or remove one (if we deprecate it, for example). So while the value is not likely to change, other components of the claim very well might. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Federated queries between wikidata and e.g. sparql.uniprot.org
Hi! > What do we need to take into account to figure this out? What are the > constraints/concerns/etc? Hmm... Looks like I maybe spoke too soon. While it is not a problem to make *WDQS* accept federated queries, it may be a problem to make federated queries actually *work*, because production machines do not seem to have direct access to the internet. Which means if we want to make it work, we'd need to ask ops for one-off exceptions to firewall policies, which are usually frowned upon - they are hard to maintain and in general require extra effort and aren't really best practice. Maybe some of the use cases might be better served by TPF server (https://github.com/blazegraph/BlazegraphBasedTPFServer) - this does not enable federated queries per se but enables to produce content that can be queried externally more easily. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Federated queries between wikidata and e.g. sparql.uniprot.org
Hi! > Dear Wikidata developers and contributors, > > I see that it is currently not possible to run federated queries from > wikidata to other sparql endpoints. I understand why you are not > allowing this in the general case. However, it would be nice to allow > this in the special case. > > Would a patch allowing limited known remote sparql endpoints to > org.wikidata.query.rdf.blazegraph.WikibaseContextListener > be possible? Technically, it is possible and shouldn't be very hard to do. We need to figure out which endpoints we want to allow. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] WDQS URL shortener
Hi! > that makes sense. It sounds like in the short term (and until we > have a Wikimedia-operated shortener), using full URLs from WDQS is – > alas – the only way to go. One option we haven't mentioned would be > for WDQS itself to support URL shortening, I have no idea where that > would sit in terms of priorities. I don't think it'd be very high. It would require building infrastructure that Wiki extension already has, and for sole purpose of implementing functionality that is already provided by that extension. So I think the best way here is just to wait until it's deployed, and maybe gently prod responsible people from time to time :) -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] WDQS URL shortener
Hi! > Hi there, may I ask what link shorteners provide you that w3id does not? > Eg baked in metrics or 10char urls? Just curious why you would want to > reimplement. From what I can see, the target audience of w3id is a relatively small set of very stable URL prefixes that are used a lot and never change. It also does not aim at making these URLs shorter, it aims at making them stable. Adding URL namespace to it is a manual process, and individual URLs are not stored. The target of URL shorteners is much bigger set of URLs, many of which are relatively low-use or transient, but which can be created via automatic means in great volumes, which make URL shorter and which are aimed at storing, at least for a while, each URL as individual data piece. So, for our purposes w3id would not be very useful. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Machine-readable Wikidata ontology/schema?
Hi! > A quick search only returned those tables so far: > https://www.wikidata.org/wiki/Wikidata:List_of_properties/all > <mailto:wikidata@lists.wikimedia.org> > > Any formal representation would work: OWL, etc. There's basic OWL with Wikibase ontology here: http://wikiba.se/ontology-1.0.owl The properties can be found in the general dump ( https://dumps.wikimedia.org/wikidatawiki/entities/ ) described as outlined here: https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format#Properties There's no separate file, RDF, OWL or otherwise, with only properties, AFAIK. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Op-ed on Wikipedia Signpost regarding Wikidata licensing
Hi! > Current legislations do not support the licensing of individual facts, > only of databases as a whole, and only in some countries. What you are Added to that, even if it *were* possible to copyright facts, I think using restrictive license (and make no mistake, any license that requires people to do specific things in exchange for data access *is* restrictive) makes a lot of trouble for any people using the data. This is especially true for data that is meant for automatic processing - you will have to add code to track licenses for each data unit, figure out how exactly to comply with the license (which would probably require professional help, always expensive), track license-contaminated data throughout the mixed databases, verify all outputs to ensure only properly-licensed data goes out... It presents so much trouble many people would just not bother with it. It would hinder exactly the thing opens source excels at - creating community of people building on each other's work by means of incremental contribution and wide participation. Want to create cool a visualization based on Wikidata? Talk to a lawyer first. Want kickstart your research exploration using Wikidata facts? To the lawyer you go. Want to write an article on, say, gender balance in science over the ages and places, and feature Wikidata facts as an example? Where's that lawyer's email again? You get the picture, I hope. How many people would decide "well, it would be cool but I have no time and resource to figure out all the license issues" and not do the next cool thing they could do? Is it something we really want to happen? And all that trouble to no benefit to anyone - there's absolutely no threat of Wikidata database being taken over and somehow subverted by "enterprises", whatever that nebulous term means. In fact, if Google example shows us anything, it's that "enterprises" are not very good at it and don't really want it. Would they benefit from the free and open data? Of course they would, as would everybody. The world - including everybody, including "enterprises" - benefited enormously from free and open participatory culture, be it open source software or free data. It is a *good thing*, not something to be afraid of! Wikidata data is meant for free use and reuse. Let's not erect artificial barriers to it out of misguided fear to somehow benefit somebody "wrong". -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] counting gendered items for the gendergap project
Hi! > Is there any way to find out if a clothing item (dress, hat, shoes, etc) > is gendered? Any ideas how to do this? Thanks in advance I don't think we have currently anything, we could have something like "subclass/instance of" "male clothing" or "female clothing" but these things are very culture-specific (including changing within the culture with a passage of time) so I'm not sure it would be easy to represent this in Wikidata. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] SPARQL CONSTRUCT results truncated
Hi! > I try to extract all mappings from wikidata to the GND authority file, > along with the according wikipedia pages, expecting roughly 500,000 to > 1m triples as result. As a starting note, I don't think extracting 1M triples may be the best way to use query service. If you need to do processing that returns such big result sets - in millions - maybe processing the dump - e.g. with wikidata toolkit at https://github.com/Wikidata/Wikidata-Toolkit - would be better idea? > However, with various calls, I get much less triples (about 2,000 to > 10,000). The output seems to be truncated in the middle of a statement, e.g. It may be some kind of timeout because of the quantity of the data being sent. How long does such request take? -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
[Wikidata] WDQS stability
Hi! As it was noted on the list, we recently tried to update Blazegraph - software running Wikidata Query Service - to version 2.0, which has numerous bugfixes and performance improvements, and some infrastructure for future work on Geospatial search, etc. Unfortunately, it seems, as it sometimes happens with new major releases, that there are certain bugs in it, and yet more unfortunately, one of the bugs seems to be of a race condition nature, which is very hard to trigger on test environment, and that, when triggered, seriously impacts the stability of the service. All this lead to WDQS service being somewhat unstable last couple of days. Due to this, I have rolled the production deployment back to pre-2.0 state. This means the service should be stable again and not experience glitches anymore. I'll be watching it just in case and if you notice anything that looks broken (like queries producing weird exceptions - timeout does not count - or service being down, etc.) please ping me. In the meantime, we will look for the cause of instability, and once it is identified and fixed, we'll try the Blazegraph 2.0 roll-out again, with the fixes applied. I'll send a note to the list when it happens. Thanks, -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wikidata RDF PHP export bug
Hi! > I just noticed a bug in the RDF live exports of Wikidata: they still use > the base URI <http://wikiba.se/ontology-beta#> for all Wikidata > vocabulary terms. The correct base URI would be > <http://wikiba.se/ontology#>. I guess this has been forgotten and never > got noticed yet (not sure if there are consumers of the live exports). It's not forgotten - in fact, we have an issue for that, https://phabricator.wikimedia.org/T112127 - but we never got to defining the point when we do it. One can argue RDF mapping is still not complete - we do not support units fully, and we may have to add stuff for geo-coordinates too - but one can argue it's good enough to be 1.0 and I'd agree with it. But we need to take decision on this. Please feel free to also comment on the task. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] SPARQL CONSTRUCT results truncated
Hi! > For me, it’s perfectly ok when a query runs for 20 minutes, when it > spares me some hours of setting up a specific environment for one > specific dataset (and doing it again when I need current data two month > later). And it would be no issue if the query runs much longer, in > situations where it competes with several others. But of course, that’s > not what I want to experience when I use a wikidata service to drive, > e.g., an autosuggest function for selecting entities. I understand that, but this is a shared server which is supposed to serve many users, and if we allow to run 20-minute queries on this service, soon enough it would become unusable. This is why we have 30-second limit on the server. Now, we have considered having an option for the server or setup that allows to run longer queries, but currently we don't have one. It would require some budget allocation and work to make it, so it's not something we can have right now. There are use cases for very long queries and very large results, the current public service endpoint is just not good in serving them, because it's not what it was meant for. > And do you think the policies and limitations of different access > strategies could be documented? These could include a high-reliability I agree that limitations better to be documented, the problem is we don't know everything we may need to document. Such as "what are queries that may be bad". When I see something like "I want to download million-row dataset" I know it's probably a bit too much. But I can't have hard rule that says 1M-1 is ok, but 1M is too much. > preferred option). And on the other end of the spectrum something what > allows people to experiment freely. Finally, the latter kind of I'm not sure how I could maintain an endpoint that would allow people to do anything they want and still provide adequate experience for everybody. Maybe if we had infinite hardware resources... but we do not. Otherwise, it is possible - and should not be extremely hard - to set one's own instance of the Query Service and use it for experimenting with heavy lifting. Of course, that would require resources - but there's no magic here, it'd require resources from us too, both in terms of hardware and people that would maintain it. So some things we can do now, some things we would be able to do later, and some things we probably would not be able to offer with any adequate quality. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] SPARQL CONSTRUCT results truncated
Hi! > 5.44s empty result > 8.60s 2090 triples > 5.44s empty result > 22.70s 27352 triples That looks weirdly random. I'll check out what is going on there. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] upcoming deployments/features
Hi! > Can you try again please? And in an in-cognito window? I just tried it > and it works for me: https://test.wikidata.org/wiki/Q649 We've had some > issues with local store though. Weird, does work for me incognito but not when logged in. > The datatype changes but the value type stays string. So depending on > what they use they might need to be adapted. RDF export seems to be fine, except that we need to update OWL and docs for new types, I'll check pywikibot a bit later. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wikidata - short biographies
Hi! > > ** For all English bios:* > > SELECT * > WHERE > { >?s <http://schema.org/description> ?o . >filter(lang(?o)='en'). > } Please don't run this on query.wikidata.org though. Please add LIMIT. Otherwise you'd be trying to download several millions of data items, which would probably time out anyway. Add something like "LIMIT 10" to it. Thanks, -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] SPARQL service slow?
Hi! > 1. Is that stability now reached? > 2. Is there some sort of flag-mechanism, that would indicate if the > SPARQL endpoint is stable, being upgraded or had change in state? It's supposed to be stable now, though there's at least one known bug (not affecting you unless you run the affected query, but if you see NotMaterializedException you probably hit it, it's being worked on and will be fixed soon, ping me if you want boring details). In general, the upgrade went fine so I didn't send a note but then apparently (of course, as soon as I waited to ensure everything is going fine, which it was, and went to sleep) something gone wrong and I'm still not sure what exactly - I'll continue to research that, but whatever happened doesn't seem to repeat itself so far. In the future, I'll make announcement when I do potentially risky stuff. We have a number of plans for implementing new functions which would involve full DB reload at least once - which will mean the DB will be temporarily reset to the state of the latest weekly dump (i.e. 2-3 days behind current state of wikidata) and then will re-sync with the current state over the course of a day or so. I'll write about that additionally before we do it, depending on how development is going could be end of this month or somewhere next month. Thanks, -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] SPARQL service slow?
Hi! > Well, we only noticed what was up due to this email! > Take a look at https://phabricator.wikimedia.org/T119915 Yes, we need to look into it. The problem is that the service has two failure modes: 1. Completely dead, rejecting all queries. This would be caught by icinga and alerted. 2. Crawling slow, but still partially alive, just performing very very badly. For this one, we do not have adequate alert system. This failure mode is rare, but we've seen it to happen, both due to somebody sending a torrent of heavy queries and some bug scenarios. Icinga does not catch that because it only checks very basic queries and those are still under timeout. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] weekly summary #194
Hi! > Indeed, this is very nifty. I also note that this uses some special > features of our SPARQL endpoint that I did not know about (the "gas > service"). It seems that this is a proprietary extension of BlazeGraph, > which comes in very handy here. Yes, it's described here: https://wiki.blazegraph.com/wiki/index.php/RDF_GAS_API and it's a service implementing basic graph algorithms such as BFS, shortest path, PageRank, etc. I personally didn't use it too much but it may be very useful for tasks which are naturally expressed as graph traversals. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] SPARQL service slow?
Hi! > is it me or is the SPARQL service very slow right now? I've upgraded it yesterday to Blazegraph 2.0 and it looks like there was some glitch there. I've restarted it and now it seems to be fine. I'll be watching it and see if it repeats. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
[Wikidata] Upcoming WDQS upgrade to Blazegraph 2.0
Hi! It looks like all the issues that we have had with Blazegraph 2.0 are now fixed, so I will attempt to upgrade our install to 2.0 again sometime around noon PDT tomorrow. There should be no visible changes except for brief restart for each server that should not be externally visible (since we have two so the other one would just take over). However, if something bad happens, there might be a brief disruption of service. I'll send a message when it's done, and if you notice anything weird after the upgrade, please ping me or submit issue in Phabricator. Thanks, -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Upcoming WDQS upgrade to Blazegraph 2.0
Hi! > So, is there an equivalent for WDQ "AROUND[]" now? :-) Not yet :) I.e., we could simulate it by accessing longitude/latitude manually, but in order to use BG 2.0's geospatial index, we need to add code to specifically encode our coordinate literals in a way compatible with this index. So there's still work to do, see https://phabricator.wikimedia.org/T123565 There are some interesting challenges due to the fact that our coordinates include globes which is not a very common thing, but we're working on supporting it. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Littar won second prize
Hi! > My 'Littar' (literature radar) website/app won second prize in DBC's > (Danish Library Center) app competition last week at the Data Science Day. > > Littar displays narrative locations from literary works on a map, - > presently only Danish locations. Data comes from Wikidata P840 and > presently colored according to P136 using the Leaflet marker. Text is > from the P1683 qualifier under P840: > > http://fnielsen.github.io/littar/ Congratulations, very nice project! -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] This SPARQL query no longer works, any ideas why?
Hi! > Wikidata SPARQL aficionados, > > This SPARQL query worked for several weeks, but quit working a few days > ago: No idea what happened, I'll look into it. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] This SPARQL query no longer works, any ideas why?
Hi! > Stas, > > I've narrowed it down to the ORDER BY clause. Changing from: > > ORDER BY ?propUrl ?valUrl > > to: > > ORDER BY ?propLabel ?valLabel Seems to be caused by recent fix for T113374, which did not work as expected. I have rolled back the deployment for now and will investigate why it broke later. Thanks for the report! -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
[Wikidata] SPARQL endpoint caching
Hi! With Wikidata Query Service usage raising and more use cases being found, it is time to consider caching infrastructure for results, since queries are expensive. One of the questions I would like to solicit feedback on is the following: Should we have default SPARQL endpoint cached or uncached? If cached, which default cache duration would be good for most users? The cache, of course, applies to the results of the same (identical) query only. Please also note the following is not an implementation plan, but rather an opinion poll, whatever we end up deciding we will have an announcement with actual plan before we do it. Also, whichever default we choose, there should be a possibility to get both cached and uncached results. The question is when you access the endpoint with no options, which one would it be. So possible variants are: 1. query.wikidata.org/sparql is uncached, to get cached result you use something like query.wikidata.org/sparql?cached=120 to get result no older than 120 seconds ago. PRO: least surprise for default users. CON: relies on goodwill of tool writers, if somebody doesn't know about cache option and uses the same query heavily, we would have to ask them to use the parameter. 2. query.wikidata.org/sparql is cached for short duration (e.g. 1 minute) by default, if you'd like fresh result, you do something like query.wikidata.org/sparql?cached=0. If you're fine with older result, you can use query.wikidata.org/sparql?cached=3600 and get cached result if it's still in cache but by default you never get result older than 1 minute. This of course assuming Varnish magic can do this, if not, the scheme has to be amended. PRO: performance improvement while keeping default results reasonably fresh CON: it is not obvious that result is not the freshest data but can be stale, so if you update something in wikidata and query again within minute, you can be surprised 3. query.wikidata.org/sparql is cached for long duration (e.g. hours) by default, if you'd like fresher result you do something like query.wikidata.org/sparql?cache=120 to get result no older than 2 minutes, or cache=0 if you want uncached one. PRO: best performance improvement for most queries, works well with queries that display data that rarely changes, such as lists, etc. CON: for people not knowing about cache option, in may be rather confusing to not be able to get up-to-date results. So we'd like to hear - especially from current SPARQL endpoint users - what do you think about these and which would work for you? Also, for the users of the WDQS GUI - provided we have cached and uncached options, which one the GUI should return by default? Should it be always uncached? Performance there is not a major question - the traffic to the GUI is pretty low - but rather convenience. Of course, if you run cached query from GUI and the data in cache, you can get results much faster for some queries. OTOH, it may be important in many cases to be able to access actual content up-to-date, not the cached version. I also created a poll: https://phabricator.wikimedia.org/V8 so please feel free to vote for your favorite option. OK, this letter is long enough already so I'll stop here and wait to hear what everybody's thinking. Thanks in advance, -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] SPARQL endpoint caching
Hi! > How often does *exactly* the same query get run within 2 minutes ? Depends where the query is coming from. E.g. if there's a graph backed by query, then a lot of people can be seeing the graph and running a query. Same if somebody publishes a link to some query e.g. during a talk or in article and a bunch of people come to look at it. Depends on use case. Some use cases - like graphs - we are just planning, so we can't really rely on statistics here. > I'd guess there's probably only a very few queries like that though. Well, maybe - we don't really know yet. That's why I want to hear opinions on this :) -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] SPARQL endpoint caching
Hi! > I agree, we should look at some actual traffic to see how many queries > /could/ be cached in a 2/5/10/60 min window. Maybe remove the example > queries from those numbers, to separate the "production" and testing > usage. Also, look at query runtime; if only "cheap" queries would be > cached, there is no point in caching. Makes sense, but some of the use cases are not implemented yet, and I'm kind of scared of allowing them without caching - e.g. graph embedding - so it's hard to rely on past data. > Once you run a query, you know both the runtime and the result size. > Maybe expensive queries with a huge result set could be cached longer by > default, and cheap/small queries not at all? If you expect your recent > Wikidata edit to change the results from 3 to 4, you should see that > ASAP; if the change would be 50.000 to 50.001, it seems less critical > somehow. That sounds like a good idea, we'll need to check if Varnish allows us to do tricks like this... -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Make federated queries possible / was: SPARQL CONSTRUCT results truncated
Hi! > Now, obviously endpoints referenced in a federated query via a > service clause have to be open - so any attacker could send his > queries directly instead of squeezing them through some other > endpoint. The only scenario I can think of is that an attackers IP > already is blocked by the attacked site. If (instead of much more > common ways to fake an IP) the attacker would choose to do it by > federated queries through WDQS, this _could_ result in WDQS being > blocked by this endpoint. This is not what we are concerned with. What we are concerned with is that federation essentially requires you to run an open proxy - i.e. to allow anybody to send requests to any URL. This is not acceptable to us because this means somebody could abuse this both to try and access our internal infrastructure and to launch attacks to other sites using our site as a platform. We could allow, if there is enough demand, to access specific whitelisted endpoints but so far we haven't found any way to allow access to any SPARQL endpoint without essentially allowing anybody to launch arbitrary network connections from our server. > provide for the linked data cloud. This must not involve the > highly-protected production environment, but could be solved by an > additional unstable/experimental endpoint under another address. The problem is we can not run production-quality endpoint in non-production environment. We could set up an endpoint on the Labs, but this endpoint would be underpowered and we won't be able to guarantee any quality of service there. To serve the amount of Wikidata data and updates, the machines should have certain hardware capabilities, which Labs machines currently do not have. Additionally, I'm not sure running open proxy even there would be a good idea. Unfortunately, in the internet environment of today there is no lack of players that would want to abuse such thing for nefarious purposes. We will keep looking for solution for this, but so far we haven't found one. Thanks, -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Undoing merges (bibliographic articles are not humans)
Hi! > If I get the question correctly, after undoing changes in the target > page, you need to go to the redirect page and undo changes made there. Yes, I tried to do that and it works fine. You may need to click on "redirected from" link when going to redirect page and then go to history and restore pre-merge version. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] SPARQL endpoint caching
Hi! > > I'll do a presentation next week, in which I intend to demonstrate > that I can add a Wikidata value online, which then is available > immediately for my application - as well as for the whole rest of the > world. (In Library Land, that's a real blast, because business > processes related to authority data often take weeks or month ...) I think we'll always have some way to run un-cached query. The question is only how easy would it be - i.e. would you need to add parameter, click a checkbox, etc. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] SPARQL CONSTRUCT results truncated
Hi! > [1] Is the service protected against internet crawlers that find such > links in the online logs of this email list? It would be a pity if we > would have to answer this query tens of thousands of times for many > years to come just to please some spiders who have no use for the result. That's a very good point. We currently do not have robots.txt file on the service. We should have it. I'll fix it ASAP. GUI links do not run the query until click, so they are safe from bots anyway. But direct links to sparql endpoint do run the query (it's the API after all :) So robots.txt is needed there. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] SPARQL CONSTRUCT results truncated
Hi! > you may want to check out the Linked Data Fragment server in Blazegraph: > https://github.com/blazegraph/BlazegraphBasedTPFServer Thanks, I will check it out! -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Status and ETA External ID conversion
Hi! > Couldn't you use P460 when there is doubt? > > https://www.wikidata.org/wiki/Property:P460 P460's type is Item, which means it is relation between two Wikidata items. External ID is relation between Wikidata item and something outside Wikidata. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Recognizing deleted resource in the Wikidata SPARQL endpoint
Hi! > Is there an alternative to SPARQL that can either check for deleted > records or give only non-deleted ones? currently, I am checking if > they are deleted by making a "ping" using HTTP head requests but this > takes a huge amount of time since I need to check about 70k > resources. Well, SPARQL data store is not supposed to contain any deleted entries... But looks like there's some bug there. If you give me the list of the "bad" entries, it's easy to update them. Considerable harder is to find *why* they weren't updated in the first place. I'm still looking into it. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wordnet mappings
Hi! > Is there a property for WordnetId? The list of properties is here: https://www.wikidata.org/wiki/Wikidata:List_of_properties Don't see there anything for Wordnet. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Recognizing deleted resource in the Wikidata SPARQL endpoint
Hi! > I am currently running an experiment to figure out how many Wikidata > entries refer to identifiers in our dataset (i.e. using property P727) > but I am receiving in the results entries that have apparently been > deleted/deprecated (e.g. http://www.wikidata.org/entity/Q18573617)... is Could you send me the query and the items you see that are wrong? > there a way to detect them using SPARQL, perhaps some meta-property or > some information in a statement, or is it simply because the endpoint is > not in sync with the main repo. Short answer - unfortunately, no. Longer answer in https://phabricator.wikimedia.org/T128947#2104017 -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
[Wikidata] Fwd: [Ops] Wikidata Query Service (WQDS) regular deployment window
Hi! FYI to those it may concern - we plan to institute regular WDQS deployments on Mondays for both code and GUI. Not much going to change except regular deployments would happen at predictable time instead of "whenever I feel like it" :) Does not preclude emergency deployments in case something is broken, of course, but hopefully this would introduce more predictability as the service matures. Guillaume would be doing the deployments, usually, with me stepping in if he can not. Forwarded Message Subject: [Ops] Wikidata Query Service (WQDS) regular deployment window Date: Tue, 22 Mar 2016 22:04:21 +0100 From: Guillaume LederreyTo: A public mailing list about Wikimedia Search and Discovery projects , Operations Engineers Hello! After discussion with Stas, we want to have a regular deployment window for Wikidata Query Service. This should help give better visibility on when new version arrives and help track issues with those new versions. I will take care of the deployments (with Stas' support, of course). The deployment window is: every Monday, from 7pm CET (10am PST - 5pm UTC) starting from Monday April 11th. Let me know if you have any question or if you know of another place where I should publicize this deployment window. Take care, Guillaume -- Guillaume Lederrey Operations Engineer, Discovery Wikimedia Foundation ___ Ops mailing list o...@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/ops ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Fwd: [Ops] Wikidata Query Service (WQDS) regular deployment window
Hi! > Could we also have some daily deployed test site, so new features could > be tried out and tested? There's a test server at wdq-beta.eqiad.wmflabs, on which I routinely test stuff. You could use it too. I'm not updating it daily though but anytime I need to test stuff, basically. We could have another instance that is auto-deployed from the deployment repo daily by scripts, if necessary. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] SPARQL/BlazeGraph: label service performance
Hi! > There is a performance issue with the labelling service. Using labels > makes even simple queries time out. For example this one: > > SELECT $p $pLabel > WHERE { >$p wdt:P31 _:bnode . >SERVICE wikibase:label { bd:serviceParam wikibase:language "en" . } > } LIMIT 11 I suspect the issue here can be that it tries to calculate the full set of values before applying service. Which may make sense if the service is external, but if it is internal and result set is huge it obviously is not working. Other alternative can be, since you are just looking for English labels, to use direct query approach: SELECT $p $pLabel WHERE { $p wdt:P31 _:bnode . OPTIONAL { $p rdfs:label $pLabel . FILTER(lang($pLabel) = "en") } } LIMIT 11 This seems to work just fine. You lose a bit of added value on the service (nicer no-label labels) but you gain a lot of speed. In any case, I'll raise this issue with Blazegraph and it also may be worth to submit Phabricator issue about it. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Status and ETA External ID conversion
Hi! > The community is checking each property to verify it should be converted: > > https://www.wikidata.org/wiki/User:Addshore/Identifiers/0 > > https://www.wikidata.org/wiki/User:Addshore/Identifiers/1 > > https://www.wikidata.org/wiki/User:Addshore/Identifiers/2 Is there a process somewhere of how the checking is done, what are criteria, etc.? I've read https://www.wikidata.org/wiki/User:Addshore/Identifiers but there's a lot of discussion but not clear if it ever come to some end. Also not clear what the process is - should I just move a property I like to "good to convert"? Should I run it through some checklist first? Should I ask somebody? What are the rules for "disputed" - is some process for review planned? I think some more definite statement would help, especially to people willing to contribute. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Status and ETA External ID conversion
Hi! > In your case, however, the answer probably is: you cannot contribute > there at all, since you are a Wikimedia employee and this is a > content-related community discussion. ;-) Many WMF employees contribute to wikis in their non-work time, as far as I know. I don't even seek to participate in the discussion (though I don't think WMF employment would disqualify me from contributing in volunteer capacity, given my affiliations - as they are - are clearly stated) - but only to know the results so I could contribute in editor capacity, following whatever rules are there. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Status and ETA External ID conversion
Hi! > Yes, sure, your free time is a different matter. I just thought you are > speaking as a WMF employee here, since you were using this email. I am It's Sunday here, so no :) I do use two separate logins for WMF official and volunteer work on Wiki, but using two emails is too cumbersome for me. > probably over-sensitive there since I am used to the very strict > policies of WMDE. They are very careful to keep paid and private > activities separate by using different accounts. Surely, it is common in WMF too. But again two email accounts seems excessive to me. Usually it's pretty clear from the context but if needed, I will clarify. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
[Wikidata] upcoming change in RDF format data
Hi! We are committing a patch that implements a change in RDF format output, specifically how we output coordinates as WKT points. If you do not use RDF format exports and specifically WKT coordinate literals there, this change has no effect for you. When we first implemented it, we chose to make it "Point(latitude longitude)". Unfortunately, turns out the standard way in WKT is Point(longitude latitude) and that's how most of the tools that implement WKT format understand it. In general, geo-data formats are split on this question, see http://www.macwright.org/lonlat/. But WKT is pretty universally in lon-lat camp, so we have to follow the established practice. As such, we are changing the WKT representation and we are bumping the format version (reported as schema:softwareVersion on RDF dumps/exports) from 0.0.1 to 0.0.2 so that the tools could adjust properly. See more details in: https://phabricator.wikimedia.org/T130049 Thanks, -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
[Wikidata] geospatial search preview
Hi all! I would like to present to you a preview of an upcoming Wikidata Query Service feature, namely geospatial search - on http://geotest.wmflabs.org/. With this, you will be able to search things by their geographic coordinates, e.g.: Objects within 10km of Paris: http://tinyurl.com/jaxr3bn Airports within 100km from Berlin: http://tinyurl.com/gtbmqz3 (try also the map view!) The purpose of the preview is to let people play with it and collect feedback on what works, what doesn't and what you'd like to see added/changed/removed. This is a preview implementation, so do not use it for anything beyond admiring the marvels of modern technology :) and providing feedback. The full release on query.wikidata.org will follow sometime in late April-early May, hopefully. What is implemented? - Coordinate storage & indexing as WKT literals - Search within radius and within bounding box - Support for different globes What is missing but would be added soon? - Distances as search output and as separate function - Documentation - You tell me Thanks, -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] SQID: the new "Wikidata classes and properties browser"
Hi! > Nice work! I especially like the ability to filter the properties by > usage amount here: > https://tools.wmflabs.org/sqid/#/browse?type=properties This makes it > super easy to find unused or nearly unused properties for example. Yes! Also some usage that seems strange - e.g., why use P31 or P279 in a reference? -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] question on claim-filtered search/dump and works on a Wikidata subset search engine
Hi! > feature that needs to be snappy. So the alternative approach I have been > working on to is to get a subset of a Wikidata dump and put it in an > ElasticSearch instance. The linked data fragments implementation would probably be useful for that, and I think it would be good idea to get one eventually for the Wikidata Query Service, but not yet. Also, we do have ElasticSearch index for Wikidata (that's what drives search on site) so it would be possible to integrate it with Query Service too (there's some support for it in Blazegraph) but it's still not done. So for now I think we don't have a ready-made solution yet. You could still try to prefix-search or regex-search on the query service, but depending on the query it may be too slow right now. > *Question: > *What is the best way to get all the entities matching a given claim? > My answer so far was downloading a dump, then filtering the entities by > claim, but are there better/less resource-intensive ways? Probably not currently without some outside tools. When we get LDF support, then that may be the way :) -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
[Wikidata] Query service issue
Hi! During the data reload for geospatial service enabling, we have discovered a problem with Wikidata dumps (https://phabricator.wikimedia.org/T133924). The effect of this problem is that some items are missing from the dump. Dumps starting 20160418 are affected, previous ones seem to be fine. The immediate fix for this would be to reload the data from a correct dump (20160411) and re-sync the data since then. Unfortunately, this may take some time (a day or so for reload, and another day or so for resync), and until then you'll see some missing data on query.wikidata.org. Please be patient until then. I apologize for the inconvenience caused, and will continue to research the cause of the missing data and then fix it. I'll update the ticket when we have new info. Thanks, -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Geospatial search for Wikidata Query Service is up
Hi! > Very nice! From where is the map-data? Open street map? Yes, see: https://www.mediawiki.org/wiki/Maps -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
[Wikidata] Geospatial search distance enabled
Hi! The geospatial search in WDQS now includes ability to sort by distance. This can be achieved in one of two ways: 1. Circular search now has clause to fetch distance from center: bd:serviceParam wikibase:distance ?distance . This places distance from center to ?distance variable for each found point, which then can be used for sorting. 2. geof:distance function - this function returns distance between any two points. This function can be used on any two points, regardless of usage of the search service. In both cases, distances are returned in kilometers, other units are not currently supported (maybe in the future). Please tell me if you notice anything wrong or have any comments/suggestions. Thanks, -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wikidata Query Service problems
Hi! > I am having problems with the Wikidata Query Service. Sometimes I just > get an black page at other time I get an interface but without a > responsive edit field and the message "Data last updated: [connecting]" > on the green status field. The issue has been going on for some days now. Probably linked to recent Varnish issues: https://phabricator.wikimedia.org/T134989 It's being worked on. > I know there has been recent instabilities, cf. > https://lists.wikimedia.org/pipermail/wikidata/2016-May/008674.html , > but https://phabricator.wikimedia.org/T134238 'Query service fails with > "Too many open files"' should be resolved as of 9 May. A new one No, that's different one and it is fixed. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Wikidata ontology
Hi! > If geo-coordinates use WKT in Wikidata (which I can't see anywhere > there), does it mean, that the original idea of /{latitude, longitude, > altitude, precision, globe}/ format was abandoned? Coordinates are WKT in the RDF output of Wikidata, when represented as single literal. That's not the original representation inside Wikibase (that still has separate elements and can be also seen in the JSON dump) and not the only RDF representation - there's also "full value" representation, described here: https://www.mediawiki.org/wiki/Wikibase/Indexing/RDF_Dump_Format#Globe_coordinate This one is harder to index and search for, but it allows you to do lookups that depend on specific part of the coordinate (like, which objects are on Mars, or what is located on the equator)? > Oh, and another thing: when I download "wikidata-properties.nt" from > WDTK dump files site [1], there is <http://wikidata.org/ontology#> used > everywhere. So... is the WB ontology somehow translated to WD ontology? Hmm, not sure about that one, Markus should know more about it. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Sparql query for "evidence based practice"
Hi! > One of the things Reasonator does not do is provide information on > qualifiers. I am interested particularly in Q2354820 > <http://www.wikidata.org/wiki/Q2354820> and where it is used. If you mean usage as a qualifier's value, this would probably work: SELECT * WHERE { ?st ?pred wd:Q2354820 . ?p wikibase:qualifier ?pred . } LIMIT 10 but no result is produced, so I assume Q2354820 is not used as a qualifier value (unless I'm missing something). -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Sparql query for "evidence based practice"
Hi! > Hoi, > Thanks Stas :) This is one example where Q2354820 is used. > > http://tools.wmflabs.org/reasonator/?=24013782 OK, this looks like a new one and the query now returns it. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Geospatial search for Wikidata Query Service is up
Hi! > Are the distances available somehow? I.e., the distances between the > wikibase:center and the searched locations? There is a distance function - see https://www.mediawiki.org/wiki/Wikidata_query_service/User_Manual#Distance_function - and there also soon will be a way to retrieve distances when searching around point - right now they are recalculated and this is a waste since the search already knows the distance, but in current version it discards it. There's a slight bug in current version (see the example above for the workaround - the hint to disable analytic mode is required in some situations) - that's why it was not announced yet. But for many applications geof:distance already works. The rest will be implemented an announced soon, probably sometime next week or week after that. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
[Wikidata] Geospatial search for Wikidata Query Service is up
Hi! After a number of difficulties and unexpected setbacks[1] I am happy to announce that geospatial search for Wikidata Query Service is now deployed and functional. You can now search for items within certain radius of a point and within a box defined by two points - more detailed instructions are in the User Manual[2]. See also query examples[3] such as "airports within 100km of Berlin": http://tinyurl.com/zxy8o64 There are still a couple of things to complete, namely sorting by distance (coming soon) and units support (maybe). Overall progress of the task is tracked by T133566[4]. [1] https://lists.wikimedia.org/pipermail/wikidata/2016-May/008674.html [2] https://www.mediawiki.org/wiki/Wikidata_query_service/User_Manual#Geospatial_search [3] https://www.mediawiki.org/wiki/Wikibase/Indexing/SPARQL_Query_Examples#Airports_within_100km_of_Berlin [4] https://phabricator.wikimedia.org/T123565 -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] SPARQL service timeouts
Hi! > I have the impression that some not-so-easy SPARQL queries that used to > run just below the timeout are now timing out regularly. Has there been > a change in the setup that may have caused this, or are we maybe seeing > increased query traffic [1]? We've recently run on a single server for couple of days due to reloading of the second one, so this may have made it a bit slower. But that should be gone now, we're back to two. Other than that, not seeing anything abnormal in https://grafana.wikimedia.org/dashboard/db/wikidata-query-service > [1] The deadline for the Int. Semantic Web Conf. is coming up, so it > might be that someone is running experiments on the system to get their > paper finished. It has been observed for other endpoints that traffic > increases at such times. This community sometimes is the greatest enemy > of its own technology ... (I recently had to IP-block an RDF crawler > from one of my sites after it had ignored robots.txt completely). We don't have any blocks or throttle mechanisms right now. But if we see somebody making serious negative impact on the service, we may have to change that. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] SPARQL service timeouts
Hi! > I'm late to the game, but a quick look into the nginx logs does not > show all that much. I see a few connection refused, but that should > translate in an HTTP 502 error, not in a partial answer. > > I'm really not good at reading VCL, but it seems that we do have some > rules in our Varnish config to cache pages in error. This would make > sense as pages in error tend to be expensive, so we probably want to > ensure the same error is capped at a maximum rate. Note that if page produced some data and then timed out - which is possible for Blazegraph queries - then by the time the error happens part of the response has been sent already, so there's no way to set error http code etc. Thus such responses are not distinguishable from valid replies, at least not without looking into the content. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Date of birth and name correlations
Hi! > Vaguely related: The full list of items "different from" (P1889) another > item sure does give an interesting read though. * > http://tinyurl.com/h23q2et Some pairs there are definitely unexpected. And I'm not sure about these two: https://www.wikidata.org/wiki/Q1787424 https://www.wikidata.org/wiki/Q166542 Some days I have a feeling those should be P460... ;) -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
[Wikidata] Grammatical display of units
Hi! Right now, quantities with units are displayed by attaching unit name to the number. While it gives the idea of what is going on, it is somewhat ungrammatical in English (83 kilgoramm, 185 centimetre, etc.) [1] and in other languages - i.e. in Russian it's 83 килограмм, 185 сантиметр - instead of the correct "83 килограмма", "185 сантиметров". For some units, the norms are kind of tricky and fluid (e.g. see [2]), and they are not even identical across all units in the same language, but the common theme is that there are grammatical rules on how to do it and we're ignoring them right now. I think we do have some means to grammatically display numbers - for example, number of references is displayed correctly in English and Russian. As I understand, it is done by using certain formats in message strings, and these formats are supported in the code in Language classes. So, I wonder if we should maybe have an (optional) property that defines the same format for units? We could then reuse the same code to display units in proper grammatical way. Alternatively, we could use short units display [3] - i.e. cm instead of centimetre - and then plurals are not required. However, this relies on units having short names, and for some units short names can be rather obscure, and maybe in some language short names need grammatical forms too. Given that we do not link unit names, it would be rather confusing (btw, why don't we?). Some units may not have short forms at all. And the short names do not exactly match the languages - rather, they usually match the script (i.e. Cyrillic, or Latin, or Hebrew) - and we may not even have data on which language uses which script, in a useful form. So using short forms is very tricky. Any other ideas on this topic? Do we have a ticket tracking this somewhere? I looked but couldn't find it. [1] http://english.stackexchange.com/questions/22082/are-units-in-english-singular-or-plural [2] https://ru.wikipedia.org/wiki/%D0%9E%D0%B1%D1%81%D1%83%D0%B6%D0%B4%D0%B5%D0%BD%D0%B8%D0%B5_%D0%92%D0%B8%D0%BA%D0%B8%D0%BF%D0%B5%D0%B4%D0%B8%D0%B8:%D0%9E%D1%84%D0%BE%D1%80%D0%BC%D0%BB%D0%B5%D0%BD%D0%B8%D0%B5_%D1%81%D1%82%D0%B0%D1%82%D0%B5%D0%B9#.D0.A1.D0.BA.D0.BB.D0.BE.D0.BD.D0.B5.D0.BD.D0.B8.D0.B5_.D0.B5.D0.B4.D0.B8.D0.BD.D0.B8.D1.86_.D0.B8.D0.B7.D0.BC.D0.B5.D1.80.D0.B5.D0.BD.D0.B8.D1.8F [3] https://phabricator.wikimedia.org/T86528 -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Grammatical display of units
Hi! > Where are the names of those units translated at the moment? I assume on the wikidata items for them, those are just labels for wikidata items (as units are items). > If these are MediaWiki messages, grammar rules for them can be added > fairly easily. If I can see where they are now, I could probably make a > quite demo patch to show how it can be done. I don't think we can put grammar rules in labels, that's why I proposed a special property as an option. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Grammatical display of units
Hi! > You mean the MediaWiki message processing code? This would probably be Yes, exactly. > powerful enough for units as well, but it works based on message strings > that look a bit like MW template calls. Someone has to enter such > strings for all units (and languages). This would be doable but the > added power comes at the price of more difficult editing of such message > strings instead of plain labels. True. OTOH, we already have non-plain strings in the database - e.g. math formulae - so that would be another example of such strings. It's not ideal but would be a start, and maybe we can have some gadgets later to deal with it :) >> Oh yes :) Russian is one, but I'm sure there are others. >> > > Forgive my ignorance; I was not able to read the example you gave there. Sorry, it's hard to give examples in foreign languages that would be comprehensible :) The gist of it is that Russian, as many other inflected languages, changes nouns by grammatical case, and uses different cases for different number of items (i.e. 1, 2, and 5 will use three different cases). Labels are of course in singular nominative case, which is wrong for many numbers. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Controversy around Wikimania talks
Hi! > Ask yourself what it is about.. It is about the Wikimania talks. What > was done is removing all the Wikimania talks without any discussion. I wonder whether Wikidata is really the best platform to host Wikimania talks and information about it. While I have no doubt these are excellent talks of great interest to Wiki community, their notability in the larger world is a more difficult question. Specifically, would we create an item for every talk even for a major conference (not considering copyright etc. questions now)? We have a lot of conferences with much wider attendance than Wikimania happening each year. Now, Wikimania is of course special - for Wiki movement. And having *some* repository for this content and knowledge would be completely appropriate. However, is that repository Wikidata - as purported to be repository of knowledge of general public interest? I am much less sure of it. Unless we take the wider mission of accepting data about talks on any conference of note - which may be possible, but I'm not sure whether it should be done... If yes, then of course clear policy statement to that effect may be helpful - so people who are not sure about it like me would know what the community consensus has arrived to. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] An attribute for "famous person"
Hi! On 8/2/16 11:36 PM, Jane Darnell wrote: > Would page props also give me the creation date of the Wikipedia page in > that specific sitelink? Because this is something I needed when Don't think so and I don't think such data should be in Wikidata or WDQS database - it's Wikipedia administrative data and should be there. External service can combine data from these sources but I don't think it falls under WDQS tasks. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] An attribute for "famous person"
Hi! > If you think it is best to implement a more general feature that adds > even more properties, then I am sure nobody will complain, but it sounds > like more work to me. The number I was asking for is something that you I don't think it's *much* more work, and I planned to do this work anyway :) Of course, it may happen that I am wrong about how much work it is, and then I might reconsider. > compute the number in a SPARQL query from the RDF. It is a completely > redundant piece of information. It's only purpose is to make SPARQL > queries that currently time out fast. In databases, such things are > called "materialized views". Speaking of which, Blazegraph does have support for inferring data, but I don't want to open that particular can of worms just yet. > This leads to a slightly different perspective than the one you'd have > in T129046. By adding page props, you want to add "new" information from > another source, and questions like data modelling etc. come to the fore. > With a materialized view, you just add some query results back to the > database for technical reasons that are specific to the database. The > two motivations might lead to different requirements at some point > (e.g., if you want to add another materialized query result to the RDF > you may have to extend page props, which involves more dependencies than > if you just extend the RDF converter). While in theory this is true, we don't have any process that allows us to do literally materialized views on current platform (there are named queries but that's not the same I think). Inference "kind of" might be that, but doing it that way probably would be very inefficient for this particular case. There are of course other ways to achieve the same, so I'll look into various options, but so far page props doesn't sound like that bad an idea, to me. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Grammatical display of units
Hi! > Well, I think we could sidestep the grammar issue by using unit symbols. We True, but what unit symbol is "apple"? It's actually used as measure of height (bonus points if you can guess on which item :). Even if we don't go this far, while SI units probably all have short names, for non-SI units, especially older and rarer ones, it may very well not be the case. Another tricky part is that short names are not connected to languages right now. I.e. if your interface language is Serbian, which short name to use? What if it's Farsi? We'd need to change how we relate units & unit symbols then. > An alternative is to use MediaWiki i18n messages instead of entity labels. > E.g. > if the unit is Q11573, we could check if MediaWiki:wikibase-unit-Q11573 > exists, > and if it does, use it. We'd get internationalization including support for > plurals for free. That may work, but downside of this is that it is linked to unit ID - so if we wanted to use it for, say, Commons data, we'd have to somehow link between "metre" on Wikidata and "metre" on Commons. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] (Ab)use of "deprecated"
Hi! > I would argue that this is better done by using qualifiers (e.g. start > data, end data). If a statement on the population size would be set to > preferred, but isn't monitored for quite some time, it can be difficult > to see if the "preferred" statement is still accurate, whereas a > qualifier would give a better indication that that stament might need an > update. Right now this bot: https://www.wikidata.org/wiki/User:PreferentialBot watches statements like "population" that have multiple values with different time qualifiers but no current preference. What it doesn't currently do is to verify that the preferred one refers to the latest date. It probably shouldn't fix these cases (because there may be valid cause why the latest is not the best, e.g. some population estimates are more precise than others) but it can alert about it. This can be added if needed. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata
Re: [Wikidata] Breaking change in JSON serialization?
Hi! > My view is that this tool should be extremely cautious when it sees new data > structures or fields. The tool should certainly not continue to output > facts without some indication that something is suspect, and preferably > should refuse to produce output under these circumstances. I don't think I agree. I find tools that are too picky about details that are not important to me hard to use, and I'd very much prefer a tool where I am in control of which information I need and which I don't need. > What can happen if the tool instead continues to operate without complaint > when new data structures are seen? Consider what would happen if the tool > was written for a version of Wikidata that didn't have rank, i.e., claim > objects did not have a rank name/value pair. If ranks were then added, > consumers of the output of the tool would have no way of distinguishing > deprecated information from other information. Ranks are a bit unusual because ranks are not just informational change, it's a semantic change. It introduces a concept of a statement that has different semantics than the rest. Of course, such change needs to be communicated - it's like I would make format change "each string beginning with letter X needs to be read backwards" but didn't tell the clients. Of course this is a breaking change if it changes semantics. What I was talking are changes that don't break semantics, and majority of additions are just that. > Of course this is an extreme case. Most changes to the Wikidata JSON dump > format will not cause such severe problems. However, given the current > situation with how the Wikidata JSON dump format can change, the tool cannot > determine whether any particular change will affect the meaning of what it > produces. Under these circumstances it is dangerous for a tool that > extracts information from the Wikidata JSON dump to continue to produce > output when it sees new data structures. The tool can not. It's not possible to write a tool that would derive semantics just from JSON dump, or even detect semantic changes. Semantic changes can be anywhere, it doesn't have to be additional field - it can be in the form of changing the meaning of the field, or format, or datatype, etc. Of course the tool can not know that - people should know that and communicate it. Again, that's why I think we need to distinguish changes that break semantics and changes that don't, and make the tools robust against the latter - but not the former because it's impossible. For dealing with the former, there is a known and widely used solution - format versioning. > This does make consuming tools sensitive to changes to the Wikidata JSON > dump format that are "non-breaking". To overcome this problem there should > be a way for tools to distinguish changes to the Wikidata JSON dump format > that do not change the meaning of existing constructs in the dump from those > that can. Consuming tools can then continue to function without problems > for the former kind of change. As I said, format versioning. Maybe even semver or some suitable modification of it. RDF exports BTW already carry version. Maybe JSON exports should too. -- Stas Malyshev smalys...@wikimedia.org ___ Wikidata mailing list Wikidata@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata