Re: [Wikidata] Make federated queries possible / was: SPARQL CONSTRUCT results truncated

2016-02-19 Thread Neubert, Joachim
Hi Stas,

Thanks for your explanation! I've to perhaps do some tests on my own systems ...

Cheers, Joachim

-Ursprüngliche Nachricht-
Von: Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] Im Auftrag von Stas 
Malyshev
Gesendet: Donnerstag, 18. Februar 2016 19:12
An: Discussion list for the Wikidata project.
Betreff: Re: [Wikidata] Make federated queries possible / was: SPARQL CONSTRUCT 
results truncated

Hi!

> Now, obviously endpoints referenced in a federated query via a service 
> clause have to be open - so any attacker could send his queries 
> directly instead of squeezing them through some other endpoint. The 
> only scenario I can think of is that an attackers IP already is 
> blocked by the attacked site. If (instead of much more common ways to 
> fake an IP) the attacker would choose to do it by federated queries 
> through WDQS, this _could_ result in WDQS being blocked by this 
> endpoint.

This is not what we are concerned with. What we are concerned with is that 
federation essentially requires you to run an open proxy - i.e. to allow 
anybody to send requests to any URL. This is not acceptable to us because this 
means somebody could abuse this both to try and access our internal 
infrastructure and to launch attacks to other sites using our site as a 
platform.

We could allow, if there is enough demand, to access specific whitelisted 
endpoints but so far we haven't found any way to allow access to any SPARQL 
endpoint without essentially allowing anybody to launch arbitrary network 
connections from our server.

> provide for the linked data cloud. This must not involve the 
> highly-protected production environment, but could be solved by an 
> additional unstable/experimental endpoint under another address.

The problem is we can not run production-quality endpoint in non-production 
environment. We could set up an endpoint on the Labs, but this endpoint would 
be underpowered and we won't be able to guarantee any quality of service there. 
To serve the amount of Wikidata data and updates, the machines should have 
certain hardware capabilities, which Labs machines currently do not have.

Additionally, I'm not sure running open proxy even there would be a good idea. 
Unfortunately, in the internet environment of today there is no lack of players 
that would want to abuse such thing for nefarious purposes.

We will keep looking for solution for this, but so far we haven't found one.

Thanks,
--
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Make federated queries possible / was: SPARQL CONSTRUCT results truncated

2016-02-18 Thread Ruben Verborgh
Hi Joachim,

> To me, a crucial point seems to be that I'm trying to look up a large number 
> of distinct entities in two endpoints and join them. In the "real life" case 
> discussed here, about 430.000 "economists" extracted from GND and about 
> 320.000 "persons with GND id" from wikidata. The result of the join are about 
> 30.000 wikidata items, for which the German and English wikipedia site links 
> are required.

The query plan a regular TPF client would come up with, would probably not 
differ from that of (most) SPARQL federation engines, so they would be 
similarly slow.

However…

You might know that TPF is an interface that allows for auto-discoverable 
extensions. Recently, we published an extension of TPF that uses Bloom filters 
to perform faster joins [1]. The trade-off is that the server needs to perform 
an extra operation (but if this saves thousands of other requests, that might 
be worthwhile). The public implementation works, but is still preliminary; 
however, if there is interest in such cases, we might speed things up. Let us 
know!

Best,

Ruben

[1] http://linkeddatafragments.org/publications/iswc2015-amf.pdf
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Make federated queries possible / was: SPARQL CONSTRUCT results truncated

2016-02-18 Thread Stas Malyshev
Hi!

> Now, obviously endpoints referenced in a federated query via a
> service clause have to be open - so any attacker could send his
> queries directly instead of squeezing them through some other
> endpoint. The only scenario I can think of is that an attackers IP
> already is blocked by the attacked site. If (instead of much more
> common ways to fake an IP) the attacker would choose to do it by
> federated queries through WDQS, this _could_ result in WDQS being
> blocked by this endpoint.

This is not what we are concerned with. What we are concerned with is
that federation essentially requires you to run an open proxy - i.e. to
allow anybody to send requests to any URL. This is not acceptable to us
because this means somebody could abuse this both to try and access our
internal infrastructure and to launch attacks to other sites using our
site as a platform.

We could allow, if there is enough demand, to access specific
whitelisted endpoints but so far we haven't found any way to allow
access to any SPARQL endpoint without essentially allowing anybody to
launch arbitrary network connections from our server.

> provide for the linked data cloud. This must not involve the
> highly-protected production environment, but could be solved by an
> additional unstable/experimental endpoint under another address.

The problem is we can not run production-quality endpoint in
non-production environment. We could set up an endpoint on the Labs, but
this endpoint would be underpowered and we won't be able to guarantee
any quality of service there. To serve the amount of Wikidata data and
updates, the machines should have certain hardware capabilities, which
Labs machines currently do not have.

Additionally, I'm not sure running open proxy even there would be a good
idea. Unfortunately, in the internet environment of today there is no
lack of players that would want to abuse such thing for nefarious purposes.

We will keep looking for solution for this, but so far we haven't found one.

Thanks,
-- 
Stas Malyshev
smalys...@wikimedia.org

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Make federated queries possible / was: SPARQL CONSTRUCT results truncated

2016-02-18 Thread Neubert, Joachim
Dear Ruben,

LDF seems a very promising solution to build reliable Linked Data production 
environment with high scalability at relatively low cost.

However, I'm not sure if the solution works well on queries like the ones 
discussed here (see below). It would be very interesting to learn how exactly 
such a query would be dealt with in an LDF client / server setting.

To me, a crucial point seems to be that I'm trying to look up a large number of 
distinct entities in two endpoints and join them. In the "real life" case 
discussed here, about 430.000 "economists" extracted from GND and about 320.000 
"persons with GND id" from wikidata. The result of the join are about 30.000 
wikidata items, for which the German and English wikipedia site links are 
required.

How could an LDF client get this information effectively?

Cheers, Joachim

> PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
> PREFIX schema: <http://schema.org/>
> #
> construct {
>?gnd schema:about ?sitelink .
> }
> where {
># the relevant wikidata items have already been 
># identified and loaded to the econ_pers endpoint in a 
># previous step
>service <http://zbw.eu/beta/sparql/econ_pers/query> {
>  ?gnd skos:prefLabel [] ;
>   skos:exactMatch ?wd .
>  filter(contains(str(?wd), 'wikidata'))
>}
>?sitelink schema:about ?wd ;
>  schema:inLanguage ?language .
>filter (contains(str(?sitelink), 'wikipedia'))
>filter (lang(?wdLabel) = ?language && ?language in ('en', 'de')) }
>

-Ursprüngliche Nachricht-
Von: Ruben Verborgh [mailto:ruben.verbo...@ugent.be] 
Gesendet: Donnerstag, 18. Februar 2016 14:02
An: wikidata@lists.wikimedia.org
Cc: Neubert, Joachim
Betreff: Re: [Wikidata] Make federated queries possible / was: SPARQL CONSTRUCT 
results truncated

Dear all,

I don't mean to hijack the thread, but for federation purposes, you might be 
interested in a Triple Pattern Fragments interface [1]. TPF offers lower server 
cost to reach high availability, at the expense of slower queries and higher 
bandwidth [2]. This is possible because the client performs most of the query 
execution.

I noticed the Wikidata SPARQL endpoint has had an excellent track record so far 
(congratulations on this), so the TPF solution might not be necessary for 
server cost / availability reasons.

However, TPF is an excellent solution for federated queries. In (yet to be 
pulbished) experiments, we have verified that the TPF client/server solution 
performs on par with state-of-the-art federation frameworks based on SPARQL 
endpoints for many simple and complex queries. Furthermore, there are no 
security problems etc. ("open proxy"), because all federation is performed by 
the client.

You can see a couple of example queries here with other datasets:
- Works by writers born in Stockholm (VIAF and DBpedia - 
http://bit.ly/writers-stockholm) - Books by Swedish Nobel prize winners that 
are in the Harvard Library (VIAF, DBpedia, Harvard - 
http://bit.ly/swedish-nobel-harvard)

It might be a quick win to set up a TPF interface on top of the existing SPARQL 
endpoint.
If you want any info, don't hesitate to ask.

Best,

Ruben

[1] http://linkeddatafragments.org/in-depth/
[2] http://linkeddatafragments.org/publications/iswc2014.pdf

___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


Re: [Wikidata] Make federated queries possible / was: SPARQL CONSTRUCT results truncated

2016-02-18 Thread Ruben Verborgh
Dear all,

I don't mean to hijack the thread, but for federation purposes, you might be 
interested in a Triple Pattern Fragments interface [1]. TPF offers lower server 
cost to reach high availability, at the expense of slower queries and higher 
bandwidth [2]. This is possible because the client performs most of the query 
execution.

I noticed the Wikidata SPARQL endpoint has had an excellent track record so far 
(congratulations on this), so the TPF solution might not be necessary for 
server cost / availability reasons.

However, TPF is an excellent solution for federated queries. In (yet to be 
pulbished) experiments, we have verified that the TPF client/server solution 
performs on par with state-of-the-art federation frameworks based on SPARQL 
endpoints for many simple and complex queries. Furthermore, there are no 
security problems etc. ("open proxy"), because all federation is performed by 
the client.

You can see a couple of example queries here with other datasets:
– Works by writers born in Stockholm (VIAF and DBpedia – 
http://bit.ly/writers-stockholm)
– Books by Swedish Nobel prize winners that are in the Harvard Library (VIAF, 
DBpedia, Harvard – http://bit.ly/swedish-nobel-harvard)

It might be a quick win to set up a TPF interface on top of the existing SPARQL 
endpoint.
If you want any info, don't hesitate to ask.

Best,

Ruben

[1] http://linkeddatafragments.org/in-depth/
[2] http://linkeddatafragments.org/publications/iswc2014.pdf
___
Wikidata mailing list
Wikidata@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata


[Wikidata] Make federated queries possible / was: SPARQL CONSTRUCT results truncated

2016-02-18 Thread Neubert, Joachim
From Stas' answer to https://phabricator.wikimedia.org/T127070 I learned the 
Wikidata Query Service does not "allow external federated queries ... for 
security reasons (it's basically open proxy)."

Now, obviously endpoints referenced in a federated query via a service clause 
have to be open - so any attacker could send his queries directly instead of 
squeezing them through some other endpoint. The only scenario I can think of is 
that an attackers IP already is blocked by the attacked site. If (instead of 
much more common ways to fake an IP) the attacker would choose to do it by 
federated queries through WDQS, this _could_ result in WDQS being blocked by 
this endpoint.

This is a quite unlikely scenario - in the last 7 years I'm on SPARQL mailing 
lists I cannot remember this kind of attack of ever having been reported - but 
of cause it is legitimate to secure production environments against any 
conceivable attack vector.

However, I think it should be possible to query Wikidata with this kind of 
query. Federated SPARQL queries are a basic building block for Linked Open 
Data, and blocking it breaks many uses Wikidata could provide for the linked 
data cloud. This must not involve the highly-protected production environment, 
but could be solved by an additional unstable/experimental endpoint under 
another address. 

As an additional illustrating argument: There is an immense difference between 
referencing something in a service clause and getting a result in a few 
seconds, or having to use the Wikidata toolkit. To get the initial query for 
this thread answered by the example program Markus kindly provided at 
https://github.com/Wikidata/Wikidata-Toolkit-Examples/blob/master/src/examples/DataExtractionProcessor.java
 (and which worked perfectly - thanks again!), it took me 
- more than five hours to download the dataset (in my work environment wired to 
the DFN network)
- 20 min to execute the query
- considerable time to fiddle with the Java code for the query if I had to 
adapt it (+ another 20 min to execute it again)

For many parts of the world, or even for users in Germany with a slow DSL 
connection, the first point alone would prohibit any use. And even with a good 
internet connection, a new or occasional user would quite probably turn away 
when offered this procedure instead of getting a "normal" LOD conformant query 
answered in a few seconds.

Again, I very much value your work and your determination to set up a service 
with very high availability and performance. Please, make the great Wikidata 
LOD available in less demanding settings, too. It should be possible for users 
to do more advanced SPARQL queries for LOD uses in an environment where you can 
not guarantee a high level of reliability.

Cheers, Joachim

-Ursprüngliche Nachricht-
Von: Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] Im Auftrag von 
Neubert, Joachim
Gesendet: Dienstag, 16. Februar 2016 15:48
An: 'Discussion list for the Wikidata project.'
Betreff: Re: [Wikidata] SPARQL CONSTRUCT results truncated

Thanks Markus, I've created https://phabricator.wikimedia.org/T127070 with the 
details.

-Ursprüngliche Nachricht-
Von: Wikidata [mailto:wikidata-boun...@lists.wikimedia.org] Im Auftrag von 
Markus Krötzsch
Gesendet: Dienstag, 16. Februar 2016 14:57
An: Discussion list for the Wikidata project.
Betreff: Re: [Wikidata] SPARQL CONSTRUCT results truncated

Hi Joachim,

I think SERVICE queries should be working, but maybe Stas knows more about 
this. Even if they are disabled, this should not result in some message rather 
than in a NullPointerException. Looks like a bug.

Markus


On 16.02.2016 13:56, Neubert, Joachim wrote:
> Hi Markus,
>
> Great that you checked that out. I can confirm that the simplified query 
> worked for me, too. It took 15.6s and revealed roughly the same number of 
> results (323789).
>
> When I loaded the results into http://zbw.eu/beta/sparql/econ_pers/query, an 
> endpoint for "economics-related" persons, it matched with 36050 persons 
> (supposedly the "most important" 8 percent of our set).
>
> What I normally would do to get the according Wikipedia site URLs, is a query 
> against the wikidata endpoint, which references the relevant wikidata URIs 
> via a "service" clause:
>
> PREFIX skos: 
> PREFIX schema: 
> #
> construct {
>?gnd schema:about ?sitelink .
> }
> where {
>service  {
>  ?gnd skos:prefLabel [] ;
>   skos:exactMatch ?wd .
>  filter(contains(str(?wd), 'wikidata'))
>}
>?sitelink schema:about ?wd ;
>  schema:inLanguage ?language .
>filter (contains(str(?sitelink), 'wikipedia'))
>filter (lang(?wdLabel) = ?language && ?language in ('en', 'de')) }
>
> This however results in a java error.
>
> If "service" clauses are supposed to work in the wikidata endpoint, I'd 
> happily provide addtitional details