Re: SPARQL-query to get data from wikidata not working

Lorenz B. Tue, 28 Jan 2020 06:20:25 -0800

Comments and debugging notes inline

On 1/28/20 9:58 AM, Andy Seaborne wrote:
>
>
> On 28/01/2020 07:50, Lorenz Buehmann wrote:
>> Yes, the intermediate result is large, I tried it on CLI:
>>
>> |bin/rsparql --service https://query.wikidata.org/sparql "select *
>> {?wikidata_link <|||http://www.wikidata.org/prop/direct/|P18> ?image}" >
>> /tmp/res.sparql|
>>
>> |It will most likely lead to an OOM error - unless your increase the JVM
>> heap memory. It's because either a large JSON or XML object will be
>> returned here and has to be parsed resp. processed.|
>
> Then one possibility is it is pushing the Fuseki query execution into
> GC overload.  It is possible that teh GC is working very hard and
> making minimal progress only to be triggered again for a full-GC.  The
> fuseki start-up script sets the JVM heap size but not to a huge
> amount. 3 million rows might trigger GC problems.  You can override it
> with JVM_ARGS.


Yep, might be a reason as also commented on SO post. Would be
interesting to see more from the Fuseki logs. Running 1h without any
exception might indicate this GC thing but still seems to be somehow
unexpected - even for 3 million rows, which is large but nowadays not
really considered as huge.


>
>>
>> |The subquery hint from Andy is a nice workaround, but indeed you would
>> get only partial results - this might not contain the Wikidata resources
>> from your dataset, thus, the result would be incomplete or even empty.
>> |
>>
>
> The thing is that SERVICE used to execute the other way - sending
> several small requests.  Except that causes other problems when the
> query has a large number of possibilities to try from earlier in the
> query.

So it's using bindings from the "outer" query, i.e. inlining data,
right? I remember somebody was suggesting to improve this via a more
smart batch approach, i.e. to reduce the number of request to the remote
service. Is this already integrated and can even be configured somehow?


Anyways, ahd more time now and tried the initial query with the given
data which in fact just contains 3 Wikidata resources. It finishes
instantly with an empty result. I also set log level to ALL, thus, I can
confirm exactly 3 requests like

SELECT  *
  WHERE
    { <https://www.wikidata.org/entity/Q180727>
                <http://www.wikidata.org/prop/direct/P18>  ?image
    }

to the Wikidata service. That means, everything as Andy said, single
requests will be send to Wikidata.


So my question, why would this not work in Fuseki? That sounds weird
with this tiny dataset and just 3 single requests. Unless, the real
dataset is much larger and would lead to millions of single requests to
Wikidata ...


By the way, the result is empty because the Wikidata URIs in the dataset
are wrong, the URIs must not start with https but http only.



>
> GC thrashing sounds possible.
>
>     Andy
>
>> |
>> |
>>
>> On 27.01.20 23:47, Andy Seaborne wrote:
>>> The query will try to pull a lot of data from
>>> query.wikidata.org/sparql.
>>>
>>> (see comments on the SO query)
>>>
>>> What might help is to write a subselect with LIMIT  inside the SERVICE
>>> and put a limit on that. That pushes a LIMIT to the far end which, as
>>> written, does not happen.
>>>
>>>      Andy
>>>
>>> On 27/01/2020 19:50, jani wrote:
>>>> Hi everybody,
>>>>
>>>> I try to get some image links from wikidata by running a SPARQL-query
>>>> from my local Jena Fuseki instance. I want to merge it with data from
>>>> my local graph. Unfortunately the query isn't delivering any data but
>>>> runs and runs instead without any error message.
>>>>
>>>> The sparql-query:
>>>>
>>>> |PREFIX owl: <http://www.w3.org/2002/07/owl#> PREFIX wd:
>>>> <http://www.wikidata.org/entity/> PREFIX wdt:
>>>> <http://www.wikidata.org/prop/direct/> PREFIX foaf:
>>>> <http://xmlns.com/foaf/0.1/> SELECT ?name ?image WHERE { ?s foaf:name
>>>> ?name. ?s owl:sameAs ?wikidata_link. FILTER
>>>> regex(str(?wikidata_link), "wikidata"). SERVICE
>>>> <https://query.wikidata.org/sparql> { ?wikidata_link wdt:P18 ?image.
>>>> } } LIMIT 10 |
>>>>
>>>> The test data I have in my local graph on the Jena Fuseki server:
>>>>
>>>> |@base <http://dmt.de/pages> . @prefix rdf:
>>>> <http://www.w3.org/1999/02/22-rdf-syntax-ns#> . @prefix rdfs:
>>>> <http://www.w3.org/2000/01/rdf-schema#> . @prefix foaf:
>>>> <http://xmlns.com/foaf/0.1/> . @prefix dbp:
>>>> <http://dbpedia.org/resource/> . @prefix wd:
>>>> <https://www.wikidata.org/entity/> . @prefix owl:
>>>> <http://www.w3.org/2002/07/owl#> . <#john-cage> a foaf:Person ;
>>>> foaf:name "John Cage"; owl:sameAs dbp:John_Cage, wd:Q180727.
>>>> <#karlheinz-stockhausen> a foaf:Person ; foaf:name "Karlheinz
>>>> Stockhausen"; owl:sameAs dbp:Karlheinz_Stockhausen, wd:Q154556.
>>>> <#arnold-schoenberg> a foaf:Person; foaf:name "Arnold Schönberg";
>>>> owl:sameAs dbp:Arnold_Schoenberg, wd:Q154770. |
>>>>
>>>> I tried a similar query for dbpedia-data which run perfectly.
>>>>
>>>> |PREFIX owl: <http://www.w3.org/2002/07/owl#> PREFIX dbp:
>>>> <http://dbpedia.org/resource/> PREFIX foaf:
>>>> <http://xmlns.com/foaf/0.1/> PREFIX dbo:
>>>> <http://dbpedia.org/ontology/> SELECT ?name ?dbpedia_link ?birthplace
>>>> WHERE { ?s foaf:name ?name. ?s owl:sameAs ?dbpedia_link. FILTER
>>>> regex(str(?dbpedia_link),"dbpedia.org"). SERVICE
>>>> <https://dbpedia.org/sparql> { ?dbpedia_link dbo:birthPlace
>>>> ?birthplace. } } LIMIT 10 |
>>>>
>>>> Any Ideas? Thanks in advance!
>>>>
>>>> Jan Seipel
>>>>
>>>> PS: also got this question on stackoverflow:
>>>> https://stackoverflow.com/questions/59937684/sparql-query-to-get-data-from-wikidata-not-working
>>>>
>>>>
>>>>
>>>>
>>
-- 
Lorenz Bühmann
AKSW group, University of Leipzig
Group: http://aksw.org - semantic web research center

Re: SPARQL-query to get data from wikidata not working

Reply via email to