On 19.04.2016 11:05, Addshore wrote:
In the case we are discussing here the truncated JSON is caused by blaze
graph deciding it has been sending data for too long and then stopping
(as I understand).
Thus you will only see a spike on the graph for the amount of data
actually sent from the server, not the size of the result blazegraph was
trying to send back.

I successfully got five files of 250M JSON each, but even those successful queries did not show up in the stats. The five files had three different versions (slightly different sizes) so they did not all come from a common cache either. Maybe the size is counted in terms of compressed or otherwise "raw" results?


I also ran into this with some simple queries that returned big sets of
data.
Although with my issue I did actually also see a Java exception somewhere.

I know the case where large result sets end in a Java timeout exception. This occurs reproducibly when you retrieve all humans or something like that. However, in my case, the behaviour is not always reproducible and there is no Java exception at the end of the output; it just stops in the middle of the file.

Markus


On 18 April 2016 at 21:51, Markus Kroetzsch
<[email protected] <mailto:[email protected]>>
wrote:

    On 18.04.2016 22:21, Markus Kroetzsch wrote:

        On 18.04.2016 21:56, Markus Kroetzsch wrote:

            Thanks, the dashboard is interesting.

            I am trying to run this query:

            SELECT ?subC ?supC WHERE { ?subC p:P279/ps:P279 ?supC }

            It is supposed to return a large result set. But I am only
            running it
            once per week. It used to work fine, but today I could not
            get it to
            succeed a single time.


        Actually, the query seems to work as it should. I am
        investigating why I
        get an error in some cases on my machine.


    Ok, I found that this is not so easy to reproduce reliably. The
    symptom I am seeing is a truncated JSON response, which just stops
    in the middle of the data (at a random location, but usually early
    on), and which is *not* followed by any error message. The stream
    just ends.

    So far, I could only get this in Java, not in Python, and it does
    not always happen. If successful, the result is about 250M in size.
    The following Python script can retrieve it:

    import requests
    SPARQL_SERVICE_URL = 'https://query.wikidata.org/sparql'
    query = """SELECT ?subC ?supC WHERE { ?subC p:P279/ps:P279 ?supC }"""
    print requests.get(SPARQL_SERVICE_URL, params={'query': query,
    'format': 'json'}).text

    (output should be redirected to a file)

    I will keep an eye on the issue, but I don't know how to debug this
    any further now, since it started to work without me changing any code.

    I also wonder how to read the dashboard after all. In spite of me
    repeating an experiment that creates a 250M result file for five
    times in the past few minutes, the "Bytes out" figure remains below
    a few MB for most of the time.


    Markus



            On 18.04.2016 21:40, Stas Malyshev wrote:

                Hi!

                    I have the impression that some not-so-easy SPARQL
                    queries that used to
                    run just below the timeout are now timing out
                    regularly. Has there been
                    a change in the setup that may have caused this, or
                    are we maybe seeing
                    increased query traffic [1]?


                We've recently run on a single server for couple of days
                due to
                reloading of the second one, so this may have made it a
                bit slower. But
                that should be gone now, we're back to two. Other than
                that, not seeing
                anything abnormal in
                
https://grafana.wikimedia.org/dashboard/db/wikidata-query-service

                    [1] The deadline for the Int. Semantic Web Conf. is
                    coming up, so it
                    might be that someone is running experiments on the
                    system to get their
                    paper finished. It has been observed for other
                    endpoints that traffic
                    increases at such times. This community sometimes is
                    the greatest enemy
                    of its own technology ... (I recently had to
                    IP-block an RDF crawler
                    from one of my sites after it had ignored robots.txt
                    completely).


                We don't have any blocks or throttle mechanisms right
                now. But if we see
                somebody making serious negative impact on the service,
                we may have to
                change that.







    --
    Markus Kroetzsch
    Faculty of Computer Science
    Technische Universität Dresden
    +49 351 463 38486 <tel:%2B49%20351%20463%2038486>
    http://korrekt.org/

    _______________________________________________
    Wikidata mailing list
    [email protected] <mailto:[email protected]>
    https://lists.wikimedia.org/mailman/listinfo/wikidata




--
Addshore


_______________________________________________
Wikidata mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata



_______________________________________________
Wikidata mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata

Reply via email to