I'm late to the game, but a quick look into the nginx logs does not
show all that much. I see a few connection refused, but that should
translate in an HTTP 502 error, not in a partial answer.

I'm really not good at reading VCL, but it seems that we do have some
rules in our Varnish config to cache pages in error. This would make
sense as pages in error tend to be expensive, so we probably want to
ensure the same error is capped at a maximum rate.

I'll keep looking. But transient errors are hard...

On Tue, Apr 19, 2016 at 11:44 AM, Addshore <[email protected]> wrote:
> Yes the size reported there will be the compressed size, so actual bytes
> over the port!
>
> Looking at the patch further it looks like some nginx settings were changed
> while caching was enabled that may also be worth looking at.
>
> On 19 April 2016 at 10:42, Markus Krötzsch <[email protected]>
> wrote:
>>
>> On 19.04.2016 11:33, Addshore wrote:
>>>
>>> Also per https://phabricator.wikimedia.org/T126730 and
>>> https://gerrit.wikimedia.org/r/#/c/274864/8 requests to the query
>>> service are now cached for 60 seconds.
>>> I expect this will include error results from timeouts so retrying a
>>> request within the same 60 seconds as the first won't event reach the
>>> WDQS servers now.
>>
>>
>> Maybe this could be the answer. Is it possible that the cache stores the
>> truncated result but not the Java exception? Then the behaviour could be a
>> timeout which just is not reported properly. Ideally, partial results should
>> not be cached or the "timeout" should be cached so that a renewed request
>> (in 60sec) returns an immediate timeout rather than a broken result set.
>>
>> Cheers,
>>
>> Markus
>>
>>>
>>> On 19 April 2016 at 10:05, Addshore <[email protected]
>>> <mailto:[email protected]>> wrote:
>>>
>>>     In the case we are discussing here the truncated JSON is caused by
>>>     blaze graph deciding it has been sending data for too long and then
>>>     stopping (as I understand).
>>>     Thus you will only see a spike on the graph for the amount of data
>>>     actually sent from the server, not the size of the result blazegraph
>>>     was trying to send back.
>>>
>>>     I also ran into this with some simple queries that returned big sets
>>>     of data.
>>>     Although with my issue I did actually also see a Java exception
>>>     somewhere.
>>>
>>>     On 18 April 2016 at 21:51, Markus Kroetzsch
>>>     <[email protected]
>>>     <mailto:[email protected]>> wrote:
>>>
>>>         On 18.04.2016 22:21, Markus Kroetzsch wrote:
>>>
>>>             On 18.04.2016 21:56, Markus Kroetzsch wrote:
>>>
>>>                 Thanks, the dashboard is interesting.
>>>
>>>                 I am trying to run this query:
>>>
>>>                 SELECT ?subC ?supC WHERE { ?subC p:P279/ps:P279 ?supC }
>>>
>>>                 It is supposed to return a large result set. But I am
>>>                 only running it
>>>                 once per week. It used to work fine, but today I could
>>>                 not get it to
>>>                 succeed a single time.
>>>
>>>
>>>             Actually, the query seems to work as it should. I am
>>>             investigating why I
>>>             get an error in some cases on my machine.
>>>
>>>
>>>         Ok, I found that this is not so easy to reproduce reliably. The
>>>         symptom I am seeing is a truncated JSON response, which just
>>>         stops in the middle of the data (at a random location, but
>>>         usually early on), and which is *not* followed by any error
>>>         message. The stream just ends.
>>>
>>>         So far, I could only get this in Java, not in Python, and it
>>>         does not always happen. If successful, the result is about 250M
>>>         in size. The following Python script can retrieve it:
>>>
>>>         import requests
>>>         SPARQL_SERVICE_URL = 'https://query.wikidata.org/sparql'
>>>         query = """SELECT ?subC ?supC WHERE { ?subC p:P279/ps:P279 ?supC
>>>         }"""
>>>         print requests.get(SPARQL_SERVICE_URL, params={'query': query,
>>>         'format': 'json'}).text
>>>
>>>         (output should be redirected to a file)
>>>
>>>         I will keep an eye on the issue, but I don't know how to debug
>>>         this any further now, since it started to work without me
>>>         changing any code.
>>>
>>>         I also wonder how to read the dashboard after all. In spite of
>>>         me repeating an experiment that creates a 250M result file for
>>>         five times in the past few minutes, the "Bytes out" figure
>>>         remains below a few MB for most of the time.
>>>
>>>
>>>         Markus
>>>
>>>
>>>
>>>                 On 18.04.2016 21:40, Stas Malyshev wrote:
>>>
>>>                     Hi!
>>>
>>>                         I have the impression that some not-so-easy
>>>                         SPARQL queries that used to
>>>                         run just below the timeout are now timing out
>>>                         regularly. Has there been
>>>                         a change in the setup that may have caused this,
>>>                         or are we maybe seeing
>>>                         increased query traffic [1]?
>>>
>>>
>>>                     We've recently run on a single server for couple of
>>>                     days due to
>>>                     reloading of the second one, so this may have made
>>>                     it a bit slower. But
>>>                     that should be gone now, we're back to two. Other
>>>                     than that, not seeing
>>>                     anything abnormal in
>>>
>>> https://grafana.wikimedia.org/dashboard/db/wikidata-query-service
>>>
>>>                         [1] The deadline for the Int. Semantic Web Conf.
>>>                         is coming up, so it
>>>                         might be that someone is running experiments on
>>>                         the system to get their
>>>                         paper finished. It has been observed for other
>>>                         endpoints that traffic
>>>                         increases at such times. This community
>>>                         sometimes is the greatest enemy
>>>                         of its own technology ... (I recently had to
>>>                         IP-block an RDF crawler
>>>                         from one of my sites after it had ignored
>>>                         robots.txt completely).
>>>
>>>
>>>                     We don't have any blocks or throttle mechanisms
>>>                     right now. But if we see
>>>                     somebody making serious negative impact on the
>>>                     service, we may have to
>>>                     change that.
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>         --
>>>         Markus Kroetzsch
>>>         Faculty of Computer Science
>>>         Technische Universität Dresden
>>>         +49 351 463 38486 <tel:%2B49%20351%20463%2038486>
>>>         http://korrekt.org/
>>>
>>>         _______________________________________________
>>>         Wikidata mailing list
>>>         [email protected]
>>> <mailto:[email protected]>
>>>         https://lists.wikimedia.org/mailman/listinfo/wikidata
>>>
>>>
>>>
>>>
>>>     --
>>>     Addshore
>>>
>>>
>>>
>>>
>>> --
>>> Addshore
>>>
>>>
>>> _______________________________________________
>>> Wikidata mailing list
>>> [email protected]
>>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>>>
>>
>>
>> _______________________________________________
>> Wikidata mailing list
>> [email protected]
>> https://lists.wikimedia.org/mailman/listinfo/wikidata
>
>
>
>
> --
> Addshore
>
> _______________________________________________
> Wikidata mailing list
> [email protected]
> https://lists.wikimedia.org/mailman/listinfo/wikidata
>



-- 
Guillaume Lederrey
Operations Engineer, Discovery
Wikimedia Foundation

_______________________________________________
Wikidata mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikidata

Reply via email to