Re: High number of 4xx requests on etcd (3.6 upgrade)

Andrew Lau Sun, 13 Aug 2017 19:27:52 -0700

I found an upstream issue on kubernetes
https://github.com/kubernetes/kubernetes/issues/48998


Would volume mount errors contribute to the 404 count?

On Mon, 14 Aug 2017 at 11:55 Andrew Lau <and...@andrewklau.com> wrote:

> On Mon, 14 Aug 2017 at 02:10 Clayton Coleman <ccole...@redhat.com> wrote:
>
>>
>>
>> On Aug 13, 2017, at 12:07 AM, Andrew Lau <and...@andrewklau.com> wrote:
>>
>> I ended up finding that one of my rules had a wrong filter that was
>> returning a +inf value.  Most of the errors (count being over 2000) were
>> 404 on QGET but fell well below 0.01% after I fixed the wrong filter rule.
>> Is that normal to get this number of 404 requests?
>>
>>
>> I would expect you to only see QGET 404 when a user asks for a key that
>> is not there.  That often happens when two components are racing to observe
>> / react to watch changes and one is deleted before the other, or a user
>> mistypes a command.  So would not expect zero, but should be proportional
>> to the amount of change in the cluster.
>>
>
> I'm seeing a lot of these errors on nodes. It looks like a CronJob has
> already finished and the node is trying to clean up a container that's
> already stopped. Would these be what contributes to the 404 errors?
>
> dockerd-current[4909]: E0814 11:15:32.934040   12646
> remote_runtime.go:109] StopPodSandbox
> "c2974678a43dd96483effabc95a0f27018b65f6bd527e451f0aba0d81e43404c" from
> runtime service failed: rpc error: code = 2 desc = NetworkPlugin cni failed
> to teardown pod "test-cron-1502552700-93sv9_test" network: CNI failed to
> retrieve network namespace path: Error: No such container:
> c2974678a43dd96483effabc95a0f27018b65f6bd527e451f0aba0d81e43404c
> E0814 11:15:32.934053   12646 kuberuntime_gc.go:138] Failed to stop
> sandbox "c2974678a43dd96483effabc95a0f27018b65f6bd527e451f0aba0d81e43404c"
> before removing: rpc error: code = 2 desc = NetworkPlugin cni failed to
> teardown pod "test-cron-1502552700-93sv9_test" network: CNI failed to
> retrieve network namespace path: Error: No such container:
> c2974678a43dd96483effabc95a0f27018b65f6bd527e451f0aba0d81e43404c
>
>
>
>>
>>
>> This is only on a staging cluster so the number of keys are small. There
>> are only about 100~ namespaces with dummy objects.
>>
>> I went through the etcd logs and didn't see anything abnormal. API server
>> response times jumped to around 0.15 for GET and 1.5 for POST. This is
>> approximately 50%~ increase from what I saw with 1.5
>>
>>
>> What percentile?  Ideally RTT as measured on etcd would be 2-4ms for
>> simple single item get for 50th and 90th.  Lists will be substantially
>> larger and QGET also reports lists, so the distribution isn't well
>> represented.
>>
>
> 99th percentile I was using:
>
> histogram_quantile(.99, sum without (instance,resource)
> (apiserver_request_latencies_bucket{verb!~"CONNECT|WATCHLIST|WATCH|PROXY"})
> ) / 1e6
>
>
>>
>> If that's 1.5 seconds for POST and isn't "max", something is horribly
>> wrong.
>>
>
> It was averaging at 1.5 seconds however since pulling down the latest etcd
> containers it seems to have come back down to 0.20
>
>
>>
>> Here's the 99th percentile for a very large production cluster in
>> milliseconds.  Pod was around 100ms but haven't been any pods created in
>> the last 30 minutes:
>>
>> ElementValue
>> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="PATCH"}
>> 4.225
>> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="POST"}
>> NaN
>> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="PUT"}
>> 97.814
>>
>> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="CONNECT"}
>> NaN
>>
>> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="DELETE"}
>> 77.235
>>
>> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="DELETECOLLECTION"}
>> 11.649
>> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="GET"}
>> 217.616
>> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="LIST"}
>> NaN
>>
>> Prometheus query sum without (instance)
>> (apiserver_request_latencies_summary{verb!="WATCH",resource="pods",quantile="0.99"})/1000
>>
>> Note that get is unusually high, we're not positive it's not being
>> misreported in prometheus in 3.6.
>>
>>
> ElementValue
> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="PUT"}
> 28.592
> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="DELETE"}
> NaN
>
> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="CONNECT"}
> NaN
> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="GET"}
> 25.11
>
> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="WATCHLIST"}
> 569000.319
> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="POST"}
> NaN
> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="LIST"}
> NaN
>
> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="DELETECOLLECTION"}
> NaN
>
>
>>
>> There were 3 leader changes, as I tried upgrading docker etcd_container
>> to see if that would fix the issue (for some reason the ansible upgrade
>> script doesn't upgrade this(?)).
>>
>>
>>
>> On Sun, 13 Aug 2017 at 01:23 Clayton Coleman <ccole...@redhat.com> wrote:
>>
>>> How big is your etcd working set in terms of number of keys?  How many
>>> namespaces?  If keys <50k then i would suspect software, hardware, or
>>> network issue in between masters and etcd.  Http etcd failures should only
>>> happen when the master is losing elections and being turned over, or the
>>> election/heartbeat timeout is too low for your actual network latency.
>>> Double check the etcd logs and verify that you aren't seen any election
>>> failures or turnover.
>>>
>>> What metrics are the apiserver side returning related to etcd latencies
>>> and failures?
>>>
>>> On Aug 12, 2017, at 11:07 AM, Andrew Lau <and...@andrewklau.com> wrote:
>>>
>>> etcd data is on dedicated drives and aws reports idle and burst capacity
>>> around 90%
>>>
>>> On Sun, 13 Aug 2017 at 00:28 Clayton Coleman <ccole...@redhat.com>
>>> wrote:
>>>
>>>> Check how much IO is being used by etcd and how much you have
>>>> provisioned.
>>>>
>>>>
>>>> > On Aug 12, 2017, at 5:32 AM, Andrew Lau <and...@andrewklau.com>
>>>> wrote:
>>>> >
>>>> > Post upgrade to 3.6 I'm noticing the API server seems to be
>>>> responding a lot slower and my etcd metrics etcd_http_failed_total is
>>>> returning a large number of failed GET requests.
>>>> >
>>>> > Has anyone seen this?
>>>> > _______________________________________________
>>>> > users mailing list
>>>> > users@lists.openshift.redhat.com
>>>> > http://lists.openshift.redhat.com/openshiftmm/listinfo/users
>>>>
>>>

_______________________________________________
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users

Re: High number of 4xx requests on etcd (3.6 upgrade)

Reply via email to