Re: High number of 4xx requests on etcd (3.6 upgrade)

Andrew Lau Sun, 13 Aug 2017 18:56:31 -0700

On Mon, 14 Aug 2017 at 02:10 Clayton Coleman <[email protected]> wrote:


>
>
> On Aug 13, 2017, at 12:07 AM, Andrew Lau <[email protected]> wrote:
>
> I ended up finding that one of my rules had a wrong filter that was
> returning a +inf value.  Most of the errors (count being over 2000) were
> 404 on QGET but fell well below 0.01% after I fixed the wrong filter rule.
> Is that normal to get this number of 404 requests?
>
>
> I would expect you to only see QGET 404 when a user asks for a key that is
> not there.  That often happens when two components are racing to observe /
> react to watch changes and one is deleted before the other, or a user
> mistypes a command.  So would not expect zero, but should be proportional
> to the amount of change in the cluster.
>

I'm seeing a lot of these errors on nodes. It looks like a CronJob has
already finished and the node is trying to clean up a container that's
already stopped. Would these be what contributes to the 404 errors?

dockerd-current[4909]: E0814 11:15:32.934040   12646 remote_runtime.go:109]
StopPodSandbox
"c2974678a43dd96483effabc95a0f27018b65f6bd527e451f0aba0d81e43404c" from
runtime service failed: rpc error: code = 2 desc = NetworkPlugin cni failed
to teardown pod "test-cron-1502552700-93sv9_test" network: CNI failed to
retrieve network namespace path: Error: No such container:
c2974678a43dd96483effabc95a0f27018b65f6bd527e451f0aba0d81e43404c
E0814 11:15:32.934053   12646 kuberuntime_gc.go:138] Failed to stop sandbox
"c2974678a43dd96483effabc95a0f27018b65f6bd527e451f0aba0d81e43404c" before
removing: rpc error: code = 2 desc = NetworkPlugin cni failed to teardown
pod "test-cron-1502552700-93sv9_test" network: CNI failed to retrieve
network namespace path: Error: No such container:
c2974678a43dd96483effabc95a0f27018b65f6bd527e451f0aba0d81e43404c



>
>
> This is only on a staging cluster so the number of keys are small. There
> are only about 100~ namespaces with dummy objects.
>
> I went through the etcd logs and didn't see anything abnormal. API server
> response times jumped to around 0.15 for GET and 1.5 for POST. This is
> approximately 50%~ increase from what I saw with 1.5
>
>
> What percentile?  Ideally RTT as measured on etcd would be 2-4ms for
> simple single item get for 50th and 90th.  Lists will be substantially
> larger and QGET also reports lists, so the distribution isn't well
> represented.
>

99th percentile I was using:

histogram_quantile(.99, sum without (instance,resource)
(apiserver_request_latencies_bucket{verb!~"CONNECT|WATCHLIST|WATCH|PROXY"})
) / 1e6


>
> If that's 1.5 seconds for POST and isn't "max", something is horribly
> wrong.
>

It was averaging at 1.5 seconds however since pulling down the latest etcd
containers it seems to have come back down to 0.20


>
> Here's the 99th percentile for a very large production cluster in
> milliseconds.  Pod was around 100ms but haven't been any pods created in
> the last 30 minutes:
>
> ElementValue
> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="PATCH"}
> 4.225
> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="POST"}
> NaN
> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="PUT"}
> 97.814
>
> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="CONNECT"}
> NaN
> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="DELETE"}
> 77.235
>
> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="DELETECOLLECTION"}
> 11.649
> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="GET"}
> 217.616
> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="LIST"}
> NaN
>
> Prometheus query sum without (instance)
> (apiserver_request_latencies_summary{verb!="WATCH",resource="pods",quantile="0.99"})/1000
>
> Note that get is unusually high, we're not positive it's not being
> misreported in prometheus in 3.6.
>
>
ElementValue
{job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="PUT"}
28.592
{job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="DELETE"}
NaN
{job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="CONNECT"}
NaN
{job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="GET"}
25.11
{job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="WATCHLIST"}
569000.319
{job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="POST"}
NaN
{job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="LIST"}
NaN
{job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="DELETECOLLECTION"}
NaN


>
> There were 3 leader changes, as I tried upgrading docker etcd_container to
> see if that would fix the issue (for some reason the ansible upgrade script
> doesn't upgrade this(?)).
>
>
>
> On Sun, 13 Aug 2017 at 01:23 Clayton Coleman <[email protected]> wrote:
>
>> How big is your etcd working set in terms of number of keys?  How many
>> namespaces?  If keys <50k then i would suspect software, hardware, or
>> network issue in between masters and etcd.  Http etcd failures should only
>> happen when the master is losing elections and being turned over, or the
>> election/heartbeat timeout is too low for your actual network latency.
>> Double check the etcd logs and verify that you aren't seen any election
>> failures or turnover.
>>
>> What metrics are the apiserver side returning related to etcd latencies
>> and failures?
>>
>> On Aug 12, 2017, at 11:07 AM, Andrew Lau <[email protected]> wrote:
>>
>> etcd data is on dedicated drives and aws reports idle and burst capacity
>> around 90%
>>
>> On Sun, 13 Aug 2017 at 00:28 Clayton Coleman <[email protected]> wrote:
>>
>>> Check how much IO is being used by etcd and how much you have
>>> provisioned.
>>>
>>>
>>> > On Aug 12, 2017, at 5:32 AM, Andrew Lau <[email protected]> wrote:
>>> >
>>> > Post upgrade to 3.6 I'm noticing the API server seems to be responding
>>> a lot slower and my etcd metrics etcd_http_failed_total is returning a
>>> large number of failed GET requests.
>>> >
>>> > Has anyone seen this?
>>> > _______________________________________________
>>> > users mailing list
>>> > [email protected]
>>> > http://lists.openshift.redhat.com/openshiftmm/listinfo/users
>>>
>>

_______________________________________________
users mailing list
[email protected]
http://lists.openshift.redhat.com/openshiftmm/listinfo/users

Re: High number of 4xx requests on etcd (3.6 upgrade)

Reply via email to