Re: High number of 4xx requests on etcd (3.6 upgrade)

2017-08-13 Thread Andrew Lau
I found an upstream issue on kubernetes
https://github.com/kubernetes/kubernetes/issues/48998

Would volume mount errors contribute to the 404 count?

On Mon, 14 Aug 2017 at 11:55 Andrew Lau  wrote:

> On Mon, 14 Aug 2017 at 02:10 Clayton Coleman  wrote:
>
>>
>>
>> On Aug 13, 2017, at 12:07 AM, Andrew Lau  wrote:
>>
>> I ended up finding that one of my rules had a wrong filter that was
>> returning a +inf value.  Most of the errors (count being over 2000) were
>> 404 on QGET but fell well below 0.01% after I fixed the wrong filter rule.
>> Is that normal to get this number of 404 requests?
>>
>>
>> I would expect you to only see QGET 404 when a user asks for a key that
>> is not there.  That often happens when two components are racing to observe
>> / react to watch changes and one is deleted before the other, or a user
>> mistypes a command.  So would not expect zero, but should be proportional
>> to the amount of change in the cluster.
>>
>
> I'm seeing a lot of these errors on nodes. It looks like a CronJob has
> already finished and the node is trying to clean up a container that's
> already stopped. Would these be what contributes to the 404 errors?
>
> dockerd-current[4909]: E0814 11:15:32.934040   12646
> remote_runtime.go:109] StopPodSandbox
> "c2974678a43dd96483effabc95a0f27018b65f6bd527e451f0aba0d81e43404c" from
> runtime service failed: rpc error: code = 2 desc = NetworkPlugin cni failed
> to teardown pod "test-cron-1502552700-93sv9_test" network: CNI failed to
> retrieve network namespace path: Error: No such container:
> c2974678a43dd96483effabc95a0f27018b65f6bd527e451f0aba0d81e43404c
> E0814 11:15:32.934053   12646 kuberuntime_gc.go:138] Failed to stop
> sandbox "c2974678a43dd96483effabc95a0f27018b65f6bd527e451f0aba0d81e43404c"
> before removing: rpc error: code = 2 desc = NetworkPlugin cni failed to
> teardown pod "test-cron-1502552700-93sv9_test" network: CNI failed to
> retrieve network namespace path: Error: No such container:
> c2974678a43dd96483effabc95a0f27018b65f6bd527e451f0aba0d81e43404c
>
>
>
>>
>>
>> This is only on a staging cluster so the number of keys are small. There
>> are only about 100~ namespaces with dummy objects.
>>
>> I went through the etcd logs and didn't see anything abnormal. API server
>> response times jumped to around 0.15 for GET and 1.5 for POST. This is
>> approximately 50%~ increase from what I saw with 1.5
>>
>>
>> What percentile?  Ideally RTT as measured on etcd would be 2-4ms for
>> simple single item get for 50th and 90th.  Lists will be substantially
>> larger and QGET also reports lists, so the distribution isn't well
>> represented.
>>
>
> 99th percentile I was using:
>
> histogram_quantile(.99, sum without (instance,resource)
> (apiserver_request_latencies_bucket{verb!~"CONNECT|WATCHLIST|WATCH|PROXY"})
> ) / 1e6
>
>
>>
>> If that's 1.5 seconds for POST and isn't "max", something is horribly
>> wrong.
>>
>
> It was averaging at 1.5 seconds however since pulling down the latest etcd
> containers it seems to have come back down to 0.20
>
>
>>
>> Here's the 99th percentile for a very large production cluster in
>> milliseconds.  Pod was around 100ms but haven't been any pods created in
>> the last 30 minutes:
>>
>> ElementValue
>> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="PATCH"}
>> 4.225
>> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="POST"}
>> NaN
>> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="PUT"}
>> 97.814
>>
>> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="CONNECT"}
>> NaN
>>
>> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="DELETE"}
>> 77.235
>>
>> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="DELETECOLLECTION"}
>> 11.649
>> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="GET"}
>> 217.616
>> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="LIST"}
>> NaN
>>
>> Prometheus query sum without (instance)
>> (apiserver_request_latencies_summary{verb!="WATCH",resource="pods",quantile="0.99"})/1000
>>
>> Note that get is unusually high, we're not positive it's not being
>> misreported in prometheus in 3.6.
>>
>>
> ElementValue
> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="PUT"}
> 28.592
> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="DELETE"}
> NaN
>
> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="CONNECT"}
> NaN
> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="GET"}
> 25.11
>
> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="WATCHLIST"}
> 569000.319
> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="POST"}
> NaN
> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="LIST"}
> NaN
>
> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="DELETECOLLECTION"}
> NaN
>
>
>>
>> There were 3 leader changes, as I tried upgrading docker etcd_container
>

Re: High number of 4xx requests on etcd (3.6 upgrade)

2017-08-13 Thread Andrew Lau
On Mon, 14 Aug 2017 at 02:10 Clayton Coleman  wrote:

>
>
> On Aug 13, 2017, at 12:07 AM, Andrew Lau  wrote:
>
> I ended up finding that one of my rules had a wrong filter that was
> returning a +inf value.  Most of the errors (count being over 2000) were
> 404 on QGET but fell well below 0.01% after I fixed the wrong filter rule.
> Is that normal to get this number of 404 requests?
>
>
> I would expect you to only see QGET 404 when a user asks for a key that is
> not there.  That often happens when two components are racing to observe /
> react to watch changes and one is deleted before the other, or a user
> mistypes a command.  So would not expect zero, but should be proportional
> to the amount of change in the cluster.
>

I'm seeing a lot of these errors on nodes. It looks like a CronJob has
already finished and the node is trying to clean up a container that's
already stopped. Would these be what contributes to the 404 errors?

dockerd-current[4909]: E0814 11:15:32.934040   12646 remote_runtime.go:109]
StopPodSandbox
"c2974678a43dd96483effabc95a0f27018b65f6bd527e451f0aba0d81e43404c" from
runtime service failed: rpc error: code = 2 desc = NetworkPlugin cni failed
to teardown pod "test-cron-1502552700-93sv9_test" network: CNI failed to
retrieve network namespace path: Error: No such container:
c2974678a43dd96483effabc95a0f27018b65f6bd527e451f0aba0d81e43404c
E0814 11:15:32.934053   12646 kuberuntime_gc.go:138] Failed to stop sandbox
"c2974678a43dd96483effabc95a0f27018b65f6bd527e451f0aba0d81e43404c" before
removing: rpc error: code = 2 desc = NetworkPlugin cni failed to teardown
pod "test-cron-1502552700-93sv9_test" network: CNI failed to retrieve
network namespace path: Error: No such container:
c2974678a43dd96483effabc95a0f27018b65f6bd527e451f0aba0d81e43404c



>
>
> This is only on a staging cluster so the number of keys are small. There
> are only about 100~ namespaces with dummy objects.
>
> I went through the etcd logs and didn't see anything abnormal. API server
> response times jumped to around 0.15 for GET and 1.5 for POST. This is
> approximately 50%~ increase from what I saw with 1.5
>
>
> What percentile?  Ideally RTT as measured on etcd would be 2-4ms for
> simple single item get for 50th and 90th.  Lists will be substantially
> larger and QGET also reports lists, so the distribution isn't well
> represented.
>

99th percentile I was using:

histogram_quantile(.99, sum without (instance,resource)
(apiserver_request_latencies_bucket{verb!~"CONNECT|WATCHLIST|WATCH|PROXY"})
) / 1e6


>
> If that's 1.5 seconds for POST and isn't "max", something is horribly
> wrong.
>

It was averaging at 1.5 seconds however since pulling down the latest etcd
containers it seems to have come back down to 0.20


>
> Here's the 99th percentile for a very large production cluster in
> milliseconds.  Pod was around 100ms but haven't been any pods created in
> the last 30 minutes:
>
> ElementValue
> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="PATCH"}
> 4.225
> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="POST"}
> NaN
> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="PUT"}
> 97.814
>
> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="CONNECT"}
> NaN
> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="DELETE"}
> 77.235
>
> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="DELETECOLLECTION"}
> 11.649
> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="GET"}
> 217.616
> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="LIST"}
> NaN
>
> Prometheus query sum without (instance)
> (apiserver_request_latencies_summary{verb!="WATCH",resource="pods",quantile="0.99"})/1000
>
> Note that get is unusually high, we're not positive it's not being
> misreported in prometheus in 3.6.
>
>
ElementValue
{job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="PUT"}
28.592
{job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="DELETE"}
NaN
{job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="CONNECT"}
NaN
{job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="GET"}
25.11
{job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="WATCHLIST"}
569000.319
{job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="POST"}
NaN
{job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="LIST"}
NaN
{job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="DELETECOLLECTION"}
NaN


>
> There were 3 leader changes, as I tried upgrading docker etcd_container to
> see if that would fix the issue (for some reason the ansible upgrade script
> doesn't upgrade this(?)).
>
>
>
> On Sun, 13 Aug 2017 at 01:23 Clayton Coleman  wrote:
>
>> How big is your etcd working set in terms of number of keys?  How many
>> namespaces?  If keys <50k then i would suspect software, hardware, or
>> network issue in between masters and etcd.  Http etc

Re: High number of 4xx requests on etcd (3.6 upgrade)

2017-08-13 Thread Clayton Coleman
On Aug 13, 2017, at 12:07 AM, Andrew Lau  wrote:

I ended up finding that one of my rules had a wrong filter that was
returning a +inf value.  Most of the errors (count being over 2000) were
404 on QGET but fell well below 0.01% after I fixed the wrong filter rule.
Is that normal to get this number of 404 requests?


I would expect you to only see QGET 404 when a user asks for a key that is
not there.  That often happens when two components are racing to observe /
react to watch changes and one is deleted before the other, or a user
mistypes a command.  So would not expect zero, but should be proportional
to the amount of change in the cluster.


This is only on a staging cluster so the number of keys are small. There
are only about 100~ namespaces with dummy objects.

I went through the etcd logs and didn't see anything abnormal. API server
response times jumped to around 0.15 for GET and 1.5 for POST. This is
approximately 50%~ increase from what I saw with 1.5


What percentile?  Ideally RTT as measured on etcd would be 2-4ms for simple
single item get for 50th and 90th.  Lists will be substantially larger and
QGET also reports lists, so the distribution isn't well represented.

If that's 1.5 seconds for POST and isn't "max", something is horribly wrong.

Here's the 99th percentile for a very large production cluster in
milliseconds.  Pod was around 100ms but haven't been any pods created in
the last 30 minutes:

ElementValue
{job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="PATCH"}
4.225
{job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="POST"}
NaN
{job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="PUT"}
97.814
{job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="CONNECT"}
NaN
{job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="DELETE"}
77.235
{job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="DELETECOLLECTION"}
11.649
{job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="GET"}
217.616
{job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="LIST"}
NaN

Prometheus query sum without (instance)
(apiserver_request_latencies_summary{verb!="WATCH",resource="pods",quantile="0.99"})/1000

Note that get is unusually high, we're not positive it's not being
misreported in prometheus in 3.6.


There were 3 leader changes, as I tried upgrading docker etcd_container to
see if that would fix the issue (for some reason the ansible upgrade script
doesn't upgrade this(?)).



On Sun, 13 Aug 2017 at 01:23 Clayton Coleman  wrote:

> How big is your etcd working set in terms of number of keys?  How many
> namespaces?  If keys <50k then i would suspect software, hardware, or
> network issue in between masters and etcd.  Http etcd failures should only
> happen when the master is losing elections and being turned over, or the
> election/heartbeat timeout is too low for your actual network latency.
> Double check the etcd logs and verify that you aren't seen any election
> failures or turnover.
>
> What metrics are the apiserver side returning related to etcd latencies
> and failures?
>
> On Aug 12, 2017, at 11:07 AM, Andrew Lau  wrote:
>
> etcd data is on dedicated drives and aws reports idle and burst capacity
> around 90%
>
> On Sun, 13 Aug 2017 at 00:28 Clayton Coleman  wrote:
>
>> Check how much IO is being used by etcd and how much you have provisioned.
>>
>>
>> > On Aug 12, 2017, at 5:32 AM, Andrew Lau  wrote:
>> >
>> > Post upgrade to 3.6 I'm noticing the API server seems to be responding
>> a lot slower and my etcd metrics etcd_http_failed_total is returning a
>> large number of failed GET requests.
>> >
>> > Has anyone seen this?
>> > ___
>> > users mailing list
>> > users@lists.openshift.redhat.com
>> > http://lists.openshift.redhat.com/openshiftmm/listinfo/users
>>
>
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: High number of 4xx requests on etcd (3.6 upgrade)

2017-08-12 Thread Andrew Lau
I ended up finding that one of my rules had a wrong filter that was
returning a +inf value.  Most of the errors (count being over 2000) were
404 on QGET but fell well below 0.01% after I fixed the wrong filter rule.
Is that normal to get this number of 404 requests?

This is only on a staging cluster so the number of keys are small. There
are only about 100~ namespaces with dummy objects.

I went through the etcd logs and didn't see anything abnormal. API server
response times jumped to around 0.15 for GET and 1.5 for POST. This is
approximately 50%~ increase from what I saw with 1.5

There were 3 leader changes, as I tried upgrading docker etcd_container to
see if that would fix the issue (for some reason the ansible upgrade script
doesn't upgrade this(?)).



On Sun, 13 Aug 2017 at 01:23 Clayton Coleman  wrote:

> How big is your etcd working set in terms of number of keys?  How many
> namespaces?  If keys <50k then i would suspect software, hardware, or
> network issue in between masters and etcd.  Http etcd failures should only
> happen when the master is losing elections and being turned over, or the
> election/heartbeat timeout is too low for your actual network latency.
> Double check the etcd logs and verify that you aren't seen any election
> failures or turnover.
>
> What metrics are the apiserver side returning related to etcd latencies
> and failures?
>
> On Aug 12, 2017, at 11:07 AM, Andrew Lau  wrote:
>
> etcd data is on dedicated drives and aws reports idle and burst capacity
> around 90%
>
> On Sun, 13 Aug 2017 at 00:28 Clayton Coleman  wrote:
>
>> Check how much IO is being used by etcd and how much you have provisioned.
>>
>>
>> > On Aug 12, 2017, at 5:32 AM, Andrew Lau  wrote:
>> >
>> > Post upgrade to 3.6 I'm noticing the API server seems to be responding
>> a lot slower and my etcd metrics etcd_http_failed_total is returning a
>> large number of failed GET requests.
>> >
>> > Has anyone seen this?
>> > ___
>> > users mailing list
>> > users@lists.openshift.redhat.com
>> > http://lists.openshift.redhat.com/openshiftmm/listinfo/users
>>
>
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: High number of 4xx requests on etcd (3.6 upgrade)

2017-08-12 Thread Clayton Coleman
How big is your etcd working set in terms of number of keys?  How many
namespaces?  If keys <50k then i would suspect software, hardware, or
network issue in between masters and etcd.  Http etcd failures should only
happen when the master is losing elections and being turned over, or the
election/heartbeat timeout is too low for your actual network latency.
Double check the etcd logs and verify that you aren't seen any election
failures or turnover.

What metrics are the apiserver side returning related to etcd latencies and
failures?

On Aug 12, 2017, at 11:07 AM, Andrew Lau  wrote:

etcd data is on dedicated drives and aws reports idle and burst capacity
around 90%

On Sun, 13 Aug 2017 at 00:28 Clayton Coleman  wrote:

> Check how much IO is being used by etcd and how much you have provisioned.
>
>
> > On Aug 12, 2017, at 5:32 AM, Andrew Lau  wrote:
> >
> > Post upgrade to 3.6 I'm noticing the API server seems to be responding a
> lot slower and my etcd metrics etcd_http_failed_total is returning a large
> number of failed GET requests.
> >
> > Has anyone seen this?
> > ___
> > users mailing list
> > users@lists.openshift.redhat.com
> > http://lists.openshift.redhat.com/openshiftmm/listinfo/users
>
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: High number of 4xx requests on etcd (3.6 upgrade)

2017-08-12 Thread Andrew Lau
etcd data is on dedicated drives and aws reports idle and burst capacity
around 90%

On Sun, 13 Aug 2017 at 00:28 Clayton Coleman  wrote:

> Check how much IO is being used by etcd and how much you have provisioned.
>
>
> > On Aug 12, 2017, at 5:32 AM, Andrew Lau  wrote:
> >
> > Post upgrade to 3.6 I'm noticing the API server seems to be responding a
> lot slower and my etcd metrics etcd_http_failed_total is returning a large
> number of failed GET requests.
> >
> > Has anyone seen this?
> > ___
> > users mailing list
> > users@lists.openshift.redhat.com
> > http://lists.openshift.redhat.com/openshiftmm/listinfo/users
>
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


Re: High number of 4xx requests on etcd (3.6 upgrade)

2017-08-12 Thread Clayton Coleman
Check how much IO is being used by etcd and how much you have provisioned.


> On Aug 12, 2017, at 5:32 AM, Andrew Lau  wrote:
>
> Post upgrade to 3.6 I'm noticing the API server seems to be responding a lot 
> slower and my etcd metrics etcd_http_failed_total is returning a large number 
> of failed GET requests.
>
> Has anyone seen this?
> ___
> users mailing list
> users@lists.openshift.redhat.com
> http://lists.openshift.redhat.com/openshiftmm/listinfo/users

___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users


High number of 4xx requests on etcd (3.6 upgrade)

2017-08-12 Thread Andrew Lau
Post upgrade to 3.6 I'm noticing the API server seems to be responding a
lot slower and my etcd metrics etcd_http_failed_total is returning a large
number of failed GET requests.

Has anyone seen this?
___
users mailing list
users@lists.openshift.redhat.com
http://lists.openshift.redhat.com/openshiftmm/listinfo/users