On Mon, 14 Aug 2017 at 02:10 Clayton Coleman <[email protected]> wrote:
> > > On Aug 13, 2017, at 12:07 AM, Andrew Lau <[email protected]> wrote: > > I ended up finding that one of my rules had a wrong filter that was > returning a +inf value. Most of the errors (count being over 2000) were > 404 on QGET but fell well below 0.01% after I fixed the wrong filter rule. > Is that normal to get this number of 404 requests? > > > I would expect you to only see QGET 404 when a user asks for a key that is > not there. That often happens when two components are racing to observe / > react to watch changes and one is deleted before the other, or a user > mistypes a command. So would not expect zero, but should be proportional > to the amount of change in the cluster. > I'm seeing a lot of these errors on nodes. It looks like a CronJob has already finished and the node is trying to clean up a container that's already stopped. Would these be what contributes to the 404 errors? dockerd-current[4909]: E0814 11:15:32.934040 12646 remote_runtime.go:109] StopPodSandbox "c2974678a43dd96483effabc95a0f27018b65f6bd527e451f0aba0d81e43404c" from runtime service failed: rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod "test-cron-1502552700-93sv9_test" network: CNI failed to retrieve network namespace path: Error: No such container: c2974678a43dd96483effabc95a0f27018b65f6bd527e451f0aba0d81e43404c E0814 11:15:32.934053 12646 kuberuntime_gc.go:138] Failed to stop sandbox "c2974678a43dd96483effabc95a0f27018b65f6bd527e451f0aba0d81e43404c" before removing: rpc error: code = 2 desc = NetworkPlugin cni failed to teardown pod "test-cron-1502552700-93sv9_test" network: CNI failed to retrieve network namespace path: Error: No such container: c2974678a43dd96483effabc95a0f27018b65f6bd527e451f0aba0d81e43404c > > > This is only on a staging cluster so the number of keys are small. There > are only about 100~ namespaces with dummy objects. > > I went through the etcd logs and didn't see anything abnormal. API server > response times jumped to around 0.15 for GET and 1.5 for POST. This is > approximately 50%~ increase from what I saw with 1.5 > > > What percentile? Ideally RTT as measured on etcd would be 2-4ms for > simple single item get for 50th and 90th. Lists will be substantially > larger and QGET also reports lists, so the distribution isn't well > represented. > 99th percentile I was using: histogram_quantile(.99, sum without (instance,resource) (apiserver_request_latencies_bucket{verb!~"CONNECT|WATCHLIST|WATCH|PROXY"}) ) / 1e6 > > If that's 1.5 seconds for POST and isn't "max", something is horribly > wrong. > It was averaging at 1.5 seconds however since pulling down the latest etcd containers it seems to have come back down to 0.20 > > Here's the 99th percentile for a very large production cluster in > milliseconds. Pod was around 100ms but haven't been any pods created in > the last 30 minutes: > > ElementValue > {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="PATCH"} > 4.225 > {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="POST"} > NaN > {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="PUT"} > 97.814 > > {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="CONNECT"} > NaN > {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="DELETE"} > 77.235 > > {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="DELETECOLLECTION"} > 11.649 > {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="GET"} > 217.616 > {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="LIST"} > NaN > > Prometheus query sum without (instance) > (apiserver_request_latencies_summary{verb!="WATCH",resource="pods",quantile="0.99"})/1000 > > Note that get is unusually high, we're not positive it's not being > misreported in prometheus in 3.6. > > ElementValue {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="PUT"} 28.592 {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="DELETE"} NaN {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="CONNECT"} NaN {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="GET"} 25.11 {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="WATCHLIST"} 569000.319 {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="POST"} NaN {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="LIST"} NaN {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="DELETECOLLECTION"} NaN > > There were 3 leader changes, as I tried upgrading docker etcd_container to > see if that would fix the issue (for some reason the ansible upgrade script > doesn't upgrade this(?)). > > > > On Sun, 13 Aug 2017 at 01:23 Clayton Coleman <[email protected]> wrote: > >> How big is your etcd working set in terms of number of keys? How many >> namespaces? If keys <50k then i would suspect software, hardware, or >> network issue in between masters and etcd. Http etcd failures should only >> happen when the master is losing elections and being turned over, or the >> election/heartbeat timeout is too low for your actual network latency. >> Double check the etcd logs and verify that you aren't seen any election >> failures or turnover. >> >> What metrics are the apiserver side returning related to etcd latencies >> and failures? >> >> On Aug 12, 2017, at 11:07 AM, Andrew Lau <[email protected]> wrote: >> >> etcd data is on dedicated drives and aws reports idle and burst capacity >> around 90% >> >> On Sun, 13 Aug 2017 at 00:28 Clayton Coleman <[email protected]> wrote: >> >>> Check how much IO is being used by etcd and how much you have >>> provisioned. >>> >>> >>> > On Aug 12, 2017, at 5:32 AM, Andrew Lau <[email protected]> wrote: >>> > >>> > Post upgrade to 3.6 I'm noticing the API server seems to be responding >>> a lot slower and my etcd metrics etcd_http_failed_total is returning a >>> large number of failed GET requests. >>> > >>> > Has anyone seen this? >>> > _______________________________________________ >>> > users mailing list >>> > [email protected] >>> > http://lists.openshift.redhat.com/openshiftmm/listinfo/users >>> >>
_______________________________________________ users mailing list [email protected] http://lists.openshift.redhat.com/openshiftmm/listinfo/users
