I found an upstream issue on kubernetes https://github.com/kubernetes/kubernetes/issues/48998
Would volume mount errors contribute to the 404 count? On Mon, 14 Aug 2017 at 11:55 Andrew Lau <and...@andrewklau.com> wrote: > On Mon, 14 Aug 2017 at 02:10 Clayton Coleman <ccole...@redhat.com> wrote: > >> >> >> On Aug 13, 2017, at 12:07 AM, Andrew Lau <and...@andrewklau.com> wrote: >> >> I ended up finding that one of my rules had a wrong filter that was >> returning a +inf value. Most of the errors (count being over 2000) were >> 404 on QGET but fell well below 0.01% after I fixed the wrong filter rule. >> Is that normal to get this number of 404 requests? >> >> >> I would expect you to only see QGET 404 when a user asks for a key that >> is not there. That often happens when two components are racing to observe >> / react to watch changes and one is deleted before the other, or a user >> mistypes a command. So would not expect zero, but should be proportional >> to the amount of change in the cluster. >> > > I'm seeing a lot of these errors on nodes. It looks like a CronJob has > already finished and the node is trying to clean up a container that's > already stopped. Would these be what contributes to the 404 errors? > > dockerd-current[4909]: E0814 11:15:32.934040 12646 > remote_runtime.go:109] StopPodSandbox > "c2974678a43dd96483effabc95a0f27018b65f6bd527e451f0aba0d81e43404c" from > runtime service failed: rpc error: code = 2 desc = NetworkPlugin cni failed > to teardown pod "test-cron-1502552700-93sv9_test" network: CNI failed to > retrieve network namespace path: Error: No such container: > c2974678a43dd96483effabc95a0f27018b65f6bd527e451f0aba0d81e43404c > E0814 11:15:32.934053 12646 kuberuntime_gc.go:138] Failed to stop > sandbox "c2974678a43dd96483effabc95a0f27018b65f6bd527e451f0aba0d81e43404c" > before removing: rpc error: code = 2 desc = NetworkPlugin cni failed to > teardown pod "test-cron-1502552700-93sv9_test" network: CNI failed to > retrieve network namespace path: Error: No such container: > c2974678a43dd96483effabc95a0f27018b65f6bd527e451f0aba0d81e43404c > > > >> >> >> This is only on a staging cluster so the number of keys are small. There >> are only about 100~ namespaces with dummy objects. >> >> I went through the etcd logs and didn't see anything abnormal. API server >> response times jumped to around 0.15 for GET and 1.5 for POST. This is >> approximately 50%~ increase from what I saw with 1.5 >> >> >> What percentile? Ideally RTT as measured on etcd would be 2-4ms for >> simple single item get for 50th and 90th. Lists will be substantially >> larger and QGET also reports lists, so the distribution isn't well >> represented. >> > > 99th percentile I was using: > > histogram_quantile(.99, sum without (instance,resource) > (apiserver_request_latencies_bucket{verb!~"CONNECT|WATCHLIST|WATCH|PROXY"}) > ) / 1e6 > > >> >> If that's 1.5 seconds for POST and isn't "max", something is horribly >> wrong. >> > > It was averaging at 1.5 seconds however since pulling down the latest etcd > containers it seems to have come back down to 0.20 > > >> >> Here's the 99th percentile for a very large production cluster in >> milliseconds. Pod was around 100ms but haven't been any pods created in >> the last 30 minutes: >> >> ElementValue >> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="PATCH"} >> 4.225 >> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="POST"} >> NaN >> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="PUT"} >> 97.814 >> >> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="CONNECT"} >> NaN >> >> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="DELETE"} >> 77.235 >> >> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="DELETECOLLECTION"} >> 11.649 >> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="GET"} >> 217.616 >> {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="LIST"} >> NaN >> >> Prometheus query sum without (instance) >> (apiserver_request_latencies_summary{verb!="WATCH",resource="pods",quantile="0.99"})/1000 >> >> Note that get is unusually high, we're not positive it's not being >> misreported in prometheus in 3.6. >> >> > ElementValue > {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="PUT"} > 28.592 > {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="DELETE"} > NaN > > {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="CONNECT"} > NaN > {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="GET"} > 25.11 > > {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="WATCHLIST"} > 569000.319 > {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="POST"} > NaN > {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="LIST"} > NaN > > {job="kubernetes-apiservers",quantile="0.99",resource="pods",verb="DELETECOLLECTION"} > NaN > > >> >> There were 3 leader changes, as I tried upgrading docker etcd_container >> to see if that would fix the issue (for some reason the ansible upgrade >> script doesn't upgrade this(?)). >> >> >> >> On Sun, 13 Aug 2017 at 01:23 Clayton Coleman <ccole...@redhat.com> wrote: >> >>> How big is your etcd working set in terms of number of keys? How many >>> namespaces? If keys <50k then i would suspect software, hardware, or >>> network issue in between masters and etcd. Http etcd failures should only >>> happen when the master is losing elections and being turned over, or the >>> election/heartbeat timeout is too low for your actual network latency. >>> Double check the etcd logs and verify that you aren't seen any election >>> failures or turnover. >>> >>> What metrics are the apiserver side returning related to etcd latencies >>> and failures? >>> >>> On Aug 12, 2017, at 11:07 AM, Andrew Lau <and...@andrewklau.com> wrote: >>> >>> etcd data is on dedicated drives and aws reports idle and burst capacity >>> around 90% >>> >>> On Sun, 13 Aug 2017 at 00:28 Clayton Coleman <ccole...@redhat.com> >>> wrote: >>> >>>> Check how much IO is being used by etcd and how much you have >>>> provisioned. >>>> >>>> >>>> > On Aug 12, 2017, at 5:32 AM, Andrew Lau <and...@andrewklau.com> >>>> wrote: >>>> > >>>> > Post upgrade to 3.6 I'm noticing the API server seems to be >>>> responding a lot slower and my etcd metrics etcd_http_failed_total is >>>> returning a large number of failed GET requests. >>>> > >>>> > Has anyone seen this? >>>> > _______________________________________________ >>>> > users mailing list >>>> > users@lists.openshift.redhat.com >>>> > http://lists.openshift.redhat.com/openshiftmm/listinfo/users >>>> >>>
_______________________________________________ users mailing list users@lists.openshift.redhat.com http://lists.openshift.redhat.com/openshiftmm/listinfo/users