Re: ZCX task monitoring, anyone?
Yuksel Thank you for that information and I'm sorry I haven't responded sooner (other work took precedence, as is so often the case). Your modification to the 'run' command has worked exactly as required. I now have cadvisor running on my zcx system and overall CPU utilisation is down to a much more reasonable 5-6% - a very welcome change! Thanks you again Regards Sean On Wed, 2 Sep 2020 at 22:09, Yuksel Gunal wrote: > Hi Sean, > > cAdvisor polls for metrics once a second by default. Though I have not > seen such high CPU utilization with the default setting, it is worth to run > cAdvisor with a different setting, one that you can define explicitly when > you start cAdvisor. I'd recommend that you try a 10s or 15s interval and > see if either helps. To do that, you need to specify the > "--housekeeping_interval" parameter. Here is how you'd do it on zCX (sets > it to 10 seconds): > > docker run -v /:/rootfs:ro -v /var/run:/var/run:ro -v /sys:/sys:ro -v > /var/lib/docker/:/var/lib/docker:ro -v /dev/disk/:/dev/disk:ro -p 8080:8080 > -d --network monitoring --name=cadvisor ibmcom/cadvisor-s390x:0.33.0 > cadvisor --housekeeping_interval=10s > > Note that I modified the instructions by adding "cadvisor > --housekeeping_interval=10s" > > Also, Prometheus polls cAdvisor, but cAdvisor's data collection is not > triggered by Prometheus polling. It has its own polling cycle and it keeps > the metrics in memory. When Prometheus polls cAdvisor, it returns the last > set of collected metrics from memory. > > Yuksel Gunal > > -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: ZCX task monitoring, anyone?
Hi Sean, cAdvisor polls for metrics once a second by default. Though I have not seen such high CPU utilization with the default setting, it is worth to run cAdvisor with a different setting, one that you can define explicitly when you start cAdvisor. I'd recommend that you try a 10s or 15s interval and see if either helps. To do that, you need to specify the "--housekeeping_interval" parameter. Here is how you'd do it on zCX (sets it to 10 seconds): docker run -v /:/rootfs:ro -v /var/run:/var/run:ro -v /sys:/sys:ro -v /var/lib/docker/:/var/lib/docker:ro -v /dev/disk/:/dev/disk:ro -p 8080:8080 -d --network monitoring --name=cadvisor ibmcom/cadvisor-s390x:0.33.0 cadvisor --housekeeping_interval=10s Note that I modified the instructions by adding "cadvisor --housekeeping_interval=10s" Also, Prometheus polls cAdvisor, but cAdvisor's data collection is not triggered by Prometheus polling. It has its own polling cycle and it keeps the metrics in memory. When Prometheus polls cAdvisor, it returns the last set of collected metrics from memory. Yuksel Gunal -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: ZCX task monitoring, anyone?
I particularly hate the URL garbling that I'm currently stuck with. Take outlook - please! -- Shmuel (Seymour J.) Metz http://mason.gmu.edu/~smetz3 From: IBM Mainframe Discussion List on behalf of Sean Gleann Sent: Thursday, August 27, 2020 4:14 AM To: IBM-MAIN@LISTSERV.UA.EDU Subject: Re: ZCX task monitoring, anyone? *sigh*. Oh, how I wish that various e-mail clients would quit re-formatting stuff. My previous response here was so nice & neat & tidy before I hit 'Send'. Reading that response back via IBMMain makes me look like a complete illiterate... Sean On Thu, 27 Aug 2020 at 09:01, Sean Gleann wrote: > Hi Attila - thanks for the pointers, but I'm not sure of how to act upon > them. > > The start-up for Cadvisor that I'm using doesn't feature any pointer to a > parameter list, and despite much googling I don't see any mention of such a > thing. Everything keeps referring back to Prometheus and then on to Grafana > My Cadvisor start-up (taken directly from the IBM Red Book and slightly > modified to comply with local restrictions): > docker network create monitoring > docker run --name cadvisor -v /sys:/sys:ro -v > /var/lib/docker/:/var/lib/docker:ro -v /dev/disk:/dev/disk:ro -d --network > monitoring ibmcom/cadvisor-s390x:0.33.0 > > Perhaps I'm looking at things the wrong way, but my current understanding > is: > Cadvisor (and also Nodeexporter) collect various usage stats; > Prometheus then gathers that data and does some sort of pre-processing of > it (it doesn't tell Cadvisor to 'do something' - it just passively makes > use of the data that Cadvisor collects) > Grafana takes the data from Prometheus and uses it to generate various > graphs/tables/reports. > > My situation is that when I run Cadvisor on it's own - no other containers > at all - then it floods as many processors as I define in the zcx > start.json file. > > Whilst Cadvisor is running, I can go to the relevant web-page and I can > see that it is producing meters/charts, etc all on its own. Since that is > the case, what is the point of Grafana? > > I have a Prometheus.yml file that features the term 'scrape_interval' (but > not 'housekeeping'), but that file is for use by Prometheus, isn't it? How > does it affect the amount of work that Cadvisor is doing, since I haven't > even started that container yet? > > Regards > Sean > > On Wed, 26 Aug 2020 at 23:05, Attila Fogarasi wrote: > >> Check your values for housekeeping interval and scrape_interval. >> Recommended is 15s and 30s (which makes for a 60 second rate window). >> Small value for housekeeping interval will cause cAdvisor cpu usage to be >> high, while scrape_interval affects Prometheus cpu usage. It is entirely >> possible to cause data collection to use 100% of the z/OS cpu -- remember >> that on Unix systems the rule of thumb is 40% overhead for uncaptured cpu >> time while z/OS is far more efficient and runs well under 10%. You will >> see this behaviour in zCX containers, it isn't going to measure the same >> as >> z/OS workload. The optimizations in Unix have the premise that cpu time >> is >> low cost (as is memory), while z/OS considers cpu to be high cost and path >> length worth saving. Same for the subsystems in z/OS and performance >> monitors. >> >> On Wed, Aug 26, 2020 at 11:43 PM Sean Gleann >> wrote: >> >> > Allan - "...count the beans differently...' Yes, I'm beginning to get >> used >> > to that concept. For instance, with the CPU Utilisation data that I >> *have* >> > been able to retrieve, the metric given is not 'CPU%', but 'Number of >> > cores'. I'm having to do some rapid re-orienting to my way of thinking. >> > As for the memory size, I've got "mem-gb" : 2 defined in my start.json >> > file, but I've not seen any indication of paging load at all in my >> testing. >> > >> > Michael - 5 zIIPs? I wish! Nope - these are all general-purpose >> > processors. >> > The z/OS system I'm using is a z/VM guest on a system run by an external >> > supplier, so I'm not sure if defining zIIPs would actually achieve >> anything >> > (Is it possible to dedicate a zIIP engine to a specific z/VM guest? >> That's >> > a road I've not yet gone down). >> > With regard to the WLM definitions, I followed the advice in the red >> book >> > and I'm reasonably certain I've got it right. Having said that, >> cross-refer >> > to a thread that I started earlier this week, titled "WLM Query" >> > The response to that led to me defining a resource group to cap the
Re: ZCX task monitoring, anyone?
OK, so the scrape_interval part of your answer is something I can quickly understand and deal with. I'll put that to one side for now because my interest is with Cadvisor and how to control it. To take the parallel example for Prometheus from the IBM red book, I have to: 1. create a 'prometheus.yml' file 2. create a Dockerfile, which features a COPY command referring to that 'prometheus.yml' file 3. do a 'docker build' of the Prometheus image 4. then I can 'run' the image. With subsequent 'runs' of Prometheus I can skip steps 1-3 as they have already been done. For Cadvisor, there is no 'build' to be done, thus no copying of a yml file The only command I know of for Cadvisor is the 'run' command I detailed earlier in this thread. (If the Cadvisor image does not exist, then it is automatically downloaded before being started.) In conceptual terms, am I right in thinking that I'm downloading a 'program' that has already been prepared for execution? If that is true, then the value of any control parameters appears to be hard-coded within the program. Given this, I'm not sure I have any control over the container manifest for Cadvisor. Regards Sean On Thu, 27 Aug 2020 at 10:15, Attila Fogarasi wrote: > Housekeeping interval is part of the container manifest as it governs > normal operation, not just performance metric collection. As such it is > specified wherever you have your container manifest defined (for example, a > .yaml file or by HTTP endpoint or HTTP server). You can also use the > command line "kubelet" tool. > Scrape_interval is the value for how often Prometheus asks cAdvisor for > data from the collection cache, thus it affects cpu used by cAdvisor to > prepare and send this data to Prometheus. > As for how Docker monitoring works, you are right that there is overlap in > the open-source tools, but the hierarchy is cAdvisor collects the metrics > and also does some aggregation and processing. You can use just cAdvisor. > Prometheus is a layer on top, getting metrics from cAdvisor and then > provides both better quality reporting and also alerting (which cAdvisor > does not). The next layer is Grafana which is a generalized metric > analytics and visualization tool (not just for Docker). For larger scale > more complex container environments you need all 3. > In a z/OS context these 3 tools became integrated circa 30 years ago, but > for Unix they are not. Splitting the processing like this has both good > and bad points (for example Grafana can run in a separate Docker container) > but definitely burns more cpu (a lot more). If not careful the > measurement tooling can cost more than the application being measured, even > though it is "free". > > On Thu, Aug 27, 2020 at 6:02 PM Sean Gleann wrote: > > > Hi Attila - thanks for the pointers, but I'm not sure of how to act upon > > them. > > > > The start-up for Cadvisor that I'm using doesn't feature any pointer to a > > parameter list, and despite much googling I don't see any mention of > such a > > thing. Everything keeps referring back to Prometheus and then on to > Grafana > > My Cadvisor start-up (taken directly from the IBM Red Book and slightly > > modified to comply with local restrictions): > > docker network create monitoring > > docker run --name cadvisor -v /sys:/sys:ro -v > > /var/lib/docker/:/var/lib/docker:ro -v /dev/disk:/dev/disk:ro -d > --network > > monitoring ibmcom/cadvisor-s390x:0.33.0 > > > > Perhaps I'm looking at things the wrong way, but my current understanding > > is: > > Cadvisor (and also Nodeexporter) collect various usage stats; > > Prometheus then gathers that data and does some sort of pre-processing of > > it (it doesn't tell Cadvisor to 'do something' - it just passively makes > > use of the data that Cadvisor collects) > > Grafana takes the data from Prometheus and uses it to generate various > > graphs/tables/reports. > > > > My situation is that when I run Cadvisor on it's own - no other > containers > > at all - then it floods as many processors as I define in the zcx > > start.json file. > > > > Whilst Cadvisor is running, I can go to the relevant web-page and I can > see > > that it is producing meters/charts, etc all on its own. Since that is the > > case, what is the point of Grafana? > > > > I have a Prometheus.yml file that features the term 'scrape_interval' > (but > > not 'housekeeping'), but that file is for use by Prometheus, isn't it? > How > > does it affect the amount of work that Cadvisor is doing, since I haven't > > even started that container yet? > > > > Regards > > Sean > > > > On Wed, 26 Aug 2020 at 23:05, Attila Fogarasi > wrote: > > > > > Check your values for housekeeping interval and scrape_interval. > > > Recommended is 15s and 30s (which makes for a 60 second rate window). > > > Small value for housekeeping interval will cause cAdvisor cpu usage to > be > > > high, while scrape_interval affects Prometheus cpu usage. It is > entirely > > > possible to cause data
Re: ZCX task monitoring, anyone?
Housekeeping interval is part of the container manifest as it governs normal operation, not just performance metric collection. As such it is specified wherever you have your container manifest defined (for example, a .yaml file or by HTTP endpoint or HTTP server). You can also use the command line "kubelet" tool. Scrape_interval is the value for how often Prometheus asks cAdvisor for data from the collection cache, thus it affects cpu used by cAdvisor to prepare and send this data to Prometheus. As for how Docker monitoring works, you are right that there is overlap in the open-source tools, but the hierarchy is cAdvisor collects the metrics and also does some aggregation and processing. You can use just cAdvisor. Prometheus is a layer on top, getting metrics from cAdvisor and then provides both better quality reporting and also alerting (which cAdvisor does not). The next layer is Grafana which is a generalized metric analytics and visualization tool (not just for Docker). For larger scale more complex container environments you need all 3. In a z/OS context these 3 tools became integrated circa 30 years ago, but for Unix they are not. Splitting the processing like this has both good and bad points (for example Grafana can run in a separate Docker container) but definitely burns more cpu (a lot more). If not careful the measurement tooling can cost more than the application being measured, even though it is "free". On Thu, Aug 27, 2020 at 6:02 PM Sean Gleann wrote: > Hi Attila - thanks for the pointers, but I'm not sure of how to act upon > them. > > The start-up for Cadvisor that I'm using doesn't feature any pointer to a > parameter list, and despite much googling I don't see any mention of such a > thing. Everything keeps referring back to Prometheus and then on to Grafana > My Cadvisor start-up (taken directly from the IBM Red Book and slightly > modified to comply with local restrictions): > docker network create monitoring > docker run --name cadvisor -v /sys:/sys:ro -v > /var/lib/docker/:/var/lib/docker:ro -v /dev/disk:/dev/disk:ro -d --network > monitoring ibmcom/cadvisor-s390x:0.33.0 > > Perhaps I'm looking at things the wrong way, but my current understanding > is: > Cadvisor (and also Nodeexporter) collect various usage stats; > Prometheus then gathers that data and does some sort of pre-processing of > it (it doesn't tell Cadvisor to 'do something' - it just passively makes > use of the data that Cadvisor collects) > Grafana takes the data from Prometheus and uses it to generate various > graphs/tables/reports. > > My situation is that when I run Cadvisor on it's own - no other containers > at all - then it floods as many processors as I define in the zcx > start.json file. > > Whilst Cadvisor is running, I can go to the relevant web-page and I can see > that it is producing meters/charts, etc all on its own. Since that is the > case, what is the point of Grafana? > > I have a Prometheus.yml file that features the term 'scrape_interval' (but > not 'housekeeping'), but that file is for use by Prometheus, isn't it? How > does it affect the amount of work that Cadvisor is doing, since I haven't > even started that container yet? > > Regards > Sean > > On Wed, 26 Aug 2020 at 23:05, Attila Fogarasi wrote: > > > Check your values for housekeeping interval and scrape_interval. > > Recommended is 15s and 30s (which makes for a 60 second rate window). > > Small value for housekeeping interval will cause cAdvisor cpu usage to be > > high, while scrape_interval affects Prometheus cpu usage. It is entirely > > possible to cause data collection to use 100% of the z/OS cpu -- remember > > that on Unix systems the rule of thumb is 40% overhead for uncaptured cpu > > time while z/OS is far more efficient and runs well under 10%. You will > > see this behaviour in zCX containers, it isn't going to measure the same > as > > z/OS workload. The optimizations in Unix have the premise that cpu time > is > > low cost (as is memory), while z/OS considers cpu to be high cost and > path > > length worth saving. Same for the subsystems in z/OS and performance > > monitors. > > > > On Wed, Aug 26, 2020 at 11:43 PM Sean Gleann > > wrote: > > > > > Allan - "...count the beans differently...' Yes, I'm beginning to get > > used > > > to that concept. For instance, with the CPU Utilisation data that I > > *have* > > > been able to retrieve, the metric given is not 'CPU%', but 'Number of > > > cores'. I'm having to do some rapid re-orienting to my way of thinking. > > > As for the memory size, I've got "mem-gb" : 2 defined in my start.json > > > file, but I've not seen any indication of paging load at all in my > > testing. > > > > > > Michael - 5 zIIPs? I wish! Nope - these are all general-purpose > > > processors. > > > The z/OS system I'm using is a z/VM guest on a system run by an > external > > > supplier, so I'm not sure if defining zIIPs would actually achieve > > anything > > > (Is it possible
Re: ZCX task monitoring, anyone?
*sigh*. Oh, how I wish that various e-mail clients would quit re-formatting stuff. My previous response here was so nice & neat & tidy before I hit 'Send'. Reading that response back via IBMMain makes me look like a complete illiterate... Sean On Thu, 27 Aug 2020 at 09:01, Sean Gleann wrote: > Hi Attila - thanks for the pointers, but I'm not sure of how to act upon > them. > > The start-up for Cadvisor that I'm using doesn't feature any pointer to a > parameter list, and despite much googling I don't see any mention of such a > thing. Everything keeps referring back to Prometheus and then on to Grafana > My Cadvisor start-up (taken directly from the IBM Red Book and slightly > modified to comply with local restrictions): > docker network create monitoring > docker run --name cadvisor -v /sys:/sys:ro -v > /var/lib/docker/:/var/lib/docker:ro -v /dev/disk:/dev/disk:ro -d --network > monitoring ibmcom/cadvisor-s390x:0.33.0 > > Perhaps I'm looking at things the wrong way, but my current understanding > is: > Cadvisor (and also Nodeexporter) collect various usage stats; > Prometheus then gathers that data and does some sort of pre-processing of > it (it doesn't tell Cadvisor to 'do something' - it just passively makes > use of the data that Cadvisor collects) > Grafana takes the data from Prometheus and uses it to generate various > graphs/tables/reports. > > My situation is that when I run Cadvisor on it's own - no other containers > at all - then it floods as many processors as I define in the zcx > start.json file. > > Whilst Cadvisor is running, I can go to the relevant web-page and I can > see that it is producing meters/charts, etc all on its own. Since that is > the case, what is the point of Grafana? > > I have a Prometheus.yml file that features the term 'scrape_interval' (but > not 'housekeeping'), but that file is for use by Prometheus, isn't it? How > does it affect the amount of work that Cadvisor is doing, since I haven't > even started that container yet? > > Regards > Sean > > On Wed, 26 Aug 2020 at 23:05, Attila Fogarasi wrote: > >> Check your values for housekeeping interval and scrape_interval. >> Recommended is 15s and 30s (which makes for a 60 second rate window). >> Small value for housekeeping interval will cause cAdvisor cpu usage to be >> high, while scrape_interval affects Prometheus cpu usage. It is entirely >> possible to cause data collection to use 100% of the z/OS cpu -- remember >> that on Unix systems the rule of thumb is 40% overhead for uncaptured cpu >> time while z/OS is far more efficient and runs well under 10%. You will >> see this behaviour in zCX containers, it isn't going to measure the same >> as >> z/OS workload. The optimizations in Unix have the premise that cpu time >> is >> low cost (as is memory), while z/OS considers cpu to be high cost and path >> length worth saving. Same for the subsystems in z/OS and performance >> monitors. >> >> On Wed, Aug 26, 2020 at 11:43 PM Sean Gleann >> wrote: >> >> > Allan - "...count the beans differently...' Yes, I'm beginning to get >> used >> > to that concept. For instance, with the CPU Utilisation data that I >> *have* >> > been able to retrieve, the metric given is not 'CPU%', but 'Number of >> > cores'. I'm having to do some rapid re-orienting to my way of thinking. >> > As for the memory size, I've got "mem-gb" : 2 defined in my start.json >> > file, but I've not seen any indication of paging load at all in my >> testing. >> > >> > Michael - 5 zIIPs? I wish! Nope - these are all general-purpose >> > processors. >> > The z/OS system I'm using is a z/VM guest on a system run by an external >> > supplier, so I'm not sure if defining zIIPs would actually achieve >> anything >> > (Is it possible to dedicate a zIIP engine to a specific z/VM guest? >> That's >> > a road I've not yet gone down). >> > With regard to the WLM definitions, I followed the advice in the red >> book >> > and I'm reasonably certain I've got it right. Having said that, >> cross-refer >> > to a thread that I started earlier this week, titled "WLM Query" >> > The response to that led to me defining a resource group to cap the >> > started task to 10MSU, which resulted in a CPU% Util value of roughly >> 5% - >> > something I could be happy with. >> > Under that cap, the started task ran, yes, but it ran like a >> three-legged >> > dog (my apologies to limb-count-challenged canines). >> > Start-up of the task, from the START command to the "server is >> > listening..." message took over an hour, and >> > STOP-command-to-task-termination took approx. 30 minutes. >> > (SSH-ing to the task was a bit of a joke, too. Responses to simple >> commands >> > like 'docker ps -a' could be seen 'painting' across the screen, >> > character-by-character...) >> > As a result, I've moved away from trying to limit the task for the time >> > being. I'm concentrating on attempting to get cadvisor to be a bit less >> > greedy. >> > >> > Regards >> > Sean >> > >> > On
Re: ZCX task monitoring, anyone?
Hi Attila - thanks for the pointers, but I'm not sure of how to act upon them. The start-up for Cadvisor that I'm using doesn't feature any pointer to a parameter list, and despite much googling I don't see any mention of such a thing. Everything keeps referring back to Prometheus and then on to Grafana My Cadvisor start-up (taken directly from the IBM Red Book and slightly modified to comply with local restrictions): docker network create monitoring docker run --name cadvisor -v /sys:/sys:ro -v /var/lib/docker/:/var/lib/docker:ro -v /dev/disk:/dev/disk:ro -d --network monitoring ibmcom/cadvisor-s390x:0.33.0 Perhaps I'm looking at things the wrong way, but my current understanding is: Cadvisor (and also Nodeexporter) collect various usage stats; Prometheus then gathers that data and does some sort of pre-processing of it (it doesn't tell Cadvisor to 'do something' - it just passively makes use of the data that Cadvisor collects) Grafana takes the data from Prometheus and uses it to generate various graphs/tables/reports. My situation is that when I run Cadvisor on it's own - no other containers at all - then it floods as many processors as I define in the zcx start.json file. Whilst Cadvisor is running, I can go to the relevant web-page and I can see that it is producing meters/charts, etc all on its own. Since that is the case, what is the point of Grafana? I have a Prometheus.yml file that features the term 'scrape_interval' (but not 'housekeeping'), but that file is for use by Prometheus, isn't it? How does it affect the amount of work that Cadvisor is doing, since I haven't even started that container yet? Regards Sean On Wed, 26 Aug 2020 at 23:05, Attila Fogarasi wrote: > Check your values for housekeeping interval and scrape_interval. > Recommended is 15s and 30s (which makes for a 60 second rate window). > Small value for housekeeping interval will cause cAdvisor cpu usage to be > high, while scrape_interval affects Prometheus cpu usage. It is entirely > possible to cause data collection to use 100% of the z/OS cpu -- remember > that on Unix systems the rule of thumb is 40% overhead for uncaptured cpu > time while z/OS is far more efficient and runs well under 10%. You will > see this behaviour in zCX containers, it isn't going to measure the same as > z/OS workload. The optimizations in Unix have the premise that cpu time is > low cost (as is memory), while z/OS considers cpu to be high cost and path > length worth saving. Same for the subsystems in z/OS and performance > monitors. > > On Wed, Aug 26, 2020 at 11:43 PM Sean Gleann > wrote: > > > Allan - "...count the beans differently...' Yes, I'm beginning to get > used > > to that concept. For instance, with the CPU Utilisation data that I > *have* > > been able to retrieve, the metric given is not 'CPU%', but 'Number of > > cores'. I'm having to do some rapid re-orienting to my way of thinking. > > As for the memory size, I've got "mem-gb" : 2 defined in my start.json > > file, but I've not seen any indication of paging load at all in my > testing. > > > > Michael - 5 zIIPs? I wish! Nope - these are all general-purpose > > processors. > > The z/OS system I'm using is a z/VM guest on a system run by an external > > supplier, so I'm not sure if defining zIIPs would actually achieve > anything > > (Is it possible to dedicate a zIIP engine to a specific z/VM guest? > That's > > a road I've not yet gone down). > > With regard to the WLM definitions, I followed the advice in the red book > > and I'm reasonably certain I've got it right. Having said that, > cross-refer > > to a thread that I started earlier this week, titled "WLM Query" > > The response to that led to me defining a resource group to cap the > > started task to 10MSU, which resulted in a CPU% Util value of roughly 5% > - > > something I could be happy with. > > Under that cap, the started task ran, yes, but it ran like a three-legged > > dog (my apologies to limb-count-challenged canines). > > Start-up of the task, from the START command to the "server is > > listening..." message took over an hour, and > > STOP-command-to-task-termination took approx. 30 minutes. > > (SSH-ing to the task was a bit of a joke, too. Responses to simple > commands > > like 'docker ps -a' could be seen 'painting' across the screen, > > character-by-character...) > > As a result, I've moved away from trying to limit the task for the time > > being. I'm concentrating on attempting to get cadvisor to be a bit less > > greedy. > > > > Regards > > Sean > > > > On Wed, 26 Aug 2020 at 13:49, Michael Babcock > > wrote: > > > > > I can’t check my zCX out right now since my internet is down. > > > > > > You are running these on zIIP engines correct? Must be nice to have 5 > > > zIIPs! And have the WLM parts in place? Although it probably > wouldn’t > > > make much difference during startup/shutdown. > > > > > > On Wed, Aug 26, 2020 at 3:40 AM Sean Gleann > > wrote: > > > > > > > Can anyone offer
Re: ZCX task monitoring, anyone?
Check your values for housekeeping interval and scrape_interval. Recommended is 15s and 30s (which makes for a 60 second rate window). Small value for housekeeping interval will cause cAdvisor cpu usage to be high, while scrape_interval affects Prometheus cpu usage. It is entirely possible to cause data collection to use 100% of the z/OS cpu -- remember that on Unix systems the rule of thumb is 40% overhead for uncaptured cpu time while z/OS is far more efficient and runs well under 10%. You will see this behaviour in zCX containers, it isn't going to measure the same as z/OS workload. The optimizations in Unix have the premise that cpu time is low cost (as is memory), while z/OS considers cpu to be high cost and path length worth saving. Same for the subsystems in z/OS and performance monitors. On Wed, Aug 26, 2020 at 11:43 PM Sean Gleann wrote: > Allan - "...count the beans differently...' Yes, I'm beginning to get used > to that concept. For instance, with the CPU Utilisation data that I *have* > been able to retrieve, the metric given is not 'CPU%', but 'Number of > cores'. I'm having to do some rapid re-orienting to my way of thinking. > As for the memory size, I've got "mem-gb" : 2 defined in my start.json > file, but I've not seen any indication of paging load at all in my testing. > > Michael - 5 zIIPs? I wish! Nope - these are all general-purpose > processors. > The z/OS system I'm using is a z/VM guest on a system run by an external > supplier, so I'm not sure if defining zIIPs would actually achieve anything > (Is it possible to dedicate a zIIP engine to a specific z/VM guest? That's > a road I've not yet gone down). > With regard to the WLM definitions, I followed the advice in the red book > and I'm reasonably certain I've got it right. Having said that, cross-refer > to a thread that I started earlier this week, titled "WLM Query" > The response to that led to me defining a resource group to cap the > started task to 10MSU, which resulted in a CPU% Util value of roughly 5% - > something I could be happy with. > Under that cap, the started task ran, yes, but it ran like a three-legged > dog (my apologies to limb-count-challenged canines). > Start-up of the task, from the START command to the "server is > listening..." message took over an hour, and > STOP-command-to-task-termination took approx. 30 minutes. > (SSH-ing to the task was a bit of a joke, too. Responses to simple commands > like 'docker ps -a' could be seen 'painting' across the screen, > character-by-character...) > As a result, I've moved away from trying to limit the task for the time > being. I'm concentrating on attempting to get cadvisor to be a bit less > greedy. > > Regards > Sean > > On Wed, 26 Aug 2020 at 13:49, Michael Babcock > wrote: > > > I can’t check my zCX out right now since my internet is down. > > > > You are running these on zIIP engines correct? Must be nice to have 5 > > zIIPs! And have the WLM parts in place? Although it probably wouldn’t > > make much difference during startup/shutdown. > > > > On Wed, Aug 26, 2020 at 3:40 AM Sean Gleann > wrote: > > > > > Can anyone offer advice, please, with regard to monitoring the system > > > > > > resource consumption of a zcx Container task? > > > > > > > > > > > > I've got a zcx Container task running on a 'sandbox' system where - as > > yet > > > > > > - I'm not collecting any RMF/SMF data. Because of that, my only source > of > > > > > > system usage is the SDSF DA panel. I feel that the numbers I see there > > > > > > are... 'questionable' is the best word I can think of. > > > > > > > > > > > > Firstly, the EXCP-count for the task goes up to about 15360 during the > > > > > > initial start-up phase, but then it stays there until the STOP command > is > > > > > > issued. At that point, EXCP-count starts rising again, until the task > > > > > > finally terminates. The explanation for that is probably because all > the > > > > > > I/O is being handled internally at the 'Linux' level - the task must be > > > > > > doing *some* I/O, right? - but the data isn't getting back to SDSF for > > some > > > > > > reason. Without the benefit of SMF data to examine, I'm wondering if > this > > > > > > is part of a larger problem. > > > > > > > > > > > > The other thing that troubles me is the CPU% busy value. My sandbox > > system > > > > > > has 5 engines defined, and in the 'start.json' file that controls the > zcx > > > > > > Container task, I've specified a 'cpu' value of 4. During the start-up > > > > > > phase for the Container started task, SDSF shows CPU% values of approx > > 80%, > > > > > > but when the task is finally initialised, this drops to 'tickover' > rates > > of > > > > > > about 1%. I'm happy with that - the initial start-up of *any* task as > > > > > > complex as a zcx Container is likely to cause high CPU usage, and the > > > > > > subsequent drop to the 1% levels is fine by me. > > > > > > > > > > > > But... Once the Container task is
Re: ZCX task monitoring, anyone?
Allan - "...count the beans differently...' Yes, I'm beginning to get used to that concept. For instance, with the CPU Utilisation data that I *have* been able to retrieve, the metric given is not 'CPU%', but 'Number of cores'. I'm having to do some rapid re-orienting to my way of thinking. As for the memory size, I've got "mem-gb" : 2 defined in my start.json file, but I've not seen any indication of paging load at all in my testing. Michael - 5 zIIPs? I wish! Nope - these are all general-purpose processors. The z/OS system I'm using is a z/VM guest on a system run by an external supplier, so I'm not sure if defining zIIPs would actually achieve anything (Is it possible to dedicate a zIIP engine to a specific z/VM guest? That's a road I've not yet gone down). With regard to the WLM definitions, I followed the advice in the red book and I'm reasonably certain I've got it right. Having said that, cross-refer to a thread that I started earlier this week, titled "WLM Query" The response to that led to me defining a resource group to cap the started task to 10MSU, which resulted in a CPU% Util value of roughly 5% - something I could be happy with. Under that cap, the started task ran, yes, but it ran like a three-legged dog (my apologies to limb-count-challenged canines). Start-up of the task, from the START command to the "server is listening..." message took over an hour, and STOP-command-to-task-termination took approx. 30 minutes. (SSH-ing to the task was a bit of a joke, too. Responses to simple commands like 'docker ps -a' could be seen 'painting' across the screen, character-by-character...) As a result, I've moved away from trying to limit the task for the time being. I'm concentrating on attempting to get cadvisor to be a bit less greedy. Regards Sean On Wed, 26 Aug 2020 at 13:49, Michael Babcock wrote: > I can’t check my zCX out right now since my internet is down. > > You are running these on zIIP engines correct? Must be nice to have 5 > zIIPs! And have the WLM parts in place? Although it probably wouldn’t > make much difference during startup/shutdown. > > On Wed, Aug 26, 2020 at 3:40 AM Sean Gleann wrote: > > > Can anyone offer advice, please, with regard to monitoring the system > > > > resource consumption of a zcx Container task? > > > > > > > > I've got a zcx Container task running on a 'sandbox' system where - as > yet > > > > - I'm not collecting any RMF/SMF data. Because of that, my only source of > > > > system usage is the SDSF DA panel. I feel that the numbers I see there > > > > are... 'questionable' is the best word I can think of. > > > > > > > > Firstly, the EXCP-count for the task goes up to about 15360 during the > > > > initial start-up phase, but then it stays there until the STOP command is > > > > issued. At that point, EXCP-count starts rising again, until the task > > > > finally terminates. The explanation for that is probably because all the > > > > I/O is being handled internally at the 'Linux' level - the task must be > > > > doing *some* I/O, right? - but the data isn't getting back to SDSF for > some > > > > reason. Without the benefit of SMF data to examine, I'm wondering if this > > > > is part of a larger problem. > > > > > > > > The other thing that troubles me is the CPU% busy value. My sandbox > system > > > > has 5 engines defined, and in the 'start.json' file that controls the zcx > > > > Container task, I've specified a 'cpu' value of 4. During the start-up > > > > phase for the Container started task, SDSF shows CPU% values of approx > 80%, > > > > but when the task is finally initialised, this drops to 'tickover' rates > of > > > > about 1%. I'm happy with that - the initial start-up of *any* task as > > > > complex as a zcx Container is likely to cause high CPU usage, and the > > > > subsequent drop to the 1% levels is fine by me. > > > > > > > > But... Once the Container task is started and I've ssh'd into it, I then > > > > want to monitor its 'internal' system consumption. I've been using the > > > > 'Getting Started...' redbook as my guide throughout all this project, and > > > > it talks about using "Nodeexporter", "Cadvisor", "Prometheus" and > "Grafana" > > > > as tools for this. I've got all those things installed and I can start > and > > > > stop them quite happily, but I've found that using Cadvisor on it's own > can > > > > drive CPU% levels back up to 80% for the entire time it is running. If a > > > > system is running flat-out when all it is doing is monitoring itself, > well, > > > > there's something wrong somewhere... I'm trying to find an idiot's guide > to > > > > controlling what Cadvisor does, but as yet I've been unsuccessful. > > > > > > > > Regards > > > > Sean > > > > > > > > -- > > > > For IBM-MAIN subscribe / signoff / archive access instructions, > > > > send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN > > > > -- > Michael
Re: ZCX task monitoring, anyone?
I can’t check my zCX out right now since my internet is down. You are running these on zIIP engines correct? Must be nice to have 5 zIIPs! And have the WLM parts in place? Although it probably wouldn’t make much difference during startup/shutdown. On Wed, Aug 26, 2020 at 3:40 AM Sean Gleann wrote: > Can anyone offer advice, please, with regard to monitoring the system > > resource consumption of a zcx Container task? > > > > I've got a zcx Container task running on a 'sandbox' system where - as yet > > - I'm not collecting any RMF/SMF data. Because of that, my only source of > > system usage is the SDSF DA panel. I feel that the numbers I see there > > are... 'questionable' is the best word I can think of. > > > > Firstly, the EXCP-count for the task goes up to about 15360 during the > > initial start-up phase, but then it stays there until the STOP command is > > issued. At that point, EXCP-count starts rising again, until the task > > finally terminates. The explanation for that is probably because all the > > I/O is being handled internally at the 'Linux' level - the task must be > > doing *some* I/O, right? - but the data isn't getting back to SDSF for some > > reason. Without the benefit of SMF data to examine, I'm wondering if this > > is part of a larger problem. > > > > The other thing that troubles me is the CPU% busy value. My sandbox system > > has 5 engines defined, and in the 'start.json' file that controls the zcx > > Container task, I've specified a 'cpu' value of 4. During the start-up > > phase for the Container started task, SDSF shows CPU% values of approx 80%, > > but when the task is finally initialised, this drops to 'tickover' rates of > > about 1%. I'm happy with that - the initial start-up of *any* task as > > complex as a zcx Container is likely to cause high CPU usage, and the > > subsequent drop to the 1% levels is fine by me. > > > > But... Once the Container task is started and I've ssh'd into it, I then > > want to monitor its 'internal' system consumption. I've been using the > > 'Getting Started...' redbook as my guide throughout all this project, and > > it talks about using "Nodeexporter", "Cadvisor", "Prometheus" and "Grafana" > > as tools for this. I've got all those things installed and I can start and > > stop them quite happily, but I've found that using Cadvisor on it's own can > > drive CPU% levels back up to 80% for the entire time it is running. If a > > system is running flat-out when all it is doing is monitoring itself, well, > > there's something wrong somewhere... I'm trying to find an idiot's guide to > > controlling what Cadvisor does, but as yet I've been unsuccessful. > > > > Regards > > Sean > > > > -- > > For IBM-MAIN subscribe / signoff / archive access instructions, > > send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN > > -- Michael Babcock OneMain Financial z/OS Systems Programmer, Lead -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: ZCX task monitoring, anyone?
II can't speak too much to the excp counts, but re: CPU usage. Unix systems "count the beans" very differently than z/OS. The "system overhead" (uncaptured CPU time) for z/OS averages about 10%. The Unix systems I have encountered routinely exceed 40%. Perhaps looking at a unix process through a z/OS lens (RMF global data capture) is distorting things? I would also check the region size allocated to the ZCX address space. Paging might be masquerading as IO due to the z/OS lens being used. HTH, -Original Message- From: IBM Mainframe Discussion List On Behalf Of Sean Gleann Sent: Wednesday, August 26, 2020 3:40 AM To: IBM-MAIN@LISTSERV.UA.EDU Subject: ZCX task monitoring, anyone? [CAUTION: This Email is from outside the Organization. Unless you trust the sender, Don’t click links or open attachments as it may be a Phishing email, which can steal your Information and compromise your Computer.] Can anyone offer advice, please, with regard to monitoring the system resource consumption of a zcx Container task? I've got a zcx Container task running on a 'sandbox' system where - as yet - I'm not collecting any RMF/SMF data. Because of that, my only source of system usage is the SDSF DA panel. I feel that the numbers I see there are... 'questionable' is the best word I can think of. Firstly, the EXCP-count for the task goes up to about 15360 during the initial start-up phase, but then it stays there until the STOP command is issued. At that point, EXCP-count starts rising again, until the task finally terminates. The explanation for that is probably because all the I/O is being handled internally at the 'Linux' level - the task must be doing *some* I/O, right? - but the data isn't getting back to SDSF for some reason. Without the benefit of SMF data to examine, I'm wondering if this is part of a larger problem. The other thing that troubles me is the CPU% busy value. My sandbox system has 5 engines defined, and in the 'start.json' file that controls the zcx Container task, I've specified a 'cpu' value of 4. During the start-up phase for the Container started task, SDSF shows CPU% values of approx 80%, but when the task is finally initialised, this drops to 'tickover' rates of about 1%. I'm happy with that - the initial start-up of *any* task as complex as a zcx Container is likely to cause high CPU usage, and the subsequent drop to the 1% levels is fine by me. But... Once the Container task is started and I've ssh'd into it, I then want to monitor its 'internal' system consumption. I've been using the 'Getting Started...' redbook as my guide throughout all this project, and it talks about using "Nodeexporter", "Cadvisor", "Prometheus" and "Grafana" as tools for this. I've got all those things installed and I can start and stop them quite happily, but I've found that using Cadvisor on it's own can drive CPU% levels back up to 80% for the entire time it is running. If a system is running flat-out when all it is doing is monitoring itself, well, there's something wrong somewhere... I'm trying to find an idiot's guide to controlling what Cadvisor does, but as yet I've been unsuccessful. Regards Sean -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN ::DISCLAIMER:: The contents of this e-mail and any attachment(s) are confidential and intended for the named recipient(s) only. E-mail transmission is not guaranteed to be secure or error-free as information could be intercepted, corrupted, lost, destroyed, arrive late or incomplete, or may contain viruses in transmission. The e mail and its contents (with or without referred errors) shall therefore not attach any liability on the originator or HCL or its affiliates. Views or opinions, if any, presented in this email are solely those of the author and may not necessarily reflect the views or opinions of HCL or its affiliates. Any form of reproduction, dissemination, copying, disclosure, modification, distribution and / or publication of this message without the prior written consent of authorized representative of HCL is strictly prohibited. If you have received this email in error please delete it and notify the sender immediately. Before opening any email and/or attachments, please check them for viruses and other defects. -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
ZCX task monitoring, anyone?
Can anyone offer advice, please, with regard to monitoring the system resource consumption of a zcx Container task? I've got a zcx Container task running on a 'sandbox' system where - as yet - I'm not collecting any RMF/SMF data. Because of that, my only source of system usage is the SDSF DA panel. I feel that the numbers I see there are... 'questionable' is the best word I can think of. Firstly, the EXCP-count for the task goes up to about 15360 during the initial start-up phase, but then it stays there until the STOP command is issued. At that point, EXCP-count starts rising again, until the task finally terminates. The explanation for that is probably because all the I/O is being handled internally at the 'Linux' level - the task must be doing *some* I/O, right? - but the data isn't getting back to SDSF for some reason. Without the benefit of SMF data to examine, I'm wondering if this is part of a larger problem. The other thing that troubles me is the CPU% busy value. My sandbox system has 5 engines defined, and in the 'start.json' file that controls the zcx Container task, I've specified a 'cpu' value of 4. During the start-up phase for the Container started task, SDSF shows CPU% values of approx 80%, but when the task is finally initialised, this drops to 'tickover' rates of about 1%. I'm happy with that - the initial start-up of *any* task as complex as a zcx Container is likely to cause high CPU usage, and the subsequent drop to the 1% levels is fine by me. But... Once the Container task is started and I've ssh'd into it, I then want to monitor its 'internal' system consumption. I've been using the 'Getting Started...' redbook as my guide throughout all this project, and it talks about using "Nodeexporter", "Cadvisor", "Prometheus" and "Grafana" as tools for this. I've got all those things installed and I can start and stop them quite happily, but I've found that using Cadvisor on it's own can drive CPU% levels back up to 80% for the entire time it is running. If a system is running flat-out when all it is doing is monitoring itself, well, there's something wrong somewhere... I'm trying to find an idiot's guide to controlling what Cadvisor does, but as yet I've been unsuccessful. Regards Sean -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN