Re: ZCX task monitoring, anyone?

2020-09-09 Thread Sean Gleann
Yuksel
Thank you for that information and I'm sorry I haven't responded sooner
(other work took precedence, as is so often the case).

Your modification to the 'run' command has worked exactly as required. I
now have cadvisor running on my zcx system and overall CPU utilisation is
down to a much more reasonable 5-6% - a very welcome change!

Thanks you again

Regards
Sean




On Wed, 2 Sep 2020 at 22:09, Yuksel Gunal  wrote:

> Hi Sean,
>
> cAdvisor polls for metrics once a second by default.  Though I have not
> seen such high CPU utilization with the default setting, it is worth to run
> cAdvisor with a different setting, one that you can define explicitly when
> you start cAdvisor.  I'd recommend that you try a 10s or 15s interval and
> see if either helps.  To do that, you need to specify the
> "--housekeeping_interval" parameter.  Here is how you'd do it on zCX (sets
> it to 10 seconds):
>
> docker run -v /:/rootfs:ro -v /var/run:/var/run:ro -v /sys:/sys:ro -v
> /var/lib/docker/:/var/lib/docker:ro -v /dev/disk/:/dev/disk:ro -p 8080:8080
> -d --network monitoring --name=cadvisor ibmcom/cadvisor-s390x:0.33.0
> cadvisor --housekeeping_interval=10s
>
> Note that I modified the instructions by adding "cadvisor
> --housekeeping_interval=10s"
>
> Also, Prometheus polls cAdvisor, but cAdvisor's data collection is not
> triggered by Prometheus polling.  It has its own polling cycle and it keeps
> the metrics in memory.  When Prometheus polls cAdvisor, it returns the last
> set of collected metrics from memory.
>
> Yuksel Gunal
>
>

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: ZCX task monitoring, anyone?

2020-09-02 Thread Yuksel Gunal
Hi Sean,

cAdvisor polls for metrics once a second by default.  Though I have not seen 
such high CPU utilization with the default setting, it is worth to run cAdvisor 
with a different setting, one that you can define explicitly when you start 
cAdvisor.  I'd recommend that you try a 10s or 15s interval and see if either 
helps.  To do that, you need to specify the "--housekeeping_interval" 
parameter.  Here is how you'd do it on zCX (sets it to 10 seconds):

docker run -v /:/rootfs:ro -v /var/run:/var/run:ro -v /sys:/sys:ro -v 
/var/lib/docker/:/var/lib/docker:ro -v /dev/disk/:/dev/disk:ro -p 8080:8080 -d 
--network monitoring --name=cadvisor ibmcom/cadvisor-s390x:0.33.0 cadvisor 
--housekeeping_interval=10s

Note that I modified the instructions by adding "cadvisor 
--housekeeping_interval=10s"

Also, Prometheus polls cAdvisor, but cAdvisor's data collection is not 
triggered by Prometheus polling.  It has its own polling cycle and it keeps the 
metrics in memory.  When Prometheus polls cAdvisor, it returns the last set of 
collected metrics from memory.

Yuksel Gunal

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: ZCX task monitoring, anyone?

2020-08-27 Thread Seymour J Metz
I particularly hate the URL garbling that I'm currently stuck with. Take 
outlook - please!


--
Shmuel (Seymour J.) Metz
http://mason.gmu.edu/~smetz3



From: IBM Mainframe Discussion List  on behalf of 
Sean Gleann 
Sent: Thursday, August 27, 2020 4:14 AM
To: IBM-MAIN@LISTSERV.UA.EDU
Subject: Re: ZCX task monitoring, anyone?

*sigh*. Oh, how I wish that various e-mail clients would quit re-formatting
stuff. My previous response here was so nice & neat & tidy before I hit
'Send'. Reading that response back via IBMMain makes me look like a
complete illiterate...

Sean

On Thu, 27 Aug 2020 at 09:01, Sean Gleann  wrote:

> Hi Attila - thanks for the pointers, but I'm not sure of how to act upon
> them.
>
> The start-up for Cadvisor that I'm using doesn't feature any pointer to a
> parameter list, and despite much googling I don't see any mention of such a
> thing. Everything keeps referring back to Prometheus and then on to Grafana
> My Cadvisor start-up (taken directly from the IBM Red Book and slightly
> modified to comply with local restrictions):
> docker network create monitoring
> docker run --name cadvisor -v /sys:/sys:ro -v
> /var/lib/docker/:/var/lib/docker:ro -v /dev/disk:/dev/disk:ro -d --network
> monitoring ibmcom/cadvisor-s390x:0.33.0
>
> Perhaps I'm looking at things the wrong way, but my current understanding
> is:
> Cadvisor (and also Nodeexporter) collect various usage stats;
> Prometheus then gathers that data and does some sort of pre-processing of
> it (it doesn't tell Cadvisor to 'do something' - it just passively makes
> use of the data that Cadvisor collects)
> Grafana takes the data from Prometheus and uses it to generate various
> graphs/tables/reports.
>
> My situation is that when I run Cadvisor on it's own - no other containers
> at all - then it floods as many processors as I define in the zcx
> start.json file.
>
> Whilst Cadvisor is running, I can go to the relevant web-page and I can
> see that it is producing meters/charts, etc all on its own. Since that is
> the case, what is the point of Grafana?
>
> I have a Prometheus.yml file that features the term 'scrape_interval' (but
> not 'housekeeping'), but that file is for use by Prometheus, isn't it? How
> does it affect the amount of work that Cadvisor is doing, since I haven't
> even started that container yet?
>
> Regards
> Sean
>
> On Wed, 26 Aug 2020 at 23:05, Attila Fogarasi  wrote:
>
>> Check your values for housekeeping interval and scrape_interval.
>> Recommended is 15s and 30s (which makes for a 60 second rate window).
>> Small value for housekeeping interval will cause cAdvisor cpu usage to be
>> high, while scrape_interval affects Prometheus cpu usage.  It is entirely
>> possible to cause data collection to use 100% of the z/OS cpu -- remember
>> that on Unix systems the rule of thumb is 40% overhead for uncaptured cpu
>> time while z/OS is far more efficient and runs well under 10%.  You will
>> see this behaviour in zCX containers, it isn't going to measure the same
>> as
>> z/OS workload.  The optimizations in Unix have the premise that cpu time
>> is
>> low cost (as is memory), while z/OS considers cpu to be high cost and path
>> length worth saving.  Same for the subsystems in z/OS and performance
>> monitors.
>>
>> On Wed, Aug 26, 2020 at 11:43 PM Sean Gleann 
>> wrote:
>>
>> > Allan - "...count the beans differently...' Yes, I'm beginning to get
>> used
>> > to that concept. For instance, with the CPU Utilisation data that I
>> *have*
>> > been able to retrieve, the metric given is not 'CPU%', but 'Number of
>> > cores'. I'm having to do some rapid re-orienting to my way of thinking.
>> > As for the memory size, I've got "mem-gb" : 2 defined in my start.json
>> > file, but I've not seen any indication of paging load at all in my
>> testing.
>> >
>> > Michael - 5 zIIPs?   I wish!  Nope - these are all general-purpose
>> > processors.
>> > The z/OS system I'm using is a z/VM guest on a system run by an external
>> > supplier, so I'm not sure if defining zIIPs would actually achieve
>> anything
>> > (Is it possible to dedicate a zIIP engine to a specific z/VM guest?
>> That's
>> > a road I've not yet gone down).
>> > With regard to the WLM definitions, I followed the advice in the red
>> book
>> > and I'm reasonably certain I've got it right. Having said that,
>> cross-refer
>> > to a thread that I started earlier this week, titled "WLM Query"
>> > The response to that led to me defining a resource group to cap the

Re: ZCX task monitoring, anyone?

2020-08-27 Thread Sean Gleann
OK, so the scrape_interval part of your answer is something I can quickly
understand and deal with.
I'll put that to one side for now because my interest is with Cadvisor and
how to control it.

To take the parallel example for Prometheus from the IBM red book, I have
to:
 1. create a 'prometheus.yml' file
 2. create a Dockerfile, which features a COPY command referring to that
'prometheus.yml' file
 3. do a 'docker build' of the Prometheus image
 4. then I can 'run' the image.
With subsequent 'runs' of Prometheus I can skip steps 1-3 as they have
already been done.
For Cadvisor, there is no 'build' to be done, thus no copying of a yml file
The only command I know of for Cadvisor is the 'run' command I detailed
earlier in this thread.
(If the Cadvisor image does not exist, then it is automatically downloaded
before being started.)
In conceptual terms, am I right in thinking that I'm downloading a
'program' that has already been prepared for execution?
If that is true, then the value of any control parameters appears to be
hard-coded within the program.
Given this, I'm not sure I have any control over the container manifest for
Cadvisor.

Regards
Sean

On Thu, 27 Aug 2020 at 10:15, Attila Fogarasi  wrote:

> Housekeeping interval is part of the container manifest as it governs
> normal operation, not just performance metric collection.  As such it is
> specified wherever you have your container manifest defined (for example, a
> .yaml file or by HTTP endpoint or HTTP server).   You can also use the
> command line "kubelet" tool.
> Scrape_interval is the value for how often Prometheus asks cAdvisor for
> data from the collection cache, thus it affects cpu used by cAdvisor to
> prepare and send this data to Prometheus.
> As for how Docker monitoring works, you are right that there is overlap in
> the open-source tools, but the hierarchy is cAdvisor collects the metrics
> and also does some aggregation and processing.  You can use just cAdvisor.
> Prometheus is a layer on top, getting metrics from cAdvisor and then
> provides both better quality reporting and also alerting (which cAdvisor
> does not).  The next layer is Grafana which is a generalized metric
> analytics and visualization tool (not just for Docker).  For larger scale
> more complex container environments you need all 3.
> In a z/OS context these 3 tools became integrated circa 30 years ago, but
> for Unix they are not.   Splitting the processing like this has both good
> and bad points (for example Grafana can run in a separate Docker container)
> but definitely burns more cpu (a lot more).   If not careful the
> measurement tooling can cost more than the application being measured, even
> though it is "free".
>
> On Thu, Aug 27, 2020 at 6:02 PM Sean Gleann  wrote:
>
> > Hi Attila - thanks for the pointers, but I'm not sure of how to act upon
> > them.
> >
> > The start-up for Cadvisor that I'm using doesn't feature any pointer to a
> > parameter list, and despite much googling I don't see any mention of
> such a
> > thing. Everything keeps referring back to Prometheus and then on to
> Grafana
> > My Cadvisor start-up (taken directly from the IBM Red Book and slightly
> > modified to comply with local restrictions):
> > docker network create monitoring
> > docker run --name cadvisor -v /sys:/sys:ro -v
> > /var/lib/docker/:/var/lib/docker:ro -v /dev/disk:/dev/disk:ro -d
> --network
> > monitoring ibmcom/cadvisor-s390x:0.33.0
> >
> > Perhaps I'm looking at things the wrong way, but my current understanding
> > is:
> > Cadvisor (and also Nodeexporter) collect various usage stats;
> > Prometheus then gathers that data and does some sort of pre-processing of
> > it (it doesn't tell Cadvisor to 'do something' - it just passively makes
> > use of the data that Cadvisor collects)
> > Grafana takes the data from Prometheus and uses it to generate various
> > graphs/tables/reports.
> >
> > My situation is that when I run Cadvisor on it's own - no other
> containers
> > at all - then it floods as many processors as I define in the zcx
> > start.json file.
> >
> > Whilst Cadvisor is running, I can go to the relevant web-page and I can
> see
> > that it is producing meters/charts, etc all on its own. Since that is the
> > case, what is the point of Grafana?
> >
> > I have a Prometheus.yml file that features the term 'scrape_interval'
> (but
> > not 'housekeeping'), but that file is for use by Prometheus, isn't it?
> How
> > does it affect the amount of work that Cadvisor is doing, since I haven't
> > even started that container yet?
> >
> > Regards
> > Sean
> >
> > On Wed, 26 Aug 2020 at 23:05, Attila Fogarasi 
> wrote:
> >
> > > Check your values for housekeeping interval and scrape_interval.
> > > Recommended is 15s and 30s (which makes for a 60 second rate window).
> > > Small value for housekeeping interval will cause cAdvisor cpu usage to
> be
> > > high, while scrape_interval affects Prometheus cpu usage.  It is
> entirely
> > > possible to cause data 

Re: ZCX task monitoring, anyone?

2020-08-27 Thread Attila Fogarasi
Housekeeping interval is part of the container manifest as it governs
normal operation, not just performance metric collection.  As such it is
specified wherever you have your container manifest defined (for example, a
.yaml file or by HTTP endpoint or HTTP server).   You can also use the
command line "kubelet" tool.
Scrape_interval is the value for how often Prometheus asks cAdvisor for
data from the collection cache, thus it affects cpu used by cAdvisor to
prepare and send this data to Prometheus.
As for how Docker monitoring works, you are right that there is overlap in
the open-source tools, but the hierarchy is cAdvisor collects the metrics
and also does some aggregation and processing.  You can use just cAdvisor.
Prometheus is a layer on top, getting metrics from cAdvisor and then
provides both better quality reporting and also alerting (which cAdvisor
does not).  The next layer is Grafana which is a generalized metric
analytics and visualization tool (not just for Docker).  For larger scale
more complex container environments you need all 3.
In a z/OS context these 3 tools became integrated circa 30 years ago, but
for Unix they are not.   Splitting the processing like this has both good
and bad points (for example Grafana can run in a separate Docker container)
but definitely burns more cpu (a lot more).   If not careful the
measurement tooling can cost more than the application being measured, even
though it is "free".

On Thu, Aug 27, 2020 at 6:02 PM Sean Gleann  wrote:

> Hi Attila - thanks for the pointers, but I'm not sure of how to act upon
> them.
>
> The start-up for Cadvisor that I'm using doesn't feature any pointer to a
> parameter list, and despite much googling I don't see any mention of such a
> thing. Everything keeps referring back to Prometheus and then on to Grafana
> My Cadvisor start-up (taken directly from the IBM Red Book and slightly
> modified to comply with local restrictions):
> docker network create monitoring
> docker run --name cadvisor -v /sys:/sys:ro -v
> /var/lib/docker/:/var/lib/docker:ro -v /dev/disk:/dev/disk:ro -d --network
> monitoring ibmcom/cadvisor-s390x:0.33.0
>
> Perhaps I'm looking at things the wrong way, but my current understanding
> is:
> Cadvisor (and also Nodeexporter) collect various usage stats;
> Prometheus then gathers that data and does some sort of pre-processing of
> it (it doesn't tell Cadvisor to 'do something' - it just passively makes
> use of the data that Cadvisor collects)
> Grafana takes the data from Prometheus and uses it to generate various
> graphs/tables/reports.
>
> My situation is that when I run Cadvisor on it's own - no other containers
> at all - then it floods as many processors as I define in the zcx
> start.json file.
>
> Whilst Cadvisor is running, I can go to the relevant web-page and I can see
> that it is producing meters/charts, etc all on its own. Since that is the
> case, what is the point of Grafana?
>
> I have a Prometheus.yml file that features the term 'scrape_interval' (but
> not 'housekeeping'), but that file is for use by Prometheus, isn't it? How
> does it affect the amount of work that Cadvisor is doing, since I haven't
> even started that container yet?
>
> Regards
> Sean
>
> On Wed, 26 Aug 2020 at 23:05, Attila Fogarasi  wrote:
>
> > Check your values for housekeeping interval and scrape_interval.
> > Recommended is 15s and 30s (which makes for a 60 second rate window).
> > Small value for housekeeping interval will cause cAdvisor cpu usage to be
> > high, while scrape_interval affects Prometheus cpu usage.  It is entirely
> > possible to cause data collection to use 100% of the z/OS cpu -- remember
> > that on Unix systems the rule of thumb is 40% overhead for uncaptured cpu
> > time while z/OS is far more efficient and runs well under 10%.  You will
> > see this behaviour in zCX containers, it isn't going to measure the same
> as
> > z/OS workload.  The optimizations in Unix have the premise that cpu time
> is
> > low cost (as is memory), while z/OS considers cpu to be high cost and
> path
> > length worth saving.  Same for the subsystems in z/OS and performance
> > monitors.
> >
> > On Wed, Aug 26, 2020 at 11:43 PM Sean Gleann 
> > wrote:
> >
> > > Allan - "...count the beans differently...' Yes, I'm beginning to get
> > used
> > > to that concept. For instance, with the CPU Utilisation data that I
> > *have*
> > > been able to retrieve, the metric given is not 'CPU%', but 'Number of
> > > cores'. I'm having to do some rapid re-orienting to my way of thinking.
> > > As for the memory size, I've got "mem-gb" : 2 defined in my start.json
> > > file, but I've not seen any indication of paging load at all in my
> > testing.
> > >
> > > Michael - 5 zIIPs?   I wish!  Nope - these are all general-purpose
> > > processors.
> > > The z/OS system I'm using is a z/VM guest on a system run by an
> external
> > > supplier, so I'm not sure if defining zIIPs would actually achieve
> > anything
> > > (Is it possible 

Re: ZCX task monitoring, anyone?

2020-08-27 Thread Sean Gleann
*sigh*. Oh, how I wish that various e-mail clients would quit re-formatting
stuff. My previous response here was so nice & neat & tidy before I hit
'Send'. Reading that response back via IBMMain makes me look like a
complete illiterate...

Sean

On Thu, 27 Aug 2020 at 09:01, Sean Gleann  wrote:

> Hi Attila - thanks for the pointers, but I'm not sure of how to act upon
> them.
>
> The start-up for Cadvisor that I'm using doesn't feature any pointer to a
> parameter list, and despite much googling I don't see any mention of such a
> thing. Everything keeps referring back to Prometheus and then on to Grafana
> My Cadvisor start-up (taken directly from the IBM Red Book and slightly
> modified to comply with local restrictions):
> docker network create monitoring
> docker run --name cadvisor -v /sys:/sys:ro -v
> /var/lib/docker/:/var/lib/docker:ro -v /dev/disk:/dev/disk:ro -d --network
> monitoring ibmcom/cadvisor-s390x:0.33.0
>
> Perhaps I'm looking at things the wrong way, but my current understanding
> is:
> Cadvisor (and also Nodeexporter) collect various usage stats;
> Prometheus then gathers that data and does some sort of pre-processing of
> it (it doesn't tell Cadvisor to 'do something' - it just passively makes
> use of the data that Cadvisor collects)
> Grafana takes the data from Prometheus and uses it to generate various
> graphs/tables/reports.
>
> My situation is that when I run Cadvisor on it's own - no other containers
> at all - then it floods as many processors as I define in the zcx
> start.json file.
>
> Whilst Cadvisor is running, I can go to the relevant web-page and I can
> see that it is producing meters/charts, etc all on its own. Since that is
> the case, what is the point of Grafana?
>
> I have a Prometheus.yml file that features the term 'scrape_interval' (but
> not 'housekeeping'), but that file is for use by Prometheus, isn't it? How
> does it affect the amount of work that Cadvisor is doing, since I haven't
> even started that container yet?
>
> Regards
> Sean
>
> On Wed, 26 Aug 2020 at 23:05, Attila Fogarasi  wrote:
>
>> Check your values for housekeeping interval and scrape_interval.
>> Recommended is 15s and 30s (which makes for a 60 second rate window).
>> Small value for housekeeping interval will cause cAdvisor cpu usage to be
>> high, while scrape_interval affects Prometheus cpu usage.  It is entirely
>> possible to cause data collection to use 100% of the z/OS cpu -- remember
>> that on Unix systems the rule of thumb is 40% overhead for uncaptured cpu
>> time while z/OS is far more efficient and runs well under 10%.  You will
>> see this behaviour in zCX containers, it isn't going to measure the same
>> as
>> z/OS workload.  The optimizations in Unix have the premise that cpu time
>> is
>> low cost (as is memory), while z/OS considers cpu to be high cost and path
>> length worth saving.  Same for the subsystems in z/OS and performance
>> monitors.
>>
>> On Wed, Aug 26, 2020 at 11:43 PM Sean Gleann 
>> wrote:
>>
>> > Allan - "...count the beans differently...' Yes, I'm beginning to get
>> used
>> > to that concept. For instance, with the CPU Utilisation data that I
>> *have*
>> > been able to retrieve, the metric given is not 'CPU%', but 'Number of
>> > cores'. I'm having to do some rapid re-orienting to my way of thinking.
>> > As for the memory size, I've got "mem-gb" : 2 defined in my start.json
>> > file, but I've not seen any indication of paging load at all in my
>> testing.
>> >
>> > Michael - 5 zIIPs?   I wish!  Nope - these are all general-purpose
>> > processors.
>> > The z/OS system I'm using is a z/VM guest on a system run by an external
>> > supplier, so I'm not sure if defining zIIPs would actually achieve
>> anything
>> > (Is it possible to dedicate a zIIP engine to a specific z/VM guest?
>> That's
>> > a road I've not yet gone down).
>> > With regard to the WLM definitions, I followed the advice in the red
>> book
>> > and I'm reasonably certain I've got it right. Having said that,
>> cross-refer
>> > to a thread that I started earlier this week, titled "WLM Query"
>> > The response to that led to me defining a resource group to cap the
>> > started task to 10MSU, which resulted in a CPU% Util value of roughly
>> 5% -
>> > something I could be happy with.
>> > Under that cap, the started task ran, yes, but it ran like a
>> three-legged
>> > dog (my apologies to limb-count-challenged canines).
>> > Start-up of the task, from the START command to the "server is
>> > listening..." message took over an hour, and
>> > STOP-command-to-task-termination took approx. 30 minutes.
>> > (SSH-ing to the task was a bit of a joke, too. Responses to simple
>> commands
>> > like 'docker ps -a' could be seen 'painting' across the screen,
>> > character-by-character...)
>> > As a result, I've moved away from trying to limit the task for the time
>> > being. I'm concentrating on attempting to get cadvisor to be a bit less
>> > greedy.
>> >
>> > Regards
>> > Sean
>> >
>> > On 

Re: ZCX task monitoring, anyone?

2020-08-27 Thread Sean Gleann
Hi Attila - thanks for the pointers, but I'm not sure of how to act upon
them.

The start-up for Cadvisor that I'm using doesn't feature any pointer to a
parameter list, and despite much googling I don't see any mention of such a
thing. Everything keeps referring back to Prometheus and then on to Grafana
My Cadvisor start-up (taken directly from the IBM Red Book and slightly
modified to comply with local restrictions):
docker network create monitoring
docker run --name cadvisor -v /sys:/sys:ro -v
/var/lib/docker/:/var/lib/docker:ro -v /dev/disk:/dev/disk:ro -d --network
monitoring ibmcom/cadvisor-s390x:0.33.0

Perhaps I'm looking at things the wrong way, but my current understanding
is:
Cadvisor (and also Nodeexporter) collect various usage stats;
Prometheus then gathers that data and does some sort of pre-processing of
it (it doesn't tell Cadvisor to 'do something' - it just passively makes
use of the data that Cadvisor collects)
Grafana takes the data from Prometheus and uses it to generate various
graphs/tables/reports.

My situation is that when I run Cadvisor on it's own - no other containers
at all - then it floods as many processors as I define in the zcx
start.json file.

Whilst Cadvisor is running, I can go to the relevant web-page and I can see
that it is producing meters/charts, etc all on its own. Since that is the
case, what is the point of Grafana?

I have a Prometheus.yml file that features the term 'scrape_interval' (but
not 'housekeeping'), but that file is for use by Prometheus, isn't it? How
does it affect the amount of work that Cadvisor is doing, since I haven't
even started that container yet?

Regards
Sean

On Wed, 26 Aug 2020 at 23:05, Attila Fogarasi  wrote:

> Check your values for housekeeping interval and scrape_interval.
> Recommended is 15s and 30s (which makes for a 60 second rate window).
> Small value for housekeeping interval will cause cAdvisor cpu usage to be
> high, while scrape_interval affects Prometheus cpu usage.  It is entirely
> possible to cause data collection to use 100% of the z/OS cpu -- remember
> that on Unix systems the rule of thumb is 40% overhead for uncaptured cpu
> time while z/OS is far more efficient and runs well under 10%.  You will
> see this behaviour in zCX containers, it isn't going to measure the same as
> z/OS workload.  The optimizations in Unix have the premise that cpu time is
> low cost (as is memory), while z/OS considers cpu to be high cost and path
> length worth saving.  Same for the subsystems in z/OS and performance
> monitors.
>
> On Wed, Aug 26, 2020 at 11:43 PM Sean Gleann 
> wrote:
>
> > Allan - "...count the beans differently...' Yes, I'm beginning to get
> used
> > to that concept. For instance, with the CPU Utilisation data that I
> *have*
> > been able to retrieve, the metric given is not 'CPU%', but 'Number of
> > cores'. I'm having to do some rapid re-orienting to my way of thinking.
> > As for the memory size, I've got "mem-gb" : 2 defined in my start.json
> > file, but I've not seen any indication of paging load at all in my
> testing.
> >
> > Michael - 5 zIIPs?   I wish!  Nope - these are all general-purpose
> > processors.
> > The z/OS system I'm using is a z/VM guest on a system run by an external
> > supplier, so I'm not sure if defining zIIPs would actually achieve
> anything
> > (Is it possible to dedicate a zIIP engine to a specific z/VM guest?
> That's
> > a road I've not yet gone down).
> > With regard to the WLM definitions, I followed the advice in the red book
> > and I'm reasonably certain I've got it right. Having said that,
> cross-refer
> > to a thread that I started earlier this week, titled "WLM Query"
> > The response to that led to me defining a resource group to cap the
> > started task to 10MSU, which resulted in a CPU% Util value of roughly 5%
> -
> > something I could be happy with.
> > Under that cap, the started task ran, yes, but it ran like a three-legged
> > dog (my apologies to limb-count-challenged canines).
> > Start-up of the task, from the START command to the "server is
> > listening..." message took over an hour, and
> > STOP-command-to-task-termination took approx. 30 minutes.
> > (SSH-ing to the task was a bit of a joke, too. Responses to simple
> commands
> > like 'docker ps -a' could be seen 'painting' across the screen,
> > character-by-character...)
> > As a result, I've moved away from trying to limit the task for the time
> > being. I'm concentrating on attempting to get cadvisor to be a bit less
> > greedy.
> >
> > Regards
> > Sean
> >
> > On Wed, 26 Aug 2020 at 13:49, Michael Babcock 
> > wrote:
> >
> > > I can’t check my zCX out right now since my internet is down.
> > >
> > > You are running these on zIIP engines correct? Must be nice to have 5
> > > zIIPs!  And have the WLM parts in place?   Although it probably
> wouldn’t
> > > make much difference during startup/shutdown.
> > >
> > > On Wed, Aug 26, 2020 at 3:40 AM Sean Gleann 
> > wrote:
> > >
> > > > Can anyone offer 

Re: ZCX task monitoring, anyone?

2020-08-26 Thread Attila Fogarasi
Check your values for housekeeping interval and scrape_interval.
Recommended is 15s and 30s (which makes for a 60 second rate window).
Small value for housekeeping interval will cause cAdvisor cpu usage to be
high, while scrape_interval affects Prometheus cpu usage.  It is entirely
possible to cause data collection to use 100% of the z/OS cpu -- remember
that on Unix systems the rule of thumb is 40% overhead for uncaptured cpu
time while z/OS is far more efficient and runs well under 10%.  You will
see this behaviour in zCX containers, it isn't going to measure the same as
z/OS workload.  The optimizations in Unix have the premise that cpu time is
low cost (as is memory), while z/OS considers cpu to be high cost and path
length worth saving.  Same for the subsystems in z/OS and performance
monitors.

On Wed, Aug 26, 2020 at 11:43 PM Sean Gleann  wrote:

> Allan - "...count the beans differently...' Yes, I'm beginning to get used
> to that concept. For instance, with the CPU Utilisation data that I *have*
> been able to retrieve, the metric given is not 'CPU%', but 'Number of
> cores'. I'm having to do some rapid re-orienting to my way of thinking.
> As for the memory size, I've got "mem-gb" : 2 defined in my start.json
> file, but I've not seen any indication of paging load at all in my testing.
>
> Michael - 5 zIIPs?   I wish!  Nope - these are all general-purpose
> processors.
> The z/OS system I'm using is a z/VM guest on a system run by an external
> supplier, so I'm not sure if defining zIIPs would actually achieve anything
> (Is it possible to dedicate a zIIP engine to a specific z/VM guest? That's
> a road I've not yet gone down).
> With regard to the WLM definitions, I followed the advice in the red book
> and I'm reasonably certain I've got it right. Having said that, cross-refer
> to a thread that I started earlier this week, titled "WLM Query"
> The response to that led to me defining a resource group to cap the
> started task to 10MSU, which resulted in a CPU% Util value of roughly 5% -
> something I could be happy with.
> Under that cap, the started task ran, yes, but it ran like a three-legged
> dog (my apologies to limb-count-challenged canines).
> Start-up of the task, from the START command to the "server is
> listening..." message took over an hour, and
> STOP-command-to-task-termination took approx. 30 minutes.
> (SSH-ing to the task was a bit of a joke, too. Responses to simple commands
> like 'docker ps -a' could be seen 'painting' across the screen,
> character-by-character...)
> As a result, I've moved away from trying to limit the task for the time
> being. I'm concentrating on attempting to get cadvisor to be a bit less
> greedy.
>
> Regards
> Sean
>
> On Wed, 26 Aug 2020 at 13:49, Michael Babcock 
> wrote:
>
> > I can’t check my zCX out right now since my internet is down.
> >
> > You are running these on zIIP engines correct? Must be nice to have 5
> > zIIPs!  And have the WLM parts in place?   Although it probably wouldn’t
> > make much difference during startup/shutdown.
> >
> > On Wed, Aug 26, 2020 at 3:40 AM Sean Gleann 
> wrote:
> >
> > > Can anyone offer advice, please, with regard to monitoring the system
> > >
> > > resource consumption of a zcx Container task?
> > >
> > >
> > >
> > > I've got a zcx Container task running on a 'sandbox' system where - as
> > yet
> > >
> > > - I'm not collecting any RMF/SMF data. Because of that, my only source
> of
> > >
> > > system usage is the SDSF DA panel. I feel that the numbers I see there
> > >
> > > are... 'questionable' is the best word I can think of.
> > >
> > >
> > >
> > > Firstly, the EXCP-count for the task goes up to about 15360 during the
> > >
> > > initial start-up phase, but then it stays there until the STOP command
> is
> > >
> > > issued. At that point, EXCP-count starts rising again, until the task
> > >
> > > finally terminates. The explanation for that is probably because all
> the
> > >
> > > I/O is being handled internally at the 'Linux' level - the task must be
> > >
> > > doing *some* I/O, right? - but the data isn't getting back to SDSF for
> > some
> > >
> > > reason. Without the benefit of SMF data to examine, I'm wondering if
> this
> > >
> > > is part of a larger problem.
> > >
> > >
> > >
> > > The other thing that troubles me is the CPU% busy value. My sandbox
> > system
> > >
> > > has 5 engines defined, and in the 'start.json' file that controls the
> zcx
> > >
> > > Container task, I've specified a 'cpu' value of 4. During the start-up
> > >
> > > phase for the Container started task, SDSF shows CPU% values of approx
> > 80%,
> > >
> > > but when the task is finally initialised, this drops to 'tickover'
> rates
> > of
> > >
> > > about 1%. I'm happy with that - the initial start-up of *any* task as
> > >
> > > complex as a zcx Container is likely to cause high CPU usage, and the
> > >
> > > subsequent drop to the 1% levels is fine by me.
> > >
> > >
> > >
> > > But... Once the Container task is 

Re: ZCX task monitoring, anyone?

2020-08-26 Thread Sean Gleann
Allan - "...count the beans differently...' Yes, I'm beginning to get used
to that concept. For instance, with the CPU Utilisation data that I *have*
been able to retrieve, the metric given is not 'CPU%', but 'Number of
cores'. I'm having to do some rapid re-orienting to my way of thinking.
As for the memory size, I've got "mem-gb" : 2 defined in my start.json
file, but I've not seen any indication of paging load at all in my testing.

Michael - 5 zIIPs?   I wish!  Nope - these are all general-purpose
processors.
The z/OS system I'm using is a z/VM guest on a system run by an external
supplier, so I'm not sure if defining zIIPs would actually achieve anything
(Is it possible to dedicate a zIIP engine to a specific z/VM guest? That's
a road I've not yet gone down).
With regard to the WLM definitions, I followed the advice in the red book
and I'm reasonably certain I've got it right. Having said that, cross-refer
to a thread that I started earlier this week, titled "WLM Query"
The response to that led to me defining a resource group to cap the
started task to 10MSU, which resulted in a CPU% Util value of roughly 5% -
something I could be happy with.
Under that cap, the started task ran, yes, but it ran like a three-legged
dog (my apologies to limb-count-challenged canines).
Start-up of the task, from the START command to the "server is
listening..." message took over an hour, and
STOP-command-to-task-termination took approx. 30 minutes.
(SSH-ing to the task was a bit of a joke, too. Responses to simple commands
like 'docker ps -a' could be seen 'painting' across the screen,
character-by-character...)
As a result, I've moved away from trying to limit the task for the time
being. I'm concentrating on attempting to get cadvisor to be a bit less
greedy.

Regards
Sean

On Wed, 26 Aug 2020 at 13:49, Michael Babcock  wrote:

> I can’t check my zCX out right now since my internet is down.
>
> You are running these on zIIP engines correct? Must be nice to have 5
> zIIPs!  And have the WLM parts in place?   Although it probably wouldn’t
> make much difference during startup/shutdown.
>
> On Wed, Aug 26, 2020 at 3:40 AM Sean Gleann  wrote:
>
> > Can anyone offer advice, please, with regard to monitoring the system
> >
> > resource consumption of a zcx Container task?
> >
> >
> >
> > I've got a zcx Container task running on a 'sandbox' system where - as
> yet
> >
> > - I'm not collecting any RMF/SMF data. Because of that, my only source of
> >
> > system usage is the SDSF DA panel. I feel that the numbers I see there
> >
> > are... 'questionable' is the best word I can think of.
> >
> >
> >
> > Firstly, the EXCP-count for the task goes up to about 15360 during the
> >
> > initial start-up phase, but then it stays there until the STOP command is
> >
> > issued. At that point, EXCP-count starts rising again, until the task
> >
> > finally terminates. The explanation for that is probably because all the
> >
> > I/O is being handled internally at the 'Linux' level - the task must be
> >
> > doing *some* I/O, right? - but the data isn't getting back to SDSF for
> some
> >
> > reason. Without the benefit of SMF data to examine, I'm wondering if this
> >
> > is part of a larger problem.
> >
> >
> >
> > The other thing that troubles me is the CPU% busy value. My sandbox
> system
> >
> > has 5 engines defined, and in the 'start.json' file that controls the zcx
> >
> > Container task, I've specified a 'cpu' value of 4. During the start-up
> >
> > phase for the Container started task, SDSF shows CPU% values of approx
> 80%,
> >
> > but when the task is finally initialised, this drops to 'tickover' rates
> of
> >
> > about 1%. I'm happy with that - the initial start-up of *any* task as
> >
> > complex as a zcx Container is likely to cause high CPU usage, and the
> >
> > subsequent drop to the 1% levels is fine by me.
> >
> >
> >
> > But... Once the Container task is started and I've ssh'd into it, I then
> >
> > want to monitor its 'internal' system consumption. I've been using the
> >
> > 'Getting Started...' redbook as my guide throughout all this project, and
> >
> > it talks about using "Nodeexporter", "Cadvisor", "Prometheus" and
> "Grafana"
> >
> > as tools for this. I've got all those things installed and I can start
> and
> >
> > stop them quite happily, but I've found that using Cadvisor on it's own
> can
> >
> > drive CPU% levels back up to 80% for the entire time it is running. If a
> >
> > system is running flat-out when all it is doing is monitoring itself,
> well,
> >
> > there's something wrong somewhere... I'm trying to find an idiot's guide
> to
> >
> > controlling what Cadvisor does, but as yet I've been unsuccessful.
> >
> >
> >
> > Regards
> >
> > Sean
> >
> >
> >
> > --
> >
> > For IBM-MAIN subscribe / signoff / archive access instructions,
> >
> > send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
> >
> > --
> Michael 

Re: ZCX task monitoring, anyone?

2020-08-26 Thread Michael Babcock
I can’t check my zCX out right now since my internet is down.

You are running these on zIIP engines correct? Must be nice to have 5
zIIPs!  And have the WLM parts in place?   Although it probably wouldn’t
make much difference during startup/shutdown.

On Wed, Aug 26, 2020 at 3:40 AM Sean Gleann  wrote:

> Can anyone offer advice, please, with regard to monitoring the system
>
> resource consumption of a zcx Container task?
>
>
>
> I've got a zcx Container task running on a 'sandbox' system where - as yet
>
> - I'm not collecting any RMF/SMF data. Because of that, my only source of
>
> system usage is the SDSF DA panel. I feel that the numbers I see there
>
> are... 'questionable' is the best word I can think of.
>
>
>
> Firstly, the EXCP-count for the task goes up to about 15360 during the
>
> initial start-up phase, but then it stays there until the STOP command is
>
> issued. At that point, EXCP-count starts rising again, until the task
>
> finally terminates. The explanation for that is probably because all the
>
> I/O is being handled internally at the 'Linux' level - the task must be
>
> doing *some* I/O, right? - but the data isn't getting back to SDSF for some
>
> reason. Without the benefit of SMF data to examine, I'm wondering if this
>
> is part of a larger problem.
>
>
>
> The other thing that troubles me is the CPU% busy value. My sandbox system
>
> has 5 engines defined, and in the 'start.json' file that controls the zcx
>
> Container task, I've specified a 'cpu' value of 4. During the start-up
>
> phase for the Container started task, SDSF shows CPU% values of approx 80%,
>
> but when the task is finally initialised, this drops to 'tickover' rates of
>
> about 1%. I'm happy with that - the initial start-up of *any* task as
>
> complex as a zcx Container is likely to cause high CPU usage, and the
>
> subsequent drop to the 1% levels is fine by me.
>
>
>
> But... Once the Container task is started and I've ssh'd into it, I then
>
> want to monitor its 'internal' system consumption. I've been using the
>
> 'Getting Started...' redbook as my guide throughout all this project, and
>
> it talks about using "Nodeexporter", "Cadvisor", "Prometheus" and "Grafana"
>
> as tools for this. I've got all those things installed and I can start and
>
> stop them quite happily, but I've found that using Cadvisor on it's own can
>
> drive CPU% levels back up to 80% for the entire time it is running. If a
>
> system is running flat-out when all it is doing is monitoring itself, well,
>
> there's something wrong somewhere... I'm trying to find an idiot's guide to
>
> controlling what Cadvisor does, but as yet I've been unsuccessful.
>
>
>
> Regards
>
> Sean
>
>
>
> --
>
> For IBM-MAIN subscribe / signoff / archive access instructions,
>
> send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
>
> --
Michael Babcock
OneMain Financial
z/OS Systems Programmer, Lead

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


Re: ZCX task monitoring, anyone?

2020-08-26 Thread Allan Staller
II can't speak too much to the excp counts, but re: CPU usage.

Unix systems "count the beans" very differently than z/OS.
The "system overhead" (uncaptured CPU time) for z/OS averages about 10%.
The Unix systems I have encountered routinely exceed 40%.

Perhaps looking at a unix process through a z/OS lens (RMF global data capture) 
is distorting things?
I would also check the region size allocated to the ZCX address space. Paging 
might be masquerading as IO due to the z/OS lens being used.

HTH,

-Original Message-
From: IBM Mainframe Discussion List  On Behalf Of 
Sean Gleann
Sent: Wednesday, August 26, 2020 3:40 AM
To: IBM-MAIN@LISTSERV.UA.EDU
Subject: ZCX task monitoring, anyone?

[CAUTION: This Email is from outside the Organization. Unless you trust the 
sender, Don’t click links or open attachments as it may be a Phishing email, 
which can steal your Information and compromise your Computer.]

Can anyone offer advice, please, with regard to monitoring the system resource 
consumption of a zcx Container task?

I've got a zcx Container task running on a 'sandbox' system where - as yet
- I'm not collecting any RMF/SMF data. Because of that, my only source of 
system usage is the SDSF DA panel. I feel that the numbers I see there are... 
'questionable' is the best word I can think of.

Firstly, the EXCP-count for the task goes up to about 15360 during the initial 
start-up phase, but then it stays there until the STOP command is issued. At 
that point, EXCP-count starts rising again, until the task finally terminates. 
The explanation for that is probably because all the I/O is being handled 
internally at the 'Linux' level - the task must be doing *some* I/O, right? - 
but the data isn't getting back to SDSF for some reason. Without the benefit of 
SMF data to examine, I'm wondering if this is part of a larger problem.

The other thing that troubles me is the CPU% busy value. My sandbox system has 
5 engines defined, and in the 'start.json' file that controls the zcx Container 
task, I've specified a 'cpu' value of 4. During the start-up phase for the 
Container started task, SDSF shows CPU% values of approx 80%, but when the task 
is finally initialised, this drops to 'tickover' rates of about 1%. I'm happy 
with that - the initial start-up of *any* task as complex as a zcx Container is 
likely to cause high CPU usage, and the subsequent drop to the 1% levels is 
fine by me.

But... Once the Container task is started and I've ssh'd into it, I then want 
to monitor its 'internal' system consumption. I've been using the 'Getting 
Started...' redbook as my guide throughout all this project, and it talks about 
using "Nodeexporter", "Cadvisor", "Prometheus" and "Grafana"
as tools for this. I've got all those things installed and I can start and stop 
them quite happily, but I've found that using Cadvisor on it's own can drive 
CPU% levels back up to 80% for the entire time it is running. If a system is 
running flat-out when all it is doing is monitoring itself, well, there's 
something wrong somewhere... I'm trying to find an idiot's guide to controlling 
what Cadvisor does, but as yet I've been unsuccessful.

Regards
Sean

--
For IBM-MAIN subscribe / signoff / archive access instructions, send email to 
lists...@listserv.ua.edu with the message: INFO IBM-MAIN
::DISCLAIMER::

The contents of this e-mail and any attachment(s) are confidential and intended 
for the named recipient(s) only. E-mail transmission is not guaranteed to be 
secure or error-free as information could be intercepted, corrupted, lost, 
destroyed, arrive late or incomplete, or may contain viruses in transmission. 
The e mail and its contents (with or without referred errors) shall therefore 
not attach any liability on the originator or HCL or its affiliates. Views or 
opinions, if any, presented in this email are solely those of the author and 
may not necessarily reflect the views or opinions of HCL or its affiliates. Any 
form of reproduction, dissemination, copying, disclosure, modification, 
distribution and / or publication of this message without the prior written 
consent of authorized representative of HCL is strictly prohibited. If you have 
received this email in error please delete it and notify the sender 
immediately. Before opening any email and/or attachments, please check them for 
viruses and other defects.


--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN


ZCX task monitoring, anyone?

2020-08-26 Thread Sean Gleann
Can anyone offer advice, please, with regard to monitoring the system
resource consumption of a zcx Container task?

I've got a zcx Container task running on a 'sandbox' system where - as yet
- I'm not collecting any RMF/SMF data. Because of that, my only source of
system usage is the SDSF DA panel. I feel that the numbers I see there
are... 'questionable' is the best word I can think of.

Firstly, the EXCP-count for the task goes up to about 15360 during the
initial start-up phase, but then it stays there until the STOP command is
issued. At that point, EXCP-count starts rising again, until the task
finally terminates. The explanation for that is probably because all the
I/O is being handled internally at the 'Linux' level - the task must be
doing *some* I/O, right? - but the data isn't getting back to SDSF for some
reason. Without the benefit of SMF data to examine, I'm wondering if this
is part of a larger problem.

The other thing that troubles me is the CPU% busy value. My sandbox system
has 5 engines defined, and in the 'start.json' file that controls the zcx
Container task, I've specified a 'cpu' value of 4. During the start-up
phase for the Container started task, SDSF shows CPU% values of approx 80%,
but when the task is finally initialised, this drops to 'tickover' rates of
about 1%. I'm happy with that - the initial start-up of *any* task as
complex as a zcx Container is likely to cause high CPU usage, and the
subsequent drop to the 1% levels is fine by me.

But... Once the Container task is started and I've ssh'd into it, I then
want to monitor its 'internal' system consumption. I've been using the
'Getting Started...' redbook as my guide throughout all this project, and
it talks about using "Nodeexporter", "Cadvisor", "Prometheus" and "Grafana"
as tools for this. I've got all those things installed and I can start and
stop them quite happily, but I've found that using Cadvisor on it's own can
drive CPU% levels back up to 80% for the entire time it is running. If a
system is running flat-out when all it is doing is monitoring itself, well,
there's something wrong somewhere... I'm trying to find an idiot's guide to
controlling what Cadvisor does, but as yet I've been unsuccessful.

Regards
Sean

--
For IBM-MAIN subscribe / signoff / archive access instructions,
send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN