Re: [opnfv-tech-discuss] [test-wg] Monitoring dashboard for long duration test
Hi Julien, Thanks a lot for your review. Yes, I'll make sure that there is POD level support for metrics. It is possible to configure the data duration, so we can set it to provide daily data instead of instant traffic. I'll use Jinja2 template for the dashboard file, thanks for pointing it out. Thanks, Rutuja On Mon, Nov 27, 2017 at 8:52 PM, Julienwrote: > It's really cool. > > As Kubi mentioned, it is useful to support POD level metrics info > including containers, nodes, disk/memory usage. > For network traffic, it is too sensitive when we use the instant traffic, > can we provide the daily data? > The config file "prototype_prometheus_dashboard" in review is 2000 lines > long. I suggest to use a simple Jinja2 template to produce this file. It > will more easily to use and maintain. > > BR/Julien > > > > Rutuja Surve 于2017年11月26日周日 下午6:05写道: > >> Hi Kubi, >> Thanks for reviewing the dashboard. >> It is possible to monitor multiple hosts (the jump server and its >> corresponding compute and controller nodes) with this dashboard. The >> 'instance' parameter for every metric corresponds to the IP address of the >> node, hence its possible to filter it by node IP. The whole physical >> deployment is configured in the pod.yaml file where we can see information >> regarding the compute and controller nodes. >> We have scripts for installing the statistics collecting daemons >> (Cadvisor and Collectd) on the jump-server and the client nodes (Compute >> and controller) that send the metrics to the jump server. >> The 'Load' corresponds to the CPU load and can be best explained with >> this Prometheus query that is used for collecting it: >> >> node_load1{instance=~\"$server:.*\"} / count by(job, instance)(count >> by(job, instance, cpu)(node_cpu{instance=~\"$server:.*\"})) >> >> CPU Usage per container corresponds to : >> >> sum(rate(container_cpu_usage_seconds_total{name=~\".+\"}[$interval])) by >> (name) * 100 >> >> So, if the sum of the rate for a particular interval exceeds 1, it can >> cross 100% and reach upto 400%. It's more about how the query is framed. >> The network traffic apparently monitors the http port of the host. >> >> Do let me know if you have more questions. >> >> Thanks, >> >> Rutuja >> >> >> >> >> >> >> >> On Wed, Nov 22, 2017 at 8:25 AM, Gaoliang (kubi) < >> jean.gaoli...@huawei.com> wrote: >> >>> Hi Rutuja, >>> >>> >>> >>> The dashboard looks pretty good J >>> >>> >>> >>> Only few questions about the dashboard. >>> >>> >>> >>> What kind of SUT you can monitor? A single host or 5 hosts (OPNFV >>> physical HA deployment)? It seems that It can be filtered by Node IP. Do >>> we have a whole view for a physical deployment POD? >>> >>> >>> >>> What does the “Load” mean? CPU Load? Why “CPU usage” can go to 400%? >>> Does the “Network Traffic” monitor one port or all of the ports of host? >>> >>> >>> >>> Regards, >>> >>> >>> >>> Kubi >>> >>> >>> >>> >>> >>> >>> >>> >>> >>> *From:* test-wg-boun...@lists.opnfv.org [mailto:test-wg-bounces@lists. >>> opnfv.org] *On Behalf Of *Rutuja Surve >>> *Sent:* Tuesday, November 21, 2017 4:42 PM >>> *To:* opnfv-tech-discuss@lists.opnfv.org; test...@lists.opnfv.org >>> *Subject:* [test-wg] Monitoring dashboard for long duration test >>> >>> >>> >>> Hi, >>> >>> I am currently working on Bottlenecks intern project focusing on >>> monitoring/dashboarding for long duration test. >>> We have been closing to a protoype. We need your opinions/comments on >>> how to organize the dashboard, what metrics/plugins should be included, etc. >>> The screenshots/details for the pre-protoype dashboard are provided >>> below. Please comment on that. >>> Currently, we do not have a public access to the dashboard. If you'd >>> like to know more details/operations, please refer to the gerrit patch: >>> >>> https://gerrit.opnfv.org/gerrit/#/c/47567/ >>> >>> and give your review there or attend the Bottlenecks meeting tomorrow >>> (Wednesday) at 8.30 am IST where I will provide regular reports for the >>> progress and show customization of the dashboard. >>> >>> >>> Also find the screenshot-pdf of the dashboard attached with this e-mail. >>> We are using Prometheus for querying and as datasource, Cadvisor and >>> Collectd plugins for collecting system metrics and Grafana for displaying >>> the dashboard. >>> >>> Thanks, >>> Rutuja >>> >>> >>> >> >> ___ >> opnfv-tech-discuss mailing list >> opnfv-tech-discuss@lists.opnfv.org >> https://lists.opnfv.org/mailman/listinfo/opnfv-tech-discuss >> > ___ opnfv-tech-discuss mailing list opnfv-tech-discuss@lists.opnfv.org https://lists.opnfv.org/mailman/listinfo/opnfv-tech-discuss
Re: [opnfv-tech-discuss] [test-wg] Monitoring dashboard for long duration test
It's really cool. As Kubi mentioned, it is useful to support POD level metrics info including containers, nodes, disk/memory usage. For network traffic, it is too sensitive when we use the instant traffic, can we provide the daily data? The config file "prototype_prometheus_dashboard" in review is 2000 lines long. I suggest to use a simple Jinja2 template to produce this file. It will more easily to use and maintain. BR/Julien Rutuja Surve于2017年11月26日周日 下午6:05写道: > Hi Kubi, > Thanks for reviewing the dashboard. > It is possible to monitor multiple hosts (the jump server and its > corresponding compute and controller nodes) with this dashboard. The > 'instance' parameter for every metric corresponds to the IP address of the > node, hence its possible to filter it by node IP. The whole physical > deployment is configured in the pod.yaml file where we can see information > regarding the compute and controller nodes. > We have scripts for installing the statistics collecting daemons (Cadvisor > and Collectd) on the jump-server and the client nodes (Compute and > controller) that send the metrics to the jump server. > The 'Load' corresponds to the CPU load and can be best explained with this > Prometheus query that is used for collecting it: > > node_load1{instance=~\"$server:.*\"} / count by(job, instance)(count > by(job, instance, cpu)(node_cpu{instance=~\"$server:.*\"})) > > CPU Usage per container corresponds to : > > sum(rate(container_cpu_usage_seconds_total{name=~\".+\"}[$interval])) by > (name) * 100 > > So, if the sum of the rate for a particular interval exceeds 1, it can > cross 100% and reach upto 400%. It's more about how the query is framed. > The network traffic apparently monitors the http port of the host. > > Do let me know if you have more questions. > > Thanks, > > Rutuja > > > > > > > > On Wed, Nov 22, 2017 at 8:25 AM, Gaoliang (kubi) > wrote: > >> Hi Rutuja, >> >> >> >> The dashboard looks pretty good J >> >> >> >> Only few questions about the dashboard. >> >> >> >> What kind of SUT you can monitor? A single host or 5 hosts (OPNFV >> physical HA deployment)? It seems that It can be filtered by Node IP. Do >> we have a whole view for a physical deployment POD? >> >> >> >> What does the “Load” mean? CPU Load? Why “CPU usage” can go to 400%? >> Does the “Network Traffic” monitor one port or all of the ports of host? >> >> >> >> Regards, >> >> >> >> Kubi >> >> >> >> >> >> >> >> >> >> *From:* test-wg-boun...@lists.opnfv.org [mailto: >> test-wg-boun...@lists.opnfv.org] *On Behalf Of *Rutuja Surve >> *Sent:* Tuesday, November 21, 2017 4:42 PM >> *To:* opnfv-tech-discuss@lists.opnfv.org; test...@lists.opnfv.org >> *Subject:* [test-wg] Monitoring dashboard for long duration test >> >> >> >> Hi, >> >> I am currently working on Bottlenecks intern project focusing on >> monitoring/dashboarding for long duration test. >> We have been closing to a protoype. We need your opinions/comments on how >> to organize the dashboard, what metrics/plugins should be included, etc. >> The screenshots/details for the pre-protoype dashboard are provided >> below. Please comment on that. >> Currently, we do not have a public access to the dashboard. If you'd like >> to know more details/operations, please refer to the gerrit patch: >> >> https://gerrit.opnfv.org/gerrit/#/c/47567/ >> >> and give your review there or attend the Bottlenecks meeting tomorrow >> (Wednesday) at 8.30 am IST where I will provide regular reports for the >> progress and show customization of the dashboard. >> >> >> Also find the screenshot-pdf of the dashboard attached with this e-mail. >> We are using Prometheus for querying and as datasource, Cadvisor and >> Collectd plugins for collecting system metrics and Grafana for displaying >> the dashboard. >> >> Thanks, >> Rutuja >> >> >> > > ___ > opnfv-tech-discuss mailing list > opnfv-tech-discuss@lists.opnfv.org > https://lists.opnfv.org/mailman/listinfo/opnfv-tech-discuss > ___ opnfv-tech-discuss mailing list opnfv-tech-discuss@lists.opnfv.org https://lists.opnfv.org/mailman/listinfo/opnfv-tech-discuss
Re: [opnfv-tech-discuss] [test-wg] Monitoring dashboard for long duration test
Hi Kubi, Thanks for reviewing the dashboard. It is possible to monitor multiple hosts (the jump server and its corresponding compute and controller nodes) with this dashboard. The 'instance' parameter for every metric corresponds to the IP address of the node, hence its possible to filter it by node IP. The whole physical deployment is configured in the pod.yaml file where we can see information regarding the compute and controller nodes. We have scripts for installing the statistics collecting daemons (Cadvisor and Collectd) on the jump-server and the client nodes (Compute and controller) that send the metrics to the jump server. The 'Load' corresponds to the CPU load and can be best explained with this Prometheus query that is used for collecting it: node_load1{instance=~\"$server:.*\"} / count by(job, instance)(count by(job, instance, cpu)(node_cpu{instance=~\"$server:.*\"})) CPU Usage per container corresponds to : sum(rate(container_cpu_usage_seconds_total{name=~\".+\"}[$interval])) by (name) * 100 So, if the sum of the rate for a particular interval exceeds 1, it can cross 100% and reach upto 400%. It's more about how the query is framed. The network traffic apparently monitors the http port of the host. Do let me know if you have more questions. Thanks, Rutuja On Wed, Nov 22, 2017 at 8:25 AM, Gaoliang (kubi)wrote: > Hi Rutuja, > > > > The dashboard looks pretty good J > > > > Only few questions about the dashboard. > > > > What kind of SUT you can monitor? A single host or 5 hosts (OPNFV > physical HA deployment)? It seems that It can be filtered by Node IP. Do > we have a whole view for a physical deployment POD? > > > > What does the “Load” mean? CPU Load? Why “CPU usage” can go to 400%? > Does the “Network Traffic” monitor one port or all of the ports of host? > > > > Regards, > > > > Kubi > > > > > > > > > > *From:* test-wg-boun...@lists.opnfv.org [mailto:test-wg-bounces@lists. > opnfv.org] *On Behalf Of *Rutuja Surve > *Sent:* Tuesday, November 21, 2017 4:42 PM > *To:* opnfv-tech-discuss@lists.opnfv.org; test...@lists.opnfv.org > *Subject:* [test-wg] Monitoring dashboard for long duration test > > > > Hi, > > I am currently working on Bottlenecks intern project focusing on > monitoring/dashboarding for long duration test. > We have been closing to a protoype. We need your opinions/comments on how > to organize the dashboard, what metrics/plugins should be included, etc. > The screenshots/details for the pre-protoype dashboard are provided below. > Please comment on that. > Currently, we do not have a public access to the dashboard. If you'd like > to know more details/operations, please refer to the gerrit patch: > > https://gerrit.opnfv.org/gerrit/#/c/47567/ > > and give your review there or attend the Bottlenecks meeting tomorrow > (Wednesday) at 8.30 am IST where I will provide regular reports for the > progress and show customization of the dashboard. > > > Also find the screenshot-pdf of the dashboard attached with this e-mail. > We are using Prometheus for querying and as datasource, Cadvisor and > Collectd plugins for collecting system metrics and Grafana for displaying > the dashboard. > > Thanks, > Rutuja > > > ___ opnfv-tech-discuss mailing list opnfv-tech-discuss@lists.opnfv.org https://lists.opnfv.org/mailman/listinfo/opnfv-tech-discuss
Re: [opnfv-tech-discuss] [test-wg] Monitoring dashboard for long duration test
Hi Rutuja, The dashboard looks pretty good ☺ Only few questions about the dashboard. What kind of SUT you can monitor? A single host or 5 hosts (OPNFV physical HA deployment)? It seems that It can be filtered by Node IP. Do we have a whole view for a physical deployment POD? What does the “Load” mean? CPU Load? Why “CPU usage” can go to 400%? Does the “Network Traffic” monitor one port or all of the ports of host? Regards, Kubi From: test-wg-boun...@lists.opnfv.org [mailto:test-wg-boun...@lists.opnfv.org] On Behalf Of Rutuja Surve Sent: Tuesday, November 21, 2017 4:42 PM To: opnfv-tech-discuss@lists.opnfv.org; test...@lists.opnfv.org Subject: [test-wg] Monitoring dashboard for long duration test Hi, I am currently working on Bottlenecks intern project focusing on monitoring/dashboarding for long duration test. We have been closing to a protoype. We need your opinions/comments on how to organize the dashboard, what metrics/plugins should be included, etc. The screenshots/details for the pre-protoype dashboard are provided below. Please comment on that. Currently, we do not have a public access to the dashboard. If you'd like to know more details/operations, please refer to the gerrit patch: https://gerrit.opnfv.org/gerrit/#/c/47567/ and give your review there or attend the Bottlenecks meeting tomorrow (Wednesday) at 8.30 am IST where I will provide regular reports for the progress and show customization of the dashboard. Also find the screenshot-pdf of the dashboard attached with this e-mail. We are using Prometheus for querying and as datasource, Cadvisor and Collectd plugins for collecting system metrics and Grafana for displaying the dashboard. Thanks, Rutuja ___ opnfv-tech-discuss mailing list opnfv-tech-discuss@lists.opnfv.org https://lists.opnfv.org/mailman/listinfo/opnfv-tech-discuss