Re: [Xen-devel] [PATCH 4/4] tools: add total/local memory bandwith monitoring
On Tue, Jan 06, 2015 at 10:29:18AM +, Andrew Cooper wrote: On 06/01/15 10:09, Chao Peng wrote: On Mon, Jan 05, 2015 at 12:39:42PM +, Wei Liu wrote: On Tue, Dec 23, 2014 at 04:54:39PM +0800, Chao Peng wrote: [...] +static int libxl__psr_cmt_get_mem_bandwidth(libxl__gc *gc, uint32_t domid, +xc_psr_cmt_type type, uint32_t socketid, uint32_t *bandwidth) +{ +uint64_t sample1, sample2; +uint32_t upscaling_factor; +int rc; + +rc = libxl__psr_cmt_get_l3_monitoring_data(gc, domid, +type, socketid, sample1); +if (rc 0) +return ERROR_FAIL; + +usleep(1); + +rc = libxl__psr_cmt_get_l3_monitoring_data(gc, domid, +type, socketid, sample2); +if (rc 0) + return ERROR_FAIL; + +if (sample2 sample1) { + LOGE(ERROR, event counter overflowed between two samplings); + return ERROR_FAIL; +} + What's the likelihood of counter overflows? Can we handle this more gracefully? Say, retry (with maximum retry cap) when counter overflows? The likelihood is very small here. Hardware guarantees the counter will not overflow in one second even under maximum platform bandwidth conditions. And we only sleep 0.01 second here. I'd like to adopt your suggestion to retry another time once that happens. But only one retry and it should correct the overflow. Thanks, Chao You have no possible way of guaranteeing that the actual elapsed time between the two samples is less than 1 second. On a very heavily loaded system, even regular task scheduling could cause an actual elapsed time of more than one second in that snippet of code. On further thought, this could not be right if implemented this only in tool stack, due to the fact that the duration between two samples can’t be guaranteed. Even got sample2 sample1 here, the data may still wrong as the hardware counter may overflowed more than one times during this period. What the hardware guaranteed here is that at most 1 overflow can happen (which can be corrected by software) when the duration between two samples is less than 1 second. So only the data that got from two samples which duration is actually less than 1 second is valid. The duration must be checked to use the data, this means something must be done in hypervisor. My initial solution is: Add a new hypercall to get both the counter value and the timestamp at that moment(The two operations should be atomic). (Looks like not good to add this to existed resource_op hypercall) Any suggestions? Thanks, Chao ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH 4/4] tools: add total/local memory bandwith monitoring
On 06/01/15 10:09, Chao Peng wrote: On Mon, Jan 05, 2015 at 12:39:42PM +, Wei Liu wrote: On Tue, Dec 23, 2014 at 04:54:39PM +0800, Chao Peng wrote: [...] +static int libxl__psr_cmt_get_mem_bandwidth(libxl__gc *gc, uint32_t domid, +xc_psr_cmt_type type, uint32_t socketid, uint32_t *bandwidth) +{ +uint64_t sample1, sample2; +uint32_t upscaling_factor; +int rc; + +rc = libxl__psr_cmt_get_l3_monitoring_data(gc, domid, +type, socketid, sample1); +if (rc 0) +return ERROR_FAIL; + +usleep(1); + +rc = libxl__psr_cmt_get_l3_monitoring_data(gc, domid, +type, socketid, sample2); +if (rc 0) + return ERROR_FAIL; + +if (sample2 sample1) { + LOGE(ERROR, event counter overflowed between two samplings); + return ERROR_FAIL; +} + What's the likelihood of counter overflows? Can we handle this more gracefully? Say, retry (with maximum retry cap) when counter overflows? The likelihood is very small here. Hardware guarantees the counter will not overflow in one second even under maximum platform bandwidth conditions. And we only sleep 0.01 second here. I'd like to adopt your suggestion to retry another time once that happens. But only one retry and it should correct the overflow. Thanks, Chao You have no possible way of guaranteeing that the actual elapsed time between the two samples is less than 1 second. On a very heavily loaded system, even regular task scheduling could cause an actual elapsed time of more than one second in that snippet of code. ~Andrew ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH 4/4] tools: add total/local memory bandwith monitoring
On Tue, Jan 06, 2015 at 10:29:18AM +, Andrew Cooper wrote: On 06/01/15 10:09, Chao Peng wrote: On Mon, Jan 05, 2015 at 12:39:42PM +, Wei Liu wrote: On Tue, Dec 23, 2014 at 04:54:39PM +0800, Chao Peng wrote: [...] +static int libxl__psr_cmt_get_mem_bandwidth(libxl__gc *gc, uint32_t domid, +xc_psr_cmt_type type, uint32_t socketid, uint32_t *bandwidth) +{ +uint64_t sample1, sample2; +uint32_t upscaling_factor; +int rc; + +rc = libxl__psr_cmt_get_l3_monitoring_data(gc, domid, +type, socketid, sample1); +if (rc 0) +return ERROR_FAIL; + +usleep(1); + +rc = libxl__psr_cmt_get_l3_monitoring_data(gc, domid, +type, socketid, sample2); +if (rc 0) + return ERROR_FAIL; + +if (sample2 sample1) { + LOGE(ERROR, event counter overflowed between two samplings); + return ERROR_FAIL; +} + What's the likelihood of counter overflows? Can we handle this more gracefully? Say, retry (with maximum retry cap) when counter overflows? The likelihood is very small here. Hardware guarantees the counter will not overflow in one second even under maximum platform bandwidth conditions. And we only sleep 0.01 second here. I'd like to adopt your suggestion to retry another time once that happens. But only one retry and it should correct the overflow. Thanks, Chao You have no possible way of guaranteeing that the actual elapsed time between the two samples is less than 1 second. On a very heavily loaded system, even regular task scheduling could cause an actual elapsed time of more than one second in that snippet of code. Yes, it's true. So the retry cap Wei suggested will be applied. Thanks. Chao ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel
Re: [Xen-devel] [PATCH 4/4] tools: add total/local memory bandwith monitoring
On Tue, Dec 23, 2014 at 04:54:39PM +0800, Chao Peng wrote: [...] +static int libxl__psr_cmt_get_mem_bandwidth(libxl__gc *gc, uint32_t domid, +xc_psr_cmt_type type, uint32_t socketid, uint32_t *bandwidth) +{ +uint64_t sample1, sample2; +uint32_t upscaling_factor; +int rc; + +rc = libxl__psr_cmt_get_l3_monitoring_data(gc, domid, +type, socketid, sample1); +if (rc 0) +return ERROR_FAIL; + +usleep(1); + +rc = libxl__psr_cmt_get_l3_monitoring_data(gc, domid, +type, socketid, sample2); +if (rc 0) + return ERROR_FAIL; + +if (sample2 sample1) { + LOGE(ERROR, event counter overflowed between two samplings); + return ERROR_FAIL; +} + What's the likelihood of counter overflows? Can we handle this more gracefully? Say, retry (with maximum retry cap) when counter overflows? +rc = xc_psr_cmt_get_l3_upscaling_factor(CTX-xch, upscaling_factor); +if (rc 0) { +LOGE(ERROR, failed to get L3 upscaling factor); +return ERROR_FAIL; +} + +*bandwidth = (sample2 - sample1) * 100 * upscaling_factor / 1024; +return rc; +} + +int libxl_psr_cmt_get_total_mem_bandwidth(libxl_ctx *ctx, uint32_t domid, +uint32_t socketid, uint32_t *bandwidth) +{ +GC_INIT(ctx); +int rc; + +rc = libxl__psr_cmt_get_mem_bandwidth(gc, domid, +XC_PSR_CMT_TOTAL_MEM_BANDWIDTH, socketid, bandwidth); +GC_FREE; +return rc; +} + +int libxl_psr_cmt_get_local_mem_bandwidth(libxl_ctx *ctx, uint32_t domid, +uint32_t socketid, uint32_t *bandwidth) +{ +GC_INIT(ctx); +int rc; + +rc = libxl__psr_cmt_get_mem_bandwidth(gc, domid, +XC_PSR_CMT_LOCAL_MEM_BANDWIDTH, socketid, bandwidth); +GC_FREE; +return rc; +} + /* * Local variables: * mode: C diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl index f7fc695..8029a39 100644 --- a/tools/libxl/libxl_types.idl +++ b/tools/libxl/libxl_types.idl @@ -693,4 +693,6 @@ libxl_event = Struct(event,[ libxl_psr_cmt_type = Enumeration(psr_cmt_type, [ (1, CACHE_OCCUPANCY), +(2, TOTAL_MEM_BANDWIDTH), +(3, LOCAL_MEM_BANDWIDTH), ]) diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c index f4534ec..e0435dd 100644 --- a/tools/libxl/xl_cmdimpl.c +++ b/tools/libxl/xl_cmdimpl.c @@ -7867,6 +7867,16 @@ static void psr_cmt_print_domain_l3_info(libxl_dominfo *dominfo, socketid, data) ) printf(%13u KB, data); break; +case LIBXL_PSR_CMT_TYPE_TOTAL_MEM_BANDWIDTH: +if ( !libxl_psr_cmt_get_total_mem_bandwidth(ctx, dominfo-domid, Coding style. + socketid, data) ) +printf(%11u KB/s, data); +break; +case LIBXL_PSR_CMT_TYPE_LOCAL_MEM_BANDWIDTH: +if ( !libxl_psr_cmt_get_local_mem_bandwidth(ctx, dominfo-domid, Ditto. Wei. ___ Xen-devel mailing list Xen-devel@lists.xen.org http://lists.xen.org/xen-devel