Re: [Xen-devel] [PATCH 4/4] tools: add total/local memory bandwith monitoring

2015-01-15 Thread Chao Peng
On Tue, Jan 06, 2015 at 10:29:18AM +, Andrew Cooper wrote:
 On 06/01/15 10:09, Chao Peng wrote:
  On Mon, Jan 05, 2015 at 12:39:42PM +, Wei Liu wrote:
  On Tue, Dec 23, 2014 at 04:54:39PM +0800, Chao Peng wrote:
  [...]
  +static int libxl__psr_cmt_get_mem_bandwidth(libxl__gc *gc, uint32_t 
  domid,
  +xc_psr_cmt_type type, uint32_t socketid, uint32_t *bandwidth)
  +{
  +uint64_t sample1, sample2;
  +uint32_t upscaling_factor;
  +int rc;
  +
  +rc = libxl__psr_cmt_get_l3_monitoring_data(gc, domid,
  +type, socketid, sample1);
  +if (rc  0)
  +return ERROR_FAIL;
  +
  +usleep(1);
  +
  +rc = libxl__psr_cmt_get_l3_monitoring_data(gc, domid,
  +type, socketid, sample2);
  +if (rc  0)
  +   return ERROR_FAIL;
  +
  +if (sample2  sample1) {
  + LOGE(ERROR, event counter overflowed between two samplings);
  + return ERROR_FAIL;
  +}
  +
  What's the likelihood of counter overflows? Can we handle this more
  gracefully? Say, retry (with maximum retry cap) when counter overflows?
  The likelihood is very small here. Hardware guarantees the counter will
  not overflow in one second even under maximum platform bandwidth conditions.
  And we only sleep 0.01 second here. 
 
  I'd like to adopt your suggestion to retry another time once that happens.
  But only one retry and it should correct the overflow.
 
  Thanks,
  Chao
 
 You have no possible way of guaranteeing that the actual elapsed time
 between the two samples is less than 1 second.  On a very heavily loaded
 system, even regular task scheduling could cause an actual elapsed time
 of more than one second in that snippet of code.
 
On further thought, this could not be right if implemented this only in
tool stack, due to the fact that the duration between two samples can’t
be guaranteed. Even got sample2  sample1 here, the data may still wrong
as the hardware counter may overflowed more than one times during this
period.

What the hardware guaranteed here is that at most 1 overflow can happen
(which can be corrected by software) when the duration between two samples
is less than 1 second. So only the data that got from two samples which
duration is actually less than 1 second is valid.

The duration must be checked to use the data, this means something must
be done in hypervisor.

My initial solution is: Add a new hypercall to get both the counter
value and the timestamp at that moment(The two operations should be
atomic).
(Looks like not good to add this to existed resource_op hypercall)


Any suggestions?

Thanks,
Chao
 
 
 ___
 Xen-devel mailing list
 Xen-devel@lists.xen.org
 http://lists.xen.org/xen-devel

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 4/4] tools: add total/local memory bandwith monitoring

2015-01-06 Thread Andrew Cooper
On 06/01/15 10:09, Chao Peng wrote:
 On Mon, Jan 05, 2015 at 12:39:42PM +, Wei Liu wrote:
 On Tue, Dec 23, 2014 at 04:54:39PM +0800, Chao Peng wrote:
 [...]
 +static int libxl__psr_cmt_get_mem_bandwidth(libxl__gc *gc, uint32_t domid,
 +xc_psr_cmt_type type, uint32_t socketid, uint32_t *bandwidth)
 +{
 +uint64_t sample1, sample2;
 +uint32_t upscaling_factor;
 +int rc;
 +
 +rc = libxl__psr_cmt_get_l3_monitoring_data(gc, domid,
 +type, socketid, sample1);
 +if (rc  0)
 +return ERROR_FAIL;
 +
 +usleep(1);
 +
 +rc = libxl__psr_cmt_get_l3_monitoring_data(gc, domid,
 +type, socketid, sample2);
 +if (rc  0)
 +   return ERROR_FAIL;
 +
 +if (sample2  sample1) {
 + LOGE(ERROR, event counter overflowed between two samplings);
 + return ERROR_FAIL;
 +}
 +
 What's the likelihood of counter overflows? Can we handle this more
 gracefully? Say, retry (with maximum retry cap) when counter overflows?
 The likelihood is very small here. Hardware guarantees the counter will
 not overflow in one second even under maximum platform bandwidth conditions.
 And we only sleep 0.01 second here. 

 I'd like to adopt your suggestion to retry another time once that happens.
 But only one retry and it should correct the overflow.

 Thanks,
 Chao

You have no possible way of guaranteeing that the actual elapsed time
between the two samples is less than 1 second.  On a very heavily loaded
system, even regular task scheduling could cause an actual elapsed time
of more than one second in that snippet of code.

~Andrew


___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 4/4] tools: add total/local memory bandwith monitoring

2015-01-06 Thread Chao Peng
On Tue, Jan 06, 2015 at 10:29:18AM +, Andrew Cooper wrote:
 On 06/01/15 10:09, Chao Peng wrote:
  On Mon, Jan 05, 2015 at 12:39:42PM +, Wei Liu wrote:
  On Tue, Dec 23, 2014 at 04:54:39PM +0800, Chao Peng wrote:
  [...]
  +static int libxl__psr_cmt_get_mem_bandwidth(libxl__gc *gc, uint32_t 
  domid,
  +xc_psr_cmt_type type, uint32_t socketid, uint32_t *bandwidth)
  +{
  +uint64_t sample1, sample2;
  +uint32_t upscaling_factor;
  +int rc;
  +
  +rc = libxl__psr_cmt_get_l3_monitoring_data(gc, domid,
  +type, socketid, sample1);
  +if (rc  0)
  +return ERROR_FAIL;
  +
  +usleep(1);
  +
  +rc = libxl__psr_cmt_get_l3_monitoring_data(gc, domid,
  +type, socketid, sample2);
  +if (rc  0)
  +   return ERROR_FAIL;
  +
  +if (sample2  sample1) {
  + LOGE(ERROR, event counter overflowed between two samplings);
  + return ERROR_FAIL;
  +}
  +
  What's the likelihood of counter overflows? Can we handle this more
  gracefully? Say, retry (with maximum retry cap) when counter overflows?
  The likelihood is very small here. Hardware guarantees the counter will
  not overflow in one second even under maximum platform bandwidth conditions.
  And we only sleep 0.01 second here. 
 
  I'd like to adopt your suggestion to retry another time once that happens.
  But only one retry and it should correct the overflow.
 
  Thanks,
  Chao
 
 You have no possible way of guaranteeing that the actual elapsed time
 between the two samples is less than 1 second.  On a very heavily loaded
 system, even regular task scheduling could cause an actual elapsed time
 of more than one second in that snippet of code.
Yes, it's true. So the retry cap Wei suggested will be applied.
Thanks.
Chao
 
 
 
 ___
 Xen-devel mailing list
 Xen-devel@lists.xen.org
 http://lists.xen.org/xen-devel

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel


Re: [Xen-devel] [PATCH 4/4] tools: add total/local memory bandwith monitoring

2015-01-05 Thread Wei Liu
On Tue, Dec 23, 2014 at 04:54:39PM +0800, Chao Peng wrote:
[...]
 +static int libxl__psr_cmt_get_mem_bandwidth(libxl__gc *gc, uint32_t domid,
 +xc_psr_cmt_type type, uint32_t socketid, uint32_t *bandwidth)
 +{
 +uint64_t sample1, sample2;
 +uint32_t upscaling_factor;
 +int rc;
 +
 +rc = libxl__psr_cmt_get_l3_monitoring_data(gc, domid,
 +type, socketid, sample1);
 +if (rc  0)
 +return ERROR_FAIL;
 +
 +usleep(1);
 +
 +rc = libxl__psr_cmt_get_l3_monitoring_data(gc, domid,
 +type, socketid, sample2);
 +if (rc  0)
 +   return ERROR_FAIL;
 +
 +if (sample2  sample1) {
 + LOGE(ERROR, event counter overflowed between two samplings);
 + return ERROR_FAIL;
 +}
 +

What's the likelihood of counter overflows? Can we handle this more
gracefully? Say, retry (with maximum retry cap) when counter overflows?

 +rc = xc_psr_cmt_get_l3_upscaling_factor(CTX-xch, upscaling_factor);
 +if (rc  0) {
 +LOGE(ERROR, failed to get L3 upscaling factor);
 +return ERROR_FAIL;
 +}
 +
 +*bandwidth = (sample2 - sample1) * 100 *  upscaling_factor / 1024;
 +return rc;
 +}
 +
 +int libxl_psr_cmt_get_total_mem_bandwidth(libxl_ctx *ctx, uint32_t domid,
 +uint32_t socketid, uint32_t *bandwidth)
 +{
 +GC_INIT(ctx);
 +int rc;
 +
 +rc = libxl__psr_cmt_get_mem_bandwidth(gc, domid,
 +XC_PSR_CMT_TOTAL_MEM_BANDWIDTH, socketid, bandwidth);
 +GC_FREE;
 +return rc;
 +}
 +
 +int libxl_psr_cmt_get_local_mem_bandwidth(libxl_ctx *ctx, uint32_t domid,
 +uint32_t socketid, uint32_t *bandwidth)
 +{
 +GC_INIT(ctx);
 +int rc;
 +
 +rc = libxl__psr_cmt_get_mem_bandwidth(gc, domid,
 +XC_PSR_CMT_LOCAL_MEM_BANDWIDTH, socketid, bandwidth);
 +GC_FREE;
 +return rc;
 +}
 +
  /*
   * Local variables:
   * mode: C
 diff --git a/tools/libxl/libxl_types.idl b/tools/libxl/libxl_types.idl
 index f7fc695..8029a39 100644
 --- a/tools/libxl/libxl_types.idl
 +++ b/tools/libxl/libxl_types.idl
 @@ -693,4 +693,6 @@ libxl_event = Struct(event,[
  
  libxl_psr_cmt_type = Enumeration(psr_cmt_type, [
  (1, CACHE_OCCUPANCY),
 +(2, TOTAL_MEM_BANDWIDTH),
 +(3, LOCAL_MEM_BANDWIDTH),
  ])
 diff --git a/tools/libxl/xl_cmdimpl.c b/tools/libxl/xl_cmdimpl.c
 index f4534ec..e0435dd 100644
 --- a/tools/libxl/xl_cmdimpl.c
 +++ b/tools/libxl/xl_cmdimpl.c
 @@ -7867,6 +7867,16 @@ static void psr_cmt_print_domain_l3_info(libxl_dominfo 
 *dominfo,
   socketid, data) )
  printf(%13u KB, data);
  break;
 +case LIBXL_PSR_CMT_TYPE_TOTAL_MEM_BANDWIDTH:
 +if ( !libxl_psr_cmt_get_total_mem_bandwidth(ctx, dominfo-domid,

Coding style.

 + socketid, data) )
 +printf(%11u KB/s, data);
 +break;
 +case LIBXL_PSR_CMT_TYPE_LOCAL_MEM_BANDWIDTH:
 +if ( !libxl_psr_cmt_get_local_mem_bandwidth(ctx, dominfo-domid,

Ditto.

Wei.

___
Xen-devel mailing list
Xen-devel@lists.xen.org
http://lists.xen.org/xen-devel