[GitHub] cjolivier01 commented on issue #9545: Profiling discussion

2018-02-05 Thread GitBox
cjolivier01 commented on issue #9545: Profiling discussion
URL: 
https://github.com/apache/incubator-mxnet/issues/9545#issuecomment-363178424
 
 
   I added a place for associated performance-related JIRA tasks: 
   
   
https://issues.apache.org/jira/secure/RapidBoard.jspa?rapidView=211=MXNET=planning.nodetail=visible=MXNET-10


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] cjolivier01 commented on issue #9545: Profiling discussion

2018-02-05 Thread GitBox
cjolivier01 commented on issue #9545: Profiling discussion
URL: 
https://github.com/apache/incubator-mxnet/issues/9545#issuecomment-363168613
 
 
   If you have better code to determine the number of "real" cores, I will use 
it.
   Right now, we use Intel OMP (at least when building with cmake on Linux) and 
it reports tat it has bound the first N/2 threads to the respective "real" 
cores.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] cjolivier01 commented on issue #9545: Profiling discussion

2018-01-30 Thread GitBox
cjolivier01 commented on issue #9545: Profiling discussion
URL: 
https://github.com/apache/incubator-mxnet/issues/9545#issuecomment-361647216
 
 
   Note OpenMP class in src/engine/openmp.cc
   ```cpp
   OpenMP::OpenMP()
 : omp_num_threads_set_in_environment(is_env_set("OMP_NUM_THREADS")) {
   #ifdef _OPENMP
 const int max = dmlc::GetEnv("MXNET_OMP_MAX_THREADS", INT_MIN);
 if (max != INT_MIN) {
   omp_thread_max_ = max;
 } else {
   if (!omp_num_threads_set_in_environment) {
 omp_thread_max_ = omp_get_num_procs();
   #ifdef ARCH_IS_INTEL_X86
 omp_thread_max_ >>= 1;
   #endif
 omp_set_num_threads(omp_thread_max_);
   } else {
 omp_thread_max_ = omp_get_max_threads();
   }
 }
   #else
 enabled_ = false;
 omp_thread_max_ = 1;
   #endif
   }
   ```


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] cjolivier01 commented on issue #9545: Profiling discussion

2018-01-30 Thread GitBox
cjolivier01 commented on issue #9545: Profiling discussion
URL: 
https://github.com/apache/incubator-mxnet/issues/9545#issuecomment-361646815
 
 
   "I think, for mordern deep learning workloads and stencil computation, it 
would be better to keep a worker thread pool and bind one worker thread to one 
dedicated core. So the number of worker threads may be same with the number of 
physical cores on machine"
   
   What I have noticed with Intel OMP is that for some N number of "virtual 
cores", the first N/2 OMP threads are allocated on on each "real" core and then 
a second pass over the cores for the second N/2 threads.
   Currently, we set OMP thread count to N/2, so that OMP threads run 
one-per-core for the first omp parallelism level (although operator parallelism 
and omp nesting can change this).
   
   Have you seen behaviour inconsistent with this?
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] cjolivier01 commented on issue #9545: Profiling discussion

2018-01-30 Thread GitBox
cjolivier01 commented on issue #9545: Profiling discussion
URL: 
https://github.com/apache/incubator-mxnet/issues/9545#issuecomment-361645985
 
 
   " if there is no operator parallelism, maybe we should give as many threads 
as possible to the running operator, especially for mkldnn operator"
   
   It is my understanding that this is the current behavior, is it not?  
Certainly there is no code to starve threads from mkl operators, no matter what 
is going on with the tuning code.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] cjolivier01 commented on issue #9545: Profiling discussion

2018-01-30 Thread GitBox
cjolivier01 commented on issue #9545: Profiling discussion
URL: 
https://github.com/apache/incubator-mxnet/issues/9545#issuecomment-361645446
 
 
   "After set USE_OPERATOR_TUNING=0 and rebuilt mxnet, the first 56 threads 
were not there anymore. Do you think this behaviour is as expectaton?"
   WHat do you mean by "not there anymore"?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] cjolivier01 commented on issue #9545: Profiling discussion

2018-01-27 Thread GitBox
cjolivier01 commented on issue #9545: Profiling discussion
URL: 
https://github.com/apache/incubator-mxnet/issues/9545#issuecomment-360995979
 
 
   can you explicitly list the attributes here please?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] cjolivier01 commented on issue #9545: Profiling discussion

2018-01-25 Thread GitBox
cjolivier01 commented on issue #9545: Profiling discussion
URL: 
https://github.com/apache/incubator-mxnet/issues/9545#issuecomment-360684117
 
 
   so far i?ve tried to make the profiling have as close to zero impact as
   possible on the runtime performance. i?ll need to think about how to get
   those other characteristics in both an elegant, abstract way, while still
   keeping the performance impact low (ie avoiding the schrodinger?s cat
   problem).
   
   On Thu, Jan 25, 2018 at 8:22 PM PatricZhao  wrote:
   
   > @cjolivier01  yes, looks good :)
   > One minor suggestion is to show the different instance of the Class/OP
   > because their performance will be very different. I think you already have
   > these data :)
   >
   > Attached the Theano profiling file theano-profile.log
   > 

   > .
   >
   > In the class level, the total runtime is added, such as convoluiton:
   >
   > <% time> <#call> <#apply> 

   > 15.8%80.4%  64.328s   3.06e-02s C 2100   3   
theano.sandbox.mkl.mkl_conv.Conv2D
   >
   > In the OP level, the convolution is separated by different shape and input
   > parameters:
   >
   > <% time> <#call> <#apply> 
   > 10.0%61.0%  40.886s   5.84e-02s C  7001   
Conv2D{imshp=(None, None, None, None), kshp=(64, 128, 1, 16), 
border_mode='valid', subsample=(1, 1), filter_flip=True, filter_dilation=(1, 1)}
   >  4.7%76.5%  19.297s   2.76e-02s C  7001   
Conv2D{imshp=(None, None, None, None), kshp=(128, 10, 16, 1), 
border_mode='valid', subsample=(1, 1), filter_flip=True, filter_dilation=(1, 1)}
   >  1.0%97.5%   4.145s   5.92e-03s C  7001   
Conv2D{imshp=(None, None, None, None), kshp=(64, 64, 1, 16), 
border_mode='valid', subsample=(1, 1), filter_flip=True, filter_dilation=(1, 1)}
   >
   > Then, Theano also gives the runtime and inputs for each instance of
   > convolution OPs.
   >
   > <% time><#call>  
   >9.3%56.5%  40.886s   5.84e-02s70039
   > Conv2D{imshp=(None, None, None, None), kshp=(64, 128, 1, 16), 
border_mode='valid', subsample=(1, 1), filter_flip=True, filter_dilation=(1, 1)}
   > (Relu{slope=0.0}.0, convolution2d_2_W, convolution2d_2_b)
   >
   > Regarding MKL output, we can take a look how to add it to your profile
   > reports when new MKL-DNN backend merged.
   >
   > ?
   > You are receiving this because you were mentioned.
   > Reply to this email directly, view it on GitHub
   > 
,
   > or mute the thread
   > 

   > .
   >
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] cjolivier01 commented on issue #9545: Profiling discussion

2018-01-25 Thread GitBox
cjolivier01 commented on issue #9545: Profiling discussion
URL: 
https://github.com/apache/incubator-mxnet/issues/9545#issuecomment-360680487
 
 
   Question:
   
   _"Operator tuning: I saw you have contributed to the operator tuning of 
mxnet. Seems it will create many omp threads before executing graph. How does 
this function affect the performance of mxnet? If I have set cpu affinity in my 
environment, these omp threads will be binded to each core. Then when executing 
computation graph, many other omp threads will be created and binded again. Do 
you think that will impact the performance?"_
   
   @cjolivier01:
   
   So yes, the graph executor itself will be a couple of threads that execute 
the operators which tend to, in turn, fork into many OMP threads for some parts 
of the operation (in some cases).  Operator tuning actually supresses this for 
some cases where the overhead of forking and converging the OMP threads takes 
longer than it would have taken to just do the whole operation on a single 
core.  This is much more pronounced on non-Intel-OMP libraries (ie libgomp).
   
   But anyway, your question is about does the execution thread binded to a 
core cause the binded OMP thread to be slower due to caching concerns?  I'd 
answer "maybe, but without data to the contrary, I assume that since the 
execution thread tends to be stalled while the operator is running its omp 
threads, that it's worse to reserve a whole core that will be idle while N-1 
OMP threads run".
   
   ^^ this assumes multiple operators are not running in parallel, which is a 
whole other discussion as it relates to core binding.
   
   I am by no means steadfast in this opinion, but it is an assumption at this 
point.  Have you known this to be the case with other frameworks or experiments?
   
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] cjolivier01 commented on issue #9545: Profiling discussion

2018-01-25 Thread GitBox
cjolivier01 commented on issue #9545: Profiling discussion
URL: 
https://github.com/apache/incubator-mxnet/issues/9545#issuecomment-360680487
 
 
   Question:
   
   _"Operator tuning: I saw you have contributed to the operator tuning of 
mxnet. Seems it will create many omp threads before executing graph. How does 
this function affect the performance of mxnet? If I have set cpu affinity in my 
environment, these omp threads will be binded to each core. Then when executing 
computation graph, many other omp threads will be created and binded again. Do 
you think that will impact the performance?"_
   
   @cjolivier01:
   
   So yes, the graph executor itself will be a couple of threads that execute 
the operators which tend to, in turn, fork into many OMP threads for some parts 
of the operation (in some cases).  Operator tuning actually supresses this for 
some cases where the overhead of forking and converging the OMP threads takes 
longer than it would have taken on a single core.  
   
   But anyway, your question is about does the execution thread binded to a 
core cause the binded OMP thread to be slower due to caching concerns?  I'd 
answer "maybe, but without data to the contrary, I assume that since the 
execution thread tends to be stalled while the operator is running its omp 
threads, that it's worse to reserve a whole core that will be idle while N-1 
OMP threads run".
   
   ^^ this assumes multiple operators are not running in parallel, which is a 
whole other discussion as it relates to core binding.
   
   I am by no means steadfast in this opinion, but it is an assumption at this 
point.  Have you known this to be the case with other frameworks or experiments?
   
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] cjolivier01 commented on issue #9545: Profiling discussion

2018-01-25 Thread GitBox
cjolivier01 commented on issue #9545: Profiling discussion
URL: 
https://github.com/apache/incubator-mxnet/issues/9545#issuecomment-360680487
 
 
   Question:
   
   Operator tuning: I saw you have contributed to the operator tuning of mxnet. 
Seems it will create many omp threads before executing graph. How does this 
function affect the performance of mxnet? If I have set cpu affinity in my 
environment, these omp threads will be binded to each core. Then when executing 
computation graph, many other omp threads will be created and binded again. Do 
you think that will impact the performance?
   
   
   So yes, the graph executor itself will be a couple of threads that execute 
the operators which tend to, in turn, fork into many OMP threads for some parts 
of the operation (in some cases).  Operator tuning actually supresses this for 
some cases where the overhead of forking and converging the OMP threads takes 
longer than it would have taken on a single core.  
   
   But anyway, your question is about does the execution thread binded to a 
core cause the binded OMP thread to be slower due to caching concerns?  I'd 
answer "maybe, but without data to the contrary, I assume that since the 
execution thread tends to be stalled while the operator is running its omp 
threads, that it's worse to reserve a whole core that will be idle while N-1 
OMP threads run".
   
   ^^ this assumes multiple operators are not running in parallel, which is a 
whole other discussion as it relates to core binding.
   
   I am by no means steadfast in this opinion, but it is an assumption at this 
point.  Have you known this to be the case with other frameworks or experiments?
   
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] cjolivier01 commented on issue #9545: Profiling discussion

2018-01-25 Thread GitBox
cjolivier01 commented on issue #9545: Profiling discussion
URL: 
https://github.com/apache/incubator-mxnet/issues/9545#issuecomment-360645754
 
 
   I am not yet sure how to incorporate the other MKL output or where I would 
get it at runtime?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] cjolivier01 commented on issue #9545: Profiling discussion

2018-01-25 Thread GitBox
cjolivier01 commented on issue #9545: Profiling discussion
URL: 
https://github.com/apache/incubator-mxnet/issues/9545#issuecomment-360644822
 
 
   @TaoLv Some aggregate output added today (see screenshot below)
   Not all of that yet...
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] cjolivier01 commented on issue #9545: Profiling discussion

2018-01-25 Thread GitBox
cjolivier01 commented on issue #9545: Profiling discussion
URL: 
https://github.com/apache/incubator-mxnet/issues/9545#issuecomment-360645371
 
 
   Like this?


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] cjolivier01 commented on issue #9545: Profiling discussion

2018-01-25 Thread GitBox
cjolivier01 commented on issue #9545: Profiling discussion
URL: 
https://github.com/apache/incubator-mxnet/issues/9545#issuecomment-360644943
 
 
   You can dump those stats and reset every iteration, I suppose...


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] cjolivier01 commented on issue #9545: Profiling discussion

2018-01-25 Thread GitBox
cjolivier01 commented on issue #9545: Profiling discussion
URL: 
https://github.com/apache/incubator-mxnet/issues/9545#issuecomment-360645274
 
 
   ![screenshot from 2018-01-25 
15-55-15](https://user-images.githubusercontent.com/11234557/35419241-b678695e-01eb-11e8-8907-abd3f6ff57f7.png)
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] cjolivier01 commented on issue #9545: Profiling discussion

2018-01-25 Thread GitBox
cjolivier01 commented on issue #9545: Profiling discussion
URL: 
https://github.com/apache/incubator-mxnet/issues/9545#issuecomment-360644943
 
 
   You can dump those stats and reset every iteration, I suppose...


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] cjolivier01 commented on issue #9545: Profiling discussion

2018-01-25 Thread GitBox
cjolivier01 commented on issue #9545: Profiling discussion
URL: 
https://github.com/apache/incubator-mxnet/issues/9545#issuecomment-360644822
 
 
   @TaoLv Some aggregate output added today (see screenshot in bottom of PR)
   Not all of that yet...


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services