[GitHub] [incubator-mxnet] ChaiBapchya edited a comment on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

2020-05-15 Thread GitBox


ChaiBapchya edited a comment on issue #17980:
URL: 
https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-629385677


   > Tested with MXNet 
[cfb474b](https://github.com/apache/incubator-mxnet/commit/cfb474ba743d5ea85161bf19875488f4cb409d3c).
 Compiled with mostly-default cmake settings:
   > 
   > ```shell
   > cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release ..
   > ```
   > 
   > Then when I run
   > 
   > ```
   > export MKL_VERBOSE=1
   > export MKLDNN_VERBOSE=1
   > python3
   > Python 3.6.9 (default, Nov  7 2019, 10:44:02) 
   > [GCC 8.3.0] on linux
   > Type "help", "copyright", "credits" or "license" for more information.
   > >>> import mxnet as mx
   > Numpy + Intel(R) MKL: THREADING LAYER: (null)
   > Numpy + Intel(R) MKL: setting Intel(R) MKL to use INTEL OpenMP runtime
   > Numpy + Intel(R) MKL: preloading libiomp5.so runtime
   > ```
   
   @kpuatamazon @kpu
   Running on Ubuntu 18.04 [which doesn't have MKL installed by default] with 
default cmake config doesn't use MKL as blas.
   Hence we can't get the above exports.
   
   Thus for Ubuntu 18.04 base AMI, one has to install MKL in /opt/intel & 
update the cmake command to
   ```
   cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -DUSE_BLAS=mkl ..
   ```
   This I found uses mkl as blas & export MKL_VERBOSE=1 confirms it.
   
   With this addition to both [default & workaround] I reran the opperf & I 
didn't see much perf differences.
   
   ## Commands
   Default
   ```
   cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -DUSE_BLAS=mkl ..
   ```
   
   Workaround
   ```
   export CXXFLAGS="${CXXFLAGS} -DUSE_MKL -I/opt/intel/mkl/include"
   cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -DUSE_BLAS=mkl ..
   ```
   
   ## Logs
    Default
   Batch_dot
   ```
   MKL_VERBOSE 
SGEMM_BATCH(T,N,0x7fdafe19ecec,0x7fdafe19ecf0,0x7fdafe19ecf4,0x7fdafe19ed04,0x7fda6001dfd0,0x7fdafe19ecf8,0x7fda6001e490,0x7fdafe19ecfc,0x7fdafe19ed08,0x3fd7ec0,0x7fdafe19ed00,0x7fdafe19ec28,0x7fdafe
 28.71ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:16
   MKL_VERBOSE 
SGEMM_BATCH(T,N,0x7fdafe19ecec,0x7fdafe19ecf0,0x7fdafe19ecf4,0x7fdafe19ed04,0x7fda6001dfd0,0x7fdafe19ecf8,0x7fda6001e490,0x7fdafe19ecfc,0x7fdafe19ed08,0x3fd7ec0,0x7fdafe19ed00,0x7fdafe19ec28,0x7fdafe
 28.53ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:16
   ```
   FC
   ```
   
dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0
 wei_f32::blocked:ab:f0 bia_undef::undef::f0 
dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.0551758
   
dnnl_verbose,exec,cpu,inner_product,gemm:jit,forward_inference,src_f32::blocked:ab:f0
 wei_f32::blocked:ab:f0 bia_undef::undef::f0 
dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.0559082
   ```
   
    Workaround
   Batch_dot [same as default]
   ```
   MKL_VERBOSE 
SGEMM_BATCH(T,N,0x7f985b78acec,0x7f985b78acf0,0x7f985b78acf4,0x7f985b78ad04,0x7f97b4016cd0,0x7f985b78acf8,0x7f97b401e550,0x7f985b78acfc,0x7f985b78ad08,0x26f2890,0x7f985b78ad00,0x7f985b78ac28,0x7f985b
 28.72ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:16
   MKL_VERBOSE 
SGEMM_BATCH(T,N,0x7f985b78acec,0x7f985b78acf0,0x7f985b78acf4,0x7f985b78ad04,0x7f97b4016cd0,0x7f985b78acf8,0x7f97b401e550,0x7f985b78acfc,0x7f985b78ad08,0x26f2890,0x7f985b78ad00,0x7f985b78ac28,0x7f985b
 28.77ms CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:16
   ```
   FC [additional MKL_VERBOSE before DNNL_VERBOSE]
   ```
   MKL_VERBOSE 
SGEMM(T,N,512,4,512,0x7f985b789c28,0x7f97b5e52e80,512,0x7f97b401e600,512,0x7f985b789c30,0x7f976e89dd80,512)
 39.68us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:16
   
dnnl_verbose,exec,cpu,inner_product,gemm:blas,forward_inference,src_f32::blocked:ab:f0
 wei_f32::blocked:ab:f0 bia_undef::undef::f0 
dst_f32::blocked:ab:f0,,,mb4ic512oc512,0.0769043
   MKL_VERBOSE 
SGEMM(T,N,512,5,2048,0x7f985b789c28,0x7f976c400100,2048,0x7f976e887100,2048,0x7f985b789c30,0x7f976e8da000,512)
 79.41us CNR:OFF Dyn:1 FastMM:1 TID:0  NThr:16
   
dnnl_verbose,exec,cpu,inner_product,gemm:blas,forward_inference,src_f32::blocked:ab:f0
 wei_f32::blocked:ab:f0 bia_undef::undef::f0 
dst_f32::blocked:ab:f0,,,mb5ic2048oc512,0.11377
   ```
   ## Results
Operator| LHS   | RHS   | MKL Default   
| MKL Workaround|
   |--- |   |   |-  
|   |
   | Dot| (4, 512, 512) | (4, 512, 512) | 4.1112
| 4.8241|
   || (5, 512, 512) | (5, 512, 512) | 6.4421
| 7.607 |
   || (5, 512, 1536 | (5, 512, 1536)| 20.3648   
| 19.2217   |
   || (5, 512, 2048)| (5, 512, 2048)| 23.3236   
| 23.2849   |
   || (5, 2048, 512)| (5, 2048, 512)| 123.1235  
| 123.9806  |
   ||   |   |   
|   |
   | Batch_dot  | (4, 

[GitHub] [incubator-mxnet] ChaiBapchya edited a comment on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

2020-05-15 Thread GitBox


ChaiBapchya edited a comment on issue #17980:
URL: 
https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-629385677


   > Tested with MXNet 
[cfb474b](https://github.com/apache/incubator-mxnet/commit/cfb474ba743d5ea85161bf19875488f4cb409d3c).
 Compiled with mostly-default cmake settings:
   > 
   > ```shell
   > cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release ..
   > ```
   > 
   > Then when I run
   > 
   > ```
   > export MKL_VERBOSE=1
   > export MKLDNN_VERBOSE=1
   > python3
   > Python 3.6.9 (default, Nov  7 2019, 10:44:02) 
   > [GCC 8.3.0] on linux
   > Type "help", "copyright", "credits" or "license" for more information.
   > >>> import mxnet as mx
   > Numpy + Intel(R) MKL: THREADING LAYER: (null)
   > Numpy + Intel(R) MKL: setting Intel(R) MKL to use INTEL OpenMP runtime
   > Numpy + Intel(R) MKL: preloading libiomp5.so runtime
   > ```
   
   @kpuatamazon @kpu
   Running on Ubuntu 18.04 [which doesn't have MKL installed by default] with 
default cmake config doesn't use MKL as blas.
   Hence we can't get the above exports.
   
   Thus for Ubuntu 18.04 base AMI, one has to install MKL in /opt/intel & 
update the cmake command to
   ```
   cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -DUSE_BLAS=mkl ..
   ```
   This I found uses mkl as blas & export MKL_VERBOSE=1 confirms it.
   
   With this addition to both [default & workaround] I reran the opperf & I 
didn't see much perf differences.
   
   Default
   ```
   cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -DUSE_BLAS=mkl ..
   ```
   
   Workaround
   ```
   export CXXFLAGS="${CXXFLAGS} -DUSE_MKL -I/opt/intel/mkl/include"
   cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -DUSE_BLAS=mkl ..
   ```
   
   Results
Operator| LHS   | RHS   | MKL Default   
| MKL Workaround|
   |--- |   |   |-  
|   |
   | Dot| (4, 512, 512) | (4, 512, 512) | 4.1112
| 4.8241|
   || (5, 512, 512) | (5, 512, 512) | 6.4421
| 7.607 |
   || (5, 512, 1536 | (5, 512, 1536)| 20.3648   
| 19.2217   |
   || (5, 512, 2048)| (5, 512, 2048)| 23.3236   
| 23.2849   |
   || (5, 2048, 512)| (5, 2048, 512)| 123.1235  
| 123.9806  |
   ||   |   |   
|   |
   | Batch_dot  | (4, 512, 512) | (4, 512, 512) | 1.4105
| 1.407 |
   || (5, 512, 512) | (5, 512, 512) | 1.7558
| 1.7511|
   || (5, 512, 1536)| (5, 512, 1536)| 6.5931
| 6.5585|
   || (5, 512, 2048)| (5, 512, 2048)| 9.1452
| 9.1031|
   || (5, 2048, 512)| (5, 2048, 512)| 29.0192   
| 28.9236   |
   
Operator| Data  | Weight| MKL Default   
| MKL Workaround|
   |--- |   |   |-  
|   |
   | FC | (4, 512)  | (512, 512)| 0.057 
| 0.0685|
   || (5, 512)  | (512, 512)| 0.0591
| 0.0698|
   || (5, 512)  | (1536, 512)   | 0.0823
| 0.0939|
   || (5, 512)  | (2048, 512)   | 0.0916
| 0.1026|
   || (5, 2048) | (512, 2048)   | 0.1146
| 0.1267|



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-mxnet] ChaiBapchya edited a comment on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

2020-05-15 Thread GitBox


ChaiBapchya edited a comment on issue #17980:
URL: 
https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-623803913


   > In case somebody finds this issue and wants their optimized build, here is 
a different workaround that removes the need for `LD_PRELOAD`. Just do this 
before running cmake the first time:
   > 
   > ```shell
   > export CXXFLAGS="${CXXFLAGS} -DUSE_MKL -I/opt/intel/mkl/include"
   > ```
   > 
   > Then `cmake` can be run normally:
   > 
   > ```shell
   > cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release ..
   > ```
   > 
   > and the compiled MXNet can be run normally without any special environment 
variables.
   
   @kpuatamazon Hi I was trying to benchmark using opperf for mkl [default] vs 
workaround
   And despite ensuring mkl is installed & using export CXXFlags followed by 
usual cmake command, build failed with 
   ```
   gemm.cpp:(.text+0xb45): undefined reference to `cblas_gemm_s8u8s32'
   ```
   
   I tried the undocumented abominable kludge option you mentioned and that 
worked smoothly.
   ```
   export LD_PRELOAD=/opt/intel/mkl/lib/intel64/libmkl_rt.so
   rm -rf build/
   mkdir -p build && cd build
   cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -D_DNNL_USE_MKL=FULL 
-DMKLINC=/opt/intel/mkl/include ..
   cmake --build . --parallel 1024
   ```
   
   Script for OpPerf : 
https://gist.github.com/ChaiBapchya/5f2342f75ddeb1e21f14acac665c76ad
   
   Results
   | Operator   | LHS   | RHS   | MKL Default   
| MKL Workaround|
   |--- |   |   |-  
|   |
   | Dot| (4, 512, 512) | (4, 512, 512) | 15.1122   
| 4.1254|
   || (5, 512, 512) | (5, 512, 512) | 38.1678   
| 7.5323|
   || (5, 512, 1536 | (5, 512, 1536)| 21.6601   
| 19.2503   |
   || (5, 512, 2048)| (5, 512, 2048)| 29.0369   
| 23.7432   |
   || (5, 2048, 512)| (5, 2048, 512)| 167.5528  
| 129.9957  |
   ||   |   |   
|   |
   | Batch_dot  | (4, 512, 512) | (4, 512, 512) | 1.7898
| 1.5445|
   || (5, 512, 512) | (5, 512, 512) | 2.2457
| 1.9361|
   || (5, 512, 1536)| (5, 512, 1536)| 6.1453
| 5.4034|
   || (5, 512, 2048)| (5, 512, 2048)| 8.246 
| 8.0442|
   || (5, 2048, 512)| (5, 2048, 512)| 160.6243  
| 29.0772   |
   ||   |   |   
|   |
   ||  **Data** |  **Weight**   |   
|   |
   | FC | (4, 512)  | (512, 512)| 0.0609
| 0.068 |
   || (5, 512)  | (512, 512)| 0.0633
| 0.0731|
   || (5, 512)  | (1536, 512)   | 0.0916
| 0.0996|
   || (5, 512)  | (2048, 512)   | 0.1081
| 0.1084|
   ||   |   |   
|   |
   
   However @kpuatamazon when I try to test out with default [i.e. default -> 
workaround -> default] by unsetting the environment variable LD_PRELOAD, it 
failed to build default with `gemm.cpp:(.text+0xe6b): undefined reference to 
`cblas_gemm_s8u8s32'`



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [incubator-mxnet] ChaiBapchya edited a comment on issue #17980: When compiled with MKL, fully_connected calls DNNL while dot and batch_dot call MKL

2020-05-04 Thread GitBox


ChaiBapchya edited a comment on issue #17980:
URL: 
https://github.com/apache/incubator-mxnet/issues/17980#issuecomment-623803913


   > In case somebody finds this issue and wants their optimized build, here is 
a different workaround that removes the need for `LD_PRELOAD`. Just do this 
before running cmake the first time:
   > 
   > ```shell
   > export CXXFLAGS="${CXXFLAGS} -DUSE_MKL -I/opt/intel/mkl/include"
   > ```
   > 
   > Then `cmake` can be run normally:
   > 
   > ```shell
   > cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release ..
   > ```
   > 
   > and the compiled MXNet can be run normally without any special environment 
variables.
   
   @kpuatamazon Hi I was trying to benchmark using opperf for mkl [default] vs 
workaround
   And despite ensuring mkl is installed & using export CXXFlags followed by 
usual cmake command, build failed with 
   ```
   gemm.cpp:(.text+0xb45): undefined reference to `cblas_gemm_s8u8s32'
   ```
   
   I tried the undocumented abominable kludge option you mentioned and that 
worked smoothly.
   ```
   export LD_PRELOAD=/opt/intel/mkl/lib/intel64/libmkl_rt.so
   rm -rf build/
   mkdir -p build && cd build
   cmake -GNinja -DUSE_CUDA=OFF -DCMAKE_BUILD_TYPE=Release -D_DNNL_USE_MKL=FULL 
-DMKLINC=/opt/intel/mkl/include ..
   cmake --build . --parallel 1024
   ```
   
   Script for OpPerf : 
https://gist.github.com/ChaiBapchya/5f2342f75ddeb1e21f14acac665c76ad
   
   Results
   | Operator   | LHS   | RHS   | MKL Default   
| MKL Workaround|
   |--- |   |   |-  
|   |
   | Dot| (4, 512, 512) | (4, 512, 512) | 15.1122   
| 4.1254|
   || (5, 512, 512) | (5, 512, 512) | 38.1678   
| 7.5323|
   || (5, 512, 1536 | (5, 512, 1536)| 21.6601   
| 19.2503   |
   || (5, 512, 2048)| (5, 512, 2048)| 29.0369   
| 23.7432   |
   || (5, 2048, 512)| (5, 2048, 512)| 167.5528  
| 129.9957  |
   ||   |   |   
|   |
   | Batch_dot  | (4, 512, 512) | (4, 512, 512) | 1.7898
| 1.5445|
   || (5, 512, 512) | (5, 512, 512) | 2.2457
| 1.9361|
   || (5, 512, 1536)| (5, 512, 1536)| 6.1453
| 5.4034|
   || (5, 512, 2048)| (5, 512, 2048)| 8.246 
| 8.0442|
   || (5, 2048, 512)| (5, 2048, 512)| 160.6243  
| 29.0772   |
   ||   |   |   
|   |
   | FC | (4, 512)  | (512, 512)| 0.0609
| 0.068 |
   || (5, 512)  | (512, 512)| 0.0633
| 0.0731|
   || (5, 512)  | (1536, 512)   | 0.0916
| 0.0996|
   || (5, 512)  | (2048, 512)   | 0.1081
| 0.1084|
   ||   |   |   
|   |
   
   However @kpuatamazon when I try to test out with default [i.e. default -> 
workaround -> default] by unsetting the environment variable LD_PRELOAD, it 
failed to build default with `gemm.cpp:(.text+0xe6b): undefined reference to 
`cblas_gemm_s8u8s32'`



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org