[GitHub] jinhuang415 commented on issue #10433: [MXNET-290] MKLDNN support for model quantization
jinhuang415 commented on issue #10433: [MXNET-290] MKLDNN support for model quantization URL: https://github.com/apache/incubator-mxnet/pull/10433#issuecomment-397143625 @marcoabreu @zheng-da Thanks for the approve! Regarding how to check different backend context, if we are not in an agreement to expose this as APIs to users, how about keep current change? Simply from context is not enough since we need to differentiate MKLDNN and native CPU but they all belong to CPU context. Another approach from python level to separate mkldnn and native CPU is by checking `/proc//maps `to match related mapped library for one process, so if mkldnn is built in, libmkldnn.so will be mapped to the process's space, the same with cudnn/cuda as well. But this approach only applies for Linux (replies on proc file system), so if we do nosetests under MAC or windows this will not work. Anyway if you think we are fine right now to keep current change, how about continue to merge the PR? Thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] jinhuang415 commented on issue #10433: [MXNET-290] MKLDNN support for model quantization
jinhuang415 commented on issue #10433: [MXNET-290] MKLDNN support for model quantization URL: https://github.com/apache/incubator-mxnet/pull/10433#issuecomment-396109115 @marcoabreu @zheng-da A kind reminder. Would you help to review the latest diff? Thanks. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] jinhuang415 commented on issue #10433: [MXNET-290] MKLDNN support for model quantization
jinhuang415 commented on issue #10433: [MXNET-290] MKLDNN support for model quantization URL: https://github.com/apache/incubator-mxnet/pull/10433#issuecomment-395630010 @marcoabreu I used my proposed approach 1 to test all cases and skip and print the cases that is expected to skip, please help to take a look. Thanks. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] jinhuang415 commented on issue #10433: [MXNET-290] MKLDNN support for model quantization
jinhuang415 commented on issue #10433: [MXNET-290] MKLDNN support for model quantization URL: https://github.com/apache/incubator-mxnet/pull/10433#issuecomment-395478506 @marcoabreu I tried to use below "try .. catch ..." statement to test native CPU, MKLDNN, GPU ``` for dtype in ['int8','uint8']: try: run_test(..., dtype) catch: ``` But for those OPs not supported by native CPU, the test would "segmentation fault" directly and the exception handler could not catch it so the test will abort directly and could not execute any further case, I think the MXNet framework didn't handle the unsupported OP very well (when OP didn't register FCompute/FComputeEx like for native CPU it didn't register FCompute/FComputeEx for quantized conv/pooling/fc), another case is if we test unsupported dtype for one backend (like test int8 for MKLDNN or uint8 for GPU), the C++ code will do parameter check failure and throw runtime error message, of course we can catch this exception and ignore it but it is not the specific OperatorMissingError exception. I think it's not trivial to change MXNet C++ code to throw the specific exception for various unsupported cases (need to add proper exception thrower for all these cases), how about use below 2 approaches which makes life a little easier: ``` run_test(..., dtype)// add checking at the top of run_test() so we can see what kind of test is expected to be skipped and the rest should be tested and passed if context == cpu_mkldnn: if dtype == 'int8' logging.info(skipping due to unsupported dtype {}, dtype) return elif context == cpu_native: ... for dtype in ['int8','uint8']: run_test(..., dtype) ``` This approach tested all cases and skipped the unsupported case explicitly so users can see, for not skipped case they are supposed to test and pass through. ``` run_test(..., dtype) // test case implementation if context == cpu_native:// list test cases for each context implicitly dtype_list = []// empty means doesn't support this quantized OP elif if context == cpu_mkldnn: dtype_list = [uint8] elif if context == gpu: dtype_list = [int8] for dtype in dtype_list: run_test(..., dtype) ``` This approach explicitly listed the dtype_list each context support and only test the supported scenario, we don't have backend info in the test case implementation( `run_test()` ) so the test case implementation will be clean. Please let us know your comments. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] jinhuang415 commented on issue #10433: [MXNET-290] MKLDNN support for model quantization
jinhuang415 commented on issue #10433: [MXNET-290] MKLDNN support for model quantization URL: https://github.com/apache/incubator-mxnet/pull/10433#issuecomment-395478506 @marcoabreu I tried to use below "try .. catch ..." statement to test native CPU, MKLDNN, GPU ``` for dtype in ['int8','uint8']: try: run_test(..., dtype) catch: ``` But for those OPs not supported by native CPU, the test would "segmentation fault" directly and the exception handler could not catch it so the test will abort directly and could not execute any further case, I think the MXNet framework didn't handle the unsupported OP very well (when OP didn't register FCompute/FComputeEx like for native CPU it didn't register FCompute/FComputeEx for quantized conv/pooling/fc), another case is if we test unsupported dtype for one backend (like test int8 for MKLDNN or uint8 for GPU), the C++ code will do parameter check failure and throw runtime error message, of course we can catch this exception and ignore it but it is not the specific OperatorMissingError exception. I think it's not trivial to change MXNet C++ code to throw the specific exception for various unsupported cases (need to add proper exception thrower for all these cases), how about use below 2 approaches which makes life a little easier: ``` run_test(..., dtype)// add checking at the top of run_test() so we can see what kind of test is expected to be skipped and the rest should be tested and passed if context == cpu_mkldnn: if dtype == 'int8' logging.info(skipping due to unsupported dtype {}, dtype) return elif context == cpu_native: ... for dtype in ['int8','uint8']: run_test(..., dtype) ``` This approach tested all cases and skipped the unsupported case explicitly so users can see, for not skipped case they are supposed to test and pass through. ``` run_test(..., dtype) // test case implementation if context == cpu_native:// list test cases for each context implicitly dtype_list = []// empty means doesn't support this quantized OP elif if context == cpu_mkldnn: dtype_list = [uint8] elif if context == gpu: dtype_list = [int8] for dtype in dtype_list: run_test(..., dtype) ``` This approach explicitly listed the dtype_list each context support and only test the supported scenario, we don't have backend info in the test case implementation( `run_test()` ) so the test case implementation will be clean. Please let us know your comments. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] jinhuang415 commented on issue #10433: [MXNET-290] MKLDNN support for model quantization
jinhuang415 commented on issue #10433: [MXNET-290] MKLDNN support for model quantization URL: https://github.com/apache/incubator-mxnet/pull/10433#issuecomment-395122305 @marcoabreu Please let me know if I misunderstood your points, and please also let us know if there is better suggested approach to handle such difference among different backend for naive CPU/MKLDNN/GPU. (dtype difference and supported quantized op difference) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] jinhuang415 commented on issue #10433: [MXNET-290] MKLDNN support for model quantization
jinhuang415 commented on issue #10433: [MXNET-290] MKLDNN support for model quantization URL: https://github.com/apache/incubator-mxnet/pull/10433#issuecomment-395100685 @marcoabreu Sorry maybe I didn't explain very clearly, for current quantization test, besides the dtype difference between mkldnn (use uint8) and GPU (use int8), there is also other difference between naive CPU and mkldnn/GPU, that naive CPU doesn't support quantized OP yet (conv/pooling/fc, only support quantize/dequantize/requantize), while mkldnn can support quantized conv/pooling and GPU can support quantized conv/pooling/fc, the test will fail for unsupported OP (unless we return directly for unsupported OP case but this needs the backend info). So in order to share the same test code among naive CPU/mkldnn/GPU, looks we still need to consider the backend type. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] jinhuang415 commented on issue #10433: [MXNET-290] MKLDNN support for model quantization
jinhuang415 commented on issue #10433: [MXNET-290] MKLDNN support for model quantization URL: https://github.com/apache/incubator-mxnet/pull/10433#issuecomment-395100685 @marcoabreu Sorry maybe I didn't explain very clearly, for current quantization test, besides the dtype difference between mkldnn (use uint8) and GPU (use int8), there is also other difference between naive CPU and mkldnn/GPU, that naive CPU doesn't support quantized OP yet (conv/pooling/fc, only support quantize/dequantize/requantize), while mkldnn can support quantized conv/pooling and GPU can support quantized conv/pooling/fc, the test will fail for unsupported OP. So in order to share the same test code among naive CPU/mkldnn/GPU, looks we still need to consider the backend type. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] jinhuang415 commented on issue #10433: [MXNET-290] MKLDNN support for model quantization
jinhuang415 commented on issue #10433: [MXNET-290] MKLDNN support for model quantization URL: https://github.com/apache/incubator-mxnet/pull/10433#issuecomment-393739496 @marcoabreu This PR has been approved by reminisce and @zheng-da will finish the review very soon, would you help to take a look at the PR before it's merged (as you suggested about a month ago :)) ? Thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] jinhuang415 commented on issue #10433: [MXNET-290] MKLDNN support for model quantization
jinhuang415 commented on issue #10433: [MXNET-290] MKLDNN support for model quantization URL: https://github.com/apache/incubator-mxnet/pull/10433#issuecomment-393196924 @reminisce @zheng-da We have resolved all the comments, would you help to check if you have further comments on the change? @marcoabreu @zheng-da We see a lot of Jenkins failure recently after submitting new change and most of the failure happens at CPP:GPU Unittest (see http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-10433/51/pipeline/726/), we tried on our local GPU and everything is fine, and we tried to re-trigger the Jenkins and it can pass sometimes (not stable, sometimes need to re-trigger several times to pass Jenkins), I think our change should not impact CPP:GPU testing, would you help to check if this is an known issue for Jenkins system or MXNet base code? Or is there any way to debug the failure issue on Jenkins? Thanks. I copied the failure log as below for your reference: ``` [14:18:36] /work/mxnet/tests/cpp/engine/threaded_engine_test.cc:133: Stopping: NaiveEngine [14:18:36] /work/mxnet/tests/cpp/engine/threaded_engine_test.cc:135: Stopped: NaiveEngine Starting... [14:18:36] /work/mxnet/tests/cpp/engine/threaded_engine_test.cc:137: Started: NaiveEngine Done... [14:18:36] /work/mxnet/tests/cpp/engine/threaded_engine_test.cc:133: Stopping: ThreadedEnginePooled terminate called after throwing an instance of 'std::system_error' what(): Operation not permitted /work/runtime_functions.sh: line 476: 7 Aborted (core dumped) build/tests/mxnet_unit_tests build.py: 2018-05-30 14:18:38,174 Running of command in container failed (134): nvidia-docker run --rm -t --shm-size=500m -v /home/jenkins_slave/workspace/ut-cpp-gpu:/work/mxnet -v /home/jenkins_slave/workspace/ut-cpp-gpu/build:/work/build -u 1001:1001 mxnet/build.ubuntu_gpu /work/runtime_functions.sh unittest_ubuntu_gpu_cpp build.py: 2018-05-30 14:18:38,175 You can try to get into the container by using the following command: nvidia-docker run --rm -t --shm-size=500m -v /home/jenkins_slave/workspace/ut-cpp-gpu:/work/mxnet -v /home/jenkins_slave/workspace/ut-cpp-gpu/build:/work/build -u 1001:1001 -ti --entrypoint /bin/bash mxnet/build.ubuntu_gpu /work/runtime_functions.sh unittest_ubuntu_gpu_cpp into container: False Traceback (most recent call last): File "ci/build.py", line 307, in sys.exit(main()) File "ci/build.py", line 243, in main container_run(platform, docker_binary, shared_memory_size, command) File "ci/build.py", line 154, in container_run raise subprocess.CalledProcessError(ret, cmd) subprocess.CalledProcessError: Command 'nvidia-docker run --rm -t --shm-size=500m -v /home/jenkins_slave/workspace/ut-cpp-gpu:/work/mxnet -v /home/jenkins_slave/workspace/ut-cpp-gpu/build:/work/build -u 1001:1001 mxnet/build.ubuntu_gpu /work/runtime_functions.sh unittest_ubuntu_gpu_cpp' returned non-zero exit status 134 script returned exit code 1 ``` This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] jinhuang415 commented on issue #10433: [MXNET-290] MKLDNN support for model quantization
jinhuang415 commented on issue #10433: [MXNET-290] MKLDNN support for model quantization URL: https://github.com/apache/incubator-mxnet/pull/10433#issuecomment-393196924 @reminisce @zheng-da We have resolved all the comments, would you help to check if you have further comments on the change? @marcoabreu @zheng-da We see a lot of Jenkins failure recently after submitting new change and most of the failure happens at CPP:GPU Unittest (see http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-10433/51/pipeline/726/), we tried on our local GPU and everything is fine, and we tried to re-trigger the Jenkins and it can pass sometimes (not stable, sometimes need to re-trigger several times to pass Jenkins), I think our change should not impact CPP:GPU testing, would you help to check if this is an issue for Jenkins system? Or is there any way to debug the failure issue on Jenkins? Thanks. I copied the failure log as below for your reference: ``` [14:18:36] /work/mxnet/tests/cpp/engine/threaded_engine_test.cc:133: Stopping: NaiveEngine [14:18:36] /work/mxnet/tests/cpp/engine/threaded_engine_test.cc:135: Stopped: NaiveEngine Starting... [14:18:36] /work/mxnet/tests/cpp/engine/threaded_engine_test.cc:137: Started: NaiveEngine Done... [14:18:36] /work/mxnet/tests/cpp/engine/threaded_engine_test.cc:133: Stopping: ThreadedEnginePooled terminate called after throwing an instance of 'std::system_error' what(): Operation not permitted /work/runtime_functions.sh: line 476: 7 Aborted (core dumped) build/tests/mxnet_unit_tests build.py: 2018-05-30 14:18:38,174 Running of command in container failed (134): nvidia-docker run --rm -t --shm-size=500m -v /home/jenkins_slave/workspace/ut-cpp-gpu:/work/mxnet -v /home/jenkins_slave/workspace/ut-cpp-gpu/build:/work/build -u 1001:1001 mxnet/build.ubuntu_gpu /work/runtime_functions.sh unittest_ubuntu_gpu_cpp build.py: 2018-05-30 14:18:38,175 You can try to get into the container by using the following command: nvidia-docker run --rm -t --shm-size=500m -v /home/jenkins_slave/workspace/ut-cpp-gpu:/work/mxnet -v /home/jenkins_slave/workspace/ut-cpp-gpu/build:/work/build -u 1001:1001 -ti --entrypoint /bin/bash mxnet/build.ubuntu_gpu /work/runtime_functions.sh unittest_ubuntu_gpu_cpp into container: False Traceback (most recent call last): File "ci/build.py", line 307, in sys.exit(main()) File "ci/build.py", line 243, in main container_run(platform, docker_binary, shared_memory_size, command) File "ci/build.py", line 154, in container_run raise subprocess.CalledProcessError(ret, cmd) subprocess.CalledProcessError: Command 'nvidia-docker run --rm -t --shm-size=500m -v /home/jenkins_slave/workspace/ut-cpp-gpu:/work/mxnet -v /home/jenkins_slave/workspace/ut-cpp-gpu/build:/work/build -u 1001:1001 mxnet/build.ubuntu_gpu /work/runtime_functions.sh unittest_ubuntu_gpu_cpp' returned non-zero exit status 134 script returned exit code 1 ``` This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] jinhuang415 commented on issue #10433: [MXNET-290] MKLDNN support for model quantization
jinhuang415 commented on issue #10433: [MXNET-290] MKLDNN support for model quantization URL: https://github.com/apache/incubator-mxnet/pull/10433#issuecomment-390582611 @zheng-da Currently the int8 performance is not good as FP32, we are planning to add several enhancement to improve performance and the int8 performance will be 1.2~1.5x of FP32 after those enhancements. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] jinhuang415 commented on issue #10433: [MXNET-290] MKLDNN support for model quantization
jinhuang415 commented on issue #10433: [MXNET-290] MKLDNN support for model quantization URL: https://github.com/apache/incubator-mxnet/pull/10433#issuecomment-389866702 @reminisce @zheng-da We have resolved all the comments, would you check if you have further comments on current change? Thanks. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] jinhuang415 commented on issue #10433: [MXNET-290] MKLDNN support for model quantization
jinhuang415 commented on issue #10433: [MXNET-290] MKLDNN support for model quantization URL: https://github.com/apache/incubator-mxnet/pull/10433#issuecomment-384533399 @zheng-da Correct my word, for mkldnn_OIhw4i16o4i change in mkldnn_base.cc, this format is needed by int8 otherwise will cause int8 runtime error (FP32 doesn't need this type so no problem during runtime). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] jinhuang415 commented on issue #10433: [MXNET-290] MKLDNN support for model quantization
jinhuang415 commented on issue #10433: [MXNET-290] MKLDNN support for model quantization URL: https://github.com/apache/incubator-mxnet/pull/10433#issuecomment-384533399 @zheng-da Correct my word, for mkldnn_OIhw4i16o4i change in mkldnn_base.cc, this format is needed by int8 otherwise will cause runtime error. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] jinhuang415 commented on issue #10433: [MXNET-290] MKLDNN support for model quantization
jinhuang415 commented on issue #10433: [MXNET-290] MKLDNN support for model quantization URL: https://github.com/apache/incubator-mxnet/pull/10433#issuecomment-383627507 @marcoabreu Sure, will do. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services