[GitHub] [incubator-mxnet] apeforest commented on issue #15703: Storage manager / memory usage regression in v1.5
apeforest commented on issue #15703: Storage manager / memory usage regression in v1.5 URL: https://github.com/apache/incubator-mxnet/issues/15703#issuecomment-523119231 @TaoLv This is not an issue (bug per se) but limitation of int32_t data types we used in MXNet. As I pointed to the line https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/ordering_op-inl.h#L434 the workspace is created using a 1D mshadow::Shape object, whose length is bounded by `index_t` which is int32_t by default. When the workspace size required is larger than 2^31, there will be overflow and causing OOM. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-mxnet] apeforest commented on issue #15703: Storage manager / memory usage regression in v1.5
apeforest commented on issue #15703: Storage manager / memory usage regression in v1.5 URL: https://github.com/apache/incubator-mxnet/issues/15703#issuecomment-522870482 @leezu Based on the analysis above, this is not a really memory usage regression but a bug due to integer overflow. The memory space required by the topk operator in your script is 2729810175 which exceeds 2^31 (max int32_t). It did not overflow in MXNet 1.4 because int64_t was used by default as the type for index_t. Therefore, this is another case where large integer support is needed in MXNet. Given that we plan to turn on USE_INT64_TENSOR_SIZE flag in MXNet 1.6 by default, would you use the workaround by turning on the compiler flag manually and building mxnet from source? Please let me know if this solution is acceptable before MXNet 1.6 release. Thanks! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-mxnet] apeforest commented on issue #15703: Storage manager / memory usage regression in v1.5
apeforest commented on issue #15703: Storage manager / memory usage regression in v1.5 URL: https://github.com/apache/incubator-mxnet/issues/15703#issuecomment-522780492 Root cause found: it is due to this line: https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/ordering_op-inl.h#L434 mshadow::Shape is constructed using index_t, which by default is int32_t in MXNet 1.5. In this case, the workspace size is 3184736511 which exceeds 2^31 and hence causing integer overflow. Workaround: turn on the USE_INT64_TENSOR_SIZE compiler flag Possible Fix: 1) turn on USE_INT64_TENSOR_SIZE flag by default in 1.6 2) change the constructor of mshadow::Shape to use int64_t always. Lin This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-mxnet] apeforest commented on issue #15703: Storage manager / memory usage regression in v1.5
apeforest commented on issue #15703: Storage manager / memory usage regression in v1.5 URL: https://github.com/apache/incubator-mxnet/issues/15703#issuecomment-522167234 Further narrowed it down to topk operator. There is some implementation of TopKImpl that did not allocate correct amount of GPU memory. Working on a PR now. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-mxnet] apeforest commented on issue #15703: Storage manager / memory usage regression in v1.5
apeforest commented on issue #15703: Storage manager / memory usage regression in v1.5 URL: https://github.com/apache/incubator-mxnet/issues/15703#issuecomment-521893896 Interestingly, I found that turning on the USE_INT64_TENSOR_SIZE flag (meaning using int64_t instead of int32_t as index_t type) will solve the OOM issue. Still rootcausing it. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services
[GitHub] [incubator-mxnet] apeforest commented on issue #15703: Storage manager / memory usage regression in v1.5
apeforest commented on issue #15703: Storage manager / memory usage regression in v1.5 URL: https://github.com/apache/incubator-mxnet/issues/15703#issuecomment-521361519 Sorry, I just saw this. Looking into it now. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services