[GitHub] [incubator-mxnet] apeforest commented on issue #15703: Storage manager / memory usage regression in v1.5

2019-08-20 Thread GitBox
apeforest commented on issue #15703: Storage manager / memory usage regression 
in v1.5
URL: 
https://github.com/apache/incubator-mxnet/issues/15703#issuecomment-523119231
 
 
   @TaoLv This is not an issue (bug per se) but limitation of int32_t data 
types we used in MXNet. As I pointed to the line 
https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/ordering_op-inl.h#L434
 the workspace is created using a 1D mshadow::Shape object, whose length is 
bounded by `index_t` which is int32_t by default. When the workspace size 
required is larger than 2^31, there will be overflow and causing OOM.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-mxnet] apeforest commented on issue #15703: Storage manager / memory usage regression in v1.5

2019-08-20 Thread GitBox
apeforest commented on issue #15703: Storage manager / memory usage regression 
in v1.5
URL: 
https://github.com/apache/incubator-mxnet/issues/15703#issuecomment-522870482
 
 
   @leezu Based on the analysis above, this is not a really memory usage 
regression but a bug due to integer overflow. The memory space required by the 
topk operator in your script is 2729810175 which exceeds 2^31 (max int32_t). It 
did not overflow in MXNet 1.4 because int64_t was used by default as the type 
for index_t. Therefore, this is another case where large integer support is 
needed in MXNet. Given that we plan to turn on USE_INT64_TENSOR_SIZE flag in 
MXNet 1.6 by default, would you use the workaround by turning on the compiler 
flag manually and building mxnet from source? Please let me know if this 
solution is acceptable before MXNet 1.6 release. Thanks!


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-mxnet] apeforest commented on issue #15703: Storage manager / memory usage regression in v1.5

2019-08-19 Thread GitBox
apeforest commented on issue #15703: Storage manager / memory usage regression 
in v1.5
URL: 
https://github.com/apache/incubator-mxnet/issues/15703#issuecomment-522780492
 
 
   Root cause found:
   
   it is due to this line: 
https://github.com/apache/incubator-mxnet/blob/master/src/operator/tensor/ordering_op-inl.h#L434
   
   mshadow::Shape is constructed using index_t, which by default is int32_t in 
MXNet 1.5. In this case, the workspace size is 3184736511 which exceeds 2^31 
and hence causing integer overflow.
   
   Workaround: turn on the USE_INT64_TENSOR_SIZE compiler flag
   
   Possible Fix:
   1) turn on USE_INT64_TENSOR_SIZE flag by default in 1.6
   2) change the constructor of mshadow::Shape to use int64_t always.
   
   Lin


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-mxnet] apeforest commented on issue #15703: Storage manager / memory usage regression in v1.5

2019-08-16 Thread GitBox
apeforest commented on issue #15703: Storage manager / memory usage regression 
in v1.5
URL: 
https://github.com/apache/incubator-mxnet/issues/15703#issuecomment-522167234
 
 
   Further narrowed it down to topk operator. There is some implementation of 
TopKImpl that did not allocate correct amount of GPU memory. Working on a PR 
now.
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-mxnet] apeforest commented on issue #15703: Storage manager / memory usage regression in v1.5

2019-08-15 Thread GitBox
apeforest commented on issue #15703: Storage manager / memory usage regression 
in v1.5
URL: 
https://github.com/apache/incubator-mxnet/issues/15703#issuecomment-521893896
 
 
   Interestingly, I found that turning on the USE_INT64_TENSOR_SIZE flag 
(meaning using int64_t instead of int32_t as index_t type) will solve the OOM 
issue. Still rootcausing it.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [incubator-mxnet] apeforest commented on issue #15703: Storage manager / memory usage regression in v1.5

2019-08-14 Thread GitBox
apeforest commented on issue #15703: Storage manager / memory usage regression 
in v1.5
URL: 
https://github.com/apache/incubator-mxnet/issues/15703#issuecomment-521361519
 
 
   Sorry, I just saw this. Looking into it now.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services