[GitHub] [incubator-mxnet] ann-qin-lu commented on issue #20959: GPU memory leak when using gluon.data.DataLoader with num_workers>0

2022-03-28 Thread GitBox


ann-qin-lu commented on issue #20959:
URL: 
https://github.com/apache/incubator-mxnet/issues/20959#issuecomment-1081046122


   Hi @ptrendx, thanks a lot for the explanation! Now I get a much clear 
picture of what's going wrong. If the actually RC is that "CUDA does not in 
fact survive forking", does it mean multiprocessing with `fork` method should 
be avoided from the very beginning?
   
   Just a quick summary for the two approaches we discussed:
   
   * with the workaround that skips the clean up for all engines, it has the 
issue of lingering gpu resources held by engine, whenever the multiprocess fork 
method is used. Proposed solution is to use `spawn` in Gluon.DataLoader. 
@waytrue17 if you can help?
   * if we revert the workaround, we will see the non-deterministic segfault 
issue at exit. This segfault could be resolved if this open Open issue for 
[Better handling of the engine 
destruction](https://github.com/apache/incubator-mxnet/issues/19379#) can be 
resolved first.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org
For additional commands, e-mail: issues-h...@mxnet.apache.org



[GitHub] [incubator-mxnet] ann-qin-lu commented on issue #20959: GPU memory leak when using gluon.data.DataLoader with num_workers>0

2022-03-27 Thread GitBox


ann-qin-lu commented on issue #20959:
URL: 
https://github.com/apache/incubator-mxnet/issues/20959#issuecomment-1080014895


   @TristonC I think your error is due to the fact that Dataloader uses shared 
memory to hold the dataset. I am not sure if using `spawn` would require 
copying shared memory or not. If yes, I am assuming this approach going to 
increase the total memory usage? 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org
For additional commands, e-mail: issues-h...@mxnet.apache.org



[GitHub] [incubator-mxnet] ann-qin-lu commented on issue #20959: GPU memory leak when using gluon.data.DataLoader with num_workers>0

2022-03-27 Thread GitBox


ann-qin-lu commented on issue #20959:
URL: 
https://github.com/apache/incubator-mxnet/issues/20959#issuecomment-1080014222


   Hi @waytrue17, thanks for sharing above info. Yep, by skipping recreating 
dataloader for each epoch does prevent such issue, but in my use case, I need 
to shard the big dataset into smaller ones in each epoch, and therefore data 
loader needs to be created multiple times. 
   I've seen a few comments (e.g. [issue 
1](https://github.com/apache/incubator-mxnet/pull/19378#issuecomment-730078762),
 [issue 2](https://github.com/apache/incubator-mxnet/issues/19420)) that 
mentioned memory error with this workaround 
[commit](https://github.com/apache/incubator-mxnet/pull/19378). Reverting this 
commit does resolve my accumulated gpu memory issue. 
   
   Side question: Could you share more insights about how this workaround 
[commit](https://github.com/apache/incubator-mxnet/pull/19378/files), which 
skips the clean up gpu memory in Naive Engine, affects the usage pattern of 
dataloader?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org
For additional commands, e-mail: issues-h...@mxnet.apache.org



[GitHub] [incubator-mxnet] ann-qin-lu commented on issue #20959: GPU memory leak when using gluon.data.DataLoader with num_workers>0

2022-03-18 Thread GitBox


ann-qin-lu commented on issue #20959:
URL: 
https://github.com/apache/incubator-mxnet/issues/20959#issuecomment-1072926312


   @TristonC False alarm on the Cuda version. Thanks a lot for your help!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org
For additional commands, e-mail: issues-h...@mxnet.apache.org



[GitHub] [incubator-mxnet] ann-qin-lu commented on issue #20959: GPU memory leak when using gluon.data.DataLoader with num_workers>0

2022-03-18 Thread GitBox


ann-qin-lu commented on issue #20959:
URL: 
https://github.com/apache/incubator-mxnet/issues/20959#issuecomment-1072926185


   After more deep dives, this issue is actually not caused by cuda upgrade 
from 10 to 11, but introduced by this specific [commit: Remove cleanup on side 
threads](https://github.com/apache/incubator-mxnet/pull/19378), which skips the 
cuda deinitialization when destructing engine. I've confirmed that after 
reverting this commit, the memory leak issue is gone.
   
   I'll work with MXNet team to see if this commit should be reverted in both 
MxNet master and 1.9 branch. (actually another user reported similar memory 
[issue](https://github.com/apache/incubator-mxnet/issues/19420) when using the 
multiprocessing and tried to 
[revert](https://github.com/apache/incubator-mxnet/pull/19432) this commit). 
Here is the open 
[issue](https://github.com/apache/incubator-mxnet/issues/19379) for better 
handling the engine destruction, which needs to be addressed first if the above 
workaround will be reverted.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org
For additional commands, e-mail: issues-h...@mxnet.apache.org



[GitHub] [incubator-mxnet] ann-qin-lu commented on issue #20959: GPU memory leak when using gluon.data.DataLoader with num_workers>0 (with Cuda > 10.1)

2022-03-16 Thread GitBox


ann-qin-lu commented on issue #20959:
URL: 
https://github.com/apache/incubator-mxnet/issues/20959#issuecomment-1069874455


   Hi @TristonC, thanks a ton for looking into the issue. I tried with 
thread_pool option, and it did work without memory leak. However, since the 
thread_pool option is slow in preparing the data, I do observe the increased 
E2E latency (mostly increased during validation time). My production use cases 
are very sensitive to the training time, and we'd still like to explore the 
option for multiprocess.Pool (assume the memory leak issue can be resolved 
soon).
   
   Do you have any hunch about what changes in Cuda/Cudnn that might lead to 
this issue?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org
For additional commands, e-mail: issues-h...@mxnet.apache.org



[GitHub] [incubator-mxnet] ann-qin-lu commented on issue #20959: GPU memory leak when using gluon.data.DataLoader with num_workers>0 (with Cuda > 10.1)

2022-03-16 Thread GitBox


ann-qin-lu commented on issue #20959:
URL: 
https://github.com/apache/incubator-mxnet/issues/20959#issuecomment-1069675674


   one more data point I've gathered is that if I remove the logic of using the 
shared memory (a.k.a the [global 
_worker_dataset](https://github.com/apache/incubator-mxnet/blame/master/python/mxnet/gluon/data/dataloader.py#L421)),
 it also resolves the memory leak issue. Most like the multiprocess + shared 
memory implementation is left behind some staled references, which are holding 
the gpu memory with the latest Cuda implementation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org
For additional commands, e-mail: issues-h...@mxnet.apache.org



[GitHub] [incubator-mxnet] ann-qin-lu commented on issue #20959: GPU memory leak when using gluon.data.DataLoader with num_workers>0 (with Cuda > 10.1)

2022-03-16 Thread GitBox


ann-qin-lu commented on issue #20959:
URL: 
https://github.com/apache/incubator-mxnet/issues/20959#issuecomment-1069231854


   one additional finding is that the memory leak happens with the default 
thread_pool option set as False (a.k.a leak when using the 
[multiprocessing.Pool](https://github.com/apache/incubator-mxnet/blame/master/python/mxnet/gluon/data/dataloader.py#L665)),
 if I switch to use 
[ThreadPool](https://github.com/apache/incubator-mxnet/blame/master/python/mxnet/gluon/data/dataloader.py#L659),
 there is no memory leak any more! This could be a good indicate for the issue 
in shared memory. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org
For additional commands, e-mail: issues-h...@mxnet.apache.org



[GitHub] [incubator-mxnet] ann-qin-lu commented on issue #20959: GPU memory leak when using gluon.data.DataLoader with num_workers>0 (with Cuda > 10.1)

2022-03-14 Thread GitBox


ann-qin-lu commented on issue #20959:
URL: 
https://github.com/apache/incubator-mxnet/issues/20959#issuecomment-1067165130


   Some additional resources I've found:
   
   * This is a similar 
[issue](https://github.com/apache/incubator-mxnet/pull/19924) for CPU memory 
leak with the MultiWorker setup in DataLoader. The solution was to add the 
python gc to clean up the memory, however this solution doesn't work for GPU. 
   * The Cudnn release 
[note](https://docs.nvidia.com/deeplearning/cudnn/release-notes/rel_8.html#rel_8)
 mentions a new buffer management that might affect the Cuda>=10.2, which seems 
to be related. And the issue only surfaces after I upgrade Cuda version (tested 
with Cuda10.2/Cuda11.1/Cuda11.5, and all 3 have memory leak issue).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org
For additional commands, e-mail: issues-h...@mxnet.apache.org