[GitHub] [incubator-mxnet] ann-qin-lu commented on issue #20959: GPU memory leak when using gluon.data.DataLoader with num_workers>0
ann-qin-lu commented on issue #20959: URL: https://github.com/apache/incubator-mxnet/issues/20959#issuecomment-1081046122 Hi @ptrendx, thanks a lot for the explanation! Now I get a much clear picture of what's going wrong. If the actually RC is that "CUDA does not in fact survive forking", does it mean multiprocessing with `fork` method should be avoided from the very beginning? Just a quick summary for the two approaches we discussed: * with the workaround that skips the clean up for all engines, it has the issue of lingering gpu resources held by engine, whenever the multiprocess fork method is used. Proposed solution is to use `spawn` in Gluon.DataLoader. @waytrue17 if you can help? * if we revert the workaround, we will see the non-deterministic segfault issue at exit. This segfault could be resolved if this open Open issue for [Better handling of the engine destruction](https://github.com/apache/incubator-mxnet/issues/19379#) can be resolved first. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For additional commands, e-mail: issues-h...@mxnet.apache.org
[GitHub] [incubator-mxnet] ann-qin-lu commented on issue #20959: GPU memory leak when using gluon.data.DataLoader with num_workers>0
ann-qin-lu commented on issue #20959: URL: https://github.com/apache/incubator-mxnet/issues/20959#issuecomment-1080014895 @TristonC I think your error is due to the fact that Dataloader uses shared memory to hold the dataset. I am not sure if using `spawn` would require copying shared memory or not. If yes, I am assuming this approach going to increase the total memory usage? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For additional commands, e-mail: issues-h...@mxnet.apache.org
[GitHub] [incubator-mxnet] ann-qin-lu commented on issue #20959: GPU memory leak when using gluon.data.DataLoader with num_workers>0
ann-qin-lu commented on issue #20959: URL: https://github.com/apache/incubator-mxnet/issues/20959#issuecomment-1080014222 Hi @waytrue17, thanks for sharing above info. Yep, by skipping recreating dataloader for each epoch does prevent such issue, but in my use case, I need to shard the big dataset into smaller ones in each epoch, and therefore data loader needs to be created multiple times. I've seen a few comments (e.g. [issue 1](https://github.com/apache/incubator-mxnet/pull/19378#issuecomment-730078762), [issue 2](https://github.com/apache/incubator-mxnet/issues/19420)) that mentioned memory error with this workaround [commit](https://github.com/apache/incubator-mxnet/pull/19378). Reverting this commit does resolve my accumulated gpu memory issue. Side question: Could you share more insights about how this workaround [commit](https://github.com/apache/incubator-mxnet/pull/19378/files), which skips the clean up gpu memory in Naive Engine, affects the usage pattern of dataloader? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For additional commands, e-mail: issues-h...@mxnet.apache.org
[GitHub] [incubator-mxnet] ann-qin-lu commented on issue #20959: GPU memory leak when using gluon.data.DataLoader with num_workers>0
ann-qin-lu commented on issue #20959: URL: https://github.com/apache/incubator-mxnet/issues/20959#issuecomment-1072926312 @TristonC False alarm on the Cuda version. Thanks a lot for your help! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For additional commands, e-mail: issues-h...@mxnet.apache.org
[GitHub] [incubator-mxnet] ann-qin-lu commented on issue #20959: GPU memory leak when using gluon.data.DataLoader with num_workers>0
ann-qin-lu commented on issue #20959: URL: https://github.com/apache/incubator-mxnet/issues/20959#issuecomment-1072926185 After more deep dives, this issue is actually not caused by cuda upgrade from 10 to 11, but introduced by this specific [commit: Remove cleanup on side threads](https://github.com/apache/incubator-mxnet/pull/19378), which skips the cuda deinitialization when destructing engine. I've confirmed that after reverting this commit, the memory leak issue is gone. I'll work with MXNet team to see if this commit should be reverted in both MxNet master and 1.9 branch. (actually another user reported similar memory [issue](https://github.com/apache/incubator-mxnet/issues/19420) when using the multiprocessing and tried to [revert](https://github.com/apache/incubator-mxnet/pull/19432) this commit). Here is the open [issue](https://github.com/apache/incubator-mxnet/issues/19379) for better handling the engine destruction, which needs to be addressed first if the above workaround will be reverted. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For additional commands, e-mail: issues-h...@mxnet.apache.org
[GitHub] [incubator-mxnet] ann-qin-lu commented on issue #20959: GPU memory leak when using gluon.data.DataLoader with num_workers>0 (with Cuda > 10.1)
ann-qin-lu commented on issue #20959: URL: https://github.com/apache/incubator-mxnet/issues/20959#issuecomment-1069874455 Hi @TristonC, thanks a ton for looking into the issue. I tried with thread_pool option, and it did work without memory leak. However, since the thread_pool option is slow in preparing the data, I do observe the increased E2E latency (mostly increased during validation time). My production use cases are very sensitive to the training time, and we'd still like to explore the option for multiprocess.Pool (assume the memory leak issue can be resolved soon). Do you have any hunch about what changes in Cuda/Cudnn that might lead to this issue? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For additional commands, e-mail: issues-h...@mxnet.apache.org
[GitHub] [incubator-mxnet] ann-qin-lu commented on issue #20959: GPU memory leak when using gluon.data.DataLoader with num_workers>0 (with Cuda > 10.1)
ann-qin-lu commented on issue #20959: URL: https://github.com/apache/incubator-mxnet/issues/20959#issuecomment-1069675674 one more data point I've gathered is that if I remove the logic of using the shared memory (a.k.a the [global _worker_dataset](https://github.com/apache/incubator-mxnet/blame/master/python/mxnet/gluon/data/dataloader.py#L421)), it also resolves the memory leak issue. Most like the multiprocess + shared memory implementation is left behind some staled references, which are holding the gpu memory with the latest Cuda implementation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For additional commands, e-mail: issues-h...@mxnet.apache.org
[GitHub] [incubator-mxnet] ann-qin-lu commented on issue #20959: GPU memory leak when using gluon.data.DataLoader with num_workers>0 (with Cuda > 10.1)
ann-qin-lu commented on issue #20959: URL: https://github.com/apache/incubator-mxnet/issues/20959#issuecomment-1069231854 one additional finding is that the memory leak happens with the default thread_pool option set as False (a.k.a leak when using the [multiprocessing.Pool](https://github.com/apache/incubator-mxnet/blame/master/python/mxnet/gluon/data/dataloader.py#L665)), if I switch to use [ThreadPool](https://github.com/apache/incubator-mxnet/blame/master/python/mxnet/gluon/data/dataloader.py#L659), there is no memory leak any more! This could be a good indicate for the issue in shared memory. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For additional commands, e-mail: issues-h...@mxnet.apache.org
[GitHub] [incubator-mxnet] ann-qin-lu commented on issue #20959: GPU memory leak when using gluon.data.DataLoader with num_workers>0 (with Cuda > 10.1)
ann-qin-lu commented on issue #20959: URL: https://github.com/apache/incubator-mxnet/issues/20959#issuecomment-1067165130 Some additional resources I've found: * This is a similar [issue](https://github.com/apache/incubator-mxnet/pull/19924) for CPU memory leak with the MultiWorker setup in DataLoader. The solution was to add the python gc to clean up the memory, however this solution doesn't work for GPU. * The Cudnn release [note](https://docs.nvidia.com/deeplearning/cudnn/release-notes/rel_8.html#rel_8) mentions a new buffer management that might affect the Cuda>=10.2, which seems to be related. And the issue only surfaces after I upgrade Cuda version (tested with Cuda10.2/Cuda11.1/Cuda11.5, and all 3 have memory leak issue). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@mxnet.apache.org For additional commands, e-mail: issues-h...@mxnet.apache.org