Re: CUDNN algorithm selection failure
"I ran a similar test(test_slice_batchnorm) for 5K times and I couldn't reproduce the issue." One thing to keep in mind is that the SelectAlgo call will cache results in a registry that is in static scope. To repro you'd likely have to create a new process each time you run the test. (Apologies if this is already how you're reproducing). SelectAlgo call: https://github.com/apache/incubator-mxnet/blob/403831ace46eab4447794df9411351e439e8983e/src/operator/nn/cudnn/cudnn_convolution-inl.h#L609 Static local / singleton registry pattern here: https://github.com/apache/incubator-mxnet/blob/024b5a916dd3a39a39031ce5e6565cd7d9d60fe2/src/operator/nn/cudnn/cudnn_algoreg.cc#L37 On Thu, Oct 4, 2018 at 8:58 PM Marco de Abreu wrote: > For GPU, we don't run any tests in parallel. > > -Marco > > Naveen Swamy schrieb am Do., 4. Okt. 2018, 19:54: > > > Looking at the error raised, you can see that the workspace size(GPU mem > > size) of 1GB isn't sufficient. I am wondering if it is due to tests > running > > in parallel on CI, if this is true(tests running in parallel) is it > > possible to reduce the parallelism ? > > Error: > > "mxnet.base.MXNetError: [05:40:12] > > src/operator/nn/./cudnn/cudnn_convolution-inl.h:870: Failed to find any > > forward convolution algorithm. with workspace size of 1073741824 bytes, > > please consider reducing batch/model size or increasing the workspace > size" > > > > I ran a similar test(test_slice_batchnorm) for 5K times and I couldn't > > reproduce the issue. I will look into it further to see if there are > other > > alternatives. > > > > > > On Thu, Oct 4, 2018 at 10:48 AM Piyush Ghai > wrote: > > > > > Another build where test_slice_batchnorm_reshape_batchnorm fails : > > > > > > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12721/7/pipeline > > > < > > > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12721/7/pipeline > > > > > > > > > > — > > > Piyush > > > > > > > On Oct 3, 2018, at 9:32 AM, Pedro Larroy < > pedro.larroy.li...@gmail.com > > > > > > wrote: > > > > > > > > Seems is not the only test: > > > > > > > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12726/5/pipeline > > > > > > > > test_slice_batchnorm_reshape_batchnorm is also failing and hasn't > been > > > > touched for a while. It doesn't look like a problem with the test to > > me, > > > > (not a flaky test). Looks to me that should find and address the root > > > cause > > > > instead of disabling the test in this case. > > > > > > > > Pedro. > > > > > > > > On Tue, Oct 2, 2018 at 2:39 AM Marco de Abreu > > > > wrote: > > > > > > > >> I have created an issue at > > > >> https://github.com/apache/incubator-mxnet/issues/12715 and a PR to > > > disable > > > >> the test at https://github.com/apache/incubator-mxnet/pull/12716. > > > >> > > > >> This test is pretty new and was submitted with a number of other > > > >> problematic (and disabled) tests: > > > >> https://github.com/apache/incubator-mxnet/issues/11164 It could be > > > >> possible > > > >> that the test is simply not stable enough. The PR that introduced > that > > > test > > > >> is https://github.com/apache/incubator-mxnet/pull/10921 - it was > > merged > > > >> two > > > >> days ago. > > > >> > > > >> Best regards, > > > >> Marco > > > >> > > > >> On Tue, Oct 2, 2018 at 8:43 AM Pedro Larroy < > > > pedro.larroy.li...@gmail.com> > > > >> wrote: > > > >> > > > >>> Thanks for checking Lin. If it happens again we will have to dig > > > deeper. > > > >> We > > > >>> have just one executor in GPU so I wonder what could be the root > > cause > > > of > > > >>> this. > > > >>> > > > >>> On Mon, Oct 1, 2018 at 10:57 PM Lin Yuan > > wrote: > > > >>> > > > I could not reproduce the error on an EC2 g3x8 instance making it > > hard > > > >> to > > > debug. I also suspect it was due to resource usage limit on ci > > > >>> Instance. > > > > > > On Mon, Oct 1, 2018 at 10:40 PM Pedro Larroy < > > > >>> pedro.larroy.li...@gmail.com > > > > > > > wrote: > > > > > > > It doesn't look like flakiness to me at first sight. I think it > > might > > > >>> be > > > > related to resource usage / allocation / leak in the worst case. > > > > > > > > Could be that there was not enough memory GPU memory at the time > of > > > >>> test > > > > execution. But I'm just speculating, hence my original question. > > > > > > > > Pedro. > > > > > > > > On Mon, Oct 1, 2018 at 8:16 PM Lin Yuan > > wrote: > > > > > > > >> Hi Pedro, > > > >> > > > >> I also got this failure in my PR > > > >> > > > >> > > > > > > > > > > >>> > > > >> > > > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11742/27/pipeline > > > >> > > > >> I was not able to identify the root cause of it from
Re: CUDNN algorithm selection failure
For GPU, we don't run any tests in parallel. -Marco Naveen Swamy schrieb am Do., 4. Okt. 2018, 19:54: > Looking at the error raised, you can see that the workspace size(GPU mem > size) of 1GB isn't sufficient. I am wondering if it is due to tests running > in parallel on CI, if this is true(tests running in parallel) is it > possible to reduce the parallelism ? > Error: > "mxnet.base.MXNetError: [05:40:12] > src/operator/nn/./cudnn/cudnn_convolution-inl.h:870: Failed to find any > forward convolution algorithm. with workspace size of 1073741824 bytes, > please consider reducing batch/model size or increasing the workspace size" > > I ran a similar test(test_slice_batchnorm) for 5K times and I couldn't > reproduce the issue. I will look into it further to see if there are other > alternatives. > > > On Thu, Oct 4, 2018 at 10:48 AM Piyush Ghai wrote: > > > Another build where test_slice_batchnorm_reshape_batchnorm fails : > > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12721/7/pipeline > > < > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12721/7/pipeline > > > > > > > — > > Piyush > > > > > On Oct 3, 2018, at 9:32 AM, Pedro Larroy > > > wrote: > > > > > > Seems is not the only test: > > > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12726/5/pipeline > > > > > > test_slice_batchnorm_reshape_batchnorm is also failing and hasn't been > > > touched for a while. It doesn't look like a problem with the test to > me, > > > (not a flaky test). Looks to me that should find and address the root > > cause > > > instead of disabling the test in this case. > > > > > > Pedro. > > > > > > On Tue, Oct 2, 2018 at 2:39 AM Marco de Abreu > > > wrote: > > > > > >> I have created an issue at > > >> https://github.com/apache/incubator-mxnet/issues/12715 and a PR to > > disable > > >> the test at https://github.com/apache/incubator-mxnet/pull/12716. > > >> > > >> This test is pretty new and was submitted with a number of other > > >> problematic (and disabled) tests: > > >> https://github.com/apache/incubator-mxnet/issues/11164 It could be > > >> possible > > >> that the test is simply not stable enough. The PR that introduced that > > test > > >> is https://github.com/apache/incubator-mxnet/pull/10921 - it was > merged > > >> two > > >> days ago. > > >> > > >> Best regards, > > >> Marco > > >> > > >> On Tue, Oct 2, 2018 at 8:43 AM Pedro Larroy < > > pedro.larroy.li...@gmail.com> > > >> wrote: > > >> > > >>> Thanks for checking Lin. If it happens again we will have to dig > > deeper. > > >> We > > >>> have just one executor in GPU so I wonder what could be the root > cause > > of > > >>> this. > > >>> > > >>> On Mon, Oct 1, 2018 at 10:57 PM Lin Yuan > wrote: > > >>> > > I could not reproduce the error on an EC2 g3x8 instance making it > hard > > >> to > > debug. I also suspect it was due to resource usage limit on ci > > >>> Instance. > > > > On Mon, Oct 1, 2018 at 10:40 PM Pedro Larroy < > > >>> pedro.larroy.li...@gmail.com > > > > > wrote: > > > > > It doesn't look like flakiness to me at first sight. I think it > might > > >>> be > > > related to resource usage / allocation / leak in the worst case. > > > > > > Could be that there was not enough memory GPU memory at the time of > > >>> test > > > execution. But I'm just speculating, hence my original question. > > > > > > Pedro. > > > > > > On Mon, Oct 1, 2018 at 8:16 PM Lin Yuan > wrote: > > > > > >> Hi Pedro, > > >> > > >> I also got this failure in my PR > > >> > > >> > > > > > > > >>> > > >> > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11742/27/pipeline > > >> > > >> I was not able to identify the root cause of it from changelist. > > >> Are > > you > > >> suggesting there is some flakiness in the master branch too? > > >> > > >> Thanks, > > >> > > >> Lin > > >> > > >> On Mon, Oct 1, 2018 at 4:55 PM Pedro Larroy < > > > pedro.larroy.li...@gmail.com> > > >> wrote: > > >> > > >>> Hi > > >>> > > >>> I saw this failure on CI: > > >>> > > >>> > > >> > > > > > > > >>> > > >> > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1697/pipeline > > >>> > > >>> Have you seen other cases where we fail to select the best CUDNN > > >> algorithm? > > >>> In which circumstances this could happen, and do you think is a > > >>> good > > > idea > > >>> to have one selected by default as a last resort? > > >>> > > >>> > > >>> Pedro. > > >>> > > >> > > > > > > > >>> > > >> > > > > >
Re: CUDNN algorithm selection failure
Looking at the error raised, you can see that the workspace size(GPU mem size) of 1GB isn't sufficient. I am wondering if it is due to tests running in parallel on CI, if this is true(tests running in parallel) is it possible to reduce the parallelism ? Error: "mxnet.base.MXNetError: [05:40:12] src/operator/nn/./cudnn/cudnn_convolution-inl.h:870: Failed to find any forward convolution algorithm. with workspace size of 1073741824 bytes, please consider reducing batch/model size or increasing the workspace size" I ran a similar test(test_slice_batchnorm) for 5K times and I couldn't reproduce the issue. I will look into it further to see if there are other alternatives. On Thu, Oct 4, 2018 at 10:48 AM Piyush Ghai wrote: > Another build where test_slice_batchnorm_reshape_batchnorm fails : > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12721/7/pipeline > < > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12721/7/pipeline > > > > — > Piyush > > > On Oct 3, 2018, at 9:32 AM, Pedro Larroy > wrote: > > > > Seems is not the only test: > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12726/5/pipeline > > > > test_slice_batchnorm_reshape_batchnorm is also failing and hasn't been > > touched for a while. It doesn't look like a problem with the test to me, > > (not a flaky test). Looks to me that should find and address the root > cause > > instead of disabling the test in this case. > > > > Pedro. > > > > On Tue, Oct 2, 2018 at 2:39 AM Marco de Abreu > > wrote: > > > >> I have created an issue at > >> https://github.com/apache/incubator-mxnet/issues/12715 and a PR to > disable > >> the test at https://github.com/apache/incubator-mxnet/pull/12716. > >> > >> This test is pretty new and was submitted with a number of other > >> problematic (and disabled) tests: > >> https://github.com/apache/incubator-mxnet/issues/11164 It could be > >> possible > >> that the test is simply not stable enough. The PR that introduced that > test > >> is https://github.com/apache/incubator-mxnet/pull/10921 - it was merged > >> two > >> days ago. > >> > >> Best regards, > >> Marco > >> > >> On Tue, Oct 2, 2018 at 8:43 AM Pedro Larroy < > pedro.larroy.li...@gmail.com> > >> wrote: > >> > >>> Thanks for checking Lin. If it happens again we will have to dig > deeper. > >> We > >>> have just one executor in GPU so I wonder what could be the root cause > of > >>> this. > >>> > >>> On Mon, Oct 1, 2018 at 10:57 PM Lin Yuan wrote: > >>> > I could not reproduce the error on an EC2 g3x8 instance making it hard > >> to > debug. I also suspect it was due to resource usage limit on ci > >>> Instance. > > On Mon, Oct 1, 2018 at 10:40 PM Pedro Larroy < > >>> pedro.larroy.li...@gmail.com > > > wrote: > > > It doesn't look like flakiness to me at first sight. I think it might > >>> be > > related to resource usage / allocation / leak in the worst case. > > > > Could be that there was not enough memory GPU memory at the time of > >>> test > > execution. But I'm just speculating, hence my original question. > > > > Pedro. > > > > On Mon, Oct 1, 2018 at 8:16 PM Lin Yuan wrote: > > > >> Hi Pedro, > >> > >> I also got this failure in my PR > >> > >> > > > > >>> > >> > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11742/27/pipeline > >> > >> I was not able to identify the root cause of it from changelist. > >> Are > you > >> suggesting there is some flakiness in the master branch too? > >> > >> Thanks, > >> > >> Lin > >> > >> On Mon, Oct 1, 2018 at 4:55 PM Pedro Larroy < > > pedro.larroy.li...@gmail.com> > >> wrote: > >> > >>> Hi > >>> > >>> I saw this failure on CI: > >>> > >>> > >> > > > > >>> > >> > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1697/pipeline > >>> > >>> Have you seen other cases where we fail to select the best CUDNN > >> algorithm? > >>> In which circumstances this could happen, and do you think is a > >>> good > > idea > >>> to have one selected by default as a last resort? > >>> > >>> > >>> Pedro. > >>> > >> > > > > >>> > >> > >
Re: CUDNN algorithm selection failure
Seems is not the only test: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-12726/5/pipeline test_slice_batchnorm_reshape_batchnorm is also failing and hasn't been touched for a while. It doesn't look like a problem with the test to me, (not a flaky test). Looks to me that should find and address the root cause instead of disabling the test in this case. Pedro. On Tue, Oct 2, 2018 at 2:39 AM Marco de Abreu wrote: > I have created an issue at > https://github.com/apache/incubator-mxnet/issues/12715 and a PR to disable > the test at https://github.com/apache/incubator-mxnet/pull/12716. > > This test is pretty new and was submitted with a number of other > problematic (and disabled) tests: > https://github.com/apache/incubator-mxnet/issues/11164 It could be > possible > that the test is simply not stable enough. The PR that introduced that test > is https://github.com/apache/incubator-mxnet/pull/10921 - it was merged > two > days ago. > > Best regards, > Marco > > On Tue, Oct 2, 2018 at 8:43 AM Pedro Larroy > wrote: > > > Thanks for checking Lin. If it happens again we will have to dig deeper. > We > > have just one executor in GPU so I wonder what could be the root cause of > > this. > > > > On Mon, Oct 1, 2018 at 10:57 PM Lin Yuan wrote: > > > > > I could not reproduce the error on an EC2 g3x8 instance making it hard > to > > > debug. I also suspect it was due to resource usage limit on ci > > Instance. > > > > > > On Mon, Oct 1, 2018 at 10:40 PM Pedro Larroy < > > pedro.larroy.li...@gmail.com > > > > > > > wrote: > > > > > > > It doesn't look like flakiness to me at first sight. I think it might > > be > > > > related to resource usage / allocation / leak in the worst case. > > > > > > > > Could be that there was not enough memory GPU memory at the time of > > test > > > > execution. But I'm just speculating, hence my original question. > > > > > > > > Pedro. > > > > > > > > On Mon, Oct 1, 2018 at 8:16 PM Lin Yuan wrote: > > > > > > > > > Hi Pedro, > > > > > > > > > > I also got this failure in my PR > > > > > > > > > > > > > > > > > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11742/27/pipeline > > > > > > > > > > I was not able to identify the root cause of it from changelist. > Are > > > you > > > > > suggesting there is some flakiness in the master branch too? > > > > > > > > > > Thanks, > > > > > > > > > > Lin > > > > > > > > > > On Mon, Oct 1, 2018 at 4:55 PM Pedro Larroy < > > > > pedro.larroy.li...@gmail.com> > > > > > wrote: > > > > > > > > > > > Hi > > > > > > > > > > > > I saw this failure on CI: > > > > > > > > > > > > > > > > > > > > > > > > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1697/pipeline > > > > > > > > > > > > Have you seen other cases where we fail to select the best CUDNN > > > > > algorithm? > > > > > > In which circumstances this could happen, and do you think is a > > good > > > > idea > > > > > > to have one selected by default as a last resort? > > > > > > > > > > > > > > > > > > Pedro. > > > > > > > > > > > > > > > > > > > > >
Re: CUDNN algorithm selection failure
I have created an issue at https://github.com/apache/incubator-mxnet/issues/12715 and a PR to disable the test at https://github.com/apache/incubator-mxnet/pull/12716. This test is pretty new and was submitted with a number of other problematic (and disabled) tests: https://github.com/apache/incubator-mxnet/issues/11164 It could be possible that the test is simply not stable enough. The PR that introduced that test is https://github.com/apache/incubator-mxnet/pull/10921 - it was merged two days ago. Best regards, Marco On Tue, Oct 2, 2018 at 8:43 AM Pedro Larroy wrote: > Thanks for checking Lin. If it happens again we will have to dig deeper. We > have just one executor in GPU so I wonder what could be the root cause of > this. > > On Mon, Oct 1, 2018 at 10:57 PM Lin Yuan wrote: > > > I could not reproduce the error on an EC2 g3x8 instance making it hard to > > debug. I also suspect it was due to resource usage limit on ci > Instance. > > > > On Mon, Oct 1, 2018 at 10:40 PM Pedro Larroy < > pedro.larroy.li...@gmail.com > > > > > wrote: > > > > > It doesn't look like flakiness to me at first sight. I think it might > be > > > related to resource usage / allocation / leak in the worst case. > > > > > > Could be that there was not enough memory GPU memory at the time of > test > > > execution. But I'm just speculating, hence my original question. > > > > > > Pedro. > > > > > > On Mon, Oct 1, 2018 at 8:16 PM Lin Yuan wrote: > > > > > > > Hi Pedro, > > > > > > > > I also got this failure in my PR > > > > > > > > > > > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11742/27/pipeline > > > > > > > > I was not able to identify the root cause of it from changelist. Are > > you > > > > suggesting there is some flakiness in the master branch too? > > > > > > > > Thanks, > > > > > > > > Lin > > > > > > > > On Mon, Oct 1, 2018 at 4:55 PM Pedro Larroy < > > > pedro.larroy.li...@gmail.com> > > > > wrote: > > > > > > > > > Hi > > > > > > > > > > I saw this failure on CI: > > > > > > > > > > > > > > > > > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1697/pipeline > > > > > > > > > > Have you seen other cases where we fail to select the best CUDNN > > > > algorithm? > > > > > In which circumstances this could happen, and do you think is a > good > > > idea > > > > > to have one selected by default as a last resort? > > > > > > > > > > > > > > > Pedro. > > > > > > > > > > > > > > >
Re: CUDNN algorithm selection failure
I could not reproduce the error on an EC2 g3x8 instance making it hard to debug. I also suspect it was due to resource usage limit on ci Instance. On Mon, Oct 1, 2018 at 10:40 PM Pedro Larroy wrote: > It doesn't look like flakiness to me at first sight. I think it might be > related to resource usage / allocation / leak in the worst case. > > Could be that there was not enough memory GPU memory at the time of test > execution. But I'm just speculating, hence my original question. > > Pedro. > > On Mon, Oct 1, 2018 at 8:16 PM Lin Yuan wrote: > > > Hi Pedro, > > > > I also got this failure in my PR > > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/PR-11742/27/pipeline > > > > I was not able to identify the root cause of it from changelist. Are you > > suggesting there is some flakiness in the master branch too? > > > > Thanks, > > > > Lin > > > > On Mon, Oct 1, 2018 at 4:55 PM Pedro Larroy < > pedro.larroy.li...@gmail.com> > > wrote: > > > > > Hi > > > > > > I saw this failure on CI: > > > > > > > > > http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1697/pipeline > > > > > > Have you seen other cases where we fail to select the best CUDNN > > algorithm? > > > In which circumstances this could happen, and do you think is a good > idea > > > to have one selected by default as a last resort? > > > > > > > > > Pedro. > > > > > >
CUDNN algorithm selection failure
Hi I saw this failure on CI: http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1697/pipeline Have you seen other cases where we fail to select the best CUDNN algorithm? In which circumstances this could happen, and do you think is a good idea to have one selected by default as a last resort? Pedro.