Re: CUDA / CUDNN support revisited

2019-07-02 Thread Dick Carter
Heads up that I'll soon be submitting a PR to help with cuda/cudnn version 
checking.  My goal is to address two points:

- rnn.cc of mxnet v1.5 does not compile against cudnn v6.  Do we scramble 
to fix it or admit that we no longer support cudnn v6 or earlier?
- how do we handle the process of removing code that assumes these 
no-longer-supported cuda/cudnn versions?

I agree with Kellen's statements that the transition should be tied to 
timeframe more than N/N-1.  Users have had over 1.5 years to move to cuda 9 / 
cudnn 7, so it's time to drop cuda 8 / cudnn 6 in my opinion.

My PR will be supplying the 'mechanism' of dealing with cuda/cudnn versions.  
We can continue the discussion on the final 'policy' settings here and in the 
PR.


On 2019/06/19 17:00:02, kellen sunderland  wrote: 
> Just double checked CUDA 9, 10 and 10.1 all support SM3, so actually I
> don't believe there's any need to drop SMs.
> 
> On Wed, Jun 19, 2019 at 9:56 AM kellen sunderland <
> kellen.sunderl...@gmail.com> wrote:
> 
> > I think where we're all going to have agreement is that we shouldn't have
> > code targeting CUDA versions earlier than CUDA 9, or cuDNN versions earlier
> > than 6.  We can go ahead and remove any code that targets those old
> > versions, and drop any SMs that are not supported by CUDA 9 / cuDNN 6.  Id
> > suggest we also add some logging for users with prior versions letting them
> > know they can still use MXNet 1.4.
> >
> > Where things get interesting is CUDA 9 / cuDNN 6 support.  I was
> > originally a proponent of the N and N-1 route for simplicity.  Looking back
> > at the choice, one complication I see is that there's competing concerns
> > between semver library compatibility and feature releases on NVIDIA's
> > part.  NVIDIA is releasing new libraries with a lot of new features on a
> > regular basis, which is good, but for compatibility reasons they've begun
> > to bump major versions less often, which is also probably also good.  For
> > example if memory serves correctly cuDNN used to get an MV bump every 6
> > months or so, but now the N-1 MV (6) was released in April of 2017.  As a
> > project maintainer I would certainly like to drop support for library
> > versions that are 2 years old in my latest release.  Supporting a 2 year
> > wide range of dependency libraries in the CI for example is going to be a
> > burden.
> >
> > From the MXNet users' perspective obviously having to update dependencies
> > is a pain, but updating these libs are likely to give significant
> > performance increases (occasional perf regressions aside).  I think a
> > consistent thread I've heard from users is that training takes too long,
> > inference costs too much, and they want their DL framework to abstract the
> > complexity of using custom hardware like TCs or AVX with them having to put
> > in a lot of effort.  Another consideration is that using old versions of
> > MXNet is actually quite easy and convenient thanks to (IMO) some solid
> > release practices and naming conventions.
> >
> > Given how easy it is to use old MXNet versions I think it's reasonable to
> > target CUDA 10 and cuDNN 7 only in release 1.5 (and drop incompatible sm
> > versions).
> >
> > On Wed, Jun 19, 2019 at 4:01 AM Marco de Abreu 
> > wrote:
> >
> >> Good points anirudh. Generally I would understand N as being the major
> >> versions. Speak we would maintain CUDA 9 and 10.1 in your given example
> >> and
> >> drop 10.0 as soon as we verified that 10.1 is working. CUDA 9 would only
> >> be
> >> dropped when 11 is released and tested.
> >>
> >> At the same time, we would always only supported the latest compatible
> >> cudnn version. Or is there any reason somebody would use an old cudnn
> >> version?
> >>
> >> Wdyt?
> >>
> >> -Marco
> >>
> >> Anirudh Subramanian  schrieb am Mi., 19. Juni
> >> 2019,
> >> 01:47:
> >>
> >> > +1, Agree this should be done for both CUDA and CUDNN versions. At max
> >> CUDA
> >> > Version N and CUDA Version N - 1 should be supported in CI.
> >> >
> >> > My question is what happens, when we are at a position, where we are on
> >> a
> >> > CUDA version N and removed support for CUDA version N - 1. Within a
> >> small
> >> > duration Nvidia comes up with a CUDA patch version N + 1, where  some
> >> perf
> >> > regressions and some bugs have been fixed. Should we just move to N + 1,
> >> > since version N will have all these issues for users and may also slow
> >> us
> >> > down on CI.
> >> >
> >> > I am facing a issue with CUDA 10 and CUDA 10.1 which also seems to be
> >> > causing intermittent CI failures:
> >> > https://github.com/apache/incubator-mxnet/issues/15273 . There is
> >> already
> >> > a
> >> > PR to bump up Nvidia version to 10.1 (
> >> > https://github.com/apache/incubator-mxnet/pull/14986/files).
> >> >
> >> > I think for situations where there is a quick follow up release like
> >> 10.1
> >> > and MXNet users are impacted by certain issues, we should just bump up
> >> the
> >> > 

Re: CUDA / CUDNN support revisited

2019-06-19 Thread kellen sunderland
Just double checked CUDA 9, 10 and 10.1 all support SM3, so actually I
don't believe there's any need to drop SMs.

On Wed, Jun 19, 2019 at 9:56 AM kellen sunderland <
kellen.sunderl...@gmail.com> wrote:

> I think where we're all going to have agreement is that we shouldn't have
> code targeting CUDA versions earlier than CUDA 9, or cuDNN versions earlier
> than 6.  We can go ahead and remove any code that targets those old
> versions, and drop any SMs that are not supported by CUDA 9 / cuDNN 6.  Id
> suggest we also add some logging for users with prior versions letting them
> know they can still use MXNet 1.4.
>
> Where things get interesting is CUDA 9 / cuDNN 6 support.  I was
> originally a proponent of the N and N-1 route for simplicity.  Looking back
> at the choice, one complication I see is that there's competing concerns
> between semver library compatibility and feature releases on NVIDIA's
> part.  NVIDIA is releasing new libraries with a lot of new features on a
> regular basis, which is good, but for compatibility reasons they've begun
> to bump major versions less often, which is also probably also good.  For
> example if memory serves correctly cuDNN used to get an MV bump every 6
> months or so, but now the N-1 MV (6) was released in April of 2017.  As a
> project maintainer I would certainly like to drop support for library
> versions that are 2 years old in my latest release.  Supporting a 2 year
> wide range of dependency libraries in the CI for example is going to be a
> burden.
>
> From the MXNet users' perspective obviously having to update dependencies
> is a pain, but updating these libs are likely to give significant
> performance increases (occasional perf regressions aside).  I think a
> consistent thread I've heard from users is that training takes too long,
> inference costs too much, and they want their DL framework to abstract the
> complexity of using custom hardware like TCs or AVX with them having to put
> in a lot of effort.  Another consideration is that using old versions of
> MXNet is actually quite easy and convenient thanks to (IMO) some solid
> release practices and naming conventions.
>
> Given how easy it is to use old MXNet versions I think it's reasonable to
> target CUDA 10 and cuDNN 7 only in release 1.5 (and drop incompatible sm
> versions).
>
> On Wed, Jun 19, 2019 at 4:01 AM Marco de Abreu 
> wrote:
>
>> Good points anirudh. Generally I would understand N as being the major
>> versions. Speak we would maintain CUDA 9 and 10.1 in your given example
>> and
>> drop 10.0 as soon as we verified that 10.1 is working. CUDA 9 would only
>> be
>> dropped when 11 is released and tested.
>>
>> At the same time, we would always only supported the latest compatible
>> cudnn version. Or is there any reason somebody would use an old cudnn
>> version?
>>
>> Wdyt?
>>
>> -Marco
>>
>> Anirudh Subramanian  schrieb am Mi., 19. Juni
>> 2019,
>> 01:47:
>>
>> > +1, Agree this should be done for both CUDA and CUDNN versions. At max
>> CUDA
>> > Version N and CUDA Version N - 1 should be supported in CI.
>> >
>> > My question is what happens, when we are at a position, where we are on
>> a
>> > CUDA version N and removed support for CUDA version N - 1. Within a
>> small
>> > duration Nvidia comes up with a CUDA patch version N + 1, where  some
>> perf
>> > regressions and some bugs have been fixed. Should we just move to N + 1,
>> > since version N will have all these issues for users and may also slow
>> us
>> > down on CI.
>> >
>> > I am facing a issue with CUDA 10 and CUDA 10.1 which also seems to be
>> > causing intermittent CI failures:
>> > https://github.com/apache/incubator-mxnet/issues/15273 . There is
>> already
>> > a
>> > PR to bump up Nvidia version to 10.1 (
>> > https://github.com/apache/incubator-mxnet/pull/14986/files).
>> >
>> > I think for situations where there is a quick follow up release like
>> 10.1
>> > and MXNet users are impacted by certain issues, we should just bump up
>> the
>> > version and stop support for 10.0.
>> > Would like to hear more from Nvidia folks (on this particular case of
>> CUDA
>> > 10.0 vs CUDA 10.1 and what are the recommendations for existing
>> customers).
>> >
>> > Anirudh
>> >
>> > On Mon, Jun 3, 2019 at 4:21 PM Dick Carter 
>> wrote:
>> >
>> > > Actually, I tried to say that support *doesn't necessarily* include
>> N-1.
>> > > I'm proposing that the supported versions are 1) covered by CI and 2)
>> > have
>> > > been available in a usable form long enough that a semi-motivated user
>> > has
>> > > been able to transition to it.  That might mean only N (e.g. per my
>> > > proposal, only cuDNN v7).
>> > >
>> > > Regarding precedent for N / N-1,  when a new CUDA version comes out,
>> > users
>> > > will transition to it at their own pace, thereby creating a N / N-1
>> > support
>> > > situation for some period.
>> > >
>> > >
>> > > On 2019/06/03 22:43:20, Pedro Larroy 
>> > > wrote:
>> > > > Your proposal of having support for N 

Re: CUDA / CUDNN support revisited

2019-06-19 Thread kellen sunderland
I think where we're all going to have agreement is that we shouldn't have
code targeting CUDA versions earlier than CUDA 9, or cuDNN versions earlier
than 6.  We can go ahead and remove any code that targets those old
versions, and drop any SMs that are not supported by CUDA 9 / cuDNN 6.  Id
suggest we also add some logging for users with prior versions letting them
know they can still use MXNet 1.4.

Where things get interesting is CUDA 9 / cuDNN 6 support.  I was originally
a proponent of the N and N-1 route for simplicity.  Looking back at the
choice, one complication I see is that there's competing concerns between
semver library compatibility and feature releases on NVIDIA's part.  NVIDIA
is releasing new libraries with a lot of new features on a regular basis,
which is good, but for compatibility reasons they've begun to bump major
versions less often, which is also probably also good.  For example if
memory serves correctly cuDNN used to get an MV bump every 6 months or so,
but now the N-1 MV (6) was released in April of 2017.  As a project
maintainer I would certainly like to drop support for library versions that
are 2 years old in my latest release.  Supporting a 2 year wide range of
dependency libraries in the CI for example is going to be a burden.

>From the MXNet users' perspective obviously having to update dependencies
is a pain, but updating these libs are likely to give significant
performance increases (occasional perf regressions aside).  I think a
consistent thread I've heard from users is that training takes too long,
inference costs too much, and they want their DL framework to abstract the
complexity of using custom hardware like TCs or AVX with them having to put
in a lot of effort.  Another consideration is that using old versions of
MXNet is actually quite easy and convenient thanks to (IMO) some solid
release practices and naming conventions.

Given how easy it is to use old MXNet versions I think it's reasonable to
target CUDA 10 and cuDNN 7 only in release 1.5 (and drop incompatible sm
versions).

On Wed, Jun 19, 2019 at 4:01 AM Marco de Abreu 
wrote:

> Good points anirudh. Generally I would understand N as being the major
> versions. Speak we would maintain CUDA 9 and 10.1 in your given example and
> drop 10.0 as soon as we verified that 10.1 is working. CUDA 9 would only be
> dropped when 11 is released and tested.
>
> At the same time, we would always only supported the latest compatible
> cudnn version. Or is there any reason somebody would use an old cudnn
> version?
>
> Wdyt?
>
> -Marco
>
> Anirudh Subramanian  schrieb am Mi., 19. Juni 2019,
> 01:47:
>
> > +1, Agree this should be done for both CUDA and CUDNN versions. At max
> CUDA
> > Version N and CUDA Version N - 1 should be supported in CI.
> >
> > My question is what happens, when we are at a position, where we are on a
> > CUDA version N and removed support for CUDA version N - 1. Within a small
> > duration Nvidia comes up with a CUDA patch version N + 1, where  some
> perf
> > regressions and some bugs have been fixed. Should we just move to N + 1,
> > since version N will have all these issues for users and may also slow us
> > down on CI.
> >
> > I am facing a issue with CUDA 10 and CUDA 10.1 which also seems to be
> > causing intermittent CI failures:
> > https://github.com/apache/incubator-mxnet/issues/15273 . There is
> already
> > a
> > PR to bump up Nvidia version to 10.1 (
> > https://github.com/apache/incubator-mxnet/pull/14986/files).
> >
> > I think for situations where there is a quick follow up release like 10.1
> > and MXNet users are impacted by certain issues, we should just bump up
> the
> > version and stop support for 10.0.
> > Would like to hear more from Nvidia folks (on this particular case of
> CUDA
> > 10.0 vs CUDA 10.1 and what are the recommendations for existing
> customers).
> >
> > Anirudh
> >
> > On Mon, Jun 3, 2019 at 4:21 PM Dick Carter  wrote:
> >
> > > Actually, I tried to say that support *doesn't necessarily* include
> N-1.
> > > I'm proposing that the supported versions are 1) covered by CI and 2)
> > have
> > > been available in a usable form long enough that a semi-motivated user
> > has
> > > been able to transition to it.  That might mean only N (e.g. per my
> > > proposal, only cuDNN v7).
> > >
> > > Regarding precedent for N / N-1,  when a new CUDA version comes out,
> > users
> > > will transition to it at their own pace, thereby creating a N / N-1
> > support
> > > situation for some period.
> > >
> > >
> > > On 2019/06/03 22:43:20, Pedro Larroy 
> > > wrote:
> > > > Your proposal of having support for N and N-1 makes a lot of sense to
> > > > me. Are there use cases for supporting older CUDA versions?
> > > >
> > > >
> > > > Thanks.
> > > >
> > > > On Mon, Jun 3, 2019 at 3:06 PM Dick Carter 
> > wrote:
> > > > >
> > > > > I'd like to revisit the discussion of:
> > >
> >
> 

Re: CUDA / CUDNN support revisited

2019-06-19 Thread Marco de Abreu
Good points anirudh. Generally I would understand N as being the major
versions. Speak we would maintain CUDA 9 and 10.1 in your given example and
drop 10.0 as soon as we verified that 10.1 is working. CUDA 9 would only be
dropped when 11 is released and tested.

At the same time, we would always only supported the latest compatible
cudnn version. Or is there any reason somebody would use an old cudnn
version?

Wdyt?

-Marco

Anirudh Subramanian  schrieb am Mi., 19. Juni 2019,
01:47:

> +1, Agree this should be done for both CUDA and CUDNN versions. At max CUDA
> Version N and CUDA Version N - 1 should be supported in CI.
>
> My question is what happens, when we are at a position, where we are on a
> CUDA version N and removed support for CUDA version N - 1. Within a small
> duration Nvidia comes up with a CUDA patch version N + 1, where  some perf
> regressions and some bugs have been fixed. Should we just move to N + 1,
> since version N will have all these issues for users and may also slow us
> down on CI.
>
> I am facing a issue with CUDA 10 and CUDA 10.1 which also seems to be
> causing intermittent CI failures:
> https://github.com/apache/incubator-mxnet/issues/15273 . There is already
> a
> PR to bump up Nvidia version to 10.1 (
> https://github.com/apache/incubator-mxnet/pull/14986/files).
>
> I think for situations where there is a quick follow up release like 10.1
> and MXNet users are impacted by certain issues, we should just bump up the
> version and stop support for 10.0.
> Would like to hear more from Nvidia folks (on this particular case of CUDA
> 10.0 vs CUDA 10.1 and what are the recommendations for existing customers).
>
> Anirudh
>
> On Mon, Jun 3, 2019 at 4:21 PM Dick Carter  wrote:
>
> > Actually, I tried to say that support *doesn't necessarily* include N-1.
> > I'm proposing that the supported versions are 1) covered by CI and 2)
> have
> > been available in a usable form long enough that a semi-motivated user
> has
> > been able to transition to it.  That might mean only N (e.g. per my
> > proposal, only cuDNN v7).
> >
> > Regarding precedent for N / N-1,  when a new CUDA version comes out,
> users
> > will transition to it at their own pace, thereby creating a N / N-1
> support
> > situation for some period.
> >
> >
> > On 2019/06/03 22:43:20, Pedro Larroy 
> > wrote:
> > > Your proposal of having support for N and N-1 makes a lot of sense to
> > > me. Are there use cases for supporting older CUDA versions?
> > >
> > >
> > > Thanks.
> > >
> > > On Mon, Jun 3, 2019 at 3:06 PM Dick Carter 
> wrote:
> > > >
> > > > I'd like to revisit the discussion of:
> >
> https://lists.apache.org/thread.html/27b84e4fc0e0728f2e4ad8b6827d7f996635021a5a4d47b5d3f4dbfb@%3Cdev.mxnet.apache.org%3E
> > now that a year has passed.
> > > >
> > > > My motivation is:
> > > >
> > > > 1.  There's a lot of hard-to-read  '#if CUDNN_MAJOR' code referencing
> > cuDNN versions back as far as v4(!?).  We need to clean this out before
> it
> > hampers our ability to nimbly move the codebase forward.
> > > >
> > > > 2.  There seems to be a difference of opinion on whether we should be
> > supporting version 'N-1' (e.g. cuDNN6).  Our current MXNet 1.5 candidate
> > does not compile against cuDNN v6, so this should be either fixed or be
> > up-front stated to the user community.  The breaking PR was
> > https://github.com/apache/incubator-mxnet/pull/14476.
> > > >
> > > > Having read the prior discussion, my take on it is:
> > > >
> > > > - Users should be given an ample time period (1 year?) to move to a
> > new CUDA/cuDNN version once it becomes 'usable.'
> > > >
> > > > - We should not claim to support a given version if it is no longer
> > part of the MXNet CI.  User's should be warned of an impeding dropping of
> > this 'testing support.'
> > > >
> > > > So these statements do not necessarily promise 'N-1' support.  I
> could
> > see a transitioning of the CI from CUDA9-only -> CUDA9&10 -> CUDA10 only.
> > Some period before CUDA9 is dropped from CI, the user community is
> warned.
> > After that time, CUDA10 might be the only version tested by CI, and hence
> > the only version supported (until the next CUDA version came around).
> > > >
> > > > Let me propose as a 'strawman' that we claim to support CUDA version
> 9
> > and 10, with cuDNN version 7 only.  Those versions have been out for over
> > 1.5 years.  So no CUDA 8 or cuDNN v6 support- over 1.5 years old with no
> > coverage by our CI.
> > > >
> > > > -Dick
> > >
> >
>


Re: CUDA / CUDNN support revisited

2019-06-18 Thread Anirudh Subramanian
+1, Agree this should be done for both CUDA and CUDNN versions. At max CUDA
Version N and CUDA Version N - 1 should be supported in CI.

My question is what happens, when we are at a position, where we are on a
CUDA version N and removed support for CUDA version N - 1. Within a small
duration Nvidia comes up with a CUDA patch version N + 1, where  some perf
regressions and some bugs have been fixed. Should we just move to N + 1,
since version N will have all these issues for users and may also slow us
down on CI.

I am facing a issue with CUDA 10 and CUDA 10.1 which also seems to be
causing intermittent CI failures:
https://github.com/apache/incubator-mxnet/issues/15273 . There is already a
PR to bump up Nvidia version to 10.1 (
https://github.com/apache/incubator-mxnet/pull/14986/files).

I think for situations where there is a quick follow up release like 10.1
and MXNet users are impacted by certain issues, we should just bump up the
version and stop support for 10.0.
Would like to hear more from Nvidia folks (on this particular case of CUDA
10.0 vs CUDA 10.1 and what are the recommendations for existing customers).

Anirudh

On Mon, Jun 3, 2019 at 4:21 PM Dick Carter  wrote:

> Actually, I tried to say that support *doesn't necessarily* include N-1.
> I'm proposing that the supported versions are 1) covered by CI and 2) have
> been available in a usable form long enough that a semi-motivated user has
> been able to transition to it.  That might mean only N (e.g. per my
> proposal, only cuDNN v7).
>
> Regarding precedent for N / N-1,  when a new CUDA version comes out, users
> will transition to it at their own pace, thereby creating a N / N-1 support
> situation for some period.
>
>
> On 2019/06/03 22:43:20, Pedro Larroy 
> wrote:
> > Your proposal of having support for N and N-1 makes a lot of sense to
> > me. Are there use cases for supporting older CUDA versions?
> >
> >
> > Thanks.
> >
> > On Mon, Jun 3, 2019 at 3:06 PM Dick Carter  wrote:
> > >
> > > I'd like to revisit the discussion of:
> https://lists.apache.org/thread.html/27b84e4fc0e0728f2e4ad8b6827d7f996635021a5a4d47b5d3f4dbfb@%3Cdev.mxnet.apache.org%3E
> now that a year has passed.
> > >
> > > My motivation is:
> > >
> > > 1.  There's a lot of hard-to-read  '#if CUDNN_MAJOR' code referencing
> cuDNN versions back as far as v4(!?).  We need to clean this out before it
> hampers our ability to nimbly move the codebase forward.
> > >
> > > 2.  There seems to be a difference of opinion on whether we should be
> supporting version 'N-1' (e.g. cuDNN6).  Our current MXNet 1.5 candidate
> does not compile against cuDNN v6, so this should be either fixed or be
> up-front stated to the user community.  The breaking PR was
> https://github.com/apache/incubator-mxnet/pull/14476.
> > >
> > > Having read the prior discussion, my take on it is:
> > >
> > > - Users should be given an ample time period (1 year?) to move to a
> new CUDA/cuDNN version once it becomes 'usable.'
> > >
> > > - We should not claim to support a given version if it is no longer
> part of the MXNet CI.  User's should be warned of an impeding dropping of
> this 'testing support.'
> > >
> > > So these statements do not necessarily promise 'N-1' support.  I could
> see a transitioning of the CI from CUDA9-only -> CUDA9&10 -> CUDA10 only.
> Some period before CUDA9 is dropped from CI, the user community is warned.
> After that time, CUDA10 might be the only version tested by CI, and hence
> the only version supported (until the next CUDA version came around).
> > >
> > > Let me propose as a 'strawman' that we claim to support CUDA version 9
> and 10, with cuDNN version 7 only.  Those versions have been out for over
> 1.5 years.  So no CUDA 8 or cuDNN v6 support- over 1.5 years old with no
> coverage by our CI.
> > >
> > > -Dick
> >
>


Re: CUDA / CUDNN support revisited

2019-06-03 Thread Dick Carter
Actually, I tried to say that support *doesn't necessarily* include N-1.  I'm 
proposing that the supported versions are 1) covered by CI and 2) have been 
available in a usable form long enough that a semi-motivated user has been able 
to transition to it.  That might mean only N (e.g. per my proposal, only cuDNN 
v7).

Regarding precedent for N / N-1,  when a new CUDA version comes out, users will 
transition to it at their own pace, thereby creating a N / N-1 support 
situation for some period.


On 2019/06/03 22:43:20, Pedro Larroy  wrote: 
> Your proposal of having support for N and N-1 makes a lot of sense to
> me. Are there use cases for supporting older CUDA versions?
> 
> 
> Thanks.
> 
> On Mon, Jun 3, 2019 at 3:06 PM Dick Carter  wrote:
> >
> > I'd like to revisit the discussion of: 
> > https://lists.apache.org/thread.html/27b84e4fc0e0728f2e4ad8b6827d7f996635021a5a4d47b5d3f4dbfb@%3Cdev.mxnet.apache.org%3E
> >  now that a year has passed.
> >
> > My motivation is:
> >
> > 1.  There's a lot of hard-to-read  '#if CUDNN_MAJOR' code referencing cuDNN 
> > versions back as far as v4(!?).  We need to clean this out before it 
> > hampers our ability to nimbly move the codebase forward.
> >
> > 2.  There seems to be a difference of opinion on whether we should be 
> > supporting version 'N-1' (e.g. cuDNN6).  Our current MXNet 1.5 candidate 
> > does not compile against cuDNN v6, so this should be either fixed or be 
> > up-front stated to the user community.  The breaking PR was 
> > https://github.com/apache/incubator-mxnet/pull/14476.
> >
> > Having read the prior discussion, my take on it is:
> >
> > - Users should be given an ample time period (1 year?) to move to a new 
> > CUDA/cuDNN version once it becomes 'usable.'
> >
> > - We should not claim to support a given version if it is no longer part of 
> > the MXNet CI.  User's should be warned of an impeding dropping of this 
> > 'testing support.'
> >
> > So these statements do not necessarily promise 'N-1' support.  I could see 
> > a transitioning of the CI from CUDA9-only -> CUDA9&10 -> CUDA10 only.  Some 
> > period before CUDA9 is dropped from CI, the user community is warned.  
> > After that time, CUDA10 might be the only version tested by CI, and hence 
> > the only version supported (until the next CUDA version came around).
> >
> > Let me propose as a 'strawman' that we claim to support CUDA version 9 and 
> > 10, with cuDNN version 7 only.  Those versions have been out for over 1.5 
> > years.  So no CUDA 8 or cuDNN v6 support- over 1.5 years old with no 
> > coverage by our CI.
> >
> > -Dick
> 


Re: CUDA / CUDNN support revisited

2019-06-03 Thread Pedro Larroy
Your proposal of having support for N and N-1 makes a lot of sense to
me. Are there use cases for supporting older CUDA versions?


Thanks.

On Mon, Jun 3, 2019 at 3:06 PM Dick Carter  wrote:
>
> I'd like to revisit the discussion of: 
> https://lists.apache.org/thread.html/27b84e4fc0e0728f2e4ad8b6827d7f996635021a5a4d47b5d3f4dbfb@%3Cdev.mxnet.apache.org%3E
>  now that a year has passed.
>
> My motivation is:
>
> 1.  There's a lot of hard-to-read  '#if CUDNN_MAJOR' code referencing cuDNN 
> versions back as far as v4(!?).  We need to clean this out before it hampers 
> our ability to nimbly move the codebase forward.
>
> 2.  There seems to be a difference of opinion on whether we should be 
> supporting version 'N-1' (e.g. cuDNN6).  Our current MXNet 1.5 candidate does 
> not compile against cuDNN v6, so this should be either fixed or be up-front 
> stated to the user community.  The breaking PR was 
> https://github.com/apache/incubator-mxnet/pull/14476.
>
> Having read the prior discussion, my take on it is:
>
> - Users should be given an ample time period (1 year?) to move to a new 
> CUDA/cuDNN version once it becomes 'usable.'
>
> - We should not claim to support a given version if it is no longer part of 
> the MXNet CI.  User's should be warned of an impeding dropping of this 
> 'testing support.'
>
> So these statements do not necessarily promise 'N-1' support.  I could see a 
> transitioning of the CI from CUDA9-only -> CUDA9&10 -> CUDA10 only.  Some 
> period before CUDA9 is dropped from CI, the user community is warned.  After 
> that time, CUDA10 might be the only version tested by CI, and hence the only 
> version supported (until the next CUDA version came around).
>
> Let me propose as a 'strawman' that we claim to support CUDA version 9 and 
> 10, with cuDNN version 7 only.  Those versions have been out for over 1.5 
> years.  So no CUDA 8 or cuDNN v6 support- over 1.5 years old with no coverage 
> by our CI.
>
> -Dick


CUDA / CUDNN support revisited

2019-06-03 Thread Dick Carter
I'd like to revisit the discussion of: 
https://lists.apache.org/thread.html/27b84e4fc0e0728f2e4ad8b6827d7f996635021a5a4d47b5d3f4dbfb@%3Cdev.mxnet.apache.org%3E
 now that a year has passed.

My motivation is:

1.  There's a lot of hard-to-read  '#if CUDNN_MAJOR' code referencing cuDNN 
versions back as far as v4(!?).  We need to clean this out before it hampers 
our ability to nimbly move the codebase forward.

2.  There seems to be a difference of opinion on whether we should be 
supporting version 'N-1' (e.g. cuDNN6).  Our current MXNet 1.5 candidate does 
not compile against cuDNN v6, so this should be either fixed or be up-front 
stated to the user community.  The breaking PR was 
https://github.com/apache/incubator-mxnet/pull/14476.

Having read the prior discussion, my take on it is:

- Users should be given an ample time period (1 year?) to move to a new 
CUDA/cuDNN version once it becomes 'usable.'

- We should not claim to support a given version if it is no longer part of the 
MXNet CI.  User's should be warned of an impeding dropping of this 'testing 
support.'

So these statements do not necessarily promise 'N-1' support.  I could see a 
transitioning of the CI from CUDA9-only -> CUDA9&10 -> CUDA10 only.  Some 
period before CUDA9 is dropped from CI, the user community is warned.  After 
that time, CUDA10 might be the only version tested by CI, and hence the only 
version supported (until the next CUDA version came around).

Let me propose as a 'strawman' that we claim to support CUDA version 9 and 10, 
with cuDNN version 7 only.  Those versions have been out for over 1.5 years.  
So no CUDA 8 or cuDNN v6 support- over 1.5 years old with no coverage by our CI.

-Dick