Re: CI impaired

2018-11-30 Thread Marco de Abreu
Thanks Naveen and Gavin!

#1 has been completed and every job has finished its processing.

#2 is the ticket with infra:
https://issues.apache.org/jira/browse/INFRA-17346

I'm now waiting for their response.

-Marco

On Fri, Nov 30, 2018 at 8:25 PM Naveen Swamy  wrote:

> Hi Marco/Gavin,
>
> Thanks for the clarification. I was not aware that it has been tested on a
> separate test environment(this is what I was suggesting and make the
> changes in a more controlled manner), last time the change was made, many
> PRs were left dangling and developers had to go trigger and I triggered
> them at least 5 times before it succeeded today.
>
> Appreciate all the hard work to make CI better.
>
> -Naveen
>
> On Fri, Nov 30, 2018 at 8:50 AM Gavin M. Bell 
> wrote:
>
> > Hey Folks,
> >
> > Marco has been running this change in dev, with flying colors, for some
> > time. This is not an experiment but a roll out that was announced.  We
> also
> > decided to make this change post the release cut so limit the blast
> radius
> > from any critical obligations to the community.  Marco is accountable for
> > this work and will address any issues that may occur as he has been put
> > on-call.  We have, to our best ability, mitigated as much risk as
> possible
> > and now it is time to pull the trigger.  The community will enjoy a bit
> > more visibility and clarity into the test process which will be
> > advantageous, as well as allowing us to extend our infrastructure in a
> way
> > that affords us more flexibility.
> >
> > No pending PRs will be impacted.
> >
> > Thank you for your support as we evolve this system to better serve the
> > community.
> >
> > -Gavin
> >
> > On Fri, Nov 30, 2018 at 5:23 PM Marco de Abreu
> >  wrote:
> >
> > > Hello Naveen, this is not an experiment. Everything has been tested in
> > our
> > > test system and is considered working 100%. This is not a test but
> > actually
> > > the move into production - the merge into master happened a week ago.
> We
> > > now just have to put all PRs into the catalogue, which means that all
> PRs
> > > have to be analyzed with the new pipelines - the only thing that will
> be
> > > noticeable is that the CI is under higher load.
> > >
> > > The pending PRs will not be impacted. The existing pipeline is still
> > > running in parallel and everything will behave as before.
> > >
> > > -Marco
> > >
> > > On Fri, Nov 30, 2018 at 4:41 PM Naveen Swamy 
> wrote:
> > >
> > > > Marco, run your experiments on a branch - set up, test it well and
> then
> > > > bring it to the master.
> > > >
> > > > > On Nov 30, 2018, at 6:53 AM, Marco de Abreu <
> > > > marco.g.ab...@googlemail.com.INVALID> wrote:
> > > > >
> > > > > Hello,
> > > > >
> > > > > I'm now moving forward with #1. I will try to get to #3 as soon as
> > > > possible
> > > > > to reduce parallel jobs in our CI. You might notice some unfinished
> > > > jobs. I
> > > > > will let you know as soon as this process has been completed. Until
> > > then,
> > > > > please bare with me since we have hundreds of jobs to run in order
> to
> > > > > validate all PRs.
> > > > >
> > > > > Best regards,
> > > > > Marco
> > > > >
> > > > > On Fri, Nov 30, 2018 at 1:36 AM Marco de Abreu <
> > > > marco.g.ab...@googlemail.com>
> > > > > wrote:
> > > > >
> > > > >> Hello,
> > > > >>
> > > > >> since the release branch has now been cut, I would like to move
> > > forward
> > > > >> with the CI improvements for the master branch. This would include
> > the
> > > > >> following actions:
> > > > >> 1. Re-enable the new Jenkins job
> > > > >> 2. Request Apache Infra to move the protected branch check from
> the
> > > main
> > > > >> pipeline to our new ones
> > > > >> 3. Merge https://github.com/apache/incubator-mxnet/pull/13474 -
> > this
> > > > >> finalizes the deprecation process
> > > > >>
> > > > >> If nobody objects, I would like to start with #1 soon. Mentors,
> > could
> > > > you
> > > > >> please assist to create the Apache Infra ticket? I would then take
> > it
> > > > from
> > > > >> there and talk to Infra.
> > > > >>
> > > > >> Best regards,
> > > > >> Marco
> > > > >>
> > > > >> On Mon, Nov 26, 2018 at 2:47 AM kellen sunderland <
> > > > >> kellen.sunderl...@gmail.com> wrote:
> > > > >>
> > > > >>> Sorry, [1] meant to reference
> > > > >>> https://issues.jenkins-ci.org/browse/JENKINS-37984 .
> > > > >>>
> > > > >>> On Sun, Nov 25, 2018 at 5:41 PM kellen sunderland <
> > > > >>> kellen.sunderl...@gmail.com> wrote:
> > > > >>>
> > > >  Marco and I ran into another urgent issue over the weekend that
> > was
> > > >  causing builds to fail.  This issue was unrelated to any feature
> > > >  development work, or other CI fixes applied recently, but it did
> > > > require
> > > >  quite a bit of work from Marco (and a little from me) to fix.
> > > > 
> > > >  We spent enough time on the problem that it caused us to take a
> > step
> > > > >>> back
> > > >  and consider how we could both fix issues i

Re: v1.4.0 status 11/29

2018-11-30 Thread Alex Zai
PR is here https://github.com/apache/incubator-mxnet/pull/13497.

On Thu, Nov 29, 2018 at 8:56 PM Lv, Tao A  wrote:

> Credit belongs to Alex.
>
> Hi Alex, would you mind porting your fix to the v1.4.x branch?
>
> Thanks,
> -Tao
>
> -Original Message-
> From: Steffen Rochel [mailto:steffenroc...@gmail.com]
> Sent: Friday, November 30, 2018 12:48 PM
> To: dev@mxnet.incubator.apache.org
> Subject: Re: v1.4.0 status 11/29
>
> Hi Tao - thanks for fixing the crash. Please create PR on v1.4.x branch
> with [v1.4.x] in title and add me to the PR.
> Steffen
>
> On Thu, Nov 29, 2018 at 8:44 PM Lv, Tao A  wrote:
>
> > Hi Steffen, I would like to have
> > https://github.com/apache/incubator-mxnet/pull/13433  into the coming
> > 1.4.0 release. It fixed a crash of deconvolution with certain input
> > size for MKL-DNN backend. This PR is well reviewed and already merged
> > into the master branch. New test case is also included there.
> >
> > Please find the corresponding issue here:
> > https://github.com/apache/incubator-mxnet/issues/13421 .
> >
> > Thanks,
> > -Tao
> >
> > -Original Message-
> > From: Steffen Rochel [mailto:steffenroc...@gmail.com]
> > Sent: Friday, November 30, 2018 12:05 PM
> > To: dev@mxnet.incubator.apache.org
> > Subject: v1.4.0 status 11/29
> >
> > Dear MXNet community -
> > I would like to provide update on v1.4.0 status, details will be
> > tracked here <
> > https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incu
> > bating%29+1.4.0+Release+Plan+and+Status
> > >
> > .
> >
> > 1. Sergey created v1.4.x branch
> > 2. As expected, additional requests have been made for inclusion in
> > v1.4.0 release. Critical PR are tracked here <
> > https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incu
> > bating%29+1.4.0+Release+Plan+and+Status#ApacheMXNet(incubating)1.4.0Re
> > leasePlanandStatus-OpenPRstotrack
> > >
> > .
> > 3. PR to update README.md is blocked by flaky test failures,
> > retriggered check.
> > 4. PR to upgrade version on master to v1.5.0 has been submitted.
> > 5. CI is setup and first run passed.
> >
> > Note: if you want to add selected fixes or enhancements, please reply
> > to this email. Please provide justification, add me as approver to the
> > v1.4.x PR and make sure your changes have tests included in PR and get
> > properly reviewed.
> >
> > Regards,
> > Steffen
> >
>


Re: CI impaired

2018-11-30 Thread Naveen Swamy
Hi Marco/Gavin,

Thanks for the clarification. I was not aware that it has been tested on a
separate test environment(this is what I was suggesting and make the
changes in a more controlled manner), last time the change was made, many
PRs were left dangling and developers had to go trigger and I triggered
them at least 5 times before it succeeded today.

Appreciate all the hard work to make CI better.

-Naveen

On Fri, Nov 30, 2018 at 8:50 AM Gavin M. Bell 
wrote:

> Hey Folks,
>
> Marco has been running this change in dev, with flying colors, for some
> time. This is not an experiment but a roll out that was announced.  We also
> decided to make this change post the release cut so limit the blast radius
> from any critical obligations to the community.  Marco is accountable for
> this work and will address any issues that may occur as he has been put
> on-call.  We have, to our best ability, mitigated as much risk as possible
> and now it is time to pull the trigger.  The community will enjoy a bit
> more visibility and clarity into the test process which will be
> advantageous, as well as allowing us to extend our infrastructure in a way
> that affords us more flexibility.
>
> No pending PRs will be impacted.
>
> Thank you for your support as we evolve this system to better serve the
> community.
>
> -Gavin
>
> On Fri, Nov 30, 2018 at 5:23 PM Marco de Abreu
>  wrote:
>
> > Hello Naveen, this is not an experiment. Everything has been tested in
> our
> > test system and is considered working 100%. This is not a test but
> actually
> > the move into production - the merge into master happened a week ago. We
> > now just have to put all PRs into the catalogue, which means that all PRs
> > have to be analyzed with the new pipelines - the only thing that will be
> > noticeable is that the CI is under higher load.
> >
> > The pending PRs will not be impacted. The existing pipeline is still
> > running in parallel and everything will behave as before.
> >
> > -Marco
> >
> > On Fri, Nov 30, 2018 at 4:41 PM Naveen Swamy  wrote:
> >
> > > Marco, run your experiments on a branch - set up, test it well and then
> > > bring it to the master.
> > >
> > > > On Nov 30, 2018, at 6:53 AM, Marco de Abreu <
> > > marco.g.ab...@googlemail.com.INVALID> wrote:
> > > >
> > > > Hello,
> > > >
> > > > I'm now moving forward with #1. I will try to get to #3 as soon as
> > > possible
> > > > to reduce parallel jobs in our CI. You might notice some unfinished
> > > jobs. I
> > > > will let you know as soon as this process has been completed. Until
> > then,
> > > > please bare with me since we have hundreds of jobs to run in order to
> > > > validate all PRs.
> > > >
> > > > Best regards,
> > > > Marco
> > > >
> > > > On Fri, Nov 30, 2018 at 1:36 AM Marco de Abreu <
> > > marco.g.ab...@googlemail.com>
> > > > wrote:
> > > >
> > > >> Hello,
> > > >>
> > > >> since the release branch has now been cut, I would like to move
> > forward
> > > >> with the CI improvements for the master branch. This would include
> the
> > > >> following actions:
> > > >> 1. Re-enable the new Jenkins job
> > > >> 2. Request Apache Infra to move the protected branch check from the
> > main
> > > >> pipeline to our new ones
> > > >> 3. Merge https://github.com/apache/incubator-mxnet/pull/13474 -
> this
> > > >> finalizes the deprecation process
> > > >>
> > > >> If nobody objects, I would like to start with #1 soon. Mentors,
> could
> > > you
> > > >> please assist to create the Apache Infra ticket? I would then take
> it
> > > from
> > > >> there and talk to Infra.
> > > >>
> > > >> Best regards,
> > > >> Marco
> > > >>
> > > >> On Mon, Nov 26, 2018 at 2:47 AM kellen sunderland <
> > > >> kellen.sunderl...@gmail.com> wrote:
> > > >>
> > > >>> Sorry, [1] meant to reference
> > > >>> https://issues.jenkins-ci.org/browse/JENKINS-37984 .
> > > >>>
> > > >>> On Sun, Nov 25, 2018 at 5:41 PM kellen sunderland <
> > > >>> kellen.sunderl...@gmail.com> wrote:
> > > >>>
> > >  Marco and I ran into another urgent issue over the weekend that
> was
> > >  causing builds to fail.  This issue was unrelated to any feature
> > >  development work, or other CI fixes applied recently, but it did
> > > require
> > >  quite a bit of work from Marco (and a little from me) to fix.
> > > 
> > >  We spent enough time on the problem that it caused us to take a
> step
> > > >>> back
> > >  and consider how we could both fix issues in CI and support the
> 1.4
> > > >>> release
> > >  with the least impact possible on MXNet devs.  Marco had planned
> to
> > > >>> make a
> > >  significant change to the CI to fix a long-standing Jenkins error
> > [1],
> > > >>> but
> > >  we feel that most developers would prioritize having a stable
> build
> > >  environment for the next few weeks over having this fix in place.
> > > 
> > >  To properly introduce a new CI system the intent was to do a
> gradual
> > >  blue/green roll out of the fix.

Re: Adding AMD CPU to CI

2018-11-30 Thread Pedro Larroy
I think just Adding AMD is not the right abstraction level. Testing and 
benchmarking with different cpu flags / march ie AVX2 sse2 brings value in my 
opinion. Just testing another vendor of a compatible cpu doesn’t.

Pedro

> On 30. Nov 2018, at 19:32, kellen sunderland  
> wrote:
> 
> Damn, knew i should have double-checked!  Oh well it's also carbon neutral.
> 
> On Fri, Nov 30, 2018 at 10:27 AM Pedro Larroy 
> wrote:
> 
>> Agee with Tianqi and Hao. Adding AMD brings no value and increases
>> complexity and CI cost. The instructions sets are the same. For
>> benchmarking it might make sense though.
>> 
>> Pedro
>> 
>>> On 30. Nov 2018, at 18:19, Tianqi Chen  wrote:
>>> 
>>> I still think it is overkill to add AMD CPU to the CI, given the
>> additional
>>> cost it could bring and little additional information we can get out from
>>> it.
>>> 
>>> A middle group is to add AMD CPU to a nightly build or final sweep before
>>> release. If there is a case that we find that AMD CPU really makes a
>>> difference, then we add it to the CI
>>> 
>>> Tianqi
>>> 
 On Thu, Nov 29, 2018 at 6:29 PM Hao Jin  wrote:
 
 For CPUs, the supported instruction sets may also vary between the same
 manufacturer's different product lines of the same generation
>> (Skylake-SP
 versus Skylake).
 For the same instruction set, the two manufacturers should both have a
 working version of the hardware implementation. If any of the
 implementations does not work, then the chip would not even be
>> considered
 functioning properly.
 If some AMD CPUs only support up to AVX2 instruction sets, they would
>> just
 function in the same way as an Intel CPU that supports up to AVX2
 instruction sets. The performance may vary, but the capability and
>> behavior
 of the two chips would be the same when given the same machine code.
 For AMD GPUs it's a totally different story, as AMD GPUs do not share
>> the
 same instruction sets with the NVIDIA ones, thus testing on AMD GPUs(if
>> we
 do have support for them) would definitely add values.
 Hao
 
 On Thu, Nov 29, 2018 at 8:37 PM Anirudh Subramanian <
>> anirudh2...@gmail.com
> 
 wrote:
 
> Instruction set extensions support like AVX2, AVX512 etc. can vary
 between
> AMD and Intel and there can also be a time lag between when Intel
 supports
> it versus when AMD supports it.
> Also, in the future this setup may be useful in case MXNet supports AMD
> GPUs and AWS also happens to have support for it.
> 
> Anirudh
> 
> 
> On Thu, Nov 29, 2018 at 4:29 PM Marco de Abreu
>  wrote:
> 
>> I think it's worth a discussion to do a sanity check. While generally
> these
>> instructions are standardized, we also made the experience with ARM
 that
>> the theory and reality sometimes don't match. Thus, it's always good
>> to
>> check.
>> 
>> In the next months we are going to refactor our slave creation
 processes.
>> Chance Bair has been working on rewriting Windows slaves from scratch
 (we
>> used images that haven't really been updated for 2 years - we still
 don't
>> know what was done on them) and they're ready soon. In the following
>> months, we will also port our Ubuntu slaves to the new method (don't
> have a
>> timeline yet). Ideally, the integration of AMD instances will only be
>> a
>> matter of running the same pipeline on a different instance type. In
 that
>> Case, it should not be a big deal.
>> 
>> If there are big differences, that's already a yellow flag for
>> compatibility, but that's unlikely. But in that case, we would have to
> make
>> a more thorough time analysis and whether it's worth the effort.
>> Maybe,
>> somebody else could also lend us a hand and help us with adding AMD
>> support.
>> 
>> -Marco
>> 
>> Am Fr., 30. Nov. 2018, 01:22 hat Hao Jin 
>> geschrieben:
>> 
>>> f16c is also an instruction set supported by both brands' recent CPUs
>> just
>>> like x86, AVX, SSE etc., and any difference in behaviors (quite
>> impossible
>>> to happen or it will be a major defect) would most likely be caused
 by
>> the
>>> underlying hardware implementation, so still, adding AMD instances is
> not
>>> adding much value here.
>>> Hao
>>> 
>>> On Thu, Nov 29, 2018 at 7:03 PM kellen sunderland <
>>> kellen.sunderl...@gmail.com> wrote:
>>> 
 Just looked at the mf16c work and wanted to mention Rahul clearly
> _was_
 thinking about AMD users in that PR.
 
 On Thu, Nov 29, 2018 at 3:46 PM kellen sunderland <
 kellen.sunderl...@gmail.com> wrote:
 
> From my perspective we're developing a few features like mf16c
 and
>>> MKLDNN
> integration specifically for Intel CPUs.  It wouldn't hurt to
 mak

Re: Adding AMD CPU to CI

2018-11-30 Thread kellen sunderland
Damn, knew i should have double-checked!  Oh well it's also carbon neutral.

On Fri, Nov 30, 2018 at 10:27 AM Pedro Larroy 
wrote:

> Agee with Tianqi and Hao. Adding AMD brings no value and increases
> complexity and CI cost. The instructions sets are the same. For
> benchmarking it might make sense though.
>
> Pedro
>
> > On 30. Nov 2018, at 18:19, Tianqi Chen  wrote:
> >
> > I still think it is overkill to add AMD CPU to the CI, given the
> additional
> > cost it could bring and little additional information we can get out from
> > it.
> >
> > A middle group is to add AMD CPU to a nightly build or final sweep before
> > release. If there is a case that we find that AMD CPU really makes a
> > difference, then we add it to the CI
> >
> > Tianqi
> >
> >> On Thu, Nov 29, 2018 at 6:29 PM Hao Jin  wrote:
> >>
> >> For CPUs, the supported instruction sets may also vary between the same
> >> manufacturer's different product lines of the same generation
> (Skylake-SP
> >> versus Skylake).
> >> For the same instruction set, the two manufacturers should both have a
> >> working version of the hardware implementation. If any of the
> >> implementations does not work, then the chip would not even be
> considered
> >> functioning properly.
> >> If some AMD CPUs only support up to AVX2 instruction sets, they would
> just
> >> function in the same way as an Intel CPU that supports up to AVX2
> >> instruction sets. The performance may vary, but the capability and
> behavior
> >> of the two chips would be the same when given the same machine code.
> >> For AMD GPUs it's a totally different story, as AMD GPUs do not share
> the
> >> same instruction sets with the NVIDIA ones, thus testing on AMD GPUs(if
> we
> >> do have support for them) would definitely add values.
> >> Hao
> >>
> >> On Thu, Nov 29, 2018 at 8:37 PM Anirudh Subramanian <
> anirudh2...@gmail.com
> >>>
> >> wrote:
> >>
> >>> Instruction set extensions support like AVX2, AVX512 etc. can vary
> >> between
> >>> AMD and Intel and there can also be a time lag between when Intel
> >> supports
> >>> it versus when AMD supports it.
> >>> Also, in the future this setup may be useful in case MXNet supports AMD
> >>> GPUs and AWS also happens to have support for it.
> >>>
> >>> Anirudh
> >>>
> >>>
> >>> On Thu, Nov 29, 2018 at 4:29 PM Marco de Abreu
> >>>  wrote:
> >>>
>  I think it's worth a discussion to do a sanity check. While generally
> >>> these
>  instructions are standardized, we also made the experience with ARM
> >> that
>  the theory and reality sometimes don't match. Thus, it's always good
> to
>  check.
> 
>  In the next months we are going to refactor our slave creation
> >> processes.
>  Chance Bair has been working on rewriting Windows slaves from scratch
> >> (we
>  used images that haven't really been updated for 2 years - we still
> >> don't
>  know what was done on them) and they're ready soon. In the following
>  months, we will also port our Ubuntu slaves to the new method (don't
> >>> have a
>  timeline yet). Ideally, the integration of AMD instances will only be
> a
>  matter of running the same pipeline on a different instance type. In
> >> that
>  Case, it should not be a big deal.
> 
>  If there are big differences, that's already a yellow flag for
>  compatibility, but that's unlikely. But in that case, we would have to
> >>> make
>  a more thorough time analysis and whether it's worth the effort.
> Maybe,
>  somebody else could also lend us a hand and help us with adding AMD
>  support.
> 
>  -Marco
> 
>  Am Fr., 30. Nov. 2018, 01:22 hat Hao Jin 
>  geschrieben:
> 
> > f16c is also an instruction set supported by both brands' recent CPUs
>  just
> > like x86, AVX, SSE etc., and any difference in behaviors (quite
>  impossible
> > to happen or it will be a major defect) would most likely be caused
> >> by
>  the
> > underlying hardware implementation, so still, adding AMD instances is
> >>> not
> > adding much value here.
> > Hao
> >
> > On Thu, Nov 29, 2018 at 7:03 PM kellen sunderland <
> > kellen.sunderl...@gmail.com> wrote:
> >
> >> Just looked at the mf16c work and wanted to mention Rahul clearly
> >>> _was_
> >> thinking about AMD users in that PR.
> >>
> >> On Thu, Nov 29, 2018 at 3:46 PM kellen sunderland <
> >> kellen.sunderl...@gmail.com> wrote:
> >>
> >>> From my perspective we're developing a few features like mf16c
> >> and
> > MKLDNN
> >>> integration specifically for Intel CPUs.  It wouldn't hurt to
> >> make
>  sure
> >>> those changes also run properly on AMD cpus.
> >>>
> >>> On Thu, Nov 29, 2018, 3:38 PM Hao Jin  >> wrote:
> >>>
>  I'm a bit confused about why we need extra functionality tests
> >>> just
> > for
>  AMD
>  CPUs, aren't AMD CPUs supporting roughly the same instruction
> >>

Re: Adding AMD CPU to CI

2018-11-30 Thread Pedro Larroy
Agee with Tianqi and Hao. Adding AMD brings no value and increases complexity 
and CI cost. The instructions sets are the same. For benchmarking it might make 
sense though.

Pedro

> On 30. Nov 2018, at 18:19, Tianqi Chen  wrote:
> 
> I still think it is overkill to add AMD CPU to the CI, given the additional
> cost it could bring and little additional information we can get out from
> it.
> 
> A middle group is to add AMD CPU to a nightly build or final sweep before
> release. If there is a case that we find that AMD CPU really makes a
> difference, then we add it to the CI
> 
> Tianqi
> 
>> On Thu, Nov 29, 2018 at 6:29 PM Hao Jin  wrote:
>> 
>> For CPUs, the supported instruction sets may also vary between the same
>> manufacturer's different product lines of the same generation (Skylake-SP
>> versus Skylake).
>> For the same instruction set, the two manufacturers should both have a
>> working version of the hardware implementation. If any of the
>> implementations does not work, then the chip would not even be considered
>> functioning properly.
>> If some AMD CPUs only support up to AVX2 instruction sets, they would just
>> function in the same way as an Intel CPU that supports up to AVX2
>> instruction sets. The performance may vary, but the capability and behavior
>> of the two chips would be the same when given the same machine code.
>> For AMD GPUs it's a totally different story, as AMD GPUs do not share the
>> same instruction sets with the NVIDIA ones, thus testing on AMD GPUs(if we
>> do have support for them) would definitely add values.
>> Hao
>> 
>> On Thu, Nov 29, 2018 at 8:37 PM Anirudh Subramanian >> 
>> wrote:
>> 
>>> Instruction set extensions support like AVX2, AVX512 etc. can vary
>> between
>>> AMD and Intel and there can also be a time lag between when Intel
>> supports
>>> it versus when AMD supports it.
>>> Also, in the future this setup may be useful in case MXNet supports AMD
>>> GPUs and AWS also happens to have support for it.
>>> 
>>> Anirudh
>>> 
>>> 
>>> On Thu, Nov 29, 2018 at 4:29 PM Marco de Abreu
>>>  wrote:
>>> 
 I think it's worth a discussion to do a sanity check. While generally
>>> these
 instructions are standardized, we also made the experience with ARM
>> that
 the theory and reality sometimes don't match. Thus, it's always good to
 check.
 
 In the next months we are going to refactor our slave creation
>> processes.
 Chance Bair has been working on rewriting Windows slaves from scratch
>> (we
 used images that haven't really been updated for 2 years - we still
>> don't
 know what was done on them) and they're ready soon. In the following
 months, we will also port our Ubuntu slaves to the new method (don't
>>> have a
 timeline yet). Ideally, the integration of AMD instances will only be a
 matter of running the same pipeline on a different instance type. In
>> that
 Case, it should not be a big deal.
 
 If there are big differences, that's already a yellow flag for
 compatibility, but that's unlikely. But in that case, we would have to
>>> make
 a more thorough time analysis and whether it's worth the effort. Maybe,
 somebody else could also lend us a hand and help us with adding AMD
 support.
 
 -Marco
 
 Am Fr., 30. Nov. 2018, 01:22 hat Hao Jin 
 geschrieben:
 
> f16c is also an instruction set supported by both brands' recent CPUs
 just
> like x86, AVX, SSE etc., and any difference in behaviors (quite
 impossible
> to happen or it will be a major defect) would most likely be caused
>> by
 the
> underlying hardware implementation, so still, adding AMD instances is
>>> not
> adding much value here.
> Hao
> 
> On Thu, Nov 29, 2018 at 7:03 PM kellen sunderland <
> kellen.sunderl...@gmail.com> wrote:
> 
>> Just looked at the mf16c work and wanted to mention Rahul clearly
>>> _was_
>> thinking about AMD users in that PR.
>> 
>> On Thu, Nov 29, 2018 at 3:46 PM kellen sunderland <
>> kellen.sunderl...@gmail.com> wrote:
>> 
>>> From my perspective we're developing a few features like mf16c
>> and
> MKLDNN
>>> integration specifically for Intel CPUs.  It wouldn't hurt to
>> make
 sure
>>> those changes also run properly on AMD cpus.
>>> 
>>> On Thu, Nov 29, 2018, 3:38 PM Hao Jin > wrote:
>>> 
 I'm a bit confused about why we need extra functionality tests
>>> just
> for
 AMD
 CPUs, aren't AMD CPUs supporting roughly the same instruction
>> sets
 as
>> the
 Intel ones? In the very impossible case that something working
>> on
> Intel
 CPUs being not functioning on AMD CPUs (or vice versa), it would
> mostly
 likely be related to the underlying hardware implementation of
>> the
> same
 ISA, to which we definitely do not have a good solution. So I
>>> don't
>> think
 performin

Re: Adding AMD CPU to CI

2018-11-30 Thread Marco de Abreu
Kellen we run CI in us-west-2, Oregon :P sorry, Environment :(

-Marco

Am Fr., 30. Nov. 2018, 18:58 hat kellen sunderland <
kellen.sunderl...@gmail.com> geschrieben:

> +1 to nightly.
>
> Given the awesome results shown by Alex for AMD cpus I think MKLDNN
> actually would probably be something I'd use, even on my AMD machines.
> Kudos to Intel for releasing this lib which works great on their hardware,
> but still pretty well w/ AMD.  The upshot of MKLDNN supporting AMD to me is
> that it makes me much more likely to support it as the default PyPi package
> (discussed in another thread).  This is part of the reason I'd like to have
> a sanity test in CI somewhere for AMD hardware.
>
> Unrelated note: regarding global warming I actually partially chose
> eu-west-1 to host CI because it's carbon neutral.  The cost of the CI is
> significant, and although it's donated by AWS I'm glad the community is
> cognizant of that.
>
> On Fri, Nov 30, 2018 at 9:54 AM Kumar, Vikas 
> wrote:
>
> > I concur. +1 for nightly for pre-release suit.
> >
> > On 11/30/18, 9:49 AM, "Tianqi Chen"  wrote:
> >
> > +1 for nightly for pre-release suit, but not the CI that triggered in
> > every
> > test.  The best engineering practice is not to add things, but to
> > remove
> > things so that there is nothing can be removed.
> >
> > In terms of MLDNN, since it is an Intel product, I doubt optimizing
> > for AMD
> > CPUs is its goal, adding CI to guard against backward compatibility
> is
> > a
> > bit overkill even. Since the AMD CPU user would likely disable this
> > feature
> > and use the original CPU version of the project.
> >
> > At least we can contribute to reducing the carbon footprint and slows
> > down
> > the global warming :)
> >
> > Tianqi
> >
> > On Fri, Nov 30, 2018 at 9:38 AM kellen sunderland <
> > kellen.sunderl...@gmail.com> wrote:
> >
> > > Regarding cost, yes we could run this nightly or simply make it run
> > an
> > > existing test suite that would make sense rather than having it
> > duplicate a
> > > suite.
> > >
> > > On Fri, Nov 30, 2018 at 9:26 AM Kumar, Vikas
> > 
> > > wrote:
> > >
> > > > I don't think there is any downside to this proposal. I think a
> > basic
> > > > sanity CI testing on AMD processors will give extra boost to our
> > tests.
> > > > This adds to developer productivity and they have one less thing
> > to worry
> > > > about. Developers have spent time in past where they had to
> > manually test
> > > > on AMD  processors, MKLDNN being the recent instance. It's good
> to
> > have
> > > > those test in CI pipeline.
> > > > All I see is benefit. If the $ cost is not too high for basic
> > sanity
> > > > testing, we should do this, until and unless some strong downside
> > is
> > > called
> > > > out.
> > > >
> > > > +1
> > > >
> > > >
> > > > On 11/29/18, 5:37 PM, "Anirudh Subramanian" <
> anirudh2...@gmail.com
> > >
> > > > wrote:
> > > >
> > > > Instruction set extensions support like AVX2, AVX512 etc. can
> > vary
> > > > between
> > > > AMD and Intel and there can also be a time lag between when
> > Intel
> > > > supports
> > > > it versus when AMD supports it.
> > > > Also, in the future this setup may be useful in case MXNet
> > supports
> > > AMD
> > > > GPUs and AWS also happens to have support for it.
> > > >
> > > > Anirudh
> > > >
> > > >
> > > > On Thu, Nov 29, 2018 at 4:29 PM Marco de Abreu
> > > >  wrote:
> > > >
> > > > > I think it's worth a discussion to do a sanity check. While
> > > > generally these
> > > > > instructions are standardized, we also made the experience
> > with ARM
> > > > that
> > > > > the theory and reality sometimes don't match. Thus, it's
> > always
> > > good
> > > > to
> > > > > check.
> > > > >
> > > > > In the next months we are going to refactor our slave
> > creation
> > > > processes.
> > > > > Chance Bair has been working on rewriting Windows slaves
> from
> > > > scratch (we
> > > > > used images that haven't really been updated for 2 years -
> > we still
> > > > don't
> > > > > know what was done on them) and they're ready soon. In the
> > > following
> > > > > months, we will also port our Ubuntu slaves to the new
> method
> > > (don't
> > > > have a
> > > > > timeline yet). Ideally, the integration of AMD instances
> > will only
> > > > be a
> > > > > matter of running the same pipeline on a different instance
> > type.
> > > In
> > > > that
> > > > > Case, it should not be a big deal.
> > > > >
> > > > > If there are big differences, that's already a yellow flag
> > for
> > > > > compatibility, but that's unlikely. But in that ca

Re: Adding AMD CPU to CI

2018-11-30 Thread kellen sunderland
+1 to nightly.

Given the awesome results shown by Alex for AMD cpus I think MKLDNN
actually would probably be something I'd use, even on my AMD machines.
Kudos to Intel for releasing this lib which works great on their hardware,
but still pretty well w/ AMD.  The upshot of MKLDNN supporting AMD to me is
that it makes me much more likely to support it as the default PyPi package
(discussed in another thread).  This is part of the reason I'd like to have
a sanity test in CI somewhere for AMD hardware.

Unrelated note: regarding global warming I actually partially chose
eu-west-1 to host CI because it's carbon neutral.  The cost of the CI is
significant, and although it's donated by AWS I'm glad the community is
cognizant of that.

On Fri, Nov 30, 2018 at 9:54 AM Kumar, Vikas 
wrote:

> I concur. +1 for nightly for pre-release suit.
>
> On 11/30/18, 9:49 AM, "Tianqi Chen"  wrote:
>
> +1 for nightly for pre-release suit, but not the CI that triggered in
> every
> test.  The best engineering practice is not to add things, but to
> remove
> things so that there is nothing can be removed.
>
> In terms of MLDNN, since it is an Intel product, I doubt optimizing
> for AMD
> CPUs is its goal, adding CI to guard against backward compatibility is
> a
> bit overkill even. Since the AMD CPU user would likely disable this
> feature
> and use the original CPU version of the project.
>
> At least we can contribute to reducing the carbon footprint and slows
> down
> the global warming :)
>
> Tianqi
>
> On Fri, Nov 30, 2018 at 9:38 AM kellen sunderland <
> kellen.sunderl...@gmail.com> wrote:
>
> > Regarding cost, yes we could run this nightly or simply make it run
> an
> > existing test suite that would make sense rather than having it
> duplicate a
> > suite.
> >
> > On Fri, Nov 30, 2018 at 9:26 AM Kumar, Vikas
> 
> > wrote:
> >
> > > I don't think there is any downside to this proposal. I think a
> basic
> > > sanity CI testing on AMD processors will give extra boost to our
> tests.
> > > This adds to developer productivity and they have one less thing
> to worry
> > > about. Developers have spent time in past where they had to
> manually test
> > > on AMD  processors, MKLDNN being the recent instance. It's good to
> have
> > > those test in CI pipeline.
> > > All I see is benefit. If the $ cost is not too high for basic
> sanity
> > > testing, we should do this, until and unless some strong downside
> is
> > called
> > > out.
> > >
> > > +1
> > >
> > >
> > > On 11/29/18, 5:37 PM, "Anirudh Subramanian"  >
> > > wrote:
> > >
> > > Instruction set extensions support like AVX2, AVX512 etc. can
> vary
> > > between
> > > AMD and Intel and there can also be a time lag between when
> Intel
> > > supports
> > > it versus when AMD supports it.
> > > Also, in the future this setup may be useful in case MXNet
> supports
> > AMD
> > > GPUs and AWS also happens to have support for it.
> > >
> > > Anirudh
> > >
> > >
> > > On Thu, Nov 29, 2018 at 4:29 PM Marco de Abreu
> > >  wrote:
> > >
> > > > I think it's worth a discussion to do a sanity check. While
> > > generally these
> > > > instructions are standardized, we also made the experience
> with ARM
> > > that
> > > > the theory and reality sometimes don't match. Thus, it's
> always
> > good
> > > to
> > > > check.
> > > >
> > > > In the next months we are going to refactor our slave
> creation
> > > processes.
> > > > Chance Bair has been working on rewriting Windows slaves from
> > > scratch (we
> > > > used images that haven't really been updated for 2 years -
> we still
> > > don't
> > > > know what was done on them) and they're ready soon. In the
> > following
> > > > months, we will also port our Ubuntu slaves to the new method
> > (don't
> > > have a
> > > > timeline yet). Ideally, the integration of AMD instances
> will only
> > > be a
> > > > matter of running the same pipeline on a different instance
> type.
> > In
> > > that
> > > > Case, it should not be a big deal.
> > > >
> > > > If there are big differences, that's already a yellow flag
> for
> > > > compatibility, but that's unlikely. But in that case, we
> would have
> > > to make
> > > > a more thorough time analysis and whether it's worth the
> effort.
> > > Maybe,
> > > > somebody else could also lend us a hand and help us with
> adding AMD
> > > > support.
> > > >
> > > > -Marco
> > > >
> > > > Am Fr., 30. Nov. 2018, 01:22 hat Hao Jin <
> hjjn.a...@gmail.com>
> > > > geschrieben:
> > > >
> > > > > f16c is also an in

Re: Adding AMD CPU to CI

2018-11-30 Thread Kumar, Vikas
I concur. +1 for nightly for pre-release suit. 

On 11/30/18, 9:49 AM, "Tianqi Chen"  wrote:

+1 for nightly for pre-release suit, but not the CI that triggered in every
test.  The best engineering practice is not to add things, but to remove
things so that there is nothing can be removed.

In terms of MLDNN, since it is an Intel product, I doubt optimizing for AMD
CPUs is its goal, adding CI to guard against backward compatibility is a
bit overkill even. Since the AMD CPU user would likely disable this feature
and use the original CPU version of the project.

At least we can contribute to reducing the carbon footprint and slows down
the global warming :)

Tianqi

On Fri, Nov 30, 2018 at 9:38 AM kellen sunderland <
kellen.sunderl...@gmail.com> wrote:

> Regarding cost, yes we could run this nightly or simply make it run an
> existing test suite that would make sense rather than having it duplicate 
a
> suite.
>
> On Fri, Nov 30, 2018 at 9:26 AM Kumar, Vikas 
> wrote:
>
> > I don't think there is any downside to this proposal. I think a basic
> > sanity CI testing on AMD processors will give extra boost to our tests.
> > This adds to developer productivity and they have one less thing to 
worry
> > about. Developers have spent time in past where they had to manually 
test
> > on AMD  processors, MKLDNN being the recent instance. It's good to have
> > those test in CI pipeline.
> > All I see is benefit. If the $ cost is not too high for basic sanity
> > testing, we should do this, until and unless some strong downside is
> called
> > out.
> >
> > +1
> >
> >
> > On 11/29/18, 5:37 PM, "Anirudh Subramanian" 
> > wrote:
> >
> > Instruction set extensions support like AVX2, AVX512 etc. can vary
> > between
> > AMD and Intel and there can also be a time lag between when Intel
> > supports
> > it versus when AMD supports it.
> > Also, in the future this setup may be useful in case MXNet supports
> AMD
> > GPUs and AWS also happens to have support for it.
> >
> > Anirudh
> >
> >
> > On Thu, Nov 29, 2018 at 4:29 PM Marco de Abreu
> >  wrote:
> >
> > > I think it's worth a discussion to do a sanity check. While
> > generally these
> > > instructions are standardized, we also made the experience with 
ARM
> > that
> > > the theory and reality sometimes don't match. Thus, it's always
> good
> > to
> > > check.
> > >
> > > In the next months we are going to refactor our slave creation
> > processes.
> > > Chance Bair has been working on rewriting Windows slaves from
> > scratch (we
> > > used images that haven't really been updated for 2 years - we 
still
> > don't
> > > know what was done on them) and they're ready soon. In the
> following
> > > months, we will also port our Ubuntu slaves to the new method
> (don't
> > have a
> > > timeline yet). Ideally, the integration of AMD instances will only
> > be a
> > > matter of running the same pipeline on a different instance type.
> In
> > that
> > > Case, it should not be a big deal.
> > >
> > > If there are big differences, that's already a yellow flag for
> > > compatibility, but that's unlikely. But in that case, we would 
have
> > to make
> > > a more thorough time analysis and whether it's worth the effort.
> > Maybe,
> > > somebody else could also lend us a hand and help us with adding 
AMD
> > > support.
> > >
> > > -Marco
> > >
> > > Am Fr., 30. Nov. 2018, 01:22 hat Hao Jin 
> > > geschrieben:
> > >
> > > > f16c is also an instruction set supported by both brands' recent
> > CPUs
> > > just
> > > > like x86, AVX, SSE etc., and any difference in behaviors (quite
> > > impossible
> > > > to happen or it will be a major defect) would most likely be
> > caused by
> > > the
> > > > underlying hardware implementation, so still, adding AMD
> instances
> > is not
> > > > adding much value here.
> > > > Hao
> > > >
> > > > On Thu, Nov 29, 2018 at 7:03 PM kellen sunderland <
> > > > kellen.sunderl...@gmail.com> wrote:
> > > >
> > > > > Just looked at the mf16c work and wanted to mention Rahul
> > clearly _was_
> > > > > thinking about AMD users in that PR.
> > > > >
> > > > > On Thu, Nov 29, 2018 at 3:46 PM kellen sunderland <
> > > > > kellen.sunderl...@gmail.com> wrote:
> > > > >
> > > > > > From my perspective we're developing a few features like
> mf16c
> > and
> > > > MK

Re: Adding AMD CPU to CI

2018-11-30 Thread Tianqi Chen
+1 for nightly for pre-release suit, but not the CI that triggered in every
test.  The best engineering practice is not to add things, but to remove
things so that there is nothing can be removed.

In terms of MLDNN, since it is an Intel product, I doubt optimizing for AMD
CPUs is its goal, adding CI to guard against backward compatibility is a
bit overkill even. Since the AMD CPU user would likely disable this feature
and use the original CPU version of the project.

At least we can contribute to reducing the carbon footprint and slows down
the global warming :)

Tianqi

On Fri, Nov 30, 2018 at 9:38 AM kellen sunderland <
kellen.sunderl...@gmail.com> wrote:

> Regarding cost, yes we could run this nightly or simply make it run an
> existing test suite that would make sense rather than having it duplicate a
> suite.
>
> On Fri, Nov 30, 2018 at 9:26 AM Kumar, Vikas 
> wrote:
>
> > I don't think there is any downside to this proposal. I think a basic
> > sanity CI testing on AMD processors will give extra boost to our tests.
> > This adds to developer productivity and they have one less thing to worry
> > about. Developers have spent time in past where they had to manually test
> > on AMD  processors, MKLDNN being the recent instance. It's good to have
> > those test in CI pipeline.
> > All I see is benefit. If the $ cost is not too high for basic sanity
> > testing, we should do this, until and unless some strong downside is
> called
> > out.
> >
> > +1
> >
> >
> > On 11/29/18, 5:37 PM, "Anirudh Subramanian" 
> > wrote:
> >
> > Instruction set extensions support like AVX2, AVX512 etc. can vary
> > between
> > AMD and Intel and there can also be a time lag between when Intel
> > supports
> > it versus when AMD supports it.
> > Also, in the future this setup may be useful in case MXNet supports
> AMD
> > GPUs and AWS also happens to have support for it.
> >
> > Anirudh
> >
> >
> > On Thu, Nov 29, 2018 at 4:29 PM Marco de Abreu
> >  wrote:
> >
> > > I think it's worth a discussion to do a sanity check. While
> > generally these
> > > instructions are standardized, we also made the experience with ARM
> > that
> > > the theory and reality sometimes don't match. Thus, it's always
> good
> > to
> > > check.
> > >
> > > In the next months we are going to refactor our slave creation
> > processes.
> > > Chance Bair has been working on rewriting Windows slaves from
> > scratch (we
> > > used images that haven't really been updated for 2 years - we still
> > don't
> > > know what was done on them) and they're ready soon. In the
> following
> > > months, we will also port our Ubuntu slaves to the new method
> (don't
> > have a
> > > timeline yet). Ideally, the integration of AMD instances will only
> > be a
> > > matter of running the same pipeline on a different instance type.
> In
> > that
> > > Case, it should not be a big deal.
> > >
> > > If there are big differences, that's already a yellow flag for
> > > compatibility, but that's unlikely. But in that case, we would have
> > to make
> > > a more thorough time analysis and whether it's worth the effort.
> > Maybe,
> > > somebody else could also lend us a hand and help us with adding AMD
> > > support.
> > >
> > > -Marco
> > >
> > > Am Fr., 30. Nov. 2018, 01:22 hat Hao Jin 
> > > geschrieben:
> > >
> > > > f16c is also an instruction set supported by both brands' recent
> > CPUs
> > > just
> > > > like x86, AVX, SSE etc., and any difference in behaviors (quite
> > > impossible
> > > > to happen or it will be a major defect) would most likely be
> > caused by
> > > the
> > > > underlying hardware implementation, so still, adding AMD
> instances
> > is not
> > > > adding much value here.
> > > > Hao
> > > >
> > > > On Thu, Nov 29, 2018 at 7:03 PM kellen sunderland <
> > > > kellen.sunderl...@gmail.com> wrote:
> > > >
> > > > > Just looked at the mf16c work and wanted to mention Rahul
> > clearly _was_
> > > > > thinking about AMD users in that PR.
> > > > >
> > > > > On Thu, Nov 29, 2018 at 3:46 PM kellen sunderland <
> > > > > kellen.sunderl...@gmail.com> wrote:
> > > > >
> > > > > > From my perspective we're developing a few features like
> mf16c
> > and
> > > > MKLDNN
> > > > > > integration specifically for Intel CPUs.  It wouldn't hurt to
> > make
> > > sure
> > > > > > those changes also run properly on AMD cpus.
> > > > > >
> > > > > > On Thu, Nov 29, 2018, 3:38 PM Hao Jin  > wrote:
> > > > > >
> > > > > >> I'm a bit confused about why we need extra functionality
> > tests just
> > > > for
> > > > > >> AMD
> > > > > >> CPUs, aren't AMD CPUs supporting roughly the same
> instruction
> > sets
> > > as
> > > > > the
> > > > > >> Intel ones? In the very impossible case that something
> > working on

Re: Adding AMD CPU to CI

2018-11-30 Thread kellen sunderland
Regarding cost, yes we could run this nightly or simply make it run an
existing test suite that would make sense rather than having it duplicate a
suite.

On Fri, Nov 30, 2018 at 9:26 AM Kumar, Vikas 
wrote:

> I don't think there is any downside to this proposal. I think a basic
> sanity CI testing on AMD processors will give extra boost to our tests.
> This adds to developer productivity and they have one less thing to worry
> about. Developers have spent time in past where they had to manually test
> on AMD  processors, MKLDNN being the recent instance. It's good to have
> those test in CI pipeline.
> All I see is benefit. If the $ cost is not too high for basic sanity
> testing, we should do this, until and unless some strong downside is called
> out.
>
> +1
>
>
> On 11/29/18, 5:37 PM, "Anirudh Subramanian" 
> wrote:
>
> Instruction set extensions support like AVX2, AVX512 etc. can vary
> between
> AMD and Intel and there can also be a time lag between when Intel
> supports
> it versus when AMD supports it.
> Also, in the future this setup may be useful in case MXNet supports AMD
> GPUs and AWS also happens to have support for it.
>
> Anirudh
>
>
> On Thu, Nov 29, 2018 at 4:29 PM Marco de Abreu
>  wrote:
>
> > I think it's worth a discussion to do a sanity check. While
> generally these
> > instructions are standardized, we also made the experience with ARM
> that
> > the theory and reality sometimes don't match. Thus, it's always good
> to
> > check.
> >
> > In the next months we are going to refactor our slave creation
> processes.
> > Chance Bair has been working on rewriting Windows slaves from
> scratch (we
> > used images that haven't really been updated for 2 years - we still
> don't
> > know what was done on them) and they're ready soon. In the following
> > months, we will also port our Ubuntu slaves to the new method (don't
> have a
> > timeline yet). Ideally, the integration of AMD instances will only
> be a
> > matter of running the same pipeline on a different instance type. In
> that
> > Case, it should not be a big deal.
> >
> > If there are big differences, that's already a yellow flag for
> > compatibility, but that's unlikely. But in that case, we would have
> to make
> > a more thorough time analysis and whether it's worth the effort.
> Maybe,
> > somebody else could also lend us a hand and help us with adding AMD
> > support.
> >
> > -Marco
> >
> > Am Fr., 30. Nov. 2018, 01:22 hat Hao Jin 
> > geschrieben:
> >
> > > f16c is also an instruction set supported by both brands' recent
> CPUs
> > just
> > > like x86, AVX, SSE etc., and any difference in behaviors (quite
> > impossible
> > > to happen or it will be a major defect) would most likely be
> caused by
> > the
> > > underlying hardware implementation, so still, adding AMD instances
> is not
> > > adding much value here.
> > > Hao
> > >
> > > On Thu, Nov 29, 2018 at 7:03 PM kellen sunderland <
> > > kellen.sunderl...@gmail.com> wrote:
> > >
> > > > Just looked at the mf16c work and wanted to mention Rahul
> clearly _was_
> > > > thinking about AMD users in that PR.
> > > >
> > > > On Thu, Nov 29, 2018 at 3:46 PM kellen sunderland <
> > > > kellen.sunderl...@gmail.com> wrote:
> > > >
> > > > > From my perspective we're developing a few features like mf16c
> and
> > > MKLDNN
> > > > > integration specifically for Intel CPUs.  It wouldn't hurt to
> make
> > sure
> > > > > those changes also run properly on AMD cpus.
> > > > >
> > > > > On Thu, Nov 29, 2018, 3:38 PM Hao Jin  wrote:
> > > > >
> > > > >> I'm a bit confused about why we need extra functionality
> tests just
> > > for
> > > > >> AMD
> > > > >> CPUs, aren't AMD CPUs supporting roughly the same instruction
> sets
> > as
> > > > the
> > > > >> Intel ones? In the very impossible case that something
> working on
> > > Intel
> > > > >> CPUs being not functioning on AMD CPUs (or vice versa), it
> would
> > > mostly
> > > > >> likely be related to the underlying hardware implementation
> of the
> > > same
> > > > >> ISA, to which we definitely do not have a good solution. So I
> don't
> > > > think
> > > > >> performing extra tests on functional aspect of the system on
> AMD
> > CPUs
> > > is
> > > > >> adding any values.
> > > > >> Hao
> > > > >>
> > > > >> On Thu, Nov 29, 2018 at 5:50 PM Seth, Manu
> >  > > >
> > > > >> wrote:
> > > > >>
> > > > >> > +1
> > > > >> >
> > > > >> > On 11/29/18, 2:39 PM, "Alex Zai"  wrote:
> > > > >> >
> > > > >> > What are people's thoughts on having AMD machines
> tested on
> > the
> > > > CI?
> > > > >> AMD
> > > > >> > machines are now available on AWS.
> > > > 

Re: Adding AMD CPU to CI

2018-11-30 Thread Kumar, Vikas
I don't think there is any downside to this proposal. I think a basic sanity CI 
testing on AMD processors will give extra boost to our tests. This adds to 
developer productivity and they have one less thing to worry about. Developers 
have spent time in past where they had to manually test on AMD  processors, 
MKLDNN being the recent instance. It's good to have those test in CI pipeline.
All I see is benefit. If the $ cost is not too high for basic sanity testing, 
we should do this, until and unless some strong downside is called out.

+1
 

On 11/29/18, 5:37 PM, "Anirudh Subramanian"  wrote:

Instruction set extensions support like AVX2, AVX512 etc. can vary between
AMD and Intel and there can also be a time lag between when Intel supports
it versus when AMD supports it.
Also, in the future this setup may be useful in case MXNet supports AMD
GPUs and AWS also happens to have support for it.

Anirudh


On Thu, Nov 29, 2018 at 4:29 PM Marco de Abreu
 wrote:

> I think it's worth a discussion to do a sanity check. While generally 
these
> instructions are standardized, we also made the experience with ARM that
> the theory and reality sometimes don't match. Thus, it's always good to
> check.
>
> In the next months we are going to refactor our slave creation processes.
> Chance Bair has been working on rewriting Windows slaves from scratch (we
> used images that haven't really been updated for 2 years - we still don't
> know what was done on them) and they're ready soon. In the following
> months, we will also port our Ubuntu slaves to the new method (don't have 
a
> timeline yet). Ideally, the integration of AMD instances will only be a
> matter of running the same pipeline on a different instance type. In that
> Case, it should not be a big deal.
>
> If there are big differences, that's already a yellow flag for
> compatibility, but that's unlikely. But in that case, we would have to 
make
> a more thorough time analysis and whether it's worth the effort. Maybe,
> somebody else could also lend us a hand and help us with adding AMD
> support.
>
> -Marco
>
> Am Fr., 30. Nov. 2018, 01:22 hat Hao Jin 
> geschrieben:
>
> > f16c is also an instruction set supported by both brands' recent CPUs
> just
> > like x86, AVX, SSE etc., and any difference in behaviors (quite
> impossible
> > to happen or it will be a major defect) would most likely be caused by
> the
> > underlying hardware implementation, so still, adding AMD instances is 
not
> > adding much value here.
> > Hao
> >
> > On Thu, Nov 29, 2018 at 7:03 PM kellen sunderland <
> > kellen.sunderl...@gmail.com> wrote:
> >
> > > Just looked at the mf16c work and wanted to mention Rahul clearly 
_was_
> > > thinking about AMD users in that PR.
> > >
> > > On Thu, Nov 29, 2018 at 3:46 PM kellen sunderland <
> > > kellen.sunderl...@gmail.com> wrote:
> > >
> > > > From my perspective we're developing a few features like mf16c and
> > MKLDNN
> > > > integration specifically for Intel CPUs.  It wouldn't hurt to make
> sure
> > > > those changes also run properly on AMD cpus.
> > > >
> > > > On Thu, Nov 29, 2018, 3:38 PM Hao Jin  > > >
> > > >> I'm a bit confused about why we need extra functionality tests just
> > for
> > > >> AMD
> > > >> CPUs, aren't AMD CPUs supporting roughly the same instruction sets
> as
> > > the
> > > >> Intel ones? In the very impossible case that something working on
> > Intel
> > > >> CPUs being not functioning on AMD CPUs (or vice versa), it would
> > mostly
> > > >> likely be related to the underlying hardware implementation of the
> > same
> > > >> ISA, to which we definitely do not have a good solution. So I don't
> > > think
> > > >> performing extra tests on functional aspect of the system on AMD
> CPUs
> > is
> > > >> adding any values.
> > > >> Hao
> > > >>
> > > >> On Thu, Nov 29, 2018 at 5:50 PM Seth, Manu
>  > >
> > > >> wrote:
> > > >>
> > > >> > +1
> > > >> >
> > > >> > On 11/29/18, 2:39 PM, "Alex Zai"  wrote:
> > > >> >
> > > >> > What are people's thoughts on having AMD machines tested on
> the
> > > CI?
> > > >> AMD
> > > >> > machines are now available on AWS.
> > > >> >
> > > >> > Best,
> > > >> > Alex
> > > >> >
> > > >> >
> > > >> >
> > > >>
> > > >
> > >
> >
>




Re: Adding AMD CPU to CI

2018-11-30 Thread Tianqi Chen
I still think it is overkill to add AMD CPU to the CI, given the additional
cost it could bring and little additional information we can get out from
it.

A middle group is to add AMD CPU to a nightly build or final sweep before
release. If there is a case that we find that AMD CPU really makes a
difference, then we add it to the CI

Tianqi

On Thu, Nov 29, 2018 at 6:29 PM Hao Jin  wrote:

> For CPUs, the supported instruction sets may also vary between the same
> manufacturer's different product lines of the same generation (Skylake-SP
> versus Skylake).
> For the same instruction set, the two manufacturers should both have a
> working version of the hardware implementation. If any of the
> implementations does not work, then the chip would not even be considered
> functioning properly.
> If some AMD CPUs only support up to AVX2 instruction sets, they would just
> function in the same way as an Intel CPU that supports up to AVX2
> instruction sets. The performance may vary, but the capability and behavior
> of the two chips would be the same when given the same machine code.
> For AMD GPUs it's a totally different story, as AMD GPUs do not share the
> same instruction sets with the NVIDIA ones, thus testing on AMD GPUs(if we
> do have support for them) would definitely add values.
> Hao
>
> On Thu, Nov 29, 2018 at 8:37 PM Anirudh Subramanian  >
> wrote:
>
> > Instruction set extensions support like AVX2, AVX512 etc. can vary
> between
> > AMD and Intel and there can also be a time lag between when Intel
> supports
> > it versus when AMD supports it.
> > Also, in the future this setup may be useful in case MXNet supports AMD
> > GPUs and AWS also happens to have support for it.
> >
> > Anirudh
> >
> >
> > On Thu, Nov 29, 2018 at 4:29 PM Marco de Abreu
> >  wrote:
> >
> > > I think it's worth a discussion to do a sanity check. While generally
> > these
> > > instructions are standardized, we also made the experience with ARM
> that
> > > the theory and reality sometimes don't match. Thus, it's always good to
> > > check.
> > >
> > > In the next months we are going to refactor our slave creation
> processes.
> > > Chance Bair has been working on rewriting Windows slaves from scratch
> (we
> > > used images that haven't really been updated for 2 years - we still
> don't
> > > know what was done on them) and they're ready soon. In the following
> > > months, we will also port our Ubuntu slaves to the new method (don't
> > have a
> > > timeline yet). Ideally, the integration of AMD instances will only be a
> > > matter of running the same pipeline on a different instance type. In
> that
> > > Case, it should not be a big deal.
> > >
> > > If there are big differences, that's already a yellow flag for
> > > compatibility, but that's unlikely. But in that case, we would have to
> > make
> > > a more thorough time analysis and whether it's worth the effort. Maybe,
> > > somebody else could also lend us a hand and help us with adding AMD
> > > support.
> > >
> > > -Marco
> > >
> > > Am Fr., 30. Nov. 2018, 01:22 hat Hao Jin 
> > > geschrieben:
> > >
> > > > f16c is also an instruction set supported by both brands' recent CPUs
> > > just
> > > > like x86, AVX, SSE etc., and any difference in behaviors (quite
> > > impossible
> > > > to happen or it will be a major defect) would most likely be caused
> by
> > > the
> > > > underlying hardware implementation, so still, adding AMD instances is
> > not
> > > > adding much value here.
> > > > Hao
> > > >
> > > > On Thu, Nov 29, 2018 at 7:03 PM kellen sunderland <
> > > > kellen.sunderl...@gmail.com> wrote:
> > > >
> > > > > Just looked at the mf16c work and wanted to mention Rahul clearly
> > _was_
> > > > > thinking about AMD users in that PR.
> > > > >
> > > > > On Thu, Nov 29, 2018 at 3:46 PM kellen sunderland <
> > > > > kellen.sunderl...@gmail.com> wrote:
> > > > >
> > > > > > From my perspective we're developing a few features like mf16c
> and
> > > > MKLDNN
> > > > > > integration specifically for Intel CPUs.  It wouldn't hurt to
> make
> > > sure
> > > > > > those changes also run properly on AMD cpus.
> > > > > >
> > > > > > On Thu, Nov 29, 2018, 3:38 PM Hao Jin  wrote:
> > > > > >
> > > > > >> I'm a bit confused about why we need extra functionality tests
> > just
> > > > for
> > > > > >> AMD
> > > > > >> CPUs, aren't AMD CPUs supporting roughly the same instruction
> sets
> > > as
> > > > > the
> > > > > >> Intel ones? In the very impossible case that something working
> on
> > > > Intel
> > > > > >> CPUs being not functioning on AMD CPUs (or vice versa), it would
> > > > mostly
> > > > > >> likely be related to the underlying hardware implementation of
> the
> > > > same
> > > > > >> ISA, to which we definitely do not have a good solution. So I
> > don't
> > > > > think
> > > > > >> performing extra tests on functional aspect of the system on AMD
> > > CPUs
> > > > is
> > > > > >> adding any values.
> > > > > >> Hao
> > > > > >>
> > > > > >> On Thu, Nov 29, 20

Re: CI impaired

2018-11-30 Thread Gavin M. Bell
Hey Folks,

Marco has been running this change in dev, with flying colors, for some
time. This is not an experiment but a roll out that was announced.  We also
decided to make this change post the release cut so limit the blast radius
from any critical obligations to the community.  Marco is accountable for
this work and will address any issues that may occur as he has been put
on-call.  We have, to our best ability, mitigated as much risk as possible
and now it is time to pull the trigger.  The community will enjoy a bit
more visibility and clarity into the test process which will be
advantageous, as well as allowing us to extend our infrastructure in a way
that affords us more flexibility.

No pending PRs will be impacted.

Thank you for your support as we evolve this system to better serve the
community.

-Gavin

On Fri, Nov 30, 2018 at 5:23 PM Marco de Abreu
 wrote:

> Hello Naveen, this is not an experiment. Everything has been tested in our
> test system and is considered working 100%. This is not a test but actually
> the move into production - the merge into master happened a week ago. We
> now just have to put all PRs into the catalogue, which means that all PRs
> have to be analyzed with the new pipelines - the only thing that will be
> noticeable is that the CI is under higher load.
>
> The pending PRs will not be impacted. The existing pipeline is still
> running in parallel and everything will behave as before.
>
> -Marco
>
> On Fri, Nov 30, 2018 at 4:41 PM Naveen Swamy  wrote:
>
> > Marco, run your experiments on a branch - set up, test it well and then
> > bring it to the master.
> >
> > > On Nov 30, 2018, at 6:53 AM, Marco de Abreu <
> > marco.g.ab...@googlemail.com.INVALID> wrote:
> > >
> > > Hello,
> > >
> > > I'm now moving forward with #1. I will try to get to #3 as soon as
> > possible
> > > to reduce parallel jobs in our CI. You might notice some unfinished
> > jobs. I
> > > will let you know as soon as this process has been completed. Until
> then,
> > > please bare with me since we have hundreds of jobs to run in order to
> > > validate all PRs.
> > >
> > > Best regards,
> > > Marco
> > >
> > > On Fri, Nov 30, 2018 at 1:36 AM Marco de Abreu <
> > marco.g.ab...@googlemail.com>
> > > wrote:
> > >
> > >> Hello,
> > >>
> > >> since the release branch has now been cut, I would like to move
> forward
> > >> with the CI improvements for the master branch. This would include the
> > >> following actions:
> > >> 1. Re-enable the new Jenkins job
> > >> 2. Request Apache Infra to move the protected branch check from the
> main
> > >> pipeline to our new ones
> > >> 3. Merge https://github.com/apache/incubator-mxnet/pull/13474 - this
> > >> finalizes the deprecation process
> > >>
> > >> If nobody objects, I would like to start with #1 soon. Mentors, could
> > you
> > >> please assist to create the Apache Infra ticket? I would then take it
> > from
> > >> there and talk to Infra.
> > >>
> > >> Best regards,
> > >> Marco
> > >>
> > >> On Mon, Nov 26, 2018 at 2:47 AM kellen sunderland <
> > >> kellen.sunderl...@gmail.com> wrote:
> > >>
> > >>> Sorry, [1] meant to reference
> > >>> https://issues.jenkins-ci.org/browse/JENKINS-37984 .
> > >>>
> > >>> On Sun, Nov 25, 2018 at 5:41 PM kellen sunderland <
> > >>> kellen.sunderl...@gmail.com> wrote:
> > >>>
> >  Marco and I ran into another urgent issue over the weekend that was
> >  causing builds to fail.  This issue was unrelated to any feature
> >  development work, or other CI fixes applied recently, but it did
> > require
> >  quite a bit of work from Marco (and a little from me) to fix.
> > 
> >  We spent enough time on the problem that it caused us to take a step
> > >>> back
> >  and consider how we could both fix issues in CI and support the 1.4
> > >>> release
> >  with the least impact possible on MXNet devs.  Marco had planned to
> > >>> make a
> >  significant change to the CI to fix a long-standing Jenkins error
> [1],
> > >>> but
> >  we feel that most developers would prioritize having a stable build
> >  environment for the next few weeks over having this fix in place.
> > 
> >  To properly introduce a new CI system the intent was to do a gradual
> >  blue/green roll out of the fix.  To manage this rollout would have
> > taken
> >  operational effort and double compute load as we run systems in
> > >>> parallel.
> >  This risks outages due to scaling limits, and we’d rather make this
> > >>> change
> >  during a period of low-developer activity, i.e. shortly after the
> 1.4
> >  release.
> > 
> >  This means that from now until the 1.4 release, in order to reduce
> >  complexity MXNet developers should only see a single Jenkins
> > >>> verification
> >  check, and a single Travis check.
> > 
> > 
> > >>>
> > >>
> >
>


-- 
Sincerely,
Gavin M. Bell

 "Never mistake a clear view for a short distance."
  -Paul Saffo


Re: CI impaired

2018-11-30 Thread Marco de Abreu
Hello Naveen, this is not an experiment. Everything has been tested in our
test system and is considered working 100%. This is not a test but actually
the move into production - the merge into master happened a week ago. We
now just have to put all PRs into the catalogue, which means that all PRs
have to be analyzed with the new pipelines - the only thing that will be
noticeable is that the CI is under higher load.

The pending PRs will not be impacted. The existing pipeline is still
running in parallel and everything will behave as before.

-Marco

On Fri, Nov 30, 2018 at 4:41 PM Naveen Swamy  wrote:

> Marco, run your experiments on a branch - set up, test it well and then
> bring it to the master.
>
> > On Nov 30, 2018, at 6:53 AM, Marco de Abreu <
> marco.g.ab...@googlemail.com.INVALID> wrote:
> >
> > Hello,
> >
> > I'm now moving forward with #1. I will try to get to #3 as soon as
> possible
> > to reduce parallel jobs in our CI. You might notice some unfinished
> jobs. I
> > will let you know as soon as this process has been completed. Until then,
> > please bare with me since we have hundreds of jobs to run in order to
> > validate all PRs.
> >
> > Best regards,
> > Marco
> >
> > On Fri, Nov 30, 2018 at 1:36 AM Marco de Abreu <
> marco.g.ab...@googlemail.com>
> > wrote:
> >
> >> Hello,
> >>
> >> since the release branch has now been cut, I would like to move forward
> >> with the CI improvements for the master branch. This would include the
> >> following actions:
> >> 1. Re-enable the new Jenkins job
> >> 2. Request Apache Infra to move the protected branch check from the main
> >> pipeline to our new ones
> >> 3. Merge https://github.com/apache/incubator-mxnet/pull/13474 - this
> >> finalizes the deprecation process
> >>
> >> If nobody objects, I would like to start with #1 soon. Mentors, could
> you
> >> please assist to create the Apache Infra ticket? I would then take it
> from
> >> there and talk to Infra.
> >>
> >> Best regards,
> >> Marco
> >>
> >> On Mon, Nov 26, 2018 at 2:47 AM kellen sunderland <
> >> kellen.sunderl...@gmail.com> wrote:
> >>
> >>> Sorry, [1] meant to reference
> >>> https://issues.jenkins-ci.org/browse/JENKINS-37984 .
> >>>
> >>> On Sun, Nov 25, 2018 at 5:41 PM kellen sunderland <
> >>> kellen.sunderl...@gmail.com> wrote:
> >>>
>  Marco and I ran into another urgent issue over the weekend that was
>  causing builds to fail.  This issue was unrelated to any feature
>  development work, or other CI fixes applied recently, but it did
> require
>  quite a bit of work from Marco (and a little from me) to fix.
> 
>  We spent enough time on the problem that it caused us to take a step
> >>> back
>  and consider how we could both fix issues in CI and support the 1.4
> >>> release
>  with the least impact possible on MXNet devs.  Marco had planned to
> >>> make a
>  significant change to the CI to fix a long-standing Jenkins error [1],
> >>> but
>  we feel that most developers would prioritize having a stable build
>  environment for the next few weeks over having this fix in place.
> 
>  To properly introduce a new CI system the intent was to do a gradual
>  blue/green roll out of the fix.  To manage this rollout would have
> taken
>  operational effort and double compute load as we run systems in
> >>> parallel.
>  This risks outages due to scaling limits, and we’d rather make this
> >>> change
>  during a period of low-developer activity, i.e. shortly after the 1.4
>  release.
> 
>  This means that from now until the 1.4 release, in order to reduce
>  complexity MXNet developers should only see a single Jenkins
> >>> verification
>  check, and a single Travis check.
> 
> 
> >>>
> >>
>


Re: CI impaired

2018-11-30 Thread Naveen Swamy
Marco, run your experiments on a branch - set up, test it well and then bring 
it to the master. 

> On Nov 30, 2018, at 6:53 AM, Marco de Abreu 
>  wrote:
> 
> Hello,
> 
> I'm now moving forward with #1. I will try to get to #3 as soon as possible
> to reduce parallel jobs in our CI. You might notice some unfinished jobs. I
> will let you know as soon as this process has been completed. Until then,
> please bare with me since we have hundreds of jobs to run in order to
> validate all PRs.
> 
> Best regards,
> Marco
> 
> On Fri, Nov 30, 2018 at 1:36 AM Marco de Abreu 
> wrote:
> 
>> Hello,
>> 
>> since the release branch has now been cut, I would like to move forward
>> with the CI improvements for the master branch. This would include the
>> following actions:
>> 1. Re-enable the new Jenkins job
>> 2. Request Apache Infra to move the protected branch check from the main
>> pipeline to our new ones
>> 3. Merge https://github.com/apache/incubator-mxnet/pull/13474 - this
>> finalizes the deprecation process
>> 
>> If nobody objects, I would like to start with #1 soon. Mentors, could you
>> please assist to create the Apache Infra ticket? I would then take it from
>> there and talk to Infra.
>> 
>> Best regards,
>> Marco
>> 
>> On Mon, Nov 26, 2018 at 2:47 AM kellen sunderland <
>> kellen.sunderl...@gmail.com> wrote:
>> 
>>> Sorry, [1] meant to reference
>>> https://issues.jenkins-ci.org/browse/JENKINS-37984 .
>>> 
>>> On Sun, Nov 25, 2018 at 5:41 PM kellen sunderland <
>>> kellen.sunderl...@gmail.com> wrote:
>>> 
 Marco and I ran into another urgent issue over the weekend that was
 causing builds to fail.  This issue was unrelated to any feature
 development work, or other CI fixes applied recently, but it did require
 quite a bit of work from Marco (and a little from me) to fix.
 
 We spent enough time on the problem that it caused us to take a step
>>> back
 and consider how we could both fix issues in CI and support the 1.4
>>> release
 with the least impact possible on MXNet devs.  Marco had planned to
>>> make a
 significant change to the CI to fix a long-standing Jenkins error [1],
>>> but
 we feel that most developers would prioritize having a stable build
 environment for the next few weeks over having this fix in place.
 
 To properly introduce a new CI system the intent was to do a gradual
 blue/green roll out of the fix.  To manage this rollout would have taken
 operational effort and double compute load as we run systems in
>>> parallel.
 This risks outages due to scaling limits, and we’d rather make this
>>> change
 during a period of low-developer activity, i.e. shortly after the 1.4
 release.
 
 This means that from now until the 1.4 release, in order to reduce
 complexity MXNet developers should only see a single Jenkins
>>> verification
 check, and a single Travis check.
 
 
>>> 
>> 


Re: CI impaired

2018-11-30 Thread Naveen Swamy
There are still pending PRs pending that needs to be merged and cherry picked 
to the branch

> On Nov 30, 2018, at 6:53 AM, Marco de Abreu 
>  wrote:
> 
> Hello,
> 
> I'm now moving forward with #1. I will try to get to #3 as soon as possible
> to reduce parallel jobs in our CI. You might notice some unfinished jobs. I
> will let you know as soon as this process has been completed. Until then,
> please bare with me since we have hundreds of jobs to run in order to
> validate all PRs.
> 
> Best regards,
> Marco
> 
> On Fri, Nov 30, 2018 at 1:36 AM Marco de Abreu 
> wrote:
> 
>> Hello,
>> 
>> since the release branch has now been cut, I would like to move forward
>> with the CI improvements for the master branch. This would include the
>> following actions:
>> 1. Re-enable the new Jenkins job
>> 2. Request Apache Infra to move the protected branch check from the main
>> pipeline to our new ones
>> 3. Merge https://github.com/apache/incubator-mxnet/pull/13474 - this
>> finalizes the deprecation process
>> 
>> If nobody objects, I would like to start with #1 soon. Mentors, could you
>> please assist to create the Apache Infra ticket? I would then take it from
>> there and talk to Infra.
>> 
>> Best regards,
>> Marco
>> 
>> On Mon, Nov 26, 2018 at 2:47 AM kellen sunderland <
>> kellen.sunderl...@gmail.com> wrote:
>> 
>>> Sorry, [1] meant to reference
>>> https://issues.jenkins-ci.org/browse/JENKINS-37984 .
>>> 
>>> On Sun, Nov 25, 2018 at 5:41 PM kellen sunderland <
>>> kellen.sunderl...@gmail.com> wrote:
>>> 
 Marco and I ran into another urgent issue over the weekend that was
 causing builds to fail.  This issue was unrelated to any feature
 development work, or other CI fixes applied recently, but it did require
 quite a bit of work from Marco (and a little from me) to fix.
 
 We spent enough time on the problem that it caused us to take a step
>>> back
 and consider how we could both fix issues in CI and support the 1.4
>>> release
 with the least impact possible on MXNet devs.  Marco had planned to
>>> make a
 significant change to the CI to fix a long-standing Jenkins error [1],
>>> but
 we feel that most developers would prioritize having a stable build
 environment for the next few weeks over having this fix in place.
 
 To properly introduce a new CI system the intent was to do a gradual
 blue/green roll out of the fix.  To manage this rollout would have taken
 operational effort and double compute load as we run systems in
>>> parallel.
 This risks outages due to scaling limits, and we’d rather make this
>>> change
 during a period of low-developer activity, i.e. shortly after the 1.4
 release.
 
 This means that from now until the 1.4 release, in order to reduce
 complexity MXNet developers should only see a single Jenkins
>>> verification
 check, and a single Travis check.
 
 
>>> 
>> 


Re: CI impaired

2018-11-30 Thread Marco de Abreu
Hello,

I'm now moving forward with #1. I will try to get to #3 as soon as possible
to reduce parallel jobs in our CI. You might notice some unfinished jobs. I
will let you know as soon as this process has been completed. Until then,
please bare with me since we have hundreds of jobs to run in order to
validate all PRs.

Best regards,
Marco

On Fri, Nov 30, 2018 at 1:36 AM Marco de Abreu 
wrote:

> Hello,
>
> since the release branch has now been cut, I would like to move forward
> with the CI improvements for the master branch. This would include the
> following actions:
> 1. Re-enable the new Jenkins job
> 2. Request Apache Infra to move the protected branch check from the main
> pipeline to our new ones
> 3. Merge https://github.com/apache/incubator-mxnet/pull/13474 - this
> finalizes the deprecation process
>
> If nobody objects, I would like to start with #1 soon. Mentors, could you
> please assist to create the Apache Infra ticket? I would then take it from
> there and talk to Infra.
>
> Best regards,
> Marco
>
> On Mon, Nov 26, 2018 at 2:47 AM kellen sunderland <
> kellen.sunderl...@gmail.com> wrote:
>
>> Sorry, [1] meant to reference
>> https://issues.jenkins-ci.org/browse/JENKINS-37984 .
>>
>> On Sun, Nov 25, 2018 at 5:41 PM kellen sunderland <
>> kellen.sunderl...@gmail.com> wrote:
>>
>> > Marco and I ran into another urgent issue over the weekend that was
>> > causing builds to fail.  This issue was unrelated to any feature
>> > development work, or other CI fixes applied recently, but it did require
>> > quite a bit of work from Marco (and a little from me) to fix.
>> >
>> > We spent enough time on the problem that it caused us to take a step
>> back
>> > and consider how we could both fix issues in CI and support the 1.4
>> release
>> > with the least impact possible on MXNet devs.  Marco had planned to
>> make a
>> > significant change to the CI to fix a long-standing Jenkins error [1],
>> but
>> > we feel that most developers would prioritize having a stable build
>> > environment for the next few weeks over having this fix in place.
>> >
>> > To properly introduce a new CI system the intent was to do a gradual
>> > blue/green roll out of the fix.  To manage this rollout would have taken
>> > operational effort and double compute load as we run systems in
>> parallel.
>> > This risks outages due to scaling limits, and we’d rather make this
>> change
>> > during a period of low-developer activity, i.e. shortly after the 1.4
>> > release.
>> >
>> > This means that from now until the 1.4 release, in order to reduce
>> > complexity MXNet developers should only see a single Jenkins
>> verification
>> > check, and a single Travis check.
>> >
>> >
>>
>