Re: CI impaired
Thanks Naveen and Gavin! #1 has been completed and every job has finished its processing. #2 is the ticket with infra: https://issues.apache.org/jira/browse/INFRA-17346 I'm now waiting for their response. -Marco On Fri, Nov 30, 2018 at 8:25 PM Naveen Swamy wrote: > Hi Marco/Gavin, > > Thanks for the clarification. I was not aware that it has been tested on a > separate test environment(this is what I was suggesting and make the > changes in a more controlled manner), last time the change was made, many > PRs were left dangling and developers had to go trigger and I triggered > them at least 5 times before it succeeded today. > > Appreciate all the hard work to make CI better. > > -Naveen > > On Fri, Nov 30, 2018 at 8:50 AM Gavin M. Bell > wrote: > > > Hey Folks, > > > > Marco has been running this change in dev, with flying colors, for some > > time. This is not an experiment but a roll out that was announced. We > also > > decided to make this change post the release cut so limit the blast > radius > > from any critical obligations to the community. Marco is accountable for > > this work and will address any issues that may occur as he has been put > > on-call. We have, to our best ability, mitigated as much risk as > possible > > and now it is time to pull the trigger. The community will enjoy a bit > > more visibility and clarity into the test process which will be > > advantageous, as well as allowing us to extend our infrastructure in a > way > > that affords us more flexibility. > > > > No pending PRs will be impacted. > > > > Thank you for your support as we evolve this system to better serve the > > community. > > > > -Gavin > > > > On Fri, Nov 30, 2018 at 5:23 PM Marco de Abreu > > wrote: > > > > > Hello Naveen, this is not an experiment. Everything has been tested in > > our > > > test system and is considered working 100%. This is not a test but > > actually > > > the move into production - the merge into master happened a week ago. > We > > > now just have to put all PRs into the catalogue, which means that all > PRs > > > have to be analyzed with the new pipelines - the only thing that will > be > > > noticeable is that the CI is under higher load. > > > > > > The pending PRs will not be impacted. The existing pipeline is still > > > running in parallel and everything will behave as before. > > > > > > -Marco > > > > > > On Fri, Nov 30, 2018 at 4:41 PM Naveen Swamy > wrote: > > > > > > > Marco, run your experiments on a branch - set up, test it well and > then > > > > bring it to the master. > > > > > > > > > On Nov 30, 2018, at 6:53 AM, Marco de Abreu < > > > > marco.g.ab...@googlemail.com.INVALID> wrote: > > > > > > > > > > Hello, > > > > > > > > > > I'm now moving forward with #1. I will try to get to #3 as soon as > > > > possible > > > > > to reduce parallel jobs in our CI. You might notice some unfinished > > > > jobs. I > > > > > will let you know as soon as this process has been completed. Until > > > then, > > > > > please bare with me since we have hundreds of jobs to run in order > to > > > > > validate all PRs. > > > > > > > > > > Best regards, > > > > > Marco > > > > > > > > > > On Fri, Nov 30, 2018 at 1:36 AM Marco de Abreu < > > > > marco.g.ab...@googlemail.com> > > > > > wrote: > > > > > > > > > >> Hello, > > > > >> > > > > >> since the release branch has now been cut, I would like to move > > > forward > > > > >> with the CI improvements for the master branch. This would include > > the > > > > >> following actions: > > > > >> 1. Re-enable the new Jenkins job > > > > >> 2. Request Apache Infra to move the protected branch check from > the > > > main > > > > >> pipeline to our new ones > > > > >> 3. Merge https://github.com/apache/incubator-mxnet/pull/13474 - > > this > > > > >> finalizes the deprecation process > > > > >> > > > > >> If nobody objects, I would like to start with #1 soon. Mentors, > > could > > > > you > > > > >> please assist to create the Apache Infra ticket? I would then take > > it > > > > from > > > > >> there and talk to Infra. > > > > >> > > > > >> Best regards, > > > > >> Marco > > > > >> > > > > >> On Mon, Nov 26, 2018 at 2:47 AM kellen sunderland < > > > > >> kellen.sunderl...@gmail.com> wrote: > > > > >> > > > > >>> Sorry, [1] meant to reference > > > > >>> https://issues.jenkins-ci.org/browse/JENKINS-37984 . > > > > >>> > > > > >>> On Sun, Nov 25, 2018 at 5:41 PM kellen sunderland < > > > > >>> kellen.sunderl...@gmail.com> wrote: > > > > >>> > > > > Marco and I ran into another urgent issue over the weekend that > > was > > > > causing builds to fail. This issue was unrelated to any feature > > > > development work, or other CI fixes applied recently, but it did > > > > require > > > > quite a bit of work from Marco (and a little from me) to fix. > > > > > > > > We spent enough time on the problem that it caused us to take a > > step > > > > >>> back > > > > and consider how we could both fix issues i
Re: v1.4.0 status 11/29
PR is here https://github.com/apache/incubator-mxnet/pull/13497. On Thu, Nov 29, 2018 at 8:56 PM Lv, Tao A wrote: > Credit belongs to Alex. > > Hi Alex, would you mind porting your fix to the v1.4.x branch? > > Thanks, > -Tao > > -Original Message- > From: Steffen Rochel [mailto:steffenroc...@gmail.com] > Sent: Friday, November 30, 2018 12:48 PM > To: dev@mxnet.incubator.apache.org > Subject: Re: v1.4.0 status 11/29 > > Hi Tao - thanks for fixing the crash. Please create PR on v1.4.x branch > with [v1.4.x] in title and add me to the PR. > Steffen > > On Thu, Nov 29, 2018 at 8:44 PM Lv, Tao A wrote: > > > Hi Steffen, I would like to have > > https://github.com/apache/incubator-mxnet/pull/13433 into the coming > > 1.4.0 release. It fixed a crash of deconvolution with certain input > > size for MKL-DNN backend. This PR is well reviewed and already merged > > into the master branch. New test case is also included there. > > > > Please find the corresponding issue here: > > https://github.com/apache/incubator-mxnet/issues/13421 . > > > > Thanks, > > -Tao > > > > -Original Message- > > From: Steffen Rochel [mailto:steffenroc...@gmail.com] > > Sent: Friday, November 30, 2018 12:05 PM > > To: dev@mxnet.incubator.apache.org > > Subject: v1.4.0 status 11/29 > > > > Dear MXNet community - > > I would like to provide update on v1.4.0 status, details will be > > tracked here < > > https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incu > > bating%29+1.4.0+Release+Plan+and+Status > > > > > . > > > > 1. Sergey created v1.4.x branch > > 2. As expected, additional requests have been made for inclusion in > > v1.4.0 release. Critical PR are tracked here < > > https://cwiki.apache.org/confluence/display/MXNET/Apache+MXNet+%28incu > > bating%29+1.4.0+Release+Plan+and+Status#ApacheMXNet(incubating)1.4.0Re > > leasePlanandStatus-OpenPRstotrack > > > > > . > > 3. PR to update README.md is blocked by flaky test failures, > > retriggered check. > > 4. PR to upgrade version on master to v1.5.0 has been submitted. > > 5. CI is setup and first run passed. > > > > Note: if you want to add selected fixes or enhancements, please reply > > to this email. Please provide justification, add me as approver to the > > v1.4.x PR and make sure your changes have tests included in PR and get > > properly reviewed. > > > > Regards, > > Steffen > > >
Re: CI impaired
Hi Marco/Gavin, Thanks for the clarification. I was not aware that it has been tested on a separate test environment(this is what I was suggesting and make the changes in a more controlled manner), last time the change was made, many PRs were left dangling and developers had to go trigger and I triggered them at least 5 times before it succeeded today. Appreciate all the hard work to make CI better. -Naveen On Fri, Nov 30, 2018 at 8:50 AM Gavin M. Bell wrote: > Hey Folks, > > Marco has been running this change in dev, with flying colors, for some > time. This is not an experiment but a roll out that was announced. We also > decided to make this change post the release cut so limit the blast radius > from any critical obligations to the community. Marco is accountable for > this work and will address any issues that may occur as he has been put > on-call. We have, to our best ability, mitigated as much risk as possible > and now it is time to pull the trigger. The community will enjoy a bit > more visibility and clarity into the test process which will be > advantageous, as well as allowing us to extend our infrastructure in a way > that affords us more flexibility. > > No pending PRs will be impacted. > > Thank you for your support as we evolve this system to better serve the > community. > > -Gavin > > On Fri, Nov 30, 2018 at 5:23 PM Marco de Abreu > wrote: > > > Hello Naveen, this is not an experiment. Everything has been tested in > our > > test system and is considered working 100%. This is not a test but > actually > > the move into production - the merge into master happened a week ago. We > > now just have to put all PRs into the catalogue, which means that all PRs > > have to be analyzed with the new pipelines - the only thing that will be > > noticeable is that the CI is under higher load. > > > > The pending PRs will not be impacted. The existing pipeline is still > > running in parallel and everything will behave as before. > > > > -Marco > > > > On Fri, Nov 30, 2018 at 4:41 PM Naveen Swamy wrote: > > > > > Marco, run your experiments on a branch - set up, test it well and then > > > bring it to the master. > > > > > > > On Nov 30, 2018, at 6:53 AM, Marco de Abreu < > > > marco.g.ab...@googlemail.com.INVALID> wrote: > > > > > > > > Hello, > > > > > > > > I'm now moving forward with #1. I will try to get to #3 as soon as > > > possible > > > > to reduce parallel jobs in our CI. You might notice some unfinished > > > jobs. I > > > > will let you know as soon as this process has been completed. Until > > then, > > > > please bare with me since we have hundreds of jobs to run in order to > > > > validate all PRs. > > > > > > > > Best regards, > > > > Marco > > > > > > > > On Fri, Nov 30, 2018 at 1:36 AM Marco de Abreu < > > > marco.g.ab...@googlemail.com> > > > > wrote: > > > > > > > >> Hello, > > > >> > > > >> since the release branch has now been cut, I would like to move > > forward > > > >> with the CI improvements for the master branch. This would include > the > > > >> following actions: > > > >> 1. Re-enable the new Jenkins job > > > >> 2. Request Apache Infra to move the protected branch check from the > > main > > > >> pipeline to our new ones > > > >> 3. Merge https://github.com/apache/incubator-mxnet/pull/13474 - > this > > > >> finalizes the deprecation process > > > >> > > > >> If nobody objects, I would like to start with #1 soon. Mentors, > could > > > you > > > >> please assist to create the Apache Infra ticket? I would then take > it > > > from > > > >> there and talk to Infra. > > > >> > > > >> Best regards, > > > >> Marco > > > >> > > > >> On Mon, Nov 26, 2018 at 2:47 AM kellen sunderland < > > > >> kellen.sunderl...@gmail.com> wrote: > > > >> > > > >>> Sorry, [1] meant to reference > > > >>> https://issues.jenkins-ci.org/browse/JENKINS-37984 . > > > >>> > > > >>> On Sun, Nov 25, 2018 at 5:41 PM kellen sunderland < > > > >>> kellen.sunderl...@gmail.com> wrote: > > > >>> > > > Marco and I ran into another urgent issue over the weekend that > was > > > causing builds to fail. This issue was unrelated to any feature > > > development work, or other CI fixes applied recently, but it did > > > require > > > quite a bit of work from Marco (and a little from me) to fix. > > > > > > We spent enough time on the problem that it caused us to take a > step > > > >>> back > > > and consider how we could both fix issues in CI and support the > 1.4 > > > >>> release > > > with the least impact possible on MXNet devs. Marco had planned > to > > > >>> make a > > > significant change to the CI to fix a long-standing Jenkins error > > [1], > > > >>> but > > > we feel that most developers would prioritize having a stable > build > > > environment for the next few weeks over having this fix in place. > > > > > > To properly introduce a new CI system the intent was to do a > gradual > > > blue/green roll out of the fix.
Re: Adding AMD CPU to CI
I think just Adding AMD is not the right abstraction level. Testing and benchmarking with different cpu flags / march ie AVX2 sse2 brings value in my opinion. Just testing another vendor of a compatible cpu doesn’t. Pedro > On 30. Nov 2018, at 19:32, kellen sunderland > wrote: > > Damn, knew i should have double-checked! Oh well it's also carbon neutral. > > On Fri, Nov 30, 2018 at 10:27 AM Pedro Larroy > wrote: > >> Agee with Tianqi and Hao. Adding AMD brings no value and increases >> complexity and CI cost. The instructions sets are the same. For >> benchmarking it might make sense though. >> >> Pedro >> >>> On 30. Nov 2018, at 18:19, Tianqi Chen wrote: >>> >>> I still think it is overkill to add AMD CPU to the CI, given the >> additional >>> cost it could bring and little additional information we can get out from >>> it. >>> >>> A middle group is to add AMD CPU to a nightly build or final sweep before >>> release. If there is a case that we find that AMD CPU really makes a >>> difference, then we add it to the CI >>> >>> Tianqi >>> On Thu, Nov 29, 2018 at 6:29 PM Hao Jin wrote: For CPUs, the supported instruction sets may also vary between the same manufacturer's different product lines of the same generation >> (Skylake-SP versus Skylake). For the same instruction set, the two manufacturers should both have a working version of the hardware implementation. If any of the implementations does not work, then the chip would not even be >> considered functioning properly. If some AMD CPUs only support up to AVX2 instruction sets, they would >> just function in the same way as an Intel CPU that supports up to AVX2 instruction sets. The performance may vary, but the capability and >> behavior of the two chips would be the same when given the same machine code. For AMD GPUs it's a totally different story, as AMD GPUs do not share >> the same instruction sets with the NVIDIA ones, thus testing on AMD GPUs(if >> we do have support for them) would definitely add values. Hao On Thu, Nov 29, 2018 at 8:37 PM Anirudh Subramanian < >> anirudh2...@gmail.com > wrote: > Instruction set extensions support like AVX2, AVX512 etc. can vary between > AMD and Intel and there can also be a time lag between when Intel supports > it versus when AMD supports it. > Also, in the future this setup may be useful in case MXNet supports AMD > GPUs and AWS also happens to have support for it. > > Anirudh > > > On Thu, Nov 29, 2018 at 4:29 PM Marco de Abreu > wrote: > >> I think it's worth a discussion to do a sanity check. While generally > these >> instructions are standardized, we also made the experience with ARM that >> the theory and reality sometimes don't match. Thus, it's always good >> to >> check. >> >> In the next months we are going to refactor our slave creation processes. >> Chance Bair has been working on rewriting Windows slaves from scratch (we >> used images that haven't really been updated for 2 years - we still don't >> know what was done on them) and they're ready soon. In the following >> months, we will also port our Ubuntu slaves to the new method (don't > have a >> timeline yet). Ideally, the integration of AMD instances will only be >> a >> matter of running the same pipeline on a different instance type. In that >> Case, it should not be a big deal. >> >> If there are big differences, that's already a yellow flag for >> compatibility, but that's unlikely. But in that case, we would have to > make >> a more thorough time analysis and whether it's worth the effort. >> Maybe, >> somebody else could also lend us a hand and help us with adding AMD >> support. >> >> -Marco >> >> Am Fr., 30. Nov. 2018, 01:22 hat Hao Jin >> geschrieben: >> >>> f16c is also an instruction set supported by both brands' recent CPUs >> just >>> like x86, AVX, SSE etc., and any difference in behaviors (quite >> impossible >>> to happen or it will be a major defect) would most likely be caused by >> the >>> underlying hardware implementation, so still, adding AMD instances is > not >>> adding much value here. >>> Hao >>> >>> On Thu, Nov 29, 2018 at 7:03 PM kellen sunderland < >>> kellen.sunderl...@gmail.com> wrote: >>> Just looked at the mf16c work and wanted to mention Rahul clearly > _was_ thinking about AMD users in that PR. On Thu, Nov 29, 2018 at 3:46 PM kellen sunderland < kellen.sunderl...@gmail.com> wrote: > From my perspective we're developing a few features like mf16c and >>> MKLDNN > integration specifically for Intel CPUs. It wouldn't hurt to mak
Re: Adding AMD CPU to CI
Damn, knew i should have double-checked! Oh well it's also carbon neutral. On Fri, Nov 30, 2018 at 10:27 AM Pedro Larroy wrote: > Agee with Tianqi and Hao. Adding AMD brings no value and increases > complexity and CI cost. The instructions sets are the same. For > benchmarking it might make sense though. > > Pedro > > > On 30. Nov 2018, at 18:19, Tianqi Chen wrote: > > > > I still think it is overkill to add AMD CPU to the CI, given the > additional > > cost it could bring and little additional information we can get out from > > it. > > > > A middle group is to add AMD CPU to a nightly build or final sweep before > > release. If there is a case that we find that AMD CPU really makes a > > difference, then we add it to the CI > > > > Tianqi > > > >> On Thu, Nov 29, 2018 at 6:29 PM Hao Jin wrote: > >> > >> For CPUs, the supported instruction sets may also vary between the same > >> manufacturer's different product lines of the same generation > (Skylake-SP > >> versus Skylake). > >> For the same instruction set, the two manufacturers should both have a > >> working version of the hardware implementation. If any of the > >> implementations does not work, then the chip would not even be > considered > >> functioning properly. > >> If some AMD CPUs only support up to AVX2 instruction sets, they would > just > >> function in the same way as an Intel CPU that supports up to AVX2 > >> instruction sets. The performance may vary, but the capability and > behavior > >> of the two chips would be the same when given the same machine code. > >> For AMD GPUs it's a totally different story, as AMD GPUs do not share > the > >> same instruction sets with the NVIDIA ones, thus testing on AMD GPUs(if > we > >> do have support for them) would definitely add values. > >> Hao > >> > >> On Thu, Nov 29, 2018 at 8:37 PM Anirudh Subramanian < > anirudh2...@gmail.com > >>> > >> wrote: > >> > >>> Instruction set extensions support like AVX2, AVX512 etc. can vary > >> between > >>> AMD and Intel and there can also be a time lag between when Intel > >> supports > >>> it versus when AMD supports it. > >>> Also, in the future this setup may be useful in case MXNet supports AMD > >>> GPUs and AWS also happens to have support for it. > >>> > >>> Anirudh > >>> > >>> > >>> On Thu, Nov 29, 2018 at 4:29 PM Marco de Abreu > >>> wrote: > >>> > I think it's worth a discussion to do a sanity check. While generally > >>> these > instructions are standardized, we also made the experience with ARM > >> that > the theory and reality sometimes don't match. Thus, it's always good > to > check. > > In the next months we are going to refactor our slave creation > >> processes. > Chance Bair has been working on rewriting Windows slaves from scratch > >> (we > used images that haven't really been updated for 2 years - we still > >> don't > know what was done on them) and they're ready soon. In the following > months, we will also port our Ubuntu slaves to the new method (don't > >>> have a > timeline yet). Ideally, the integration of AMD instances will only be > a > matter of running the same pipeline on a different instance type. In > >> that > Case, it should not be a big deal. > > If there are big differences, that's already a yellow flag for > compatibility, but that's unlikely. But in that case, we would have to > >>> make > a more thorough time analysis and whether it's worth the effort. > Maybe, > somebody else could also lend us a hand and help us with adding AMD > support. > > -Marco > > Am Fr., 30. Nov. 2018, 01:22 hat Hao Jin > geschrieben: > > > f16c is also an instruction set supported by both brands' recent CPUs > just > > like x86, AVX, SSE etc., and any difference in behaviors (quite > impossible > > to happen or it will be a major defect) would most likely be caused > >> by > the > > underlying hardware implementation, so still, adding AMD instances is > >>> not > > adding much value here. > > Hao > > > > On Thu, Nov 29, 2018 at 7:03 PM kellen sunderland < > > kellen.sunderl...@gmail.com> wrote: > > > >> Just looked at the mf16c work and wanted to mention Rahul clearly > >>> _was_ > >> thinking about AMD users in that PR. > >> > >> On Thu, Nov 29, 2018 at 3:46 PM kellen sunderland < > >> kellen.sunderl...@gmail.com> wrote: > >> > >>> From my perspective we're developing a few features like mf16c > >> and > > MKLDNN > >>> integration specifically for Intel CPUs. It wouldn't hurt to > >> make > sure > >>> those changes also run properly on AMD cpus. > >>> > >>> On Thu, Nov 29, 2018, 3:38 PM Hao Jin >> wrote: > >>> > I'm a bit confused about why we need extra functionality tests > >>> just > > for > AMD > CPUs, aren't AMD CPUs supporting roughly the same instruction > >>
Re: Adding AMD CPU to CI
Agee with Tianqi and Hao. Adding AMD brings no value and increases complexity and CI cost. The instructions sets are the same. For benchmarking it might make sense though. Pedro > On 30. Nov 2018, at 18:19, Tianqi Chen wrote: > > I still think it is overkill to add AMD CPU to the CI, given the additional > cost it could bring and little additional information we can get out from > it. > > A middle group is to add AMD CPU to a nightly build or final sweep before > release. If there is a case that we find that AMD CPU really makes a > difference, then we add it to the CI > > Tianqi > >> On Thu, Nov 29, 2018 at 6:29 PM Hao Jin wrote: >> >> For CPUs, the supported instruction sets may also vary between the same >> manufacturer's different product lines of the same generation (Skylake-SP >> versus Skylake). >> For the same instruction set, the two manufacturers should both have a >> working version of the hardware implementation. If any of the >> implementations does not work, then the chip would not even be considered >> functioning properly. >> If some AMD CPUs only support up to AVX2 instruction sets, they would just >> function in the same way as an Intel CPU that supports up to AVX2 >> instruction sets. The performance may vary, but the capability and behavior >> of the two chips would be the same when given the same machine code. >> For AMD GPUs it's a totally different story, as AMD GPUs do not share the >> same instruction sets with the NVIDIA ones, thus testing on AMD GPUs(if we >> do have support for them) would definitely add values. >> Hao >> >> On Thu, Nov 29, 2018 at 8:37 PM Anirudh Subramanian >> >> wrote: >> >>> Instruction set extensions support like AVX2, AVX512 etc. can vary >> between >>> AMD and Intel and there can also be a time lag between when Intel >> supports >>> it versus when AMD supports it. >>> Also, in the future this setup may be useful in case MXNet supports AMD >>> GPUs and AWS also happens to have support for it. >>> >>> Anirudh >>> >>> >>> On Thu, Nov 29, 2018 at 4:29 PM Marco de Abreu >>> wrote: >>> I think it's worth a discussion to do a sanity check. While generally >>> these instructions are standardized, we also made the experience with ARM >> that the theory and reality sometimes don't match. Thus, it's always good to check. In the next months we are going to refactor our slave creation >> processes. Chance Bair has been working on rewriting Windows slaves from scratch >> (we used images that haven't really been updated for 2 years - we still >> don't know what was done on them) and they're ready soon. In the following months, we will also port our Ubuntu slaves to the new method (don't >>> have a timeline yet). Ideally, the integration of AMD instances will only be a matter of running the same pipeline on a different instance type. In >> that Case, it should not be a big deal. If there are big differences, that's already a yellow flag for compatibility, but that's unlikely. But in that case, we would have to >>> make a more thorough time analysis and whether it's worth the effort. Maybe, somebody else could also lend us a hand and help us with adding AMD support. -Marco Am Fr., 30. Nov. 2018, 01:22 hat Hao Jin geschrieben: > f16c is also an instruction set supported by both brands' recent CPUs just > like x86, AVX, SSE etc., and any difference in behaviors (quite impossible > to happen or it will be a major defect) would most likely be caused >> by the > underlying hardware implementation, so still, adding AMD instances is >>> not > adding much value here. > Hao > > On Thu, Nov 29, 2018 at 7:03 PM kellen sunderland < > kellen.sunderl...@gmail.com> wrote: > >> Just looked at the mf16c work and wanted to mention Rahul clearly >>> _was_ >> thinking about AMD users in that PR. >> >> On Thu, Nov 29, 2018 at 3:46 PM kellen sunderland < >> kellen.sunderl...@gmail.com> wrote: >> >>> From my perspective we're developing a few features like mf16c >> and > MKLDNN >>> integration specifically for Intel CPUs. It wouldn't hurt to >> make sure >>> those changes also run properly on AMD cpus. >>> >>> On Thu, Nov 29, 2018, 3:38 PM Hao Jin > wrote: >>> I'm a bit confused about why we need extra functionality tests >>> just > for AMD CPUs, aren't AMD CPUs supporting roughly the same instruction >> sets as >> the Intel ones? In the very impossible case that something working >> on > Intel CPUs being not functioning on AMD CPUs (or vice versa), it would > mostly likely be related to the underlying hardware implementation of >> the > same ISA, to which we definitely do not have a good solution. So I >>> don't >> think performin
Re: Adding AMD CPU to CI
Kellen we run CI in us-west-2, Oregon :P sorry, Environment :( -Marco Am Fr., 30. Nov. 2018, 18:58 hat kellen sunderland < kellen.sunderl...@gmail.com> geschrieben: > +1 to nightly. > > Given the awesome results shown by Alex for AMD cpus I think MKLDNN > actually would probably be something I'd use, even on my AMD machines. > Kudos to Intel for releasing this lib which works great on their hardware, > but still pretty well w/ AMD. The upshot of MKLDNN supporting AMD to me is > that it makes me much more likely to support it as the default PyPi package > (discussed in another thread). This is part of the reason I'd like to have > a sanity test in CI somewhere for AMD hardware. > > Unrelated note: regarding global warming I actually partially chose > eu-west-1 to host CI because it's carbon neutral. The cost of the CI is > significant, and although it's donated by AWS I'm glad the community is > cognizant of that. > > On Fri, Nov 30, 2018 at 9:54 AM Kumar, Vikas > wrote: > > > I concur. +1 for nightly for pre-release suit. > > > > On 11/30/18, 9:49 AM, "Tianqi Chen" wrote: > > > > +1 for nightly for pre-release suit, but not the CI that triggered in > > every > > test. The best engineering practice is not to add things, but to > > remove > > things so that there is nothing can be removed. > > > > In terms of MLDNN, since it is an Intel product, I doubt optimizing > > for AMD > > CPUs is its goal, adding CI to guard against backward compatibility > is > > a > > bit overkill even. Since the AMD CPU user would likely disable this > > feature > > and use the original CPU version of the project. > > > > At least we can contribute to reducing the carbon footprint and slows > > down > > the global warming :) > > > > Tianqi > > > > On Fri, Nov 30, 2018 at 9:38 AM kellen sunderland < > > kellen.sunderl...@gmail.com> wrote: > > > > > Regarding cost, yes we could run this nightly or simply make it run > > an > > > existing test suite that would make sense rather than having it > > duplicate a > > > suite. > > > > > > On Fri, Nov 30, 2018 at 9:26 AM Kumar, Vikas > > > > > wrote: > > > > > > > I don't think there is any downside to this proposal. I think a > > basic > > > > sanity CI testing on AMD processors will give extra boost to our > > tests. > > > > This adds to developer productivity and they have one less thing > > to worry > > > > about. Developers have spent time in past where they had to > > manually test > > > > on AMD processors, MKLDNN being the recent instance. It's good > to > > have > > > > those test in CI pipeline. > > > > All I see is benefit. If the $ cost is not too high for basic > > sanity > > > > testing, we should do this, until and unless some strong downside > > is > > > called > > > > out. > > > > > > > > +1 > > > > > > > > > > > > On 11/29/18, 5:37 PM, "Anirudh Subramanian" < > anirudh2...@gmail.com > > > > > > > wrote: > > > > > > > > Instruction set extensions support like AVX2, AVX512 etc. can > > vary > > > > between > > > > AMD and Intel and there can also be a time lag between when > > Intel > > > > supports > > > > it versus when AMD supports it. > > > > Also, in the future this setup may be useful in case MXNet > > supports > > > AMD > > > > GPUs and AWS also happens to have support for it. > > > > > > > > Anirudh > > > > > > > > > > > > On Thu, Nov 29, 2018 at 4:29 PM Marco de Abreu > > > > wrote: > > > > > > > > > I think it's worth a discussion to do a sanity check. While > > > > generally these > > > > > instructions are standardized, we also made the experience > > with ARM > > > > that > > > > > the theory and reality sometimes don't match. Thus, it's > > always > > > good > > > > to > > > > > check. > > > > > > > > > > In the next months we are going to refactor our slave > > creation > > > > processes. > > > > > Chance Bair has been working on rewriting Windows slaves > from > > > > scratch (we > > > > > used images that haven't really been updated for 2 years - > > we still > > > > don't > > > > > know what was done on them) and they're ready soon. In the > > > following > > > > > months, we will also port our Ubuntu slaves to the new > method > > > (don't > > > > have a > > > > > timeline yet). Ideally, the integration of AMD instances > > will only > > > > be a > > > > > matter of running the same pipeline on a different instance > > type. > > > In > > > > that > > > > > Case, it should not be a big deal. > > > > > > > > > > If there are big differences, that's already a yellow flag > > for > > > > > compatibility, but that's unlikely. But in that ca
Re: Adding AMD CPU to CI
+1 to nightly. Given the awesome results shown by Alex for AMD cpus I think MKLDNN actually would probably be something I'd use, even on my AMD machines. Kudos to Intel for releasing this lib which works great on their hardware, but still pretty well w/ AMD. The upshot of MKLDNN supporting AMD to me is that it makes me much more likely to support it as the default PyPi package (discussed in another thread). This is part of the reason I'd like to have a sanity test in CI somewhere for AMD hardware. Unrelated note: regarding global warming I actually partially chose eu-west-1 to host CI because it's carbon neutral. The cost of the CI is significant, and although it's donated by AWS I'm glad the community is cognizant of that. On Fri, Nov 30, 2018 at 9:54 AM Kumar, Vikas wrote: > I concur. +1 for nightly for pre-release suit. > > On 11/30/18, 9:49 AM, "Tianqi Chen" wrote: > > +1 for nightly for pre-release suit, but not the CI that triggered in > every > test. The best engineering practice is not to add things, but to > remove > things so that there is nothing can be removed. > > In terms of MLDNN, since it is an Intel product, I doubt optimizing > for AMD > CPUs is its goal, adding CI to guard against backward compatibility is > a > bit overkill even. Since the AMD CPU user would likely disable this > feature > and use the original CPU version of the project. > > At least we can contribute to reducing the carbon footprint and slows > down > the global warming :) > > Tianqi > > On Fri, Nov 30, 2018 at 9:38 AM kellen sunderland < > kellen.sunderl...@gmail.com> wrote: > > > Regarding cost, yes we could run this nightly or simply make it run > an > > existing test suite that would make sense rather than having it > duplicate a > > suite. > > > > On Fri, Nov 30, 2018 at 9:26 AM Kumar, Vikas > > > wrote: > > > > > I don't think there is any downside to this proposal. I think a > basic > > > sanity CI testing on AMD processors will give extra boost to our > tests. > > > This adds to developer productivity and they have one less thing > to worry > > > about. Developers have spent time in past where they had to > manually test > > > on AMD processors, MKLDNN being the recent instance. It's good to > have > > > those test in CI pipeline. > > > All I see is benefit. If the $ cost is not too high for basic > sanity > > > testing, we should do this, until and unless some strong downside > is > > called > > > out. > > > > > > +1 > > > > > > > > > On 11/29/18, 5:37 PM, "Anirudh Subramanian" > > > > wrote: > > > > > > Instruction set extensions support like AVX2, AVX512 etc. can > vary > > > between > > > AMD and Intel and there can also be a time lag between when > Intel > > > supports > > > it versus when AMD supports it. > > > Also, in the future this setup may be useful in case MXNet > supports > > AMD > > > GPUs and AWS also happens to have support for it. > > > > > > Anirudh > > > > > > > > > On Thu, Nov 29, 2018 at 4:29 PM Marco de Abreu > > > wrote: > > > > > > > I think it's worth a discussion to do a sanity check. While > > > generally these > > > > instructions are standardized, we also made the experience > with ARM > > > that > > > > the theory and reality sometimes don't match. Thus, it's > always > > good > > > to > > > > check. > > > > > > > > In the next months we are going to refactor our slave > creation > > > processes. > > > > Chance Bair has been working on rewriting Windows slaves from > > > scratch (we > > > > used images that haven't really been updated for 2 years - > we still > > > don't > > > > know what was done on them) and they're ready soon. In the > > following > > > > months, we will also port our Ubuntu slaves to the new method > > (don't > > > have a > > > > timeline yet). Ideally, the integration of AMD instances > will only > > > be a > > > > matter of running the same pipeline on a different instance > type. > > In > > > that > > > > Case, it should not be a big deal. > > > > > > > > If there are big differences, that's already a yellow flag > for > > > > compatibility, but that's unlikely. But in that case, we > would have > > > to make > > > > a more thorough time analysis and whether it's worth the > effort. > > > Maybe, > > > > somebody else could also lend us a hand and help us with > adding AMD > > > > support. > > > > > > > > -Marco > > > > > > > > Am Fr., 30. Nov. 2018, 01:22 hat Hao Jin < > hjjn.a...@gmail.com> > > > > geschrieben: > > > > > > > > > f16c is also an in
Re: Adding AMD CPU to CI
I concur. +1 for nightly for pre-release suit. On 11/30/18, 9:49 AM, "Tianqi Chen" wrote: +1 for nightly for pre-release suit, but not the CI that triggered in every test. The best engineering practice is not to add things, but to remove things so that there is nothing can be removed. In terms of MLDNN, since it is an Intel product, I doubt optimizing for AMD CPUs is its goal, adding CI to guard against backward compatibility is a bit overkill even. Since the AMD CPU user would likely disable this feature and use the original CPU version of the project. At least we can contribute to reducing the carbon footprint and slows down the global warming :) Tianqi On Fri, Nov 30, 2018 at 9:38 AM kellen sunderland < kellen.sunderl...@gmail.com> wrote: > Regarding cost, yes we could run this nightly or simply make it run an > existing test suite that would make sense rather than having it duplicate a > suite. > > On Fri, Nov 30, 2018 at 9:26 AM Kumar, Vikas > wrote: > > > I don't think there is any downside to this proposal. I think a basic > > sanity CI testing on AMD processors will give extra boost to our tests. > > This adds to developer productivity and they have one less thing to worry > > about. Developers have spent time in past where they had to manually test > > on AMD processors, MKLDNN being the recent instance. It's good to have > > those test in CI pipeline. > > All I see is benefit. If the $ cost is not too high for basic sanity > > testing, we should do this, until and unless some strong downside is > called > > out. > > > > +1 > > > > > > On 11/29/18, 5:37 PM, "Anirudh Subramanian" > > wrote: > > > > Instruction set extensions support like AVX2, AVX512 etc. can vary > > between > > AMD and Intel and there can also be a time lag between when Intel > > supports > > it versus when AMD supports it. > > Also, in the future this setup may be useful in case MXNet supports > AMD > > GPUs and AWS also happens to have support for it. > > > > Anirudh > > > > > > On Thu, Nov 29, 2018 at 4:29 PM Marco de Abreu > > wrote: > > > > > I think it's worth a discussion to do a sanity check. While > > generally these > > > instructions are standardized, we also made the experience with ARM > > that > > > the theory and reality sometimes don't match. Thus, it's always > good > > to > > > check. > > > > > > In the next months we are going to refactor our slave creation > > processes. > > > Chance Bair has been working on rewriting Windows slaves from > > scratch (we > > > used images that haven't really been updated for 2 years - we still > > don't > > > know what was done on them) and they're ready soon. In the > following > > > months, we will also port our Ubuntu slaves to the new method > (don't > > have a > > > timeline yet). Ideally, the integration of AMD instances will only > > be a > > > matter of running the same pipeline on a different instance type. > In > > that > > > Case, it should not be a big deal. > > > > > > If there are big differences, that's already a yellow flag for > > > compatibility, but that's unlikely. But in that case, we would have > > to make > > > a more thorough time analysis and whether it's worth the effort. > > Maybe, > > > somebody else could also lend us a hand and help us with adding AMD > > > support. > > > > > > -Marco > > > > > > Am Fr., 30. Nov. 2018, 01:22 hat Hao Jin > > > geschrieben: > > > > > > > f16c is also an instruction set supported by both brands' recent > > CPUs > > > just > > > > like x86, AVX, SSE etc., and any difference in behaviors (quite > > > impossible > > > > to happen or it will be a major defect) would most likely be > > caused by > > > the > > > > underlying hardware implementation, so still, adding AMD > instances > > is not > > > > adding much value here. > > > > Hao > > > > > > > > On Thu, Nov 29, 2018 at 7:03 PM kellen sunderland < > > > > kellen.sunderl...@gmail.com> wrote: > > > > > > > > > Just looked at the mf16c work and wanted to mention Rahul > > clearly _was_ > > > > > thinking about AMD users in that PR. > > > > > > > > > > On Thu, Nov 29, 2018 at 3:46 PM kellen sunderland < > > > > > kellen.sunderl...@gmail.com> wrote: > > > > > > > > > > > From my perspective we're developing a few features like > mf16c > > and > > > > MK
Re: Adding AMD CPU to CI
+1 for nightly for pre-release suit, but not the CI that triggered in every test. The best engineering practice is not to add things, but to remove things so that there is nothing can be removed. In terms of MLDNN, since it is an Intel product, I doubt optimizing for AMD CPUs is its goal, adding CI to guard against backward compatibility is a bit overkill even. Since the AMD CPU user would likely disable this feature and use the original CPU version of the project. At least we can contribute to reducing the carbon footprint and slows down the global warming :) Tianqi On Fri, Nov 30, 2018 at 9:38 AM kellen sunderland < kellen.sunderl...@gmail.com> wrote: > Regarding cost, yes we could run this nightly or simply make it run an > existing test suite that would make sense rather than having it duplicate a > suite. > > On Fri, Nov 30, 2018 at 9:26 AM Kumar, Vikas > wrote: > > > I don't think there is any downside to this proposal. I think a basic > > sanity CI testing on AMD processors will give extra boost to our tests. > > This adds to developer productivity and they have one less thing to worry > > about. Developers have spent time in past where they had to manually test > > on AMD processors, MKLDNN being the recent instance. It's good to have > > those test in CI pipeline. > > All I see is benefit. If the $ cost is not too high for basic sanity > > testing, we should do this, until and unless some strong downside is > called > > out. > > > > +1 > > > > > > On 11/29/18, 5:37 PM, "Anirudh Subramanian" > > wrote: > > > > Instruction set extensions support like AVX2, AVX512 etc. can vary > > between > > AMD and Intel and there can also be a time lag between when Intel > > supports > > it versus when AMD supports it. > > Also, in the future this setup may be useful in case MXNet supports > AMD > > GPUs and AWS also happens to have support for it. > > > > Anirudh > > > > > > On Thu, Nov 29, 2018 at 4:29 PM Marco de Abreu > > wrote: > > > > > I think it's worth a discussion to do a sanity check. While > > generally these > > > instructions are standardized, we also made the experience with ARM > > that > > > the theory and reality sometimes don't match. Thus, it's always > good > > to > > > check. > > > > > > In the next months we are going to refactor our slave creation > > processes. > > > Chance Bair has been working on rewriting Windows slaves from > > scratch (we > > > used images that haven't really been updated for 2 years - we still > > don't > > > know what was done on them) and they're ready soon. In the > following > > > months, we will also port our Ubuntu slaves to the new method > (don't > > have a > > > timeline yet). Ideally, the integration of AMD instances will only > > be a > > > matter of running the same pipeline on a different instance type. > In > > that > > > Case, it should not be a big deal. > > > > > > If there are big differences, that's already a yellow flag for > > > compatibility, but that's unlikely. But in that case, we would have > > to make > > > a more thorough time analysis and whether it's worth the effort. > > Maybe, > > > somebody else could also lend us a hand and help us with adding AMD > > > support. > > > > > > -Marco > > > > > > Am Fr., 30. Nov. 2018, 01:22 hat Hao Jin > > > geschrieben: > > > > > > > f16c is also an instruction set supported by both brands' recent > > CPUs > > > just > > > > like x86, AVX, SSE etc., and any difference in behaviors (quite > > > impossible > > > > to happen or it will be a major defect) would most likely be > > caused by > > > the > > > > underlying hardware implementation, so still, adding AMD > instances > > is not > > > > adding much value here. > > > > Hao > > > > > > > > On Thu, Nov 29, 2018 at 7:03 PM kellen sunderland < > > > > kellen.sunderl...@gmail.com> wrote: > > > > > > > > > Just looked at the mf16c work and wanted to mention Rahul > > clearly _was_ > > > > > thinking about AMD users in that PR. > > > > > > > > > > On Thu, Nov 29, 2018 at 3:46 PM kellen sunderland < > > > > > kellen.sunderl...@gmail.com> wrote: > > > > > > > > > > > From my perspective we're developing a few features like > mf16c > > and > > > > MKLDNN > > > > > > integration specifically for Intel CPUs. It wouldn't hurt to > > make > > > sure > > > > > > those changes also run properly on AMD cpus. > > > > > > > > > > > > On Thu, Nov 29, 2018, 3:38 PM Hao Jin > wrote: > > > > > > > > > > > >> I'm a bit confused about why we need extra functionality > > tests just > > > > for > > > > > >> AMD > > > > > >> CPUs, aren't AMD CPUs supporting roughly the same > instruction > > sets > > > as > > > > > the > > > > > >> Intel ones? In the very impossible case that something > > working on
Re: Adding AMD CPU to CI
Regarding cost, yes we could run this nightly or simply make it run an existing test suite that would make sense rather than having it duplicate a suite. On Fri, Nov 30, 2018 at 9:26 AM Kumar, Vikas wrote: > I don't think there is any downside to this proposal. I think a basic > sanity CI testing on AMD processors will give extra boost to our tests. > This adds to developer productivity and they have one less thing to worry > about. Developers have spent time in past where they had to manually test > on AMD processors, MKLDNN being the recent instance. It's good to have > those test in CI pipeline. > All I see is benefit. If the $ cost is not too high for basic sanity > testing, we should do this, until and unless some strong downside is called > out. > > +1 > > > On 11/29/18, 5:37 PM, "Anirudh Subramanian" > wrote: > > Instruction set extensions support like AVX2, AVX512 etc. can vary > between > AMD and Intel and there can also be a time lag between when Intel > supports > it versus when AMD supports it. > Also, in the future this setup may be useful in case MXNet supports AMD > GPUs and AWS also happens to have support for it. > > Anirudh > > > On Thu, Nov 29, 2018 at 4:29 PM Marco de Abreu > wrote: > > > I think it's worth a discussion to do a sanity check. While > generally these > > instructions are standardized, we also made the experience with ARM > that > > the theory and reality sometimes don't match. Thus, it's always good > to > > check. > > > > In the next months we are going to refactor our slave creation > processes. > > Chance Bair has been working on rewriting Windows slaves from > scratch (we > > used images that haven't really been updated for 2 years - we still > don't > > know what was done on them) and they're ready soon. In the following > > months, we will also port our Ubuntu slaves to the new method (don't > have a > > timeline yet). Ideally, the integration of AMD instances will only > be a > > matter of running the same pipeline on a different instance type. In > that > > Case, it should not be a big deal. > > > > If there are big differences, that's already a yellow flag for > > compatibility, but that's unlikely. But in that case, we would have > to make > > a more thorough time analysis and whether it's worth the effort. > Maybe, > > somebody else could also lend us a hand and help us with adding AMD > > support. > > > > -Marco > > > > Am Fr., 30. Nov. 2018, 01:22 hat Hao Jin > > geschrieben: > > > > > f16c is also an instruction set supported by both brands' recent > CPUs > > just > > > like x86, AVX, SSE etc., and any difference in behaviors (quite > > impossible > > > to happen or it will be a major defect) would most likely be > caused by > > the > > > underlying hardware implementation, so still, adding AMD instances > is not > > > adding much value here. > > > Hao > > > > > > On Thu, Nov 29, 2018 at 7:03 PM kellen sunderland < > > > kellen.sunderl...@gmail.com> wrote: > > > > > > > Just looked at the mf16c work and wanted to mention Rahul > clearly _was_ > > > > thinking about AMD users in that PR. > > > > > > > > On Thu, Nov 29, 2018 at 3:46 PM kellen sunderland < > > > > kellen.sunderl...@gmail.com> wrote: > > > > > > > > > From my perspective we're developing a few features like mf16c > and > > > MKLDNN > > > > > integration specifically for Intel CPUs. It wouldn't hurt to > make > > sure > > > > > those changes also run properly on AMD cpus. > > > > > > > > > > On Thu, Nov 29, 2018, 3:38 PM Hao Jin wrote: > > > > > > > > > >> I'm a bit confused about why we need extra functionality > tests just > > > for > > > > >> AMD > > > > >> CPUs, aren't AMD CPUs supporting roughly the same instruction > sets > > as > > > > the > > > > >> Intel ones? In the very impossible case that something > working on > > > Intel > > > > >> CPUs being not functioning on AMD CPUs (or vice versa), it > would > > > mostly > > > > >> likely be related to the underlying hardware implementation > of the > > > same > > > > >> ISA, to which we definitely do not have a good solution. So I > don't > > > > think > > > > >> performing extra tests on functional aspect of the system on > AMD > > CPUs > > > is > > > > >> adding any values. > > > > >> Hao > > > > >> > > > > >> On Thu, Nov 29, 2018 at 5:50 PM Seth, Manu > > > > > > > > > >> wrote: > > > > >> > > > > >> > +1 > > > > >> > > > > > >> > On 11/29/18, 2:39 PM, "Alex Zai" wrote: > > > > >> > > > > > >> > What are people's thoughts on having AMD machines > tested on > > the > > > > CI? > > > > >> AMD > > > > >> > machines are now available on AWS. > > > >
Re: Adding AMD CPU to CI
I don't think there is any downside to this proposal. I think a basic sanity CI testing on AMD processors will give extra boost to our tests. This adds to developer productivity and they have one less thing to worry about. Developers have spent time in past where they had to manually test on AMD processors, MKLDNN being the recent instance. It's good to have those test in CI pipeline. All I see is benefit. If the $ cost is not too high for basic sanity testing, we should do this, until and unless some strong downside is called out. +1 On 11/29/18, 5:37 PM, "Anirudh Subramanian" wrote: Instruction set extensions support like AVX2, AVX512 etc. can vary between AMD and Intel and there can also be a time lag between when Intel supports it versus when AMD supports it. Also, in the future this setup may be useful in case MXNet supports AMD GPUs and AWS also happens to have support for it. Anirudh On Thu, Nov 29, 2018 at 4:29 PM Marco de Abreu wrote: > I think it's worth a discussion to do a sanity check. While generally these > instructions are standardized, we also made the experience with ARM that > the theory and reality sometimes don't match. Thus, it's always good to > check. > > In the next months we are going to refactor our slave creation processes. > Chance Bair has been working on rewriting Windows slaves from scratch (we > used images that haven't really been updated for 2 years - we still don't > know what was done on them) and they're ready soon. In the following > months, we will also port our Ubuntu slaves to the new method (don't have a > timeline yet). Ideally, the integration of AMD instances will only be a > matter of running the same pipeline on a different instance type. In that > Case, it should not be a big deal. > > If there are big differences, that's already a yellow flag for > compatibility, but that's unlikely. But in that case, we would have to make > a more thorough time analysis and whether it's worth the effort. Maybe, > somebody else could also lend us a hand and help us with adding AMD > support. > > -Marco > > Am Fr., 30. Nov. 2018, 01:22 hat Hao Jin > geschrieben: > > > f16c is also an instruction set supported by both brands' recent CPUs > just > > like x86, AVX, SSE etc., and any difference in behaviors (quite > impossible > > to happen or it will be a major defect) would most likely be caused by > the > > underlying hardware implementation, so still, adding AMD instances is not > > adding much value here. > > Hao > > > > On Thu, Nov 29, 2018 at 7:03 PM kellen sunderland < > > kellen.sunderl...@gmail.com> wrote: > > > > > Just looked at the mf16c work and wanted to mention Rahul clearly _was_ > > > thinking about AMD users in that PR. > > > > > > On Thu, Nov 29, 2018 at 3:46 PM kellen sunderland < > > > kellen.sunderl...@gmail.com> wrote: > > > > > > > From my perspective we're developing a few features like mf16c and > > MKLDNN > > > > integration specifically for Intel CPUs. It wouldn't hurt to make > sure > > > > those changes also run properly on AMD cpus. > > > > > > > > On Thu, Nov 29, 2018, 3:38 PM Hao Jin > > > > > > >> I'm a bit confused about why we need extra functionality tests just > > for > > > >> AMD > > > >> CPUs, aren't AMD CPUs supporting roughly the same instruction sets > as > > > the > > > >> Intel ones? In the very impossible case that something working on > > Intel > > > >> CPUs being not functioning on AMD CPUs (or vice versa), it would > > mostly > > > >> likely be related to the underlying hardware implementation of the > > same > > > >> ISA, to which we definitely do not have a good solution. So I don't > > > think > > > >> performing extra tests on functional aspect of the system on AMD > CPUs > > is > > > >> adding any values. > > > >> Hao > > > >> > > > >> On Thu, Nov 29, 2018 at 5:50 PM Seth, Manu > > > > > > >> wrote: > > > >> > > > >> > +1 > > > >> > > > > >> > On 11/29/18, 2:39 PM, "Alex Zai" wrote: > > > >> > > > > >> > What are people's thoughts on having AMD machines tested on > the > > > CI? > > > >> AMD > > > >> > machines are now available on AWS. > > > >> > > > > >> > Best, > > > >> > Alex > > > >> > > > > >> > > > > >> > > > > >> > > > > > > > > > >
Re: Adding AMD CPU to CI
I still think it is overkill to add AMD CPU to the CI, given the additional cost it could bring and little additional information we can get out from it. A middle group is to add AMD CPU to a nightly build or final sweep before release. If there is a case that we find that AMD CPU really makes a difference, then we add it to the CI Tianqi On Thu, Nov 29, 2018 at 6:29 PM Hao Jin wrote: > For CPUs, the supported instruction sets may also vary between the same > manufacturer's different product lines of the same generation (Skylake-SP > versus Skylake). > For the same instruction set, the two manufacturers should both have a > working version of the hardware implementation. If any of the > implementations does not work, then the chip would not even be considered > functioning properly. > If some AMD CPUs only support up to AVX2 instruction sets, they would just > function in the same way as an Intel CPU that supports up to AVX2 > instruction sets. The performance may vary, but the capability and behavior > of the two chips would be the same when given the same machine code. > For AMD GPUs it's a totally different story, as AMD GPUs do not share the > same instruction sets with the NVIDIA ones, thus testing on AMD GPUs(if we > do have support for them) would definitely add values. > Hao > > On Thu, Nov 29, 2018 at 8:37 PM Anirudh Subramanian > > wrote: > > > Instruction set extensions support like AVX2, AVX512 etc. can vary > between > > AMD and Intel and there can also be a time lag between when Intel > supports > > it versus when AMD supports it. > > Also, in the future this setup may be useful in case MXNet supports AMD > > GPUs and AWS also happens to have support for it. > > > > Anirudh > > > > > > On Thu, Nov 29, 2018 at 4:29 PM Marco de Abreu > > wrote: > > > > > I think it's worth a discussion to do a sanity check. While generally > > these > > > instructions are standardized, we also made the experience with ARM > that > > > the theory and reality sometimes don't match. Thus, it's always good to > > > check. > > > > > > In the next months we are going to refactor our slave creation > processes. > > > Chance Bair has been working on rewriting Windows slaves from scratch > (we > > > used images that haven't really been updated for 2 years - we still > don't > > > know what was done on them) and they're ready soon. In the following > > > months, we will also port our Ubuntu slaves to the new method (don't > > have a > > > timeline yet). Ideally, the integration of AMD instances will only be a > > > matter of running the same pipeline on a different instance type. In > that > > > Case, it should not be a big deal. > > > > > > If there are big differences, that's already a yellow flag for > > > compatibility, but that's unlikely. But in that case, we would have to > > make > > > a more thorough time analysis and whether it's worth the effort. Maybe, > > > somebody else could also lend us a hand and help us with adding AMD > > > support. > > > > > > -Marco > > > > > > Am Fr., 30. Nov. 2018, 01:22 hat Hao Jin > > > geschrieben: > > > > > > > f16c is also an instruction set supported by both brands' recent CPUs > > > just > > > > like x86, AVX, SSE etc., and any difference in behaviors (quite > > > impossible > > > > to happen or it will be a major defect) would most likely be caused > by > > > the > > > > underlying hardware implementation, so still, adding AMD instances is > > not > > > > adding much value here. > > > > Hao > > > > > > > > On Thu, Nov 29, 2018 at 7:03 PM kellen sunderland < > > > > kellen.sunderl...@gmail.com> wrote: > > > > > > > > > Just looked at the mf16c work and wanted to mention Rahul clearly > > _was_ > > > > > thinking about AMD users in that PR. > > > > > > > > > > On Thu, Nov 29, 2018 at 3:46 PM kellen sunderland < > > > > > kellen.sunderl...@gmail.com> wrote: > > > > > > > > > > > From my perspective we're developing a few features like mf16c > and > > > > MKLDNN > > > > > > integration specifically for Intel CPUs. It wouldn't hurt to > make > > > sure > > > > > > those changes also run properly on AMD cpus. > > > > > > > > > > > > On Thu, Nov 29, 2018, 3:38 PM Hao Jin wrote: > > > > > > > > > > > >> I'm a bit confused about why we need extra functionality tests > > just > > > > for > > > > > >> AMD > > > > > >> CPUs, aren't AMD CPUs supporting roughly the same instruction > sets > > > as > > > > > the > > > > > >> Intel ones? In the very impossible case that something working > on > > > > Intel > > > > > >> CPUs being not functioning on AMD CPUs (or vice versa), it would > > > > mostly > > > > > >> likely be related to the underlying hardware implementation of > the > > > > same > > > > > >> ISA, to which we definitely do not have a good solution. So I > > don't > > > > > think > > > > > >> performing extra tests on functional aspect of the system on AMD > > > CPUs > > > > is > > > > > >> adding any values. > > > > > >> Hao > > > > > >> > > > > > >> On Thu, Nov 29, 20
Re: CI impaired
Hey Folks, Marco has been running this change in dev, with flying colors, for some time. This is not an experiment but a roll out that was announced. We also decided to make this change post the release cut so limit the blast radius from any critical obligations to the community. Marco is accountable for this work and will address any issues that may occur as he has been put on-call. We have, to our best ability, mitigated as much risk as possible and now it is time to pull the trigger. The community will enjoy a bit more visibility and clarity into the test process which will be advantageous, as well as allowing us to extend our infrastructure in a way that affords us more flexibility. No pending PRs will be impacted. Thank you for your support as we evolve this system to better serve the community. -Gavin On Fri, Nov 30, 2018 at 5:23 PM Marco de Abreu wrote: > Hello Naveen, this is not an experiment. Everything has been tested in our > test system and is considered working 100%. This is not a test but actually > the move into production - the merge into master happened a week ago. We > now just have to put all PRs into the catalogue, which means that all PRs > have to be analyzed with the new pipelines - the only thing that will be > noticeable is that the CI is under higher load. > > The pending PRs will not be impacted. The existing pipeline is still > running in parallel and everything will behave as before. > > -Marco > > On Fri, Nov 30, 2018 at 4:41 PM Naveen Swamy wrote: > > > Marco, run your experiments on a branch - set up, test it well and then > > bring it to the master. > > > > > On Nov 30, 2018, at 6:53 AM, Marco de Abreu < > > marco.g.ab...@googlemail.com.INVALID> wrote: > > > > > > Hello, > > > > > > I'm now moving forward with #1. I will try to get to #3 as soon as > > possible > > > to reduce parallel jobs in our CI. You might notice some unfinished > > jobs. I > > > will let you know as soon as this process has been completed. Until > then, > > > please bare with me since we have hundreds of jobs to run in order to > > > validate all PRs. > > > > > > Best regards, > > > Marco > > > > > > On Fri, Nov 30, 2018 at 1:36 AM Marco de Abreu < > > marco.g.ab...@googlemail.com> > > > wrote: > > > > > >> Hello, > > >> > > >> since the release branch has now been cut, I would like to move > forward > > >> with the CI improvements for the master branch. This would include the > > >> following actions: > > >> 1. Re-enable the new Jenkins job > > >> 2. Request Apache Infra to move the protected branch check from the > main > > >> pipeline to our new ones > > >> 3. Merge https://github.com/apache/incubator-mxnet/pull/13474 - this > > >> finalizes the deprecation process > > >> > > >> If nobody objects, I would like to start with #1 soon. Mentors, could > > you > > >> please assist to create the Apache Infra ticket? I would then take it > > from > > >> there and talk to Infra. > > >> > > >> Best regards, > > >> Marco > > >> > > >> On Mon, Nov 26, 2018 at 2:47 AM kellen sunderland < > > >> kellen.sunderl...@gmail.com> wrote: > > >> > > >>> Sorry, [1] meant to reference > > >>> https://issues.jenkins-ci.org/browse/JENKINS-37984 . > > >>> > > >>> On Sun, Nov 25, 2018 at 5:41 PM kellen sunderland < > > >>> kellen.sunderl...@gmail.com> wrote: > > >>> > > Marco and I ran into another urgent issue over the weekend that was > > causing builds to fail. This issue was unrelated to any feature > > development work, or other CI fixes applied recently, but it did > > require > > quite a bit of work from Marco (and a little from me) to fix. > > > > We spent enough time on the problem that it caused us to take a step > > >>> back > > and consider how we could both fix issues in CI and support the 1.4 > > >>> release > > with the least impact possible on MXNet devs. Marco had planned to > > >>> make a > > significant change to the CI to fix a long-standing Jenkins error > [1], > > >>> but > > we feel that most developers would prioritize having a stable build > > environment for the next few weeks over having this fix in place. > > > > To properly introduce a new CI system the intent was to do a gradual > > blue/green roll out of the fix. To manage this rollout would have > > taken > > operational effort and double compute load as we run systems in > > >>> parallel. > > This risks outages due to scaling limits, and we’d rather make this > > >>> change > > during a period of low-developer activity, i.e. shortly after the > 1.4 > > release. > > > > This means that from now until the 1.4 release, in order to reduce > > complexity MXNet developers should only see a single Jenkins > > >>> verification > > check, and a single Travis check. > > > > > > >>> > > >> > > > -- Sincerely, Gavin M. Bell "Never mistake a clear view for a short distance." -Paul Saffo
Re: CI impaired
Hello Naveen, this is not an experiment. Everything has been tested in our test system and is considered working 100%. This is not a test but actually the move into production - the merge into master happened a week ago. We now just have to put all PRs into the catalogue, which means that all PRs have to be analyzed with the new pipelines - the only thing that will be noticeable is that the CI is under higher load. The pending PRs will not be impacted. The existing pipeline is still running in parallel and everything will behave as before. -Marco On Fri, Nov 30, 2018 at 4:41 PM Naveen Swamy wrote: > Marco, run your experiments on a branch - set up, test it well and then > bring it to the master. > > > On Nov 30, 2018, at 6:53 AM, Marco de Abreu < > marco.g.ab...@googlemail.com.INVALID> wrote: > > > > Hello, > > > > I'm now moving forward with #1. I will try to get to #3 as soon as > possible > > to reduce parallel jobs in our CI. You might notice some unfinished > jobs. I > > will let you know as soon as this process has been completed. Until then, > > please bare with me since we have hundreds of jobs to run in order to > > validate all PRs. > > > > Best regards, > > Marco > > > > On Fri, Nov 30, 2018 at 1:36 AM Marco de Abreu < > marco.g.ab...@googlemail.com> > > wrote: > > > >> Hello, > >> > >> since the release branch has now been cut, I would like to move forward > >> with the CI improvements for the master branch. This would include the > >> following actions: > >> 1. Re-enable the new Jenkins job > >> 2. Request Apache Infra to move the protected branch check from the main > >> pipeline to our new ones > >> 3. Merge https://github.com/apache/incubator-mxnet/pull/13474 - this > >> finalizes the deprecation process > >> > >> If nobody objects, I would like to start with #1 soon. Mentors, could > you > >> please assist to create the Apache Infra ticket? I would then take it > from > >> there and talk to Infra. > >> > >> Best regards, > >> Marco > >> > >> On Mon, Nov 26, 2018 at 2:47 AM kellen sunderland < > >> kellen.sunderl...@gmail.com> wrote: > >> > >>> Sorry, [1] meant to reference > >>> https://issues.jenkins-ci.org/browse/JENKINS-37984 . > >>> > >>> On Sun, Nov 25, 2018 at 5:41 PM kellen sunderland < > >>> kellen.sunderl...@gmail.com> wrote: > >>> > Marco and I ran into another urgent issue over the weekend that was > causing builds to fail. This issue was unrelated to any feature > development work, or other CI fixes applied recently, but it did > require > quite a bit of work from Marco (and a little from me) to fix. > > We spent enough time on the problem that it caused us to take a step > >>> back > and consider how we could both fix issues in CI and support the 1.4 > >>> release > with the least impact possible on MXNet devs. Marco had planned to > >>> make a > significant change to the CI to fix a long-standing Jenkins error [1], > >>> but > we feel that most developers would prioritize having a stable build > environment for the next few weeks over having this fix in place. > > To properly introduce a new CI system the intent was to do a gradual > blue/green roll out of the fix. To manage this rollout would have > taken > operational effort and double compute load as we run systems in > >>> parallel. > This risks outages due to scaling limits, and we’d rather make this > >>> change > during a period of low-developer activity, i.e. shortly after the 1.4 > release. > > This means that from now until the 1.4 release, in order to reduce > complexity MXNet developers should only see a single Jenkins > >>> verification > check, and a single Travis check. > > > >>> > >> >
Re: CI impaired
Marco, run your experiments on a branch - set up, test it well and then bring it to the master. > On Nov 30, 2018, at 6:53 AM, Marco de Abreu > wrote: > > Hello, > > I'm now moving forward with #1. I will try to get to #3 as soon as possible > to reduce parallel jobs in our CI. You might notice some unfinished jobs. I > will let you know as soon as this process has been completed. Until then, > please bare with me since we have hundreds of jobs to run in order to > validate all PRs. > > Best regards, > Marco > > On Fri, Nov 30, 2018 at 1:36 AM Marco de Abreu > wrote: > >> Hello, >> >> since the release branch has now been cut, I would like to move forward >> with the CI improvements for the master branch. This would include the >> following actions: >> 1. Re-enable the new Jenkins job >> 2. Request Apache Infra to move the protected branch check from the main >> pipeline to our new ones >> 3. Merge https://github.com/apache/incubator-mxnet/pull/13474 - this >> finalizes the deprecation process >> >> If nobody objects, I would like to start with #1 soon. Mentors, could you >> please assist to create the Apache Infra ticket? I would then take it from >> there and talk to Infra. >> >> Best regards, >> Marco >> >> On Mon, Nov 26, 2018 at 2:47 AM kellen sunderland < >> kellen.sunderl...@gmail.com> wrote: >> >>> Sorry, [1] meant to reference >>> https://issues.jenkins-ci.org/browse/JENKINS-37984 . >>> >>> On Sun, Nov 25, 2018 at 5:41 PM kellen sunderland < >>> kellen.sunderl...@gmail.com> wrote: >>> Marco and I ran into another urgent issue over the weekend that was causing builds to fail. This issue was unrelated to any feature development work, or other CI fixes applied recently, but it did require quite a bit of work from Marco (and a little from me) to fix. We spent enough time on the problem that it caused us to take a step >>> back and consider how we could both fix issues in CI and support the 1.4 >>> release with the least impact possible on MXNet devs. Marco had planned to >>> make a significant change to the CI to fix a long-standing Jenkins error [1], >>> but we feel that most developers would prioritize having a stable build environment for the next few weeks over having this fix in place. To properly introduce a new CI system the intent was to do a gradual blue/green roll out of the fix. To manage this rollout would have taken operational effort and double compute load as we run systems in >>> parallel. This risks outages due to scaling limits, and we’d rather make this >>> change during a period of low-developer activity, i.e. shortly after the 1.4 release. This means that from now until the 1.4 release, in order to reduce complexity MXNet developers should only see a single Jenkins >>> verification check, and a single Travis check. >>> >>
Re: CI impaired
There are still pending PRs pending that needs to be merged and cherry picked to the branch > On Nov 30, 2018, at 6:53 AM, Marco de Abreu > wrote: > > Hello, > > I'm now moving forward with #1. I will try to get to #3 as soon as possible > to reduce parallel jobs in our CI. You might notice some unfinished jobs. I > will let you know as soon as this process has been completed. Until then, > please bare with me since we have hundreds of jobs to run in order to > validate all PRs. > > Best regards, > Marco > > On Fri, Nov 30, 2018 at 1:36 AM Marco de Abreu > wrote: > >> Hello, >> >> since the release branch has now been cut, I would like to move forward >> with the CI improvements for the master branch. This would include the >> following actions: >> 1. Re-enable the new Jenkins job >> 2. Request Apache Infra to move the protected branch check from the main >> pipeline to our new ones >> 3. Merge https://github.com/apache/incubator-mxnet/pull/13474 - this >> finalizes the deprecation process >> >> If nobody objects, I would like to start with #1 soon. Mentors, could you >> please assist to create the Apache Infra ticket? I would then take it from >> there and talk to Infra. >> >> Best regards, >> Marco >> >> On Mon, Nov 26, 2018 at 2:47 AM kellen sunderland < >> kellen.sunderl...@gmail.com> wrote: >> >>> Sorry, [1] meant to reference >>> https://issues.jenkins-ci.org/browse/JENKINS-37984 . >>> >>> On Sun, Nov 25, 2018 at 5:41 PM kellen sunderland < >>> kellen.sunderl...@gmail.com> wrote: >>> Marco and I ran into another urgent issue over the weekend that was causing builds to fail. This issue was unrelated to any feature development work, or other CI fixes applied recently, but it did require quite a bit of work from Marco (and a little from me) to fix. We spent enough time on the problem that it caused us to take a step >>> back and consider how we could both fix issues in CI and support the 1.4 >>> release with the least impact possible on MXNet devs. Marco had planned to >>> make a significant change to the CI to fix a long-standing Jenkins error [1], >>> but we feel that most developers would prioritize having a stable build environment for the next few weeks over having this fix in place. To properly introduce a new CI system the intent was to do a gradual blue/green roll out of the fix. To manage this rollout would have taken operational effort and double compute load as we run systems in >>> parallel. This risks outages due to scaling limits, and we’d rather make this >>> change during a period of low-developer activity, i.e. shortly after the 1.4 release. This means that from now until the 1.4 release, in order to reduce complexity MXNet developers should only see a single Jenkins >>> verification check, and a single Travis check. >>> >>
Re: CI impaired
Hello, I'm now moving forward with #1. I will try to get to #3 as soon as possible to reduce parallel jobs in our CI. You might notice some unfinished jobs. I will let you know as soon as this process has been completed. Until then, please bare with me since we have hundreds of jobs to run in order to validate all PRs. Best regards, Marco On Fri, Nov 30, 2018 at 1:36 AM Marco de Abreu wrote: > Hello, > > since the release branch has now been cut, I would like to move forward > with the CI improvements for the master branch. This would include the > following actions: > 1. Re-enable the new Jenkins job > 2. Request Apache Infra to move the protected branch check from the main > pipeline to our new ones > 3. Merge https://github.com/apache/incubator-mxnet/pull/13474 - this > finalizes the deprecation process > > If nobody objects, I would like to start with #1 soon. Mentors, could you > please assist to create the Apache Infra ticket? I would then take it from > there and talk to Infra. > > Best regards, > Marco > > On Mon, Nov 26, 2018 at 2:47 AM kellen sunderland < > kellen.sunderl...@gmail.com> wrote: > >> Sorry, [1] meant to reference >> https://issues.jenkins-ci.org/browse/JENKINS-37984 . >> >> On Sun, Nov 25, 2018 at 5:41 PM kellen sunderland < >> kellen.sunderl...@gmail.com> wrote: >> >> > Marco and I ran into another urgent issue over the weekend that was >> > causing builds to fail. This issue was unrelated to any feature >> > development work, or other CI fixes applied recently, but it did require >> > quite a bit of work from Marco (and a little from me) to fix. >> > >> > We spent enough time on the problem that it caused us to take a step >> back >> > and consider how we could both fix issues in CI and support the 1.4 >> release >> > with the least impact possible on MXNet devs. Marco had planned to >> make a >> > significant change to the CI to fix a long-standing Jenkins error [1], >> but >> > we feel that most developers would prioritize having a stable build >> > environment for the next few weeks over having this fix in place. >> > >> > To properly introduce a new CI system the intent was to do a gradual >> > blue/green roll out of the fix. To manage this rollout would have taken >> > operational effort and double compute load as we run systems in >> parallel. >> > This risks outages due to scaling limits, and we’d rather make this >> change >> > during a period of low-developer activity, i.e. shortly after the 1.4 >> > release. >> > >> > This means that from now until the 1.4 release, in order to reduce >> > complexity MXNet developers should only see a single Jenkins >> verification >> > check, and a single Travis check. >> > >> > >> >