Re: [Proposal] Stabilizing Apache MXNet CI build system

2017-11-06 Thread Chris Olivier
After a decision is reached, i am willing to add tasks to Apache MXNet JIRA

On Mon, Nov 6, 2017 at 6:15 AM, Pedro Larroy 
wrote:

> Thanks for setting up the document guys, looks like a solid basis to
> start to work on!
>
> Marco, Kellen and I have already added some comments.
>
> Pedro
>
>
> On Sun, Nov 5, 2017 at 3:43 AM, Meghna Baijal
>  wrote:
> > Kellen, Thank you for your comments in the doc.
> > Sure Steffen, I will continue to merge everyone’s comments into the doc
> and
> > work with Pedro to finalize it.
> > And then we can vote on the options.
> >
> > Thanks,
> > Meghna Baijal
> >
> >
> > On Sat, Nov 4, 2017 at 6:34 AM, Steffen Rochel 
> > wrote:
> >
> >> Sandeep and Meghna have been working in background collecting input and
> >> preparing a doc. I suggest to drive discussion forward and would like to
> >> ask everybody to contribute to
> >> https://docs.google.com/document/d/17PEasQ2VWrXi2Cf7IGZSWGZMawxDk
> >> dlavUDASzUmLjk/edit?usp=sharing
> >>
> >> Lets converge on requirements and architecture, so we can move forward
> with
> >> implementation.
> >>
> >> I would like to suggest for Pedro  and Meghna to lead the discussion and
> >> help to resolve suggestions.
> >>
> >> I assume we need a vote once we are converged on a good draft to call
> it a
> >> plan and move forward with implementation. As we all are unhappy with
> the
> >> current CI situation I would also suggest a phased approach, so we can
> get
> >> back to reliable and efficient basic CI quickly and add advanced
> >> capabilities over time.
> >>
> >> Steffen
> >>
> >> On Wed, Nov 1, 2017 at 1:14 PM kellen sunderland <
> >> kellen.sunderl...@gmail.com> wrote:
> >>
> >> > Hey Henri, I think that's what a few of us are advocating.  Running a
> set
> >> > of quick tests as part of the PR process, and then a more detailed
> >> > regression test suite periodically (say every 4 hours). This fits
> nicely
> >> > into a tagging or 2 branch development system.  Commits will be tagged
> >> (or
> >> > merged into a stable branch) as soon as they pass the detailed
> regression
> >> > testing.
> >> >
> >> > On Wed, Nov 1, 2017 at 9:07 PM, Hen  wrote:
> >> >
> >> > > Random question - can the CI be split such that the Apache CI is
> doing
> >> a
> >> > > basic set of checks on that hardware, and is hooked to a PR, while
> >> there
> >> > is
> >> > > a larger "Is trunk good for release?" test that is running
> periodically
> >> > > rather than on every PR?
> >> > >
> >> > > ie: do we need each PR to be run on varied hardware, or can we have
> >> this
> >> > > two tier approach?
> >> > >
> >> > > Hen
> >> > >
> >> > > On Fri, Oct 20, 2017 at 1:01 PM, sandeep krishnamurthy <
> >> > > sandeep.krishn...@gmail.com> wrote:
> >> > >
> >> > > > Hello all,
> >> > > >
> >> > > > I am hereby opening up a discussion thread on how we can stabilize
> >> > Apache
> >> > > > MXNet CI build system.
> >> > > >
> >> > > > Problems:
> >> > > >
> >> > > > 
> >> > > >
> >> > > > Recently, we have seen following issues with Apache MXNet CI build
> >> > > systems:
> >> > > >
> >> > > >1. Apache Jenkins master is overloaded and we see issues like -
> >> > unable
> >> > > >to trigger builds, difficult to load and view the blue ocean
> and
> >> > other
> >> > > >Jenkins build status page.
> >> > > >2. We are generating too many request/interaction on Apache
> Infra
> >> > > team.
> >> > > >   1. Addition/deletion of new slave: Caused from scaling
> >> activity,
> >> > > >   recycling, troubleshooting or any actions leading to change
> of
> >> > > slave
> >> > > >   machines.
> >> > > >   2. Plugins / other Jenkins Master configurations.
> >> > > >   3. Experimentation on CI pipelines.
> >> > > >3. Harder to debug and resolve issues - Since access to master
> and
> >> > > slave
> >> > > >is not with the same community, it requires Infra and
> community to
> >> > > dive
> >> > > >deep together on all action items.
> >> > > >
> >> > > > Possible Solutions:
> >> > > >
> >> > > > ==
> >> > > >
> >> > > >1. Can we set up a separate Jenkins CI build system for Apache
> >> MXNet
> >> > > >outside Apache Infra?
> >> > > >2. Can we have a separate Jenkins Master in Apache Infra for
> >> MXNet?
> >> > > >3. Review design of current setup, refine and fill the gaps.
> >> > > >
> >> > > > @ Mentors/Infra team/Community:
> >> > > >
> >> > > > ==
> >> > > >
> >> > > > Please provide your suggestions on how we can proceed further and
> >> work
> >> > on
> >> > > > stabilizing the CI build systems for MXNet.
> >> > > >
> >> > > > Also, if the community decides on separate Jenkins CI build
> system,
> >> > what
> >> > > > important points should be taken care of apart from the below:
> >> > > >
> >> > > >1. Community being able to access the build page for build
> >> statuses.

Re: mxnet ndarray inference in js

2017-11-06 Thread TongKe Xue
Hi Hagay,

  Good point. The high level problem is:

  I want to run mxnet training on GPU, and inference on CPU --
browsers / javascript in particular.

  On the training side, I'm dealing mostly with NDArray / doing my own
gradient calculations / optimization. I would like some library for
the client side, where I can just 'port over my mxnet ndarray graph'
and start running inference.

  I hope this motivates the issue.

--TongKe

On Mon, Nov 6, 2017 at 10:58 AM, Lupesko, Hagay  wrote:
> TongKe,
>
> What’s the use case you are after?
> Answering this question may help us help you (
>
> Hagay
>
> On 11/2/17, 12:10, "TongKe Xue"  wrote:
>
> Hi,
>
>   I'm looking for a js library compatible with mxnet/ndarray.
>
> 1. I am aware of https://github.com/dmlc/mxnet.js/
> However:
> a. that appears to be all of mxnet, not ndarray
> b. that appears to only support Python models, whereas I'm using 
> Java/Scala
>
> 2. I am aware of https://github.com/scijs/ndarray
> However:
> this appears to be in different library, try to create a unifying API 
> over both
>
>   Back to my original question -- is there some JS API that is
> directly compatible with mxnet's ndarray interface?
>
> Thanks,
> --TongKe
>
>
>


[jira] [Updated] (MXNET-1) EPIC: Create independent Jenkins Server for MXNet CI

2017-11-06 Thread Suneel Marthi (JIRA)

 [ 
https://issues.apache.org/jira/browse/MXNET-1?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Suneel Marthi updated MXNET-1:
--
Component/s: CI Build 

> EPIC: Create independent Jenkins Server for MXNet CI
> 
>
> Key: MXNET-1
> URL: https://issues.apache.org/jira/browse/MXNET-1
> Project: Apache MXNet
>  Issue Type: New Feature
>  Components: CI Build 
>Reporter: Chris Olivier
>
> Contains subtasks for creating non-Apache CI Jenkins system



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Re: mxnet ndarray inference in js

2017-11-06 Thread Lupesko, Hagay
TongKe,

What’s the use case you are after?
Answering this question may help us help you (

Hagay

On 11/2/17, 12:10, "TongKe Xue"  wrote:

Hi,

  I'm looking for a js library compatible with mxnet/ndarray.

1. I am aware of https://github.com/dmlc/mxnet.js/
However:
a. that appears to be all of mxnet, not ndarray
b. that appears to only support Python models, whereas I'm using Java/Scala

2. I am aware of https://github.com/scijs/ndarray
However:
this appears to be in different library, try to create a unifying API over 
both

  Back to my original question -- is there some JS API that is
directly compatible with mxnet's ndarray interface?

Thanks,
--TongKe





Re: what is NDArrayFuncReturn ?

2017-11-06 Thread YiZhi Liu
This is a internal class for wrapping NDArray(s) returned by function
calls. It handles functions that return more than one array, and
tracks the array dependencies.

You can call get() or apply() to get NDArray

2017-11-05 11:55 GMT+08:00 TongKe Xue :
> Hi,
>
> 1. I'm running into issues with NDArrayFuncReturn vs NDArray. In particular:
>
> 1a. Transpose is defined on NDArray, but does not appear to be defined
> on NDArrayFuncReturn:
> https://github.com/apache/incubator-mxnet/blob/master/scala-package/core/src/main/scala/ml/dmlc/mxnet/NDArray.scala#L665
>
> 2. After calling broadcast_minus , I get a NDArrayFuncReturn
>
> 3. There are functions 'head, get, apply' for doing NDArrayFuncReturn
> -> NDArray:
> https://github.com/apache/incubator-mxnet/blob/master/scala-package/core/src/main/scala/ml/dmlc/mxnet/NDArray.scala#L1099-L1104
>
> So:
>   * What is NDArrayFuncReturn ?
>   * If I'm getting NDArrayFuncReturn (a private class) returned, am I
> calling some internal API I should not be using ?
>   * What is correct way o do NDArrayFuncReturn -> NDArray ?
>
> Thanks,
> --TongKe



-- 
Yizhi Liu
DMLC member
Technical Manager
Qihoo 360 Inc, Shanghai, China


Re: Running tests in parallel

2017-11-06 Thread Chris Olivier
That’d be great.

On Mon, Nov 6, 2017 at 7:04 AM kellen sunderland <
kellen.sunderl...@gmail.com> wrote:

> Yeah I think the issue is related to a few test fixtures setup / teardown.
> When I have some more time I'll try and narrow down what's wrong with
> specific tests.  There may be some tests that are / aren't reentrant.
> Some tests work well, for example python3 -m nose --verbose --processes 2
> test_gluon, but test_operator just starts reporting errors after 20 or so
> tests.
>
> On Mon, Nov 6, 2017 at 3:58 PM, Chris Olivier 
> wrote:
>
> > I’ve never tried that but it certainly seems like it would help CI
> speeds,
> > especially since we don’t always use 100% CPU and almost never 100% GPU
> for
> > tests
> >
> > On Mon, Nov 6, 2017 at 6:43 AM kellen sunderland <
> > kellen.sunderl...@gmail.com> wrote:
> >
> > > Hey all,
> > >
> > > Just wanted to ask before I dive too deeply on this. Does anyone know
> why
> > > tests fail when run in multiprocess mode?  For example: python3 -m nose
> > > --verbose --processes 2
> > >
> > > I've verified this isn't an OOM error, there should be plenty of GPU
> > memory
> > > on the instance I'm using.  I've also been watching nvidia-smi closely
> > > during the failures.
> > >
> > > -Kellen
> > >
> >
>


Re: Running tests in parallel

2017-11-06 Thread kellen sunderland
Yeah I think the issue is related to a few test fixtures setup / teardown.
When I have some more time I'll try and narrow down what's wrong with
specific tests.  There may be some tests that are / aren't reentrant.
Some tests work well, for example python3 -m nose --verbose --processes 2
test_gluon, but test_operator just starts reporting errors after 20 or so
tests.

On Mon, Nov 6, 2017 at 3:58 PM, Chris Olivier  wrote:

> I’ve never tried that but it certainly seems like it would help CI speeds,
> especially since we don’t always use 100% CPU and almost never 100% GPU for
> tests
>
> On Mon, Nov 6, 2017 at 6:43 AM kellen sunderland <
> kellen.sunderl...@gmail.com> wrote:
>
> > Hey all,
> >
> > Just wanted to ask before I dive too deeply on this. Does anyone know why
> > tests fail when run in multiprocess mode?  For example: python3 -m nose
> > --verbose --processes 2
> >
> > I've verified this isn't an OOM error, there should be plenty of GPU
> memory
> > on the instance I'm using.  I've also been watching nvidia-smi closely
> > during the failures.
> >
> > -Kellen
> >
>


Re: Running tests in parallel

2017-11-06 Thread Chris Olivier
I’ve never tried that but it certainly seems like it would help CI speeds,
especially since we don’t always use 100% CPU and almost never 100% GPU for
tests

On Mon, Nov 6, 2017 at 6:43 AM kellen sunderland <
kellen.sunderl...@gmail.com> wrote:

> Hey all,
>
> Just wanted to ask before I dive too deeply on this. Does anyone know why
> tests fail when run in multiprocess mode?  For example: python3 -m nose
> --verbose --processes 2
>
> I've verified this isn't an OOM error, there should be plenty of GPU memory
> on the instance I'm using.  I've also been watching nvidia-smi closely
> during the failures.
>
> -Kellen
>