RE: Stochastic gradient descent performance

2015-04-02 Thread Ulanov, Alexander
Hi Shivaram, It sounds really interesting! With this time we can estimate if it worth considering to run an iterative algorithm on Spark. For example, for SGD on Imagenet (450K samples) we will spend 450K*50ms=62.5 hours to traverse all data by one example not considering the data loading,

Re: Unit test logs in Jenkins?

2015-04-02 Thread Steve Loughran
On 2 Apr 2015, at 06:31, Patrick Wendell pwend...@gmail.com wrote: Hey Marcelo, Great question. Right now, some of the more active developers have an account that allows them to log into this cluster to inspect logs (we copy the logs from each run to a node on that cluster). The

Re: org.spark-project.jetty and guava repo locations

2015-04-02 Thread Ted Yu
Take a look at the maven-shade-plugin in pom.xml. Here is the snippet for org.spark-project.jetty : relocation patternorg.eclipse.jetty/pattern shadedPatternorg.spark-project.jetty/shadedPattern includes

Re: Stochastic gradient descent performance

2015-04-02 Thread Joseph Bradley
When you say It seems that instead of sample it is better to shuffle data and then access it sequentially by mini-batches, are you sure that holds true for a big dataset in a cluster? As far as implementing it, I haven't looked carefully at GapSamplingIterator (in RandomSampler.scala) myself, but

Re: Unit test logs in Jenkins?

2015-04-02 Thread shane knapp
i agree with all of this. but can we please break up the tests and make them shorter? :) On Thu, Apr 2, 2015 at 8:54 AM, Nicholas Chammas nicholas.cham...@gmail.com wrote: This is secondary to Marcelo’s question, but I wanted to comment on this: Its main limitation is more cultural than

Test all the things (Was: Unit test logs in Jenkins?)

2015-04-02 Thread Nicholas Chammas
(Renaming thread so as to un-hijack Marcelo's request.) Sure, we definitely want tests running faster. Part of testing all the things will be factoring out stuff from the various builds that can be run just once. We've also tried in the past (with little success) to parallelize test execution

RE: Stochastic gradient descent performance

2015-04-02 Thread Ulanov, Alexander
Hi Joseph, Thank you for suggestion! It seems that instead of sample it is better to shuffle data and then access it sequentially by mini-batches. Could you suggest how to implement it? With regards to aggregate (reduce), I am wondering why it works so slow in local mode? Could you elaborate

Re: Unit test logs in Jenkins?

2015-04-02 Thread Marcelo Vanzin
On Thu, Apr 2, 2015 at 3:01 AM, Steve Loughran ste...@hortonworks.com wrote: That would be really helpful to debug build failures. The scalatest output isn't all that helpful. Potentially an issue with the test runner, rather than the tests themselves. Sorry, that was me over-generalizing.