Re: MLLib - Thoughts about refactoring Updater for LBFGS?

2014-04-07 Thread Debasish Das
By the way...what's the idea...the labeled data set is a RDD which is cached on all nodes.. The bfgs solver is maintained on the master or each worker is supposed to maintain it's own bfgs... On Mon, Apr 7, 2014 at 11:23 PM, Debasish Das wrote: > I got your checkinI need to run logistic reg

Re: MLLib - Thoughts about refactoring Updater for LBFGS?

2014-04-07 Thread Debasish Das
I got your checkinI need to run logistic regression SGD vs BFGS for my current usecases but your next checkin will update the logistic regression with LBFGS right ? Are you adding it to regression package as well ? Thanks. Deb On Mon, Apr 7, 2014 at 7:00 PM, DB Tsai wrote: > Hi guys, > > T

Re: Contributing to Spark

2014-04-07 Thread Matei Zaharia
I’d suggest looking for the issues labeled “Starter” on JIRA. You can find them here: https://issues.apache.org/jira/browse/SPARK-1438?jql=project%20%3D%20SPARK%20AND%20labels%20%3D%20Starter%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened) Matei On Apr 7, 2014, at 9:45 PM, M

Re: Contributing to Spark

2014-04-07 Thread Mukesh G
Hi Sujeet, Thanks. I went thru the website and looks great. Is there a list of items that I can choose from, for contribution? Thanks Mukesh On Mon, Apr 7, 2014 at 10:14 PM, Sujeet Varakhedi wrote: > This is a good place to start: > https://cwiki.apache.org/confluence/display/SPARK/Contri

Re: MLLib - Thoughts about refactoring Updater for LBFGS?

2014-04-07 Thread DB Tsai
Hi guys, The latest PR uses Breeze's L-BFGS implement which is introduced by Xiangrui's sparse input format work in SPARK-1212. https://github.com/apache/spark/pull/353 Now, it works with the new sparse framework! Any feedback would be greatly appreciated. Thanks. Sincerely, DB Tsai

Re: Spark Streaming and Flume Avro RPC Servers

2014-04-07 Thread Christophe Clapp
Cool. I'll look at making the code change in FlumeUtils and generating a pull request. As far as the use case, the volume of messages we have is currently about 30 MB per second which may grow to over what a 1 Gbit network adapter can handle. - Christophe On Apr 7, 2014 1:51 PM, "Michael Ernest"

Re: Spark Streaming and Flume Avro RPC Servers

2014-04-07 Thread Michael Ernest
I don't see why not. If one were doing something similar with straight Flume, you'd start an agent on each node you care to receive Avro/RPC events. In the absence of clearer insight to your use case, I'm puzzling just a little why it's necessary for each Worker to be its own receiver, but there's

Re: Spark Streaming and Flume Avro RPC Servers

2014-04-07 Thread Christophe Clapp
Could it be as simple as just changing FlumeUtils to accept a list of host/port number pairs to start the RPC servers on? On 4/7/14, 12:58 PM, Christophe Clapp wrote: Based on the source code here: https://github.com/apache/spark/blob/master/external/flume/src/main/scala/org/apache/spark/strea

Re: Spark Streaming and Flume Avro RPC Servers

2014-04-07 Thread Christophe Clapp
Based on the source code here: https://github.com/apache/spark/blob/master/external/flume/src/main/scala/org/apache/spark/streaming/flume/FlumeUtils.scala It looks like in its current version, FlumeUtils does not support starting an Avro RPC server on more than one worker. - Christophe On 4/7

Re: Spark Streaming and Flume Avro RPC Servers

2014-04-07 Thread Christophe Clapp
Right, but at least in my case, no avro RPC server was started on any of the spark worker nodes except for one. I don't know if that's just some configuration issue with my setup or if it's expected behavior. I would need spark to start avro RPC servers on every worker rather than just one. - Chri

Re: Spark Streaming and Flume Avro RPC Servers

2014-04-07 Thread Michael Ernest
You can configure your sinks to write to one or more Avro sources in a load-balanced configuration. https://flume.apache.org/FlumeUserGuide.html#flume-sink-processors mfe On Mon, Apr 7, 2014 at 3:19 PM, Christophe Clapp wrote: > Hi, > > From my testing of Spark Streaming with Flume, it seems t

Spark Streaming and Flume Avro RPC Servers

2014-04-07 Thread Christophe Clapp
Hi, From my testing of Spark Streaming with Flume, it seems that there's only one of the Spark worker nodes that runs a Flume Avro RPC server to receive messages at any given time, as opposed to every Spark worker running an Avro RPC server to receive messages. Is this the case? Our use-case

Re: ALS array index out of bound with 50 factors

2014-04-07 Thread Xiangrui Meng
Hi Deb, It would be helpful if you can attached the logs. It is strange to see that you can make 4 iterations but not 10. Xiangrui On Mon, Apr 7, 2014 at 10:36 AM, Debasish Das wrote: > I am using master... > > No negative indexes... > > If I run with 4 iterations it runs fine and I can generat

Re: Flaky streaming tests

2014-04-07 Thread Michael Armbrust
I agree these should be disabled right away, and the JIRA can be used to track fixing / turning them back on. On Mon, Apr 7, 2014 at 11:33 AM, Michael Armbrust wrote: > There is a JIRA for one of the flakey tests here: > https://issues.apache.org/jira/browse/SPARK-1409 > > > On Mon, Apr 7, 2014

Re: Flaky streaming tests

2014-04-07 Thread Tathagata Das
Yes, I will take a look at those tests ASAP. TD On Mon, Apr 7, 2014 at 11:32 AM, Patrick Wendell wrote: > TD - do you know what is going on here? > > I looked into this ab it and at least a few of these that use > Thread.sleep() and assume the sleep will be exact, which is wrong. We > should

Re: Flaky streaming tests

2014-04-07 Thread Michael Armbrust
There is a JIRA for one of the flakey tests here: https://issues.apache.org/jira/browse/SPARK-1409 On Mon, Apr 7, 2014 at 11:32 AM, Patrick Wendell wrote: > TD - do you know what is going on here? > > I looked into this ab it and at least a few of these that use > Thread.sleep() and assume the

Re: Flaky streaming tests

2014-04-07 Thread Patrick Wendell
TD - do you know what is going on here? I looked into this ab it and at least a few of these that use Thread.sleep() and assume the sleep will be exact, which is wrong. We should disable all the tests that do and probably they should be re-written to virtualize time. - Patrick On Mon, Apr 7, 20

Re: Flaky streaming tests

2014-04-07 Thread Nan Zhu
I met this issue when Jenkins seems to be very busy On Monday, April 7, 2014, Kay Ousterhout wrote: > Hi all, > > The InputStreamsSuite seems to have some serious flakiness issues -- I've > seen the file input stream fail many times and now I'm seeing some actor > input stream test failures

Flaky streaming tests

2014-04-07 Thread Kay Ousterhout
Hi all, The InputStreamsSuite seems to have some serious flakiness issues -- I've seen the file input stream fail many times and now I'm seeing some actor input stream test failures ( https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13846/consoleFull) on what I think is an unrela

Re: ALS array index out of bound with 50 factors

2014-04-07 Thread Debasish Das
I am using master... No negative indexes... If I run with 4 iterations it runs fine and I can generate factors... With 10 iterations run fails with array index out of bound... 25m users and 3m products are within int limits Does it help if I can point the logs for both the runs to you ? I

Re: ALS array index out of bound with 50 factors

2014-04-07 Thread Xiangrui Meng
Hi Deb, This thread is for the out-of-bound error you described. I don't think the number of iterations has any effect here. My questions were: 1) Are you using the master branch or a particular commit? 2) Do you have negative or out-of-integer-range user or product ids? Try to print out the max

Re: Contributing to Spark

2014-04-07 Thread Sujeet Varakhedi
This is a good place to start: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark Sujeet On Mon, Apr 7, 2014 at 9:20 AM, Mukesh G wrote: > Hi, > >How I contribute to Spark and it's associated projects? > > Appreciate the help... > > Thanks > > Mukesh >

Contributing to Spark

2014-04-07 Thread Mukesh G
Hi, How I contribute to Spark and it's associated projects? Appreciate the help... Thanks Mukesh

Re: tachyon dependency

2014-04-07 Thread Haoyuan Li
Tachyon is Java 6 compatible from version 0.4. Beside putting input/output data in Tachyon ( http://tachyon-project.org/Running-Spark-on-Tachyon.html ), Spark applications can also persist data into Tachyon ( https://github.com/apache/spark/blob/master/docs/scala-programming-guide.md ). On Mon, A

tachyon dependency

2014-04-07 Thread Koert Kuipers
i noticed there is a dependency on tachyon in spark core 1.0.0-SNAPSHOT. how does that work? i believe tachyon is written in java 7, yet spark claims to be java 6 compatible.

Re: ALS array index out of bound with 50 factors

2014-04-07 Thread Debasish Das
Nick, I already have this code which calls dictionary generation and then maps string etc to ints...I think the core algorithm should stay in ints...if you like I can add this code in MFUtils.scalathat's the convention I followed similar to MLUtils.scala...actually these functions should be ev

Re: ALS array index out of bound with 50 factors

2014-04-07 Thread Nick Pentreath
On the partitioning / id keys. If we would look at hash partitioning, how feasible will it be to just allow the user and item ids to be strings? A lot of the time these ids are strings anyway (UUIDs and so on), and it's really painful to translate between String <-> Int the whole time. Are there a