Re: MLLib - Thoughts about refactoring Updater for LBFGS?

2014-04-08 Thread Debasish Das
I got your checkinI need to run logistic regression SGD vs BFGS for my current usecases but your next checkin will update the logistic regression with LBFGS right ? Are you adding it to regression package as well ? Thanks. Deb On Mon, Apr 7, 2014 at 7:00 PM, DB Tsai dbt...@stanford.edu

Re: MLLib - Thoughts about refactoring Updater for LBFGS?

2014-04-08 Thread Debasish Das
By the way...what's the idea...the labeled data set is a RDD which is cached on all nodes.. The bfgs solver is maintained on the master or each worker is supposed to maintain it's own bfgs... On Mon, Apr 7, 2014 at 11:23 PM, Debasish Das debasish.da...@gmail.comwrote: I got your checkinI

Re: Contributing to Spark

2014-04-08 Thread Aaron Davidson
Matei's link seems to point to a specific starter project as part of the starter list, but here is the list itself: https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20labels%20%3D%20Starter%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened) On Mon, Apr 7,

Re: Contributing to Spark

2014-04-08 Thread Michael Ernest
Ha ha! nice try, sheepherder! ;-) On Tue, Apr 8, 2014 at 12:37 PM, Matei Zaharia matei.zaha...@gmail.comwrote: Shh, maybe I really wanted people to fix that one issue. On Apr 8, 2014, at 9:34 AM, Aaron Davidson ilike...@gmail.com wrote: Matei's link seems to point to a specific starter

Apache Spark and Graphx for Real Time Analytics

2014-04-08 Thread love2dishtech
Hi, Is Graphx on top of Apache Spark, is able to process the large scale distributed graph traversal and compute, in real time. What is the query execution engine distributing the query on top of graphx and apache spark. My typical use case is a large scale distributed graph traversal in real

Re: Apache Spark and Graphx for Real Time Analytics

2014-04-08 Thread Nick Pentreath
GraphX, like Spark, will not typically be real-time (where by real-time here I assume you mean of the order of a few 10s-100s ms, up to a few seconds). Spark can in some cases approach the upper boundary of this definition (a second or two, possibly less) when data is cached in memory and the

reading custom input format in Spark

2014-04-08 Thread Anurag
Hi, I am able to read a custom input format in spark. scala val inputRead = sc.newAPIHadoopFile(hdfs:// 127.0.0.1/user/cloudera/date_dataset/ ,classOf[io.reader.PatternInputFormat],classOf[org.apache.hadoop.io.LongWritable],classOf[org.apache.hadoop.io.Text]) However, doing a inputRead.count()

Re: reading custom input format in Spark

2014-04-08 Thread Andrew Ash
Are you using the PatternInputFormat from this blog post? https://hadoopi.wordpress.com/2013/05/31/custom-recordreader-processing-string-pattern-delimited-records/ If so you need to set the pattern in the configuration before attempting to read data with that InputFormat: String regex =

Re: Apache Spark and Graphx for Real Time Analytics

2014-04-08 Thread Koert Kuipers
it all depends on what kind of traversing. if its point traversing then a random access based something would be great. if its more scan-like traversl then spark will fit On Tue, Apr 8, 2014 at 4:56 PM, Evan Chan e...@ooyala.com wrote: I doubt Titan would be able to give you traversal of

Re: Apache Spark and Graphx for Real Time Analytics

2014-04-08 Thread Nick Pentreath
Likely neither will give real-time for full-graph traversal, no. And once in memory, GraphX would definitely be faster for breadth-first traversal. But for vertex-centric traversals (starting from a vertex and traversing edges from there, such as friends of friends queries etc) then Titan is

Re: Apache Spark and Graphx for Real Time Analytics

2014-04-08 Thread Reynold Xin
Nick and Koert summarized it pretty well. Just to clarify and give some concrete examples. If you want to start with a specific vertex, and follow some path, it is probably easier and faster to use some key values store or even MySQL or a graph database. If you want to count the average length

Re: MLLib - Thoughts about refactoring Updater for LBFGS?

2014-04-08 Thread DB Tsai
Hi Debasish, The L-BFGS solver will be in the master like GD solver, and the part that is parallelized is computing the gradient of each input row, and summing them up. I prefer to make the optimizer plug-able instead of adding new LogisticRegressionWithLBFGS since 98% of the code will be the

Re: MLLib - Thoughts about refactoring Updater for LBFGS?

2014-04-08 Thread Debasish Das
Yup that's what I expected...L-BFGS solver is in the master and gradient computation per RDD is done on each of the workers... This miniBatchFraction is also a heuristic which I don't think makes sense for LogisticRegressionWithBFGS...does it ? On Tue, Apr 8, 2014 at 3:44 PM, DB Tsai

Re: MLLib - Thoughts about refactoring Updater for LBFGS?

2014-04-08 Thread DB Tsai
I think mini batch is still useful for L-BFGS. One of the use-cases can be initialized the weights by training with the smaller subsamples of data using mini batch with L-BFGS. Then we could use the weights trained with mini batch to start another training process with full data. Sincerely, DB

Re: MLLib - Thoughts about refactoring Updater for LBFGS?

2014-04-08 Thread Debasish Das
Have you experimented with it ? For logistic regression at least given enough iterations/tolerance that you are giving, BFGS in both ways should converge to same solution On Tue, Apr 8, 2014 at 4:19 PM, DB Tsai dbt...@stanford.edu wrote: I think mini batch is still useful for L-BFGS. One

Re: MLLib - Thoughts about refactoring Updater for LBFGS?

2014-04-08 Thread DB Tsai
I don't experiment it. That's the use-case in theory I could think of. ^^ However, from what I saw, BFGS converges really fast so that I only need 20~30 iterations in general. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: