Re: Batch prediciton for ALS

2015-02-17 Thread Xiangrui Meng
It may be too late to merge it into 1.3. I'm going to make another pass on your PR today. -Xiangrui On Tue, Feb 10, 2015 at 8:01 AM, Debasish Das debasish.da...@gmail.com wrote: Hi, Will it be possible to merge this PR to 1.3 ? https://github.com/apache/spark/pull/3098 The batch prediction

Re: Batch prediciton for ALS

2015-02-17 Thread Debasish Das
It will be really help us if we merge it but I guess it is already diverged from the new ALS...I will also take a look at it again and try update with the new ALS... On Tue, Feb 17, 2015 at 3:22 PM, Xiangrui Meng men...@gmail.com wrote: It may be too late to merge it into 1.3. I'm going to make

Re: Replacing Jetty with TomCat

2015-02-17 Thread Patrick Wendell
Hey Niranda, It seems to me a lot of effort to support multiple libraries inside of Spark like this, so I'm not sure that's a great solution. If you are building an application that embeds Spark, is it not possible for you to continue to use Jetty for Spark's internal servers and use tomcat for

Re: mllib.recommendation Design

2015-02-17 Thread Debasish Das
There is a usability difference...I am not sure if recommendation.ALS would like to add both userConstraint and productConstraint ? GraphLab CF for example has it and we are ready to support all the features for modest ranks where gram matrices can be made... For large ranks I am still working on

Re: Replacing Jetty with TomCat

2015-02-17 Thread Niranda Perera
Hi Sean, The main issue we have is, running two web servers in a single product. we think it would not be an elegant solution. Could you please point me to the main areas where jetty server is tightly coupled or extension points where I could plug tomcat instead of jetty? If successful I could

Re: mllib.recommendation Design

2015-02-17 Thread Xiangrui Meng
The current ALS implementation allow pluggable solvers for NormalEquation, where we put CholeskeySolver and NNLS solver. Please check the current implementation and let us know how your constraint solver would fit. For a general matrix factorization package, let's make a JIRA and move our

Re: [ml] Lost persistence for fold in crossvalidation.

2015-02-17 Thread Xiangrui Meng
There are three different regParams defined in the grid and there are tree folds. For simplicity, we didn't split the dataset into three and reuse them, but do the split for each fold. Then we need to cache 3*3 times. Note that the pipeline API is not yet optimized for performance. It would be

Re: Replacing Jetty with TomCat

2015-02-17 Thread Corey Nolet
Niranda, I'm not sure if I'd say Spark's use of Jetty to expose its UI monitoring layer constitutes a use of two web servers in a single product. Hadoop uses Jetty as well as do many other applications today that need embedded http layers for serving up their monitoring UI to users. This is

JavaRDD Aggregate initial value - Closure-serialized zero value reasoning?

2015-02-17 Thread Matt Cheah
Hi everyone, I was using JavaPairRDD¹s combineByKey() to compute all of my aggregations before, since I assumed that every aggregation required a key. However, I realized I could do my analysis using JavaRDD¹s aggregate() instead and not use a key. I have set spark.serializer to use Kryo. As a

Re: org.apache.spark.sql.sources.DDLException: Unsupported dataType: [1.1] failure: ``varchar'' expected but identifier char found in spark-sql

2015-02-17 Thread Yin Huai
Hi Quizhuang, Right now, char is not supported in DDL. Can you try varchar or string? Thanks, Yin On Mon, Feb 16, 2015 at 10:39 PM, Qiuzhuang Lian qiuzhuang.l...@gmail.com wrote: Hi, I am not sure this has been reported already or not, I run into this error under spark-sql shell as build

Re: [MLlib] Performance problem in GeneralizedLinearAlgorithm

2015-02-17 Thread Evan R. Sparks
Josh - thanks for the detailed write up - this seems a little funny to me. I agree that with the current code path there is extra work being done than needs to be (e.g. the features are re-scaled at every iteration, but the relatively costly process of fitting the StandardScaler should not be

Re: [MLlib] Performance problem in GeneralizedLinearAlgorithm

2015-02-17 Thread Peter Rudenko
It's fixed today: https://github.com/apache/spark/pull/4593 Thanks, Peter Rudenko On 2015-02-17 18:25, Evan R. Sparks wrote: Josh - thanks for the detailed write up - this seems a little funny to me. I agree that with the current code path there is extra work being done than needs to be (e.g.

Fwd: [MLlib] Performance problem in GeneralizedLinearAlgorithm

2015-02-17 Thread Josh Devins
Cross-posting as I got no response on the users mailing list last week. Any response would be appreciated :) Josh -- Forwarded message -- From: Josh Devins j...@soundcloud.com Date: 9 February 2015 at 15:59 Subject: [MLlib] Performance problem in GeneralizedLinearAlgorithm To: