[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-09 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-62299039 @shivaram IMHO it would be good to have the developer API updates as well and test a couple of more pipelines before we push this out. I'll try to get a

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-07 Thread tomerk
Github user tomerk commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-62112717 At @shivaram's suggestion, I started porting over a simple text classifier pipeline that was already using an Estimator/Transformer abstraction of RDD[U] to RDD[V]

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-07 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-62196531 @mengxr @jkbradley I tried to port one of our simple image processing pipelines to the new interface today and the code for this is at

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-07 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-62199983 @tomerk At @shivaram's suggestion, I started porting over a simple text classifier pipeline that was already using an Estimator/Transformer abstraction of

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-07 Thread tomerk
Github user tomerk commented on a diff in the pull request: https://github.com/apache/spark/pull/3099#discussion_r20034471 --- Diff: mllib/src/main/scala/org/apache/spark/ml/example/StandardScaler.scala --- @@ -0,0 +1,70 @@ +/* + * Licensed to the Apache Software

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-07 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-62207734 I'd like to second the thanks to you both for trying out the new API! Some thoughts: @shivaram About your Dataset API comments: As @mengxr said, I am

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-07 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-62223666 On the parameters vs constructors, the problem with default args to constructors is that they make it very difficult to add settings to an algorithm. Such changes will

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-62227737 Test FAILed. Refer to this link for build results (access rights to CI server needed):

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-07 Thread shivaram
Github user shivaram commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-62229656 Thanks for the replies. I'm summarizing the discussion so far and including some replies. I think we are getting close to a good API ! IMHO it would be good to have the

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-07 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-62227719 [Test build #23077 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23077/consoleFull) for PR 3099 at commit

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-07 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-62227735 [Test build #23077 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23077/consoleFull) for PR 3099 at commit

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-07 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-62230172 @shivaram @tomerk I updated this PR and removed `modelParams`. The param traits are marked as `private[ml]`. As I mentioned, I have no particular preference here.

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-07 Thread tomerk
Github user tomerk commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-62234569 Is there anything that can be done about these two lines that appear in every transform and fit function: ```scala import dataset.sqlContext._ val map

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-07 Thread mateiz
Github user mateiz commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-62241597 This is a good point, maybe we should have an example or a test class that is not in the org.apache.spark.ml package and tries to create a pipeline stage. --- If your

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-07 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-62242562 I will make one under `org.apache.spark.examples` (though it is still under `spark`). --- If your project is set up for it, you can reply to this email and have your

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-06 Thread vruusmann
Github user vruusmann commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61949888 @jegonzal PMML is essentially a domain-specific language (DSL) for the domain of predictive analytic applications. It is commonly used only for the representation of

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-06 Thread etrain
Github user etrain commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61977531 I've got several other comments on this PR - mostly good, and will leave some more detailed comments in a bit. TL;DR - What's wrong with Java Serialization? But

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-06 Thread etrain
Github user etrain commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61978539 Oh, and I'm not saying let's not support PMML - I'm saying let's have a sensible way of handing off models to other JVM programs that doesn't involve writing to XML in

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-06 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-62034255 [Test build #23011 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23011/consoleFull) for PR 3099 at commit

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-06 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-62034301 [Test build #23011 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23011/consoleFull) for PR 3099 at commit

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-06 Thread shivaram
Github user shivaram commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-62105053 @mengxr @jkbradley I tried to port one of our simple image processing pipelines to the new interface today and the code for this is at

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3099#discussion_r19859152 --- Diff: mllib/src/main/scala/org/apache/spark/ml/Model.scala --- @@ -0,0 +1,6 @@ +package org.apache.spark.ml --- End diff -- Hey @srowen,

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread srowen
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/3099#discussion_r19859297 --- Diff: mllib/src/main/scala/org/apache/spark/ml/Model.scala --- @@ -0,0 +1,6 @@ +package org.apache.spark.ml --- End diff -- Oh Ok, is it

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61775481 @srowen Glad to hear that you like it :) Your feedback will be greatly appreciated, but I don't want to waste your time on minor details. Let's run the discussion in the

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread srowen
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61785223 @mengxr Yes I agree with your conclusion. With a generic get/set by key method, I suppose you don't strictly need all the particular getters and setters. I suppose that's

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61853890 If maxIter is a constant, would it be clearer to use MAX_ITER? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61854896 @sryza `maxIter` is not a constant. It is a parameter key in the current design. --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread sryza
Github user sryza commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61858097 Both the reference and the class internals are immutable, no? Typical Java conventions would put such a variable in all caps, though maybe in Scala it's different? ---

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread sryza
Github user sryza commented on a diff in the pull request: https://github.com/apache/spark/pull/3099#discussion_r19897343 --- Diff: mllib/src/main/scala/org/apache/spark/ml/parameters.scala --- @@ -0,0 +1,267 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61864795 I vote for the lr.setRegParam(0.1) setup. I also vote for setting parameters beforehand, and not allowing parameters in fit(). --- If your project is set up for it,

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61905232 Test FAILed. Refer to this link for build results (access rights to CI server needed):

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61905223 [Test build #22962 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22962/consoleFull) for PR 3099 at commit

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61905231 [Test build #22962 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22962/consoleFull) for PR 3099 at commit

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread shivaram
Github user shivaram commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61913282 @mengxr Thanks for putting this together ! I had some high level comments by looking at the code 1. Could we have constructors also with getter, setters ? This

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread manishamde
Github user manishamde commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61916011 I have a few comments based upon the API: 1. Like @jkbradley, I prefer ```lr.setMaxIter(50)``` over ```lr.set(lr.maxIter, 50)```. Also, prefer to avoid

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread mengxr
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/3099#discussion_r19923238 --- Diff: mllib/src/main/scala/org/apache/spark/ml/parameters.scala --- @@ -0,0 +1,267 @@ +/* + * Licensed to the Apache Software Foundation (ASF)

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61922261 @shivaram Thanks for your feedback! Could we have constructors also with getter, setters? It would be hard to maintain binary compatibility in the long

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread jegonzal
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61922586 The model serving work would really benefit from being able to evaluate models without requiring a Spark context especially since we are shooting for 10s millisecond

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61923441 @manishamde Thanks for your feedback! Like @jkbradley, I prefer lr.setMaxIter(50) over lr.set(lr.maxIter, 50). Me too and this is implemented in the

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61929832 [Test build #22986 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22986/consoleFull) for PR 3099 at commit

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61929837 [Test build #22986 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22986/consoleFull) for PR 3099 at commit

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61929839 Test FAILed. Refer to this link for build results (access rights to CI server needed):

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61930125 [Test build #22987 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22987/consoleFull) for PR 3099 at commit

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61930133 Test FAILed. Refer to this link for build results (access rights to CI server needed):

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61930130 [Test build #22987 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22987/consoleFull) for PR 3099 at commit

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/3099#discussion_r19927481 --- Diff: mllib/src/main/scala/org/apache/spark/ml/Estimator.scala --- @@ -0,0 +1,70 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61931707 [Test build #22991 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22991/consoleFull) for PR 3099 at commit

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61931722 [Test build #22991 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22991/consoleFull) for PR 3099 at commit

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61931724 Test FAILed. Refer to this link for build results (access rights to CI server needed):

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61932391 @mengxr @shivaram Adding to the conversation... In addition to the SchemaRDD based transformer API, it would be good to have a single Row transformation API too.

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread jkbradley
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61932628 @jegonzal Do you feel like most model classes should be serve-able by design? Or would you recommend having lighter model classes which can be built from regular

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread jegonzal
Github user jegonzal commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61934417 @jkbradley Right now we are planning to serve linear combinations of models derived from MLlib (currently latent factor models, naive bayes, and decision trees).

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-05 Thread mengxr
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61934791 @jegonzal @jkbradley (This may be slightly off-topic.) To serve models, which are usually long pipelines in practice, I'm thinking of the following: 1. serialize

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-04 Thread mengxr
GitHub user mengxr opened a pull request: https://github.com/apache/spark/pull/3099 [WIP][SPARK-3530][MLLIB] pipeline and parameters ### This is a WIP, so please do not spend time on individual lines. This PR adds package org.apache.spark.ml with pipeline and parameters, as

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-04 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61748060 [Test build #22907 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22907/consoleFull) for PR 3099 at commit

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61748064 Test FAILed. Refer to this link for build results (access rights to CI server needed):

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-04 Thread SparkQA
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/3099#issuecomment-61748036 [Test build #22907 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22907/consoleFull) for PR 3099 at commit

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-04 Thread srowen
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/3099#discussion_r19859052 --- Diff: mllib/src/main/scala/org/apache/spark/ml/Model.scala --- @@ -0,0 +1,6 @@ +package org.apache.spark.ml --- End diff -- (Missing

[GitHub] spark pull request: [WIP][SPARK-3530][MLLIB] pipeline and paramete...

2014-11-04 Thread srowen
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/3099#discussion_r19859074 --- Diff: mllib/src/main/scala/org/apache/spark/ml/example/CrossValidator.scala --- @@ -0,0 +1,64 @@ +package org.apache.spark.ml.example --- End diff