Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-62299039
@shivaram
IMHO it would be good to have the developer API updates as well and test
a couple of more pipelines before we push this out.
I'll try to get a
Github user tomerk commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-62112717
At @shivaram's suggestion, I started porting over a simple text classifier
pipeline that was already using an Estimator/Transformer abstraction of RDD[U]
to RDD[V]
Github user mengxr commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-62196531
@mengxr @jkbradley I tried to port one of our simple image processing
pipelines to the new interface today and the code for this is at
Github user mengxr commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-62199983
@tomerk
At @shivaram's suggestion, I started porting over a simple text
classifier pipeline that was already using an Estimator/Transformer abstraction
of
Github user tomerk commented on a diff in the pull request:
https://github.com/apache/spark/pull/3099#discussion_r20034471
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/example/StandardScaler.scala ---
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-62207734
I'd like to second the thanks to you both for trying out the new API! Some
thoughts:
@shivaram About your Dataset API comments:
As @mengxr said, I am
Github user mateiz commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-62223666
On the parameters vs constructors, the problem with default args to
constructors is that they make it very difficult to add settings to an
algorithm. Such changes will
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-62227737
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
Github user shivaram commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-62229656
Thanks for the replies. I'm summarizing the discussion so far and including
some replies. I think we are getting close to a good API ! IMHO it would be
good to have the
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-62227719
[Test build #23077 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23077/consoleFull)
for PR 3099 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-62227735
[Test build #23077 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23077/consoleFull)
for PR 3099 at commit
Github user mengxr commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-62230172
@shivaram @tomerk I updated this PR and removed `modelParams`. The param
traits are marked as `private[ml]`. As I mentioned, I have no particular
preference here.
Github user tomerk commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-62234569
Is there anything that can be done about these two lines that appear in
every transform and fit function:
```scala
import dataset.sqlContext._
val map
Github user mateiz commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-62241597
This is a good point, maybe we should have an example or a test class that
is not in the org.apache.spark.ml package and tries to create a pipeline stage.
---
If your
Github user mengxr commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-62242562
I will make one under `org.apache.spark.examples` (though it is still under
`spark`).
---
If your project is set up for it, you can reply to this email and have your
Github user vruusmann commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-61949888
@jegonzal PMML is essentially a domain-specific language (DSL) for the
domain of predictive analytic applications. It is commonly used only for the
representation of
Github user etrain commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-61977531
I've got several other comments on this PR - mostly good, and will leave
some more detailed comments in a bit. TL;DR - What's wrong with Java
Serialization?
But
Github user etrain commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-61978539
Oh, and I'm not saying let's not support PMML - I'm saying let's have a
sensible way of handing off models to other JVM programs that doesn't involve
writing to XML in
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-62034255
[Test build #23011 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23011/consoleFull)
for PR 3099 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-62034301
[Test build #23011 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/23011/consoleFull)
for PR 3099 at commit
Github user shivaram commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-62105053
@mengxr @jkbradley I tried to port one of our simple image processing
pipelines to the new interface today and the code for this is at
Github user mengxr commented on a diff in the pull request:
https://github.com/apache/spark/pull/3099#discussion_r19859152
--- Diff: mllib/src/main/scala/org/apache/spark/ml/Model.scala ---
@@ -0,0 +1,6 @@
+package org.apache.spark.ml
--- End diff --
Hey @srowen,
Github user srowen commented on a diff in the pull request:
https://github.com/apache/spark/pull/3099#discussion_r19859297
--- Diff: mllib/src/main/scala/org/apache/spark/ml/Model.scala ---
@@ -0,0 +1,6 @@
+package org.apache.spark.ml
--- End diff --
Oh Ok, is it
Github user mengxr commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-61775481
@srowen Glad to hear that you like it :) Your feedback will be greatly
appreciated, but I don't want to waste your time on minor details. Let's run
the discussion in the
Github user srowen commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-61785223
@mengxr Yes I agree with your conclusion. With a generic get/set by key
method, I suppose you don't strictly need all the particular getters and
setters. I suppose that's
Github user sryza commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-61853890
If maxIter is a constant, would it be clearer to use MAX_ITER?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as
Github user mengxr commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-61854896
@sryza `maxIter` is not a constant. It is a parameter key in the current
design.
---
If your project is set up for it, you can reply to this email and have your
reply
Github user sryza commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-61858097
Both the reference and the class internals are immutable, no? Typical Java
conventions would put such a variable in all caps, though maybe in Scala it's
different?
---
Github user sryza commented on a diff in the pull request:
https://github.com/apache/spark/pull/3099#discussion_r19897343
--- Diff: mllib/src/main/scala/org/apache/spark/ml/parameters.scala ---
@@ -0,0 +1,267 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-61864795
I vote for the lr.setRegParam(0.1) setup. I also vote for setting
parameters beforehand, and not allowing parameters in fit().
---
If your project is set up for it,
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-61905232
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-61905223
[Test build #22962 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22962/consoleFull)
for PR 3099 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-61905231
[Test build #22962 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22962/consoleFull)
for PR 3099 at commit
Github user shivaram commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-61913282
@mengxr Thanks for putting this together ! I had some high level comments
by looking at the code
1. Could we have constructors also with getter, setters ? This
Github user manishamde commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-61916011
I have a few comments based upon the API:
1. Like @jkbradley, I prefer ```lr.setMaxIter(50)``` over
```lr.set(lr.maxIter, 50)```. Also, prefer to avoid
Github user mengxr commented on a diff in the pull request:
https://github.com/apache/spark/pull/3099#discussion_r19923238
--- Diff: mllib/src/main/scala/org/apache/spark/ml/parameters.scala ---
@@ -0,0 +1,267 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF)
Github user mengxr commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-61922261
@shivaram Thanks for your feedback!
Could we have constructors also with getter, setters?
It would be hard to maintain binary compatibility in the long
Github user jegonzal commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-61922586
The model serving work would really benefit from being able to evaluate
models without requiring a Spark context especially since we are shooting for
10s millisecond
Github user mengxr commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-61923441
@manishamde Thanks for your feedback!
Like @jkbradley, I prefer lr.setMaxIter(50) over lr.set(lr.maxIter, 50).
Me too and this is implemented in the
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-61929832
[Test build #22986 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22986/consoleFull)
for PR 3099 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-61929837
[Test build #22986 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22986/consoleFull)
for PR 3099 at commit
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-61929839
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-61930125
[Test build #22987 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22987/consoleFull)
for PR 3099 at commit
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-61930133
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-61930130
[Test build #22987 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22987/consoleFull)
for PR 3099 at commit
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/3099#discussion_r19927481
--- Diff: mllib/src/main/scala/org/apache/spark/ml/Estimator.scala ---
@@ -0,0 +1,70 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-61931707
[Test build #22991 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22991/consoleFull)
for PR 3099 at commit
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-61931722
[Test build #22991 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22991/consoleFull)
for PR 3099 at commit
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-61931724
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-61932391
@mengxr @shivaram Adding to the conversation...
In addition to the SchemaRDD based transformer API, it would be good to
have a single Row transformation API too.
Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-61932628
@jegonzal Do you feel like most model classes should be serve-able by
design? Or would you recommend having lighter model classes which can be built
from regular
Github user jegonzal commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-61934417
@jkbradley Right now we are planning to serve linear combinations of models
derived from MLlib (currently latent factor models, naive bayes, and decision
trees).
Github user mengxr commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-61934791
@jegonzal @jkbradley (This may be slightly off-topic.) To serve models,
which are usually long pipelines in practice, I'm thinking of the following:
1. serialize
GitHub user mengxr opened a pull request:
https://github.com/apache/spark/pull/3099
[WIP][SPARK-3530][MLLIB] pipeline and parameters
### This is a WIP, so please do not spend time on individual lines.
This PR adds package org.apache.spark.ml with pipeline and parameters, as
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-61748060
[Test build #22907 has
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22907/consoleFull)
for PR 3099 at commit
Github user AmplabJenkins commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-61748064
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
Github user SparkQA commented on the pull request:
https://github.com/apache/spark/pull/3099#issuecomment-61748036
[Test build #22907 has
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/22907/consoleFull)
for PR 3099 at commit
Github user srowen commented on a diff in the pull request:
https://github.com/apache/spark/pull/3099#discussion_r19859052
--- Diff: mllib/src/main/scala/org/apache/spark/ml/Model.scala ---
@@ -0,0 +1,6 @@
+package org.apache.spark.ml
--- End diff --
(Missing
Github user srowen commented on a diff in the pull request:
https://github.com/apache/spark/pull/3099#discussion_r19859074
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/example/CrossValidator.scala ---
@@ -0,0 +1,64 @@
+package org.apache.spark.ml.example
--- End diff
59 matches
Mail list logo