[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-02-03 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/4233#discussion_r24061275
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/classification/LogisticRegressionSuite.scala
 ---
@@ -459,7 +461,41 @@ class LogisticRegressionSuite extends FunSuite with 
MLlibTestSparkContext with M
 // very steep curve in logistic function so that when we draw samples 
from distribution, it's
 // very easy to assign to another labels. However, this prediction 
result is consistent to R.
 
validatePrediction(model.predict(validationRDD.map(_.features)).collect(), 
validationData, 0.47)
+  }
+
+  test(model export/import) {
--- End diff --

I'm reorganizing the code some to make it easier to keep exporters for each 
version.  It should be pretty maintainable and will allow for better testing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-02-03 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/4233#discussion_r24060903
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -17,14 +17,17 @@
 
 package org.apache.spark.mllib.classification
 
+import org.apache.spark.SparkContext
 import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.classification.impl.GLMClassificationModel
 import org.apache.spark.mllib.linalg.BLAS.dot
 import org.apache.spark.mllib.linalg.{DenseVector, Vector}
 import org.apache.spark.mllib.optimization._
 import org.apache.spark.mllib.regression._
-import org.apache.spark.mllib.util.{DataValidators, MLUtils}
+import org.apache.spark.mllib.util.{DataValidators, Exportable, Importable}
--- End diff --

Sounds good


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-02-03 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4233#issuecomment-72787496
  
  [Test build #26719 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26719/consoleFull)
 for   PR 4233 at commit 
[`b4ee064`](https://github.com/apache/spark/commit/b4ee0643b2981517ba02f036d4a053883ca1fe37).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-02-03 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/4233#issuecomment-72787523
  
That last commit did significant re-organization and cleanups to address 
@mengxr 's comments.  Hopefully good to go now!  I won't put any more models in 
this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-02-03 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/4233#discussion_r24052353
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -17,14 +17,17 @@
 
 package org.apache.spark.mllib.classification
 
+import org.apache.spark.SparkContext
 import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.classification.impl.GLMClassificationModel
 import org.apache.spark.mllib.linalg.BLAS.dot
 import org.apache.spark.mllib.linalg.{DenseVector, Vector}
 import org.apache.spark.mllib.optimization._
 import org.apache.spark.mllib.regression._
-import org.apache.spark.mllib.util.{DataValidators, MLUtils}
+import org.apache.spark.mllib.util.{DataValidators, Exportable, Importable}
--- End diff --

About the names, we have `Exportable` with `save()` and `Importable` with 
`load()`. I have two questions:

1) Would `Saveable` and `save` be a better match? I'm looking at the search 
results from grepcode:

* `Saveable` as an interface: 
http://grepcode.com/search/?start=0query=Saveableentity=typek=i
* `Exportable` as an interface: 
http://grepcode.com/search?query=Exportablestart=0entity=typen=k=i

People use both but I didn't see a combination of `Exportable` and 
`save()`. I hope `savable` or `saveable` is still a valid word. Same applies to 
`Exportable`/`Loadable` and `load()`.

2) If an instance is `Importable`, it means we can import it from 
somewhere. This is not true for `object Model`. We don't import `object Model` 
but we use `object Model` to import `class Model`. Should the interface be 
called `Importer`/`Loader` instead?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-02-03 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/4233#discussion_r24052532
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala ---
@@ -65,6 +67,70 @@ class NaiveBayesModel private[mllib] (
   override def predict(testData: Vector): Double = {
 labels(brzArgmax(brzPi + brzTheta * testData.toBreeze))
   }
+
+  override def save(sc: SparkContext, path: String): Unit = {
+val sqlContext = new SQLContext(sc)
+import sqlContext._
+
+// Create JSON metadata.
+val metadataRDD =
+  sc.parallelize(Seq((this.getClass.getName, 
formatVersion))).toDataFrame(class, version)
+metadataRDD.toJSON.repartition(1).saveAsTextFile(path + /metadata)
+
+// Create Parquet data.
+val data = NaiveBayesModel.Data(labels, pi, theta)
+val dataRDD: DataFrame = sc.parallelize(Seq(data))
+dataRDD.repartition(1).saveAsParquetFile(path + /data)
+  }
+
+  override protected def formatVersion: String = 
NaiveBayesModel.formatVersion
+
+}
+
+object NaiveBayesModel extends Importable[NaiveBayesModel] {
+
+  /** Model data for model import/export */
+  private case class Data(labels: Array[Double], pi: Array[Double], theta: 
Array[Array[Double]])
--- End diff --

This `Data` class should live inside `ImporterV1`. If `ImporterV2` uses a 
different format, it is hard to update `Data` if it is global to both importers.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-02-03 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/4233#discussion_r24054409
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/classification/LogisticRegressionSuite.scala
 ---
@@ -425,25 +427,25 @@ class LogisticRegressionSuite extends FunSuite with 
MLlibTestSparkContext with M
  * weights. The mathematical discussion and proof can be found here:
  * http://en.wikipedia.org/wiki/Multinomial_logistic_regression
  *
- *weights1 = weights$`1` - weights$`0`
- *weights2 = weights$`2` - weights$`0`
+ * weights1 = weights$`1` - weights$`0`
+ * weights2 = weights$`2` - weights$`0`
  *
- * weights1
- *5 x 1 sparse Matrix of class dgCMatrix
- *s0
- * 2.6228269
- *data.V2 -0.5837166
- *data.V3  0.9285260
- *data.V4 -0.3783612
- *data.V5 -0.8123411
- * weights2
- *5 x 1 sparse Matrix of class dgCMatrix
- * s0
- * 4.11197445
- *data.V2 -0.16918650
- *data.V3 -0.81104784
- *data.V4 -0.06463799
- *data.V5 -0.29198337
+ *  weights1
+ * 5 x 1 sparse Matrix of class dgCMatrix
+ * s0
+ * 2.6228269
--- End diff --

The original indentation is correct. Is it by accident?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-02-03 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/4233#discussion_r24054690
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/NaiveBayes.scala ---
@@ -65,6 +67,70 @@ class NaiveBayesModel private[mllib] (
   override def predict(testData: Vector): Double = {
 labels(brzArgmax(brzPi + brzTheta * testData.toBreeze))
   }
+
+  override def save(sc: SparkContext, path: String): Unit = {
+val sqlContext = new SQLContext(sc)
+import sqlContext._
+
+// Create JSON metadata.
+val metadataRDD =
+  sc.parallelize(Seq((this.getClass.getName, 
formatVersion))).toDataFrame(class, version)
+metadataRDD.toJSON.repartition(1).saveAsTextFile(path + /metadata)
+
+// Create Parquet data.
+val data = NaiveBayesModel.Data(labels, pi, theta)
+val dataRDD: DataFrame = sc.parallelize(Seq(data))
+dataRDD.repartition(1).saveAsParquetFile(path + /data)
+  }
+
+  override protected def formatVersion: String = 
NaiveBayesModel.formatVersion
+
+}
+
+object NaiveBayesModel extends Importable[NaiveBayesModel] {
+
+  /** Model data for model import/export */
+  private case class Data(labels: Array[Double], pi: Array[Double], theta: 
Array[Array[Double]])
--- End diff --

Good point


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-02-03 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/4233#discussion_r24054729
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/classification/LogisticRegressionSuite.scala
 ---
@@ -425,25 +427,25 @@ class LogisticRegressionSuite extends FunSuite with 
MLlibTestSparkContext with M
  * weights. The mathematical discussion and proof can be found here:
  * http://en.wikipedia.org/wiki/Multinomial_logistic_regression
  *
- *weights1 = weights$`1` - weights$`0`
- *weights2 = weights$`2` - weights$`0`
+ * weights1 = weights$`1` - weights$`0`
+ * weights2 = weights$`2` - weights$`0`
  *
- * weights1
- *5 x 1 sparse Matrix of class dgCMatrix
- *s0
- * 2.6228269
- *data.V2 -0.5837166
- *data.V3  0.9285260
- *data.V4 -0.3783612
- *data.V5 -0.8123411
- * weights2
- *5 x 1 sparse Matrix of class dgCMatrix
- * s0
- * 4.11197445
- *data.V2 -0.16918650
- *data.V3 -0.81104784
- *data.V4 -0.06463799
- *data.V5 -0.29198337
+ *  weights1
+ * 5 x 1 sparse Matrix of class dgCMatrix
+ * s0
+ * 2.6228269
--- End diff --

Weird, that's an accident.  I'll revert it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-02-03 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/4233#discussion_r24055623
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/classification/LogisticRegressionSuite.scala
 ---
@@ -459,7 +461,41 @@ class LogisticRegressionSuite extends FunSuite with 
MLlibTestSparkContext with M
 // very steep curve in logistic function so that when we draw samples 
from distribution, it's
 // very easy to assign to another labels. However, this prediction 
result is consistent to R.
 
validatePrediction(model.predict(validationRDD.map(_.features)).collect(), 
validationData, 0.47)
+  }
+
+  test(model export/import) {
--- End diff --

That requires us to maintain the exporters for each version. I'm thinking 
of keeping one saved model as test resources and test against that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-02-03 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/4233#discussion_r24054410
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/classification/LogisticRegressionSuite.scala
 ---
@@ -459,7 +461,41 @@ class LogisticRegressionSuite extends FunSuite with 
MLlibTestSparkContext with M
 // very steep curve in logistic function so that when we draw samples 
from distribution, it's
 // very easy to assign to another labels. However, this prediction 
result is consistent to R.
 
validatePrediction(model.predict(validationRDD.map(_.features)).collect(), 
validationData, 0.47)
+  }
+
+  test(model export/import) {
--- End diff --

I have a question about upgradability. When we have `V2`, how to test 
`ImporterV1`? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-02-03 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/4233#discussion_r24054680
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -17,14 +17,17 @@
 
 package org.apache.spark.mllib.classification
 
+import org.apache.spark.SparkContext
 import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.classification.impl.GLMClassificationModel
 import org.apache.spark.mllib.linalg.BLAS.dot
 import org.apache.spark.mllib.linalg.{DenseVector, Vector}
 import org.apache.spark.mllib.optimization._
 import org.apache.spark.mllib.regression._
-import org.apache.spark.mllib.util.{DataValidators, MLUtils}
+import org.apache.spark.mllib.util.{DataValidators, Exportable, Importable}
--- End diff --

Saveable and Loader sound good to me.  What about PMML?  Does it sounds 
reasonable to have  separate traits for PMML such as PMMLSaveable and 
PMMLLoadable?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-02-03 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/4233#discussion_r24054795
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/classification/LogisticRegressionSuite.scala
 ---
@@ -459,7 +461,41 @@ class LogisticRegressionSuite extends FunSuite with 
MLlibTestSparkContext with M
 // very steep curve in logistic function so that when we draw samples 
from distribution, it's
 // very easy to assign to another labels. However, this prediction 
result is consistent to R.
 
validatePrediction(model.predict(validationRDD.map(_.features)).collect(), 
validationData, 0.47)
+  }
+
+  test(model export/import) {
--- End diff --

True, I guess we should have an analogous ExporterV1 and keep it around for 
testing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-02-03 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/4233#discussion_r24055819
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -17,14 +17,17 @@
 
 package org.apache.spark.mllib.classification
 
+import org.apache.spark.SparkContext
 import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.classification.impl.GLMClassificationModel
 import org.apache.spark.mllib.linalg.BLAS.dot
 import org.apache.spark.mllib.linalg.{DenseVector, Vector}
 import org.apache.spark.mllib.optimization._
 import org.apache.spark.mllib.regression._
-import org.apache.spark.mllib.util.{DataValidators, MLUtils}
+import org.apache.spark.mllib.util.{DataValidators, Exportable, Importable}
--- End diff --

Good question ... I don't have good names, `PMMLExportable` with `toPMML` 
and `PMMLImporter` with `fromPMML`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-02-02 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/4233#discussion_r23976586
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -68,6 +79,65 @@ class LogisticRegressionModel (
   case None = score
 }
   }
+
+  override def save(sc: SparkContext, path: String): Unit = {
+val sqlContext = new SQLContext(sc)
+import sqlContext._
+
+// Create JSON metadata.
+val metadata = LogisticRegressionModel.Metadata(
+  clazz = this.getClass.getName, version = Exportable.latestVersion)
+val metadataRDD: DataFrame = sc.parallelize(Seq(metadata))
+metadataRDD.toJSON.saveAsTextFile(path + /metadata)
+// Create Parquet data.
+val data = LogisticRegressionModel.Data(weights, intercept, threshold)
+val dataRDD: DataFrame = sc.parallelize(Seq(data))
+dataRDD.saveAsParquetFile(path + /data)
+  }
+}
+
+object LogisticRegressionModel extends Importable[LogisticRegressionModel] 
{
+
+  private case class Metadata(clazz: String, version: String)
+
+  private case class Data(weights: Vector, intercept: Double, threshold: 
Option[Double])
+
+  override def load(sc: SparkContext, path: String): 
LogisticRegressionModel = {
+val sqlContext = new SQLContext(sc)
+import sqlContext._
+
+// Load JSON metadata.
+val metadataRDD = sqlContext.jsonFile(path + /metadata)
--- End diff --

Yes, toJSON writes one line per record. Otherwise it is much harder to 
decide record boundaries. People can use JSON beautifiers to reformat a 
one-line JSON record, which is not an issue given many tools/browser plugins 
available.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-02-02 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/4233#issuecomment-72588869
  
Rebased after logreg changes...but I need to fix loading still.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-02-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4233#issuecomment-72580348
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26579/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-02-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4233#issuecomment-72580345
  
  [Test build #26579 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26579/consoleFull)
 for   PR 4233 at commit 
[`a71e555`](https://github.com/apache/spark/commit/a71e555b2193d5caec76fad088994f64bf638f9d).
 * This patch **fails Scala style tests**.
 * This patch **does not merge cleanly**.
 * This patch adds the following public classes _(experimental)_:
  * `  case class Data(weights: Vector, intercept: Double, threshold: 
Option[Double])`
  * `  case class Data(weights: Vector, intercept: Double)`
  * `trait Exportable `
  * `trait Importable[M : Exportable] `
  * `  protected abstract class Importer `
  * `   * @return (class name, version, metadata)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-02-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4233#issuecomment-72580736
  
  [Test build #26580 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26580/consoleFull)
 for   PR 4233 at commit 
[`444341a`](https://github.com/apache/spark/commit/444341a03ad4ebd4d3493a3dad1f8c0a4d95735b).
 * This patch **fails Scala style tests**.
 * This patch **does not merge cleanly**.
 * This patch adds the following public classes _(experimental)_:
  * `  case class Data(weights: Vector, intercept: Double, threshold: 
Option[Double])`
  * `  case class Data(weights: Vector, intercept: Double)`
  * `trait Exportable `
  * `trait Importable[M : Exportable] `
  * `  protected abstract class Importer `
  * `   * @return (class name, version, metadata)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-02-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4233#issuecomment-72580738
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26580/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-02-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4233#issuecomment-72580650
  
  [Test build #26580 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26580/consoleFull)
 for   PR 4233 at commit 
[`444341a`](https://github.com/apache/spark/commit/444341a03ad4ebd4d3493a3dad1f8c0a4d95735b).
 * This patch **does not merge cleanly**.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-02-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4233#issuecomment-72589061
  
  [Test build #26600 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26600/consoleFull)
 for   PR 4233 at commit 
[`ee99228`](https://github.com/apache/spark/commit/ee99228ac7904f410210b5e9118f5e88ea280164).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-02-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4233#issuecomment-72580164
  
  [Test build #26579 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26579/consoleFull)
 for   PR 4233 at commit 
[`a71e555`](https://github.com/apache/spark/commit/a71e555b2193d5caec76fad088994f64bf638f9d).
 * This patch **does not merge cleanly**.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-02-02 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4233#issuecomment-72592729
  
  [Test build #26600 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26600/consoleFull)
 for   PR 4233 at commit 
[`ee99228`](https://github.com/apache/spark/commit/ee99228ac7904f410210b5e9118f5e88ea280164).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  case class Data(weights: Vector, intercept: Double, threshold: 
Option[Double])`
  * `  case class Data(weights: Vector, intercept: Double)`
  * `trait Exportable `
  * `trait Importable[M : Exportable] `
  * `  protected abstract class Importer `
  * `   * @return (class name, version, metadata)`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-02-02 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4233#issuecomment-72592738
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26600/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-02-01 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/4233#discussion_r23903668
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -68,6 +79,65 @@ class LogisticRegressionModel (
   case None = score
 }
   }
+
+  override def save(sc: SparkContext, path: String): Unit = {
+val sqlContext = new SQLContext(sc)
+import sqlContext._
+
+// Create JSON metadata.
+val metadata = LogisticRegressionModel.Metadata(
+  clazz = this.getClass.getName, version = Exportable.latestVersion)
--- End diff --

Sounds good, I'll try that.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-02-01 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/4233#discussion_r23903673
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -68,6 +79,65 @@ class LogisticRegressionModel (
   case None = score
 }
   }
+
+  override def save(sc: SparkContext, path: String): Unit = {
+val sqlContext = new SQLContext(sc)
+import sqlContext._
+
+// Create JSON metadata.
+val metadata = LogisticRegressionModel.Metadata(
+  clazz = this.getClass.getName, version = Exportable.latestVersion)
+val metadataRDD: DataFrame = sc.parallelize(Seq(metadata))
+metadataRDD.toJSON.saveAsTextFile(path + /metadata)
+// Create Parquet data.
+val data = LogisticRegressionModel.Data(weights, intercept, threshold)
+val dataRDD: DataFrame = sc.parallelize(Seq(data))
+dataRDD.saveAsParquetFile(path + /data)
+  }
+}
+
+object LogisticRegressionModel extends Importable[LogisticRegressionModel] 
{
+
+  private case class Metadata(clazz: String, version: String)
+
+  private case class Data(weights: Vector, intercept: Double, threshold: 
Option[Double])
+
+  override def load(sc: SparkContext, path: String): 
LogisticRegressionModel = {
+val sqlContext = new SQLContext(sc)
+import sqlContext._
+
+// Load JSON metadata.
+val metadataRDD = sqlContext.jsonFile(path + /metadata)
--- End diff --

Do we want to make sure it has 1 row so that it's saved as a single text 
file though?  Or do we not really care if it gets distributed?  (Or I could try 
to enforce partitioning by repartitioning with numPartitions = 1 beforehand.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-02-01 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/4233#discussion_r23903677
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -68,6 +79,65 @@ class LogisticRegressionModel (
   case None = score
 }
   }
+
+  override def save(sc: SparkContext, path: String): Unit = {
+val sqlContext = new SQLContext(sc)
+import sqlContext._
+
+// Create JSON metadata.
+val metadata = LogisticRegressionModel.Metadata(
+  clazz = this.getClass.getName, version = Exportable.latestVersion)
+val metadataRDD: DataFrame = sc.parallelize(Seq(metadata))
+metadataRDD.toJSON.saveAsTextFile(path + /metadata)
+// Create Parquet data.
+val data = LogisticRegressionModel.Data(weights, intercept, threshold)
+val dataRDD: DataFrame = sc.parallelize(Seq(data))
+dataRDD.saveAsParquetFile(path + /data)
+  }
+}
+
+object LogisticRegressionModel extends Importable[LogisticRegressionModel] 
{
+
+  private case class Metadata(clazz: String, version: String)
+
+  private case class Data(weights: Vector, intercept: Double, threshold: 
Option[Double])
+
+  override def load(sc: SparkContext, path: String): 
LogisticRegressionModel = {
+val sqlContext = new SQLContext(sc)
+import sqlContext._
+
+// Load JSON metadata.
+val metadataRDD = sqlContext.jsonFile(path + /metadata)
+val metadataArray = metadataRDD.select(clazz, version).take(1)
+assert(metadataArray.size == 1,
+  sUnable to load LogisticRegressionModel metadata from: ${path + 
/metadata})
+metadataArray(0) match {
--- End diff --

Good idea, I'll go ahead and set up that infrastructure for future versions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-02-01 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/4233#discussion_r23906177
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -68,6 +79,65 @@ class LogisticRegressionModel (
   case None = score
 }
   }
+
+  override def save(sc: SparkContext, path: String): Unit = {
+val sqlContext = new SQLContext(sc)
+import sqlContext._
+
+// Create JSON metadata.
+val metadata = LogisticRegressionModel.Metadata(
+  clazz = this.getClass.getName, version = Exportable.latestVersion)
+val metadataRDD: DataFrame = sc.parallelize(Seq(metadata))
+metadataRDD.toJSON.saveAsTextFile(path + /metadata)
+// Create Parquet data.
+val data = LogisticRegressionModel.Data(weights, intercept, threshold)
+val dataRDD: DataFrame = sc.parallelize(Seq(data))
+dataRDD.saveAsParquetFile(path + /data)
+  }
+}
+
+object LogisticRegressionModel extends Importable[LogisticRegressionModel] 
{
+
+  private case class Metadata(clazz: String, version: String)
+
+  private case class Data(weights: Vector, intercept: Double, threshold: 
Option[Double])
+
+  override def load(sc: SparkContext, path: String): 
LogisticRegressionModel = {
+val sqlContext = new SQLContext(sc)
+import sqlContext._
+
+// Load JSON metadata.
+val metadataRDD = sqlContext.jsonFile(path + /metadata)
--- End diff --

We want to use RDD to avoid talking to fs directly. If you use json4s, you 
can render single line JSON easily:

~~~
import org.json4s._
import org.json4s.JsonDSL._
import org.json4s.jackson.JsonMethods._

val json = (a\n - b\n)
println(compact(render(json)))
~~~

outputs

~~~
{a\n:b\n}
~~~

So the metadata won't span multiple lines.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-02-01 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/4233#discussion_r23904506
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -68,6 +79,65 @@ class LogisticRegressionModel (
   case None = score
 }
   }
+
+  override def save(sc: SparkContext, path: String): Unit = {
+val sqlContext = new SQLContext(sc)
+import sqlContext._
+
+// Create JSON metadata.
+val metadata = LogisticRegressionModel.Metadata(
+  clazz = this.getClass.getName, version = Exportable.latestVersion)
+val metadataRDD: DataFrame = sc.parallelize(Seq(metadata))
+metadataRDD.toJSON.saveAsTextFile(path + /metadata)
+// Create Parquet data.
+val data = LogisticRegressionModel.Data(weights, intercept, threshold)
+val dataRDD: DataFrame = sc.parallelize(Seq(data))
+dataRDD.saveAsParquetFile(path + /data)
+  }
+}
+
+object LogisticRegressionModel extends Importable[LogisticRegressionModel] 
{
+
+  private case class Metadata(clazz: String, version: String)
+
+  private case class Data(weights: Vector, intercept: Double, threshold: 
Option[Double])
+
+  override def load(sc: SparkContext, path: String): 
LogisticRegressionModel = {
+val sqlContext = new SQLContext(sc)
+import sqlContext._
+
+// Load JSON metadata.
+val metadataRDD = sqlContext.jsonFile(path + /metadata)
--- End diff --

Thinking more about it, how would metadata not be a single row.  A Row can 
have complex nested types.  If it were multiple rows, would it replicate 
columns like class and version for each row?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-02-01 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/4233#discussion_r23909199
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -68,6 +79,65 @@ class LogisticRegressionModel (
   case None = score
 }
   }
+
+  override def save(sc: SparkContext, path: String): Unit = {
+val sqlContext = new SQLContext(sc)
+import sqlContext._
+
+// Create JSON metadata.
+val metadata = LogisticRegressionModel.Metadata(
+  clazz = this.getClass.getName, version = Exportable.latestVersion)
+val metadataRDD: DataFrame = sc.parallelize(Seq(metadata))
+metadataRDD.toJSON.saveAsTextFile(path + /metadata)
+// Create Parquet data.
+val data = LogisticRegressionModel.Data(weights, intercept, threshold)
+val dataRDD: DataFrame = sc.parallelize(Seq(data))
+dataRDD.saveAsParquetFile(path + /data)
+  }
+}
+
+object LogisticRegressionModel extends Importable[LogisticRegressionModel] 
{
+
+  private case class Metadata(clazz: String, version: String)
+
+  private case class Data(weights: Vector, intercept: Double, threshold: 
Option[Double])
+
+  override def load(sc: SparkContext, path: String): 
LogisticRegressionModel = {
+val sqlContext = new SQLContext(sc)
+import sqlContext._
+
+// Load JSON metadata.
+val metadataRDD = sqlContext.jsonFile(path + /metadata)
--- End diff --

That's not quite my question: I think the confusion is mixing row (line) 
in a text file vs row (or record) in an RDD.  How about we store the metadata 
in a single record in an RDD, but we print that RDD as multi-line JSON to a 
single text file?  It will be easier for humans to read and will be easy to 
load as a single record as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-02-01 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/4233#discussion_r23909298
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -68,6 +79,65 @@ class LogisticRegressionModel (
   case None = score
 }
   }
+
+  override def save(sc: SparkContext, path: String): Unit = {
+val sqlContext = new SQLContext(sc)
+import sqlContext._
+
+// Create JSON metadata.
+val metadata = LogisticRegressionModel.Metadata(
+  clazz = this.getClass.getName, version = Exportable.latestVersion)
+val metadataRDD: DataFrame = sc.parallelize(Seq(metadata))
+metadataRDD.toJSON.saveAsTextFile(path + /metadata)
+// Create Parquet data.
+val data = LogisticRegressionModel.Data(weights, intercept, threshold)
+val dataRDD: DataFrame = sc.parallelize(Seq(data))
+dataRDD.saveAsParquetFile(path + /data)
+  }
+}
+
+object LogisticRegressionModel extends Importable[LogisticRegressionModel] 
{
+
+  private case class Metadata(clazz: String, version: String)
+
+  private case class Data(weights: Vector, intercept: Double, threshold: 
Option[Double])
+
+  override def load(sc: SparkContext, path: String): 
LogisticRegressionModel = {
+val sqlContext = new SQLContext(sc)
+import sqlContext._
+
+// Load JSON metadata.
+val metadataRDD = sqlContext.jsonFile(path + /metadata)
--- End diff --

Also, I get the motivation for using json4s directly rather than going 
through DataFrame and DataFrame.toJSON in terms of reducing dependencies.  
However, I like the idea of using DataFrame since it will be helpful when we 
add other types of metadata, such as info about each feature.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-02-01 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/4233#discussion_r23910101
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -68,6 +79,65 @@ class LogisticRegressionModel (
   case None = score
 }
   }
+
+  override def save(sc: SparkContext, path: String): Unit = {
+val sqlContext = new SQLContext(sc)
+import sqlContext._
+
+// Create JSON metadata.
+val metadata = LogisticRegressionModel.Metadata(
+  clazz = this.getClass.getName, version = Exportable.latestVersion)
+val metadataRDD: DataFrame = sc.parallelize(Seq(metadata))
+metadataRDD.toJSON.saveAsTextFile(path + /metadata)
+// Create Parquet data.
+val data = LogisticRegressionModel.Data(weights, intercept, threshold)
+val dataRDD: DataFrame = sc.parallelize(Seq(data))
+dataRDD.saveAsParquetFile(path + /data)
+  }
+}
+
+object LogisticRegressionModel extends Importable[LogisticRegressionModel] 
{
+
+  private case class Metadata(clazz: String, version: String)
+
+  private case class Data(weights: Vector, intercept: Double, threshold: 
Option[Double])
+
+  override def load(sc: SparkContext, path: String): 
LogisticRegressionModel = {
+val sqlContext = new SQLContext(sc)
+import sqlContext._
+
+// Load JSON metadata.
+val metadataRDD = sqlContext.jsonFile(path + /metadata)
--- End diff --

(I guess these are conflicting since using DataFrame and toJSON will mean 1 
record per text file line, but that's OK with me.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-01-31 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/4233#discussion_r23887814
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -68,6 +79,65 @@ class LogisticRegressionModel (
   case None = score
 }
   }
+
+  override def save(sc: SparkContext, path: String): Unit = {
+val sqlContext = new SQLContext(sc)
+import sqlContext._
+
+// Create JSON metadata.
+val metadata = LogisticRegressionModel.Metadata(
+  clazz = this.getClass.getName, version = Exportable.latestVersion)
--- End diff --

Should each model have its own version? For example, we might add some 
statistics to LRModel and save them during model export. If the version is 
global, we might have trouble loading it back.

With the new DataFrame API, this could be done by

~~~
sc.parallelize(Seq((clazz, version))).toDataFrame(clazz, version)
~~~

and hence we don't need the case class and it is easier to use class 
instead of clazz.

For the JSON format, we can use json4s directly and save as `RDD[String]`. 
The code will be cleaner and have less dependency.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-01-31 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/4233#discussion_r23887815
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -68,6 +79,65 @@ class LogisticRegressionModel (
   case None = score
 }
   }
+
+  override def save(sc: SparkContext, path: String): Unit = {
+val sqlContext = new SQLContext(sc)
+import sqlContext._
+
+// Create JSON metadata.
+val metadata = LogisticRegressionModel.Metadata(
+  clazz = this.getClass.getName, version = Exportable.latestVersion)
+val metadataRDD: DataFrame = sc.parallelize(Seq(metadata))
+metadataRDD.toJSON.saveAsTextFile(path + /metadata)
+// Create Parquet data.
+val data = LogisticRegressionModel.Data(weights, intercept, threshold)
+val dataRDD: DataFrame = sc.parallelize(Seq(data))
+dataRDD.saveAsParquetFile(path + /data)
+  }
+}
+
+object LogisticRegressionModel extends Importable[LogisticRegressionModel] 
{
+
+  private case class Metadata(clazz: String, version: String)
+
+  private case class Data(weights: Vector, intercept: Double, threshold: 
Option[Double])
+
+  override def load(sc: SparkContext, path: String): 
LogisticRegressionModel = {
+val sqlContext = new SQLContext(sc)
+import sqlContext._
+
+// Load JSON metadata.
+val metadataRDD = sqlContext.jsonFile(path + /metadata)
--- End diff --

Same here. We can load it as RDD[String] and use json4s to parse it. The 
question is whether we want to treat metadata as a single-record table. It 
might have nested fields, which doesn't look like a table to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-01-31 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/4233#discussion_r23887813
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -68,6 +79,65 @@ class LogisticRegressionModel (
   case None = score
 }
   }
+
+  override def save(sc: SparkContext, path: String): Unit = {
+val sqlContext = new SQLContext(sc)
+import sqlContext._
--- End diff --

`import sqlContext._` is no longer needed due to recent API change. 
`implicit val sqlContext = new SQLContext(sc)` should work.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-01-31 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/4233#discussion_r23887816
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -68,6 +79,65 @@ class LogisticRegressionModel (
   case None = score
 }
   }
+
+  override def save(sc: SparkContext, path: String): Unit = {
+val sqlContext = new SQLContext(sc)
+import sqlContext._
+
+// Create JSON metadata.
+val metadata = LogisticRegressionModel.Metadata(
+  clazz = this.getClass.getName, version = Exportable.latestVersion)
+val metadataRDD: DataFrame = sc.parallelize(Seq(metadata))
+metadataRDD.toJSON.saveAsTextFile(path + /metadata)
+// Create Parquet data.
+val data = LogisticRegressionModel.Data(weights, intercept, threshold)
+val dataRDD: DataFrame = sc.parallelize(Seq(data))
+dataRDD.saveAsParquetFile(path + /data)
+  }
+}
+
+object LogisticRegressionModel extends Importable[LogisticRegressionModel] 
{
+
+  private case class Metadata(clazz: String, version: String)
+
+  private case class Data(weights: Vector, intercept: Double, threshold: 
Option[Double])
+
+  override def load(sc: SparkContext, path: String): 
LogisticRegressionModel = {
+val sqlContext = new SQLContext(sc)
+import sqlContext._
+
+// Load JSON metadata.
+val metadataRDD = sqlContext.jsonFile(path + /metadata)
+val metadataArray = metadataRDD.select(clazz, version).take(1)
+assert(metadataArray.size == 1,
+  sUnable to load LogisticRegressionModel metadata from: ${path + 
/metadata})
+metadataArray(0) match {
--- End diff --

The loading part could be more generic. For each model and each version, we 
need to maintain a loader that reads saved models to the current version or 
claim incompatible. For example, 

~~~
object LogisticRegressionModel extends Importable[LogisticRegressionModel] {
  def load(sc path): LogisticRegressionModel = {
val metadata = ...
val importer = Importers.get(version)
importer.load(sc, path)
  }
  object Importers {
 def get(version: String): Importer[LogisticRegressionModel] = {
version match { 
  case 1.0 = new V1()
  case 2.0 = ...
  case _ = throw new Exception(...)
}
 }
class V1 extends Importer[LogisticRegressionModel] {
  ...
}
  }
~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-01-31 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/4233#discussion_r23888056
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala
 ---
@@ -68,6 +79,65 @@ class LogisticRegressionModel (
   case None = score
 }
   }
+
+  override def save(sc: SparkContext, path: String): Unit = {
+val sqlContext = new SQLContext(sc)
+import sqlContext._
--- End diff --

I tried that, and it fails to compile.  (I tried removed sqlContext from 
save(), as well as having it be an implicit val without the import.  Neither 
worked.)  Is there another import I need?
```
[error] 
/Users/josephkb/spark/mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala:89:
 type mismatch;
[error]  found   : 
org.apache.spark.rdd.RDD[org.apache.spark.mllib.classification.LogisticRegressionModel.Metadata]
[error]  required: org.apache.spark.sql.DataFrame
[error] Error occurred in an application involving default arguments.
[error] val metadataRDD: DataFrame = sc.parallelize(Seq(metadata))
[error]^
[error] 
/Users/josephkb/spark/mllib/src/main/scala/org/apache/spark/mllib/classification/LogisticRegression.scala:94:
 type mismatch;
[error]  found   : 
org.apache.spark.rdd.RDD[org.apache.spark.mllib.classification.LogisticRegressionModel.Data]
[error]  required: org.apache.spark.sql.DataFrame
[error] Error occurred in an application involving default arguments.
[error] val dataRDD: DataFrame = sc.parallelize(Seq(data))
[error]^
[error] two errors found
[error] (mllib/compile:compile) Compilation failed
```
It may not be an issue though, with the other change you suggested below.  
I'll see.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-01-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4233#issuecomment-72313229
  
  [Test build #26459 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26459/consoleFull)
 for   PR 4233 at commit 
[`638fa81`](https://github.com/apache/spark/commit/638fa81e40ef64558a6de70b4ad669fb63c46ada).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `trait Exportable `
  * `trait Importable[Model : Exportable] `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-01-31 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4233#issuecomment-72313232
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26459/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-01-31 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4233#issuecomment-72311863
  
  [Test build #26459 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26459/consoleFull)
 for   PR 4233 at commit 
[`638fa81`](https://github.com/apache/spark/commit/638fa81e40ef64558a6de70b4ad669fb63c46ada).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-01-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4233#issuecomment-72278551
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26427/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-01-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4233#issuecomment-72278541
  
  [Test build #26427 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26427/consoleFull)
 for   PR 4233 at commit 
[`365314f`](https://github.com/apache/spark/commit/365314f5a90005c2f3f625fb60739d64c029148a).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `trait Exportable `
  * `trait Importable[Model : Exportable] `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-01-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4233#issuecomment-72269096
  
  [Test build #26427 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26427/consoleFull)
 for   PR 4233 at commit 
[`365314f`](https://github.com/apache/spark/commit/365314f5a90005c2f3f625fb60739d64c029148a).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-01-27 Thread jkbradley
GitHub user jkbradley opened a pull request:

https://github.com/apache/spark/pull/4233

[WIP] [SPARK-4587] [mllib] ML model import/export

This is a WIP PR for Parquet-based model import/export.  Please see the 
design doc on [the JIRA](https://issues.apache.org/jira/browse/SPARK-4587).

Sketch of current contents:
* New traits: Exportable, Importable with abstract methods save(sc, path) 
and load(sc, path), respectively
* Implemented Exportable, Importable for LogisticRegressionModel and added 
unit test
* Also: Added LogisticRegressionModel.getThreshold method (so that unit 
test could check the threshold)

I will remove the WIP tag after I add save/load for other models in the 
spark.mllib package.

CC: @mengxr  @selvinsource


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/jkbradley/spark ml-import-export

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/4233.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #4233


commit 4c11844331cb76c6c0d0ea10141aadcd174a348f
Author: Joseph K. Bradley jos...@databricks.com
Date:   2015-01-27T21:44:02Z

Added save, load to mllib.classification.LogisticRegressionModel, plus test 
suite

commit 14711b7d378534d7ff9a8a790af9089ac0c20ad9
Author: Joseph K. Bradley jos...@databricks.com
Date:   2015-01-27T22:08:10Z

small cleanups




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-01-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4233#issuecomment-71741610
  
  [Test build #26190 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26190/consoleFull)
 for   PR 4233 at commit 
[`14711b7`](https://github.com/apache/spark/commit/14711b7d378534d7ff9a8a790af9089ac0c20ad9).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-01-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/4233#issuecomment-71752235
  
  [Test build #26190 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/26190/consoleFull)
 for   PR 4233 at commit 
[`14711b7`](https://github.com/apache/spark/commit/14711b7d378534d7ff9a8a790af9089ac0c20ad9).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `trait Exportable `
  * `trait Importable[Model : Exportable] `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [WIP] [SPARK-4587] [mllib] ML model import/exp...

2015-01-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/4233#issuecomment-71752240
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/26190/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org