[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2378#issuecomment-55855022
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/121/consoleFull)
 for   PR 2378 at commit 
[`1fccf1a`](https://github.com/apache/spark/commit/1fccf1adc91e78a6c9e65f4ae14ba770a7eecd2c).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-17 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2378#issuecomment-55855160
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20449/consoleFull)
 for   PR 2378 at commit 
[`19d0967`](https://github.com/apache/spark/commit/19d096783b60e741173f48f2944d91f650616140).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2378#issuecomment-55699312
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20371/consoleFull)
 for   PR 2378 at commit 
[`df19464`](https://github.com/apache/spark/commit/df194640e7dd72d9c6413ec2935889d422a41de2).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2378#issuecomment-55699370
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20371/consoleFull)
 for   PR 2378 at commit 
[`df19464`](https://github.com/apache/spark/commit/df194640e7dd72d9c6413ec2935889d422a41de2).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2378#issuecomment-55706058
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20375/consoleFull)
 for   PR 2378 at commit 
[`708dc02`](https://github.com/apache/spark/commit/708dc0288d23385ff3638fd07fdff9efc3ff8272).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2378#issuecomment-55707384
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20376/consoleFull)
 for   PR 2378 at commit 
[`e1d1bfc`](https://github.com/apache/spark/commit/e1d1bfce4b464e6b14f649081155faf7c4d28471).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2378#issuecomment-55709985
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20377/consoleFull)
 for   PR 2378 at commit 
[`44736d7`](https://github.com/apache/spark/commit/44736d7d849a523419006b565cf51fa732e8854c).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2378#issuecomment-55711136
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20375/consoleFull)
 for   PR 2378 at commit 
[`708dc02`](https://github.com/apache/spark/commit/708dc0288d23385ff3638fd07fdff9efc3ff8272).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2378#issuecomment-55712664
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20376/consoleFull)
 for   PR 2378 at commit 
[`e1d1bfc`](https://github.com/apache/spark/commit/e1d1bfce4b464e6b14f649081155faf7c4d28471).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2378#issuecomment-55716038
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20377/consoleFull)
 for   PR 2378 at commit 
[`44736d7`](https://github.com/apache/spark/commit/44736d7d849a523419006b565cf51fa732e8854c).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-16 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2378#issuecomment-55795901
  
@davies Couple Python tests failed with this change. Could you fix them?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-16 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/2378#issuecomment-55795929
  
test this please


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2378#issuecomment-55807054
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/117/consoleFull)
 for   PR 2378 at commit 
[`9ceff73`](https://github.com/apache/spark/commit/9ceff7360427e9b36d7151c5f296d0ce199610dc).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-16 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2378#discussion_r17630886
  
--- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala 
---
@@ -775,17 +775,38 @@ private[spark] object PythonRDD extends Logging {
 }.toJavaRDD()
   }
 
+  private class AutoBatchedPickler(iter: Iterator[Any]) extends 
Iterator[Array[Byte]] {
+private val pickle = new Pickler()
+private var batch = 1
+private val buffer = new mutable.ArrayBuffer[Any]
+
+override def hasNext(): Boolean = iter.hasNext
+
+override def next(): Array[Byte] = {
+  while (iter.hasNext  buffer.length  batch) {
+buffer += iter.next()
+  }
+  val bytes = pickle.dumps(buffer.toArray)
+  val size = bytes.length
+  // let  1M  size  10M
+  if (size  1024 * 100) {
+batch = (1024 * 100) / size  // fast grow
--- End diff --

If the first record is small, e.g., a SparseVector with a single nonzero, 
and the records followed are large vectors, line 789 may cause memory problems. 
Does it give significant performance gain? under what circumstances?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2378#issuecomment-55815158
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/117/consoleFull)
 for   PR 2378 at commit 
[`9ceff73`](https://github.com/apache/spark/commit/9ceff7360427e9b36d7151c5f296d0ce199610dc).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-16 Thread davies
Github user davies commented on a diff in the pull request:

https://github.com/apache/spark/pull/2378#discussion_r17631575
  
--- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala 
---
@@ -775,17 +775,38 @@ private[spark] object PythonRDD extends Logging {
 }.toJavaRDD()
   }
 
+  private class AutoBatchedPickler(iter: Iterator[Any]) extends 
Iterator[Array[Byte]] {
+private val pickle = new Pickler()
+private var batch = 1
+private val buffer = new mutable.ArrayBuffer[Any]
+
+override def hasNext(): Boolean = iter.hasNext
+
+override def next(): Array[Byte] = {
+  while (iter.hasNext  buffer.length  batch) {
+buffer += iter.next()
+  }
+  val bytes = pickle.dumps(buffer.toArray)
+  val size = bytes.length
+  // let  1M  size  10M
+  if (size  1024 * 100) {
+batch = (1024 * 100) / size  // fast grow
--- End diff --

Good question. Without this fast path, `batch` may need to grow 15 times to 
become stable, it's good and safer. I will remove this fast path. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-16 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2378#discussion_r17632544
  
--- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala 
---
@@ -775,17 +775,38 @@ private[spark] object PythonRDD extends Logging {
 }.toJavaRDD()
   }
 
+  private class AutoBatchedPickler(iter: Iterator[Any]) extends 
Iterator[Array[Byte]] {
+private val pickle = new Pickler()
+private var batch = 1
+private val buffer = new mutable.ArrayBuffer[Any]
+
+override def hasNext(): Boolean = iter.hasNext
+
+override def next(): Array[Byte] = {
+  while (iter.hasNext  buffer.length  batch) {
+buffer += iter.next()
+  }
+  val bytes = pickle.dumps(buffer.toArray)
+  val size = bytes.length
+  // let  1M  size  10M
+  if (size  1024 * 100) {
+batch = (1024 * 100) / size  // fast grow
+  } else if (size  1024 * 1024) {
+batch *= 2
+  } else if (size  1024 * 1024 * 10) {
+batch /= 2
--- End diff --

If the first record is very large, `batch` will be 0.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2378#issuecomment-5582
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20433/consoleFull)
 for   PR 2378 at commit 
[`1fccf1a`](https://github.com/apache/spark/commit/1fccf1adc91e78a6c9e65f4ae14ba770a7eecd2c).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2378#issuecomment-55834191
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20433/consoleFull)
 for   PR 2378 at commit 
[`1fccf1a`](https://github.com/apache/spark/commit/1fccf1adc91e78a6c9e65f4ae14ba770a7eecd2c).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2378#issuecomment-55850933
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/121/consoleFull)
 for   PR 2378 at commit 
[`1fccf1a`](https://github.com/apache/spark/commit/1fccf1adc91e78a6c9e65f4ae14ba770a7eecd2c).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2378#issuecomment-55851024
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20449/consoleFull)
 for   PR 2378 at commit 
[`19d0967`](https://github.com/apache/spark/commit/19d096783b60e741173f48f2944d91f650616140).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-15 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2378#issuecomment-55668739
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/109/consoleFull)
 for   PR 2378 at commit 
[`722dd96`](https://github.com/apache/spark/commit/722dd96976d6a083b0ddb985ac6c518c791bce39).
 * This patch **does not** merge cleanly!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-15 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2378#discussion_r17574390
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
@@ -17,16 +17,18 @@
 
 package org.apache.spark.mllib.api.python
 
-import java.nio.{ByteBuffer, ByteOrder}
+import java.io.OutputStream
 
 import scala.collection.JavaConverters._
 
+import net.razorvine.pickle.{Pickler, Unpickler, IObjectConstructor, 
IObjectPickler, PickleException, Opcodes}
--- End diff --

use `_`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-15 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2378#discussion_r17574399
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
@@ -472,214 +452,140 @@ class PythonMLLibAPI extends Serializable {
   numRows: Long,
   numCols: Int,
   numPartitions: java.lang.Integer,
-  seed: java.lang.Long): JavaRDD[Array[Byte]] = {
+  seed: java.lang.Long): JavaRDD[Vector] = {
 val parts = getNumPartitionsOrDefault(numPartitions, jsc)
 val s = getSeedOrDefault(seed)
-RG.poissonVectorRDD(jsc.sc, mean, numRows, numCols, parts, 
s).map(SerDe.serializeDoubleVector)
+RG.poissonVectorRDD(jsc.sc, mean, numRows, numCols, parts, s)
   }
 
 }
 
 /**
- * :: DeveloperApi ::
- * MultivariateStatisticalSummary with Vector fields serialized.
+ * SerDe utility functions for PythonMLLibAPI.
  */
-@DeveloperApi
-class MultivariateStatisticalSummarySerialized(val summary: 
MultivariateStatisticalSummary)
-  extends Serializable {
+private[spark] object SerDe extends Serializable {
 
-  def mean: Array[Byte] = SerDe.serializeDoubleVector(summary.mean)
+  private[python] def reduce_object(out: OutputStream, pickler: Pickler,
+module: String, name: String, objects: 
Object*) = {
+out.write(Opcodes.GLOBAL)
+out.write((module + \n + name + \n).getBytes)
--- End diff --

Does it increase the storage cost by a lot for small objects?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-15 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2378#discussion_r17574385
  
--- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala 
---
@@ -778,8 +778,8 @@ private[spark] object PythonRDD extends Logging {
   def javaToPython(jRDD: JavaRDD[Any]): JavaRDD[Array[Byte]] = {
 jRDD.rdd.mapPartitions { iter =
   val pickle = new Pickler
-  iter.map { row =
-pickle.dumps(row)
+  iter.grouped(1024).map { rows =
--- End diff --

Shall we divide groups based on the serialized size?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-15 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2378#discussion_r17574396
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
@@ -472,214 +452,140 @@ class PythonMLLibAPI extends Serializable {
   numRows: Long,
   numCols: Int,
   numPartitions: java.lang.Integer,
-  seed: java.lang.Long): JavaRDD[Array[Byte]] = {
+  seed: java.lang.Long): JavaRDD[Vector] = {
 val parts = getNumPartitionsOrDefault(numPartitions, jsc)
 val s = getSeedOrDefault(seed)
-RG.poissonVectorRDD(jsc.sc, mean, numRows, numCols, parts, 
s).map(SerDe.serializeDoubleVector)
+RG.poissonVectorRDD(jsc.sc, mean, numRows, numCols, parts, s)
   }
 
 }
 
 /**
- * :: DeveloperApi ::
- * MultivariateStatisticalSummary with Vector fields serialized.
+ * SerDe utility functions for PythonMLLibAPI.
  */
-@DeveloperApi
-class MultivariateStatisticalSummarySerialized(val summary: 
MultivariateStatisticalSummary)
-  extends Serializable {
+private[spark] object SerDe extends Serializable {
 
-  def mean: Array[Byte] = SerDe.serializeDoubleVector(summary.mean)
+  private[python] def reduce_object(out: OutputStream, pickler: Pickler,
--- End diff --

use camelCase for method names


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-15 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2378#discussion_r17574404
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
@@ -472,214 +452,140 @@ class PythonMLLibAPI extends Serializable {
   numRows: Long,
   numCols: Int,
   numPartitions: java.lang.Integer,
-  seed: java.lang.Long): JavaRDD[Array[Byte]] = {
+  seed: java.lang.Long): JavaRDD[Vector] = {
 val parts = getNumPartitionsOrDefault(numPartitions, jsc)
 val s = getSeedOrDefault(seed)
-RG.poissonVectorRDD(jsc.sc, mean, numRows, numCols, parts, 
s).map(SerDe.serializeDoubleVector)
+RG.poissonVectorRDD(jsc.sc, mean, numRows, numCols, parts, s)
   }
 
 }
 
 /**
- * :: DeveloperApi ::
- * MultivariateStatisticalSummary with Vector fields serialized.
+ * SerDe utility functions for PythonMLLibAPI.
  */
-@DeveloperApi
-class MultivariateStatisticalSummarySerialized(val summary: 
MultivariateStatisticalSummary)
-  extends Serializable {
+private[spark] object SerDe extends Serializable {
 
-  def mean: Array[Byte] = SerDe.serializeDoubleVector(summary.mean)
+  private[python] def reduce_object(out: OutputStream, pickler: Pickler,
+module: String, name: String, objects: 
Object*) = {
+out.write(Opcodes.GLOBAL)
+out.write((module + \n + name + \n).getBytes)
+out.write(Opcodes.MARK)
+objects.foreach(pickler.save(_))
+out.write(Opcodes.TUPLE)
+out.write(Opcodes.REDUCE)
+  }
 
-  def variance: Array[Byte] = SerDe.serializeDoubleVector(summary.variance)
+  private[python] class DenseVectorPickler extends IObjectPickler {
+def pickle(obj: Object, out: OutputStream, pickler: Pickler) = {
+  val vector: DenseVector = obj.asInstanceOf[DenseVector]
+  reduce_object(out, pickler, pyspark.mllib.linalg, DenseVector, 
vector.toArray)
--- End diff --

ditto: what is the cost of using class names?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-15 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2378#discussion_r17574578
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
@@ -60,18 +60,18 @@ class PythonMLLibAPI extends Serializable {
   def loadLabeledPoints(
   jsc: JavaSparkContext,
   path: String,
-  minPartitions: Int): JavaRDD[Array[Byte]] =
-MLUtils.loadLabeledPoints(jsc.sc, path, 
minPartitions).map(SerDe.serializeLabeledPoint)
+  minPartitions: Int): JavaRDD[LabeledPoint] =
+MLUtils.loadLabeledPoints(jsc.sc, path, minPartitions)
 
   private def trainRegressionModel(
   trainFunc: (RDD[LabeledPoint], Vector) = GeneralizedLinearModel,
-  dataBytesJRDD: JavaRDD[Array[Byte]],
+  dataJRDD: JavaRDD[Any],
   initialWeightsBA: Array[Byte]): 
java.util.LinkedList[java.lang.Object] = {
-val data = dataBytesJRDD.rdd.map(SerDe.deserializeLabeledPoint)
-val initialWeights = SerDe.deserializeDoubleVector(initialWeightsBA)
+val data = dataJRDD.rdd.map(_.asInstanceOf[LabeledPoint])
--- End diff --

maybe we can try `dataJRDD.rdd.asInstanceOf[RDD[LabeledPoint]]` instead of 
`map`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-15 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2378#discussion_r17574784
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
@@ -472,214 +452,140 @@ class PythonMLLibAPI extends Serializable {
   numRows: Long,
   numCols: Int,
   numPartitions: java.lang.Integer,
-  seed: java.lang.Long): JavaRDD[Array[Byte]] = {
+  seed: java.lang.Long): JavaRDD[Vector] = {
 val parts = getNumPartitionsOrDefault(numPartitions, jsc)
 val s = getSeedOrDefault(seed)
-RG.poissonVectorRDD(jsc.sc, mean, numRows, numCols, parts, 
s).map(SerDe.serializeDoubleVector)
+RG.poissonVectorRDD(jsc.sc, mean, numRows, numCols, parts, s)
   }
 
 }
 
 /**
- * :: DeveloperApi ::
- * MultivariateStatisticalSummary with Vector fields serialized.
+ * SerDe utility functions for PythonMLLibAPI.
  */
-@DeveloperApi
-class MultivariateStatisticalSummarySerialized(val summary: 
MultivariateStatisticalSummary)
-  extends Serializable {
+private[spark] object SerDe extends Serializable {
 
-  def mean: Array[Byte] = SerDe.serializeDoubleVector(summary.mean)
+  private[python] def reduce_object(out: OutputStream, pickler: Pickler,
+module: String, name: String, objects: 
Object*) = {
+out.write(Opcodes.GLOBAL)
+out.write((module + \n + name + \n).getBytes)
+out.write(Opcodes.MARK)
+objects.foreach(pickler.save(_))
+out.write(Opcodes.TUPLE)
+out.write(Opcodes.REDUCE)
+  }
 
-  def variance: Array[Byte] = SerDe.serializeDoubleVector(summary.variance)
+  private[python] class DenseVectorPickler extends IObjectPickler {
+def pickle(obj: Object, out: OutputStream, pickler: Pickler) = {
+  val vector: DenseVector = obj.asInstanceOf[DenseVector]
+  reduce_object(out, pickler, pyspark.mllib.linalg, DenseVector, 
vector.toArray)
+}
+  }
 
-  def count: Long = summary.count
+  private[python] class DenseVectorConstructor extends IObjectConstructor {
+def construct(args: Array[Object]) :Object = {
+  require(args.length == 1)
+  new DenseVector(args(0).asInstanceOf[Array[Double]])
+}
+  }
+
+  private[python] class DenseMatrixPickler extends IObjectPickler {
+def pickle(obj: Object, out: OutputStream, pickler: Pickler) = {
+  val m: DenseMatrix = obj.asInstanceOf[DenseMatrix]
+  reduce_object(out, pickler, pyspark.mllib.linalg, DenseMatrix,
+m.numRows.asInstanceOf[Object], m.numCols.asInstanceOf[Object], 
m.values)
+}
+  }
 
-  def numNonzeros: Array[Byte] = 
SerDe.serializeDoubleVector(summary.numNonzeros)
+  private[python] class DenseMatrixConstructor extends IObjectConstructor {
+def construct(args: Array[Object]) :Object = {
+  require(args.length == 3)
+  new DenseMatrix(args(0).asInstanceOf[Int], args(1).asInstanceOf[Int],
+args(2).asInstanceOf[Array[Double]])
+}
+  }
 
-  def max: Array[Byte] = SerDe.serializeDoubleVector(summary.max)
+  private[python] class SparseVectorPickler extends IObjectPickler {
+def pickle(obj: Object, out: OutputStream, pickler: Pickler) = {
+  val v: SparseVector = obj.asInstanceOf[SparseVector]
+  reduce_object(out, pickler, pyspark.mllib.linalg, SparseVector,
+v.size.asInstanceOf[Object], v.indices, v.values)
+}
+  }
 
-  def min: Array[Byte] = SerDe.serializeDoubleVector(summary.min)
-}
+  private[python] class SparseVectorConstructor extends IObjectConstructor 
{
+def construct(args: Array[Object]) :Object = {
+  require(args.length == 3)
+  new SparseVector(args(0).asInstanceOf[Int], 
args(1).asInstanceOf[Array[Int]],
+args(2).asInstanceOf[Array[Double]])
+}
+  }
 
-/**
- * SerDe utility functions for PythonMLLibAPI.
- */
-private[spark] object SerDe extends Serializable {
-  private val DENSE_VECTOR_MAGIC: Byte = 1
-  private val SPARSE_VECTOR_MAGIC: Byte = 2
-  private val DENSE_MATRIX_MAGIC: Byte = 3
-  private val LABELED_POINT_MAGIC: Byte = 4
-
-  private[python] def deserializeDoubleVector(bytes: Array[Byte], offset: 
Int = 0): Vector = {
-require(bytes.length - offset = 5, Byte array too short)
-val magic = bytes(offset)
-if (magic == DENSE_VECTOR_MAGIC) {
-  deserializeDenseVector(bytes, offset)
-} else if (magic == SPARSE_VECTOR_MAGIC) {
-  deserializeSparseVector(bytes, offset)
-} else {
-  throw new IllegalArgumentException(Magic  + magic +  is wrong.)
+  private[python] class LabeledPointPickler extends IObjectPickler {
+def 

[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-15 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/2378#discussion_r17574827
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
@@ -472,214 +452,140 @@ class PythonMLLibAPI extends Serializable {
   numRows: Long,
   numCols: Int,
   numPartitions: java.lang.Integer,
-  seed: java.lang.Long): JavaRDD[Array[Byte]] = {
+  seed: java.lang.Long): JavaRDD[Vector] = {
 val parts = getNumPartitionsOrDefault(numPartitions, jsc)
 val s = getSeedOrDefault(seed)
-RG.poissonVectorRDD(jsc.sc, mean, numRows, numCols, parts, 
s).map(SerDe.serializeDoubleVector)
+RG.poissonVectorRDD(jsc.sc, mean, numRows, numCols, parts, s)
   }
 
 }
 
 /**
- * :: DeveloperApi ::
- * MultivariateStatisticalSummary with Vector fields serialized.
+ * SerDe utility functions for PythonMLLibAPI.
  */
-@DeveloperApi
-class MultivariateStatisticalSummarySerialized(val summary: 
MultivariateStatisticalSummary)
-  extends Serializable {
+private[spark] object SerDe extends Serializable {
 
-  def mean: Array[Byte] = SerDe.serializeDoubleVector(summary.mean)
+  private[python] def reduce_object(out: OutputStream, pickler: Pickler,
+module: String, name: String, objects: 
Object*) = {
+out.write(Opcodes.GLOBAL)
+out.write((module + \n + name + \n).getBytes)
+out.write(Opcodes.MARK)
+objects.foreach(pickler.save(_))
+out.write(Opcodes.TUPLE)
+out.write(Opcodes.REDUCE)
+  }
 
-  def variance: Array[Byte] = SerDe.serializeDoubleVector(summary.variance)
+  private[python] class DenseVectorPickler extends IObjectPickler {
+def pickle(obj: Object, out: OutputStream, pickler: Pickler) = {
+  val vector: DenseVector = obj.asInstanceOf[DenseVector]
+  reduce_object(out, pickler, pyspark.mllib.linalg, DenseVector, 
vector.toArray)
+}
+  }
 
-  def count: Long = summary.count
+  private[python] class DenseVectorConstructor extends IObjectConstructor {
+def construct(args: Array[Object]) :Object = {
+  require(args.length == 1)
+  new DenseVector(args(0).asInstanceOf[Array[Double]])
+}
+  }
+
+  private[python] class DenseMatrixPickler extends IObjectPickler {
+def pickle(obj: Object, out: OutputStream, pickler: Pickler) = {
+  val m: DenseMatrix = obj.asInstanceOf[DenseMatrix]
+  reduce_object(out, pickler, pyspark.mllib.linalg, DenseMatrix,
+m.numRows.asInstanceOf[Object], m.numCols.asInstanceOf[Object], 
m.values)
+}
+  }
 
-  def numNonzeros: Array[Byte] = 
SerDe.serializeDoubleVector(summary.numNonzeros)
+  private[python] class DenseMatrixConstructor extends IObjectConstructor {
+def construct(args: Array[Object]) :Object = {
+  require(args.length == 3)
+  new DenseMatrix(args(0).asInstanceOf[Int], args(1).asInstanceOf[Int],
+args(2).asInstanceOf[Array[Double]])
+}
+  }
 
-  def max: Array[Byte] = SerDe.serializeDoubleVector(summary.max)
+  private[python] class SparseVectorPickler extends IObjectPickler {
+def pickle(obj: Object, out: OutputStream, pickler: Pickler) = {
+  val v: SparseVector = obj.asInstanceOf[SparseVector]
+  reduce_object(out, pickler, pyspark.mllib.linalg, SparseVector,
+v.size.asInstanceOf[Object], v.indices, v.values)
+}
+  }
 
-  def min: Array[Byte] = SerDe.serializeDoubleVector(summary.min)
-}
+  private[python] class SparseVectorConstructor extends IObjectConstructor 
{
+def construct(args: Array[Object]) :Object = {
+  require(args.length == 3)
+  new SparseVector(args(0).asInstanceOf[Int], 
args(1).asInstanceOf[Array[Int]],
+args(2).asInstanceOf[Array[Double]])
+}
+  }
 
-/**
- * SerDe utility functions for PythonMLLibAPI.
- */
-private[spark] object SerDe extends Serializable {
-  private val DENSE_VECTOR_MAGIC: Byte = 1
-  private val SPARSE_VECTOR_MAGIC: Byte = 2
-  private val DENSE_MATRIX_MAGIC: Byte = 3
-  private val LABELED_POINT_MAGIC: Byte = 4
-
-  private[python] def deserializeDoubleVector(bytes: Array[Byte], offset: 
Int = 0): Vector = {
-require(bytes.length - offset = 5, Byte array too short)
-val magic = bytes(offset)
-if (magic == DENSE_VECTOR_MAGIC) {
-  deserializeDenseVector(bytes, offset)
-} else if (magic == SPARSE_VECTOR_MAGIC) {
-  deserializeSparseVector(bytes, offset)
-} else {
-  throw new IllegalArgumentException(Magic  + magic +  is wrong.)
+  private[python] class LabeledPointPickler extends IObjectPickler {
+def 

[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-15 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2378#issuecomment-55676382
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/109/consoleFull)
 for   PR 2378 at commit 
[`722dd96`](https://github.com/apache/spark/commit/722dd96976d6a083b0ddb985ac6c518c791bce39).
 * This patch **fails** unit tests.
 * This patch **does not** merge cleanly!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-15 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/2378#issuecomment-55685928
  
Just merged #2365 in case you want to rebase.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2378#issuecomment-55518181
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20304/consoleFull)
 for   PR 2378 at commit 
[`b02e34f`](https://github.com/apache/spark/commit/b02e34f53f8e0ba992477b20def58ddf356aa3f1).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2378#issuecomment-55518966
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20302/consoleFull)
 for   PR 2378 at commit 
[`4d7963e`](https://github.com/apache/spark/commit/4d7963ef91851fba280025b0778f0583fe819c55).
 * This patch **fails** unit tests.
 * This patch **does not** merge cleanly!



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2378#issuecomment-55531127
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20307/consoleFull)
 for   PR 2378 at commit 
[`0ee1525`](https://github.com/apache/spark/commit/0ee1525054e6ab75ef4b456fe1de148ef866de4e).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2378#issuecomment-55532844
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20307/consoleFull)
 for   PR 2378 at commit 
[`0ee1525`](https://github.com/apache/spark/commit/0ee1525054e6ab75ef4b456fe1de148ef866de4e).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2378#issuecomment-0409
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20320/consoleFull)
 for   PR 2378 at commit 
[`722dd96`](https://github.com/apache/spark/commit/722dd96976d6a083b0ddb985ac6c518c791bce39).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2378#issuecomment-2292
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20320/consoleFull)
 for   PR 2378 at commit 
[`722dd96`](https://github.com/apache/spark/commit/722dd96976d6a083b0ddb985ac6c518c791bce39).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-13 Thread davies
GitHub user davies opened a pull request:

https://github.com/apache/spark/pull/2378

[SPARK-3491] [WIP] [MLlib] [PySpark] use pickle to serialize data in MLlib

Currently, we serialize the data between JVM and Python case by case 
manually, this cannot scale to support so many APIs in MLlib.

This patch will try to address this problem by serialize the data using 
pickle protocol, using Pyrolite library to serialize/deserialize in JVM. Pickle 
protocol can be easily extended to support customized class.

In the first step, it can support Double, DenseVector, SparseVector, 
DenseMatrix, LabeledPoint, Rating, Tuple2 now, the recommendation module had 
been refactor to use this new protocol.

Later, I will refactor all others to use this protocol.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/davies/spark pickle_mllib

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/2378.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #2378


commit b30ef35ec7830cee08b4f8d692da26d98cac70e8
Author: Davies Liu davies@gmail.com
Date:   2014-09-13T07:18:33Z

use pickle to serialize data for mllib/recommendation




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2378#issuecomment-55484501
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20259/consoleFull)
 for   PR 2378 at commit 
[`b30ef35`](https://github.com/apache/spark/commit/b30ef35ec7830cee08b4f8d692da26d98cac70e8).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2378#issuecomment-55485771
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20259/consoleFull)
 for   PR 2378 at commit 
[`b30ef35`](https://github.com/apache/spark/commit/b30ef35ec7830cee08b4f8d692da26d98cac70e8).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class JavaSparkContext(val sc: SparkContext)`
  * `class Rating(object):`
  * `class JavaStreamingContext(val ssc: StreamingContext) extends 
Closeable `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2378#issuecomment-55486079
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20265/consoleFull)
 for   PR 2378 at commit 
[`f1544c4`](https://github.com/apache/spark/commit/f1544c47917836d7ef77353c467182cc5cc7addb).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2378#issuecomment-55486352
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20267/consoleFull)
 for   PR 2378 at commit 
[`aa2287e`](https://github.com/apache/spark/commit/aa2287ec75998bbc5512a37d5415dc2115615533).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2378#issuecomment-55487217
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20265/consoleFull)
 for   PR 2378 at commit 
[`f1544c4`](https://github.com/apache/spark/commit/f1544c47917836d7ef77353c467182cc5cc7addb).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class Vector(object):`
  * `class DenseVector(Vector):`
  * `class SparseVector(Vector):`
  * `class Matrix(object):`
  * `class DenseMatrix(Matrix):`
  * `class Rating(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2378#issuecomment-55487600
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20267/consoleFull)
 for   PR 2378 at commit 
[`aa2287e`](https://github.com/apache/spark/commit/aa2287ec75998bbc5512a37d5415dc2115615533).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `class JavaSparkContext(val sc: SparkContext)`
  * `class TaskCompletionListenerException(errorMessages: Seq[String]) 
extends Exception `
  * `class Dummy(object):`
  * `class Vector(object):`
  * `class DenseVector(Vector):`
  * `class SparseVector(Vector):`
  * `class Matrix(object):`
  * `class DenseMatrix(Matrix):`
  * `class Rating(object):`
  * `class JavaStreamingContext(val ssc: StreamingContext) extends 
Closeable `



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2378#issuecomment-55497580
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20273/consoleFull)
 for   PR 2378 at commit 
[`8fe166a`](https://github.com/apache/spark/commit/8fe166a80c5162914a9393b9526bc14150ce2402).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-13 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/2378#issuecomment-55499182
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20273/consoleFull)
 for   PR 2378 at commit 
[`8fe166a`](https://github.com/apache/spark/commit/8fe166a80c5162914a9393b9526bc14150ce2402).
 * This patch **fails** unit tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  class ArrayConstructor extends 
net.razorvine.pickle.objects.ArrayConstructor `
  * `class Vector(object):`
  * `class DenseVector(Vector):`
  * `class SparseVector(Vector):`
  * `class Matrix(object):`
  * `class DenseMatrix(Matrix):`
  * `class Rating(object):`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org