[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-55855022 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/121/consoleFull) for PR 2378 at commit [`1fccf1a`](https://github.com/apache/spark/commit/1fccf1adc91e78a6c9e65f4ae14ba770a7eecd2c). * This patch **passes** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-55855160 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20449/consoleFull) for PR 2378 at commit [`19d0967`](https://github.com/apache/spark/commit/19d096783b60e741173f48f2944d91f650616140). * This patch **passes** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-55699312 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20371/consoleFull) for PR 2378 at commit [`df19464`](https://github.com/apache/spark/commit/df194640e7dd72d9c6413ec2935889d422a41de2). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-55699370 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20371/consoleFull) for PR 2378 at commit [`df19464`](https://github.com/apache/spark/commit/df194640e7dd72d9c6413ec2935889d422a41de2). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-55706058 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20375/consoleFull) for PR 2378 at commit [`708dc02`](https://github.com/apache/spark/commit/708dc0288d23385ff3638fd07fdff9efc3ff8272). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-55707384 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20376/consoleFull) for PR 2378 at commit [`e1d1bfc`](https://github.com/apache/spark/commit/e1d1bfce4b464e6b14f649081155faf7c4d28471). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-55709985 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20377/consoleFull) for PR 2378 at commit [`44736d7`](https://github.com/apache/spark/commit/44736d7d849a523419006b565cf51fa732e8854c). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-55711136 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20375/consoleFull) for PR 2378 at commit [`708dc02`](https://github.com/apache/spark/commit/708dc0288d23385ff3638fd07fdff9efc3ff8272). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-55712664 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20376/consoleFull) for PR 2378 at commit [`e1d1bfc`](https://github.com/apache/spark/commit/e1d1bfce4b464e6b14f649081155faf7c4d28471). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-55716038 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20377/consoleFull) for PR 2378 at commit [`44736d7`](https://github.com/apache/spark/commit/44736d7d849a523419006b565cf51fa732e8854c). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-55795901 @davies Couple Python tests failed with this change. Could you fix them? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-55795929 test this please --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-55807054 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/117/consoleFull) for PR 2378 at commit [`9ceff73`](https://github.com/apache/spark/commit/9ceff7360427e9b36d7151c5f296d0ce199610dc). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17630886 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala --- @@ -775,17 +775,38 @@ private[spark] object PythonRDD extends Logging { }.toJavaRDD() } + private class AutoBatchedPickler(iter: Iterator[Any]) extends Iterator[Array[Byte]] { +private val pickle = new Pickler() +private var batch = 1 +private val buffer = new mutable.ArrayBuffer[Any] + +override def hasNext(): Boolean = iter.hasNext + +override def next(): Array[Byte] = { + while (iter.hasNext buffer.length batch) { +buffer += iter.next() + } + val bytes = pickle.dumps(buffer.toArray) + val size = bytes.length + // let 1M size 10M + if (size 1024 * 100) { +batch = (1024 * 100) / size // fast grow --- End diff -- If the first record is small, e.g., a SparseVector with a single nonzero, and the records followed are large vectors, line 789 may cause memory problems. Does it give significant performance gain? under what circumstances? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-55815158 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/117/consoleFull) for PR 2378 at commit [`9ceff73`](https://github.com/apache/spark/commit/9ceff7360427e9b36d7151c5f296d0ce199610dc). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17631575 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala --- @@ -775,17 +775,38 @@ private[spark] object PythonRDD extends Logging { }.toJavaRDD() } + private class AutoBatchedPickler(iter: Iterator[Any]) extends Iterator[Array[Byte]] { +private val pickle = new Pickler() +private var batch = 1 +private val buffer = new mutable.ArrayBuffer[Any] + +override def hasNext(): Boolean = iter.hasNext + +override def next(): Array[Byte] = { + while (iter.hasNext buffer.length batch) { +buffer += iter.next() + } + val bytes = pickle.dumps(buffer.toArray) + val size = bytes.length + // let 1M size 10M + if (size 1024 * 100) { +batch = (1024 * 100) / size // fast grow --- End diff -- Good question. Without this fast path, `batch` may need to grow 15 times to become stable, it's good and safer. I will remove this fast path. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17632544 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala --- @@ -775,17 +775,38 @@ private[spark] object PythonRDD extends Logging { }.toJavaRDD() } + private class AutoBatchedPickler(iter: Iterator[Any]) extends Iterator[Array[Byte]] { +private val pickle = new Pickler() +private var batch = 1 +private val buffer = new mutable.ArrayBuffer[Any] + +override def hasNext(): Boolean = iter.hasNext + +override def next(): Array[Byte] = { + while (iter.hasNext buffer.length batch) { +buffer += iter.next() + } + val bytes = pickle.dumps(buffer.toArray) + val size = bytes.length + // let 1M size 10M + if (size 1024 * 100) { +batch = (1024 * 100) / size // fast grow + } else if (size 1024 * 1024) { +batch *= 2 + } else if (size 1024 * 1024 * 10) { +batch /= 2 --- End diff -- If the first record is very large, `batch` will be 0. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-5582 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20433/consoleFull) for PR 2378 at commit [`1fccf1a`](https://github.com/apache/spark/commit/1fccf1adc91e78a6c9e65f4ae14ba770a7eecd2c). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-55834191 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20433/consoleFull) for PR 2378 at commit [`1fccf1a`](https://github.com/apache/spark/commit/1fccf1adc91e78a6c9e65f4ae14ba770a7eecd2c). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-55850933 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/121/consoleFull) for PR 2378 at commit [`1fccf1a`](https://github.com/apache/spark/commit/1fccf1adc91e78a6c9e65f4ae14ba770a7eecd2c). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-55851024 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20449/consoleFull) for PR 2378 at commit [`19d0967`](https://github.com/apache/spark/commit/19d096783b60e741173f48f2944d91f650616140). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-55668739 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/109/consoleFull) for PR 2378 at commit [`722dd96`](https://github.com/apache/spark/commit/722dd96976d6a083b0ddb985ac6c518c791bce39). * This patch **does not** merge cleanly! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17574390 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala --- @@ -17,16 +17,18 @@ package org.apache.spark.mllib.api.python -import java.nio.{ByteBuffer, ByteOrder} +import java.io.OutputStream import scala.collection.JavaConverters._ +import net.razorvine.pickle.{Pickler, Unpickler, IObjectConstructor, IObjectPickler, PickleException, Opcodes} --- End diff -- use `_` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17574399 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala --- @@ -472,214 +452,140 @@ class PythonMLLibAPI extends Serializable { numRows: Long, numCols: Int, numPartitions: java.lang.Integer, - seed: java.lang.Long): JavaRDD[Array[Byte]] = { + seed: java.lang.Long): JavaRDD[Vector] = { val parts = getNumPartitionsOrDefault(numPartitions, jsc) val s = getSeedOrDefault(seed) -RG.poissonVectorRDD(jsc.sc, mean, numRows, numCols, parts, s).map(SerDe.serializeDoubleVector) +RG.poissonVectorRDD(jsc.sc, mean, numRows, numCols, parts, s) } } /** - * :: DeveloperApi :: - * MultivariateStatisticalSummary with Vector fields serialized. + * SerDe utility functions for PythonMLLibAPI. */ -@DeveloperApi -class MultivariateStatisticalSummarySerialized(val summary: MultivariateStatisticalSummary) - extends Serializable { +private[spark] object SerDe extends Serializable { - def mean: Array[Byte] = SerDe.serializeDoubleVector(summary.mean) + private[python] def reduce_object(out: OutputStream, pickler: Pickler, +module: String, name: String, objects: Object*) = { +out.write(Opcodes.GLOBAL) +out.write((module + \n + name + \n).getBytes) --- End diff -- Does it increase the storage cost by a lot for small objects? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17574385 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala --- @@ -778,8 +778,8 @@ private[spark] object PythonRDD extends Logging { def javaToPython(jRDD: JavaRDD[Any]): JavaRDD[Array[Byte]] = { jRDD.rdd.mapPartitions { iter = val pickle = new Pickler - iter.map { row = -pickle.dumps(row) + iter.grouped(1024).map { rows = --- End diff -- Shall we divide groups based on the serialized size? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17574396 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala --- @@ -472,214 +452,140 @@ class PythonMLLibAPI extends Serializable { numRows: Long, numCols: Int, numPartitions: java.lang.Integer, - seed: java.lang.Long): JavaRDD[Array[Byte]] = { + seed: java.lang.Long): JavaRDD[Vector] = { val parts = getNumPartitionsOrDefault(numPartitions, jsc) val s = getSeedOrDefault(seed) -RG.poissonVectorRDD(jsc.sc, mean, numRows, numCols, parts, s).map(SerDe.serializeDoubleVector) +RG.poissonVectorRDD(jsc.sc, mean, numRows, numCols, parts, s) } } /** - * :: DeveloperApi :: - * MultivariateStatisticalSummary with Vector fields serialized. + * SerDe utility functions for PythonMLLibAPI. */ -@DeveloperApi -class MultivariateStatisticalSummarySerialized(val summary: MultivariateStatisticalSummary) - extends Serializable { +private[spark] object SerDe extends Serializable { - def mean: Array[Byte] = SerDe.serializeDoubleVector(summary.mean) + private[python] def reduce_object(out: OutputStream, pickler: Pickler, --- End diff -- use camelCase for method names --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17574404 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala --- @@ -472,214 +452,140 @@ class PythonMLLibAPI extends Serializable { numRows: Long, numCols: Int, numPartitions: java.lang.Integer, - seed: java.lang.Long): JavaRDD[Array[Byte]] = { + seed: java.lang.Long): JavaRDD[Vector] = { val parts = getNumPartitionsOrDefault(numPartitions, jsc) val s = getSeedOrDefault(seed) -RG.poissonVectorRDD(jsc.sc, mean, numRows, numCols, parts, s).map(SerDe.serializeDoubleVector) +RG.poissonVectorRDD(jsc.sc, mean, numRows, numCols, parts, s) } } /** - * :: DeveloperApi :: - * MultivariateStatisticalSummary with Vector fields serialized. + * SerDe utility functions for PythonMLLibAPI. */ -@DeveloperApi -class MultivariateStatisticalSummarySerialized(val summary: MultivariateStatisticalSummary) - extends Serializable { +private[spark] object SerDe extends Serializable { - def mean: Array[Byte] = SerDe.serializeDoubleVector(summary.mean) + private[python] def reduce_object(out: OutputStream, pickler: Pickler, +module: String, name: String, objects: Object*) = { +out.write(Opcodes.GLOBAL) +out.write((module + \n + name + \n).getBytes) +out.write(Opcodes.MARK) +objects.foreach(pickler.save(_)) +out.write(Opcodes.TUPLE) +out.write(Opcodes.REDUCE) + } - def variance: Array[Byte] = SerDe.serializeDoubleVector(summary.variance) + private[python] class DenseVectorPickler extends IObjectPickler { +def pickle(obj: Object, out: OutputStream, pickler: Pickler) = { + val vector: DenseVector = obj.asInstanceOf[DenseVector] + reduce_object(out, pickler, pyspark.mllib.linalg, DenseVector, vector.toArray) --- End diff -- ditto: what is the cost of using class names? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17574578 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala --- @@ -60,18 +60,18 @@ class PythonMLLibAPI extends Serializable { def loadLabeledPoints( jsc: JavaSparkContext, path: String, - minPartitions: Int): JavaRDD[Array[Byte]] = -MLUtils.loadLabeledPoints(jsc.sc, path, minPartitions).map(SerDe.serializeLabeledPoint) + minPartitions: Int): JavaRDD[LabeledPoint] = +MLUtils.loadLabeledPoints(jsc.sc, path, minPartitions) private def trainRegressionModel( trainFunc: (RDD[LabeledPoint], Vector) = GeneralizedLinearModel, - dataBytesJRDD: JavaRDD[Array[Byte]], + dataJRDD: JavaRDD[Any], initialWeightsBA: Array[Byte]): java.util.LinkedList[java.lang.Object] = { -val data = dataBytesJRDD.rdd.map(SerDe.deserializeLabeledPoint) -val initialWeights = SerDe.deserializeDoubleVector(initialWeightsBA) +val data = dataJRDD.rdd.map(_.asInstanceOf[LabeledPoint]) --- End diff -- maybe we can try `dataJRDD.rdd.asInstanceOf[RDD[LabeledPoint]]` instead of `map` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17574784 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala --- @@ -472,214 +452,140 @@ class PythonMLLibAPI extends Serializable { numRows: Long, numCols: Int, numPartitions: java.lang.Integer, - seed: java.lang.Long): JavaRDD[Array[Byte]] = { + seed: java.lang.Long): JavaRDD[Vector] = { val parts = getNumPartitionsOrDefault(numPartitions, jsc) val s = getSeedOrDefault(seed) -RG.poissonVectorRDD(jsc.sc, mean, numRows, numCols, parts, s).map(SerDe.serializeDoubleVector) +RG.poissonVectorRDD(jsc.sc, mean, numRows, numCols, parts, s) } } /** - * :: DeveloperApi :: - * MultivariateStatisticalSummary with Vector fields serialized. + * SerDe utility functions for PythonMLLibAPI. */ -@DeveloperApi -class MultivariateStatisticalSummarySerialized(val summary: MultivariateStatisticalSummary) - extends Serializable { +private[spark] object SerDe extends Serializable { - def mean: Array[Byte] = SerDe.serializeDoubleVector(summary.mean) + private[python] def reduce_object(out: OutputStream, pickler: Pickler, +module: String, name: String, objects: Object*) = { +out.write(Opcodes.GLOBAL) +out.write((module + \n + name + \n).getBytes) +out.write(Opcodes.MARK) +objects.foreach(pickler.save(_)) +out.write(Opcodes.TUPLE) +out.write(Opcodes.REDUCE) + } - def variance: Array[Byte] = SerDe.serializeDoubleVector(summary.variance) + private[python] class DenseVectorPickler extends IObjectPickler { +def pickle(obj: Object, out: OutputStream, pickler: Pickler) = { + val vector: DenseVector = obj.asInstanceOf[DenseVector] + reduce_object(out, pickler, pyspark.mllib.linalg, DenseVector, vector.toArray) +} + } - def count: Long = summary.count + private[python] class DenseVectorConstructor extends IObjectConstructor { +def construct(args: Array[Object]) :Object = { + require(args.length == 1) + new DenseVector(args(0).asInstanceOf[Array[Double]]) +} + } + + private[python] class DenseMatrixPickler extends IObjectPickler { +def pickle(obj: Object, out: OutputStream, pickler: Pickler) = { + val m: DenseMatrix = obj.asInstanceOf[DenseMatrix] + reduce_object(out, pickler, pyspark.mllib.linalg, DenseMatrix, +m.numRows.asInstanceOf[Object], m.numCols.asInstanceOf[Object], m.values) +} + } - def numNonzeros: Array[Byte] = SerDe.serializeDoubleVector(summary.numNonzeros) + private[python] class DenseMatrixConstructor extends IObjectConstructor { +def construct(args: Array[Object]) :Object = { + require(args.length == 3) + new DenseMatrix(args(0).asInstanceOf[Int], args(1).asInstanceOf[Int], +args(2).asInstanceOf[Array[Double]]) +} + } - def max: Array[Byte] = SerDe.serializeDoubleVector(summary.max) + private[python] class SparseVectorPickler extends IObjectPickler { +def pickle(obj: Object, out: OutputStream, pickler: Pickler) = { + val v: SparseVector = obj.asInstanceOf[SparseVector] + reduce_object(out, pickler, pyspark.mllib.linalg, SparseVector, +v.size.asInstanceOf[Object], v.indices, v.values) +} + } - def min: Array[Byte] = SerDe.serializeDoubleVector(summary.min) -} + private[python] class SparseVectorConstructor extends IObjectConstructor { +def construct(args: Array[Object]) :Object = { + require(args.length == 3) + new SparseVector(args(0).asInstanceOf[Int], args(1).asInstanceOf[Array[Int]], +args(2).asInstanceOf[Array[Double]]) +} + } -/** - * SerDe utility functions for PythonMLLibAPI. - */ -private[spark] object SerDe extends Serializable { - private val DENSE_VECTOR_MAGIC: Byte = 1 - private val SPARSE_VECTOR_MAGIC: Byte = 2 - private val DENSE_MATRIX_MAGIC: Byte = 3 - private val LABELED_POINT_MAGIC: Byte = 4 - - private[python] def deserializeDoubleVector(bytes: Array[Byte], offset: Int = 0): Vector = { -require(bytes.length - offset = 5, Byte array too short) -val magic = bytes(offset) -if (magic == DENSE_VECTOR_MAGIC) { - deserializeDenseVector(bytes, offset) -} else if (magic == SPARSE_VECTOR_MAGIC) { - deserializeSparseVector(bytes, offset) -} else { - throw new IllegalArgumentException(Magic + magic + is wrong.) + private[python] class LabeledPointPickler extends IObjectPickler { +def
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17574827 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala --- @@ -472,214 +452,140 @@ class PythonMLLibAPI extends Serializable { numRows: Long, numCols: Int, numPartitions: java.lang.Integer, - seed: java.lang.Long): JavaRDD[Array[Byte]] = { + seed: java.lang.Long): JavaRDD[Vector] = { val parts = getNumPartitionsOrDefault(numPartitions, jsc) val s = getSeedOrDefault(seed) -RG.poissonVectorRDD(jsc.sc, mean, numRows, numCols, parts, s).map(SerDe.serializeDoubleVector) +RG.poissonVectorRDD(jsc.sc, mean, numRows, numCols, parts, s) } } /** - * :: DeveloperApi :: - * MultivariateStatisticalSummary with Vector fields serialized. + * SerDe utility functions for PythonMLLibAPI. */ -@DeveloperApi -class MultivariateStatisticalSummarySerialized(val summary: MultivariateStatisticalSummary) - extends Serializable { +private[spark] object SerDe extends Serializable { - def mean: Array[Byte] = SerDe.serializeDoubleVector(summary.mean) + private[python] def reduce_object(out: OutputStream, pickler: Pickler, +module: String, name: String, objects: Object*) = { +out.write(Opcodes.GLOBAL) +out.write((module + \n + name + \n).getBytes) +out.write(Opcodes.MARK) +objects.foreach(pickler.save(_)) +out.write(Opcodes.TUPLE) +out.write(Opcodes.REDUCE) + } - def variance: Array[Byte] = SerDe.serializeDoubleVector(summary.variance) + private[python] class DenseVectorPickler extends IObjectPickler { +def pickle(obj: Object, out: OutputStream, pickler: Pickler) = { + val vector: DenseVector = obj.asInstanceOf[DenseVector] + reduce_object(out, pickler, pyspark.mllib.linalg, DenseVector, vector.toArray) +} + } - def count: Long = summary.count + private[python] class DenseVectorConstructor extends IObjectConstructor { +def construct(args: Array[Object]) :Object = { + require(args.length == 1) + new DenseVector(args(0).asInstanceOf[Array[Double]]) +} + } + + private[python] class DenseMatrixPickler extends IObjectPickler { +def pickle(obj: Object, out: OutputStream, pickler: Pickler) = { + val m: DenseMatrix = obj.asInstanceOf[DenseMatrix] + reduce_object(out, pickler, pyspark.mllib.linalg, DenseMatrix, +m.numRows.asInstanceOf[Object], m.numCols.asInstanceOf[Object], m.values) +} + } - def numNonzeros: Array[Byte] = SerDe.serializeDoubleVector(summary.numNonzeros) + private[python] class DenseMatrixConstructor extends IObjectConstructor { +def construct(args: Array[Object]) :Object = { + require(args.length == 3) + new DenseMatrix(args(0).asInstanceOf[Int], args(1).asInstanceOf[Int], +args(2).asInstanceOf[Array[Double]]) +} + } - def max: Array[Byte] = SerDe.serializeDoubleVector(summary.max) + private[python] class SparseVectorPickler extends IObjectPickler { +def pickle(obj: Object, out: OutputStream, pickler: Pickler) = { + val v: SparseVector = obj.asInstanceOf[SparseVector] + reduce_object(out, pickler, pyspark.mllib.linalg, SparseVector, +v.size.asInstanceOf[Object], v.indices, v.values) +} + } - def min: Array[Byte] = SerDe.serializeDoubleVector(summary.min) -} + private[python] class SparseVectorConstructor extends IObjectConstructor { +def construct(args: Array[Object]) :Object = { + require(args.length == 3) + new SparseVector(args(0).asInstanceOf[Int], args(1).asInstanceOf[Array[Int]], +args(2).asInstanceOf[Array[Double]]) +} + } -/** - * SerDe utility functions for PythonMLLibAPI. - */ -private[spark] object SerDe extends Serializable { - private val DENSE_VECTOR_MAGIC: Byte = 1 - private val SPARSE_VECTOR_MAGIC: Byte = 2 - private val DENSE_MATRIX_MAGIC: Byte = 3 - private val LABELED_POINT_MAGIC: Byte = 4 - - private[python] def deserializeDoubleVector(bytes: Array[Byte], offset: Int = 0): Vector = { -require(bytes.length - offset = 5, Byte array too short) -val magic = bytes(offset) -if (magic == DENSE_VECTOR_MAGIC) { - deserializeDenseVector(bytes, offset) -} else if (magic == SPARSE_VECTOR_MAGIC) { - deserializeSparseVector(bytes, offset) -} else { - throw new IllegalArgumentException(Magic + magic + is wrong.) + private[python] class LabeledPointPickler extends IObjectPickler { +def
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-55676382 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/109/consoleFull) for PR 2378 at commit [`722dd96`](https://github.com/apache/spark/commit/722dd96976d6a083b0ddb985ac6c518c791bce39). * This patch **fails** unit tests. * This patch **does not** merge cleanly! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-55685928 Just merged #2365 in case you want to rebase. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-55518181 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20304/consoleFull) for PR 2378 at commit [`b02e34f`](https://github.com/apache/spark/commit/b02e34f53f8e0ba992477b20def58ddf356aa3f1). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-55518966 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20302/consoleFull) for PR 2378 at commit [`4d7963e`](https://github.com/apache/spark/commit/4d7963ef91851fba280025b0778f0583fe819c55). * This patch **fails** unit tests. * This patch **does not** merge cleanly! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-55531127 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20307/consoleFull) for PR 2378 at commit [`0ee1525`](https://github.com/apache/spark/commit/0ee1525054e6ab75ef4b456fe1de148ef866de4e). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-55532844 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20307/consoleFull) for PR 2378 at commit [`0ee1525`](https://github.com/apache/spark/commit/0ee1525054e6ab75ef4b456fe1de148ef866de4e). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-0409 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20320/consoleFull) for PR 2378 at commit [`722dd96`](https://github.com/apache/spark/commit/722dd96976d6a083b0ddb985ac6c518c791bce39). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-2292 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20320/consoleFull) for PR 2378 at commit [`722dd96`](https://github.com/apache/spark/commit/722dd96976d6a083b0ddb985ac6c518c791bce39). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/2378 [SPARK-3491] [WIP] [MLlib] [PySpark] use pickle to serialize data in MLlib Currently, we serialize the data between JVM and Python case by case manually, this cannot scale to support so many APIs in MLlib. This patch will try to address this problem by serialize the data using pickle protocol, using Pyrolite library to serialize/deserialize in JVM. Pickle protocol can be easily extended to support customized class. In the first step, it can support Double, DenseVector, SparseVector, DenseMatrix, LabeledPoint, Rating, Tuple2 now, the recommendation module had been refactor to use this new protocol. Later, I will refactor all others to use this protocol. You can merge this pull request into a Git repository by running: $ git pull https://github.com/davies/spark pickle_mllib Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/2378.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #2378 commit b30ef35ec7830cee08b4f8d692da26d98cac70e8 Author: Davies Liu davies@gmail.com Date: 2014-09-13T07:18:33Z use pickle to serialize data for mllib/recommendation --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-55484501 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20259/consoleFull) for PR 2378 at commit [`b30ef35`](https://github.com/apache/spark/commit/b30ef35ec7830cee08b4f8d692da26d98cac70e8). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-55485771 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20259/consoleFull) for PR 2378 at commit [`b30ef35`](https://github.com/apache/spark/commit/b30ef35ec7830cee08b4f8d692da26d98cac70e8). * This patch **passes** unit tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class JavaSparkContext(val sc: SparkContext)` * `class Rating(object):` * `class JavaStreamingContext(val ssc: StreamingContext) extends Closeable ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-55486079 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20265/consoleFull) for PR 2378 at commit [`f1544c4`](https://github.com/apache/spark/commit/f1544c47917836d7ef77353c467182cc5cc7addb). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-55486352 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20267/consoleFull) for PR 2378 at commit [`aa2287e`](https://github.com/apache/spark/commit/aa2287ec75998bbc5512a37d5415dc2115615533). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-55487217 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20265/consoleFull) for PR 2378 at commit [`f1544c4`](https://github.com/apache/spark/commit/f1544c47917836d7ef77353c467182cc5cc7addb). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class Vector(object):` * `class DenseVector(Vector):` * `class SparseVector(Vector):` * `class Matrix(object):` * `class DenseMatrix(Matrix):` * `class Rating(object):` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-55487600 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20267/consoleFull) for PR 2378 at commit [`aa2287e`](https://github.com/apache/spark/commit/aa2287ec75998bbc5512a37d5415dc2115615533). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * `class JavaSparkContext(val sc: SparkContext)` * `class TaskCompletionListenerException(errorMessages: Seq[String]) extends Exception ` * `class Dummy(object):` * `class Vector(object):` * `class DenseVector(Vector):` * `class SparseVector(Vector):` * `class Matrix(object):` * `class DenseMatrix(Matrix):` * `class Rating(object):` * `class JavaStreamingContext(val ssc: StreamingContext) extends Closeable ` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-55497580 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20273/consoleFull) for PR 2378 at commit [`8fe166a`](https://github.com/apache/spark/commit/8fe166a80c5162914a9393b9526bc14150ce2402). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-55499182 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/20273/consoleFull) for PR 2378 at commit [`8fe166a`](https://github.com/apache/spark/commit/8fe166a80c5162914a9393b9526bc14150ce2402). * This patch **fails** unit tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_: * ` class ArrayConstructor extends net.razorvine.pickle.objects.ArrayConstructor ` * `class Vector(object):` * `class DenseVector(Vector):` * `class SparseVector(Vector):` * `class Matrix(object):` * `class DenseMatrix(Matrix):` * `class Rating(object):` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org