[GitHub] spark pull request: [SPARK-3488][MLLIB] Cache python RDDs after de...

2014-09-11 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2362#issuecomment-55297787 Can your do some benchmark to show the difference? I'm in doubt that caching the serialized data will better than caching the original objects, the former can

[GitHub] spark pull request: [SPARK-1484][MLLIB] Warn when running an itera...

2014-09-11 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2347#issuecomment-55308192 Is it possible that add the cache for RDD automatically instead of show an warning, if the cache is always helpful? --- If your project is set up for it, you can reply

[GitHub] spark pull request: [SPARK-3488][MLLIB] Cache python RDDs after de...

2014-09-11 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2362#issuecomment-55308253 I think you could pick any algorithm that you think will have most difference. For repeated warning, maybe it's not hard to make it show only once. --- If your

[GitHub] spark pull request: [SPARK-2951] support unpickle array.array for ...

2014-09-11 Thread davies
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/2365 [SPARK-2951] support unpickle array.array for Python 2.6 Pyrolite can not unpickle array.array which pickled by Python 2.6, this patch fix it by extend Pyrolite. There is a bug in Pyrolite

[GitHub] spark pull request: [SPARK-3465] fix task metrics aggregation in l...

2014-09-11 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2338#issuecomment-55345239 @andrewor14 @sryza done --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-3463] [PySpark] aggregate and show spil...

2014-09-11 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2336#discussion_r17457077 --- Diff: core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala --- @@ -242,7 +242,8 @@ class JobProgressListener(conf: SparkConf) extends

[GitHub] spark pull request: [SPARK-3463] [PySpark] aggregate and show spil...

2014-09-11 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2336#discussion_r17457126 --- Diff: python/pyspark/shuffle.py --- @@ -68,6 +68,11 @@ def _get_local_dirs(sub): return [os.path.join(d, python, str(os.getpid()), sub) for d

[GitHub] spark pull request: [SPARK-3500] [SQL] use JavaSchemaRDD as Schema...

2014-09-12 Thread davies
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/2369 [SPARK-3500] [SQL] use JavaSchemaRDD as SchemaRDD._jschema_rdd Currently, SchemaRDD._jschema_rdd is SchemaRDD, the Scala API (coalesce(), repartition()) can not been called in Python easily

[GitHub] spark pull request: [PySpark] Add blank line so that Python RDD.to...

2014-09-12 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2370#issuecomment-55423644 LGTM --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

2014-09-12 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2356#issuecomment-55425713 @mengxr I'm looking into this, could we block this a few days until we find out the scalable way to do serialization? --- If your project is set up for it, you can reply

[GitHub] spark pull request: [SPARK-3463] [PySpark] aggregate and show spil...

2014-09-12 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2336#discussion_r17509949 --- Diff: python/pyspark/shuffle.py --- @@ -68,6 +68,11 @@ def _get_local_dirs(sub): return [os.path.join(d, python, str(os.getpid()), sub) for d

[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-12 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2351#issuecomment-55482196 @JoshRosen I had addressed your comment, also added docs for configs and tests. I realized that the profile result also can be showed interactively

[GitHub] spark pull request: [SPARK-3430] [PySpark] [Doc] generate PySpark ...

2014-09-12 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2292#issuecomment-55482419 @JoshRosen I had moved the docs to python/docs/, rebased with master. I think it's ready to merge, please take another look, thanks. --- If your project is set up

[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-13 Thread davies
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/2378 [SPARK-3491] [WIP] [MLlib] [PySpark] use pickle to serialize data in MLlib Currently, we serialize the data between JVM and Python case by case manually, this cannot scale to support so many APIs

[GitHub] spark pull request: [SPARK-3463] [PySpark] aggregate and show spil...

2014-09-13 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2336#discussion_r17516995 --- Diff: python/pyspark/worker.py --- @@ -27,12 +27,11 @@ # copy_reg module. from pyspark.accumulators import _accumulatorRegistry from

[GitHub] spark pull request: [SPARK-3519] add distinct(n) to PySpark

2014-09-14 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2383#discussion_r17525914 --- Diff: python/pyspark/tests.py --- @@ -586,6 +586,17 @@ def test_repartitionAndSortWithinPartitions(self): self.assertEquals(partitions[0], [(0

[GitHub] spark pull request: [SPARK-3519] add distinct(n) to PySpark

2014-09-14 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2383#discussion_r17525918 --- Diff: python/pyspark/rdd.py --- @@ -353,7 +353,7 @@ def func(iterator): return ifilter(f, iterator) return self.mapPartitions

[GitHub] spark pull request: [SPARK-2951] [PySpark] support unpickle array....

2014-09-14 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2365#issuecomment-1408 I had created https://issues.apache.org/jira/browse/SPARK-3524 to track this. --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark pull request: [SPARK-1087] Move python traceback utilities i...

2014-09-14 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2385#discussion_r17526144 --- Diff: python/pyspark/context.py --- @@ -99,8 +100,8 @@ def __init__(self, master=None, appName=None, sparkHome=None, pyFiles=None

[GitHub] spark pull request: [SPARK-1087] Move python traceback utilities i...

2014-09-14 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2385#issuecomment-1894 LGTM, just one minor comment, it's not must to have. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark pull request: [SPARK-2594][SQL] Add CACHE TABLE name AS SE...

2014-09-14 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2381#issuecomment-2011 It seems that there are some unrelated changes in it, could you rebase with master? --- If your project is set up for it, you can reply to this email and have your reply

[GitHub] spark pull request: [SPARK-3488][MLLIB] Cache python RDDs after de...

2014-09-14 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2362#issuecomment-2191 The benchmark result sounds reasonable, thanks for confirm it. Cache the RDD after serialization will reduce the memory usage and GC pressure, but have some CPU overhead

[GitHub] spark pull request: [SPARK-3377] [Metrics] Metrics can be accident...

2014-09-14 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2250#issuecomment-2366 Could you cleanup the changes? It's confusing to see a bunch of debugging changes were left. --- If your project is set up for it, you can reply to this email and have

[GitHub] spark pull request: [SPARK-3519] add distinct(n) to PySpark

2014-09-15 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2383#discussion_r17553519 --- Diff: python/pyspark/rdd.py --- @@ -353,7 +353,7 @@ def func(iterator): return ifilter(f, iterator) return self.mapPartitions

[GitHub] spark pull request: [SPARK-3519] add distinct(n) to PySpark

2014-09-15 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2383#discussion_r17553641 --- Diff: python/pyspark/tests.py --- @@ -586,6 +586,17 @@ def test_repartitionAndSortWithinPartitions(self): self.assertEquals(partitions[0], [(0

[GitHub] spark pull request: [SPARK-2594][SQL] Add CACHE TABLE name AS SE...

2014-09-15 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2397#issuecomment-55629364 ok to test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature

[GitHub] spark pull request: [SPARK-1087] Move python traceback utilities i...

2014-09-15 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2385#discussion_r17557557 --- Diff: python/pyspark/traceback_utils.py --- @@ -0,0 +1,80 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more

[GitHub] spark pull request: [SPARK-927] detect numpy at time of use

2014-09-15 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2313#issuecomment-55634239 @mattf The only benefit we got from numpy is performance in poisson(), so the only thing that could lose should be performance, not reproducibility, if it fallbacks

[GitHub] spark pull request: [SPARK-927] detect numpy at time of use

2014-09-15 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2313#issuecomment-55635550 If numpy is not installed in driver, it should not have warning, i think. --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark pull request: [SPARK-3488][MLLIB] Cache python RDDs after de...

2014-09-15 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2362#issuecomment-55637953 Would it make sense to preserve the portions of this patch that drop caching for the NaiveBayes, ALS, and DecisionTree learners, which I do not believe require external

[GitHub] spark pull request: [SPARK-927] detect numpy at time of use

2014-09-15 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2313#issuecomment-55650152 Good question, we did not push the logging from worker to driver, so it's not easy to show up an warning from worker. Maybe we could leave this (push logging to driver

[GitHub] spark pull request: [SPARK-927] detect numpy at time of use

2014-09-15 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2313#issuecomment-55664261 Yes, it works for me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-3554] use broadcast automatically for l...

2014-09-16 Thread davies
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/2417 [SPARK-3554] use broadcast automatically for large closure Py4j can not handle large string efficiently, so we should use broadcast for large closure automatically. (Broadcast use local filesystem

[GitHub] spark pull request: [SPARK-3491] [WIP] [MLlib] [PySpark] use pickl...

2014-09-16 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17631575 --- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala --- @@ -775,17 +775,38 @@ private[spark] object PythonRDD extends Logging

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-17 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-55916761 @mengxr it's ready to review now, thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark pull request: [MLLIB] fix a unresolved reference variable 'n...

2014-09-17 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2423#issuecomment-55929364 @OdinLin good catch! But _common.py will be retired after PR #2378, maybe it's not needed anymore. --- If your project is set up for it, you can reply to this email

[GitHub] spark pull request: [SPARK-3550][MLLIB] Disable automatic rdd cach...

2014-09-17 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2412#issuecomment-55929756 @staple I also addressed this in #2378 , could you help to review this part? --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-17 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17686849 --- Diff: python/pyspark/mllib/recommendation.py --- @@ -54,34 +64,51 @@ def __del__(self): def predict(self, user, product): return self

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-17 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17686887 --- Diff: python/pyspark/mllib/recommendation.py --- @@ -54,34 +64,51 @@ def __del__(self): def predict(self, user, product): return self

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-09-17 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-55982725 @mateiz In this patch, the values in SameKey can only be iterated once, I will fix this later. --- If your project is set up for it, you can reply to this email

[GitHub] spark pull request: [SPARK-3074] [PySpark] support groupByKey() wi...

2014-09-18 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/1977#issuecomment-56004283 @JoshRosen @mateiz I had addressed all your comments. The IResulterIterator can be iterated multiple times now, also can be pickled. --- If your project is set up

[GitHub] spark pull request: [SPARK-3592] [SQL] [PySpark] support applySche...

2014-09-18 Thread davies
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/2448 [SPARK-3592] [SQL] [PySpark] support applySchema to RDD of Row Fix the issue when applySchema() to an RDD of Row. Also add type mapping for BinaryType. You can merge this pull request

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-18 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17751963 --- Diff: python/pyspark/mllib/tests.py --- @@ -198,41 +212,36 @@ def test_serialize(self): lil[1, 0] = 1 lil[3, 0] = 2

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-18 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17752055 --- Diff: python/pyspark/mllib/linalg.py --- @@ -257,10 +410,34 @@ def stringify(vector): Vectors.stringify(Vectors.dense([0.0, 1.0

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-18 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17752465 --- Diff: python/pyspark/mllib/linalg.py --- @@ -23,14 +23,148 @@ SciPy is available in their environment. -import numpy -from numpy

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-18 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17752588 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala --- @@ -64,6 +64,12 @@ class DenseMatrix(val numRows: Int, val numCols: Int, val

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-18 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17752597 --- Diff: python/pyspark/mllib/linalg.py --- @@ -23,14 +23,148 @@ SciPy is available in their environment. -import numpy -from numpy

[GitHub] spark pull request: [SPARK-3592] [SQL] [PySpark] support applySche...

2014-09-18 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2448#discussion_r17757102 --- Diff: python/pyspark/sql.py --- @@ -1117,6 +1119,11 @@ def applySchema(self, rdd, schema): # take the first few rows to verify schema

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-18 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17757207 --- Diff: python/pyspark/mllib/tests.py --- @@ -198,41 +212,36 @@ def test_serialize(self): lil[1, 0] = 1 lil[3, 0] = 2

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-18 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2378#discussion_r17757949 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala --- @@ -476,259 +436,167 @@ class PythonMLLibAPI extends Serializable

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-18 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-56109944 @jkbradley I should have addressed all your comments, or leave comments if I have not figure out how to do now, thanks for reviewing this huge PR. --- If your project

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-19 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-56210084 @mengxr PickleSerializer do not compress data, there is CompressSerializer can do it using gzip(level 1). Compression can help for small range of double or repeated

[GitHub] spark pull request: [SPARK-3491] [MLlib] [PySpark] use pickle to s...

2014-09-19 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2378#issuecomment-56211052 @mengxr In this PR, I just tried to avoid other changes except serialization, we could change the cache behavior or compression later. It's will be good to have

[GitHub] spark pull request: [SPARK-3634] [PySpark] User's module should ta...

2014-09-22 Thread davies
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/2492 [SPARK-3634] [PySpark] User's module should take precedence over system modules Python modules added through addPyFile should take precedence over system modules. This patch put the path

[GitHub] spark pull request: [SPARK-3580] add 'partitions' property to PySp...

2014-09-23 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2478#issuecomment-56552452 RDD._jrdd is very heavy for PipelinedRDD, but getNumPartitions() could be optimized for PipelinedRDD to avoid the creation of _jrdd (could be rdd.prev.getNumPartitions

[GitHub] spark pull request: [SPARK-3634] [PySpark] User's module should ta...

2014-09-23 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2492#issuecomment-56598619 I think it's fine to move on, and remove the comment about risk in PR's description. --- If your project is set up for it, you can reply to this email and have your

[GitHub] spark pull request: [SPARK-3550][MLLIB] Disable automatic rdd cach...

2014-09-23 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2412#issuecomment-56610305 @staple could you rebase this PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark pull request: Python SQL Example Code

2014-09-24 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2521#discussion_r17987196 --- Diff: examples/src/main/python/sql.py --- @@ -0,0 +1,52 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more

[GitHub] spark pull request: Python SQL Example Code

2014-09-24 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2521#issuecomment-56710085 This example only demonstrate jsonFile(), it will more powerful if it could have some usage of `inferSchema()` and `applySchema()`. --- If your project is set up

[GitHub] spark pull request: [SPARK-3679] [PySpark] pickle the exact global...

2014-09-24 Thread davies
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/2522 [SPARK-3679] [PySpark] pickle the exact globals of functions function.func_code.co_names has all the names used in the function, including name of attributes. It will pickle some unnecessary globals

[GitHub] spark pull request: Python SQL Example Code

2014-09-24 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2521#discussion_r17992775 --- Diff: examples/src/main/python/sql.py --- @@ -0,0 +1,52 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more

[GitHub] spark pull request: [SPARK-3634] [PySpark] User's module should ta...

2014-09-24 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2492#discussion_r17993171 --- Diff: python/pyspark/context.py --- @@ -183,10 +183,9 @@ def _do_init(self, master, appName, sparkHome, pyFiles, environment, batchSize

[GitHub] spark pull request: [SPARK-3679] [PySpark] pickle the exact global...

2014-09-24 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2522#issuecomment-56725936 @JoshRosen fixed the description --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does

[GitHub] spark pull request: [SPARK-3679] [PySpark] pickle the exact global...

2014-09-24 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2522#discussion_r17995491 --- Diff: python/pyspark/cloudpickle.py --- @@ -304,16 +313,37 @@ def save_function_tuple(self, func, forced_imports): write(pickle.REDUCE

[GitHub] spark pull request: [SPARK-3681] [SQL] [PySpark] fix serialization...

2014-09-24 Thread davies
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/2526 [SPARK-3681] [SQL] [PySpark] fix serialization of List and Map in SchemaRDD Currently, the schema of object in ArrayType or MapType is attached lazily, it will have better performance but introduce

[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-24 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2351#discussion_r18002615 --- Diff: python/pyspark/rdd.py --- @@ -2081,8 +2085,44 @@ def _jrdd(self): self.ctx.pythonExec

[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-24 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2351#issuecomment-56752778 @JoshRosen I had addressed your comments, plz take another look, thanks! --- If your project is set up for it, you can reply to this email and have your reply appear

[GitHub] spark pull request: [SPARK-3550][MLLIB] Disable automatic rdd cach...

2014-09-25 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2412#issuecomment-56886980 @staple thanks, I'd like to keep it as before for ALS, could you close this PR (maybe also the issue)? --- If your project is set up for it, you can reply to this email

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

2014-09-25 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2356#discussion_r18061848 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala --- @@ -284,6 +285,80 @@ class PythonMLLibAPI extends Serializable

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

2014-09-25 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2356#discussion_r18061937 --- Diff: python/pyspark/mllib/Word2Vec.py --- @@ -0,0 +1,123 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more

[GitHub] spark pull request: [SPARK-3486][MLlib][PySpark] PySpark support f...

2014-09-25 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2356#issuecomment-56888002 Could you add some tests? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have

[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-25 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2351#issuecomment-56889367 @JoshRosen sorry for this mistake, fixed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project

[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-25 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2351#discussion_r18067596 --- Diff: docs/configuration.md --- @@ -207,6 +207,25 @@ Apart from these, the following properties are also available, and may be useful /td /tr

[GitHub] spark pull request: [WIP] [SPARK-2377] Python API for Streaming

2014-09-25 Thread davies
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/2538 [WIP] [SPARK-2377] Python API for Streaming This patch bring Python API for Streaming, WIP. TODO: updateStateByKey() windowXXX() This patch is based on work from @giwa You

[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-25 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2351#discussion_r18071567 --- Diff: docs/configuration.md --- @@ -207,6 +207,25 @@ Apart from these, the following properties are also available, and may be useful /td /tr

[GitHub] spark pull request: [WIP] [SPARK-2377] Python API for Streaming

2014-09-25 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2538#discussion_r18073938 --- Diff: python/pyspark/streaming/jtime.py --- @@ -0,0 +1,135 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more

[GitHub] spark pull request: [WIP] [SPARK-2377] Python API for Streaming

2014-09-26 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2538#discussion_r18095580 --- Diff: examples/src/main/python/streaming/wordcount.py --- @@ -0,0 +1,21 @@ +import sys + +from pyspark.streaming.context import

[GitHub] spark pull request: [WIP] [SPARK-2377] Python API for Streaming

2014-09-26 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2538#discussion_r18095572 --- Diff: examples/src/main/python/streaming/network_wordcount.py --- @@ -0,0 +1,20 @@ +import sys + +from pyspark.streaming.context import

[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-26 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2351#issuecomment-56988133 Thanks for review this, your comments made it much better. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well

[GitHub] spark pull request: [WIP] [SPARK-2377] Python API for Streaming

2014-09-26 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2538#issuecomment-57020408 @giwa @JoshRosen It support window functions and updateStateByKey now, should be function complete as Java API, I'd move on to add more docs and tests. We did

[GitHub] spark pull request: [SPARK-3478] [PySpark] Profile the Python task...

2014-09-26 Thread davies
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/2556 [SPARK-3478] [PySpark] Profile the Python tasks This patch add profiling support for PySpark, it will show the profiling results before the driver exits, here is one example

[GitHub] spark pull request: [SPARK-2377] Python API for Streaming

2014-09-29 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2538#issuecomment-57227515 @JoshRosen @tdas I think this PR is ready for review now, please take a look. @giwa I also saw these errors sometimes when I run the all the tests, it's a bit

[GitHub] spark pull request: [SPARK-2377] Python API for Streaming

2014-09-29 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2538#discussion_r18200031 --- Diff: bin/pyspark --- @@ -87,11 +87,7 @@ export PYSPARK_SUBMIT_ARGS if [[ -n $SPARK_TESTING ]]; then unset YARN_CONF_DIR unset

[GitHub] spark pull request: [SPARK-2377] Python API for Streaming

2014-09-29 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2538#discussion_r18200124 --- Diff: python/pyspark/accumulators.py --- @@ -256,3 +256,8 @@ def _start_update_server(): thread.daemon = True thread.start

[GitHub] spark pull request: [SPARK-2377] Python API for Streaming

2014-09-30 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2538#discussion_r18201664 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala --- @@ -413,7 +413,7 @@ class StreamingContext private[streaming

[GitHub] spark pull request: [SPARK-2377] Python API for Streaming

2014-09-30 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2538#discussion_r18202866 --- Diff: python/pyspark/serializers.py --- @@ -114,6 +114,9 @@ def __ne__(self, other): def __repr__(self): return %s object % self

[GitHub] spark pull request: [SPARK-2377] Python API for Streaming

2014-09-30 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2538#discussion_r18202884 --- Diff: python/pyspark/streaming/context.py --- @@ -0,0 +1,243 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more

[GitHub] spark pull request: [SPARK-2377] Python API for Streaming

2014-09-30 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2538#discussion_r18202894 --- Diff: python/pyspark/streaming/dstream.py --- @@ -0,0 +1,633 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more

[GitHub] spark pull request: [SPARK-2377] Python API for Streaming

2014-09-30 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2538#discussion_r18202907 --- Diff: python/pyspark/streaming/dstream.py --- @@ -0,0 +1,633 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one or more

[GitHub] spark pull request: [SPARK-2377] Python API for Streaming

2014-09-30 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2538#discussion_r18203052 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/api/python/PythonDStream.scala --- @@ -0,0 +1,261 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request: [SPARK-2377] Python API for Streaming

2014-09-30 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2538#discussion_r18203252 --- Diff: streaming/src/main/scala/org/apache/spark/streaming/api/python/PythonDStream.scala --- @@ -0,0 +1,261 @@ +/* + * Licensed to the Apache

[GitHub] spark pull request: [SPARK-2377] Python API for Streaming

2014-09-30 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2538#issuecomment-57279309 @giwa After fixing the problem of increasing partitions (it will increase performance problem), the tests run very stable now. --- If your project is set up for it, you

[GitHub] spark pull request: [SPARK-2377] Python API for Streaming

2014-09-30 Thread davies
Github user davies commented on the pull request: https://github.com/apache/spark/pull/2538#issuecomment-57279408 @tdas I should have addressed all your comments (or leave comment), please take another look. --- If your project is set up for it, you can reply to this email and have

[GitHub] spark pull request: [SPARK-2377] Python API for Streaming

2014-09-30 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2538#discussion_r18240816 --- Diff: examples/src/main/python/streaming/hdfs_wordcount.py --- @@ -0,0 +1,21 @@ +import sys + +from pyspark import SparkContext +from

[GitHub] spark pull request: [SPARK-2377] Python API for Streaming

2014-09-30 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2538#discussion_r18240810 --- Diff: examples/src/main/python/streaming/network_wordcount.py --- @@ -0,0 +1,20 @@ +import sys + +from pyspark import SparkContext +from

[GitHub] spark pull request: [SPARK-3749] [PySpark] fix bugs in broadcast l...

2014-09-30 Thread davies
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/2603 [SPARK-3749] [PySpark] fix bugs in broadcast large closure of RDD 1. broadcast is triggle unexpected 2. fd is leaked in JVM (also leak in parallelize()) 3. broadcast is not unpersisted in JVM

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-01 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2563#discussion_r18321283 --- Diff: python/pyspark/sql.py --- @@ -62,6 +63,12 @@ def __eq__(self, other): def __ne__(self, other): return not self.__eq__(other

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-01 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2563#discussion_r18321352 --- Diff: python/pyspark/sql.py --- @@ -205,6 +234,16 @@ def __str__(self): return ArrayType(%s,%s) % (self.elementType

[GitHub] spark pull request: [SPARK-3713][SQL] Uses JSON to serialize DataT...

2014-10-01 Thread davies
Github user davies commented on a diff in the pull request: https://github.com/apache/spark/pull/2563#discussion_r18321520 --- Diff: python/pyspark/sql.py --- @@ -385,50 +429,32 @@ def _parse_datatype_string(datatype_string): check_datatype(complex_maptype) True

[GitHub] spark pull request: [SPARK-3762] clear reference of SparkEnv after...

2014-10-02 Thread davies
GitHub user davies opened a pull request: https://github.com/apache/spark/pull/2624 [SPARK-3762] clear reference of SparkEnv after stop SparkEnv is cached in ThreadLocal object, so after stop and create a new SparkContext, old SparkEnv is still used by some threads, it will trigger

  1   2   3   4   5   6   7   8   9   10   >