Github user davies commented on the pull request:
https://github.com/apache/spark/pull/2362#issuecomment-55297787
Can your do some benchmark to show the difference?
I'm in doubt that caching the serialized data will better than caching the
original objects, the former can
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/2347#issuecomment-55308192
Is it possible that add the cache for RDD automatically instead of show an
warning, if the cache is always helpful?
---
If your project is set up for it, you can reply
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/2362#issuecomment-55308253
I think you could pick any algorithm that you think will have most
difference.
For repeated warning, maybe it's not hard to make it show only once.
---
If your
GitHub user davies opened a pull request:
https://github.com/apache/spark/pull/2365
[SPARK-2951] support unpickle array.array for Python 2.6
Pyrolite can not unpickle array.array which pickled by Python 2.6, this
patch fix it by extend Pyrolite.
There is a bug in Pyrolite
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/2338#issuecomment-55345239
@andrewor14 @sryza done
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/2336#discussion_r17457077
--- Diff:
core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala ---
@@ -242,7 +242,8 @@ class JobProgressListener(conf: SparkConf) extends
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/2336#discussion_r17457126
--- Diff: python/pyspark/shuffle.py ---
@@ -68,6 +68,11 @@ def _get_local_dirs(sub):
return [os.path.join(d, python, str(os.getpid()), sub) for d
GitHub user davies opened a pull request:
https://github.com/apache/spark/pull/2369
[SPARK-3500] [SQL] use JavaSchemaRDD as SchemaRDD._jschema_rdd
Currently, SchemaRDD._jschema_rdd is SchemaRDD, the Scala API (coalesce(),
repartition()) can not been called in Python easily
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/2370#issuecomment-55423644
LGTM
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/2356#issuecomment-55425713
@mengxr I'm looking into this, could we block this a few days until we find
out the scalable way to do serialization?
---
If your project is set up for it, you can reply
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/2336#discussion_r17509949
--- Diff: python/pyspark/shuffle.py ---
@@ -68,6 +68,11 @@ def _get_local_dirs(sub):
return [os.path.join(d, python, str(os.getpid()), sub) for d
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/2351#issuecomment-55482196
@JoshRosen I had addressed your comment, also added docs for configs and
tests.
I realized that the profile result also can be showed interactively
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/2292#issuecomment-55482419
@JoshRosen I had moved the docs to python/docs/, rebased with master. I
think it's ready to merge, please take another look, thanks.
---
If your project is set up
GitHub user davies opened a pull request:
https://github.com/apache/spark/pull/2378
[SPARK-3491] [WIP] [MLlib] [PySpark] use pickle to serialize data in MLlib
Currently, we serialize the data between JVM and Python case by case
manually, this cannot scale to support so many APIs
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/2336#discussion_r17516995
--- Diff: python/pyspark/worker.py ---
@@ -27,12 +27,11 @@
# copy_reg module.
from pyspark.accumulators import _accumulatorRegistry
from
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/2383#discussion_r17525914
--- Diff: python/pyspark/tests.py ---
@@ -586,6 +586,17 @@ def test_repartitionAndSortWithinPartitions(self):
self.assertEquals(partitions[0], [(0
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/2383#discussion_r17525918
--- Diff: python/pyspark/rdd.py ---
@@ -353,7 +353,7 @@ def func(iterator):
return ifilter(f, iterator)
return self.mapPartitions
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/2365#issuecomment-1408
I had created https://issues.apache.org/jira/browse/SPARK-3524 to track
this.
---
If your project is set up for it, you can reply to this email and have your
reply
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/2385#discussion_r17526144
--- Diff: python/pyspark/context.py ---
@@ -99,8 +100,8 @@ def __init__(self, master=None, appName=None,
sparkHome=None, pyFiles=None
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/2385#issuecomment-1894
LGTM, just one minor comment, it's not must to have.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/2381#issuecomment-2011
It seems that there are some unrelated changes in it, could you rebase with
master?
---
If your project is set up for it, you can reply to this email and have your
reply
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/2362#issuecomment-2191
The benchmark result sounds reasonable, thanks for confirm it. Cache the
RDD after serialization will reduce the memory usage and GC pressure, but have
some CPU overhead
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/2250#issuecomment-2366
Could you cleanup the changes? It's confusing to see a bunch of debugging
changes were left.
---
If your project is set up for it, you can reply to this email and have
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/2383#discussion_r17553519
--- Diff: python/pyspark/rdd.py ---
@@ -353,7 +353,7 @@ def func(iterator):
return ifilter(f, iterator)
return self.mapPartitions
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/2383#discussion_r17553641
--- Diff: python/pyspark/tests.py ---
@@ -586,6 +586,17 @@ def test_repartitionAndSortWithinPartitions(self):
self.assertEquals(partitions[0], [(0
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/2397#issuecomment-55629364
ok to test.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/2385#discussion_r17557557
--- Diff: python/pyspark/traceback_utils.py ---
@@ -0,0 +1,80 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/2313#issuecomment-55634239
@mattf The only benefit we got from numpy is performance in poisson(), so
the only thing that could lose should be performance, not reproducibility, if
it fallbacks
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/2313#issuecomment-55635550
If numpy is not installed in driver, it should not have warning, i think.
---
If your project is set up for it, you can reply to this email and have your
reply appear
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/2362#issuecomment-55637953
Would it make sense to preserve the portions of this patch that drop
caching for the NaiveBayes, ALS, and DecisionTree learners, which I do not
believe require external
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/2313#issuecomment-55650152
Good question, we did not push the logging from worker to driver, so it's
not easy to show up an warning from worker. Maybe we could leave this (push
logging to driver
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/2313#issuecomment-55664261
Yes, it works for me.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have
GitHub user davies opened a pull request:
https://github.com/apache/spark/pull/2417
[SPARK-3554] use broadcast automatically for large closure
Py4j can not handle large string efficiently, so we should use broadcast
for large closure automatically. (Broadcast use local filesystem
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/2378#discussion_r17631575
--- Diff: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala
---
@@ -775,17 +775,38 @@ private[spark] object PythonRDD extends Logging
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/2378#issuecomment-55916761
@mengxr it's ready to review now, thanks.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/2423#issuecomment-55929364
@OdinLin good catch! But _common.py will be retired after PR #2378, maybe
it's not needed anymore.
---
If your project is set up for it, you can reply to this email
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/2412#issuecomment-55929756
@staple I also addressed this in #2378 , could you help to review this part?
---
If your project is set up for it, you can reply to this email and have your
reply appear
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/2378#discussion_r17686849
--- Diff: python/pyspark/mllib/recommendation.py ---
@@ -54,34 +64,51 @@ def __del__(self):
def predict(self, user, product):
return self
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/2378#discussion_r17686887
--- Diff: python/pyspark/mllib/recommendation.py ---
@@ -54,34 +64,51 @@ def __del__(self):
def predict(self, user, product):
return self
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-55982725
@mateiz In this patch, the values in SameKey can only be iterated once, I
will fix this later.
---
If your project is set up for it, you can reply to this email
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/1977#issuecomment-56004283
@JoshRosen @mateiz I had addressed all your comments. The
IResulterIterator can be iterated multiple times now, also can be pickled.
---
If your project is set up
GitHub user davies opened a pull request:
https://github.com/apache/spark/pull/2448
[SPARK-3592] [SQL] [PySpark] support applySchema to RDD of Row
Fix the issue when applySchema() to an RDD of Row.
Also add type mapping for BinaryType.
You can merge this pull request
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/2378#discussion_r17751963
--- Diff: python/pyspark/mllib/tests.py ---
@@ -198,41 +212,36 @@ def test_serialize(self):
lil[1, 0] = 1
lil[3, 0] = 2
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/2378#discussion_r17752055
--- Diff: python/pyspark/mllib/linalg.py ---
@@ -257,10 +410,34 @@ def stringify(vector):
Vectors.stringify(Vectors.dense([0.0, 1.0
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/2378#discussion_r17752465
--- Diff: python/pyspark/mllib/linalg.py ---
@@ -23,14 +23,148 @@
SciPy is available in their environment.
-import numpy
-from numpy
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/2378#discussion_r17752588
--- Diff: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala
---
@@ -64,6 +64,12 @@ class DenseMatrix(val numRows: Int, val numCols: Int,
val
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/2378#discussion_r17752597
--- Diff: python/pyspark/mllib/linalg.py ---
@@ -23,14 +23,148 @@
SciPy is available in their environment.
-import numpy
-from numpy
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/2448#discussion_r17757102
--- Diff: python/pyspark/sql.py ---
@@ -1117,6 +1119,11 @@ def applySchema(self, rdd, schema):
# take the first few rows to verify schema
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/2378#discussion_r17757207
--- Diff: python/pyspark/mllib/tests.py ---
@@ -198,41 +212,36 @@ def test_serialize(self):
lil[1, 0] = 1
lil[3, 0] = 2
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/2378#discussion_r17757949
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
@@ -476,259 +436,167 @@ class PythonMLLibAPI extends Serializable
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/2378#issuecomment-56109944
@jkbradley I should have addressed all your comments, or leave comments if
I have not figure out how to do now, thanks for reviewing this huge PR.
---
If your project
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/2378#issuecomment-56210084
@mengxr PickleSerializer do not compress data, there is CompressSerializer
can do it using gzip(level 1). Compression can help for small range of double
or repeated
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/2378#issuecomment-56211052
@mengxr In this PR, I just tried to avoid other changes except
serialization, we could change the cache behavior or compression later.
It's will be good to have
GitHub user davies opened a pull request:
https://github.com/apache/spark/pull/2492
[SPARK-3634] [PySpark] User's module should take precedence over system
modules
Python modules added through addPyFile should take precedence over system
modules.
This patch put the path
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/2478#issuecomment-56552452
RDD._jrdd is very heavy for PipelinedRDD, but getNumPartitions() could be
optimized for PipelinedRDD to avoid the creation of _jrdd (could be
rdd.prev.getNumPartitions
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/2492#issuecomment-56598619
I think it's fine to move on, and remove the comment about risk in PR's
description.
---
If your project is set up for it, you can reply to this email and have your
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/2412#issuecomment-56610305
@staple could you rebase this PR?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/2521#discussion_r17987196
--- Diff: examples/src/main/python/sql.py ---
@@ -0,0 +1,52 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/2521#issuecomment-56710085
This example only demonstrate jsonFile(), it will more powerful if it could
have some usage of `inferSchema()` and `applySchema()`.
---
If your project is set up
GitHub user davies opened a pull request:
https://github.com/apache/spark/pull/2522
[SPARK-3679] [PySpark] pickle the exact globals of functions
function.func_code.co_names has all the names used in the function,
including name of attributes. It will pickle some unnecessary globals
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/2521#discussion_r17992775
--- Diff: examples/src/main/python/sql.py ---
@@ -0,0 +1,52 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/2492#discussion_r17993171
--- Diff: python/pyspark/context.py ---
@@ -183,10 +183,9 @@ def _do_init(self, master, appName, sparkHome,
pyFiles, environment, batchSize
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/2522#issuecomment-56725936
@JoshRosen fixed the description
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/2522#discussion_r17995491
--- Diff: python/pyspark/cloudpickle.py ---
@@ -304,16 +313,37 @@ def save_function_tuple(self, func, forced_imports):
write(pickle.REDUCE
GitHub user davies opened a pull request:
https://github.com/apache/spark/pull/2526
[SPARK-3681] [SQL] [PySpark] fix serialization of List and Map in SchemaRDD
Currently, the schema of object in ArrayType or MapType is attached lazily,
it will have better performance but introduce
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/2351#discussion_r18002615
--- Diff: python/pyspark/rdd.py ---
@@ -2081,8 +2085,44 @@ def _jrdd(self):
self.ctx.pythonExec
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/2351#issuecomment-56752778
@JoshRosen I had addressed your comments, plz take another look, thanks!
---
If your project is set up for it, you can reply to this email and have your
reply appear
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/2412#issuecomment-56886980
@staple thanks, I'd like to keep it as before for ALS, could you close this
PR (maybe also the issue)?
---
If your project is set up for it, you can reply to this email
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/2356#discussion_r18061848
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
@@ -284,6 +285,80 @@ class PythonMLLibAPI extends Serializable
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/2356#discussion_r18061937
--- Diff: python/pyspark/mllib/Word2Vec.py ---
@@ -0,0 +1,123 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/2356#issuecomment-56888002
Could you add some tests?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/2351#issuecomment-56889367
@JoshRosen sorry for this mistake, fixed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/2351#discussion_r18067596
--- Diff: docs/configuration.md ---
@@ -207,6 +207,25 @@ Apart from these, the following properties are also
available, and may be useful
/td
/tr
GitHub user davies opened a pull request:
https://github.com/apache/spark/pull/2538
[WIP] [SPARK-2377] Python API for Streaming
This patch bring Python API for Streaming, WIP.
TODO:
updateStateByKey()
windowXXX()
This patch is based on work from @giwa
You
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/2351#discussion_r18071567
--- Diff: docs/configuration.md ---
@@ -207,6 +207,25 @@ Apart from these, the following properties are also
available, and may be useful
/td
/tr
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/2538#discussion_r18073938
--- Diff: python/pyspark/streaming/jtime.py ---
@@ -0,0 +1,135 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/2538#discussion_r18095580
--- Diff: examples/src/main/python/streaming/wordcount.py ---
@@ -0,0 +1,21 @@
+import sys
+
+from pyspark.streaming.context import
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/2538#discussion_r18095572
--- Diff: examples/src/main/python/streaming/network_wordcount.py ---
@@ -0,0 +1,20 @@
+import sys
+
+from pyspark.streaming.context import
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/2351#issuecomment-56988133
Thanks for review this, your comments made it much better.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/2538#issuecomment-57020408
@giwa @JoshRosen It support window functions and updateStateByKey now,
should be function complete as Java API, I'd move on to add more docs and tests.
We did
GitHub user davies opened a pull request:
https://github.com/apache/spark/pull/2556
[SPARK-3478] [PySpark] Profile the Python tasks
This patch add profiling support for PySpark, it will show the profiling
results
before the driver exits, here is one example
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/2538#issuecomment-57227515
@JoshRosen @tdas I think this PR is ready for review now, please take a
look.
@giwa I also saw these errors sometimes when I run the all the tests, it's
a bit
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/2538#discussion_r18200031
--- Diff: bin/pyspark ---
@@ -87,11 +87,7 @@ export PYSPARK_SUBMIT_ARGS
if [[ -n $SPARK_TESTING ]]; then
unset YARN_CONF_DIR
unset
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/2538#discussion_r18200124
--- Diff: python/pyspark/accumulators.py ---
@@ -256,3 +256,8 @@ def _start_update_server():
thread.daemon = True
thread.start
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/2538#discussion_r18201664
--- Diff:
streaming/src/main/scala/org/apache/spark/streaming/StreamingContext.scala ---
@@ -413,7 +413,7 @@ class StreamingContext private[streaming
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/2538#discussion_r18202866
--- Diff: python/pyspark/serializers.py ---
@@ -114,6 +114,9 @@ def __ne__(self, other):
def __repr__(self):
return %s object % self
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/2538#discussion_r18202884
--- Diff: python/pyspark/streaming/context.py ---
@@ -0,0 +1,243 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/2538#discussion_r18202894
--- Diff: python/pyspark/streaming/dstream.py ---
@@ -0,0 +1,633 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/2538#discussion_r18202907
--- Diff: python/pyspark/streaming/dstream.py ---
@@ -0,0 +1,633 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/2538#discussion_r18203052
--- Diff:
streaming/src/main/scala/org/apache/spark/streaming/api/python/PythonDStream.scala
---
@@ -0,0 +1,261 @@
+/*
+ * Licensed to the Apache
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/2538#discussion_r18203252
--- Diff:
streaming/src/main/scala/org/apache/spark/streaming/api/python/PythonDStream.scala
---
@@ -0,0 +1,261 @@
+/*
+ * Licensed to the Apache
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/2538#issuecomment-57279309
@giwa After fixing the problem of increasing partitions (it will increase
performance problem), the tests run very stable now.
---
If your project is set up for it, you
Github user davies commented on the pull request:
https://github.com/apache/spark/pull/2538#issuecomment-57279408
@tdas I should have addressed all your comments (or leave comment), please
take another look.
---
If your project is set up for it, you can reply to this email and have
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/2538#discussion_r18240816
--- Diff: examples/src/main/python/streaming/hdfs_wordcount.py ---
@@ -0,0 +1,21 @@
+import sys
+
+from pyspark import SparkContext
+from
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/2538#discussion_r18240810
--- Diff: examples/src/main/python/streaming/network_wordcount.py ---
@@ -0,0 +1,20 @@
+import sys
+
+from pyspark import SparkContext
+from
GitHub user davies opened a pull request:
https://github.com/apache/spark/pull/2603
[SPARK-3749] [PySpark] fix bugs in broadcast large closure of RDD
1. broadcast is triggle unexpected
2. fd is leaked in JVM (also leak in parallelize())
3. broadcast is not unpersisted in JVM
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/2563#discussion_r18321283
--- Diff: python/pyspark/sql.py ---
@@ -62,6 +63,12 @@ def __eq__(self, other):
def __ne__(self, other):
return not self.__eq__(other
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/2563#discussion_r18321352
--- Diff: python/pyspark/sql.py ---
@@ -205,6 +234,16 @@ def __str__(self):
return ArrayType(%s,%s) % (self.elementType
Github user davies commented on a diff in the pull request:
https://github.com/apache/spark/pull/2563#discussion_r18321520
--- Diff: python/pyspark/sql.py ---
@@ -385,50 +429,32 @@ def _parse_datatype_string(datatype_string):
check_datatype(complex_maptype)
True
GitHub user davies opened a pull request:
https://github.com/apache/spark/pull/2624
[SPARK-3762] clear reference of SparkEnv after stop
SparkEnv is cached in ThreadLocal object, so after stop and create a new
SparkContext, old SparkEnv is still used by some threads, it will trigger
1 - 100 of 6045 matches
Mail list logo