Github user mccheah commented on a diff in the pull request:
https://github.com/apache/spark/pull/21366#discussion_r195512430
--- Diff:
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsAllocator.scala
---
@@ -0,0 +1,141
Github user mccheah commented on a diff in the pull request:
https://github.com/apache/spark/pull/21366#discussion_r195512808
--- Diff:
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsPollingSnapshotSource.scala
---
@@ -0,0
Github user mccheah commented on the issue:
https://github.com/apache/spark/pull/21366
> @mccheah could you add a design doc for future reference and so that new
contributors can understand better the rationale behind this. There is some
description in the JIRA ticket but not eno
Github user mccheah commented on a diff in the pull request:
https://github.com/apache/spark/pull/21366#discussion_r195513995
--- Diff:
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterManager.scala
---
@@ -56,17 +58,44
Github user mccheah commented on a diff in the pull request:
https://github.com/apache/spark/pull/21366#discussion_r195542619
--- Diff:
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsPollingSnapshotSource.scala
---
@@ -0,0
Github user mccheah commented on a diff in the pull request:
https://github.com/apache/spark/pull/21366#discussion_r195561927
--- Diff:
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsPollingSnapshotSource.scala
---
@@ -0,0
Github user mccheah commented on a diff in the pull request:
https://github.com/apache/spark/pull/21366#discussion_r195566124
--- Diff:
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsSnapshotsStoreImpl.scala
---
@@ -0,0 +1,88
Github user mccheah commented on a diff in the pull request:
https://github.com/apache/spark/pull/21366#discussion_r195567079
--- Diff:
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsPollingSnapshotSource.scala
---
@@ -0,0
Github user mccheah commented on a diff in the pull request:
https://github.com/apache/spark/pull/21366#discussion_r195567414
--- Diff:
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsPollingSnapshotSource.scala
---
@@ -0,0
Github user mccheah commented on the issue:
https://github.com/apache/spark/pull/21366
Ok, addressed comments. The latest patch also makes it so that the
subscribers run in a thread pool instead of just on a single thread. We have
two subscribers so now they can run concurrently, if
Github user mccheah commented on the issue:
https://github.com/apache/spark/pull/21366
retest this please
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h
Github user mccheah commented on the issue:
https://github.com/apache/spark/pull/21366
Ok, I'm merging to master. Thanks everyone for contributing to review -
@foxish, @liyinan926 , @skonto , @dvogelbacher, @erikerlandson. As discussed
earlier, I will post a design document fo
Github user mccheah commented on the issue:
https://github.com/apache/spark/pull/21551
@rxin for SA that this is going into branch-2.3. Should be fine - we can
ship this if/when we cut 2.3.2. Merging.
---
-
To
Github user mccheah commented on the issue:
https://github.com/apache/spark/pull/21551
Missed this before merging, but @fabriziocucci for future reference please
put the ticket number along with `[K8S]` in the PR description. Sorry I didn't
catch this before me
Github user mccheah commented on a diff in the pull request:
https://github.com/apache/spark/pull/21584#discussion_r196237071
--- Diff:
resource-managers/kubernetes/docker/src/main/dockerfiles/spark/bindings/R/Dockerfile
---
@@ -0,0 +1,29 @@
+#
+# Licensed to the Apache
Github user mccheah commented on the issue:
https://github.com/apache/spark/pull/21551
The PR is already merged so it's too late now. Next time, please create a
ticket if one does not exist and put it in the PR descri
Github user mccheah commented on the issue:
https://github.com/apache/spark/pull/21551
asfgit is showing the commit as pushed above.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional
Github user mccheah commented on the issue:
https://github.com/apache/spark/pull/21551
Hm, I thought I did. I don't quite know what's going on.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apac
Github user mccheah commented on a diff in the pull request:
https://github.com/apache/spark/pull/21366#discussion_r196257515
--- Diff:
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala
---
@@ -154,6 +154,24 @@ private[spark] object Config
Github user mccheah commented on the issue:
https://github.com/apache/spark/pull/21555
This change makes it such that using the tool forces building and pushing
both Python and non-Python, but, what if the user wants to only build one to
save time? I can imagine that being the case
Github user mccheah commented on the issue:
https://github.com/apache/spark/pull/21555
Yup feel free to merge and follow up separately.
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For
Github user mccheah commented on the issue:
https://github.com/apache/spark/pull/21508
@gatorsmile @hvanhovell, I'm working with @bkrieger and we need this patch
soon. May we please get a sign off or else any suggested changes
Github user mccheah commented on a diff in the pull request:
https://github.com/apache/spark/pull/21511#discussion_r198659423
--- Diff:
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/deploy/k8s/Config.scala
---
@@ -104,6 +104,20 @@ private[spark] object Config
GitHub user mccheah opened a pull request:
https://github.com/apache/spark/pull/21660
[SPARK-24683][K8S] Fix k8s no resource
## What changes were proposed in this pull request?
Make SparkSubmit pass in the main class even if `SparkLauncher.NO_RESOURCE`
is the primary
Github user mccheah commented on the issue:
https://github.com/apache/spark/pull/21660
@ifilonenko
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h
Github user mccheah commented on a diff in the pull request:
https://github.com/apache/spark/pull/21462#discussion_r199228479
--- Diff:
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterManager.scala
---
@@ -46,8 +46,6
Github user mccheah commented on a diff in the pull request:
https://github.com/apache/spark/pull/21660#discussion_r199562616
--- Diff:
resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/KubernetesSuite.scala
---
@@ -21,17
Github user mccheah commented on a diff in the pull request:
https://github.com/apache/spark/pull/21660#discussion_r199596517
--- Diff:
resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/KubernetesSuite.scala
---
@@ -21,17
Github user mccheah commented on the issue:
https://github.com/apache/spark/pull/21743
@vanzin would it be possible for you to please take a look at this? Thanks.
---
-
To unsubscribe, e-mail: reviews-unsubscr
Github user mccheah commented on the issue:
https://github.com/apache/spark/pull/21743
test this please
---
-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h
Github user mccheah closed the pull request at:
https://github.com/apache/spark/pull/4481
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is
Github user mccheah commented on the pull request:
https://github.com/apache/spark/pull/3130#issuecomment-69662311
Hi,
I was wondering if this is still being updated?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as
Github user mccheah commented on a diff in the pull request:
https://github.com/apache/spark/pull/4066#discussion_r23106734
--- Diff: core/src/main/scala/org/apache/spark/SparkHadoopWriter.scala ---
@@ -105,10 +107,20 @@ class SparkHadoopWriter(@transient jobConf: JobConf
Github user mccheah commented on a diff in the pull request:
https://github.com/apache/spark/pull/4066#discussion_r23108057
--- Diff: core/src/main/scala/org/apache/spark/SparkHadoopWriter.scala ---
@@ -105,10 +107,20 @@ class SparkHadoopWriter(@transient jobConf: JobConf
Github user mccheah commented on a diff in the pull request:
https://github.com/apache/spark/pull/4066#discussion_r23108302
--- Diff: core/src/main/scala/org/apache/spark/SparkHadoopWriter.scala ---
@@ -105,10 +107,20 @@ class SparkHadoopWriter(@transient jobConf: JobConf
Github user mccheah commented on the pull request:
https://github.com/apache/spark/pull/3130#issuecomment-70321240
Looks like we're on the same page. However I believe this still raises the
question of how to best do the shading itself. It looks like the short-term
solution
GitHub user mccheah opened a pull request:
https://github.com/apache/spark/pull/4106
[SPARK-5158] [core] [security] Spark standalone mode can authenticate
against a Kerberos-secured Hadoop cluster
Previously, Kerberos secured Hadoop clusters could only be accessed by
Spark running
Github user mccheah commented on the pull request:
https://github.com/apache/spark/pull/4106#issuecomment-70553364
Suggestions to unit test are welcome. This should not be merged until it is
unit-tested.
---
If your project is set up for it, you can reply to this email and have your
Github user mccheah commented on the pull request:
https://github.com/apache/spark/pull/4106#issuecomment-70553855
One other caveat I forgot to mention, and the commit message should be
updated and this reflected in the docs: User proxying needs to be enabled.
Basically, the user
Github user mccheah commented on the pull request:
https://github.com/apache/spark/pull/4066#issuecomment-70710069
Instead of having every task require a call back to the driver or master,
can the master broadcast to the executor that a task is being speculated and
any executor with
Github user mccheah commented on the pull request:
https://github.com/apache/spark/pull/4066#issuecomment-70723749
Did you think of any corner cases that you might have missed? In terms of
correctness, this seems okay (although the Jenkins build indicates there are
some issues). Have
Github user mccheah commented on a diff in the pull request:
https://github.com/apache/spark/pull/4066#discussion_r23255988
--- Diff: core/src/main/scala/org/apache/spark/SparkHadoopWriter.scala ---
@@ -105,10 +107,24 @@ class SparkHadoopWriter(@transient jobConf: JobConf
Github user mccheah commented on a diff in the pull request:
https://github.com/apache/spark/pull/4066#discussion_r23265225
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
---
@@ -113,7 +115,7 @@ class DAGScheduler(
private val failedEpoch = new
Github user mccheah commented on a diff in the pull request:
https://github.com/apache/spark/pull/4066#discussion_r23270381
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
---
@@ -113,7 +115,7 @@ class DAGScheduler(
private val failedEpoch = new
Github user mccheah commented on a diff in the pull request:
https://github.com/apache/spark/pull/4106#discussion_r23320905
--- Diff:
core/src/main/scala/org/apache/spark/deploy/StandaloneSparkHadoopUtil.scala ---
@@ -0,0 +1,76 @@
+/*
+ * Licensed to the Apache Software
Github user mccheah commented on a diff in the pull request:
https://github.com/apache/spark/pull/4106#discussion_r23317321
--- Diff:
core/src/main/scala/org/apache/spark/deploy/StandaloneSparkHadoopUtil.scala ---
@@ -0,0 +1,76 @@
+/*
+ * Licensed to the Apache Software
Github user mccheah commented on the pull request:
https://github.com/apache/spark/pull/4106#issuecomment-70891705
That¹s correct. Definitely a work-in-progress so if there¹s another
security
model you¹d recommend I¹m all ears!
-Matt Cheah
From: Tom Graves
Github user mccheah commented on a diff in the pull request:
https://github.com/apache/spark/pull/4106#discussion_r23326963
--- Diff:
core/src/main/scala/org/apache/spark/deploy/StandaloneSparkHadoopUtil.scala ---
@@ -0,0 +1,76 @@
+/*
+ * Licensed to the Apache Software
GitHub user mccheah opened a pull request:
https://github.com/apache/spark/pull/4155
[SPARK-4879] Use the Spark driver to authorize Hadoop commits.
This is a version of https://github.com/apache/spark/pull/4066/ which is up
to date with master and has unit tests
Github user mccheah commented on the pull request:
https://github.com/apache/spark/pull/4066#issuecomment-70959680
The linked pull request takes your ideas, makes them compatible with
master, and adds unit tests. Feel free to take a look.
---
If your project is set up for it, you
Github user mccheah commented on the pull request:
https://github.com/apache/spark/pull/4155#issuecomment-71116954
Looks like the tests timed out. This change is probably a large performance
bottleneck, as communication back to the driver on every commit task is
expensive?
---
If
Github user mccheah commented on the pull request:
https://github.com/apache/spark/pull/4155#issuecomment-71130249
Actually it just looks like one test is hanging, so likely something not
being shut down properly.
---
If your project is set up for it, you can reply to this email and
Github user mccheah commented on the pull request:
https://github.com/apache/spark/pull/4155#issuecomment-71155039
Lots of comments to address, thanks for the detailed feedback @JoshRosen!
One hanging cause is that the actor can be stopped multiple times in some of
the YARN tests, so
Github user mccheah commented on the pull request:
https://github.com/apache/spark/pull/4155#issuecomment-71155895
Pushed to try to make Jenkins pass. If Jenkins passes I'll handle the
comments so far.
---
If your project is set up for it, you can reply to this email and have
Github user mccheah commented on the pull request:
https://github.com/apache/spark/pull/4155#issuecomment-71253931
I'm also concerned about the performance ramifications of this. We need to
run performance benchmarks. However, the only critical path that is affected by
this are
Github user mccheah commented on a diff in the pull request:
https://github.com/apache/spark/pull/4155#discussion_r23478673
--- Diff: core/src/main/scala/org/apache/spark/SparkHadoopWriter.scala ---
@@ -106,18 +107,25 @@ class SparkHadoopWriter(@transient jobConf: JobConf
Github user mccheah commented on a diff in the pull request:
https://github.com/apache/spark/pull/4155#discussion_r23507946
--- Diff:
core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala ---
@@ -0,0 +1,172 @@
+/*
+ * Licensed to the Apache Software
Github user mccheah commented on the pull request:
https://github.com/apache/spark/pull/4155#issuecomment-71399368
@vanzin I can attempt to make OutputCommitCoordinator more multithreaded as
you suggest. Do we have an example somewhere of Spark executors calling back to
the driver
Github user mccheah commented on a diff in the pull request:
https://github.com/apache/spark/pull/4155#discussion_r23508902
--- Diff:
core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala ---
@@ -0,0 +1,178 @@
+/*
+ * Licensed to the Apache Software
Github user mccheah commented on a diff in the pull request:
https://github.com/apache/spark/pull/4155#discussion_r23509086
--- Diff: core/src/main/scala/org/apache/spark/SparkHadoopWriter.scala ---
@@ -106,18 +107,25 @@ class SparkHadoopWriter(@transient jobConf: JobConf
Github user mccheah commented on a diff in the pull request:
https://github.com/apache/spark/pull/4155#discussion_r23513157
--- Diff:
core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala ---
@@ -0,0 +1,252 @@
+/*
+ * Licensed to the Apache Software
Github user mccheah commented on the pull request:
https://github.com/apache/spark/pull/3656#issuecomment-71526176
Seeing some problems that this PR could address so reviving this thread.
@lawlerd the configurable count would help because if it is known that the
individual
Github user mccheah commented on the pull request:
https://github.com/apache/spark/pull/4155#issuecomment-71530343
@vanzin that's pretty much what I went with. The actor will receive the
message and for commit permission requests they're farmed off to a thread pool.
-
Github user mccheah commented on a diff in the pull request:
https://github.com/apache/spark/pull/4155#discussion_r23560945
--- Diff:
core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala ---
@@ -0,0 +1,252 @@
+/*
+ * Licensed to the Apache Software
Github user mccheah commented on a diff in the pull request:
https://github.com/apache/spark/pull/4155#discussion_r23567732
--- Diff:
core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala ---
@@ -0,0 +1,252 @@
+/*
+ * Licensed to the Apache Software
Github user mccheah commented on a diff in the pull request:
https://github.com/apache/spark/pull/4155#discussion_r23573888
--- Diff: core/src/main/scala/org/apache/spark/SparkHadoopWriter.scala ---
@@ -106,18 +107,25 @@ class SparkHadoopWriter(@transient jobConf: JobConf
Github user mccheah commented on a diff in the pull request:
https://github.com/apache/spark/pull/4155#discussion_r23581966
--- Diff: core/src/main/scala/org/apache/spark/SparkHadoopWriter.scala ---
@@ -106,18 +107,25 @@ class SparkHadoopWriter(@transient jobConf: JobConf
Github user mccheah commented on the pull request:
https://github.com/apache/spark/pull/4155#issuecomment-71707718
This is a work in progress. In particular, the OutputCommitCoordinatorSuite
isn't quite testing the right thing now. I don't exactly know how to test the
full
Github user mccheah commented on a diff in the pull request:
https://github.com/apache/spark/pull/4155#discussion_r23634059
--- Diff:
core/src/test/scala/org/apache/spark/scheduler/OutputCommitCoordinatorSuite.scala
---
@@ -0,0 +1,177 @@
+/*
+ * Licensed to the Apache
Github user mccheah commented on the pull request:
https://github.com/apache/spark/pull/4155#issuecomment-71743261
Jenkins, retest this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have
Github user mccheah commented on the pull request:
https://github.com/apache/spark/pull/4155#issuecomment-71893889
Regarding the tests, I note that this is currently NOT testing the right
thing. It was, until I added extra logic down in the Executors, which is now
bypassed by the way
Github user mccheah commented on a diff in the pull request:
https://github.com/apache/spark/pull/4155#discussion_r23712915
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
---
@@ -63,7 +63,7 @@ class DAGScheduler(
mapOutputTracker
Github user mccheah commented on a diff in the pull request:
https://github.com/apache/spark/pull/4155#discussion_r23743012
--- Diff:
core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala ---
@@ -0,0 +1,258 @@
+/*
+ * Licensed to the Apache Software
Github user mccheah commented on the pull request:
https://github.com/apache/spark/pull/4252#issuecomment-72126451
@ash211
Thanks for doing this. My colleagues and I will test this appropriately
when we find the bandwidth!
---
If your project is set up for it, you can reply
Github user mccheah commented on the pull request:
https://github.com/apache/spark/pull/4155#issuecomment-72141862
Jenkins, retest this please.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
GitHub user mccheah opened a pull request:
https://github.com/apache/spark/pull/2608
[SPARK-1860] Worker better app cleanup
First contribution to the project, so apologize for any significant errors.
This PR addresses [SPARK-1860]. The application directories are now
GitHub user mccheah reopened a pull request:
https://github.com/apache/spark/pull/2608
[SPARK-1860] Worker better app cleanup
First contribution to the project, so apologize for any significant errors.
This PR addresses [SPARK-1860]. The application directories are now
Github user mccheah closed the pull request at:
https://github.com/apache/spark/pull/2608
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is
Github user mccheah closed the pull request at:
https://github.com/apache/spark/pull/2608
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is
GitHub user mccheah opened a pull request:
https://github.com/apache/spark/pull/2609
[SPARK-1860] More conservative app directory cleanup.
First contribution to the project, so apologize for any significant errors.
This PR addresses [SPARK-1860]. The application directories
Github user mccheah commented on a diff in the pull request:
https://github.com/apache/spark/pull/2609#discussion_r18316820
--- Diff: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala
---
@@ -22,6 +22,7 @@ import java.text.SimpleDateFormat
import java.util.Date
Github user mccheah commented on a diff in the pull request:
https://github.com/apache/spark/pull/2609#discussion_r18365112
--- Diff: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala
---
@@ -233,8 +244,15 @@ private[spark] class Worker(
} else
Github user mccheah commented on a diff in the pull request:
https://github.com/apache/spark/pull/2609#discussion_r18365510
--- Diff:
core/src/main/scala/org/apache/spark/deploy/worker/ExecutorRunner.scala ---
@@ -174,7 +168,7 @@ private[spark] class ExecutorRunner
Github user mccheah commented on a diff in the pull request:
https://github.com/apache/spark/pull/2609#discussion_r18407971
--- Diff: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala
---
@@ -191,6 +194,8 @@ private[spark] class Worker(
changeMaster
Github user mccheah commented on the pull request:
https://github.com/apache/spark/pull/2662#issuecomment-58056000
Sorry about that. I think Jenkins should be catching these kinds of build
failures though. Jenkins should attempt to build the project against multiple
versions of
Github user mccheah commented on the pull request:
https://github.com/apache/spark/pull/2662#issuecomment-58057718
Fair enough. The bottom line is that we could be more explicit about this.
Perhaps something in the documentation?
---
If your project is set up for it, you can reply
GitHub user mccheah opened a pull request:
https://github.com/apache/spark/pull/2828
[SPARK-3736] Workers reconnect when disassociated from the master.
Before, if the master node is killed and restarted, the worker nodes
would not attempt to reconnect to the Master. Therefore
Github user mccheah commented on the pull request:
https://github.com/apache/spark/pull/2828#issuecomment-59408043
One remark is that there are no automated tests in this commit for now.
I was unsuccessful in setting up TestKit to emulate a worker and master
sending messages
Github user mccheah commented on a diff in the pull request:
https://github.com/apache/spark/pull/2828#discussion_r18977288
--- Diff: core/src/main/scala/org/apache/spark/deploy/master/Master.scala
---
@@ -341,7 +341,11 @@ private[spark] class Master(
case Some
Github user mccheah commented on a diff in the pull request:
https://github.com/apache/spark/pull/2828#discussion_r18978981
--- Diff: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala
---
@@ -362,9 +372,19 @@ private[spark] class Worker
Github user mccheah commented on a diff in the pull request:
https://github.com/apache/spark/pull/2828#discussion_r18986188
--- Diff: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala
---
@@ -362,9 +372,19 @@ private[spark] class Worker
Github user mccheah commented on a diff in the pull request:
https://github.com/apache/spark/pull/2828#discussion_r18986702
--- Diff: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala
---
@@ -362,9 +372,19 @@ private[spark] class Worker
Github user mccheah commented on a diff in the pull request:
https://github.com/apache/spark/pull/2828#discussion_r18986941
--- Diff: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala
---
@@ -362,9 +372,19 @@ private[spark] class Worker
Github user mccheah commented on a diff in the pull request:
https://github.com/apache/spark/pull/2828#discussion_r18988742
--- Diff: core/src/main/scala/org/apache/spark/deploy/worker/Worker.scala
---
@@ -362,9 +372,19 @@ private[spark] class Worker
Github user mccheah commented on the pull request:
https://github.com/apache/spark/pull/4634#issuecomment-74732535
There is a case where map-side-combine is actually not the right thing to
do in some of my workflows. map-side-combine makes sense if the overall amount
of data is
Github user mccheah commented on the pull request:
https://github.com/apache/spark/pull/4634#issuecomment-74752839
We want to take advantage of the distributed reduce functionality of
combineByKey when computing the other aggregation metrics as well. Is this not
lost if we do a map
Github user mccheah commented on the pull request:
https://github.com/apache/spark/pull/4634#issuecomment-74756966
You lose the parallelism that's inherent in computing the reduce as a
parallel operation, as opposed to computing it on a list in a single task.
For more co
Github user mccheah commented on the pull request:
https://github.com/apache/spark/pull/4106#issuecomment-74762686
Able to come back to this now!
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user mccheah commented on a diff in the pull request:
https://github.com/apache/spark/pull/4106#discussion_r24860380
--- Diff: core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala ---
@@ -193,17 +193,21 @@ class HadoopRDD[K, V](
override def getPartitions: Array
Github user mccheah commented on the pull request:
https://github.com/apache/spark/pull/4420#issuecomment-74827293
If every single object is large though, then in that case after we've
spilled the 32nd object, there would still be an OOM before we check for
spilling again, rig
401 - 500 of 947 matches
Mail list logo