Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/18388
200k+ connections seems to be your problem then. Is this all a single
application? You say 6000 nodes with 64 executors on each host, how many cores
per executor? Or do you mean basically
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/18388
ok sorry I forgot you had the screenshot there. so as you mention in that
post if we are just creating to many outboundbuffers before they can actual be
sent over the network then we should try
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/18547
There is no reason to print out messages that aren't useful to the users.
Many users see Warnings and read them and think there is a problem with their
application or configuration. Mo
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/18388
So that is an issue. If users are running spark 1.6 or spark 2.1 on the
same cluster as the new one with this feature, you can't upgrade the shuffle
service until no one runs those. W
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/18388
making the external shuffle service incompatible is a huge issue for
deployments. For the yarn side you would have to have the nodemanager run 2
versions (which as far as I know hasn't
Github user tgravescs commented on a diff in the pull request:
https://github.com/apache/spark/pull/18476#discussion_r125138598
--- Diff: docs/configuration.md ---
@@ -1398,6 +1398,15 @@ Apart from these, the following properties are also
available, and may be useful
Github user tgravescs commented on a diff in the pull request:
https://github.com/apache/spark/pull/18476#discussion_r125138509
--- Diff: docs/configuration.md ---
@@ -1398,6 +1398,15 @@ Apart from these, the following properties are also
available, and may be useful
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/18476
Jenkins, test this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and
Github user tgravescs commented on a diff in the pull request:
https://github.com/apache/spark/pull/18476#discussion_r125137046
--- Diff: docs/configuration.md ---
@@ -1398,6 +1398,15 @@ Apart from these, the following properties are also
available, and may be useful
Github user tgravescs commented on a diff in the pull request:
https://github.com/apache/spark/pull/18476#discussion_r125038588
--- Diff: docs/configuration.md ---
@@ -1398,6 +1398,15 @@ Apart from these, the following properties are also
available, and may be useful
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/18388
I think having both sides would probably be good. limit the reducer
connections and simultaneous block calls but have a fail safe on the shuffle
server side where it can reject connections also
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/18388
Haven't looked at the path in detail yet. High level questions/thoughts.
So you say the memory usage is by the netty chunks, so my assumption is
this is during the actual transfer? fa
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17113
+1 finally got a clean build, will merge to master
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/18162
thanks for the reviews @cloud-fan
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/18162
Jenkins, test this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/18162
Jenkins, test this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/18162
failure is from previous push of code.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/18162
upmerged to master and updated default and removed unneeded changes.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/18162
will upmerge shortly, since there are conflicts
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this
Github user tgravescs commented on a diff in the pull request:
https://github.com/apache/spark/pull/18162#discussion_r123527201
--- Diff:
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -295,4 +295,12 @@ package object config {
"above
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/18070
sorry for my delay on getting back to this.
So if we do that you would have to have taskKilledReason extend
TaskFailedReason so because things rely on the countTowardsTaskFailures field.
Then
Github user tgravescs commented on a diff in the pull request:
https://github.com/apache/spark/pull/18162#discussion_r123273222
--- Diff:
core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala ---
@@ -528,7 +528,13 @@ class JobProgressListener(conf: SparkConf
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17113
Jenkins, test this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17113
these failures definitely look unrelated. I'll kick once more to try to get
clean run.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitH
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/18162
sorry missed that you had commented, yes we can change that
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does
Github user tgravescs commented on a diff in the pull request:
https://github.com/apache/spark/pull/18162#discussion_r123262316
--- Diff:
core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala ---
@@ -528,7 +528,13 @@ class JobProgressListener(conf: SparkConf
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17113
Jenkins, test this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/18162
Jenkins, test this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/18213
I think this is a very hard thing for us to know, to many different failure
types. I agree that setting to 1 is better then us getting it wrong, although
I question a bit still if that is right
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/18162
sorry been out on vacation and probably won't have time this week to
respond much but will update early next week. Thanks @rdblue . I am running
this in our production as well and can cl
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/18150
Sorry been out.on vacation I think invalidating all and having feature flag
makes sense for now. If we get more data on it causing issues we can revisit.
Sorry won't have time to review in d
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/18162
I'm not sure what you mean by its not doable? what places are you seeing
update the block statuses that I haven't covered here? most of it was done by
the BlockManager. Maybe I
Github user tgravescs commented on a diff in the pull request:
https://github.com/apache/spark/pull/18162#discussion_r119943504
--- Diff:
core/src/main/scala/org/apache/spark/ui/jobs/JobProgressListener.scala ---
@@ -528,7 +528,13 @@ class JobProgressListener(conf: SparkConf
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/18162
turned on by default for backwards compatibility but don't really agree
with it. We should make it more stable/usable for people by turning it off.
I'm assuming anyone that is using
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/18162
ok, I'll update the default.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enable
Github user tgravescs commented on a diff in the pull request:
https://github.com/apache/spark/pull/18150#discussion_r119736751
--- Diff: core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
---
@@ -1383,19 +1394,43 @@ class DAGScheduler(
*/
private
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/18162
@JoshRosen what do you think should we add the deprecated?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/18070
thanks for the udpates. I was testing this out by running large job with
speculative tasks and I am still seeing the stage summary show failed tasks.
It looks like its due to this code
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17113
changes LGTM.
@squito did you have any further comments?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/18162
It would be nice to get this into spark 2.2 if we can
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/18162
Yeah I was figuring I would file another jira to remove it later. I can
add the deprecated flag here if you guys agree.
---
If your project is set up for it, you can reply to this email and
Github user tgravescs commented on a diff in the pull request:
https://github.com/apache/spark/pull/18162#discussion_r119525021
--- Diff:
core/src/main/scala/org/apache/spark/internal/config/package.scala ---
@@ -295,4 +295,12 @@ package object config {
"above
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/18162
taskMetrics doesn't take the sparkconf or anything to get at a config so we
would have to config out everywhere its incrementing or adding things. I think
that wouldn't be to hard.
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/18162
Updated, I put the TaskMetrics api back with deprecated marking and just
had it return Nil. @JoshRosen Were you thinking of adding more back?
---
If your project is set up for it, you can
Github user tgravescs commented on a diff in the pull request:
https://github.com/apache/spark/pull/18162#discussion_r119444253
--- Diff: core/src/main/scala/org/apache/spark/util/JsonProtocol.scala ---
@@ -368,8 +356,7 @@ private[spark] object JsonProtocol {
("Sh
Github user tgravescs commented on a diff in the pull request:
https://github.com/apache/spark/pull/18162#discussion_r119422345
--- Diff: core/src/main/scala/org/apache/spark/util/JsonProtocol.scala ---
@@ -368,8 +356,7 @@ private[spark] object JsonProtocol {
("Sh
Github user tgravescs commented on a diff in the pull request:
https://github.com/apache/spark/pull/18162#discussion_r119422001
--- Diff: core/src/main/scala/org/apache/spark/executor/TaskMetrics.scala
---
@@ -110,15 +109,6 @@ class TaskMetrics private[spark] () extends
Github user tgravescs commented on a diff in the pull request:
https://github.com/apache/spark/pull/18162#discussion_r119421473
--- Diff: core/src/main/scala/org/apache/spark/util/JsonProtocol.scala ---
@@ -368,8 +356,7 @@ private[spark] object JsonProtocol {
("Sh
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/18162
@JoshRosen
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if
GitHub user tgravescs opened a pull request:
https://github.com/apache/spark/pull/18162
[SPARK-20923] Remove TaskMetrics._updatedBlockStatuses
## What changes were proposed in this pull request?
Remove TaskMetrics._updatedBlockStatuses. As far as I can see its not used
by
Github user tgravescs commented on a diff in the pull request:
https://github.com/apache/spark/pull/17113#discussion_r119366199
--- Diff:
core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala ---
@@ -51,29 +51,19 @@ import org.apache.spark.util.{AccumulatorV2
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17113
please add jira SPARK-20898 to the description since fixing that here
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17113
so unfortunately I haven't actually been seeing this. You can see with
external shuffle is something happens to the NM and it does cause job failure.
NM crashes for OOM, something else kil
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17113
@squito just double checking, are you ok with this change and did you have
any comments?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub
Github user tgravescs commented on a diff in the pull request:
https://github.com/apache/spark/pull/17113#discussion_r118751271
--- Diff: docs/configuration.md ---
@@ -1449,6 +1449,14 @@ Apart from these, the following properties are also
available, and may be useful
Github user tgravescs commented on a diff in the pull request:
https://github.com/apache/spark/pull/17113#discussion_r118751130
--- Diff:
core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala ---
@@ -145,6 +146,72 @@ private[scheduler] class BlacklistTracker
Github user tgravescs commented on a diff in the pull request:
https://github.com/apache/spark/pull/17113#discussion_r118748969
--- Diff:
core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala ---
@@ -145,6 +146,72 @@ private[scheduler] class BlacklistTracker
Github user tgravescs commented on a diff in the pull request:
https://github.com/apache/spark/pull/17113#discussion_r118747665
--- Diff:
core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala ---
@@ -145,6 +146,72 @@ private[scheduler] class BlacklistTracker
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17113
I'm curious did you test the killing part on an actual yarn job? I was
trying it on master and I don't think it works at all due to the way its
passing allocation client. Its a sepa
Github user tgravescs commented on a diff in the pull request:
https://github.com/apache/spark/pull/18070#discussion_r118711377
--- Diff: core/src/main/scala/org/apache/spark/executor/Executor.scala ---
@@ -459,7 +459,7 @@ private[spark] class Executor(
case CausedBy
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/18070
sorry the case I was talking about is with a fetch failure. The true abort
stage doesn't happen until it retries 4 times. in that mean time you can have
tasks from the same stage (diff
Github user tgravescs commented on a diff in the pull request:
https://github.com/apache/spark/pull/17113#discussion_r118315661
--- Diff:
core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala ---
@@ -145,6 +146,75 @@ private[scheduler] class BlacklistTracker
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17113
@jerryshao sorry my delay on this, we have rough design what we want to do
for future changes but I think those are going to take a while and in the mean
time I think this is a useful addition
Github user tgravescs commented on a diff in the pull request:
https://github.com/apache/spark/pull/17113#discussion_r118260480
--- Diff:
core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala ---
@@ -145,6 +146,75 @@ private[scheduler] class BlacklistTracker
Github user tgravescs commented on a diff in the pull request:
https://github.com/apache/spark/pull/17113#discussion_r118261422
--- Diff:
core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala ---
@@ -145,6 +146,75 @@ private[scheduler] class BlacklistTracker
Github user tgravescs commented on a diff in the pull request:
https://github.com/apache/spark/pull/17113#discussion_r118260789
--- Diff:
core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala ---
@@ -145,6 +146,75 @@ private[scheduler] class BlacklistTracker
Github user tgravescs commented on a diff in the pull request:
https://github.com/apache/spark/pull/18070#discussion_r117977100
--- Diff: core/src/main/scala/org/apache/spark/executor/Executor.scala ---
@@ -338,6 +340,9 @@ private[spark] class Executor
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17955
at a high level this definitely makes sense. I need to look at in more
detail, I'll try to do that in the next day or two.
I am wondering what all testing you have done on this?
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/16705
@devaraj-kavali thanks, it looks like this is already fixed in spark 2.2
with SPARK-20217 please close.
---
If your project is set up for it, you can reply to this email and have your
reply
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/16705
Jenkins, test this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17658
looks good, I'll merge. thanks @redsanket
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17854
also what is the exact error/stack trace you see when you say "failed to
connect"?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitH
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17854
what is your network timeout (spark.network.timeout) set to?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17854
> If that's what you mean, there's no need for retrying. No RPC calls retry
anymore. See #16503 (comment) for an explanation.
I see, I guess with the way we have the rpc i
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17854
I took a quick look at the registerExecutor call in
CoarseGrainedExecutorBackend and its not retrying at all. We should change
that to retry. We retry heartbeats and many other things so it
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17854
to slow down launching you could just set
spark.yarn.containerLauncherMaxThreads to be smaller. that isn't guaranteed
but neither is this really. Just an alternative or something you c
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/15009
we should update the tags to 2.3.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
Github user tgravescs commented on a diff in the pull request:
https://github.com/apache/spark/pull/17658#discussion_r114997435
--- Diff:
core/src/main/scala/org/apache/spark/scheduler/EventLoggingListener.scala ---
@@ -283,10 +283,15 @@ private[spark] object EventLoggingListener
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17658
@srowen @vanzin do either of you know where the jenkins stuff is
configured? wondering why this isn't working for me.
---
If your project is set up for it, you can reply to this email and
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17658
retest this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17658
there are also a few formatting things that look like they were just line
wraps and extra new lines.
---
If your project is set up for it, you can reply to this email and have your
reply appear
Github user tgravescs commented on a diff in the pull request:
https://github.com/apache/spark/pull/17658#discussion_r114647868
--- Diff: core/src/main/scala/org/apache/spark/ui/SparkUI.scala ---
@@ -60,6 +60,8 @@ private[spark] class SparkUI private (
var appId
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17744
Yes conceptually it could be removed but as you say is a bigger change. Are
you still seeing memory issues after this change?
---
If your project is set up for it, you can reply to this email
Github user tgravescs commented on a diff in the pull request:
https://github.com/apache/spark/pull/17658#discussion_r114363985
--- Diff:
core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala ---
@@ -742,6 +743,7 @@ private[history] object FsHistoryProvider
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17658
retest this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17658
test this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17658
Jenkins, okay to test
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17744
We see the same issue on some of our clusters. I was planning on doing 2
things. Something like this to reduce that memory usage and then on the other
side you could change the shuffle fetcher
Github user tgravescs commented on a diff in the pull request:
https://github.com/apache/spark/pull/17744#discussion_r113697866
--- Diff:
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockHandler.java
---
@@ -93,14 +92,25 @@ protected void
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17658
test this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so
GitHub user tgravescs reopened a pull request:
https://github.com/apache/spark/pull/17748
[SPARK-19812] YARN shuffle service fails to relocate recovery DB acroâ¦
â¦ss NFS directories
## What changes were proposed in this pull request?
Change from using java
Github user tgravescs closed the pull request at:
https://github.com/apache/spark/pull/17748
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17748
Thanks for the review @mridulm merged to master and branch-2.2
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17658
Jenkins, test this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and
Github user tgravescs commented on a diff in the pull request:
https://github.com/apache/spark/pull/17744#discussion_r113323069
--- Diff:
common/network-shuffle/src/main/java/org/apache/spark/network/shuffle/ExternalShuffleBlockHandler.java
---
@@ -93,14 +92,25 @@ protected void
Github user tgravescs commented on a diff in the pull request:
https://github.com/apache/spark/pull/17658#discussion_r113196872
--- Diff: core/src/main/scala/org/apache/spark/ui/SparkUI.scala ---
@@ -139,6 +140,8 @@ private[spark] abstract class SparkUITab(parent:
SparkUI, prefix
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/17658
Jenkins, test this please
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and
Github user tgravescs commented on a diff in the pull request:
https://github.com/apache/spark/pull/17748#discussion_r113192444
--- Diff:
common/network-yarn/src/main/java/org/apache/spark/network/yarn/YarnShuffleService.java
---
@@ -363,25 +362,29 @@ protected File
GitHub user tgravescs opened a pull request:
https://github.com/apache/spark/pull/17748
[SPARK-19812] YARN shuffle service fails to relocate recovery DB acroâ¦
â¦ss NFS directories
## What changes were proposed in this pull request?
Change from using java
Github user tgravescs commented on the issue:
https://github.com/apache/spark/pull/15009
@kishorvpatil please fix documentation
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
801 - 900 of 3291 matches
Mail list logo