[GitHub] spark pull request: SPARK-2634: Change MapOutputTrackerWorker.mapS...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/1541#issuecomment-5603 This seems fine to me, actually, so I'm going to re-run the tests then merge it. We might want to do some other cleanup in this file, but that can wait for a separate PR. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2634: Change MapOutputTrackerWorker.mapS...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1541#issuecomment-56889065 [QA tests have started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/155/consoleFull) for PR 1541 at commit [`d450053`](https://github.com/apache/spark/commit/d450053165141c160e8b9a678218641a268c4353). * This patch merges cleanly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2634: Change MapOutputTrackerWorker.mapS...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1541#issuecomment-56895776 [QA tests have finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/155/consoleFull) for PR 1541 at commit [`d450053`](https://github.com/apache/spark/commit/d450053165141c160e8b9a678218641a268c4353). * This patch **passes** unit tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2634: Change MapOutputTrackerWorker.mapS...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/1541 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2634: Change MapOutputTrackerWorker.mapS...
Github user zsxwing commented on the pull request: https://github.com/apache/spark/pull/1541#issuecomment-56909022 Thank you @JoshRosen --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2634: Change MapOutputTrackerWorker.mapS...
Github user zsxwing commented on the pull request: https://github.com/apache/spark/pull/1541#issuecomment-55985572 @JoshRosen do you think it's OK? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2634: Change MapOutputTrackerWorker.mapS...
Github user zsxwing commented on the pull request: https://github.com/apache/spark/pull/1541#issuecomment-54067191 ping @JoshRosen, could you help take a look at this one? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2634: Change MapOutputTrackerWorker.mapS...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/1541#issuecomment-54098544 Thanks for the reminder. @kayousterhout I looked over @zsxwing's example and I agree that there's a thread-safety issue here.We can definitely have multiple concurrent block fetches that could race when accessing mapStatuses. There's a lot of other state in MapOutputTracker that's guarded with `synchronized`, which implies that this instances of MapOutputTracker will be accessed from multiple threads. In fact, there's even a `statuses.synchronized` at the end of `getServerStatuses` that's guarding a `MapOutputTracker.convertMapStatuses` call, but for some reason the other branch of the `if` guards it using `fetchedStatuses.synchronized` (which doesn't even make sense, since `fetchedStatuses` is a local variable defined inside of `getServerStatuses`). Since the synchronization logic here seems kind of messy / confusing and mapStatuses is only accessed from MapOutputTracker, maybe it would be better to just add proper synchronization around reads/writes to mapStatuses rather than converting it to a ConcurrentHashMap. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2634: Change MapOutputTrackerWorker.mapS...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/1541#issuecomment-54098674 Actually, it looks like the `fetchedStatuses` vs `statuses` synchronization is correct, since it's guarding against modification to that statuses array while reading it in `convertMapStatuses`. This needs a closer look, but I'm not sure whether we need this synchronization, since the output status for a particular map task should be immutable once written. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2634: Change MapOutputTrackerWorker.mapS...
Github user zsxwing commented on the pull request: https://github.com/apache/spark/pull/1541#issuecomment-54103100 the output status for a particular map task should be immutable once written. But `mapStatuses` is mutable. Therefore, some thread is putting an item to `mapStatuses` and the space of map is not enough to store this new item, map will expand its space, rehash the items and update its internal state (https://github.com/scala/scala/blob/2.11.x/src/library/scala/collection/mutable/HashTable.scala#L249). If another thread is calling `get` at the same time, it may get a wrong value, or crash the thread. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2634: Change MapOutputTrackerWorker.mapS...
Github user JoshRosen commented on the pull request: https://github.com/apache/spark/pull/1541#issuecomment-54103317 I agree that `mapStatuses` itself is mutable. I was just observing that the values stored in `mapStatuses` (the `Array[MapStatus]`es) aren't modified after they're stored in the HashMap. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2634: Change MapOutputTrackerWorker.mapS...
Github user zsxwing commented on the pull request: https://github.com/apache/spark/pull/1541#issuecomment-54103752 You remind me that even if `Array[MapStatus]` won't be modified, according to java memory model, fetchedStatuses (https://github.com/zsxwing/spark/blob/SPARK-2634/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L168) and statuses (https://github.com/zsxwing/spark/blob/SPARK-2634/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L186) may be inconsistent without proper protection. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2634: Change MapOutputTrackerWorker.mapS...
Github user zsxwing commented on the pull request: https://github.com/apache/spark/pull/1541#issuecomment-54104158 ConcurrentHashMap should fix all these issues I found. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: SPARK-2634: Change MapOutputTrackerWorker.mapS...
Github user kayousterhout commented on the pull request: https://github.com/apache/spark/pull/1541#issuecomment-50073250 When is this accessed concurrently? I looked quickly and can only find updates from the (single-threaded) DAGScheduler event loop. Is the issue that it can be read/written concurrently? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2634: Change MapOutputTrackerWorker.mapS...
Github user zsxwing commented on the pull request: https://github.com/apache/spark/pull/1541#issuecomment-50097007 When is this accessed concurrently? For example, HashShuffleReader.read - object BlockStoreShuffleFetcher.fetch - MapOutputTracker.getServerStatuses. Different `HashShuffleReader` instances can be used in different `Task`s. All `TaskRunner`s share the same `env` of `Executor`. Therefore, all `Task`s uses the same `MapOutputTracker` instance in the `SparkEnv`. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2634: Change MapOutputTrackerWorker.mapS...
Github user mridulm commented on the pull request: https://github.com/apache/spark/pull/1541#issuecomment-49855865 Instead of a ConcurrentHashMap, we should actually move it to a disk backed Map - the cleanup of this datastructure is painful - which it can become extremely large; particularly for iterative algo's. Fortunately, most cases, we just need the last few entries - and so LRU scheme by most disk backed map's work beautifully. We have been using mapdb for this in MapOutputTrackerWorker - and it has worked beautifully. @rxin might be particularly interested since he is looking into reduce memory footprint of spark CC @mateiz - this is what I had mentioned about earlier. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2634: Change MapOutputTrackerWorker.mapS...
Github user zsxwing commented on the pull request: https://github.com/apache/spark/pull/1541#issuecomment-49865627 Instead of a ConcurrentHashMap, we should actually move it to a disk backed Map Agree. Is there a PR ready? I think this is a critical bug and hope it can be fixed soon. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2634: Change MapOutputTrackerWorker.mapS...
GitHub user zsxwing opened a pull request: https://github.com/apache/spark/pull/1541 SPARK-2634: Change MapOutputTrackerWorker.mapStatuses to ConcurrentHashMap MapOutputTrackerWorker.mapStatuses is used concurrently, it should be thread-safe. This bug has already been fixed in #1328. Nevertheless, considering #1328 won't be merged soon, I send this trivial fix and hope this issue can be solved soon. You can merge this pull request into a Git repository by running: $ git pull https://github.com/zsxwing/spark SPARK-2634 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/1541.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #1541 commit e8a912678d22678ba4d5f2f75eeec5109ead28b7 Author: zsxwing zsxw...@gmail.com Date: 2014-07-23T02:38:03Z SPARK-2634: Change MapOutputTrackerWorker.mapStatuses to ConcurrentHashMap --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2634: Change MapOutputTrackerWorker.mapS...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1541#issuecomment-49829141 QA tests have started for PR 1541. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17011/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2634: Change MapOutputTrackerWorker.mapS...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1541#issuecomment-49829180 QA results for PR 1541:br- This patch FAILED unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17011/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2634: Change MapOutputTrackerWorker.mapS...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1541#issuecomment-49829579 QA tests have started for PR 1541. This patch merges cleanly. brView progress: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17012/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] spark pull request: SPARK-2634: Change MapOutputTrackerWorker.mapS...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/1541#issuecomment-49834008 QA results for PR 1541:br- This patch PASSES unit tests.br- This patch merges cleanlybr- This patch adds no public classesbrbrFor more information see test ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17012/consoleFull --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---