[GitHub] spark pull request: SPARK-2634: Change MapOutputTrackerWorker.mapS...

2014-09-25 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/1541#issuecomment-5603
  
This seems fine to me, actually, so I'm going to re-run the tests then 
merge it.  We might want to do some other cleanup in this file, but that can 
wait for a separate PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2634: Change MapOutputTrackerWorker.mapS...

2014-09-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1541#issuecomment-56889065
  
  [QA tests have 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/155/consoleFull)
 for   PR 1541 at commit 
[`d450053`](https://github.com/apache/spark/commit/d450053165141c160e8b9a678218641a268c4353).
 * This patch merges cleanly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2634: Change MapOutputTrackerWorker.mapS...

2014-09-25 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1541#issuecomment-56895776
  
  [QA tests have 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/155/consoleFull)
 for   PR 1541 at commit 
[`d450053`](https://github.com/apache/spark/commit/d450053165141c160e8b9a678218641a268c4353).
 * This patch **passes** unit tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2634: Change MapOutputTrackerWorker.mapS...

2014-09-25 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/1541


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2634: Change MapOutputTrackerWorker.mapS...

2014-09-25 Thread zsxwing
Github user zsxwing commented on the pull request:

https://github.com/apache/spark/pull/1541#issuecomment-56909022
  
Thank you @JoshRosen 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2634: Change MapOutputTrackerWorker.mapS...

2014-09-17 Thread zsxwing
Github user zsxwing commented on the pull request:

https://github.com/apache/spark/pull/1541#issuecomment-55985572
  
@JoshRosen do you think it's OK?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2634: Change MapOutputTrackerWorker.mapS...

2014-09-01 Thread zsxwing
Github user zsxwing commented on the pull request:

https://github.com/apache/spark/pull/1541#issuecomment-54067191
  
ping @JoshRosen, could you help take a look at this one?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2634: Change MapOutputTrackerWorker.mapS...

2014-09-01 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/1541#issuecomment-54098544
  
Thanks for the reminder.

@kayousterhout I looked over @zsxwing's example and I agree that there's a 
thread-safety issue here.We can definitely have multiple concurrent block 
fetches that could race when accessing mapStatuses.

There's a lot of other state in MapOutputTracker that's guarded with 
`synchronized`, which implies that this instances of MapOutputTracker will be 
accessed from multiple threads.  In fact, there's even a 
`statuses.synchronized` at the end of `getServerStatuses` that's guarding a 
`MapOutputTracker.convertMapStatuses` call, but for some reason the other 
branch of the `if` guards it using `fetchedStatuses.synchronized` (which 
doesn't even make sense, since `fetchedStatuses` is a local variable defined 
inside of `getServerStatuses`).

Since the synchronization logic here seems kind of messy / confusing and 
mapStatuses is only accessed from MapOutputTracker, maybe it would be better to 
just add proper synchronization around reads/writes to mapStatuses rather than 
converting it to a ConcurrentHashMap.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2634: Change MapOutputTrackerWorker.mapS...

2014-09-01 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/1541#issuecomment-54098674
  
Actually, it looks like the `fetchedStatuses` vs `statuses` synchronization 
is correct, since it's guarding against modification to that statuses array 
while reading it in `convertMapStatuses`.  This needs a closer look, but I'm 
not sure whether we need this synchronization, since the output status for a 
particular map task should be immutable once written.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2634: Change MapOutputTrackerWorker.mapS...

2014-09-01 Thread zsxwing
Github user zsxwing commented on the pull request:

https://github.com/apache/spark/pull/1541#issuecomment-54103100
  
 the output status for a particular map task should be immutable once 
written.

But `mapStatuses` is mutable. Therefore, some thread is putting an item to 
`mapStatuses` and the space of map is not enough to store this new item, map 
will expand its space, rehash the items and update its internal state 
(https://github.com/scala/scala/blob/2.11.x/src/library/scala/collection/mutable/HashTable.scala#L249).
 If another thread is calling `get` at the same time, it may get a wrong value, 
or crash the thread.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2634: Change MapOutputTrackerWorker.mapS...

2014-09-01 Thread JoshRosen
Github user JoshRosen commented on the pull request:

https://github.com/apache/spark/pull/1541#issuecomment-54103317
  
I agree that `mapStatuses` itself is mutable.  I was just observing that 
the values stored in `mapStatuses` (the `Array[MapStatus]`es) aren't modified 
after they're stored in the HashMap.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2634: Change MapOutputTrackerWorker.mapS...

2014-09-01 Thread zsxwing
Github user zsxwing commented on the pull request:

https://github.com/apache/spark/pull/1541#issuecomment-54103752
  
You remind me that even if `Array[MapStatus]` won't be modified, according 
to java memory model, fetchedStatuses 
(https://github.com/zsxwing/spark/blob/SPARK-2634/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L168)
 and statuses 
(https://github.com/zsxwing/spark/blob/SPARK-2634/core/src/main/scala/org/apache/spark/MapOutputTracker.scala#L186)
 may be inconsistent without proper protection.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2634: Change MapOutputTrackerWorker.mapS...

2014-09-01 Thread zsxwing
Github user zsxwing commented on the pull request:

https://github.com/apache/spark/pull/1541#issuecomment-54104158
  
ConcurrentHashMap should fix all these issues I found.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: SPARK-2634: Change MapOutputTrackerWorker.mapS...

2014-07-24 Thread kayousterhout
Github user kayousterhout commented on the pull request:

https://github.com/apache/spark/pull/1541#issuecomment-50073250
  
When is this accessed concurrently?  I looked quickly and can only find 
updates from the (single-threaded) DAGScheduler event loop.  Is the issue that 
it can be read/written concurrently?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2634: Change MapOutputTrackerWorker.mapS...

2014-07-24 Thread zsxwing
Github user zsxwing commented on the pull request:

https://github.com/apache/spark/pull/1541#issuecomment-50097007
  
 When is this accessed concurrently?

For example, HashShuffleReader.read - object 
BlockStoreShuffleFetcher.fetch - MapOutputTracker.getServerStatuses. Different 
`HashShuffleReader` instances can be used in different `Task`s. All 
`TaskRunner`s share the same `env` of `Executor`. Therefore, all `Task`s uses 
the same `MapOutputTracker` instance in the `SparkEnv`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2634: Change MapOutputTrackerWorker.mapS...

2014-07-23 Thread mridulm
Github user mridulm commented on the pull request:

https://github.com/apache/spark/pull/1541#issuecomment-49855865
  
Instead of a ConcurrentHashMap, we should actually move it to a disk backed 
Map - the cleanup of this datastructure is painful - which it can become 
extremely large; particularly for iterative algo's.
Fortunately, most cases, we just need the last few entries - and so LRU 
scheme by most disk backed map's work beautifully.

We have been using mapdb for this in MapOutputTrackerWorker  - and it has 
worked beautifully.
@rxin might be particularly interested since he is looking into reduce 
memory footprint of spark
CC @mateiz - this is what I had mentioned about earlier.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2634: Change MapOutputTrackerWorker.mapS...

2014-07-23 Thread zsxwing
Github user zsxwing commented on the pull request:

https://github.com/apache/spark/pull/1541#issuecomment-49865627
  
 Instead of a ConcurrentHashMap, we should actually move it to a disk 
backed Map

Agree. Is there a PR ready? I think this is a critical bug and hope it can 
be fixed soon.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2634: Change MapOutputTrackerWorker.mapS...

2014-07-22 Thread zsxwing
GitHub user zsxwing opened a pull request:

https://github.com/apache/spark/pull/1541

SPARK-2634: Change MapOutputTrackerWorker.mapStatuses to ConcurrentHashMap

MapOutputTrackerWorker.mapStatuses is used concurrently, it should be 
thread-safe. This bug has already been fixed in #1328. Nevertheless, 
considering #1328 won't be merged soon, I send this trivial fix and hope this 
issue can be solved soon.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zsxwing/spark SPARK-2634

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/1541.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #1541


commit e8a912678d22678ba4d5f2f75eeec5109ead28b7
Author: zsxwing zsxw...@gmail.com
Date:   2014-07-23T02:38:03Z

SPARK-2634: Change MapOutputTrackerWorker.mapStatuses to ConcurrentHashMap




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2634: Change MapOutputTrackerWorker.mapS...

2014-07-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1541#issuecomment-49829141
  
QA tests have started for PR 1541. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17011/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2634: Change MapOutputTrackerWorker.mapS...

2014-07-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1541#issuecomment-49829180
  
QA results for PR 1541:br- This patch FAILED unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17011/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2634: Change MapOutputTrackerWorker.mapS...

2014-07-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1541#issuecomment-49829579
  
QA tests have started for PR 1541. This patch merges cleanly. brView 
progress: 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17012/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] spark pull request: SPARK-2634: Change MapOutputTrackerWorker.mapS...

2014-07-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/1541#issuecomment-49834008
  
QA results for PR 1541:br- This patch PASSES unit tests.br- This patch 
merges cleanlybr- This patch adds no public classesbrbrFor more 
information see test 
ouptut:brhttps://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/17012/consoleFull


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---