[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem
[ https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544563#comment-14544563 ] Kannan Rajah commented on SPARK-1529: - Just wanted to check if folks got a chance to review the changes. If you have any concerns, I will be happy to address them. Support setting spark.local.dirs to a hadoop FileSystem Key: SPARK-1529 URL: https://issues.apache.org/jira/browse/SPARK-1529 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Wendell Assignee: Kannan Rajah Attachments: Spark Shuffle using HDFS.pdf In some environments, like with MapR, local volumes are accessed through the Hadoop filesystem interface. We should allow setting spark.local.dir to a Hadoop filesystem location. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem
[ https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493251#comment-14493251 ] Kannan Rajah commented on SPARK-1529: - You can use the Compare functionality to see a single page of diffs across commits. Here is the link: https://github.com/rkannan82/spark/compare/4aaf48d46d13129f0f9bdafd771dd80fe568a7dc...rkannan82:7195353a31f7cfb087ec804b597b01fb362bc3f6 A few clarifications. 1. There are 2 reasons for introducing a FileSystem abstraction in Spark instead of directly using Hadoop FileSystem. - There are Spark shuffle specific APIs that needed abstraction. Please take a look at this code: https://github.com/rkannan82/spark/blob/dfs_shuffle/core/src/main/scala/org/apache/spark/storage/FileSystem.scala - For local file system access, we can choose to circumvent using Hadoop's local file system implementation if its not efficient. If you look at LocalFileSystem.scala, for most APIs, it just delegates to the old code of using Spark's disk block manager, etc. In fact, we can just look at this single class and determine if we will hit any performance degradation for the default Apache shuffle code path. https://github.com/rkannan82/spark/blob/dfs_shuffle/core/src/main/scala/org/apache/spark/storage/LocalFileSystem.scala 2. During the write phase, we shuffle to HDFS instead of local file system. While reading back, we don't use the Netty based transport that Apache shuffle uses. Instead we have a new implementation called DFSShuffleClient that reads from HDFS. That is the main difference. https://github.com/rkannan82/spark/blob/dfs_shuffle/network/shuffle/src/main/java/org/apache/spark/network/shuffle/DFSShuffleClient.java Support setting spark.local.dirs to a hadoop FileSystem Key: SPARK-1529 URL: https://issues.apache.org/jira/browse/SPARK-1529 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Wendell Assignee: Kannan Rajah Attachments: Spark Shuffle using HDFS.pdf In some environments, like with MapR, local volumes are accessed through the Hadoop filesystem interface. We should allow setting spark.local.dir to a Hadoop filesystem location. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem
[ https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492215#comment-14492215 ] Sean Owen commented on SPARK-1529: -- (Sorry if this double-posts.) Is there a good way to see the whole diff at the moment? I know there's a branch with individual commits. Maybe I am missing something basic. This puts a new abstraction on top of a Hadoop FileSystem on top of the underlying file system abstraction. That's getting heavy. If it's only abstracting access to an InputStream / OutputStream, why is it needed? that's already directly available from, say, Hadoop's FileSystem. What would be the performance gain if this is the bit being swapped out? This is my original question -- you shuffle to HDFS, then read it back to send it again via the existing shuffle? It kind of made sense when the idea was to swap the whole shuffle to replace its transport. Support setting spark.local.dirs to a hadoop FileSystem Key: SPARK-1529 URL: https://issues.apache.org/jira/browse/SPARK-1529 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Wendell Assignee: Kannan Rajah Attachments: Spark Shuffle using HDFS.pdf In some environments, like with MapR, local volumes are accessed through the Hadoop filesystem interface. We should allow setting spark.local.dir to a Hadoop filesystem location. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem
[ https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491834#comment-14491834 ] Kannan Rajah commented on SPARK-1529: - Thanks. FYI, I have pushed few more commits to my repo to handle all the TODOs and bug fixes. So you can track this branch for all the changes: https://github.com/rkannan82/spark/commits/dfs_shuffle Support setting spark.local.dirs to a hadoop FileSystem Key: SPARK-1529 URL: https://issues.apache.org/jira/browse/SPARK-1529 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Wendell Assignee: Kannan Rajah Attachments: Spark Shuffle using HDFS.pdf In some environments, like with MapR, local volumes are accessed through the Hadoop filesystem interface. We should allow setting spark.local.dir to a Hadoop filesystem location. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem
[ https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491863#comment-14491863 ] Patrick Wendell commented on SPARK-1529: Hey Kannan, We originally considered doing something like you are proposing, where we would change our filesystem interactions to all use a Hadoop FileSystem class and then we'd use Hadoop's LocalFileSystem. However, there were two issues: 1. We used POSIX API's that are not present in Hadoop. For instance, we use memory mapping in various places, FileChannel in the BlockObjectWriter, etc. 2. Using LocalFileSystem has a substantial performance overheads compared with our current code. So we'd have to write our own implementation of a Local filesystem. For this reason, we decided that our current shuffle machinery was fundamentally not usable for non-POSIX environments. So we decided that instead, we'd let people customize shuffle behavior at a higher level and we implemented the pluggable shuffle components. So you can create a shuffle manager that is specifically optimized for a particular environment (e.g. MapR). Did you consider implementing a MapR shuffle using that mechanism instead? You'd have to operate at a higher level, where you reason about shuffle records, etc. But you'd have a lot of flexibility to optimize within that. Support setting spark.local.dirs to a hadoop FileSystem Key: SPARK-1529 URL: https://issues.apache.org/jira/browse/SPARK-1529 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Wendell Assignee: Kannan Rajah Attachments: Spark Shuffle using HDFS.pdf In some environments, like with MapR, local volumes are accessed through the Hadoop filesystem interface. We should allow setting spark.local.dir to a Hadoop filesystem location. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem
[ https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491869#comment-14491869 ] Kannan Rajah commented on SPARK-1529: - [~pwendell] The default code path still uses the FileChannel, memory mapping techniques. I just provided an abstraction called FileSystem.scala (not Hadoop's FileSystem.java). LocalFileSystem.scala delegates the call to existing Spark code path that uses FileChannel. I am using Hadoop's RawLocalFileSystem class just to get an InputStream, OutputStream. And this internally also uses FileChannel. Please see RawLocalFileSystem.LocalFSFileInputStream. It is just a wrapper on java.io.FileInputStream. Going back to why I considered this approach. It will allow us to reuse all the logic currently used by SortShuffle code path. We would have to implement pretty much everything that's been done by Spark to do the shuffle on HDFS. We are in the processing of running some performance tests to understand the impact of the change. One of the main things we will be verifying is if there is any performance degradation introduced in the default code path and fix if there is any. Is this acceptable? Support setting spark.local.dirs to a hadoop FileSystem Key: SPARK-1529 URL: https://issues.apache.org/jira/browse/SPARK-1529 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Wendell Assignee: Kannan Rajah Attachments: Spark Shuffle using HDFS.pdf In some environments, like with MapR, local volumes are accessed through the Hadoop filesystem interface. We should allow setting spark.local.dir to a Hadoop filesystem location. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem
[ https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491004#comment-14491004 ] Cheng Lian commented on SPARK-1529: --- [~kannan] I haven't being working on Spark Core for a while, but I'll take a look at this. Thanks for working on this! Also cc [~pwendell] [~andrewor14]. Support setting spark.local.dirs to a hadoop FileSystem Key: SPARK-1529 URL: https://issues.apache.org/jira/browse/SPARK-1529 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Wendell Assignee: Kannan Rajah Attachments: Spark Shuffle using HDFS.pdf In some environments, like with MapR, local volumes are accessed through the Hadoop filesystem interface. We should allow setting spark.local.dir to a Hadoop filesystem location. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem
[ https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484022#comment-14484022 ] Kannan Rajah commented on SPARK-1529: - [~liancheng] Will you be able to take a look at the code and provide some feedback? Support setting spark.local.dirs to a hadoop FileSystem Key: SPARK-1529 URL: https://issues.apache.org/jira/browse/SPARK-1529 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Wendell Assignee: Kannan Rajah Attachments: Spark Shuffle using HDFS.pdf In some environments, like with MapR, local volumes are accessed through the Hadoop filesystem interface. We should allow setting spark.local.dir to a Hadoop filesystem location. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem
[ https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385648#comment-14385648 ] Kannan Rajah commented on SPARK-1529: - I have pushed the first round of commits to my repo. I would like to get some early feedback on the overall design. https://github.com/rkannan82/spark/commits/dfs_shuffle Commits: https://github.com/rkannan82/spark/commit/ce8b430512b31e932ffdab6e0a2c1a6a1768ffbf https://github.com/rkannan82/spark/commit/8f5415c248c0a9ca5ad3ec9f48f839b24c259813 https://github.com/rkannan82/spark/commit/d9d179ba6c685cc8eb181f442e9bd6ad91cc4290 Support setting spark.local.dirs to a hadoop FileSystem Key: SPARK-1529 URL: https://issues.apache.org/jira/browse/SPARK-1529 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Wendell Assignee: Kannan Rajah Attachments: Spark Shuffle using HDFS.pdf In some environments, like with MapR, local volumes are accessed through the Hadoop filesystem interface. We should allow setting spark.local.dir to a Hadoop filesystem location. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem
[ https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14361978#comment-14361978 ] Kannan Rajah commented on SPARK-1529: - Can someone assign this bug to me? I am working on a patch. Support setting spark.local.dirs to a hadoop FileSystem Key: SPARK-1529 URL: https://issues.apache.org/jira/browse/SPARK-1529 Project: Spark Issue Type: Improvement Components: Spark Core Reporter: Patrick Wendell Assignee: Cheng Lian Attachments: SparkShuffleUsingHDFS_API.pdf In some environments, like with MapR, local volumes are accessed through the Hadoop filesystem interface. We should allow setting spark.local.dir to a Hadoop filesystem location. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem
[ https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14288341#comment-14288341 ] Kannan Rajah commented on SPARK-1529: - [~pwendell] I would like us to consider the option of reusing the write code path of existing shuffle implementation instead of implementing from scratch. This will allow us to take advantage of all the optimizations that are already done and will be done in future. Only the read code path needs to be reimplemented fully as we don't need all the shuffle server logic. There are a handful of shuffle classes that need to use the HDFS abstractions in order to achieve this. I have attached a high level proposal. Let me know your thoughts. Write IndexShufflleBlockManager, SortShuffleWriter, ExternalSorter, BlockObjectWriter. Read BlockStoreShuffleFetcher, HashShuffleReader Support setting spark.local.dirs to a hadoop FileSystem Key: SPARK-1529 URL: https://issues.apache.org/jira/browse/SPARK-1529 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Patrick Wendell Assignee: Cheng Lian In some environments, like with MapR, local volumes are accessed through the Hadoop filesystem interface. We should allow setting spark.local.dir to a Hadoop filesystem location. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem
[ https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267419#comment-14267419 ] Patrick Wendell commented on SPARK-1529: Hey Sean, From what I remember of this, the issue is that MapR clusters are not typically provisioned with much local disk space available, because the MapRFS supports accessing local volumes in its API, unlike the HDFS API. So in general the expectation is that large amounts of local data should be written through MapR's API to its local filesystem. They have an NFS mount you can use as a work around to provide POSIX API's, and I think most MapR users set this mount up and then have Spark write shuffle data there. Option 2 which [~rkannan82] mentions is not actually feasible in Spark right now. We don't support writing shuffle data through the Hadoop API's right now and I think Cheng's patch was only a prototype of how we might do that... Support setting spark.local.dirs to a hadoop FileSystem Key: SPARK-1529 URL: https://issues.apache.org/jira/browse/SPARK-1529 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Patrick Wendell Assignee: Cheng Lian In some environments, like with MapR, local volumes are accessed through the Hadoop filesystem interface. We should allow setting spark.local.dir to a Hadoop filesystem location. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem
[ https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267424#comment-14267424 ] Patrick Wendell commented on SPARK-1529: BTW - I think if MapR wants to have a customized shuffle, the direction proposed in this patch is probably not the best way to do it. It would make more sense to implement a DFS-based shuffle using the new pluggable shuffle API. I.e. a shuffle that communicates through the filesystem rather than doing transfers through Spark. Support setting spark.local.dirs to a hadoop FileSystem Key: SPARK-1529 URL: https://issues.apache.org/jira/browse/SPARK-1529 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Patrick Wendell Assignee: Cheng Lian In some environments, like with MapR, local volumes are accessed through the Hadoop filesystem interface. We should allow setting spark.local.dir to a Hadoop filesystem location. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem
[ https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267430#comment-14267430 ] Sean Owen commented on SPARK-1529: -- [~pwendell] Gotcha, that begins to make sense. I assume the cluster can be provisioned with as much local disk as desired, regardless of defaults. The alternative, to write temp files across the network and read them back in order to then broadcast them back over the network, seems a lot worse than just setting up the right amount of local disk. But if it works well enough and is easier in some situations, sounds like that's also an option. I suppose I'm asking / questioning why the project would want to encourage remote shuffle files by trying to not just use the HDFS APIs, but even maintain a specialized version of it, just to make a third workaround for a vendor config issue? Surely MapR should just set up clusters that are provisioned with Spark more how Spark needs them. Support setting spark.local.dirs to a hadoop FileSystem Key: SPARK-1529 URL: https://issues.apache.org/jira/browse/SPARK-1529 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Patrick Wendell Assignee: Cheng Lian In some environments, like with MapR, local volumes are accessed through the Hadoop filesystem interface. We should allow setting spark.local.dir to a Hadoop filesystem location. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem
[ https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264527#comment-14264527 ] Sean Owen commented on SPARK-1529: -- OK, but, why do these files *have* to go on a non-local disk? It sounds like you're saying Spark doesn't work at all on MapR right now, but that can't be the case. They *can* go on a non-local disk, I'm sure. What's the value of that, given that Spark is transporting the files itself? Still, as you say, this proprietary setup works already through the java.io+NFS and HDFS APIs, with no change. If it's just not as fast, is that a problem that Spark should be solving? Just don't do it. Or it's up to the vendor to optimize. Support setting spark.local.dirs to a hadoop FileSystem Key: SPARK-1529 URL: https://issues.apache.org/jira/browse/SPARK-1529 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Patrick Wendell Assignee: Cheng Lian In some environments, like with MapR, local volumes are accessed through the Hadoop filesystem interface. We should allow setting spark.local.dir to a Hadoop filesystem location. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem
[ https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263823#comment-14263823 ] Sean Owen commented on SPARK-1529: -- So Spark needs to read and write local files for things like shuffle. It uses java.io. this continues to work everywhere. I am still missing why something has to go through HDFS or NFS here. These files should not be on NFS either. Support setting spark.local.dirs to a hadoop FileSystem Key: SPARK-1529 URL: https://issues.apache.org/jira/browse/SPARK-1529 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Patrick Wendell Assignee: Cheng Lian In some environments, like with MapR, local volumes are accessed through the Hadoop filesystem interface. We should allow setting spark.local.dir to a Hadoop filesystem location. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem
[ https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263928#comment-14263928 ] Kannan Rajah commented on SPARK-1529: - In a MapR distribution, these files need to go into MapR volume and not local disk. This MapR volume is local to the node though, but its part of the distributed file system. This can be achieved in 2 ways: 1. Expose the MapR file system as a NFS mount point. Now, you can use the normal java.io API and data will still get written to MapR volume instead of local disk. 2. Use HDFS API (with underlying MapR implementation) instead of java.io to make the IO go to MapR volume. Support setting spark.local.dirs to a hadoop FileSystem Key: SPARK-1529 URL: https://issues.apache.org/jira/browse/SPARK-1529 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Patrick Wendell Assignee: Cheng Lian In some environments, like with MapR, local volumes are accessed through the Hadoop filesystem interface. We should allow setting spark.local.dir to a Hadoop filesystem location. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem
[ https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263516#comment-14263516 ] Cheng Lian commented on SPARK-1529: --- Hi [~srowen], first of all we are not trying to put shuffle and temp files in HDFS. At the time this ticket was created, the initial motivation was to support MapR, because MapR only exposes local file system via MapR volume and HDFS {{FileSystem}} interface. However, later on this issue was worked around with NFS. And this ticket wasn't solved because of lacking enough capacity. [~rkannan82] Thanks for looking into this! Several months ago, I had once implemented a prototype by simply replacing Java NIO file system operations with corresponding HDFS {{FileSystem}} version. According to prior benchmark done with {{spark-perf}}, this introduces ~15% performance penalty for shuffling. Thus we had once planned to write a specialized {{FileSystem}} implementation which simply wraps normal Java NIO operations to avoid the performance penalty as much as possible, and then replace all local file system access with this specialized {{FileSystem}} implementation. Support setting spark.local.dirs to a hadoop FileSystem Key: SPARK-1529 URL: https://issues.apache.org/jira/browse/SPARK-1529 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Patrick Wendell Assignee: Cheng Lian In some environments, like with MapR, local volumes are accessed through the Hadoop filesystem interface. We should allow setting spark.local.dir to a Hadoop filesystem location. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem
[ https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263578#comment-14263578 ] Sean Owen commented on SPARK-1529: -- Hm, how do these APIs preclude the direct use of java.io? Is this actually disabled in MapR? If there is a workaround what is the remaining motivation? Support setting spark.local.dirs to a hadoop FileSystem Key: SPARK-1529 URL: https://issues.apache.org/jira/browse/SPARK-1529 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Patrick Wendell Assignee: Cheng Lian In some environments, like with MapR, local volumes are accessed through the Hadoop filesystem interface. We should allow setting spark.local.dir to a Hadoop filesystem location. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem
[ https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263603#comment-14263603 ] Kannan Rajah commented on SPARK-1529: - It cannot preclude the use of java.io completely. If there are java.io. APIs that are needed for some use case, then you cannot use the HDFS API. But that is the case normally. The NFS mount based workaround is not as efficient as accessing it through the HDFS interface. Hence the need. Support setting spark.local.dirs to a hadoop FileSystem Key: SPARK-1529 URL: https://issues.apache.org/jira/browse/SPARK-1529 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Patrick Wendell Assignee: Cheng Lian In some environments, like with MapR, local volumes are accessed through the Hadoop filesystem interface. We should allow setting spark.local.dir to a Hadoop filesystem location. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem
[ https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263600#comment-14263600 ] Kannan Rajah commented on SPARK-1529: - [~lian cheng] Can you upload this prototype patch so that I can reuse it? What branch was it based off? When I start making new changes, I suppose I can do it against master branch, right? Support setting spark.local.dirs to a hadoop FileSystem Key: SPARK-1529 URL: https://issues.apache.org/jira/browse/SPARK-1529 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Patrick Wendell Assignee: Cheng Lian In some environments, like with MapR, local volumes are accessed through the Hadoop filesystem interface. We should allow setting spark.local.dir to a Hadoop filesystem location. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem
[ https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263287#comment-14263287 ] Kannan Rajah commented on SPARK-1529: - [~lian cheng] [~pwendell]] I want to work on this JIRA. It's been a while since there has been any update. So can you please share what the current status is? Has there been a consensus on replacing the file API with a HDFS kind of interface and plugging in the right implementation? I will be looking at the code base in the mean time. Support setting spark.local.dirs to a hadoop FileSystem Key: SPARK-1529 URL: https://issues.apache.org/jira/browse/SPARK-1529 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Patrick Wendell Assignee: Cheng Lian In some environments, like with MapR, local volumes are accessed through the Hadoop filesystem interface. We should allow setting spark.local.dir to a Hadoop filesystem location. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem
[ https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14259323#comment-14259323 ] Sean Owen commented on SPARK-1529: -- This may be a dumb question on an old issue, but: you can already access local filesystems through HDFS's FileSystem. Of course you can always access local filesystems directly in general. The FileSystem abstraction won't provide everything that java.nio does because it is not a local filesystem. But why would you want to put shuffle and temp files in HDFS? The MapR comment confuses me more since its main trick is exposing HDFS as more like an NFS mount point. But if anything that makes it already usable as is for this purpose. Support setting spark.local.dirs to a hadoop FileSystem Key: SPARK-1529 URL: https://issues.apache.org/jira/browse/SPARK-1529 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Patrick Wendell Assignee: Cheng Lian In some environments, like with MapR, local volumes are accessed through the Hadoop filesystem interface. We should allow setting spark.local.dir to a Hadoop filesystem location. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem
[ https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13976494#comment-13976494 ] Aaron Davidson commented on SPARK-1529: --- Hey, looking through the code a little more in depth reveals that the rewind and truncate functionality of DiskBlockObjectWriter is actually used during shuffle file consolidation. The issue is that we store the metadata for each consolidated shuffle file as a consecutive set of offsets into the file (an offset table). That is, if we have 3 blocks stored in the same file, rather than storing 3 pairs of (offset, length), we simply store the offsets and use the fact that they're laid out consecutively to reconstruct the lengths. This means we can't suffer holes in the data structure of partial writes, and thus rely on the fact that partial writes (which are not included in the data structure right now) are always of size 0. I think getting around this is pretty straightforward, however: We can simply store the offsets of all partial writes in the offset table, and just avoid storing them in the index we build to look up the positions of particular map tasks in the offset table. This will mean we can reconstruct the lengths properly, but most importantly it means we will not think that our failed map tasks were successful (because index lookups for them will still fail, even though they're in the offset table). This seems like a pretty clean solution that wraps up our usage of FileChannels, save where we mmap files back into memory. We will likely want to special-case the blocks to make sure we can mmap them directly when reading from the local file system. Support setting spark.local.dirs to a hadoop FileSystem Key: SPARK-1529 URL: https://issues.apache.org/jira/browse/SPARK-1529 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Patrick Wendell Assignee: Cheng Lian Fix For: 1.1.0 In some environments, like with MapR, local volumes are accessed through the Hadoop filesystem interface. We should allow setting spark.local.dir to a Hadoop filesystem location. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem
[ https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13975914#comment-13975914 ] Cheng Lian commented on SPARK-1529: --- After some investigation, I came to the conclusion that, unlike adding Tachyon support, to allow setting {{spark.local.dir}} to a Hadoop FS location, instead of adding something like {{HDFSBlockManager}} / {{HDFSStore}}, we have to refactor related local FS access code to leverage HDFS interfaces. And it seems hard to make this change incremental. Besides writing shuffle map output, at least two places reference {{spark.local.dir}}: # HTTP broadcasting uses {{spark.local.dir}} as resource root, and access local FS with `java.io.File` # {{FileServerHandler}} accesses {{spark.local.dir}} via {{DiskBlockManager}} and reads local file with {{FileSegment}} and {{java.io.File}} Adding new block manager / store for HDFS can't fix these places. I'm currently working on this issue by: # Refactoring {{FileSegment.file}} from {{java.io.File}} to {{org.apache.hadoop.fs.Path}}, # Refactoring {{DiskBlockManager}}, {{DiskStore}}, {{HttpBroadcast}} {{FileServerHandler}} to leverage HDFS interfaces. Please leave comments if I missed anything or there's simpler ways to workaround this. (PS: We should definitely refactor block manager related code to reduce duplicate code and encapsulate more details. Maybe the public interface of block manager should only communicate with other component with block IDs and storage levels.) Support setting spark.local.dirs to a hadoop FileSystem Key: SPARK-1529 URL: https://issues.apache.org/jira/browse/SPARK-1529 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Patrick Wendell Assignee: Cheng Lian Fix For: 1.1.0 In some environments, like with MapR, local volumes are accessed through the Hadoop filesystem interface. We should allow setting spark.local.dir to a Hadoop filesystem location. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem
[ https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13975947#comment-13975947 ] Patrick Wendell commented on SPARK-1529: [~liancheng] Hey Cheng, the tricky thing here is want to avoid _always_ going through the HDFS filesystem interface when people are actually using local files. We might need to add an intermediate abstraction to deal with this. We already do this elsehwere in the code base, for instance the JobLogger will load an output stream either directly form a file or from a hadoop file. One thing to note is that the requirement here is really only for the shuffle files, not for the other uses. But I realize we currently conflate these inside of Spark so that might not buy us much. I'll look into this a bit more later. Support setting spark.local.dirs to a hadoop FileSystem Key: SPARK-1529 URL: https://issues.apache.org/jira/browse/SPARK-1529 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Patrick Wendell Assignee: Cheng Lian Fix For: 1.1.0 In some environments, like with MapR, local volumes are accessed through the Hadoop filesystem interface. We should allow setting spark.local.dir to a Hadoop filesystem location. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem
[ https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13976038#comment-13976038 ] Patrick Wendell commented on SPARK-1529: One idea proposed by [~adav] was to always use the Hadoop filesystem API, but to potentially implement our own version of the local filesystem if we find the Hadoop version has performance drawbacks. Another issue is that we use FileChannel objects directly in the `DiskBlockObjectWriter`. After looking through this a bit, the functionality there to commit and rewind writes is not actually used anywhere, we could probably just remove it. [~liancheng] I think it would be worth it to look at a version where we just take all of the File API's and replace them with Hadoop equivalents. I.e. your proposal. Support setting spark.local.dirs to a hadoop FileSystem Key: SPARK-1529 URL: https://issues.apache.org/jira/browse/SPARK-1529 Project: Spark Issue Type: Bug Components: Spark Core Reporter: Patrick Wendell Assignee: Cheng Lian Fix For: 1.1.0 In some environments, like with MapR, local volumes are accessed through the Hadoop filesystem interface. We should allow setting spark.local.dir to a Hadoop filesystem location. -- This message was sent by Atlassian JIRA (v6.2#6252)