[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

2015-05-14 Thread Kannan Rajah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544563#comment-14544563
 ] 

Kannan Rajah commented on SPARK-1529:
-

Just wanted to check if folks got a chance to review the changes. If you have 
any concerns, I will be happy to address them.

 Support setting spark.local.dirs to a hadoop FileSystem 
 

 Key: SPARK-1529
 URL: https://issues.apache.org/jira/browse/SPARK-1529
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Kannan Rajah
 Attachments: Spark Shuffle using HDFS.pdf


 In some environments, like with MapR, local volumes are accessed through the 
 Hadoop filesystem interface. We should allow setting spark.local.dir to a 
 Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

2015-04-13 Thread Kannan Rajah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14493251#comment-14493251
 ] 

Kannan Rajah commented on SPARK-1529:
-

You can use the Compare functionality to see a single page of diffs across 
commits. Here is the link: 
https://github.com/rkannan82/spark/compare/4aaf48d46d13129f0f9bdafd771dd80fe568a7dc...rkannan82:7195353a31f7cfb087ec804b597b01fb362bc3f6

A few clarifications.
1. There are 2 reasons for introducing a FileSystem abstraction in Spark 
instead of directly using Hadoop FileSystem.
  - There are Spark shuffle specific APIs that needed abstraction. Please take 
a look at this code:
https://github.com/rkannan82/spark/blob/dfs_shuffle/core/src/main/scala/org/apache/spark/storage/FileSystem.scala

  - For local file system access, we can choose to circumvent using Hadoop's 
local file system implementation if its not efficient. If you look at 
LocalFileSystem.scala, for most APIs, it just delegates to the old code of 
using Spark's disk block manager, etc. In fact, we can just look at this single 
class and determine if we will hit any performance degradation for the default 
Apache shuffle code path.
https://github.com/rkannan82/spark/blob/dfs_shuffle/core/src/main/scala/org/apache/spark/storage/LocalFileSystem.scala

2. During the write phase, we shuffle to HDFS instead of local file system. 
While reading back, we don't use the Netty based transport that Apache shuffle 
uses. Instead we have a new implementation called DFSShuffleClient that reads 
from HDFS. That is the main difference.
https://github.com/rkannan82/spark/blob/dfs_shuffle/network/shuffle/src/main/java/org/apache/spark/network/shuffle/DFSShuffleClient.java

 Support setting spark.local.dirs to a hadoop FileSystem 
 

 Key: SPARK-1529
 URL: https://issues.apache.org/jira/browse/SPARK-1529
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Kannan Rajah
 Attachments: Spark Shuffle using HDFS.pdf


 In some environments, like with MapR, local volumes are accessed through the 
 Hadoop filesystem interface. We should allow setting spark.local.dir to a 
 Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

2015-04-13 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14492215#comment-14492215
 ] 

Sean Owen commented on SPARK-1529:
--

(Sorry if this double-posts.)

Is there a good way to see the whole diff at the moment? I know there's a 
branch with individual commits. Maybe I am missing something basic.

This puts a new abstraction on top of a Hadoop FileSystem on top of the 
underlying file system abstraction. That's getting heavy. If it's only 
abstracting access to an InputStream / OutputStream, why is it needed? that's 
already directly available from, say, Hadoop's FileSystem.

What would be the performance gain if this is the bit being swapped out? This 
is my original question -- you shuffle to HDFS, then read it back to send it 
again via the existing shuffle? It kind of made sense when the idea was to swap 
the whole shuffle to replace its transport.

 Support setting spark.local.dirs to a hadoop FileSystem 
 

 Key: SPARK-1529
 URL: https://issues.apache.org/jira/browse/SPARK-1529
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Kannan Rajah
 Attachments: Spark Shuffle using HDFS.pdf


 In some environments, like with MapR, local volumes are accessed through the 
 Hadoop filesystem interface. We should allow setting spark.local.dir to a 
 Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

2015-04-12 Thread Kannan Rajah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491834#comment-14491834
 ] 

Kannan Rajah commented on SPARK-1529:
-

Thanks. FYI, I have pushed few more commits to my repo to handle all the TODOs 
and bug fixes. So you can track this branch for all the changes: 
https://github.com/rkannan82/spark/commits/dfs_shuffle

 Support setting spark.local.dirs to a hadoop FileSystem 
 

 Key: SPARK-1529
 URL: https://issues.apache.org/jira/browse/SPARK-1529
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Kannan Rajah
 Attachments: Spark Shuffle using HDFS.pdf


 In some environments, like with MapR, local volumes are accessed through the 
 Hadoop filesystem interface. We should allow setting spark.local.dir to a 
 Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

2015-04-12 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491863#comment-14491863
 ] 

Patrick Wendell commented on SPARK-1529:


Hey Kannan,

We originally considered doing something like you are proposing, where we would 
change our filesystem interactions to all use a Hadoop FileSystem class and 
then we'd use Hadoop's LocalFileSystem. However, there were two issues:

1. We used POSIX API's that are not present in Hadoop. For instance, we use 
memory mapping in various places, FileChannel in the BlockObjectWriter, etc.
2. Using LocalFileSystem has a substantial performance overheads compared with 
our current code. So we'd have to write our own implementation of a Local 
filesystem.

For this reason, we decided that our current shuffle machinery was 
fundamentally not usable for non-POSIX environments. So we decided that 
instead, we'd let people customize shuffle behavior at a higher level and we 
implemented the pluggable shuffle components. So you can create a shuffle 
manager that is specifically optimized for a particular environment (e.g. MapR).

Did you consider implementing a MapR shuffle using that mechanism instead? 
You'd have to operate at a higher level, where you reason about shuffle 
records, etc. But you'd have a lot of flexibility to optimize within that.

 Support setting spark.local.dirs to a hadoop FileSystem 
 

 Key: SPARK-1529
 URL: https://issues.apache.org/jira/browse/SPARK-1529
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Kannan Rajah
 Attachments: Spark Shuffle using HDFS.pdf


 In some environments, like with MapR, local volumes are accessed through the 
 Hadoop filesystem interface. We should allow setting spark.local.dir to a 
 Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

2015-04-12 Thread Kannan Rajah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491869#comment-14491869
 ] 

Kannan Rajah commented on SPARK-1529:
-

[~pwendell] The default code path still uses the FileChannel, memory mapping 
techniques. I just provided an abstraction called FileSystem.scala (not 
Hadoop's FileSystem.java). LocalFileSystem.scala delegates the call to existing 
Spark code path that uses FileChannel. I am using Hadoop's RawLocalFileSystem 
class just to get an InputStream, OutputStream. And this internally also uses 
FileChannel. Please see RawLocalFileSystem.LocalFSFileInputStream. It is just a 
wrapper on java.io.FileInputStream.

Going back to why I considered this approach. It will allow us to reuse all the 
logic currently used by SortShuffle code path. We would have to implement 
pretty much everything that's been done by Spark to do the shuffle on HDFS. We 
are in the processing of running some performance tests to understand the 
impact of the change. One of the main things we will be verifying is if there 
is any performance degradation introduced in the default code path and fix if 
there is any. Is this acceptable?

 Support setting spark.local.dirs to a hadoop FileSystem 
 

 Key: SPARK-1529
 URL: https://issues.apache.org/jira/browse/SPARK-1529
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Kannan Rajah
 Attachments: Spark Shuffle using HDFS.pdf


 In some environments, like with MapR, local volumes are accessed through the 
 Hadoop filesystem interface. We should allow setting spark.local.dir to a 
 Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

2015-04-11 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14491004#comment-14491004
 ] 

Cheng Lian commented on SPARK-1529:
---

[~kannan] I haven't being working on Spark Core for a while, but I'll take a 
look at this. Thanks for working on this!

Also cc [~pwendell] [~andrewor14].

 Support setting spark.local.dirs to a hadoop FileSystem 
 

 Key: SPARK-1529
 URL: https://issues.apache.org/jira/browse/SPARK-1529
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Kannan Rajah
 Attachments: Spark Shuffle using HDFS.pdf


 In some environments, like with MapR, local volumes are accessed through the 
 Hadoop filesystem interface. We should allow setting spark.local.dir to a 
 Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

2015-04-07 Thread Kannan Rajah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14484022#comment-14484022
 ] 

Kannan Rajah commented on SPARK-1529:
-

[~liancheng] Will you be able to take a look at the code and provide some 
feedback?

 Support setting spark.local.dirs to a hadoop FileSystem 
 

 Key: SPARK-1529
 URL: https://issues.apache.org/jira/browse/SPARK-1529
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Kannan Rajah
 Attachments: Spark Shuffle using HDFS.pdf


 In some environments, like with MapR, local volumes are accessed through the 
 Hadoop filesystem interface. We should allow setting spark.local.dir to a 
 Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

2015-03-29 Thread Kannan Rajah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14385648#comment-14385648
 ] 

Kannan Rajah commented on SPARK-1529:
-

I have pushed the first round of commits to my repo. I would like to get some 
early feedback on the overall design.
https://github.com/rkannan82/spark/commits/dfs_shuffle

Commits:
https://github.com/rkannan82/spark/commit/ce8b430512b31e932ffdab6e0a2c1a6a1768ffbf
https://github.com/rkannan82/spark/commit/8f5415c248c0a9ca5ad3ec9f48f839b24c259813
https://github.com/rkannan82/spark/commit/d9d179ba6c685cc8eb181f442e9bd6ad91cc4290

 Support setting spark.local.dirs to a hadoop FileSystem 
 

 Key: SPARK-1529
 URL: https://issues.apache.org/jira/browse/SPARK-1529
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Kannan Rajah
 Attachments: Spark Shuffle using HDFS.pdf


 In some environments, like with MapR, local volumes are accessed through the 
 Hadoop filesystem interface. We should allow setting spark.local.dir to a 
 Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

2015-03-14 Thread Kannan Rajah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14361978#comment-14361978
 ] 

Kannan Rajah commented on SPARK-1529:
-

Can someone assign this bug to me? I am working on a patch.

 Support setting spark.local.dirs to a hadoop FileSystem 
 

 Key: SPARK-1529
 URL: https://issues.apache.org/jira/browse/SPARK-1529
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Cheng Lian
 Attachments: SparkShuffleUsingHDFS_API.pdf


 In some environments, like with MapR, local volumes are accessed through the 
 Hadoop filesystem interface. We should allow setting spark.local.dir to a 
 Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

2015-01-22 Thread Kannan Rajah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14288341#comment-14288341
 ] 

Kannan Rajah commented on SPARK-1529:
-

[~pwendell] I would like us to consider the option of reusing the write code 
path of existing shuffle implementation instead of implementing from scratch. 
This will allow us to take advantage of all the optimizations that are already 
done and will be done in future. Only the read code path needs to be 
reimplemented fully as we don't need all the shuffle server logic. There are a 
handful of shuffle classes that need to use the HDFS abstractions in order to 
achieve this. I have attached a high level proposal. Let me know your thoughts.

Write
IndexShufflleBlockManager, SortShuffleWriter, ExternalSorter, BlockObjectWriter.

Read
BlockStoreShuffleFetcher, HashShuffleReader

 Support setting spark.local.dirs to a hadoop FileSystem 
 

 Key: SPARK-1529
 URL: https://issues.apache.org/jira/browse/SPARK-1529
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Cheng Lian

 In some environments, like with MapR, local volumes are accessed through the 
 Hadoop filesystem interface. We should allow setting spark.local.dir to a 
 Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

2015-01-07 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267419#comment-14267419
 ] 

Patrick Wendell commented on SPARK-1529:


Hey Sean,

From what I remember of this, the issue is that MapR clusters are not 
typically provisioned with much local disk space available, because the MapRFS 
supports accessing local volumes in its API, unlike the HDFS API. So in 
general the expectation is that large amounts of local data should be written 
through MapR's API to its local filesystem. They have an NFS mount you can use 
as a work around to provide POSIX API's, and I think most MapR users set this 
mount up and then have Spark write shuffle data there.

Option 2 which [~rkannan82] mentions is not actually feasible in Spark right 
now. We don't support writing shuffle data through the Hadoop API's right now 
and I think Cheng's patch was only a prototype of how we might do that...

 Support setting spark.local.dirs to a hadoop FileSystem 
 

 Key: SPARK-1529
 URL: https://issues.apache.org/jira/browse/SPARK-1529
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Cheng Lian

 In some environments, like with MapR, local volumes are accessed through the 
 Hadoop filesystem interface. We should allow setting spark.local.dir to a 
 Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

2015-01-07 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267424#comment-14267424
 ] 

Patrick Wendell commented on SPARK-1529:


BTW - I think if MapR wants to have a customized shuffle, the direction 
proposed in this patch is probably not the best way to do it. It would make 
more sense to implement a DFS-based shuffle using the new pluggable shuffle 
API. I.e. a shuffle that communicates through the filesystem rather than doing 
transfers through Spark.

 Support setting spark.local.dirs to a hadoop FileSystem 
 

 Key: SPARK-1529
 URL: https://issues.apache.org/jira/browse/SPARK-1529
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Cheng Lian

 In some environments, like with MapR, local volumes are accessed through the 
 Hadoop filesystem interface. We should allow setting spark.local.dir to a 
 Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

2015-01-07 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14267430#comment-14267430
 ] 

Sean Owen commented on SPARK-1529:
--

[~pwendell] Gotcha, that begins to make sense. I assume the cluster can be 
provisioned with as much local disk as desired, regardless of defaults. The 
alternative, to write temp files across the network and read them back in order 
to then broadcast them back over the network, seems a lot worse than just 
setting up the right amount of local disk. But if it works well enough and is 
easier in some situations, sounds like that's also an option. I suppose I'm 
asking / questioning why the project would want to encourage remote shuffle 
files by trying to not just use the HDFS APIs, but even maintain a specialized 
version of it, just to make a third workaround for a vendor config issue? 
Surely MapR should just set up clusters that are provisioned with Spark more 
how Spark needs them.

 Support setting spark.local.dirs to a hadoop FileSystem 
 

 Key: SPARK-1529
 URL: https://issues.apache.org/jira/browse/SPARK-1529
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Cheng Lian

 In some environments, like with MapR, local volumes are accessed through the 
 Hadoop filesystem interface. We should allow setting spark.local.dir to a 
 Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

2015-01-05 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14264527#comment-14264527
 ] 

Sean Owen commented on SPARK-1529:
--

OK, but, why do these files *have* to go on a non-local disk? It sounds like 
you're saying Spark doesn't work at all on MapR right now, but that can't be 
the case. They *can* go on a non-local disk, I'm sure. What's the value of 
that, given that Spark is transporting the files itself?

Still, as you say, this proprietary setup works already through the java.io+NFS 
and HDFS APIs, with no change. If it's just not as fast, is that a problem that 
Spark should be solving? Just don't do it. Or it's up to the vendor to optimize.

 Support setting spark.local.dirs to a hadoop FileSystem 
 

 Key: SPARK-1529
 URL: https://issues.apache.org/jira/browse/SPARK-1529
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Cheng Lian

 In some environments, like with MapR, local volumes are accessed through the 
 Hadoop filesystem interface. We should allow setting spark.local.dir to a 
 Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

2015-01-04 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263823#comment-14263823
 ] 

Sean Owen commented on SPARK-1529:
--

So Spark needs to read and write local files for things like shuffle. It uses 
java.io. this continues to work everywhere. I am still missing why something 
has to go through HDFS or NFS here. These files should not be on NFS either. 

 Support setting spark.local.dirs to a hadoop FileSystem 
 

 Key: SPARK-1529
 URL: https://issues.apache.org/jira/browse/SPARK-1529
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Cheng Lian

 In some environments, like with MapR, local volumes are accessed through the 
 Hadoop filesystem interface. We should allow setting spark.local.dir to a 
 Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

2015-01-04 Thread Kannan Rajah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263928#comment-14263928
 ] 

Kannan Rajah commented on SPARK-1529:
-

In a MapR distribution, these files need to go into MapR volume and not local 
disk. This MapR volume is local to the node though, but its part of the 
distributed file system. This can be achieved in 2 ways:
1. Expose the MapR file system as a NFS mount point. Now, you can use the 
normal java.io API and data will still get written to MapR volume instead of 
local disk.

2. Use HDFS API (with underlying MapR implementation) instead of java.io to 
make the IO go to MapR volume.

 Support setting spark.local.dirs to a hadoop FileSystem 
 

 Key: SPARK-1529
 URL: https://issues.apache.org/jira/browse/SPARK-1529
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Cheng Lian

 In some environments, like with MapR, local volumes are accessed through the 
 Hadoop filesystem interface. We should allow setting spark.local.dir to a 
 Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

2015-01-03 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263516#comment-14263516
 ] 

Cheng Lian commented on SPARK-1529:
---

Hi [~srowen], first of all we are not trying to put shuffle and temp files in 
HDFS. At the time this ticket was created, the initial motivation was to 
support MapR, because MapR only exposes local file system via MapR volume and 
HDFS {{FileSystem}} interface. However, later on this issue was worked around 
with NFS. And this ticket wasn't solved because of lacking enough capacity.

[~rkannan82] Thanks for looking into this! Several months ago, I had once 
implemented a prototype by simply replacing Java NIO file system operations 
with corresponding HDFS {{FileSystem}} version. According to prior benchmark 
done with {{spark-perf}}, this introduces ~15% performance penalty for 
shuffling. Thus we had once planned to write a specialized {{FileSystem}} 
implementation which simply wraps normal Java NIO operations to avoid the 
performance penalty as much as possible, and then replace all local file system 
access with this specialized {{FileSystem}} implementation.

 Support setting spark.local.dirs to a hadoop FileSystem 
 

 Key: SPARK-1529
 URL: https://issues.apache.org/jira/browse/SPARK-1529
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Cheng Lian

 In some environments, like with MapR, local volumes are accessed through the 
 Hadoop filesystem interface. We should allow setting spark.local.dir to a 
 Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

2015-01-03 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263578#comment-14263578
 ] 

Sean Owen commented on SPARK-1529:
--

Hm, how do these APIs preclude the direct use of java.io? Is this actually 
disabled in MapR? If there is a workaround what is the remaining motivation? 

 Support setting spark.local.dirs to a hadoop FileSystem 
 

 Key: SPARK-1529
 URL: https://issues.apache.org/jira/browse/SPARK-1529
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Cheng Lian

 In some environments, like with MapR, local volumes are accessed through the 
 Hadoop filesystem interface. We should allow setting spark.local.dir to a 
 Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

2015-01-03 Thread Kannan Rajah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263603#comment-14263603
 ] 

Kannan Rajah commented on SPARK-1529:
-

It cannot preclude the use of java.io completely. If there are java.io. APIs 
that are needed for some use case, then you cannot use the HDFS API. But  that 
is the case normally.

The NFS mount based workaround is not as efficient as accessing it through the 
HDFS interface. Hence the need.

 Support setting spark.local.dirs to a hadoop FileSystem 
 

 Key: SPARK-1529
 URL: https://issues.apache.org/jira/browse/SPARK-1529
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Cheng Lian

 In some environments, like with MapR, local volumes are accessed through the 
 Hadoop filesystem interface. We should allow setting spark.local.dir to a 
 Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

2015-01-03 Thread Kannan Rajah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263600#comment-14263600
 ] 

Kannan Rajah commented on SPARK-1529:
-

[~lian cheng] Can you upload this prototype patch so that I can reuse it? What 
branch was it based off? When I start making new changes, I suppose I can do it 
against master branch, right?

 Support setting spark.local.dirs to a hadoop FileSystem 
 

 Key: SPARK-1529
 URL: https://issues.apache.org/jira/browse/SPARK-1529
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Cheng Lian

 In some environments, like with MapR, local volumes are accessed through the 
 Hadoop filesystem interface. We should allow setting spark.local.dir to a 
 Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

2015-01-02 Thread Kannan Rajah (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14263287#comment-14263287
 ] 

Kannan Rajah commented on SPARK-1529:
-

[~lian cheng] [~pwendell]] I want to work on this JIRA. It's been a while since 
there has been any update. So can you please share what the current status is? 
Has there been a consensus on replacing the file API with a HDFS kind of 
interface and plugging in the right implementation? I will be looking at the 
code base in the mean time.

 Support setting spark.local.dirs to a hadoop FileSystem 
 

 Key: SPARK-1529
 URL: https://issues.apache.org/jira/browse/SPARK-1529
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Cheng Lian

 In some environments, like with MapR, local volumes are accessed through the 
 Hadoop filesystem interface. We should allow setting spark.local.dir to a 
 Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

2014-12-27 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14259323#comment-14259323
 ] 

Sean Owen commented on SPARK-1529:
--

This may be a dumb question on an old issue, but: you can already access local 
filesystems through HDFS's FileSystem. Of course you can always access local 
filesystems directly in general.

The FileSystem abstraction won't provide everything that java.nio does because 
it is not a local filesystem.

But why would you want to put shuffle and temp files in HDFS? The MapR comment 
confuses me more since its main trick is exposing HDFS as more like an NFS 
mount point. But if anything that makes it already usable as is for this 
purpose. 



 Support setting spark.local.dirs to a hadoop FileSystem 
 

 Key: SPARK-1529
 URL: https://issues.apache.org/jira/browse/SPARK-1529
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Cheng Lian

 In some environments, like with MapR, local volumes are accessed through the 
 Hadoop filesystem interface. We should allow setting spark.local.dir to a 
 Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

2014-04-22 Thread Aaron Davidson (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13976494#comment-13976494
 ] 

Aaron Davidson commented on SPARK-1529:
---

Hey, looking through the code a little more in depth reveals that the rewind 
and truncate functionality of DiskBlockObjectWriter is actually used during 
shuffle file consolidation. The issue is that we store the metadata for each 
consolidated shuffle file as a consecutive set of offsets into the file (an 
offset table). That is, if we have 3 blocks stored in the same file, rather 
than storing 3 pairs of (offset, length), we simply store the offsets and use 
the fact that they're laid out consecutively to reconstruct the lengths. This 
means we can't suffer holes in the data structure of partial writes, and thus 
rely on the fact that partial writes (which are not included in the data 
structure right now) are always of size 0.

I think getting around this is pretty straightforward, however: We can simply 
store the offsets of all partial writes in the offset table, and just avoid 
storing them in the index we build to look up the positions of particular map 
tasks in the offset table. This will mean we can reconstruct the lengths 
properly, but most importantly it means we will not think that our failed map 
tasks were successful (because index lookups for them will still fail, even 
though they're in the offset table).

This seems like a pretty clean solution that wraps up our usage of 
FileChannels, save where we mmap files back into memory. We will likely want to 
special-case the blocks to make sure we can mmap them directly when reading 
from the local file system.

 Support setting spark.local.dirs to a hadoop FileSystem 
 

 Key: SPARK-1529
 URL: https://issues.apache.org/jira/browse/SPARK-1529
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Cheng Lian
 Fix For: 1.1.0


 In some environments, like with MapR, local volumes are accessed through the 
 Hadoop filesystem interface. We should allow setting spark.local.dir to a 
 Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

2014-04-21 Thread Cheng Lian (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13975914#comment-13975914
 ] 

Cheng Lian commented on SPARK-1529:
---

After some investigation, I came to the conclusion that, unlike adding Tachyon 
support, to allow setting {{spark.local.dir}} to a Hadoop FS location, instead 
of adding something like {{HDFSBlockManager}} / {{HDFSStore}}, we have to 
refactor related local FS access code to leverage HDFS interfaces. And it seems 
hard to make this change incremental. Besides writing shuffle map output, at 
least two places reference {{spark.local.dir}}:

# HTTP broadcasting uses {{spark.local.dir}} as resource root, and access local 
FS with `java.io.File`
# {{FileServerHandler}} accesses {{spark.local.dir}} via {{DiskBlockManager}} 
and reads local file with {{FileSegment}} and {{java.io.File}}

Adding new block manager / store for HDFS can't fix these places. I'm currently 
working on this issue by:

# Refactoring {{FileSegment.file}} from {{java.io.File}} to 
{{org.apache.hadoop.fs.Path}},
# Refactoring {{DiskBlockManager}}, {{DiskStore}}, {{HttpBroadcast}}  
{{FileServerHandler}}  to leverage HDFS interfaces.

Please leave comments if I missed anything or there's simpler ways to 
workaround this.

(PS: We should definitely refactor block manager related code to reduce 
duplicate code and encapsulate more details. Maybe the public interface of 
block manager should only communicate with other component with block IDs and 
storage levels.)

 Support setting spark.local.dirs to a hadoop FileSystem 
 

 Key: SPARK-1529
 URL: https://issues.apache.org/jira/browse/SPARK-1529
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Cheng Lian
 Fix For: 1.1.0


 In some environments, like with MapR, local volumes are accessed through the 
 Hadoop filesystem interface. We should allow setting spark.local.dir to a 
 Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

2014-04-21 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13975947#comment-13975947
 ] 

Patrick Wendell commented on SPARK-1529:


[~liancheng] Hey Cheng, the tricky thing here is want to avoid _always_ going 
through the HDFS filesystem interface when people are actually using local 
files. We might need to add an intermediate abstraction to deal with this. We 
already do this elsehwere in the code base, for instance the JobLogger will 
load an output stream either directly form a file or from a hadoop file.

One thing to note is that the requirement here is really only for the shuffle 
files, not for the other uses. But I realize we currently conflate these inside 
of Spark so that might not buy us much. I'll look into this a bit more later.

 Support setting spark.local.dirs to a hadoop FileSystem 
 

 Key: SPARK-1529
 URL: https://issues.apache.org/jira/browse/SPARK-1529
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Cheng Lian
 Fix For: 1.1.0


 In some environments, like with MapR, local volumes are accessed through the 
 Hadoop filesystem interface. We should allow setting spark.local.dir to a 
 Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-1529) Support setting spark.local.dirs to a hadoop FileSystem

2014-04-21 Thread Patrick Wendell (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13976038#comment-13976038
 ] 

Patrick Wendell commented on SPARK-1529:


One idea proposed by [~adav] was to always use the Hadoop filesystem API, but 
to potentially implement our own version of the local filesystem if we find the 
Hadoop version has performance drawbacks.

Another issue is that we use FileChannel objects directly in the 
`DiskBlockObjectWriter`. After looking through this a bit, the functionality 
there to commit and rewind writes is not actually used anywhere, we could 
probably just remove it.

[~liancheng] I think it would be worth it to look at a version where we just 
take all of the File API's and replace them with Hadoop equivalents. I.e. your 
proposal.



 Support setting spark.local.dirs to a hadoop FileSystem 
 

 Key: SPARK-1529
 URL: https://issues.apache.org/jira/browse/SPARK-1529
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Reporter: Patrick Wendell
Assignee: Cheng Lian
 Fix For: 1.1.0


 In some environments, like with MapR, local volumes are accessed through the 
 Hadoop filesystem interface. We should allow setting spark.local.dir to a 
 Hadoop filesystem location. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)