[jira] [Comment Edited] (SPARK-1529) Support DFS based shuffle in addition to Netty shuffle

Da Fox (JIRA) Mon, 14 Dec 2015 04:36:12 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15055916#comment-15055916
 ]


Da Fox edited comment on SPARK-1529 at 12/14/15 12:34 PM:
----------------------------------------------------------

Hi,
are there any updates for this improvement?

We are running Spark on YARN with a MapR distribution hadoop cluster with small 
local disks. Small disk setup is not uncommon. Requiring local disk space just 
for Spark scratch space seems like a waste of disk space which can be part of 
hdfs. What should be the size of local disk anyway? Should it be dependent on 
the size of datasets we are processing? Doesn't a multi-user environment make 
the problem even worse?

Another issue is with configuration of {{spark.local.dir}} for Spark on YARN: 
{quote}In Spark 1.0 and later this will be overriden by SPARK_LOCAL_DIRS 
(Standalone, Mesos) or LOCAL_DIRS (YARN) environment variables set by the 
cluster manager. {quote}
Spark uses now the same local directory as Node Manager. If we use the 
workaround with NFS mount, it requires us to move Node Manager directory to NFS 
too, which seems obscure (should we create a separate ticket for that?).

We would really appreciate if you implement a solution for this ticket. We can 
also see  a lot of questions on forums such as 
[StackOverflow|http://stackoverflow.com/q/31303568/878613] related to Spark 
running out of space for scratch dir, so the community would surely appreciate 
it too.

Thanks.


was (Author: dafox7777777):
Hi,
are there any updates for this improvement?

We are running Spark on YARN with a MapR distribution hadoop cluster with small 
local disks. Small disk setup is not uncommon. Requiring local disk space just 
for Spark scratch space seems like a waste of disk space which can be part of 
hdfs. What should be the size of local disk anyway? Should it be dependent on 
the size of datasets we are processing? Doesn't a multi-user environment make 
the problem even worse?

Another issue is with configuration of {{spark.local.dir}} for Spark on YARN: 
{quote}In Spark 1.0 and later this will be overriden by SPARK_LOCAL_DIRS 
(Standalone, Mesos) or LOCAL_DIRS (YARN) environment variables set by the 
cluster manager. {quote}
Spark uses now the same local directory as Node Manager. If we use the 
workaround with NFS mount, it requires us to move Node Manager directory to NFS 
too, which seems obscure (should we create a separate ticket for that?).

We would really appreciate if you implement a solution for this ticket.

Thanks.

> Support DFS based shuffle in addition to Netty shuffle
> ------------------------------------------------------
>
>                 Key: SPARK-1529
>                 URL: https://issues.apache.org/jira/browse/SPARK-1529
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>            Reporter: Patrick Wendell
>            Assignee: Kannan Rajah
>         Attachments: Spark Shuffle using HDFS.pdf
>
>
> In some environments, like with MapR, local volumes are accessed through the 
> Hadoop filesystem interface. Shuffle is implemented by writing intermediate 
> data to local disk and serving it to remote node using Netty as a transport 
> mechanism. We want to provide an HDFS based shuffle such that data can be 
> written to HDFS (instead of local disk) and served using HDFS API on the 
> remote nodes. This could involve exposing a file system abstraction to Spark 
> shuffle and have 2 modes of running it. In default mode, it will write to 
> local disk and in the DFS mode, it will write to HDFS.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-1529) Support DFS based shuffle in addition to Netty shuffle

Reply via email to