[jira] [Commented] (SPARK-29181) Cache preferred locations of checkpointed RDD
[ https://issues.apache.org/jira/browse/SPARK-29181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16934575#comment-16934575 ] Dongjoon Hyun commented on SPARK-29181: --- Thanks~ :) > Cache preferred locations of checkpointed RDD > - > > Key: SPARK-29181 > URL: https://issues.apache.org/jira/browse/SPARK-29181 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Priority: Major > > One Spark job in our cluster fits many ALS models in parallel. The fitting > goes well, but in next when we union all factors, the union operation is very > slow. > By looking into the driver stack dump, looks like the driver spends a lot of > time on computing preferred locations. As we checkpoint training data before > fitting ALS, the time is spent on > ReliableCheckpointRDD.getPreferredLocations. In this method, it will call DFS > interface to query file status and block locations. As we have big number of > partitions derived from the checkpointed RDD, the union will spend a lot of > time on querying the same information. > This proposes to add a Spark config to control the caching behavior of > ReliableCheckpointRDD.getPreferredLocations. If it is enabled, > getPreferredLocations will only compute preferred locations once and cache it > for late usage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29181) Cache preferred locations of checkpointed RDD
[ https://issues.apache.org/jira/browse/SPARK-29181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16933955#comment-16933955 ] Liang-Chi Hsieh commented on SPARK-29181: - [~dongjoon] Thanks. Not aware of creating duplicate one. > Cache preferred locations of checkpointed RDD > - > > Key: SPARK-29181 > URL: https://issues.apache.org/jira/browse/SPARK-29181 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Priority: Major > > One Spark job in our cluster fits many ALS models in parallel. The fitting > goes well, but in next when we union all factors, the union operation is very > slow. > By looking into the driver stack dump, looks like the driver spends a lot of > time on computing preferred locations. As we checkpoint training data before > fitting ALS, the time is spent on > ReliableCheckpointRDD.getPreferredLocations. In this method, it will call DFS > interface to query file status and block locations. As we have big number of > partitions derived from the checkpointed RDD, the union will spend a lot of > time on querying the same information. > This proposes to add a Spark config to control the caching behavior of > ReliableCheckpointRDD.getPreferredLocations. If it is enabled, > getPreferredLocations will only compute preferred locations once and cache it > for late usage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-29181) Cache preferred locations of checkpointed RDD
[ https://issues.apache.org/jira/browse/SPARK-29181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16933926#comment-16933926 ] Dongjoon Hyun commented on SPARK-29181: --- Hi, [~viirya]. Is this a duplicate of SPARK-29182? Then, could you close this since you make a PR there. > Cache preferred locations of checkpointed RDD > - > > Key: SPARK-29181 > URL: https://issues.apache.org/jira/browse/SPARK-29181 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.0.0 >Reporter: Liang-Chi Hsieh >Priority: Major > > One Spark job in our cluster fits many ALS models in parallel. The fitting > goes well, but in next when we union all factors, the union operation is very > slow. > By looking into the driver stack dump, looks like the driver spends a lot of > time on computing preferred locations. As we checkpoint training data before > fitting ALS, the time is spent on > ReliableCheckpointRDD.getPreferredLocations. In this method, it will call DFS > interface to query file status and block locations. As we have big number of > partitions derived from the checkpointed RDD, the union will spend a lot of > time on querying the same information. > This proposes to add a Spark config to control the caching behavior of > ReliableCheckpointRDD.getPreferredLocations. If it is enabled, > getPreferredLocations will only compute preferred locations once and cache it > for late usage. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org