[jira] [Commented] (SPARK-29189) Add an option to ignore block locations when listing file

2020-01-24 Thread Reynold Xin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17023340#comment-17023340
 ] 

Reynold Xin commented on SPARK-29189:
-

This is great, but how would users know when to set this? Shouldn't we do a 
slight incremental improvement to just automatically detect the common object 
stores and disable locality check?

> Add an option to ignore block locations when listing file
> -
>
> Key: SPARK-29189
> URL: https://issues.apache.org/jira/browse/SPARK-29189
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wang, Gang
>Assignee: Wang, Gang
>Priority: Major
> Fix For: 3.0.0
>
>
> In our PROD env, we have a pure Spark cluster, I think this is also pretty 
> common, where computation is separated from storage layer. In such deploy 
> mode, data locality is never reachable. 
>  And there are some configurations in Spark scheduler to reduce waiting time 
> for data locality(e.g. "spark.locality.wait"). While, problem is that, in 
> listing file phase, the location informations of all the files, with all the 
> blocks inside each file, are all fetched from the distributed file system. 
> Actually, in a PROD environment, a table can be so huge that even fetching 
> all these location informations need take tens of seconds.
>  To improve such scenario, Spark need provide an option, where data locality 
> can be totally ignored, all we need in the listing file phase are the files 
> locations, without any block location informations.
>  
> And we made a benchmark in our PROD env, after ignore the block locations, we 
> got a pretty huge improvement.
> ||Table Size||Total File Number||Total Block Number||List File Duration(With 
> Block Location)||List File Duration(Without Block Location)||
> |22.6T|3|12|16.841s|1.730s|
> |28.8 T|42001|148964|10.099s|2.858s|
> |3.4 T|2| 2|5.833s|4.881s|
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-29189) Add an option to ignore block locations when listing file

2019-10-07 Thread Imran Rashid (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-29189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16946176#comment-16946176
 ] 

Imran Rashid commented on SPARK-29189:
--

Fixed by pr https://github.com/apache/spark/pull/25869

> Add an option to ignore block locations when listing file
> -
>
> Key: SPARK-29189
> URL: https://issues.apache.org/jira/browse/SPARK-29189
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Wang, Gang
>Assignee: Wang, Gang
>Priority: Major
>
> In our PROD env, we have a pure Spark cluster, I think this is also pretty 
> common, where computation is separated from storage layer. In such deploy 
> mode, data locality is never reachable. 
>  And there are some configurations in Spark scheduler to reduce waiting time 
> for data locality(e.g. "spark.locality.wait"). While, problem is that, in 
> listing file phase, the location informations of all the files, with all the 
> blocks inside each file, are all fetched from the distributed file system. 
> Actually, in a PROD environment, a table can be so huge that even fetching 
> all these location informations need take tens of seconds.
>  To improve such scenario, Spark need provide an option, where data locality 
> can be totally ignored, all we need in the listing file phase are the files 
> locations, without any block location informations.
>  
> And we made a benchmark in our PROD env, after ignore the block locations, we 
> got a pretty huge improvement.
> ||Table Size||Total File Number||Total Block Number||List File Duration(With 
> Block Location)||List File Duration(Without Block Location)||
> |22.6T|3|12|16.841s|1.730s|
> |28.8 T|42001|148964|10.099s|2.858s|
> |3.4 T|2| 2|5.833s|4.881s|
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org