[jira] [Commented] (SPARK-26164) [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort

2021-05-06 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17340296#comment-17340296
 ] 

Apache Spark commented on SPARK-26164:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/32459

> [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort
> --
>
> Key: SPARK-26164
> URL: https://issues.apache.org/jira/browse/SPARK-26164
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0, 3.1.0
>Reporter: Cheng Su
>Assignee: Cheng Su
>Priority: Minor
> Fix For: 3.2.0
>
>
> Problem:
> Current spark always requires a local sort before writing to output table on 
> partition/bucket columns [1]. The disadvantage is the sort might waste 
> reserved CPU time on executor due to spill. Hive does not require the local 
> sort before writing output table [2], and we saw performance regression when 
> migrating hive workload to spark.
>  
> Proposal:
> We can avoid the local sort by keeping the mapping between file path and 
> output writer. In case of writing row to a new file path, we create a new 
> output writer. Otherwise, re-use the same output writer if the writer already 
> exists (mainly change should be in FileFormatDataWriter.scala). This is very 
> similar to what hive does in [2].
> Given the new behavior (i.e. avoid sort by keeping multiple output writer) 
> consumes more memory on executor (multiple output writer needs to be opened 
> in same time), than the current behavior (i.e. only one output writer 
> opened). We can add the config to switch between the current and new behavior.
>  
> [1]: spark FileFormatWriter.scala - 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L123]
> [2]: hive FileSinkOperator.java - 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L510]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26164) [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort

2021-04-15 Thread Apache Spark (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17322552#comment-17322552
 ] 

Apache Spark commented on SPARK-26164:
--

User 'c21' has created a pull request for this issue:
https://github.com/apache/spark/pull/32198

> [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort
> --
>
> Key: SPARK-26164
> URL: https://issues.apache.org/jira/browse/SPARK-26164
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0, 3.1.0
>Reporter: Cheng Su
>Priority: Minor
>
> Problem:
> Current spark always requires a local sort before writing to output table on 
> partition/bucket columns [1]. The disadvantage is the sort might waste 
> reserved CPU time on executor due to spill. Hive does not require the local 
> sort before writing output table [2], and we saw performance regression when 
> migrating hive workload to spark.
>  
> Proposal:
> We can avoid the local sort by keeping the mapping between file path and 
> output writer. In case of writing row to a new file path, we create a new 
> output writer. Otherwise, re-use the same output writer if the writer already 
> exists (mainly change should be in FileFormatDataWriter.scala). This is very 
> similar to what hive does in [2].
> Given the new behavior (i.e. avoid sort by keeping multiple output writer) 
> consumes more memory on executor (multiple output writer needs to be opened 
> in same time), than the current behavior (i.e. only one output writer 
> opened). We can add the config to switch between the current and new behavior.
>  
> [1]: spark FileFormatWriter.scala - 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L123]
> [2]: hive FileSinkOperator.java - 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L510]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26164) [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort

2020-08-25 Thread Cheng Su (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17184623#comment-17184623
 ] 

Cheng Su commented on SPARK-26164:
--

Just update - discussed with [~cloud_fan], this Jira is still valid, and I am 
working on a new PR on latest master. Thanks.

> [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort
> --
>
> Key: SPARK-26164
> URL: https://issues.apache.org/jira/browse/SPARK-26164
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0, 3.0.0
>Reporter: Cheng Su
>Priority: Minor
>
> Problem:
> Current spark always requires a local sort before writing to output table on 
> partition/bucket columns [1]. The disadvantage is the sort might waste 
> reserved CPU time on executor due to spill. Hive does not require the local 
> sort before writing output table [2], and we saw performance regression when 
> migrating hive workload to spark.
>  
> Proposal:
> We can avoid the local sort by keeping the mapping between file path and 
> output writer. In case of writing row to a new file path, we create a new 
> output writer. Otherwise, re-use the same output writer if the writer already 
> exists (mainly change should be in FileFormatDataWriter.scala). This is very 
> similar to what hive does in [2].
> Given the new behavior (i.e. avoid sort by keeping multiple output writer) 
> consumes more memory on executor (multiple output writer needs to be opened 
> in same time), than the current behavior (i.e. only one output writer 
> opened). We can add the config to switch between the current and new behavior.
>  
> [1]: spark FileFormatWriter.scala - 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L123]
> [2]: hive FileSinkOperator.java - 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L510]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26164) [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort

2020-08-24 Thread Cheng Su (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17183643#comment-17183643
 ] 

Cheng Su commented on SPARK-26164:
--

[~cloud_fan] and [~hyukjin.kwon] - sorry for keeping ping. Would like to get 
your opinion on this JIRA. Thanks.

> [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort
> --
>
> Key: SPARK-26164
> URL: https://issues.apache.org/jira/browse/SPARK-26164
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Cheng Su
>Priority: Minor
>
> Problem:
> Current spark always requires a local sort before writing to output table on 
> partition/bucket columns [1]. The disadvantage is the sort might waste 
> reserved CPU time on executor due to spill. Hive does not require the local 
> sort before writing output table [2], and we saw performance regression when 
> migrating hive workload to spark.
>  
> Proposal:
> We can avoid the local sort by keeping the mapping between file path and 
> output writer. In case of writing row to a new file path, we create a new 
> output writer. Otherwise, re-use the same output writer if the writer already 
> exists (mainly change should be in FileFormatDataWriter.scala). This is very 
> similar to what hive does in [2].
> Given the new behavior (i.e. avoid sort by keeping multiple output writer) 
> consumes more memory on executor (multiple output writer needs to be opened 
> in same time), than the current behavior (i.e. only one output writer 
> opened). We can add the config to switch between the current and new behavior.
>  
> [1]: spark FileFormatWriter.scala - 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L123]
> [2]: hive FileSinkOperator.java - 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L510]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26164) [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort

2020-08-21 Thread Cheng Su (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-26164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17182239#comment-17182239
 ] 

Cheng Su commented on SPARK-26164:
--

[~cloud_fan], [~hyukjin.kwon] - wondering how do you think of this JIRA? I can 
raise another PR for latest trunk. Thanks.

The JIRA is still a valid problem for latest master trunk. Internally in our 
fork, we are seeing much better CPU/IO performance when writing big output 
table with dynamic partitions or buckets. The sort before writing output can 
have obvious overhead if the sort needs spill.

> [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort
> --
>
> Key: SPARK-26164
> URL: https://issues.apache.org/jira/browse/SPARK-26164
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Cheng Su
>Priority: Minor
>
> Problem:
> Current spark always requires a local sort before writing to output table on 
> partition/bucket columns [1]. The disadvantage is the sort might waste 
> reserved CPU time on executor due to spill. Hive does not require the local 
> sort before writing output table [2], and we saw performance regression when 
> migrating hive workload to spark.
>  
> Proposal:
> We can avoid the local sort by keeping the mapping between file path and 
> output writer. In case of writing row to a new file path, we create a new 
> output writer. Otherwise, re-use the same output writer if the writer already 
> exists (mainly change should be in FileFormatDataWriter.scala). This is very 
> similar to what hive does in [2].
> Given the new behavior (i.e. avoid sort by keeping multiple output writer) 
> consumes more memory on executor (multiple output writer needs to be opened 
> in same time), than the current behavior (i.e. only one output writer 
> opened). We can add the config to switch between the current and new behavior.
>  
> [1]: spark FileFormatWriter.scala - 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L123]
> [2]: hive FileSinkOperator.java - 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L510]
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26164) [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort

2019-02-07 Thread Thincrs (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16762991#comment-16762991
 ] 

Thincrs commented on SPARK-26164:
-

A user of thincrs has selected this issue. Deadline: Thu, Feb 14, 2019 7:16 PM

> [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort
> --
>
> Key: SPARK-26164
> URL: https://issues.apache.org/jira/browse/SPARK-26164
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Cheng Su
>Priority: Minor
>
> Problem:
> Current spark always requires a local sort before writing to output table on 
> partition/bucket columns [1]. The disadvantage is the sort might waste 
> reserved CPU time on executor due to spill. Hive does not require the local 
> sort before writing output table [2], and we saw performance regression when 
> migrating hive workload to spark.
>  
> Proposal:
> We can avoid the local sort by keeping the mapping between file path and 
> output writer. In case of writing row to a new file path, we create a new 
> output writer. Otherwise, re-use the same output writer if the writer already 
> exists (mainly change should be in FileFormatDataWriter.scala). This is very 
> similar to what hive does in [2].
> Given the new behavior (i.e. avoid sort by keeping multiple output writer) 
> consumes more memory on executor (multiple output writer needs to be opened 
> in same time), than the current behavior (i.e. only one output writer 
> opened). We can add the config to switch between the current and new behavior.
>  
> [1]: spark FileFormatWriter.scala - 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L123]
> [2]: hive FileSinkOperator.java - 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L510]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26164) [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort

2018-12-27 Thread Cheng Su (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16730029#comment-16730029
 ] 

Cheng Su commented on SPARK-26164:
--

[~cloud_fan], could you help take a look of 
[https://github.com/apache/spark/pull/23163] ? Thanks!

> [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort
> --
>
> Key: SPARK-26164
> URL: https://issues.apache.org/jira/browse/SPARK-26164
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Cheng Su
>Priority: Minor
>
> Problem:
> Current spark always requires a local sort before writing to output table on 
> partition/bucket columns [1]. The disadvantage is the sort might waste 
> reserved CPU time on executor due to spill. Hive does not require the local 
> sort before writing output table [2], and we saw performance regression when 
> migrating hive workload to spark.
>  
> Proposal:
> We can avoid the local sort by keeping the mapping between file path and 
> output writer. In case of writing row to a new file path, we create a new 
> output writer. Otherwise, re-use the same output writer if the writer already 
> exists (mainly change should be in FileFormatDataWriter.scala). This is very 
> similar to what hive does in [2].
> Given the new behavior (i.e. avoid sort by keeping multiple output writer) 
> consumes more memory on executor (multiple output writer needs to be opened 
> in same time), than the current behavior (i.e. only one output writer 
> opened). We can add the config to switch between the current and new behavior.
>  
> [1]: spark FileFormatWriter.scala - 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L123]
> [2]: hive FileSinkOperator.java - 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L510]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26164) [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort

2018-11-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16701643#comment-16701643
 ] 

Apache Spark commented on SPARK-26164:
--

User 'c21' has created a pull request for this issue:
https://github.com/apache/spark/pull/23163

> [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort
> --
>
> Key: SPARK-26164
> URL: https://issues.apache.org/jira/browse/SPARK-26164
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Cheng Su
>Priority: Minor
>
> Problem:
> Current spark always requires a local sort before writing to output table on 
> partition/bucket columns [1]. The disadvantage is the sort might waste 
> reserved CPU time on executor due to spill. Hive does not require the local 
> sort before writing output table [2], and we saw performance regression when 
> migrating hive workload to spark.
>  
> Proposal:
> We can avoid the local sort by keeping the mapping between file path and 
> output writer. In case of writing row to a new file path, we create a new 
> output writer. Otherwise, re-use the same output writer if the writer already 
> exists (mainly change should be in FileFormatDataWriter.scala). This is very 
> similar to what hive does in [2].
> Given the new behavior (i.e. avoid sort by keeping multiple output writer) 
> consumes more memory on executor (multiple output writer needs to be opened 
> in same time), than the current behavior (i.e. only one output writer 
> opened). We can add the config to switch between the current and new behavior.
>  
> [1]: spark FileFormatWriter.scala - 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L123]
> [2]: hive FileSinkOperator.java - 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L510]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26164) [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort

2018-11-28 Thread Apache Spark (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16701642#comment-16701642
 ] 

Apache Spark commented on SPARK-26164:
--

User 'c21' has created a pull request for this issue:
https://github.com/apache/spark/pull/23163

> [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort
> --
>
> Key: SPARK-26164
> URL: https://issues.apache.org/jira/browse/SPARK-26164
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Cheng Su
>Priority: Minor
>
> Problem:
> Current spark always requires a local sort before writing to output table on 
> partition/bucket columns [1]. The disadvantage is the sort might waste 
> reserved CPU time on executor due to spill. Hive does not require the local 
> sort before writing output table [2], and we saw performance regression when 
> migrating hive workload to spark.
>  
> Proposal:
> We can avoid the local sort by keeping the mapping between file path and 
> output writer. In case of writing row to a new file path, we create a new 
> output writer. Otherwise, re-use the same output writer if the writer already 
> exists (mainly change should be in FileFormatDataWriter.scala). This is very 
> similar to what hive does in [2].
> Given the new behavior (i.e. avoid sort by keeping multiple output writer) 
> consumes more memory on executor (multiple output writer needs to be opened 
> in same time), than the current behavior (i.e. only one output writer 
> opened). We can add the config to switch between the current and new behavior.
>  
> [1]: spark FileFormatWriter.scala - 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L123]
> [2]: hive FileSinkOperator.java - 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L510]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26164) [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort

2018-11-26 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16700012#comment-16700012
 ] 

Wenchen Fan commented on SPARK-26164:
-

The idea LGTM. It would be better to reuse some code between this and aggregate.

> [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort
> --
>
> Key: SPARK-26164
> URL: https://issues.apache.org/jira/browse/SPARK-26164
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Cheng Su
>Priority: Minor
>
> Problem:
> Current spark always requires a local sort before writing to output table on 
> partition/bucket columns [1]. The disadvantage is the sort might waste 
> reserved CPU time on executor due to spill. Hive does not require the local 
> sort before writing output table [2], and we saw performance regression when 
> migrating hive workload to spark.
>  
> Proposal:
> We can avoid the local sort by keeping the mapping between file path and 
> output writer. In case of writing row to a new file path, we create a new 
> output writer. Otherwise, re-use the same output writer if the writer already 
> exists (mainly change should be in FileFormatDataWriter.scala). This is very 
> similar to what hive does in [2].
> Given the new behavior (i.e. avoid sort by keeping multiple output writer) 
> consumes more memory on executor (multiple output writer needs to be opened 
> in same time), than the current behavior (i.e. only one output writer 
> opened). We can add the config to switch between the current and new behavior.
>  
> [1]: spark FileFormatWriter.scala - 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L123]
> [2]: hive FileSinkOperator.java - 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L510]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26164) [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort

2018-11-26 Thread Cheng Su (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1673#comment-1673
 ] 

Cheng Su commented on SPARK-26164:
--

_> I think we can follow aggregate, try hash writer first, and fallback to sort 
when we are short of memory_

This is a good idea. After checking aggregation implementation, I think we can 
do following for writing table (i.e. FileFormatDataWriter and FileFormatWriter):

1.Maintain mapping between file path and output writer, and re-use writer for 
writing input row. In case of the number of opened output writers exceeding a 
threshold (can be changed by a config), we go to 2.

2.Sort the rest of input rows (use the same sorter in SortExec). Then writing 
the rest of sorted rows, and we can close the writer on the fly, in case no 
more rows for current file path.

* I use the number of output writers here to proxy the used memory. Let me know 
if there's a better way to know if the executor is short of memory.

Do you guys think we are good to go implementing with above design? cc 
[~cloud_fan], [~tejasp] and [~sameerag], thanks!

> [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort
> --
>
> Key: SPARK-26164
> URL: https://issues.apache.org/jira/browse/SPARK-26164
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Cheng Su
>Priority: Minor
>
> Problem:
> Current spark always requires a local sort before writing to output table on 
> partition/bucket columns [1]. The disadvantage is the sort might waste 
> reserved CPU time on executor due to spill. Hive does not require the local 
> sort before writing output table [2], and we saw performance regression when 
> migrating hive workload to spark.
>  
> Proposal:
> We can avoid the local sort by keeping the mapping between file path and 
> output writer. In case of writing row to a new file path, we create a new 
> output writer. Otherwise, re-use the same output writer if the writer already 
> exists (mainly change should be in FileFormatDataWriter.scala). This is very 
> similar to what hive does in [2].
> Given the new behavior (i.e. avoid sort by keeping multiple output writer) 
> consumes more memory on executor (multiple output writer needs to be opened 
> in same time), than the current behavior (i.e. only one output writer 
> opened). We can add the config to switch between the current and new behavior.
>  
> [1]: spark FileFormatWriter.scala - 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L123]
> [2]: hive FileSinkOperator.java - 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L510]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26164) [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort

2018-11-25 Thread Wenchen Fan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16698413#comment-16698413
 ] 

Wenchen Fan commented on SPARK-26164:
-

I think we can follow aggregate, try hash writer first, and fallback to sort 
when we are short of memory

> [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort
> --
>
> Key: SPARK-26164
> URL: https://issues.apache.org/jira/browse/SPARK-26164
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Cheng Su
>Priority: Minor
>
> Problem:
> Current spark always requires a local sort before writing to output table on 
> partition/bucket columns [1]. The disadvantage is the sort might waste 
> reserved CPU time on executor due to spill. Hive does not require the local 
> sort before writing output table [2], and we saw performance regression when 
> migrating hive workload to spark.
>  
> Proposal:
> We can avoid the local sort by keeping the mapping between file path and 
> output writer. In case of writing row to a new file path, we create a new 
> output writer. Otherwise, re-use the same output writer if the writer already 
> exists (mainly change should be in FileFormatDataWriter.scala). This is very 
> similar to what hive does in [2].
> Given the new behavior (i.e. avoid sort by keeping multiple output writer) 
> consumes more memory on executor (multiple output writer needs to be opened 
> in same time), than the current behavior (i.e. only one output writer 
> opened). We can add the config to switch between the current and new behavior.
>  
> [1]: spark FileFormatWriter.scala - 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L123]
> [2]: hive FileSinkOperator.java - 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L510]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-26164) [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort

2018-11-24 Thread Cheng Su (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-26164?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16698070#comment-16698070
 ] 

Cheng Su commented on SPARK-26164:
--

cc [~cloud_fan], [~tejasp], and [~sameerag] who might have the most context on 
this. I think this is something we need in spark. Let me know if this is not 
needed, otherwise I will start working on a pr to address the issue.

> [SQL] Allow FileFormatWriter to write multiple partitions/buckets without sort
> --
>
> Key: SPARK-26164
> URL: https://issues.apache.org/jira/browse/SPARK-26164
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: Cheng Su
>Priority: Minor
>
> Problem:
> Current spark always requires a local sort before writing to output table on 
> partition/bucket columns [1]. The disadvantage is the sort might waste 
> reserved CPU time on executor due to spill. Hive does not require the local 
> sort before writing output table [2], and we saw performance regression when 
> migrating hive workload to spark.
>  
> Proposal:
> We can avoid the local sort by keeping the mapping between file path and 
> output writer. In case of writing row to a new file path, we create a new 
> output writer. Otherwise, re-use the same output writer if the writer already 
> exists (mainly change should be in FileFormatDataWriter.scala). This is very 
> similar to what hive does in [2].
> Given the new behavior (i.e. avoid sort by keeping multiple output writer) 
> consumes more memory on executor (multiple output writer needs to be opened 
> in same time), than the current behavior (i.e. only one output writer 
> opened). We can add the config to switch between the current and new behavior.
>  
> [1]: spark FileFormatWriter.scala - 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala#L123]
> [2]: hive FileSinkOperator.java - 
> [https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/exec/FileSinkOperator.java#L510]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org