[jira] [Resolved] (SPARK-38168) LikeSimplification handles escape character

2022-02-09 Thread Dooyoung Hwang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-38168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dooyoung Hwang resolved SPARK-38168.

Resolution: Won't Fix

> LikeSimplification handles escape character
> ---
>
> Key: SPARK-38168
> URL: https://issues.apache.org/jira/browse/SPARK-38168
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Dooyoung Hwang
>Priority: Major
>
> Currently, LikeSimplification rule of catalyst is skipped if the pattern 
> contains escape character.
> {noformat}
> SELECT * FROM tbl LIKE '%100\%'
> ...
> == Optimized Logical Plan ==
> Filter (isnotnull(c_1#0) && c_1#0 LIKE %100\%)
> +- Relation[c_1#0,c_2#1,c_3#2] ...
> {noformat}
> The filter LIKE '%100\%' in this query is not optimized into 'EndsWith' of 
> StringType.
> LikeSimplification rule can consider a special character(wildcard(%, _) or 
> escape character) as a plain character if the character follows an escape 
> character.
> By doing that, LikeSimplification rule can optimize the filter like below.
> {noformat}
> SELECT * FROM tbl LIKE '%100\%'
> ...
> == Optimized Logical Plan ==
> Filter (isnotnull(c_1#0) && EndsWith(c_1#0, 100%))
> +- Relation[c_1#0,c_2#1,c_3#2] 
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38168) LikeSimplification handles escape character

2022-02-09 Thread Dooyoung Hwang (Jira)
Dooyoung Hwang created SPARK-38168:
--

 Summary: LikeSimplification handles escape character
 Key: SPARK-38168
 URL: https://issues.apache.org/jira/browse/SPARK-38168
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.2.0, 3.3.0
Reporter: Dooyoung Hwang


Currently, LikeSimplification rule of catalyst is skipped if the pattern 
contains escape character.

{noformat}
SELECT * FROM tbl LIKE '%100\%'

...
== Optimized Logical Plan ==
Filter (isnotnull(c_1#0) && c_1#0 LIKE %100\%)
+- Relation[c_1#0,c_2#1,c_3#2] ...
{noformat}
The filter LIKE '%100\%' in this query is not optimized into 'EndsWith' of 
StringType.

LikeSimplification rule can consider a special character(wildcard(%, _) or 
escape character) as a plain character if the character follows an escape 
character.
By doing that, LikeSimplification rule can optimize the filter like below.

{noformat}
SELECT * FROM tbl LIKE '%100\%'

...
== Optimized Logical Plan ==
Filter (isnotnull(c_1#0) && EndsWith(c_1#0, 100%))
+- Relation[c_1#0,c_2#1,c_3#2] 
{noformat}
 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33655) Thrift server : FETCH_PRIOR does not cause to reiterate from start position.

2020-12-03 Thread Dooyoung Hwang (Jira)
Dooyoung Hwang created SPARK-33655:
--

 Summary: Thrift server : FETCH_PRIOR does not cause to reiterate 
from start position. 
 Key: SPARK-33655
 URL: https://issues.apache.org/jira/browse/SPARK-33655
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
Reporter: Dooyoung Hwang


Currently, when a client requests FETCH_PRIOR to thrift server, thrift server 
reiterates from start position. Because thrift server caches a query result 
with an array, FETCH_PRIOR can be implemented without reiterating the result.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25353) executeTake in SparkPlan could decode rows more than necessary.

2018-09-06 Thread Dooyoung Hwang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dooyoung Hwang updated SPARK-25353:
---
Description: 
In some cases, executeTake in SparkPlan could decode more than necessary.

For example, df.limit(1000).collect() is executed.
  +- executeTake in SparkPlan is called with arg 1000.
  +- If total rows count from partitions is 2000, executeTake decode them 
and create array of InternalRow whose size is 2000.
  +- Slice the first 1000 rows, and return them. 1000 rows in the rear are 
not used.

 

  was:
In some cases, executeTake in SparkPlan could decode more than necessary.

For example, df.limit(1000).collect() is executed.
  +- executeTake in SparkPlan is called with arg 1000.
  +- If total rows count from partitions is 2000, executeTake decdoe them 
and create array of InternalRow whose size is 2000.
  +- Slice the first 1000 rows, and return them. 1000 rows in the rear are 
not used.

 


> executeTake in SparkPlan could decode rows more than necessary.
> ---
>
> Key: SPARK-25353
> URL: https://issues.apache.org/jira/browse/SPARK-25353
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Dooyoung Hwang
>Priority: Major
>
> In some cases, executeTake in SparkPlan could decode more than necessary.
> For example, df.limit(1000).collect() is executed.
>   +- executeTake in SparkPlan is called with arg 1000.
>   +- If total rows count from partitions is 2000, executeTake decode them 
> and create array of InternalRow whose size is 2000.
>   +- Slice the first 1000 rows, and return them. 1000 rows in the rear 
> are not used.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25353) executeTake in SparkPlan could decode rows more than necessary.

2018-09-06 Thread Dooyoung Hwang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dooyoung Hwang updated SPARK-25353:
---
Description: 
In some cases, executeTake in SparkPlan could decode more than necessary.

For example, df.limit(1000).collect() is executed.
  +- executeTake in SparkPlan is called with arg 1000.
  +- If total rows count from partitions is 2000, executeTake decdoe them 
and create array of InternalRow whose size is 2000.
  +- Slice the first 1000 rows, and return them. 1000 rows in the rear are 
not used.

 

  was:
In some cases, executeTake in SparkPlan could deserialize more than necessary.

For example, df.limit(1000).collect() is executed.
  +- executeTake in SparkPlan is called with arg 1000.
  +- If total rows count from partitions is 2000, executeTake deserialize 
them and create array of InternalRow whose size is 2000.
  +- Slice the first 1000 rows, and return them. 1000 rows in the rear are 
not used.

 


> executeTake in SparkPlan could decode rows more than necessary.
> ---
>
> Key: SPARK-25353
> URL: https://issues.apache.org/jira/browse/SPARK-25353
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Dooyoung Hwang
>Priority: Major
>
> In some cases, executeTake in SparkPlan could decode more than necessary.
> For example, df.limit(1000).collect() is executed.
>   +- executeTake in SparkPlan is called with arg 1000.
>   +- If total rows count from partitions is 2000, executeTake decdoe them 
> and create array of InternalRow whose size is 2000.
>   +- Slice the first 1000 rows, and return them. 1000 rows in the rear 
> are not used.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25353) executeTake in SparkPlan could decode rows more than necessary.

2018-09-06 Thread Dooyoung Hwang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dooyoung Hwang updated SPARK-25353:
---
Summary: executeTake in SparkPlan could decode rows more than necessary.  
(was: executeTake in SparkPlan could deserialize more than necessary.)

> executeTake in SparkPlan could decode rows more than necessary.
> ---
>
> Key: SPARK-25353
> URL: https://issues.apache.org/jira/browse/SPARK-25353
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Dooyoung Hwang
>Priority: Major
>
> In some cases, executeTake in SparkPlan could deserialize more than necessary.
> For example, df.limit(1000).collect() is executed.
>   +- executeTake in SparkPlan is called with arg 1000.
>   +- If total rows count from partitions is 2000, executeTake deserialize 
> them and create array of InternalRow whose size is 2000.
>   +- Slice the first 1000 rows, and return them. 1000 rows in the rear 
> are not used.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25353) executeTake in SparkPlan could deserialize more than necessary.

2018-09-06 Thread Dooyoung Hwang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dooyoung Hwang updated SPARK-25353:
---
Summary: executeTake in SparkPlan could deserialize more than necessary.  
(was: executeTake has been modified to avoid unnecessary deserialization.)

> executeTake in SparkPlan could deserialize more than necessary.
> ---
>
> Key: SPARK-25353
> URL: https://issues.apache.org/jira/browse/SPARK-25353
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Dooyoung Hwang
>Priority: Major
>
> In some cases, executeTake in SparkPlan could deserialize more than necessary.
> For example, df.limit(1000).collect() is executed.
>   +- executeTake in SparkPlan is called with arg 1000.
>   +- If total rows count from partitions is 2000, executeTake deserialize 
> them and create array of InternalRow whose size is 2000.
>   +- Slice the first 1000 rows, and return them. 1000 rows in the rear 
> are not used.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25353) executeTake has been modified to avoid unnecessary deserialization.

2018-09-06 Thread Dooyoung Hwang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dooyoung Hwang updated SPARK-25353:
---
Summary: executeTake has been modified to avoid unnecessary 
deserialization.  (was: Refactoring executeTake(n: Int) in SparkPlan)

> executeTake has been modified to avoid unnecessary deserialization.
> ---
>
> Key: SPARK-25353
> URL: https://issues.apache.org/jira/browse/SPARK-25353
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Dooyoung Hwang
>Priority: Major
>
> In some cases, executeTake in SparkPlan could deserialize more than necessary.
> For example, df.limit(1000).collect() is executed.
>   +- executeTake in SparkPlan is called with arg 1000.
>   +- If total rows count from partitions is 2000, executeTake deserialize 
> them and create array of InternalRow whose size is 2000.
>   +- Slice the first 1000 rows, and return them. 1000 rows in the rear 
> are not used.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-25353) Refactoring executeTake(n: Int) in SparkPlan

2018-09-06 Thread Dooyoung Hwang (JIRA)
Dooyoung Hwang created SPARK-25353:
--

 Summary: Refactoring executeTake(n: Int) in SparkPlan
 Key: SPARK-25353
 URL: https://issues.apache.org/jira/browse/SPARK-25353
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 2.3.1
Reporter: Dooyoung Hwang


In some cases, executeTake in SparkPlan could deserialize more than necessary.

For example, df.limit(1000).collect() is executed.
  +- executeTake in SparkPlan is called with arg 1000.
  +- If total rows count from partitions is 2000, executeTake deserialize 
them and create array of InternalRow whose size is 2000.
  +- Slice the first 1000 rows, and return them. 1000 rows in the rear are 
not used.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-25224) Improvement of Spark SQL ThriftServer memory management

2018-08-24 Thread Dooyoung Hwang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-25224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dooyoung Hwang updated SPARK-25224:
---
Description: 
Spark SQL just have two options for managing thriftserver memory - enable 
spark.sql.thriftServer.incrementalCollect or not

*1. The case of enabling spark.sql.thriftServer.incrementalCollects*
*1) Pros :* thriftserver can handle large output without OOM.

*2) Cons*
 * Performance degradation because of executing task partition by partition.
 * Handle queries with count-limit inefficiently because of executing all 
partitions. (executeTake stop scanning after collecting count-limit.)
 * Cannot cache result for FETCH_FIRST

*2. The case of disabling spark.sql.thriftServer.incrementalCollects*

*1) Pros :* Good performance for small output

*2) Cons*
 * Memory peak usage is too large because allocating decompressed & 
deserialized rows in "batch" manner, and OOM could occur for large output.
 * It is difficult to measure memory peak usage of Query, so configuring 
spark.driver.maxResultSize is very difficult.
 * If decompressed & deserialized rows fills up eden area of JVM Heap, they 
moves to old Gen and could increase possibility of "Full GC" that stops the 
world.

 

The improvement idea is below:
 # *DataSet does not decompress & deserialize result, and just return total row 
count & iterator to SQL-Executor.* By doing that, only uncompressed data reside 
in memory, so that the memory usage is not only much lower than before but is 
configurable with using spark.driver.maxResultSize.
 # *After SQL-Executor get total row count & iterator from DataSet, it could 
decide whether collecting them as batch manner(appropriate for small row count) 
or deserializing and sending them iteratively (appropriate for large row count) 
with considering returned row count.*

  was:
Spark SQL just have two options for managing thriftserver memory - enable 
spark.sql.thriftServer.incrementalCollect or not
 # *The case of enabling spark.sql.thriftServer.incrementalCollects*
*1)Pros*
  - thriftserver can handle large output without OOM.

*2) Cons*
 - Performance degradation because of executing task partition by partition.
 - Handle queries with count-limit inefficiently because of executing all 
partitions. (executeTake stop scanning after collecting count-limit.)
- Cannot cache result for FETCH_FIRST


 # *The case of disabling spark.sql.thriftServer.incrementalCollects*
*1) Pros*
 - Good performance for small output

*2) Cons*
- Memory peak usage is too large because allocating decompressed & deserialized 
rows in "batch" manner, and OOM could occur for large output.
- It is difficult to measure memory peak usage of Query, so configuring 
spark.driver.maxResultSize is very difficult.
- If decompressed & deserialized rows fills up eden area of JVM Heap, they 
moves to old Gen and could increase possibility of "Full GC" that stops the 
world.

 

The improvement idea is below:
 # *DataSet does not decompress & deserialize result, and just return total row 
count & iterator to SQL-Executor.* By doing that, only uncompressed data reside 
in memory, so that the memory usage is not only much lower than before but is 
configurable with using spark.driver.maxResultSize.


 # *After SQL-Executor get total row count & iterator from DataSet, it could 
decide whether collecting them as batch manner(appropriate for small row count) 
or deserializing and sending them iteratively (appropriate for large row count) 
with considering returned row count.*


> Improvement of Spark SQL ThriftServer memory management
> ---
>
> Key: SPARK-25224
> URL: https://issues.apache.org/jira/browse/SPARK-25224
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.3.1
>Reporter: Dooyoung Hwang
>Priority: Major
>
> Spark SQL just have two options for managing thriftserver memory - enable 
> spark.sql.thriftServer.incrementalCollect or not
> *1. The case of enabling spark.sql.thriftServer.incrementalCollects*
> *1) Pros :* thriftserver can handle large output without OOM.
> *2) Cons*
>  * Performance degradation because of executing task partition by partition.
>  * Handle queries with count-limit inefficiently because of executing all 
> partitions. (executeTake stop scanning after collecting count-limit.)
>  * Cannot cache result for FETCH_FIRST
> *2. The case of disabling spark.sql.thriftServer.incrementalCollects*
> *1) Pros :* Good performance for small output
> *2) Cons*
>  * Memory peak usage is too large because allocating decompressed & 
> deserialized rows in "batch" manner, and OOM could occur for large output.
>  * It is difficult to measure memory peak usage of Query, so configuring 
> spark.driver.maxResultSize is very difficult.
>  * If decompressed & 

[jira] [Created] (SPARK-25224) Improvement of Spark SQL ThriftServer memory management

2018-08-24 Thread Dooyoung Hwang (JIRA)
Dooyoung Hwang created SPARK-25224:
--

 Summary: Improvement of Spark SQL ThriftServer memory management
 Key: SPARK-25224
 URL: https://issues.apache.org/jira/browse/SPARK-25224
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.3.1
Reporter: Dooyoung Hwang


Spark SQL just have two options for managing thriftserver memory - enable 
spark.sql.thriftServer.incrementalCollect or not
 # *The case of enabling spark.sql.thriftServer.incrementalCollects*
*1)Pros*
  - thriftserver can handle large output without OOM.

*2) Cons*
 - Performance degradation because of executing task partition by partition.
 - Handle queries with count-limit inefficiently because of executing all 
partitions. (executeTake stop scanning after collecting count-limit.)
- Cannot cache result for FETCH_FIRST


 # *The case of disabling spark.sql.thriftServer.incrementalCollects*
*1) Pros*
 - Good performance for small output

*2) Cons*
- Memory peak usage is too large because allocating decompressed & deserialized 
rows in "batch" manner, and OOM could occur for large output.
- It is difficult to measure memory peak usage of Query, so configuring 
spark.driver.maxResultSize is very difficult.
- If decompressed & deserialized rows fills up eden area of JVM Heap, they 
moves to old Gen and could increase possibility of "Full GC" that stops the 
world.

 

The improvement idea is below:
 # *DataSet does not decompress & deserialize result, and just return total row 
count & iterator to SQL-Executor.* By doing that, only uncompressed data reside 
in memory, so that the memory usage is not only much lower than before but is 
configurable with using spark.driver.maxResultSize.


 # *After SQL-Executor get total row count & iterator from DataSet, it could 
decide whether collecting them as batch manner(appropriate for small row count) 
or deserializing and sending them iteratively (appropriate for large row count) 
with considering returned row count.*



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org