[jira] [Resolved] (SPARK-38168) LikeSimplification handles escape character
[ https://issues.apache.org/jira/browse/SPARK-38168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dooyoung Hwang resolved SPARK-38168. Resolution: Won't Fix > LikeSimplification handles escape character > --- > > Key: SPARK-38168 > URL: https://issues.apache.org/jira/browse/SPARK-38168 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0, 3.3.0 >Reporter: Dooyoung Hwang >Priority: Major > > Currently, LikeSimplification rule of catalyst is skipped if the pattern > contains escape character. > {noformat} > SELECT * FROM tbl LIKE '%100\%' > ... > == Optimized Logical Plan == > Filter (isnotnull(c_1#0) && c_1#0 LIKE %100\%) > +- Relation[c_1#0,c_2#1,c_3#2] ... > {noformat} > The filter LIKE '%100\%' in this query is not optimized into 'EndsWith' of > StringType. > LikeSimplification rule can consider a special character(wildcard(%, _) or > escape character) as a plain character if the character follows an escape > character. > By doing that, LikeSimplification rule can optimize the filter like below. > {noformat} > SELECT * FROM tbl LIKE '%100\%' > ... > == Optimized Logical Plan == > Filter (isnotnull(c_1#0) && EndsWith(c_1#0, 100%)) > +- Relation[c_1#0,c_2#1,c_3#2] > {noformat} > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-38168) LikeSimplification handles escape character
Dooyoung Hwang created SPARK-38168: -- Summary: LikeSimplification handles escape character Key: SPARK-38168 URL: https://issues.apache.org/jira/browse/SPARK-38168 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.2.0, 3.3.0 Reporter: Dooyoung Hwang Currently, LikeSimplification rule of catalyst is skipped if the pattern contains escape character. {noformat} SELECT * FROM tbl LIKE '%100\%' ... == Optimized Logical Plan == Filter (isnotnull(c_1#0) && c_1#0 LIKE %100\%) +- Relation[c_1#0,c_2#1,c_3#2] ... {noformat} The filter LIKE '%100\%' in this query is not optimized into 'EndsWith' of StringType. LikeSimplification rule can consider a special character(wildcard(%, _) or escape character) as a plain character if the character follows an escape character. By doing that, LikeSimplification rule can optimize the filter like below. {noformat} SELECT * FROM tbl LIKE '%100\%' ... == Optimized Logical Plan == Filter (isnotnull(c_1#0) && EndsWith(c_1#0, 100%)) +- Relation[c_1#0,c_2#1,c_3#2] {noformat} -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33655) Thrift server : FETCH_PRIOR does not cause to reiterate from start position.
Dooyoung Hwang created SPARK-33655: -- Summary: Thrift server : FETCH_PRIOR does not cause to reiterate from start position. Key: SPARK-33655 URL: https://issues.apache.org/jira/browse/SPARK-33655 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Reporter: Dooyoung Hwang Currently, when a client requests FETCH_PRIOR to thrift server, thrift server reiterates from start position. Because thrift server caches a query result with an array, FETCH_PRIOR can be implemented without reiterating the result. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25353) executeTake in SparkPlan could decode rows more than necessary.
[ https://issues.apache.org/jira/browse/SPARK-25353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dooyoung Hwang updated SPARK-25353: --- Description: In some cases, executeTake in SparkPlan could decode more than necessary. For example, df.limit(1000).collect() is executed. +- executeTake in SparkPlan is called with arg 1000. +- If total rows count from partitions is 2000, executeTake decode them and create array of InternalRow whose size is 2000. +- Slice the first 1000 rows, and return them. 1000 rows in the rear are not used. was: In some cases, executeTake in SparkPlan could decode more than necessary. For example, df.limit(1000).collect() is executed. +- executeTake in SparkPlan is called with arg 1000. +- If total rows count from partitions is 2000, executeTake decdoe them and create array of InternalRow whose size is 2000. +- Slice the first 1000 rows, and return them. 1000 rows in the rear are not used. > executeTake in SparkPlan could decode rows more than necessary. > --- > > Key: SPARK-25353 > URL: https://issues.apache.org/jira/browse/SPARK-25353 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.1 >Reporter: Dooyoung Hwang >Priority: Major > > In some cases, executeTake in SparkPlan could decode more than necessary. > For example, df.limit(1000).collect() is executed. > +- executeTake in SparkPlan is called with arg 1000. > +- If total rows count from partitions is 2000, executeTake decode them > and create array of InternalRow whose size is 2000. > +- Slice the first 1000 rows, and return them. 1000 rows in the rear > are not used. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25353) executeTake in SparkPlan could decode rows more than necessary.
[ https://issues.apache.org/jira/browse/SPARK-25353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dooyoung Hwang updated SPARK-25353: --- Description: In some cases, executeTake in SparkPlan could decode more than necessary. For example, df.limit(1000).collect() is executed. +- executeTake in SparkPlan is called with arg 1000. +- If total rows count from partitions is 2000, executeTake decdoe them and create array of InternalRow whose size is 2000. +- Slice the first 1000 rows, and return them. 1000 rows in the rear are not used. was: In some cases, executeTake in SparkPlan could deserialize more than necessary. For example, df.limit(1000).collect() is executed. +- executeTake in SparkPlan is called with arg 1000. +- If total rows count from partitions is 2000, executeTake deserialize them and create array of InternalRow whose size is 2000. +- Slice the first 1000 rows, and return them. 1000 rows in the rear are not used. > executeTake in SparkPlan could decode rows more than necessary. > --- > > Key: SPARK-25353 > URL: https://issues.apache.org/jira/browse/SPARK-25353 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.1 >Reporter: Dooyoung Hwang >Priority: Major > > In some cases, executeTake in SparkPlan could decode more than necessary. > For example, df.limit(1000).collect() is executed. > +- executeTake in SparkPlan is called with arg 1000. > +- If total rows count from partitions is 2000, executeTake decdoe them > and create array of InternalRow whose size is 2000. > +- Slice the first 1000 rows, and return them. 1000 rows in the rear > are not used. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25353) executeTake in SparkPlan could decode rows more than necessary.
[ https://issues.apache.org/jira/browse/SPARK-25353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dooyoung Hwang updated SPARK-25353: --- Summary: executeTake in SparkPlan could decode rows more than necessary. (was: executeTake in SparkPlan could deserialize more than necessary.) > executeTake in SparkPlan could decode rows more than necessary. > --- > > Key: SPARK-25353 > URL: https://issues.apache.org/jira/browse/SPARK-25353 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.1 >Reporter: Dooyoung Hwang >Priority: Major > > In some cases, executeTake in SparkPlan could deserialize more than necessary. > For example, df.limit(1000).collect() is executed. > +- executeTake in SparkPlan is called with arg 1000. > +- If total rows count from partitions is 2000, executeTake deserialize > them and create array of InternalRow whose size is 2000. > +- Slice the first 1000 rows, and return them. 1000 rows in the rear > are not used. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25353) executeTake in SparkPlan could deserialize more than necessary.
[ https://issues.apache.org/jira/browse/SPARK-25353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dooyoung Hwang updated SPARK-25353: --- Summary: executeTake in SparkPlan could deserialize more than necessary. (was: executeTake has been modified to avoid unnecessary deserialization.) > executeTake in SparkPlan could deserialize more than necessary. > --- > > Key: SPARK-25353 > URL: https://issues.apache.org/jira/browse/SPARK-25353 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.1 >Reporter: Dooyoung Hwang >Priority: Major > > In some cases, executeTake in SparkPlan could deserialize more than necessary. > For example, df.limit(1000).collect() is executed. > +- executeTake in SparkPlan is called with arg 1000. > +- If total rows count from partitions is 2000, executeTake deserialize > them and create array of InternalRow whose size is 2000. > +- Slice the first 1000 rows, and return them. 1000 rows in the rear > are not used. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25353) executeTake has been modified to avoid unnecessary deserialization.
[ https://issues.apache.org/jira/browse/SPARK-25353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dooyoung Hwang updated SPARK-25353: --- Summary: executeTake has been modified to avoid unnecessary deserialization. (was: Refactoring executeTake(n: Int) in SparkPlan) > executeTake has been modified to avoid unnecessary deserialization. > --- > > Key: SPARK-25353 > URL: https://issues.apache.org/jira/browse/SPARK-25353 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 2.3.1 >Reporter: Dooyoung Hwang >Priority: Major > > In some cases, executeTake in SparkPlan could deserialize more than necessary. > For example, df.limit(1000).collect() is executed. > +- executeTake in SparkPlan is called with arg 1000. > +- If total rows count from partitions is 2000, executeTake deserialize > them and create array of InternalRow whose size is 2000. > +- Slice the first 1000 rows, and return them. 1000 rows in the rear > are not used. > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-25353) Refactoring executeTake(n: Int) in SparkPlan
Dooyoung Hwang created SPARK-25353: -- Summary: Refactoring executeTake(n: Int) in SparkPlan Key: SPARK-25353 URL: https://issues.apache.org/jira/browse/SPARK-25353 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 2.3.1 Reporter: Dooyoung Hwang In some cases, executeTake in SparkPlan could deserialize more than necessary. For example, df.limit(1000).collect() is executed. +- executeTake in SparkPlan is called with arg 1000. +- If total rows count from partitions is 2000, executeTake deserialize them and create array of InternalRow whose size is 2000. +- Slice the first 1000 rows, and return them. 1000 rows in the rear are not used. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-25224) Improvement of Spark SQL ThriftServer memory management
[ https://issues.apache.org/jira/browse/SPARK-25224?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dooyoung Hwang updated SPARK-25224: --- Description: Spark SQL just have two options for managing thriftserver memory - enable spark.sql.thriftServer.incrementalCollect or not *1. The case of enabling spark.sql.thriftServer.incrementalCollects* *1) Pros :* thriftserver can handle large output without OOM. *2) Cons* * Performance degradation because of executing task partition by partition. * Handle queries with count-limit inefficiently because of executing all partitions. (executeTake stop scanning after collecting count-limit.) * Cannot cache result for FETCH_FIRST *2. The case of disabling spark.sql.thriftServer.incrementalCollects* *1) Pros :* Good performance for small output *2) Cons* * Memory peak usage is too large because allocating decompressed & deserialized rows in "batch" manner, and OOM could occur for large output. * It is difficult to measure memory peak usage of Query, so configuring spark.driver.maxResultSize is very difficult. * If decompressed & deserialized rows fills up eden area of JVM Heap, they moves to old Gen and could increase possibility of "Full GC" that stops the world. The improvement idea is below: # *DataSet does not decompress & deserialize result, and just return total row count & iterator to SQL-Executor.* By doing that, only uncompressed data reside in memory, so that the memory usage is not only much lower than before but is configurable with using spark.driver.maxResultSize. # *After SQL-Executor get total row count & iterator from DataSet, it could decide whether collecting them as batch manner(appropriate for small row count) or deserializing and sending them iteratively (appropriate for large row count) with considering returned row count.* was: Spark SQL just have two options for managing thriftserver memory - enable spark.sql.thriftServer.incrementalCollect or not # *The case of enabling spark.sql.thriftServer.incrementalCollects* *1)Pros* - thriftserver can handle large output without OOM. *2) Cons* - Performance degradation because of executing task partition by partition. - Handle queries with count-limit inefficiently because of executing all partitions. (executeTake stop scanning after collecting count-limit.) - Cannot cache result for FETCH_FIRST # *The case of disabling spark.sql.thriftServer.incrementalCollects* *1) Pros* - Good performance for small output *2) Cons* - Memory peak usage is too large because allocating decompressed & deserialized rows in "batch" manner, and OOM could occur for large output. - It is difficult to measure memory peak usage of Query, so configuring spark.driver.maxResultSize is very difficult. - If decompressed & deserialized rows fills up eden area of JVM Heap, they moves to old Gen and could increase possibility of "Full GC" that stops the world. The improvement idea is below: # *DataSet does not decompress & deserialize result, and just return total row count & iterator to SQL-Executor.* By doing that, only uncompressed data reside in memory, so that the memory usage is not only much lower than before but is configurable with using spark.driver.maxResultSize. # *After SQL-Executor get total row count & iterator from DataSet, it could decide whether collecting them as batch manner(appropriate for small row count) or deserializing and sending them iteratively (appropriate for large row count) with considering returned row count.* > Improvement of Spark SQL ThriftServer memory management > --- > > Key: SPARK-25224 > URL: https://issues.apache.org/jira/browse/SPARK-25224 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 2.3.1 >Reporter: Dooyoung Hwang >Priority: Major > > Spark SQL just have two options for managing thriftserver memory - enable > spark.sql.thriftServer.incrementalCollect or not > *1. The case of enabling spark.sql.thriftServer.incrementalCollects* > *1) Pros :* thriftserver can handle large output without OOM. > *2) Cons* > * Performance degradation because of executing task partition by partition. > * Handle queries with count-limit inefficiently because of executing all > partitions. (executeTake stop scanning after collecting count-limit.) > * Cannot cache result for FETCH_FIRST > *2. The case of disabling spark.sql.thriftServer.incrementalCollects* > *1) Pros :* Good performance for small output > *2) Cons* > * Memory peak usage is too large because allocating decompressed & > deserialized rows in "batch" manner, and OOM could occur for large output. > * It is difficult to measure memory peak usage of Query, so configuring > spark.driver.maxResultSize is very difficult. > * If decompressed &
[jira] [Created] (SPARK-25224) Improvement of Spark SQL ThriftServer memory management
Dooyoung Hwang created SPARK-25224: -- Summary: Improvement of Spark SQL ThriftServer memory management Key: SPARK-25224 URL: https://issues.apache.org/jira/browse/SPARK-25224 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 2.3.1 Reporter: Dooyoung Hwang Spark SQL just have two options for managing thriftserver memory - enable spark.sql.thriftServer.incrementalCollect or not # *The case of enabling spark.sql.thriftServer.incrementalCollects* *1)Pros* - thriftserver can handle large output without OOM. *2) Cons* - Performance degradation because of executing task partition by partition. - Handle queries with count-limit inefficiently because of executing all partitions. (executeTake stop scanning after collecting count-limit.) - Cannot cache result for FETCH_FIRST # *The case of disabling spark.sql.thriftServer.incrementalCollects* *1) Pros* - Good performance for small output *2) Cons* - Memory peak usage is too large because allocating decompressed & deserialized rows in "batch" manner, and OOM could occur for large output. - It is difficult to measure memory peak usage of Query, so configuring spark.driver.maxResultSize is very difficult. - If decompressed & deserialized rows fills up eden area of JVM Heap, they moves to old Gen and could increase possibility of "Full GC" that stops the world. The improvement idea is below: # *DataSet does not decompress & deserialize result, and just return total row count & iterator to SQL-Executor.* By doing that, only uncompressed data reside in memory, so that the memory usage is not only much lower than before but is configurable with using spark.driver.maxResultSize. # *After SQL-Executor get total row count & iterator from DataSet, it could decide whether collecting them as batch manner(appropriate for small row count) or deserializing and sending them iteratively (appropriate for large row count) with considering returned row count.* -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org