[jira] [Commented] (SPARK-33275) ANSI mode: runtime errors instead of returning null
[ https://issues.apache.org/jira/browse/SPARK-33275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17250261#comment-17250261 ] Leanken.Lin commented on SPARK-33275: - [~Chongguang] Yes, Please go ahead. > ANSI mode: runtime errors instead of returning null > --- > > Key: SPARK-33275 > URL: https://issues.apache.org/jira/browse/SPARK-33275 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.1.0 >Reporter: Wenchen Fan >Priority: Major > > We should respect the ANSI mode in more places. What we have done so far are > mostly the overflow check in various operators. This ticket is to track a > category of ANSI mode behaviors: operators should throw runtime errors > instead of returning null when the input is invalid. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33619) GetMapValueUtil code generation error
[ https://issues.apache.org/jira/browse/SPARK-33619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leanken.Lin updated SPARK-33619: Description: ``` In -SPARK-33460--- there is a bug as follow. GetMapValueUtil generated code error when ANSI mode is on s"""throw new NoSuchElementException("Key " + $eval2 + " does not exist.");""" should be s"""throw new java.util.NoSuchElementException("Key " + $eval2 + " does not exist.");""" But Why are checkExceptionInExpression[Exception](expr, errMsg) and sql/testOnly org.apache.spark.sql.SQLQueryTestSuite – -z ansi/map.sql can't detect this Bug it's because 1. checkExceptionInExpression is some what error, too. it should wrap with withSQLConf(SQLConf.CODEGEN_FACTORY_MODE.key like CheckEvalulation, AND WE SHOULD FIX this later. 2. SQLQueryTestSuite ansi/map.sql failed to detect because of the ConstantFolding rules, it is calling eval instead of code gen in this case ``` was: ``` In -SPARK-33460--- there is a bug as follow. GetMapValueUtil generated code error when ANSI mode is on s"""throw new NoSuchElementException("Key " + $eval2 + " does not exist.");""" should be s"""throw new java.util.NoSuchElementException("Key " + $eval2 + " does not exist.");""" But Why are checkExceptionInExpression[Exception](expr, errMsg) and sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z ansi/map.sql can't detect this Bug it's because 1. checkExceptionInExpression is some what error, too. it should wrap with withSQLConf(SQLConf.CODEGEN_FACTORY_MODE.key like CheckEvalulation 2. SQLQueryTestSuite ansi/map.sql failed to detect because of the ConstantFolding rules, it is calling eval instead of code gen in this case ``` > GetMapValueUtil code generation error > - > > Key: SPARK-33619 > URL: https://issues.apache.org/jira/browse/SPARK-33619 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Leanken.Lin >Priority: Major > > ``` > In -SPARK-33460--- > there is a bug as follow. > GetMapValueUtil generated code error when ANSI mode is on > > s"""throw new NoSuchElementException("Key " + $eval2 + " does not exist.");""" > should be > s"""throw new java.util.NoSuchElementException("Key " + $eval2 + " does not > exist.");""" > > But Why are > checkExceptionInExpression[Exception](expr, errMsg) > and sql/testOnly org.apache.spark.sql.SQLQueryTestSuite – -z ansi/map.sql > can't detect this Bug > > it's because > 1. checkExceptionInExpression is some what error, too. it should wrap with > withSQLConf(SQLConf.CODEGEN_FACTORY_MODE.key like CheckEvalulation, AND WE > SHOULD FIX this later. > 2. SQLQueryTestSuite ansi/map.sql failed to detect because of the > ConstantFolding rules, it is calling eval instead of code gen in this case > > ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33619) GetMapValueUtil code generation error
[ https://issues.apache.org/jira/browse/SPARK-33619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leanken.Lin updated SPARK-33619: Description: ``` In -SPARK-33460--- there is a bug as follow. GetMapValueUtil generated code error when ANSI mode is on s"""throw new NoSuchElementException("Key " + $eval2 + " does not exist.");""" should be s"""throw new java.util.NoSuchElementException("Key " + $eval2 + " does not exist.");""" But Why are checkExceptionInExpression[Exception](expr, errMsg) and sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z ansi/map.sql can't detect this Bug it's because 1. checkExceptionInExpression is some what error, too. it should wrap with withSQLConf(SQLConf.CODEGEN_FACTORY_MODE.key like CheckEvalulation 2. SQLQueryTestSuite ansi/map.sql failed to detect because of the ConstantFolding rules, it is calling eval instead of code gen in this case ``` was: ``` ``` > GetMapValueUtil code generation error > - > > Key: SPARK-33619 > URL: https://issues.apache.org/jira/browse/SPARK-33619 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Leanken.Lin >Priority: Major > > ``` > In -SPARK-33460--- > there is a bug as follow. > GetMapValueUtil generated code error when ANSI mode is on > > s"""throw new NoSuchElementException("Key " + $eval2 + " does not exist.");""" > should be > s"""throw new java.util.NoSuchElementException("Key " + $eval2 + " does not > exist.");""" > > But Why are > checkExceptionInExpression[Exception](expr, errMsg) > and sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z ansi/map.sql > can't detect this Bug > > it's because > 1. checkExceptionInExpression is some what error, too. it should wrap with > withSQLConf(SQLConf.CODEGEN_FACTORY_MODE.key like CheckEvalulation > 2. SQLQueryTestSuite ansi/map.sql failed to detect because of the > ConstantFolding rules, it is calling eval instead of code gen in this case > > ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33619) GetMapValueUtil code generation error
Leanken.Lin created SPARK-33619: --- Summary: GetMapValueUtil code generation error Key: SPARK-33619 URL: https://issues.apache.org/jira/browse/SPARK-33619 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.0 Reporter: Leanken.Lin ``` ``` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33498) Datetime parsing should fail if the input string can't be parsed, or the pattern string is invalid
[ https://issues.apache.org/jira/browse/SPARK-33498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leanken.Lin updated SPARK-33498: Summary: Datetime parsing should fail if the input string can't be parsed, or the pattern string is invalid (was: Datetime parsing should fail if the input string can't be parsed, or he pattern string is invalid) > Datetime parsing should fail if the input string can't be parsed, or the > pattern string is invalid > -- > > Key: SPARK-33498 > URL: https://issues.apache.org/jira/browse/SPARK-33498 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Leanken.Lin >Priority: Major > > Datetime parsing should fail if the input string can't be parsed, or he > pattern string is invalid, when ANSI mode is enable. This patch should update > GetTimeStamp, UnixTimeStamp, ToUnixTimeStamp and Cast -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33498) Datetime parsing should fail if the input string can't be parsed, or the pattern string is invalid
[ https://issues.apache.org/jira/browse/SPARK-33498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leanken.Lin updated SPARK-33498: Description: Datetime parsing should fail if the input string can't be parsed, or the pattern string is invalid, when ANSI mode is enable. This patch should update GetTimeStamp, UnixTimeStamp, ToUnixTimeStamp and Cast (was: Datetime parsing should fail if the input string can't be parsed, or he pattern string is invalid, when ANSI mode is enable. This patch should update GetTimeStamp, UnixTimeStamp, ToUnixTimeStamp and Cast) > Datetime parsing should fail if the input string can't be parsed, or the > pattern string is invalid > -- > > Key: SPARK-33498 > URL: https://issues.apache.org/jira/browse/SPARK-33498 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Leanken.Lin >Priority: Major > > Datetime parsing should fail if the input string can't be parsed, or the > pattern string is invalid, when ANSI mode is enable. This patch should update > GetTimeStamp, UnixTimeStamp, ToUnixTimeStamp and Cast -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33498) Datetime parsing should fail if the input string can't be parsed, or he pattern string is invalid
Leanken.Lin created SPARK-33498: --- Summary: Datetime parsing should fail if the input string can't be parsed, or he pattern string is invalid Key: SPARK-33498 URL: https://issues.apache.org/jira/browse/SPARK-33498 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Leanken.Lin Datetime parsing should fail if the input string can't be parsed, or he pattern string is invalid, when ANSI mode is enable. This patch should update GetTimeStamp, UnixTimeStamp, ToUnixTimeStamp and Cast -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33386) Accessing array elements should failed if index is out of bound.
[ https://issues.apache.org/jira/browse/SPARK-33386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leanken.Lin updated SPARK-33386: Description: When ansi mode enabled, accessing array element with out of bound index should failed with exception, but currently it's returning null. (was: When ansi mode enabled, accessing array element should failed with exception, but currently it's returning null.) > Accessing array elements should failed if index is out of bound. > > > Key: SPARK-33386 > URL: https://issues.apache.org/jira/browse/SPARK-33386 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Leanken.Lin >Assignee: Leanken.Lin >Priority: Major > Fix For: 3.1.0 > > > When ansi mode enabled, accessing array element with out of bound index > should failed with exception, but currently it's returning null. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33460) Accessing map values should fail if key is not found.
[ https://issues.apache.org/jira/browse/SPARK-33460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leanken.Lin updated SPARK-33460: Description: When ansi mode enabled, accessing map values should failed with exception if key does not exist , but currently it's returning null. > Accessing map values should fail if key is not found. > - > > Key: SPARK-33460 > URL: https://issues.apache.org/jira/browse/SPARK-33460 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Leanken.Lin >Priority: Major > > When ansi mode enabled, accessing map values should failed with exception if > key does not exist , but currently it's returning null. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33460) Accessing map values should fail if key is not found.
Leanken.Lin created SPARK-33460: --- Summary: Accessing map values should fail if key is not found. Key: SPARK-33460 URL: https://issues.apache.org/jira/browse/SPARK-33460 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Leanken.Lin -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33391) element_at with CreateArray not respect one based index
[ https://issues.apache.org/jira/browse/SPARK-33391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leanken.Lin updated SPARK-33391: Description: var df = spark.sql("select element_at(array(3, 2, 1), 0)") df.printSchema() df = spark.sql("select element_at(array(3, 2, 1), 1)") df.printSchema() df = spark.sql("select element_at(array(3, 2, 1), 2)") df.printSchema() df = spark.sql("select element_at(array(3, 2, 1), 3)") df.printSchema() root |-- element_at(array(3, 2, 1), 0): integer (nullable = false) root |-- element_at(array(3, 2, 1), 1): integer (nullable = false) root |-- element_at(array(3, 2, 1), 2): integer (nullable = false) root |-- element_at(array(3, 2, 1), 3): integer (nullable = true) In this case, the nullable property in element_at with CreateArray statement is not correct. was:TODO > element_at with CreateArray not respect one based index > --- > > Key: SPARK-33391 > URL: https://issues.apache.org/jira/browse/SPARK-33391 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Leanken.Lin >Priority: Major > > var df = spark.sql("select element_at(array(3, 2, 1), 0)") > df.printSchema() > df = spark.sql("select element_at(array(3, 2, 1), 1)") > df.printSchema() > df = spark.sql("select element_at(array(3, 2, 1), 2)") > df.printSchema() > df = spark.sql("select element_at(array(3, 2, 1), 3)") > df.printSchema() > root > |-- element_at(array(3, 2, 1), 0): integer (nullable = false) > root > |-- element_at(array(3, 2, 1), 1): integer (nullable = false) > root > |-- element_at(array(3, 2, 1), 2): integer (nullable = false) > root > |-- element_at(array(3, 2, 1), 3): integer (nullable = true) > > In this case, the nullable property in element_at with CreateArray statement > is not correct. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33391) element_at with CreateArray not respect one based index
[ https://issues.apache.org/jira/browse/SPARK-33391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leanken.Lin updated SPARK-33391: Summary: element_at with CreateArray not respect one based index (was: element_at not respect one based index) > element_at with CreateArray not respect one based index > --- > > Key: SPARK-33391 > URL: https://issues.apache.org/jira/browse/SPARK-33391 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.0 >Reporter: Leanken.Lin >Priority: Major > > TODO -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33391) element_at not respect one based index
Leanken.Lin created SPARK-33391: --- Summary: element_at not respect one based index Key: SPARK-33391 URL: https://issues.apache.org/jira/browse/SPARK-33391 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.0 Reporter: Leanken.Lin TODO -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33386) Accessing array elements should failed if index is out of bound.
[ https://issues.apache.org/jira/browse/SPARK-33386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leanken.Lin updated SPARK-33386: Description: When ansi mode enabled, accessing array element should failed with exception, but currently it's returning null. (was: TODO) > Accessing array elements should failed if index is out of bound. > > > Key: SPARK-33386 > URL: https://issues.apache.org/jira/browse/SPARK-33386 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Leanken.Lin >Priority: Major > > When ansi mode enabled, accessing array element should failed with exception, > but currently it's returning null. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33386) Accessing array elements should failed if index is out of bound.
Leanken.Lin created SPARK-33386: --- Summary: Accessing array elements should failed if index is out of bound. Key: SPARK-33386 URL: https://issues.apache.org/jira/browse/SPARK-33386 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Leanken.Lin TODO -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33140) make Analyzer rules using SQLConf.get
[ https://issues.apache.org/jira/browse/SPARK-33140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leanken.Lin updated SPARK-33140: Summary: make Analyzer rules using SQLConf.get (was: make Analyzer and Optimizer rules using SQLConf.get) > make Analyzer rules using SQLConf.get > - > > Key: SPARK-33140 > URL: https://issues.apache.org/jira/browse/SPARK-33140 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Leanken.Lin >Priority: Major > > TODO -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33139) protect setActiveSession and clearActiveSession
[ https://issues.apache.org/jira/browse/SPARK-33139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leanken.Lin updated SPARK-33139: Description: This PR is a sub-task of [SPARK-33138](https://issues.apache.org/jira/browse/SPARK-33138). In order to make SQLConf.get reliable and stable, we need to make sure user can't pollute the SQLConf and SparkSession Context via calling setActiveSession and clearActiveSession. Change of the PR: * add legacy config spark.sql.legacy.allowModifyActiveSession to fallback to old behavior if user do need to call these two API. * by default, if user call these two API, it will throw exception * add extra two internal and private API setActiveSessionInternal and clearActiveSessionInternal for current internal usage * change all internal reference to new internal API exception for SQLContext.setActive and SQLContext.clearActive was:TODO > protect setActiveSession and clearActiveSession > --- > > Key: SPARK-33139 > URL: https://issues.apache.org/jira/browse/SPARK-33139 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Leanken.Lin >Priority: Major > > This PR is a sub-task of > [SPARK-33138](https://issues.apache.org/jira/browse/SPARK-33138). In order to > make SQLConf.get reliable and stable, we need to make sure user can't pollute > the SQLConf and SparkSession Context via calling setActiveSession and > clearActiveSession. > Change of the PR: > * add legacy config spark.sql.legacy.allowModifyActiveSession to fallback to > old behavior if user do need to call these two API. > * by default, if user call these two API, it will throw exception > * add extra two internal and private API setActiveSessionInternal and > clearActiveSessionInternal for current internal usage > * change all internal reference to new internal API exception for > SQLContext.setActive and SQLContext.clearActive -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33138) unify temp view and permanent view behaviors
[ https://issues.apache.org/jira/browse/SPARK-33138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17213824#comment-17213824 ] Leanken.Lin commented on SPARK-33138: - Thanks for your consideration, but me and my colleague are already working on the sub-tasks. > unify temp view and permanent view behaviors > > > Key: SPARK-33138 > URL: https://issues.apache.org/jira/browse/SPARK-33138 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Leanken.Lin >Priority: Major > > Currently, temp view store mapping of temp view name and its logicalPlan, and > permanent view store in HMS stores its origin SQL text. > So for permanent view, when try to refer the permanent view, its SQL text > will be parse-analyze-optimize-plan again with current SQLConf and > SparkSession context, so it might keep changing when the SQLConf and context > is different each time. > In order the unify the behaviors of temp view and permanent view, proposed > that we keep its origin SQLText for both temp and permanent view, and also > keep record of the SQLConf when the view was created. Each time we try to > refer the view, we using the Snapshot SQLConf to parse-analyze-optimize-plan > the SQLText, in this way, we can make sure the output of the created view to > be stable. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33142) SQL temp view should store SQL text as well
Leanken.Lin created SPARK-33142: --- Summary: SQL temp view should store SQL text as well Key: SPARK-33142 URL: https://issues.apache.org/jira/browse/SPARK-33142 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Leanken.Lin TODO -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33141) capture SQL configs when creating permanent views
Leanken.Lin created SPARK-33141: --- Summary: capture SQL configs when creating permanent views Key: SPARK-33141 URL: https://issues.apache.org/jira/browse/SPARK-33141 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Leanken.Lin TODO -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33140) make Analyzer and Optimizer rules using SQLConf.get
Leanken.Lin created SPARK-33140: --- Summary: make Analyzer and Optimizer rules using SQLConf.get Key: SPARK-33140 URL: https://issues.apache.org/jira/browse/SPARK-33140 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Leanken.Lin TODO -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33139) protect setActiveSession and clearActiveSession
Leanken.Lin created SPARK-33139: --- Summary: protect setActiveSession and clearActiveSession Key: SPARK-33139 URL: https://issues.apache.org/jira/browse/SPARK-33139 Project: Spark Issue Type: Sub-task Components: SQL Affects Versions: 3.1.0 Reporter: Leanken.Lin TODO -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33138) unify temp view and permanent view behaviors
[ https://issues.apache.org/jira/browse/SPARK-33138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leanken.Lin updated SPARK-33138: Description: Currently, temp view store mapping of temp view name and its logicalPlan, and permanent view store in HMS stores its origin SQL text. So for permanent view, when try to refer the permanent view, its SQL text will be parse-analyze-optimize-plan again with current SQLConf and SparkSession context, so it might keep changing when the SQLConf and context is different each time. In order the unify the behaviors of temp view and permanent view, proposed that we keep its origin SQLText for both temp and permanent view, and also keep record of the SQLConf when the view was created. Each time we try to refer the view, we using the Snapshot SQLConf to parse-analyze-optimize-plan the SQLText, in this way, we can make sure the output of the created view to be stable. was:Currently, temp view store mapping of temp view name and its logicalPlan, and permanent view store in HMS stores its origin SQL text. So for permanent view, when try to refer the permanent view, its SQL text will be parse-analyze-optimize-plan again with current SQLConf and SparkSession context, so it might keep changing the SQLConf and context is different each time. So, in order the unify the behaviors of temp view and permanent view, propose that we keep its origin SQLText for both temp and permanent view, and also keep record of the SQLConf when the view was created. Each time we try to refer the view, we using the Snapshot SQLConf to parse-analyze-optimize-plan the SQLText, in this way, we can make sure the output of the created view to be stable. > unify temp view and permanent view behaviors > > > Key: SPARK-33138 > URL: https://issues.apache.org/jira/browse/SPARK-33138 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Leanken.Lin >Priority: Major > > Currently, temp view store mapping of temp view name and its logicalPlan, and > permanent view store in HMS stores its origin SQL text. > So for permanent view, when try to refer the permanent view, its SQL text > will be parse-analyze-optimize-plan again with current SQLConf and > SparkSession context, so it might keep changing when the SQLConf and context > is different each time. > In order the unify the behaviors of temp view and permanent view, proposed > that we keep its origin SQLText for both temp and permanent view, and also > keep record of the SQLConf when the view was created. Each time we try to > refer the view, we using the Snapshot SQLConf to parse-analyze-optimize-plan > the SQLText, in this way, we can make sure the output of the created view to > be stable. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33138) unify temp view and permanent view behaviors
Leanken.Lin created SPARK-33138: --- Summary: unify temp view and permanent view behaviors Key: SPARK-33138 URL: https://issues.apache.org/jira/browse/SPARK-33138 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.1.0 Environment: Currently, temp view store mapping of temp view name and its logicalPlan, and permanent view store in HMS stores its origin SQL text. So for permanent view, when try to refer the permanent view, its SQL text will be parse-analyze-optimize-plan again with current SQLConf and SparkSession context, so it might keep changing the SQLConf and context is different each time. So, in order the unify the behaviors of temp view and permanent view, propose that we keep its origin SQLText for both temp and permanent view, and also keep record of the SQLConf when the view was created. Each time we try to refer the view, we using the Snapshot SQLConf to parse-analyze-optimize-plan the SQLText, in this way, we can make sure the output of the created view to be stable. Reporter: Leanken.Lin -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33138) unify temp view and permanent view behaviors
[ https://issues.apache.org/jira/browse/SPARK-33138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leanken.Lin updated SPARK-33138: Description: Currently, temp view store mapping of temp view name and its logicalPlan, and permanent view store in HMS stores its origin SQL text. So for permanent view, when try to refer the permanent view, its SQL text will be parse-analyze-optimize-plan again with current SQLConf and SparkSession context, so it might keep changing the SQLConf and context is different each time. So, in order the unify the behaviors of temp view and permanent view, propose that we keep its origin SQLText for both temp and permanent view, and also keep record of the SQLConf when the view was created. Each time we try to refer the view, we using the Snapshot SQLConf to parse-analyze-optimize-plan the SQLText, in this way, we can make sure the output of the created view to be stable. Environment: (was: Currently, temp view store mapping of temp view name and its logicalPlan, and permanent view store in HMS stores its origin SQL text. So for permanent view, when try to refer the permanent view, its SQL text will be parse-analyze-optimize-plan again with current SQLConf and SparkSession context, so it might keep changing the SQLConf and context is different each time. So, in order the unify the behaviors of temp view and permanent view, propose that we keep its origin SQLText for both temp and permanent view, and also keep record of the SQLConf when the view was created. Each time we try to refer the view, we using the Snapshot SQLConf to parse-analyze-optimize-plan the SQLText, in this way, we can make sure the output of the created view to be stable.) > unify temp view and permanent view behaviors > > > Key: SPARK-33138 > URL: https://issues.apache.org/jira/browse/SPARK-33138 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Leanken.Lin >Priority: Major > > Currently, temp view store mapping of temp view name and its logicalPlan, and > permanent view store in HMS stores its origin SQL text. So for permanent > view, when try to refer the permanent view, its SQL text will be > parse-analyze-optimize-plan again with current SQLConf and SparkSession > context, so it might keep changing the SQLConf and context is different each > time. So, in order the unify the behaviors of temp view and permanent view, > propose that we keep its origin SQLText for both temp and permanent view, and > also keep record of the SQLConf when the view was created. Each time we try > to refer the view, we using the Snapshot SQLConf to > parse-analyze-optimize-plan the SQLText, in this way, we can make sure the > output of the created view to be stable. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-33016) Potential SQLMetrics missed which might cause WEB UI display issue while AQE is on.
[ https://issues.apache.org/jira/browse/SPARK-33016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leanken.Lin updated SPARK-33016: Affects Version/s: (was: 3.0.1) 3.0.0 > Potential SQLMetrics missed which might cause WEB UI display issue while AQE > is on. > --- > > Key: SPARK-33016 > URL: https://issues.apache.org/jira/browse/SPARK-33016 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Leanken.Lin >Priority: Minor > > In current AQE execution, there might be a following scenario which might > cause SQLMetrics being incorrectly override. > # Stage A and B are created, and UI updated thru event > onAdaptiveExecutionUpdate. > # Stage A and B are running. Subquery in stage A keep updating metrics thru > event onAdaptiveSQLMetricUpdate. > # Stage B completes, while stage A's subquery is still running, updating > metrics. > # Completion of stage B triggers new stage creation and UI update thru event > onAdaptiveExecutionUpdate again (just like step 1). > > But it's very hard to re-produce this issue, since it was only happened with > high concurrency. For the fix, I suggested that we might be able to keep all > duplicated metrics instead of updating it every time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-33016) Potential SQLMetrics missed which might cause WEB UI display issue while AQE is on.
Leanken.Lin created SPARK-33016: --- Summary: Potential SQLMetrics missed which might cause WEB UI display issue while AQE is on. Key: SPARK-33016 URL: https://issues.apache.org/jira/browse/SPARK-33016 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.1 Reporter: Leanken.Lin In current AQE execution, there might be a following scenario which might cause SQLMetrics being incorrectly override. # Stage A and B are created, and UI updated thru event onAdaptiveExecutionUpdate. # Stage A and B are running. Subquery in stage A keep updating metrics thru event onAdaptiveSQLMetricUpdate. # Stage B completes, while stage A's subquery is still running, updating metrics. # Completion of stage B triggers new stage creation and UI update thru event onAdaptiveExecutionUpdate again (just like step 1). But it's very hard to re-produce this issue, since it was only happened with high concurrency. For the fix, I suggested that we might be able to keep all duplicated metrics instead of updating it every time. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32765) EliminateJoinToEmptyRelation should respect exchange behavior when canChangeNumPartitions == false
[ https://issues.apache.org/jira/browse/SPARK-32765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leanken.Lin resolved SPARK-32765. - Resolution: Not A Bug > EliminateJoinToEmptyRelation should respect exchange behavior when > canChangeNumPartitions == false > -- > > Key: SPARK-32765 > URL: https://issues.apache.org/jira/browse/SPARK-32765 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Leanken.Lin >Priority: Major > > Currently, EliminateJoinToEmptyRelation Rule will convert Join into > EmptyRelation in some cases with AQE on. But if either sub plan of Join > contains a ShuffleQueryStage(canChangeNumPartitions == false), which means > the Exchange produced by repartition Or singlePartition, in this case, if we > were to convert it into an EmptyRelation, it will lost user specified number > partition information for downstream operator, it's not right. > So in the Patch, try not to do the conversion if either sub plan of Join > contains ShuffleQueryStage(canChangeNumPartitions == false) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32765) EliminateJoinToEmptyRelation should respect exchange behavior when canChangeNumPartitions == false
[ https://issues.apache.org/jira/browse/SPARK-32765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leanken.Lin updated SPARK-32765: Environment: (was: Currently, EliminateJoinToEmptyRelation Rule will convert Join into EmptyRelation in some cases with AQE on. But if either sub plan of Join contains a ShuffleQueryStage(canChangeNumPartitions == false), which means the Exchange produced by repartition Or singlePartition, in this case, if we were to convert it into an EmptyRelation, it will lost user specified number partition information for downstream operator, it's not right. So in the Patch, try not to do the conversion if either sub plan of Join contains ShuffleQueryStage(canChangeNumPartitions == false)) > EliminateJoinToEmptyRelation should respect exchange behavior when > canChangeNumPartitions == false > -- > > Key: SPARK-32765 > URL: https://issues.apache.org/jira/browse/SPARK-32765 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Leanken.Lin >Priority: Major > > Currently, EliminateJoinToEmptyRelation Rule will convert Join into > EmptyRelation in some cases with AQE on. But if either sub plan of Join > contains a ShuffleQueryStage(canChangeNumPartitions == false), which means > the Exchange produced by repartition Or singlePartition, in this case, if we > were to convert it into an EmptyRelation, it will lost user specified number > partition information for downstream operator, it's not right. > So in the Patch, try not to do the conversion if either sub plan of Join > contains ShuffleQueryStage(canChangeNumPartitions == false) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32765) EliminateJoinToEmptyRelation should respect exchange behavior when canChangeNumPartitions == false
[ https://issues.apache.org/jira/browse/SPARK-32765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leanken.Lin updated SPARK-32765: Description: Currently, EliminateJoinToEmptyRelation Rule will convert Join into EmptyRelation in some cases with AQE on. But if either sub plan of Join contains a ShuffleQueryStage(canChangeNumPartitions == false), which means the Exchange produced by repartition Or singlePartition, in this case, if we were to convert it into an EmptyRelation, it will lost user specified number partition information for downstream operator, it's not right. So in the Patch, try not to do the conversion if either sub plan of Join contains ShuffleQueryStage(canChangeNumPartitions == false) > EliminateJoinToEmptyRelation should respect exchange behavior when > canChangeNumPartitions == false > -- > > Key: SPARK-32765 > URL: https://issues.apache.org/jira/browse/SPARK-32765 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 > Environment: Currently, EliminateJoinToEmptyRelation Rule will > convert Join into EmptyRelation in some cases with AQE on. But if either sub > plan of Join contains a ShuffleQueryStage(canChangeNumPartitions == false), > which means the Exchange produced by repartition Or singlePartition, in this > case, if we were to convert it into an EmptyRelation, it will lost user > specified number partition information for downstream operator, it's not > right. > So in the Patch, try not to do the conversion if either sub plan of Join > contains ShuffleQueryStage(canChangeNumPartitions == false) >Reporter: Leanken.Lin >Priority: Major > > Currently, EliminateJoinToEmptyRelation Rule will convert Join into > EmptyRelation in some cases with AQE on. But if either sub plan of Join > contains a ShuffleQueryStage(canChangeNumPartitions == false), which means > the Exchange produced by repartition Or singlePartition, in this case, if we > were to convert it into an EmptyRelation, it will lost user specified number > partition information for downstream operator, it's not right. > So in the Patch, try not to do the conversion if either sub plan of Join > contains ShuffleQueryStage(canChangeNumPartitions == false) -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32765) EliminateJoinToEmptyRelation should respect exchange behavior when canChangeNumPartitions == false
Leanken.Lin created SPARK-32765: --- Summary: EliminateJoinToEmptyRelation should respect exchange behavior when canChangeNumPartitions == false Key: SPARK-32765 URL: https://issues.apache.org/jira/browse/SPARK-32765 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Environment: Currently, EliminateJoinToEmptyRelation Rule will convert Join into EmptyRelation in some cases with AQE on. But if either sub plan of Join contains a ShuffleQueryStage(canChangeNumPartitions == false), which means the Exchange produced by repartition Or singlePartition, in this case, if we were to convert it into an EmptyRelation, it will lost user specified number partition information for downstream operator, it's not right. So in the Patch, try not to do the conversion if either sub plan of Join contains ShuffleQueryStage(canChangeNumPartitions == false) Reporter: Leanken.Lin -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32705) EmptyHashedRelation$; no valid constructor
Leanken.Lin created SPARK-32705: --- Summary: EmptyHashedRelation$; no valid constructor Key: SPARK-32705 URL: https://issues.apache.org/jira/browse/SPARK-32705 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Leanken.Lin Current EmptyHashedRelation is object and it will cause JavaDeserialization Exception as follow {code:java} // Stack 20/08/26 11:13:30 WARN [task-result-getter-2] TaskSetManager: Lost task 34.0 in stage 57.0 (TID 18076, emr-worker-5.cluster-183257, executor 18): java .io.InvalidClassException: org.apache.spark.sql.execution.joins.EmptyHashedRelation$; no valid constructor at java.io.ObjectStreamClass$ExceptionInfo.newInvalidClassException(ObjectStreamClass.java:169) at java.io.ObjectStreamClass.checkDeserialize(ObjectStreamClass.java:874) at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2042) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1572) at java.io.ObjectInputStream.readObject(ObjectInputStream.java:430) at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76) at org.apache.spark.broadcast.TorrentBroadcast$.$anonfun$unBlockifyObject$4(TorrentBroadcast.scala:328) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:330) at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$4(TorrentBroadcast.scala:249) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$2(TorrentBroadcast.scala:223) at org.apache.spark.util.KeyLock.withLock(KeyLock.scala:64) at org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$1(TorrentBroadcast.scala:218) at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1343) {code} using case object instead to fix serialization issue. And also change EmptyHashedRelation not to extend NullAwareHashedRelation since it's already being used in other non-NAAJ joins. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32678) Rename EmptyHashedRelationWithAllNullKeys and simplify NAAJ generated code
Leanken.Lin created SPARK-32678: --- Summary: Rename EmptyHashedRelationWithAllNullKeys and simplify NAAJ generated code Key: SPARK-32678 URL: https://issues.apache.org/jira/browse/SPARK-32678 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Leanken.Lin EmptyHashedRelationWithAllNullKeys is a bit of confusing for its naming, and this minor change also simplify the generated code for BHJ NAAj. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32649) Optimize BHJ/SHJ inner and semi join with empty hashed relation
[ https://issues.apache.org/jira/browse/SPARK-32649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179397#comment-17179397 ] Leanken.Lin commented on SPARK-32649: - Feel free to just send out PR for reviewing,^_^ > Optimize BHJ/SHJ inner and semi join with empty hashed relation > --- > > Key: SPARK-32649 > URL: https://issues.apache.org/jira/browse/SPARK-32649 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Priority: Trivial > > With `EmptyHashedRelation` introduced in > [https://github.com/apache/spark/pull/29389], it inspired me that there's a > minor optimization we can apply to broadcast hash join and shuffled hash join > if build side hashed relation is empty. > If build side hashed relation is empty (i.e. build side is empty) > 1.inner join: we don't need to execute stream side at all, just return an > empty iterator - > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala#L152] > 2.semi join: we don't need to execute stream side at all, just return an > empty iterator - > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala#L227] > . > This is not common that build side is empty, but in case it is, we can > leverage it to not execute stream side at all for better query CPU/IO > performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-32649) Optimize BHJ/SHJ inner and semi join with empty hashed relation
[ https://issues.apache.org/jira/browse/SPARK-32649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179397#comment-17179397 ] Leanken.Lin edited comment on SPARK-32649 at 8/18/20, 6:37 AM: --- Feel free to just send out PR for reviewing was (Author: leanken): Feel free to just send out PR for reviewing,^_^ > Optimize BHJ/SHJ inner and semi join with empty hashed relation > --- > > Key: SPARK-32649 > URL: https://issues.apache.org/jira/browse/SPARK-32649 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Priority: Trivial > > With `EmptyHashedRelation` introduced in > [https://github.com/apache/spark/pull/29389], it inspired me that there's a > minor optimization we can apply to broadcast hash join and shuffled hash join > if build side hashed relation is empty. > If build side hashed relation is empty (i.e. build side is empty) > 1.inner join: we don't need to execute stream side at all, just return an > empty iterator - > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala#L152] > 2.semi join: we don't need to execute stream side at all, just return an > empty iterator - > [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala#L227] > . > This is not common that build side is empty, but in case it is, we can > leverage it to not execute stream side at all for better query CPU/IO > performance. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32644) NAAJ support for ShuffleHashJoin when AQE is on
[ https://issues.apache.org/jira/browse/SPARK-32644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leanken.Lin updated SPARK-32644: Description: In SPARK-32290, we managed to optimize NAAJ scenario from BNLJ to BHJ, but skipped the checking the BuildSide plan with *spark.sql.autoBroadcastJoinThreshold* parameter, which means in very bad case, BuildSide Plan being too big might cause Driver OOM. So the NAAJ support for ShuffledHashJoin is important as well. The support of SHJ for NAAJ has some difficulties in NullKey scenario, as for normal HashedRelation and EmtpyHashedRelation, the code logical should be the same when it comes to BHJ and SHJ, but if NullKey exists in global BuildSide data, and only one partition could be built into EmptyHashedRelationWithAllNullKeys, and this partition was not able to do *fast stop* for the entire RDD. So after offline talked with some committer and discussion, decided to support NAAJ for SHJ when AQE is on, because when AQE is on, Shuffle will be pre-executed, and we should be able to know that whether the BuildSide contains NullKey or not before the actual JOIN executed. Basically, In NAAJ SHJ Implementation, we collected information whether BuildSide is Empty or contains NullKey, and keep these information in ShuffleExchangeExec Metrics, and during AQE, we rewritten these two case into LocalTableScan and StreamedSidePlan to Avoid NAAJ, as for the normal relation, we processed it in Distributed style. was: In [SPARK-32290|https://issues.apache.org/jira/browse/SPARK-32290], we managed to optimize NAAJ scenario from BNLJ to BHJ, but skipped the checking the BuildSide plan with *spark.sql.autoBroadcastJoinThreshold* parameter, which means in very bad case, BuildSide Plan being to big might cause Driver OOM. So the NAAJ support for ShuffledHashJoin is important as well. The support of SHJ for NAAJ has some difficulties in NullKey scenario, as for normal HashedRelation and EmtpyHashedRelation, the code logical should be the same when it comes to BHJ and SHJ, but if NullKey exists in global BuildSide data, and only one partition could be built into EmptyHashedRelationWithAllNullKeys, and this partition was not able to do *fast stop* for the entire RDD. So after offline talked with some committer and discussion, decided to support NAAJ for SHJ when AQE is on, because when AQE is on, Shuffle will be pre-executed, and we should be able to know that whether the BuildSide contains NullKey or not before the actual JOIN executed. Basically, In NAAJ SHJ Implementation, we collected information whether BuildSide is Empty or contains NullKey, and keep these information in ShuffleExchangeExec Metrics, and during AQE, we rewritten these two case into LocalTableScan and StreamedSidePlan to Avoid NAAJ, as for the normal relation, we processed it in Distributed style. > NAAJ support for ShuffleHashJoin when AQE is on > --- > > Key: SPARK-32644 > URL: https://issues.apache.org/jira/browse/SPARK-32644 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Leanken.Lin >Priority: Major > > In SPARK-32290, we managed to optimize NAAJ scenario from BNLJ to BHJ, but > skipped the checking the BuildSide plan with > *spark.sql.autoBroadcastJoinThreshold* parameter, which means in very bad > case, BuildSide Plan being too big might cause Driver OOM. So the NAAJ > support for ShuffledHashJoin is important as well. > > The support of SHJ for NAAJ has some difficulties in NullKey scenario, as for > normal HashedRelation and EmtpyHashedRelation, the code logical should be the > same when it comes to BHJ and SHJ, but if NullKey exists in global BuildSide > data, and only one partition could be built into > EmptyHashedRelationWithAllNullKeys, and this partition was not able to do > *fast stop* for the entire RDD. So after offline talked with some committer > and discussion, decided to support NAAJ for SHJ when AQE is on, because when > AQE is on, Shuffle will be pre-executed, and we should be able to know that > whether the BuildSide contains NullKey or not before the actual JOIN executed. > > Basically, In NAAJ SHJ Implementation, we collected information whether > BuildSide is Empty or contains NullKey, and keep these information in > ShuffleExchangeExec Metrics, and during AQE, we rewritten these two case into > LocalTableScan and StreamedSidePlan to Avoid NAAJ, as for the normal > relation, we processed it in Distributed style. > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail:
[jira] [Created] (SPARK-32644) NAAJ support for ShuffleHashJoin when AQE is on
Leanken.Lin created SPARK-32644: --- Summary: NAAJ support for ShuffleHashJoin when AQE is on Key: SPARK-32644 URL: https://issues.apache.org/jira/browse/SPARK-32644 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Leanken.Lin In [SPARK-32290|https://issues.apache.org/jira/browse/SPARK-32290], we managed to optimize NAAJ scenario from BNLJ to BHJ, but skipped the checking the BuildSide plan with *spark.sql.autoBroadcastJoinThreshold* parameter, which means in very bad case, BuildSide Plan being to big might cause Driver OOM. So the NAAJ support for ShuffledHashJoin is important as well. The support of SHJ for NAAJ has some difficulties in NullKey scenario, as for normal HashedRelation and EmtpyHashedRelation, the code logical should be the same when it comes to BHJ and SHJ, but if NullKey exists in global BuildSide data, and only one partition could be built into EmptyHashedRelationWithAllNullKeys, and this partition was not able to do *fast stop* for the entire RDD. So after offline talked with some committer and discussion, decided to support NAAJ for SHJ when AQE is on, because when AQE is on, Shuffle will be pre-executed, and we should be able to know that whether the BuildSide contains NullKey or not before the actual JOIN executed. Basically, In NAAJ SHJ Implementation, we collected information whether BuildSide is Empty or contains NullKey, and keep these information in ShuffleExchangeExec Metrics, and during AQE, we rewritten these two case into LocalTableScan and StreamedSidePlan to Avoid NAAJ, as for the normal relation, we processed it in Distributed style. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32615) Fix AQE aggregateMetrics java.util.NoSuchElementException
[ https://issues.apache.org/jira/browse/SPARK-32615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leanken.Lin updated SPARK-32615: Description: {code:java} // Reproduce Step sql/test-only org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite -- -z "SPARK-32573: Eliminate NAAJ when BuildSide is EmptyHashedRelationWithAllNullKeys" {code} {code:java} // Error Message 14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread element-tracking-store-worker 14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread element-tracking-store-worker java.util.NoSuchElementException: key not found: 12 at scala.collection.immutable.Map$Map1.apply(Map.scala:114) at org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$aggregateMetrics$11(SQLAppStatusListener.scala:257) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149) at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) at scala.collection.mutable.HashMap.foreach(HashMap.scala:149) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.AbstractTraversable.map(Traversable.scala:108) at org.apache.spark.sql.execution.ui.SQLAppStatusListener.aggregateMetrics(SQLAppStatusListener.scala:256) at org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$onExecutionEnd$2(SQLAppStatusListener.scala:365) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.tryLog(Utils.scala:1971) at org.apache.spark.status.ElementTrackingStore$$anon$1.run(ElementTrackingStore.scala:117) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)[info] - SPARK-32573: Eliminate NAAJ when BuildSide is EmptyHashedRelationWithAllNullKeys (2 seconds, 14 milliseconds) {code} This issue is mainly because during AQE, while sub-plan changed, the metrics update is overwrite. for example, in this UT, change from BroadcastHashJoinExec into a LocalTableScanExec, and in the onExecutionEnd action it will try aggregate all metrics including old ones during the execution, which will cause NoSuchElementException, since the metricsType is already updated with plan rewritten. So we need to filter out those outdated metrics. was: {code:java} // Reproduce Step sql/test-only org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite -- -z "SPARK-32573: Eliminate NAAJ when BuildSide is EmptyHashedRelationWithAllNullKeys" {code} {code:java} // Error Message 14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread element-tracking-store-worker 14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread element-tracking-store-worker java.util.NoSuchElementException: key not found: 12 at scala.collection.immutable.Map$Map1.apply(Map.scala:114) at org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$aggregateMetrics$11(SQLAppStatusListener.scala:257) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149) at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) at scala.collection.mutable.HashMap.foreach(HashMap.scala:149) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.AbstractTraversable.map(Traversable.scala:108) at org.apache.spark.sql.execution.ui.SQLAppStatusListener.aggregateMetrics(SQLAppStatusListener.scala:256) at org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$onExecutionEnd$2(SQLAppStatusListener.scala:365) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.tryLog(Utils.scala:1971) at org.apache.spark.status.ElementTrackingStore$$anon$1.run(ElementTrackingStore.scala:117) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at
[jira] [Updated] (SPARK-32615) Fix AQE aggregateMetrics java.util.NoSuchElementException
[ https://issues.apache.org/jira/browse/SPARK-32615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leanken.Lin updated SPARK-32615: Description: {code:java} // Reproduce Step sql/test-only org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite -- -z "SPARK-32573: Eliminate NAAJ when BuildSide is EmptyHashedRelationWithAllNullKeys" {code} {code:java} // Error Message 14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread element-tracking-store-worker 14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread element-tracking-store-worker java.util.NoSuchElementException: key not found: 12 at scala.collection.immutable.Map$Map1.apply(Map.scala:114) at org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$aggregateMetrics$11(SQLAppStatusListener.scala:257) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149) at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) at scala.collection.mutable.HashMap.foreach(HashMap.scala:149) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.AbstractTraversable.map(Traversable.scala:108) at org.apache.spark.sql.execution.ui.SQLAppStatusListener.aggregateMetrics(SQLAppStatusListener.scala:256) at org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$onExecutionEnd$2(SQLAppStatusListener.scala:365) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.tryLog(Utils.scala:1971) at org.apache.spark.status.ElementTrackingStore$$anon$1.run(ElementTrackingStore.scala:117) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)[info] - SPARK-32573: Eliminate NAAJ when BuildSide is EmptyHashedRelationWithAllNullKeys (2 seconds, 14 milliseconds) {code} This issue is mainly because during AQE, while sub-plan changed, the metrics update is overwrite. for example, in this UT, change from BroadcastHashJoinExec into a LocalTableScanExec, and in the onExecutionEnd action it will try aggregate all metrics during the execution, which will cause NoSuchElementException was: {code:java} // Reproduce Step sql/test-only org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite -- -z "SPARK-32573: Eliminate NAAJ when BuildSide is EmptyHashedRelationWithAllNullKeys" {code} {code:java} // Error Message 14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread element-tracking-store-worker14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread element-tracking-store-workerjava.util.NoSuchElementException: key not found: 12 at scala.collection.immutable.Map$Map1.apply(Map.scala:114) at org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$aggregateMetrics$11(SQLAppStatusListener.scala:257) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149) at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) at scala.collection.mutable.HashMap.foreach(HashMap.scala:149) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.AbstractTraversable.map(Traversable.scala:108) at org.apache.spark.sql.execution.ui.SQLAppStatusListener.aggregateMetrics(SQLAppStatusListener.scala:256) at org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$onExecutionEnd$2(SQLAppStatusListener.scala:365) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.tryLog(Utils.scala:1971) at org.apache.spark.status.ElementTrackingStore$$anon$1.run(ElementTrackingStore.scala:117) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)[info] - SPARK-32573: Eliminate NAAJ when BuildSide is EmptyHashedRelationWithAllNullKeys
[jira] [Updated] (SPARK-32615) Fix AQE aggregateMetrics java.util.NoSuchElementException
[ https://issues.apache.org/jira/browse/SPARK-32615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leanken.Lin updated SPARK-32615: Description: {code:java} // Reproduce Step sql/test-only org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite -- -z "SPARK-32573: Eliminate NAAJ when BuildSide is EmptyHashedRelationWithAllNullKeys" {code} {code:java} // Error Message 14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread element-tracking-store-worker14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread element-tracking-store-workerjava.util.NoSuchElementException: key not found: 12 at scala.collection.immutable.Map$Map1.apply(Map.scala:114) at org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$aggregateMetrics$11(SQLAppStatusListener.scala:257) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149) at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) at scala.collection.mutable.HashMap.foreach(HashMap.scala:149) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.AbstractTraversable.map(Traversable.scala:108) at org.apache.spark.sql.execution.ui.SQLAppStatusListener.aggregateMetrics(SQLAppStatusListener.scala:256) at org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$onExecutionEnd$2(SQLAppStatusListener.scala:365) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.tryLog(Utils.scala:1971) at org.apache.spark.status.ElementTrackingStore$$anon$1.run(ElementTrackingStore.scala:117) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)[info] - SPARK-32573: Eliminate NAAJ when BuildSide is EmptyHashedRelationWithAllNullKeys (2 seconds, 14 milliseconds) {code} This issue is mainly because during AQE, while sub-plan changed, the metrics update is overwrite. for example, in this UT, change from BroadcastHashJoinExec into a LocalTableScanExec, and in the onExecutionEnd action it will try aggregate all metrics during the execution, which will cause NoSuchElementException was: Reproduce Step {code:java} //代码占位符 sql/test-only org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite -- -z "SPARK-32573: Eliminate NAAJ when BuildSide is EmptyHashedRelationWithAllNullKeys" {code} {code:java} //代码占位符 14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread element-tracking-store-worker14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread element-tracking-store-workerjava.util.NoSuchElementException: key not found: 12 at scala.collection.immutable.Map$Map1.apply(Map.scala:114) at org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$aggregateMetrics$11(SQLAppStatusListener.scala:257) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149) at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) at scala.collection.mutable.HashMap.foreach(HashMap.scala:149) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.AbstractTraversable.map(Traversable.scala:108) at org.apache.spark.sql.execution.ui.SQLAppStatusListener.aggregateMetrics(SQLAppStatusListener.scala:256) at org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$onExecutionEnd$2(SQLAppStatusListener.scala:365) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.tryLog(Utils.scala:1971) at org.apache.spark.status.ElementTrackingStore$$anon$1.run(ElementTrackingStore.scala:117) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)[info] - SPARK-32573: Eliminate NAAJ when BuildSide is EmptyHashedRelationWithAllNullKeys (2
[jira] [Updated] (SPARK-32615) Fix AQE aggregateMetrics java.util.NoSuchElementException
[ https://issues.apache.org/jira/browse/SPARK-32615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leanken.Lin updated SPARK-32615: Summary: Fix AQE aggregateMetrics java.util.NoSuchElementException (was: AQE aggregateMetrics java.util.NoSuchElementException) > Fix AQE aggregateMetrics java.util.NoSuchElementException > - > > Key: SPARK-32615 > URL: https://issues.apache.org/jira/browse/SPARK-32615 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.0.0 >Reporter: Leanken.Lin >Priority: Minor > > Reproduce Step > {code:java} > //代码占位符 > sql/test-only org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite > -- -z "SPARK-32573: Eliminate NAAJ when BuildSide is > EmptyHashedRelationWithAllNullKeys" > {code} > {code:java} > //代码占位符 > 14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread > element-tracking-store-worker14:40:44.089 ERROR org.apache.spark.util.Utils: > Uncaught exception in thread > element-tracking-store-workerjava.util.NoSuchElementException: key not found: > 12 at scala.collection.immutable.Map$Map1.apply(Map.scala:114) at > org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$aggregateMetrics$11(SQLAppStatusListener.scala:257) > at > scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at > scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149) at > scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) at > scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) at > scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) at > scala.collection.mutable.HashMap.foreach(HashMap.scala:149) at > scala.collection.TraversableLike.map(TraversableLike.scala:238) at > scala.collection.TraversableLike.map$(TraversableLike.scala:231) at > scala.collection.AbstractTraversable.map(Traversable.scala:108) at > org.apache.spark.sql.execution.ui.SQLAppStatusListener.aggregateMetrics(SQLAppStatusListener.scala:256) > at > org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$onExecutionEnd$2(SQLAppStatusListener.scala:365) > at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at > org.apache.spark.util.Utils$.tryLog(Utils.scala:1971) at > org.apache.spark.status.ElementTrackingStore$$anon$1.run(ElementTrackingStore.scala:117) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748)[info] - SPARK-32573: Eliminate NAAJ > when BuildSide is EmptyHashedRelationWithAllNullKeys (2 seconds, 14 > milliseconds) > {code} > This issue is mainly because during AQE, while sub-plan changed, the metrics > update is overwrite. for example, in this UT, change from > BroadcastHashJoinExec into a LocalTableScanExec, and in the onExecutionEnd > action it will try aggregate all metrics during the execution, which will > cause NoSuchElementException -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32615) AQE aggregateMetrics java.util.NoSuchElementException
Leanken.Lin created SPARK-32615: --- Summary: AQE aggregateMetrics java.util.NoSuchElementException Key: SPARK-32615 URL: https://issues.apache.org/jira/browse/SPARK-32615 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.0.0 Reporter: Leanken.Lin Reproduce Step {code:java} //代码占位符 sql/test-only org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite -- -z "SPARK-32573: Eliminate NAAJ when BuildSide is EmptyHashedRelationWithAllNullKeys" {code} {code:java} //代码占位符 14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread element-tracking-store-worker14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread element-tracking-store-workerjava.util.NoSuchElementException: key not found: 12 at scala.collection.immutable.Map$Map1.apply(Map.scala:114) at org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$aggregateMetrics$11(SQLAppStatusListener.scala:257) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149) at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) at scala.collection.mutable.HashMap.foreach(HashMap.scala:149) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at scala.collection.AbstractTraversable.map(Traversable.scala:108) at org.apache.spark.sql.execution.ui.SQLAppStatusListener.aggregateMetrics(SQLAppStatusListener.scala:256) at org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$onExecutionEnd$2(SQLAppStatusListener.scala:365) at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at org.apache.spark.util.Utils$.tryLog(Utils.scala:1971) at org.apache.spark.status.ElementTrackingStore$$anon$1.run(ElementTrackingStore.scala:117) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.run(FutureTask.java:266) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)[info] - SPARK-32573: Eliminate NAAJ when BuildSide is EmptyHashedRelationWithAllNullKeys (2 seconds, 14 milliseconds) {code} This issue is mainly because during AQE, while sub-plan changed, the metrics update is overwrite. for example, in this UT, change from BroadcastHashJoinExec into a LocalTableScanExec, and in the onExecutionEnd action it will try aggregate all metrics during the execution, which will cause NoSuchElementException -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32573) Anti Join Improvement with EmptyHashedRelation and EmptyHashedRelationWithAllNullKeys
[ https://issues.apache.org/jira/browse/SPARK-32573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leanken.Lin updated SPARK-32573: Description: In SPARK-32290, we introduced several new types of HashedRelation * EmptyHashedRelation * EmptyHashedRelationWithAllNullKeys They were all limited to used only in NAAJ scenario. These new HashedRelation could be applied to other scenario for performance improvements. * EmptyHashedRelation could also be used in Normal AntiJoin for fast stop * While AQE is on and buildSide is EmptyHashedRelationWithAllNullKeys, can convert NAAJ to a Empty LocalRelation to skip meaningless data iteration since in Single-Key NAAJ, if null key exists in BuildSide, will drop all records in streamedSide. This Patch including two changes. * using EmptyHashedRelation to do fast stop for common anti join as well * In AQE, eliminate BroadcastHashJoin(NAAJ) if buildSide is a EmptyHashedRelationWithAllNullKeys was: In [SPARK-32290|https://issues.apache.org/jira/browse/SPARK-32290], we introduced several new types of HashedRelation * EmptyHashedRelation * EmptyHashedRelationWithAllNullKeys They were all limited to used only in NAAJ scenario. But as for a improvement, EmptyHashedRelation could also be used in Normal AntiJoin for fast stop, and as for in AQE, we can even eliminate anti join when we knew that buildSide is empty. This Patch including two changes. In Non-AQE, using EmptyHashedRelation to do fast stop for common anti join as well In AQE, eliminate anti join if buildSide is a EmptyHashedRelation of ShuffleWriteRecord is 0 Summary: Anti Join Improvement with EmptyHashedRelation and EmptyHashedRelationWithAllNullKeys (was: Eliminate Anti Join when BuildSide is Empty) > Anti Join Improvement with EmptyHashedRelation and > EmptyHashedRelationWithAllNullKeys > - > > Key: SPARK-32573 > URL: https://issues.apache.org/jira/browse/SPARK-32573 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Leanken.Lin >Priority: Minor > > In SPARK-32290, we introduced several new types of HashedRelation > * EmptyHashedRelation > * EmptyHashedRelationWithAllNullKeys > They were all limited to used only in NAAJ scenario. These new HashedRelation > could be applied to other scenario for performance improvements. > * EmptyHashedRelation could also be used in Normal AntiJoin for fast stop > * While AQE is on and buildSide is EmptyHashedRelationWithAllNullKeys, can > convert NAAJ to a Empty LocalRelation to skip meaningless data iteration > since in Single-Key NAAJ, if null key exists in BuildSide, will drop all > records in streamedSide. > This Patch including two changes. > * using EmptyHashedRelation to do fast stop for common anti join as well > * In AQE, eliminate BroadcastHashJoin(NAAJ) if buildSide is a > EmptyHashedRelationWithAllNullKeys -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32573) Eliminate Anti Join when BuildSide is Empty
Leanken.Lin created SPARK-32573: --- Summary: Eliminate Anti Join when BuildSide is Empty Key: SPARK-32573 URL: https://issues.apache.org/jira/browse/SPARK-32573 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Leanken.Lin In [SPARK-32290|https://issues.apache.org/jira/browse/SPARK-32290], we introduced several new types of HashedRelation * EmptyHashedRelation * EmptyHashedRelationWithAllNullKeys They were all limited to used only in NAAJ scenario. But as for a improvement, EmptyHashedRelation could also be used in Normal AntiJoin for fast stop, and as for in AQE, we can even eliminate anti join when we knew that buildSide is empty. This Patch including two changes. In Non-AQE, using EmptyHashedRelation to do fast stop for common anti join as well In AQE, eliminate anti join if buildSide is a EmptyHashedRelation of ShuffleWriteRecord is 0 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32494) Null Aware Anti Join Optimize Support Multi-Column
[ https://issues.apache.org/jira/browse/SPARK-32494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leanken.Lin resolved SPARK-32494. - Resolution: Later > Null Aware Anti Join Optimize Support Multi-Column > -- > > Key: SPARK-32494 > URL: https://issues.apache.org/jira/browse/SPARK-32494 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Leanken.Lin >Priority: Major > > In Issue SPARK-32290, we managed to optimize BroadcastNestedLoopJoin into > BroadcastHashJoin within the Single-Column NAAJ scenario, by using hash > lookup instead of loop join. > It's simple to just fulfill a "NOT IN" logical when it's a single key, but > multi-column not in is much more complicated with all these null aware > comparison. > FYI, code logical for single and multi column is defined at > ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-single-column.sql > ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-multi-column.sql > > Hence, proposed with a New type HashedRelation, NullAwareHashedRelation. > For NullAwareHashedRelation > # it will not skip anyNullColumn key like LongHashedRelation and > UnsafeHashedRelation do. > # while building NullAwareHashedRelation, will put extra keys into the > relation, just to make null aware columns comparison in hash lookup style. > the duplication would be 2^numKeys - 1 times, for example, if we are to > support NAAJ with 3 column join key, the buildSide would be expanded into > (2^3 - 1) times, 7X. > For example, if there is a UnsafeRow key (1,2,3) > In NullAware Mode, it should be expanded into 7 keys with extra C(3,1), > C(3,2) combinations, within the combinations, we duplicated these record with > null padding as following. > Original record > (1,2,3) > Extra record to be appended into NullAwareHashedRelation > (null, 2, 3) (1, null, 3) (1, 2, null) > (null, null, 3) (null, 2, null) (1, null, null)) > with the expanded data we can extract a common pattern for both single and > multi column. allNull refer to a unsafeRow which has all null columns. > * buildSide is empty input => return all rows > * allNullColumnKey Exists In buildSide input => reject all rows > * if streamedSideRow.allNull is true => drop the row > * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation > => drop the row > * if streamedSideRow.allNull is false & notFindMatch in > NullAwareHashedRelation => return the row > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32494) Null Aware Anti Join Optimize Support Multi-Column
[ https://issues.apache.org/jira/browse/SPARK-32494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leanken.Lin updated SPARK-32494: Description: In Issue SPARK-32290, we managed to optimize BroadcastNestedLoopJoin into BroadcastHashJoin within the Single-Column NAAJ scenario, by using hash lookup instead of loop join. It's simple to just fulfill a "NOT IN" logical when it's a single key, but multi-column not in is much more complicated with all these null aware comparison. FYI, code logical for single and multi column is defined at ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-single-column.sql ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-multi-column.sql Hence, proposed with a New type HashedRelation, NullAwareHashedRelation. For NullAwareHashedRelation # it will not skip anyNullColumn key like LongHashedRelation and UnsafeHashedRelation do. # while building NullAwareHashedRelation, will put extra keys into the relation, just to make null aware columns comparison in hash lookup style. the duplication would be 2^numKeys - 1 times, for example, if we are to support NAAJ with 3 column join key, the buildSide would be expanded into (2^3 - 1) times, 7X. For example, if there is a UnsafeRow key (1,2,3) In NullAware Mode, it should be expanded into 7 keys with extra C(3,1), C(3,2) combinations, within the combinations, we duplicated these record with null padding as following. Original record (1,2,3) Extra record to be appended into NullAwareHashedRelation (null, 2, 3) (1, null, 3) (1, 2, null) (null, null, 3) (null, 2, null) (1, null, null)) with the expanded data we can extract a common pattern for both single and multi column. allNull refer to a unsafeRow which has all null columns. * buildSide is empty input => return all rows * allNullColumnKey Exists In buildSide input => reject all rows * if streamedSideRow.allNull is true => drop the row * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation => drop the row * if streamedSideRow.allNull is false & notFindMatch in NullAwareHashedRelation => return the row was: In Issue [SPARK-32290|https://issues.apache.org/jira/browse/SPARK-32290], we managed to optimize BroadcastNestedLoopJoin into BroadcastHashJoin within the Single-Column NAAJ scenario, by using hash lookup instead of loop join. It's simple to just fulfill a "NOT IN" logical when it's a single key, but multi-column not in is much more complicated with all these null aware compare. Hence, proposed with a New type HashedRelation, NullAwareHashedRelation. For NullAwareHashedRelation # it will not skip anyNullColumn key like LongHashedRelation and UnsafeHashedRelation # while building NullAwareHashedRelation, will put extra keys into the relation, just to make null aware columns comparison in hash lookup style. the duplication would be 2^numKeys - 1, for example, if we are to support NAAJ with 3 column join key, the buildSide would be expanded into (2^3 - 1) times, 7X. For example, if there is a UnsafeRow key (1,2,3) In NullAware Mode, it should be expanded into 7 keys with extra C(3,1), C(3,2) combinations, within the combinations, we duplicated these record with null padding as following. Original record (1,2,3) Extra record to be appended into HashedRelation (null, 2, 3) (1, null, 3) (1, 2, null) (null, null, 3) (null, 2, null) (1, null, null)) with the expanded data we can extract a common pattern for both single and multi column. allNull refer to a unsafeRow which has all null columns. * buildSide is empty input => return all rows * allNullColumnKey Exists In buildSide input => reject all rows * if streamedSideRow.allNull is true => drop the row * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation => drop the row * if streamedSideRow.allNull is false & notFindMatch in NullAwareHashedRelation => return the row > Null Aware Anti Join Optimize Support Multi-Column > -- > > Key: SPARK-32494 > URL: https://issues.apache.org/jira/browse/SPARK-32494 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 3.0.0 >Reporter: Leanken.Lin >Priority: Major > > In Issue SPARK-32290, we managed to optimize BroadcastNestedLoopJoin into > BroadcastHashJoin within the Single-Column NAAJ scenario, by using hash > lookup instead of loop join. > It's simple to just fulfill a "NOT IN" logical when it's a single key, but > multi-column not in is much more complicated with all these null aware > comparison. > FYI, code logical for single and multi column is defined at > ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-single-column.sql >
[jira] [Created] (SPARK-32494) Null Aware Anti Join Optimize Support Multi-Column
Leanken.Lin created SPARK-32494: --- Summary: Null Aware Anti Join Optimize Support Multi-Column Key: SPARK-32494 URL: https://issues.apache.org/jira/browse/SPARK-32494 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 3.0.0 Reporter: Leanken.Lin In Issue [SPARK-32290|https://issues.apache.org/jira/browse/SPARK-32290], we managed to optimize BroadcastNestedLoopJoin into BroadcastHashJoin within the Single-Column NAAJ scenario, by using hash lookup instead of loop join. It's simple to just fulfill a "NOT IN" logical when it's a single key, but multi-column not in is much more complicated with all these null aware compare. Hence, proposed with a New type HashedRelation, NullAwareHashedRelation. For NullAwareHashedRelation # it will not skip anyNullColumn key like LongHashedRelation and UnsafeHashedRelation # while building NullAwareHashedRelation, will put extra keys into the relation, just to make null aware columns comparison in hash lookup style. the duplication would be 2^numKeys - 1, for example, if we are to support NAAJ with 3 column join key, the buildSide would be expanded into (2^3 - 1) times, 7X. For example, if there is a UnsafeRow key (1,2,3) In NullAware Mode, it should be expanded into 7 keys with extra C(3,1), C(3,2) combinations, within the combinations, we duplicated these record with null padding as following. Original record (1,2,3) Extra record to be appended into HashedRelation (null, 2, 3) (1, null, 3) (1, 2, null) (null, null, 3) (null, 2, null) (1, null, null)) with the expanded data we can extract a common pattern for both single and multi column. allNull refer to a unsafeRow which has all null columns. * buildSide is empty input => return all rows * allNullColumnKey Exists In buildSide input => reject all rows * if streamedSideRow.allNull is true => drop the row * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation => drop the row * if streamedSideRow.allNull is false & notFindMatch in NullAwareHashedRelation => return the row -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-32474) NullAwareAntiJoin multi-column support
[ https://issues.apache.org/jira/browse/SPARK-32474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leanken.Lin resolved SPARK-32474. - Resolution: Duplicate > NullAwareAntiJoin multi-column support > -- > > Key: SPARK-32474 > URL: https://issues.apache.org/jira/browse/SPARK-32474 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Leanken.Lin >Priority: Minor > > This is a follow up improvement of Issue SPARK-32290. > In SPARK-32290, we already optimize NAAJ from BroadcastNestedLoopJoin to > BroadcastHashJoin, which improve total calculation from O(M*N) to O(M), but > it's only targeting on Single Column Case, because it's much more complicate > in multi column support. > See. [http://www.vldb.org/pvldb/vol2/vldb09-423.pdf] Section 6 > > FYI, code logical for single and multi column is defined at > ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-single-column.sql > ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-multi-column.sql > > For supporting multi column, I throw the following idea and see if is it > worth to do multi-column support with some trade off. I would need to do some > data expansion in HashedRelation, and i would call this new type of > HashedRelation as NullAwareHashedRelation. > > In NullAwareHashedRelation, key with null column is allowed, which is > opposite in LongHashedRelation and UnsafeHashedRelation; And single key might > be expanded into 2^N - 1 records, (N refer to columnNum of the key). for > example, if there is a record > (1 ,2, 3) is about to insert into NullAwareHashedRelation, we take C(1,3), > C(2,3) as a combination to copy origin key row, and setNull at target > position, and then insert into NullAwareHashedRelation. including the origin > key row, there will be 7 key row inserted as follow. > (null, 2, 3) > (1, null, 3) > (1, 2, null) > (null, null, 3) > (null, 2, null) > (1, null, null) > (1, 2, 3) > > with the expanded data we can extract a common pattern for both single and > multi column. allNull refer to a unsafeRow which has all null columns. > * buildSide is empty input => return all rows > * allNullColumnKey Exists In buildSide input => reject all rows > * if streamedSideRow.allNull is true => drop the row > * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation > => drop the row > * if streamedSideRow.allNull is false & notFindMatch in > NullAwareHashedRelation => return the row > > this solution will sure make buildSide data expand to 2^N-1 times, but since > it is normally up to 2~3 column in NAAJ in normal production query, i suppose > that it's acceptable to expand buildSide data to around 7X. I would also have > a limitation of max column support for NAAJ, basically should not more than > 3. > > > > > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32474) NullAwareAntiJoin multi-column support
[ https://issues.apache.org/jira/browse/SPARK-32474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leanken.Lin updated SPARK-32474: Description: This is a follow up improvement of Issue SPARK-32290. In SPARK-32290, we already optimize NAAJ from BroadcastNestedLoopJoin to BroadcastHashJoin, which improve total calculation from O(M*N) to O(M), but it's only targeting on Single Column Case, because it's much more complicate in multi column support. See. [http://www.vldb.org/pvldb/vol2/vldb09-423.pdf] Section 6 FYI, code logical for single and multi column is defined at ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-single-column.sql ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-multi-column.sql For supporting multi column, I throw the following idea and see if is it worth to do multi-column support with some trade off. I would need to do some data expansion in HashedRelation, and i would call this new type of HashedRelation as NullAwareHashedRelation. In NullAwareHashedRelation, key with null column is allowed, which is opposite in LongHashedRelation and UnsafeHashedRelation; And single key might be expanded into 2^N - 1 records, (N refer to columnNum of the key). for example, if there is a record (1 ,2, 3) is about to insert into NullAwareHashedRelation, we take C(1,3), C(2,3) as a combination to copy origin key row, and setNull at target position, and then insert into NullAwareHashedRelation. including the origin key row, there will be 7 key row inserted as follow. (null, 2, 3) (1, null, 3) (1, 2, null) (null, null, 3) (null, 2, null) (1, null, null) (1, 2, 3) with the expanded data we can extract a common pattern for both single and multi column. allNull refer to a unsafeRow which has all null columns. * buildSide is empty input => return all rows * allNullColumnKey Exists In buildSide input => reject all rows * if streamedSideRow.allNull is true => drop the row * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation => drop the row * if streamedSideRow.allNull is false & notFindMatch in NullAwareHashedRelation => return the row this solution will sure make buildSide data expand to 2^N-1 times, but since it is normally up to 2~3 column in NAAJ in normal production query, i suppose that it's acceptable to expand buildSide data to around 7X. I would also have a limitation of max column support for NAAJ, basically should not more than 3. was: This is a follow up improvement of Issue SPARK-32290. In SPARK-32290, we already optimize NAAJ from BroadcastNestedLoopJoin to BroadcastHashJoin, which improve total calculation from O(M*N) to O(M), but it's only targeting on Single Column Case, because it's much more complicate in multi column support. See. [http://www.vldb.org/pvldb/vol2/vldb09-423.pdf] Section 6 NAAJ multi column logical !image-2020-07-29-12-03-22-554.png! NAAJ single column logical !image-2020-07-29-12-03-32-677.png! For supporting multi column, I throw the following idea and see if is it worth to do multi-column support with some trade off. I would need to do some data expansion in HashedRelation, and i would call this new type of HashedRelation as NullAwareHashedRelation. In NullAwareHashedRelation, key with null column is allowed, which is opposite in LongHashedRelation and UnsafeHashedRelation; And single key might be expanded into 2^N - 1 records, (N refer to columnNum of the key). for example, if there is a record (1 ,2, 3) is about to insert into NullAwareHashedRelation, we take C(1,3), C(2,3) as a combination to copy origin key row, and setNull at target position, and then insert into NullAwareHashedRelation. including the origin key row, there will be 7 key row inserted as follow. (null, 2, 3) (1, null, 3) (1, 2, null) (null, null, 3) (null, 2, null) (1, null, null) (1, 2, 3) with the expanded data we can extract a common pattern for both single and multi column. allNull refer to a unsafeRow which has all null columns. * buildSide is empty input => return all rows * allNullColumnKey Exists In buildSide input => reject all rows * if streamedSideRow.allNull is true => drop the row * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation => drop the row * if streamedSideRow.allNull is false & notFindMatch in NullAwareHashedRelation => return the row this solution will sure make buildSide data expand to 2^N-1 times, but since it is normally up to 2~3 column in NAAJ in normal production query, i suppose that it's acceptable to expand buildSide data to around 7X. I would also have a limitation of max column support for NAAJ, basically should not more than 3. > NullAwareAntiJoin multi-column support > -- > > Key:
[jira] [Updated] (SPARK-32474) NullAwareAntiJoin multi-column support
[ https://issues.apache.org/jira/browse/SPARK-32474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leanken.Lin updated SPARK-32474: Description: This is a follow up improvement of Issue SPARK-32290. In SPARK-32290, we already optimize NAAJ from BroadcastNestedLoopJoin to BroadcastHashJoin, which improve total calculation from O(M*N) to O(M), but it's only targeting on Single Column Case, because it's much more complicate in multi column support. See. [http://www.vldb.org/pvldb/vol2/vldb09-423.pdf] Section 6 NAAJ multi column logical !image-2020-07-29-12-03-22-554.png! NAAJ single column logical !image-2020-07-29-12-03-32-677.png! For supporting multi column, I throw the following idea and see if is it worth to do multi-column support with some trade off. I would need to do some data expansion in HashedRelation, and i would call this new type of HashedRelation as NullAwareHashedRelation. In NullAwareHashedRelation, key with null column is allowed, which is opposite in LongHashedRelation and UnsafeHashedRelation; And single key might be expanded into 2^N - 1 records, (N refer to columnNum of the key). for example, if there is a record (1 ,2, 3) is about to insert into NullAwareHashedRelation, we take C(1,3), C(2,3) as a combination to copy origin key row, and setNull at target position, and then insert into NullAwareHashedRelation. including the origin key row, there will be 7 key row inserted as follow. (null, 2, 3) (1, null, 3) (1, 2, null) (null, null, 3) (null, 2, null) (1, null, null) (1, 2, 3) with the expanded data we can extract a common pattern for both single and multi column. allNull refer to a unsafeRow which has all null columns. * buildSide is empty input => return all rows * allNullColumnKey Exists In buildSide input => reject all rows * if streamedSideRow.allNull is true => drop the row * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation => drop the row * if streamedSideRow.allNull is false & notFindMatch in NullAwareHashedRelation => return the row this solution will sure make buildSide data expand to 2^N-1 times, but since it is normally up to 2~3 column in NAAJ in normal production query, i suppose that it's acceptable to expand buildSide data to around 7X. I would also have a limitation of max column support for NAAJ, basically should not more than 3. was: This is a follow up improvement of Issue [SPARK-32290|https://issues.apache.org/jira/browse/SPARK-32290]. In SPARK-32290, we already optimize NAAJ from BroadcastNestedLoopJoin to BroadcastHashJoin, which improve total calculation from O(M*N) to O(M), but it's only targeting on Single Column Case, because it's much more complicate in multi column support. See. http://www.vldb.org/pvldb/vol2/vldb09-423.pdf Section 6 NAAJ multi column logical !image-2020-07-29-11-41-11-939.png! NAAJ single column logical !image-2020-07-29-11-41-03-757.png! For supporting multi column, I throw the following idea and see if is it worth to do multi-column support with some trade off. I would need to do some data expansion in HashedRelation, and i would call this new type of HashedRelation as NullAwareHashedRelation. In NullAwareHashedRelation, key with null column is allowed, which is opposite in LongHashedRelation and UnsafeHashedRelation; And single key might be expanded into 2^N - 1 records, (N refer to columnNum of the key). for example, if there is a record (1 ,2, 3) is about to insert into NullAwareHashedRelation, we take C(1,3), C(2,3) as a combination to copy origin key row, and setNull at target position, and then insert into NullAwareHashedRelation. including the origin key row, there will be 7 key row inserted as follow. (null, 2, 3) (1, null, 3) (1, 2, null) (null, null, 3) (null, 2, null) (1, null, null) (1, 2, 3) with the expanded data we can extract a common pattern for both single and multi column. allNull refer to a unsafeRow which has all null columns. * buildSide is empty input => return all rows * allNullColumnKey Exists In buildSide input => reject all rows * if streamedSideRow.allNull is true => drop the row * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation => drop the row * if streamedSideRow.allNull is false & notFindMatch in NullAwareHashedRelation => return the row this solution will sure make buildSide data expand to 2^N-1 times, but since it is normally up to 2~3 column in NAAJ in normal production query, i suppose that it's acceptable to expand buildSide data to around 7X. I would also have a limitation of max column support for NAAJ, basically should not more than 3. > NullAwareAntiJoin multi-column support > -- > > Key: SPARK-32474 > URL: https://issues.apache.org/jira/browse/SPARK-32474 >
[jira] [Created] (SPARK-32474) NullAwareAntiJoin multi-column support
Leanken.Lin created SPARK-32474: --- Summary: NullAwareAntiJoin multi-column support Key: SPARK-32474 URL: https://issues.apache.org/jira/browse/SPARK-32474 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Leanken.Lin Fix For: 3.1.0 This is a follow up improvement of Issue [SPARK-32290|https://issues.apache.org/jira/browse/SPARK-32290]. In SPARK-32290, we already optimize NAAJ from BroadcastNestedLoopJoin to BroadcastHashJoin, which improve total calculation from O(M*N) to O(M), but it's only targeting on Single Column Case, because it's much more complicate in multi column support. See. http://www.vldb.org/pvldb/vol2/vldb09-423.pdf Section 6 NAAJ multi column logical !image-2020-07-29-11-41-11-939.png! NAAJ single column logical !image-2020-07-29-11-41-03-757.png! For supporting multi column, I throw the following idea and see if is it worth to do multi-column support with some trade off. I would need to do some data expansion in HashedRelation, and i would call this new type of HashedRelation as NullAwareHashedRelation. In NullAwareHashedRelation, key with null column is allowed, which is opposite in LongHashedRelation and UnsafeHashedRelation; And single key might be expanded into 2^N - 1 records, (N refer to columnNum of the key). for example, if there is a record (1 ,2, 3) is about to insert into NullAwareHashedRelation, we take C(1,3), C(2,3) as a combination to copy origin key row, and setNull at target position, and then insert into NullAwareHashedRelation. including the origin key row, there will be 7 key row inserted as follow. (null, 2, 3) (1, null, 3) (1, 2, null) (null, null, 3) (null, 2, null) (1, null, null) (1, 2, 3) with the expanded data we can extract a common pattern for both single and multi column. allNull refer to a unsafeRow which has all null columns. * buildSide is empty input => return all rows * allNullColumnKey Exists In buildSide input => reject all rows * if streamedSideRow.allNull is true => drop the row * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation => drop the row * if streamedSideRow.allNull is false & notFindMatch in NullAwareHashedRelation => return the row this solution will sure make buildSide data expand to 2^N-1 times, but since it is normally up to 2~3 column in NAAJ in normal production query, i suppose that it's acceptable to expand buildSide data to around 7X. I would also have a limitation of max column support for NAAJ, basically should not more than 3. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32290) NotInSubquery SingleColumn Optimize
[ https://issues.apache.org/jira/browse/SPARK-32290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leanken.Lin updated SPARK-32290: Fix Version/s: 3.0.1 > NotInSubquery SingleColumn Optimize > --- > > Key: SPARK-32290 > URL: https://issues.apache.org/jira/browse/SPARK-32290 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.0.0 >Reporter: Leanken.Lin >Priority: Minor > Fix For: 3.0.1, 3.1.0 > > Original Estimate: 24h > Remaining Estimate: 24h > > Normally, > A NotInSubquery will plan into BroadcastNestedLoopJoinExec, which is very > very time consuming. For example, I've done TPCH benchmark lately, Query 16 > almost took half of the entire TPCH 22Query execution Time. So i proposed > that to do the following optimize. > Inside BroadcastNestedLoopJoinExec, we can identify not in subquery with only > single column in following pattern. > {code:java} > case _@Or( > _@EqualTo(leftAttr: AttributeReference, rightAttr: > AttributeReference), > _@IsNull( > _@EqualTo(_: AttributeReference, _: AttributeReference) > ) > ) > {code} > if buildSide rows is small enough, we can change build side data into a > HashMap. > so the M*N calculation can be optimized into M*log(N) > I've done a benchmark job in 1TB TPCH, before apply the optimize > Query 16 take around 18 mins to finish, after apply the M*log(N) optimize, it > takes only 30s to finish. > But this optimize only works on single column not in subquery, so i am here > to seek advise whether the community need this update or not. I will do the > pull request first, if the community member thought it's hack, it's fine to > just ignore this request. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-32290) NotInSubquery SingleColumn Optimize
Leanken.Lin created SPARK-32290: --- Summary: NotInSubquery SingleColumn Optimize Key: SPARK-32290 URL: https://issues.apache.org/jira/browse/SPARK-32290 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.0.0 Reporter: Leanken.Lin Fix For: 3.1.0 Normally, A NotInSubquery will plan into BroadcastNestedLoopJoinExec, which is very very time consuming. For example, I've done TPCH benchmark lately, Query 16 almost took half of the entire TPCH 22Query execution Time. So i proposed that to do the following optimize. Inside BroadcastNestedLoopJoinExec, we can identify not in subquery with only single column in following pattern. {code:java} case _@Or( _@EqualTo(leftAttr: AttributeReference, rightAttr: AttributeReference), _@IsNull( _@EqualTo(_: AttributeReference, _: AttributeReference) ) ) {code} if buildSide rows is small enough, we can change build side data into a HashMap. so the M*N calculation can be optimized into M*log(N) I've done a benchmark job in 1TB TPCH, before apply the optimize Query 16 take around 18 mins to finish, after apply the M*log(N) optimize, it takes only 30s to finish. But this optimize only works on single column not in subquery, so i am here to seek advise whether the community need this update or not. I will do the pull request first, if the community member thought it's hack, it's fine to just ignore this request. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28050) DataFrameWriter support insertInto a specific table partition
[ https://issues.apache.org/jira/browse/SPARK-28050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leanken.Lin updated SPARK-28050: Description: {code:java} // Some comments here val ptTableName = "mc_test_pt_table" sql(s"CREATE TABLE ${ptTableName} (name STRING, num BIGINT) PARTITIONED BY (pt1 STRING, pt2 STRING)") val df = spark.sparkContext.parallelize(0 to 99, 2) .map(f => { (s"name-$f", f) }) .toDF("name", "num") // if i want to insert df into a specific partition // say pt1='2018',pt2='0601' current api does not supported // only with following work around df.createOrReplaceTempView(s"${ptTableName}_tmp_view") sql(s"insert into table ${ptTableName} partition (pt1='2018', pt2='0601') select * from ${ptTableName}_tmp_view") {code} Propose to have another API in DataframeWriter that can do somethink like: {code:java} df.write.insertInto(ptTableName, "pt1='2018',pt2='0601'") {code} we have a lot of this kind of scenario in our production env. providing a api like this will make us less painful. was: ``` val ptTableName = "mc_test_pt_table" sql(s"CREATE TABLE ${ptTableName} (name STRING, num BIGINT) PARTITIONED BY (pt1 STRING, pt2 STRING)") val df = spark.sparkContext.parallelize(0 to 99, 2) .map(f => { (s"name-$f", f) }) .toDF("name", "num") // if i want to insert df into a specific partition // say pt1='2018',pt2='0601' current api does not supported // only with following work around df.createOrReplaceTempView(s"${ptTableName}_tmp_view") sql(s"insert into table ${ptTableName} partition (pt1='2018', pt2='0601') select * from ${ptTableName}_tmp_view") ``` Propose to have another API in DataframeWriter that can do somethink like: ``` df.write.insertInto(ptTableName, "pt1='2018',pt2='0601'") ``` we have a lot of this kind of scenario in our production env. providing a api like this will make us less painful. > DataFrameWriter support insertInto a specific table partition > - > > Key: SPARK-28050 > URL: https://issues.apache.org/jira/browse/SPARK-28050 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.3, 2.4.3 >Reporter: Leanken.Lin >Priority: Minor > Fix For: 2.3.3, 2.4.3 > > > {code:java} > // Some comments here > val ptTableName = "mc_test_pt_table" > sql(s"CREATE TABLE ${ptTableName} (name STRING, num BIGINT) PARTITIONED BY > (pt1 STRING, pt2 STRING)") > val df = spark.sparkContext.parallelize(0 to 99, 2) > .map(f => > { > (s"name-$f", f) > }) > .toDF("name", "num") > // if i want to insert df into a specific partition > // say pt1='2018',pt2='0601' current api does not supported > // only with following work around > df.createOrReplaceTempView(s"${ptTableName}_tmp_view") > sql(s"insert into table ${ptTableName} partition (pt1='2018', pt2='0601') > select * from ${ptTableName}_tmp_view") > {code} > Propose to have another API in DataframeWriter that can do somethink like: > {code:java} > df.write.insertInto(ptTableName, "pt1='2018',pt2='0601'") > {code} > we have a lot of this kind of scenario in our production env. providing a api > like this will make us less painful. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-28050) DataFrameWriter support insertInto a specific table partition
[ https://issues.apache.org/jira/browse/SPARK-28050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Leanken.Lin updated SPARK-28050: Description: ``` val ptTableName = "mc_test_pt_table" sql(s"CREATE TABLE ${ptTableName} (name STRING, num BIGINT) PARTITIONED BY (pt1 STRING, pt2 STRING)") val df = spark.sparkContext.parallelize(0 to 99, 2) .map(f => { (s"name-$f", f) }) .toDF("name", "num") // if i want to insert df into a specific partition // say pt1='2018',pt2='0601' current api does not supported // only with following work around df.createOrReplaceTempView(s"${ptTableName}_tmp_view") sql(s"insert into table ${ptTableName} partition (pt1='2018', pt2='0601') select * from ${ptTableName}_tmp_view") ``` Propose to have another API in DataframeWriter that can do somethink like: ``` df.write.insertInto(ptTableName, "pt1='2018',pt2='0601'") ``` we have a lot of this kind of scenario in our production env. providing a api like this will make us less painful. was: val ptTableName = "mc_test_pt_table" sql(s"CREATE TABLE ${ptTableName} (name STRING, num BIGINT) PARTITIONED BY (pt1 STRING, pt2 STRING)") val df = spark.sparkContext.parallelize(0 to 99, 2) .map(f => { (s"name-$f", f) }) .toDF("name", "num") // if i want to insert df into a specific partition // say pt1='2018',pt2='0601' current api does not supported // only with following work around df.createOrReplaceTempView(s"${ptTableName}_tmp_view") sql(s"insert into table ${ptTableName} partition (pt1='2018', pt2='0601') select * from ${ptTableName}_tmp_view") Propose to have another API in DataframeWriter that can do somethink like: df.write.insertInto(ptTableName, "pt1='2018',pt2='0601'") we have a lot of this kind of scenario in our production env. providing a api like this will make us less painful. > DataFrameWriter support insertInto a specific table partition > - > > Key: SPARK-28050 > URL: https://issues.apache.org/jira/browse/SPARK-28050 > Project: Spark > Issue Type: New Feature > Components: SQL >Affects Versions: 2.3.3, 2.4.3 >Reporter: Leanken.Lin >Priority: Minor > Fix For: 2.3.3, 2.4.3 > > > ``` > val ptTableName = "mc_test_pt_table" > sql(s"CREATE TABLE ${ptTableName} (name STRING, num BIGINT) PARTITIONED BY > (pt1 STRING, pt2 STRING)") > val df = spark.sparkContext.parallelize(0 to 99, 2) > .map(f => > { > (s"name-$f", f) > }) > .toDF("name", "num") > // if i want to insert df into a specific partition > // say pt1='2018',pt2='0601' current api does not supported > // only with following work around > df.createOrReplaceTempView(s"${ptTableName}_tmp_view") > sql(s"insert into table ${ptTableName} partition (pt1='2018', pt2='0601') > select * from ${ptTableName}_tmp_view") > ``` > Propose to have another API in DataframeWriter that can do somethink like: > ``` > df.write.insertInto(ptTableName, "pt1='2018',pt2='0601'") > ``` > we have a lot of this kind of scenario in our production env. providing a api > like this will make us less painful. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-28050) DataFrameWriter support insertInto a specific table partition
Leanken.Lin created SPARK-28050: --- Summary: DataFrameWriter support insertInto a specific table partition Key: SPARK-28050 URL: https://issues.apache.org/jira/browse/SPARK-28050 Project: Spark Issue Type: New Feature Components: SQL Affects Versions: 2.4.3, 2.3.3 Reporter: Leanken.Lin Fix For: 2.4.3, 2.3.3 val ptTableName = "mc_test_pt_table" sql(s"CREATE TABLE ${ptTableName} (name STRING, num BIGINT) PARTITIONED BY (pt1 STRING, pt2 STRING)") val df = spark.sparkContext.parallelize(0 to 99, 2) .map(f => { (s"name-$f", f) }) .toDF("name", "num") // if i want to insert df into a specific partition // say pt1='2018',pt2='0601' current api does not supported // only with following work around df.createOrReplaceTempView(s"${ptTableName}_tmp_view") sql(s"insert into table ${ptTableName} partition (pt1='2018', pt2='0601') select * from ${ptTableName}_tmp_view") Propose to have another API in DataframeWriter that can do somethink like: df.write.insertInto(ptTableName, "pt1='2018',pt2='0601'") we have a lot of this kind of scenario in our production env. providing a api like this will make us less painful. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org