[jira] [Commented] (SPARK-33275) ANSI mode: runtime errors instead of returning null

2020-12-16 Thread Leanken.Lin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17250261#comment-17250261
 ] 

Leanken.Lin commented on SPARK-33275:
-

[~Chongguang] Yes, Please go ahead.

> ANSI mode: runtime errors instead of returning null
> ---
>
> Key: SPARK-33275
> URL: https://issues.apache.org/jira/browse/SPARK-33275
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Wenchen Fan
>Priority: Major
>
> We should respect the ANSI mode in more places. What we have done so far are 
> mostly the overflow check in various operators. This ticket is to track a 
> category of ANSI mode behaviors: operators should throw runtime errors 
> instead of returning null when the input is invalid.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33619) GetMapValueUtil code generation error

2020-12-01 Thread Leanken.Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leanken.Lin updated SPARK-33619:

Description: 
```

In -SPARK-33460---

there is a bug as follow.

GetMapValueUtil generated code error when ANSI mode is on

 

s"""throw new NoSuchElementException("Key " + $eval2 + " does not exist.");"""

should be 

s"""throw new java.util.NoSuchElementException("Key " + $eval2 + " does not 
exist.");"""

 

But Why are

checkExceptionInExpression[Exception](expr, errMsg)

and sql/testOnly org.apache.spark.sql.SQLQueryTestSuite – -z ansi/map.sql

can't detect this Bug

 

it's because 

1. checkExceptionInExpression is some what error, too. it should wrap with 

withSQLConf(SQLConf.CODEGEN_FACTORY_MODE.key like CheckEvalulation, AND WE 
SHOULD FIX this later.

2. SQLQueryTestSuite ansi/map.sql failed to detect because of the 
ConstantFolding rules, it is calling eval instead of code gen  in this case

 

```

  was:
```

In -SPARK-33460---

there is a bug as follow.

GetMapValueUtil generated code error when ANSI mode is on

 

s"""throw new NoSuchElementException("Key " + $eval2 + " does not exist.");"""

should be 

s"""throw new java.util.NoSuchElementException("Key " + $eval2 + " does not 
exist.");"""

 

But Why are

checkExceptionInExpression[Exception](expr, errMsg)

and sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z ansi/map.sql

can't detect this Bug

 

it's because 

1. checkExceptionInExpression is some what error, too. it should wrap with 

withSQLConf(SQLConf.CODEGEN_FACTORY_MODE.key like CheckEvalulation

2. SQLQueryTestSuite ansi/map.sql failed to detect because of the 
ConstantFolding rules, it is calling eval instead of code gen  in this case

 

```


> GetMapValueUtil code generation error
> -
>
> Key: SPARK-33619
> URL: https://issues.apache.org/jira/browse/SPARK-33619
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Priority: Major
>
> ```
> In -SPARK-33460---
> there is a bug as follow.
> GetMapValueUtil generated code error when ANSI mode is on
>  
> s"""throw new NoSuchElementException("Key " + $eval2 + " does not exist.");"""
> should be 
> s"""throw new java.util.NoSuchElementException("Key " + $eval2 + " does not 
> exist.");"""
>  
> But Why are
> checkExceptionInExpression[Exception](expr, errMsg)
> and sql/testOnly org.apache.spark.sql.SQLQueryTestSuite – -z ansi/map.sql
> can't detect this Bug
>  
> it's because 
> 1. checkExceptionInExpression is some what error, too. it should wrap with 
> withSQLConf(SQLConf.CODEGEN_FACTORY_MODE.key like CheckEvalulation, AND WE 
> SHOULD FIX this later.
> 2. SQLQueryTestSuite ansi/map.sql failed to detect because of the 
> ConstantFolding rules, it is calling eval instead of code gen  in this case
>  
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33619) GetMapValueUtil code generation error

2020-12-01 Thread Leanken.Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33619?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leanken.Lin updated SPARK-33619:

Description: 
```

In -SPARK-33460---

there is a bug as follow.

GetMapValueUtil generated code error when ANSI mode is on

 

s"""throw new NoSuchElementException("Key " + $eval2 + " does not exist.");"""

should be 

s"""throw new java.util.NoSuchElementException("Key " + $eval2 + " does not 
exist.");"""

 

But Why are

checkExceptionInExpression[Exception](expr, errMsg)

and sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z ansi/map.sql

can't detect this Bug

 

it's because 

1. checkExceptionInExpression is some what error, too. it should wrap with 

withSQLConf(SQLConf.CODEGEN_FACTORY_MODE.key like CheckEvalulation

2. SQLQueryTestSuite ansi/map.sql failed to detect because of the 
ConstantFolding rules, it is calling eval instead of code gen  in this case

 

```

  was:
```

 

```


> GetMapValueUtil code generation error
> -
>
> Key: SPARK-33619
> URL: https://issues.apache.org/jira/browse/SPARK-33619
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Priority: Major
>
> ```
> In -SPARK-33460---
> there is a bug as follow.
> GetMapValueUtil generated code error when ANSI mode is on
>  
> s"""throw new NoSuchElementException("Key " + $eval2 + " does not exist.");"""
> should be 
> s"""throw new java.util.NoSuchElementException("Key " + $eval2 + " does not 
> exist.");"""
>  
> But Why are
> checkExceptionInExpression[Exception](expr, errMsg)
> and sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z ansi/map.sql
> can't detect this Bug
>  
> it's because 
> 1. checkExceptionInExpression is some what error, too. it should wrap with 
> withSQLConf(SQLConf.CODEGEN_FACTORY_MODE.key like CheckEvalulation
> 2. SQLQueryTestSuite ansi/map.sql failed to detect because of the 
> ConstantFolding rules, it is calling eval instead of code gen  in this case
>  
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33619) GetMapValueUtil code generation error

2020-12-01 Thread Leanken.Lin (Jira)
Leanken.Lin created SPARK-33619:
---

 Summary: GetMapValueUtil code generation error
 Key: SPARK-33619
 URL: https://issues.apache.org/jira/browse/SPARK-33619
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.0
Reporter: Leanken.Lin


```

 

```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33498) Datetime parsing should fail if the input string can't be parsed, or the pattern string is invalid

2020-11-20 Thread Leanken.Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leanken.Lin updated SPARK-33498:

Summary: Datetime parsing should fail if the input string can't be parsed, 
or the pattern string is invalid  (was: Datetime parsing should fail if the 
input string can't be parsed, or he pattern string is invalid)

> Datetime parsing should fail if the input string can't be parsed, or the 
> pattern string is invalid
> --
>
> Key: SPARK-33498
> URL: https://issues.apache.org/jira/browse/SPARK-33498
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Priority: Major
>
> Datetime parsing should fail if the input string can't be parsed, or he 
> pattern string is invalid, when ANSI mode is enable. This patch should update 
> GetTimeStamp, UnixTimeStamp, ToUnixTimeStamp and Cast



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33498) Datetime parsing should fail if the input string can't be parsed, or the pattern string is invalid

2020-11-20 Thread Leanken.Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leanken.Lin updated SPARK-33498:

Description: Datetime parsing should fail if the input string can't be 
parsed, or the pattern string is invalid, when ANSI mode is enable. This patch 
should update GetTimeStamp, UnixTimeStamp, ToUnixTimeStamp and Cast  (was: 
Datetime parsing should fail if the input string can't be parsed, or he pattern 
string is invalid, when ANSI mode is enable. This patch should update 
GetTimeStamp, UnixTimeStamp, ToUnixTimeStamp and Cast)

> Datetime parsing should fail if the input string can't be parsed, or the 
> pattern string is invalid
> --
>
> Key: SPARK-33498
> URL: https://issues.apache.org/jira/browse/SPARK-33498
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Priority: Major
>
> Datetime parsing should fail if the input string can't be parsed, or the 
> pattern string is invalid, when ANSI mode is enable. This patch should update 
> GetTimeStamp, UnixTimeStamp, ToUnixTimeStamp and Cast



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33498) Datetime parsing should fail if the input string can't be parsed, or he pattern string is invalid

2020-11-19 Thread Leanken.Lin (Jira)
Leanken.Lin created SPARK-33498:
---

 Summary: Datetime parsing should fail if the input string can't be 
parsed, or he pattern string is invalid
 Key: SPARK-33498
 URL: https://issues.apache.org/jira/browse/SPARK-33498
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Leanken.Lin


Datetime parsing should fail if the input string can't be parsed, or he pattern 
string is invalid, when ANSI mode is enable. This patch should update 
GetTimeStamp, UnixTimeStamp, ToUnixTimeStamp and Cast



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33386) Accessing array elements should failed if index is out of bound.

2020-11-15 Thread Leanken.Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leanken.Lin updated SPARK-33386:

Description: When ansi mode enabled, accessing array element with out of 
bound index should failed with exception, but currently it's returning null.  
(was: When ansi mode enabled, accessing array element should failed with 
exception, but currently it's returning null.)

> Accessing array elements should failed if index is out of bound.
> 
>
> Key: SPARK-33386
> URL: https://issues.apache.org/jira/browse/SPARK-33386
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Assignee: Leanken.Lin
>Priority: Major
> Fix For: 3.1.0
>
>
> When ansi mode enabled, accessing array element with out of bound index 
> should failed with exception, but currently it's returning null.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33460) Accessing map values should fail if key is not found.

2020-11-15 Thread Leanken.Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leanken.Lin updated SPARK-33460:

Description: When ansi mode enabled, accessing map values should failed 
with exception if key does not exist , but currently it's returning null.

> Accessing map values should fail if key is not found.
> -
>
> Key: SPARK-33460
> URL: https://issues.apache.org/jira/browse/SPARK-33460
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Priority: Major
>
> When ansi mode enabled, accessing map values should failed with exception if 
> key does not exist , but currently it's returning null.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33460) Accessing map values should fail if key is not found.

2020-11-15 Thread Leanken.Lin (Jira)
Leanken.Lin created SPARK-33460:
---

 Summary: Accessing map values should fail if key is not found.
 Key: SPARK-33460
 URL: https://issues.apache.org/jira/browse/SPARK-33460
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Leanken.Lin






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33391) element_at with CreateArray not respect one based index

2020-11-08 Thread Leanken.Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leanken.Lin updated SPARK-33391:

Description: 
var df = spark.sql("select element_at(array(3, 2, 1), 0)")
df.printSchema()

df = spark.sql("select element_at(array(3, 2, 1), 1)")
df.printSchema()

df = spark.sql("select element_at(array(3, 2, 1), 2)")
df.printSchema()

df = spark.sql("select element_at(array(3, 2, 1), 3)")
df.printSchema()

root
 |-- element_at(array(3, 2, 1), 0): integer (nullable = false)

root
 |-- element_at(array(3, 2, 1), 1): integer (nullable = false)

root
 |-- element_at(array(3, 2, 1), 2): integer (nullable = false)

root
 |-- element_at(array(3, 2, 1), 3): integer (nullable = true)

 

In this case, the nullable property in element_at with CreateArray statement is 
not correct.

 

  was:TODO


> element_at with CreateArray not respect one based index
> ---
>
> Key: SPARK-33391
> URL: https://issues.apache.org/jira/browse/SPARK-33391
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Priority: Major
>
> var df = spark.sql("select element_at(array(3, 2, 1), 0)")
> df.printSchema()
> df = spark.sql("select element_at(array(3, 2, 1), 1)")
> df.printSchema()
> df = spark.sql("select element_at(array(3, 2, 1), 2)")
> df.printSchema()
> df = spark.sql("select element_at(array(3, 2, 1), 3)")
> df.printSchema()
> root
>  |-- element_at(array(3, 2, 1), 0): integer (nullable = false)
> root
>  |-- element_at(array(3, 2, 1), 1): integer (nullable = false)
> root
>  |-- element_at(array(3, 2, 1), 2): integer (nullable = false)
> root
>  |-- element_at(array(3, 2, 1), 3): integer (nullable = true)
>  
> In this case, the nullable property in element_at with CreateArray statement 
> is not correct.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33391) element_at with CreateArray not respect one based index

2020-11-08 Thread Leanken.Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leanken.Lin updated SPARK-33391:

Summary: element_at with CreateArray not respect one based index  (was: 
element_at not respect one based index)

> element_at with CreateArray not respect one based index
> ---
>
> Key: SPARK-33391
> URL: https://issues.apache.org/jira/browse/SPARK-33391
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Priority: Major
>
> TODO



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33391) element_at not respect one based index

2020-11-08 Thread Leanken.Lin (Jira)
Leanken.Lin created SPARK-33391:
---

 Summary: element_at not respect one based index
 Key: SPARK-33391
 URL: https://issues.apache.org/jira/browse/SPARK-33391
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.0
Reporter: Leanken.Lin


TODO



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33386) Accessing array elements should failed if index is out of bound.

2020-11-08 Thread Leanken.Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leanken.Lin updated SPARK-33386:

Description: When ansi mode enabled, accessing array element should failed 
with exception, but currently it's returning null.  (was: TODO)

> Accessing array elements should failed if index is out of bound.
> 
>
> Key: SPARK-33386
> URL: https://issues.apache.org/jira/browse/SPARK-33386
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Priority: Major
>
> When ansi mode enabled, accessing array element should failed with exception, 
> but currently it's returning null.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33386) Accessing array elements should failed if index is out of bound.

2020-11-08 Thread Leanken.Lin (Jira)
Leanken.Lin created SPARK-33386:
---

 Summary: Accessing array elements should failed if index is out of 
bound.
 Key: SPARK-33386
 URL: https://issues.apache.org/jira/browse/SPARK-33386
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Leanken.Lin


TODO



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33140) make Analyzer rules using SQLConf.get

2020-10-14 Thread Leanken.Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33140?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leanken.Lin updated SPARK-33140:

Summary: make Analyzer rules using SQLConf.get  (was: make Analyzer and 
Optimizer rules using SQLConf.get)

> make Analyzer rules using SQLConf.get
> -
>
> Key: SPARK-33140
> URL: https://issues.apache.org/jira/browse/SPARK-33140
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Priority: Major
>
> TODO



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33139) protect setActiveSession and clearActiveSession

2020-10-14 Thread Leanken.Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leanken.Lin updated SPARK-33139:

Description: 
This PR is a sub-task of 
[SPARK-33138](https://issues.apache.org/jira/browse/SPARK-33138). In order to 
make SQLConf.get reliable and stable, we need to make sure user can't pollute 
the SQLConf and SparkSession Context via calling setActiveSession and 
clearActiveSession.

Change of the PR:

* add legacy config spark.sql.legacy.allowModifyActiveSession to fallback to 
old behavior if user do need to call these two API.
* by default, if user call these two API, it will throw exception
* add extra two internal and private API setActiveSessionInternal and 
clearActiveSessionInternal for current internal usage
* change all internal reference to new internal API exception for 
SQLContext.setActive and SQLContext.clearActive

  was:TODO


> protect setActiveSession and clearActiveSession
> ---
>
> Key: SPARK-33139
> URL: https://issues.apache.org/jira/browse/SPARK-33139
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Priority: Major
>
> This PR is a sub-task of 
> [SPARK-33138](https://issues.apache.org/jira/browse/SPARK-33138). In order to 
> make SQLConf.get reliable and stable, we need to make sure user can't pollute 
> the SQLConf and SparkSession Context via calling setActiveSession and 
> clearActiveSession.
> Change of the PR:
> * add legacy config spark.sql.legacy.allowModifyActiveSession to fallback to 
> old behavior if user do need to call these two API.
> * by default, if user call these two API, it will throw exception
> * add extra two internal and private API setActiveSessionInternal and 
> clearActiveSessionInternal for current internal usage
> * change all internal reference to new internal API exception for 
> SQLContext.setActive and SQLContext.clearActive



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-33138) unify temp view and permanent view behaviors

2020-10-14 Thread Leanken.Lin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-33138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17213824#comment-17213824
 ] 

Leanken.Lin commented on SPARK-33138:
-

Thanks for your consideration, but me and my colleague are already working on 
the sub-tasks.

> unify temp view and permanent view behaviors
> 
>
> Key: SPARK-33138
> URL: https://issues.apache.org/jira/browse/SPARK-33138
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Priority: Major
>
> Currently, temp view store mapping of temp view name and its logicalPlan, and 
> permanent view store in HMS stores its origin SQL text.
> So for permanent view, when try to refer the permanent view, its SQL text 
> will be parse-analyze-optimize-plan again with current SQLConf and 
> SparkSession context, so it might keep changing when the SQLConf and context 
> is different each time.
> In order the unify the behaviors of temp view and permanent view, proposed 
> that we keep its origin SQLText for both temp and permanent view, and also 
> keep record of the SQLConf when the view was created. Each time we try to 
> refer the view, we using the Snapshot SQLConf to parse-analyze-optimize-plan 
> the SQLText, in this way, we can make sure the output of the created view to 
> be stable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33142) SQL temp view should store SQL text as well

2020-10-13 Thread Leanken.Lin (Jira)
Leanken.Lin created SPARK-33142:
---

 Summary: SQL temp view should store SQL text as well
 Key: SPARK-33142
 URL: https://issues.apache.org/jira/browse/SPARK-33142
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Leanken.Lin


TODO



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33141) capture SQL configs when creating permanent views

2020-10-13 Thread Leanken.Lin (Jira)
Leanken.Lin created SPARK-33141:
---

 Summary: capture SQL configs when creating permanent views
 Key: SPARK-33141
 URL: https://issues.apache.org/jira/browse/SPARK-33141
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Leanken.Lin


TODO



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33140) make Analyzer and Optimizer rules using SQLConf.get

2020-10-13 Thread Leanken.Lin (Jira)
Leanken.Lin created SPARK-33140:
---

 Summary: make Analyzer and Optimizer rules using SQLConf.get
 Key: SPARK-33140
 URL: https://issues.apache.org/jira/browse/SPARK-33140
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Leanken.Lin


TODO



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33139) protect setActiveSession and clearActiveSession

2020-10-13 Thread Leanken.Lin (Jira)
Leanken.Lin created SPARK-33139:
---

 Summary: protect setActiveSession and clearActiveSession
 Key: SPARK-33139
 URL: https://issues.apache.org/jira/browse/SPARK-33139
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Affects Versions: 3.1.0
Reporter: Leanken.Lin


TODO



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33138) unify temp view and permanent view behaviors

2020-10-13 Thread Leanken.Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leanken.Lin updated SPARK-33138:

Description: 
Currently, temp view store mapping of temp view name and its logicalPlan, and 
permanent view store in HMS stores its origin SQL text.

So for permanent view, when try to refer the permanent view, its SQL text will 
be parse-analyze-optimize-plan again with current SQLConf and SparkSession 
context, so it might keep changing when the SQLConf and context is different 
each time.

In order the unify the behaviors of temp view and permanent view, proposed that 
we keep its origin SQLText for both temp and permanent view, and also keep 
record of the SQLConf when the view was created. Each time we try to refer the 
view, we using the Snapshot SQLConf to parse-analyze-optimize-plan the SQLText, 
in this way, we can make sure the output of the created view to be stable.

  was:Currently, temp view store mapping of temp view name and its logicalPlan, 
and permanent view store in HMS stores its origin SQL text. So for permanent 
view, when try to refer the permanent view, its SQL text will be 
parse-analyze-optimize-plan again with current SQLConf and SparkSession 
context, so it might keep changing the SQLConf and context is different each 
time. So, in order the unify the behaviors of temp view and permanent view, 
propose that we keep its origin SQLText for both temp and permanent view, and 
also keep record of the SQLConf when the view was created. Each time we try to 
refer the view, we using the Snapshot SQLConf to parse-analyze-optimize-plan 
the SQLText, in this way, we can make sure the output of the created view to be 
stable.


> unify temp view and permanent view behaviors
> 
>
> Key: SPARK-33138
> URL: https://issues.apache.org/jira/browse/SPARK-33138
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Priority: Major
>
> Currently, temp view store mapping of temp view name and its logicalPlan, and 
> permanent view store in HMS stores its origin SQL text.
> So for permanent view, when try to refer the permanent view, its SQL text 
> will be parse-analyze-optimize-plan again with current SQLConf and 
> SparkSession context, so it might keep changing when the SQLConf and context 
> is different each time.
> In order the unify the behaviors of temp view and permanent view, proposed 
> that we keep its origin SQLText for both temp and permanent view, and also 
> keep record of the SQLConf when the view was created. Each time we try to 
> refer the view, we using the Snapshot SQLConf to parse-analyze-optimize-plan 
> the SQLText, in this way, we can make sure the output of the created view to 
> be stable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33138) unify temp view and permanent view behaviors

2020-10-13 Thread Leanken.Lin (Jira)
Leanken.Lin created SPARK-33138:
---

 Summary: unify temp view and permanent view behaviors
 Key: SPARK-33138
 URL: https://issues.apache.org/jira/browse/SPARK-33138
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.1.0
 Environment: Currently, temp view store mapping of temp view name and 
its logicalPlan, and permanent view store in HMS stores its origin SQL text. So 
for permanent view, when try to refer the permanent view, its SQL text will be 
parse-analyze-optimize-plan again with current SQLConf and SparkSession 
context, so it might keep changing the SQLConf and context is different each 
time. So, in order the unify the behaviors of temp view and permanent view, 
propose that we keep its origin SQLText for both temp and permanent view, and 
also keep record of the SQLConf when the view was created. Each time we try to 
refer the view, we using the Snapshot SQLConf to parse-analyze-optimize-plan 
the SQLText, in this way, we can make sure the output of the created view to be 
stable.
Reporter: Leanken.Lin






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33138) unify temp view and permanent view behaviors

2020-10-13 Thread Leanken.Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leanken.Lin updated SPARK-33138:

Description: Currently, temp view store mapping of temp view name and its 
logicalPlan, and permanent view store in HMS stores its origin SQL text. So for 
permanent view, when try to refer the permanent view, its SQL text will be 
parse-analyze-optimize-plan again with current SQLConf and SparkSession 
context, so it might keep changing the SQLConf and context is different each 
time. So, in order the unify the behaviors of temp view and permanent view, 
propose that we keep its origin SQLText for both temp and permanent view, and 
also keep record of the SQLConf when the view was created. Each time we try to 
refer the view, we using the Snapshot SQLConf to parse-analyze-optimize-plan 
the SQLText, in this way, we can make sure the output of the created view to be 
stable.
Environment: (was: Currently, temp view store mapping of temp view name 
and its logicalPlan, and permanent view store in HMS stores its origin SQL 
text. So for permanent view, when try to refer the permanent view, its SQL text 
will be parse-analyze-optimize-plan again with current SQLConf and SparkSession 
context, so it might keep changing the SQLConf and context is different each 
time. So, in order the unify the behaviors of temp view and permanent view, 
propose that we keep its origin SQLText for both temp and permanent view, and 
also keep record of the SQLConf when the view was created. Each time we try to 
refer the view, we using the Snapshot SQLConf to parse-analyze-optimize-plan 
the SQLText, in this way, we can make sure the output of the created view to be 
stable.)

> unify temp view and permanent view behaviors
> 
>
> Key: SPARK-33138
> URL: https://issues.apache.org/jira/browse/SPARK-33138
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Leanken.Lin
>Priority: Major
>
> Currently, temp view store mapping of temp view name and its logicalPlan, and 
> permanent view store in HMS stores its origin SQL text. So for permanent 
> view, when try to refer the permanent view, its SQL text will be 
> parse-analyze-optimize-plan again with current SQLConf and SparkSession 
> context, so it might keep changing the SQLConf and context is different each 
> time. So, in order the unify the behaviors of temp view and permanent view, 
> propose that we keep its origin SQLText for both temp and permanent view, and 
> also keep record of the SQLConf when the view was created. Each time we try 
> to refer the view, we using the Snapshot SQLConf to 
> parse-analyze-optimize-plan the SQLText, in this way, we can make sure the 
> output of the created view to be stable.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33016) Potential SQLMetrics missed which might cause WEB UI display issue while AQE is on.

2020-09-28 Thread Leanken.Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leanken.Lin updated SPARK-33016:

Affects Version/s: (was: 3.0.1)
   3.0.0

> Potential SQLMetrics missed which might cause WEB UI display issue while AQE 
> is on.
> ---
>
> Key: SPARK-33016
> URL: https://issues.apache.org/jira/browse/SPARK-33016
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Leanken.Lin
>Priority: Minor
>
> In current AQE execution, there might be a following scenario which might 
> cause SQLMetrics being incorrectly override.
>  # Stage A and B are created, and UI updated thru event 
> onAdaptiveExecutionUpdate.
>  # Stage A and B are running. Subquery in stage A keep updating metrics thru 
> event onAdaptiveSQLMetricUpdate.
>  # Stage B completes, while stage A's subquery is still running, updating 
> metrics.
>  # Completion of stage B triggers new stage creation and UI update thru event 
> onAdaptiveExecutionUpdate again (just like step 1).
>  
> But it's very hard to re-produce this issue, since it was only happened with 
> high concurrency. For the fix, I suggested that we might be able to keep all 
> duplicated metrics instead of updating it every time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33016) Potential SQLMetrics missed which might cause WEB UI display issue while AQE is on.

2020-09-28 Thread Leanken.Lin (Jira)
Leanken.Lin created SPARK-33016:
---

 Summary: Potential SQLMetrics missed which might cause WEB UI 
display issue while AQE is on.
 Key: SPARK-33016
 URL: https://issues.apache.org/jira/browse/SPARK-33016
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.1
Reporter: Leanken.Lin


In current AQE execution, there might be a following scenario which might cause 
SQLMetrics being incorrectly override.
 # Stage A and B are created, and UI updated thru event 
onAdaptiveExecutionUpdate.
 # Stage A and B are running. Subquery in stage A keep updating metrics thru 
event onAdaptiveSQLMetricUpdate.
 # Stage B completes, while stage A's subquery is still running, updating 
metrics.
 # Completion of stage B triggers new stage creation and UI update thru event 
onAdaptiveExecutionUpdate again (just like step 1).

 

But it's very hard to re-produce this issue, since it was only happened with 
high concurrency. For the fix, I suggested that we might be able to keep all 
duplicated metrics instead of updating it every time.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32765) EliminateJoinToEmptyRelation should respect exchange behavior when canChangeNumPartitions == false

2020-09-13 Thread Leanken.Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leanken.Lin resolved SPARK-32765.
-
Resolution: Not A Bug

> EliminateJoinToEmptyRelation should respect exchange behavior when 
> canChangeNumPartitions == false
> --
>
> Key: SPARK-32765
> URL: https://issues.apache.org/jira/browse/SPARK-32765
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Leanken.Lin
>Priority: Major
>
> Currently, EliminateJoinToEmptyRelation Rule will convert Join into 
> EmptyRelation in some cases with AQE on. But if either sub plan of Join 
> contains a ShuffleQueryStage(canChangeNumPartitions == false), which means 
> the Exchange produced by repartition Or singlePartition, in this case, if we 
> were to convert it into an EmptyRelation, it will lost user specified number 
> partition information for downstream operator, it's not right. 
> So in the Patch, try not to do the conversion if either sub plan of Join 
> contains ShuffleQueryStage(canChangeNumPartitions == false)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32765) EliminateJoinToEmptyRelation should respect exchange behavior when canChangeNumPartitions == false

2020-09-01 Thread Leanken.Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leanken.Lin updated SPARK-32765:

Environment: (was: Currently, EliminateJoinToEmptyRelation Rule will 
convert Join into EmptyRelation in some cases with AQE on. But if either sub 
plan of Join contains a ShuffleQueryStage(canChangeNumPartitions == false), 
which means the Exchange produced by repartition Or singlePartition, in this 
case, if we were to convert it into an EmptyRelation, it will lost user 
specified number partition information for downstream operator, it's not right. 

So in the Patch, try not to do the conversion if either sub plan of Join 
contains ShuffleQueryStage(canChangeNumPartitions == false))

> EliminateJoinToEmptyRelation should respect exchange behavior when 
> canChangeNumPartitions == false
> --
>
> Key: SPARK-32765
> URL: https://issues.apache.org/jira/browse/SPARK-32765
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Leanken.Lin
>Priority: Major
>
> Currently, EliminateJoinToEmptyRelation Rule will convert Join into 
> EmptyRelation in some cases with AQE on. But if either sub plan of Join 
> contains a ShuffleQueryStage(canChangeNumPartitions == false), which means 
> the Exchange produced by repartition Or singlePartition, in this case, if we 
> were to convert it into an EmptyRelation, it will lost user specified number 
> partition information for downstream operator, it's not right. 
> So in the Patch, try not to do the conversion if either sub plan of Join 
> contains ShuffleQueryStage(canChangeNumPartitions == false)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32765) EliminateJoinToEmptyRelation should respect exchange behavior when canChangeNumPartitions == false

2020-09-01 Thread Leanken.Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leanken.Lin updated SPARK-32765:

Description: 
Currently, EliminateJoinToEmptyRelation Rule will convert Join into 
EmptyRelation in some cases with AQE on. But if either sub plan of Join 
contains a ShuffleQueryStage(canChangeNumPartitions == false), which means the 
Exchange produced by repartition Or singlePartition, in this case, if we were 
to convert it into an EmptyRelation, it will lost user specified number 
partition information for downstream operator, it's not right. 

So in the Patch, try not to do the conversion if either sub plan of Join 
contains ShuffleQueryStage(canChangeNumPartitions == false)

> EliminateJoinToEmptyRelation should respect exchange behavior when 
> canChangeNumPartitions == false
> --
>
> Key: SPARK-32765
> URL: https://issues.apache.org/jira/browse/SPARK-32765
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
> Environment: Currently, EliminateJoinToEmptyRelation Rule will 
> convert Join into EmptyRelation in some cases with AQE on. But if either sub 
> plan of Join contains a ShuffleQueryStage(canChangeNumPartitions == false), 
> which means the Exchange produced by repartition Or singlePartition, in this 
> case, if we were to convert it into an EmptyRelation, it will lost user 
> specified number partition information for downstream operator, it's not 
> right. 
> So in the Patch, try not to do the conversion if either sub plan of Join 
> contains ShuffleQueryStage(canChangeNumPartitions == false)
>Reporter: Leanken.Lin
>Priority: Major
>
> Currently, EliminateJoinToEmptyRelation Rule will convert Join into 
> EmptyRelation in some cases with AQE on. But if either sub plan of Join 
> contains a ShuffleQueryStage(canChangeNumPartitions == false), which means 
> the Exchange produced by repartition Or singlePartition, in this case, if we 
> were to convert it into an EmptyRelation, it will lost user specified number 
> partition information for downstream operator, it's not right. 
> So in the Patch, try not to do the conversion if either sub plan of Join 
> contains ShuffleQueryStage(canChangeNumPartitions == false)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32765) EliminateJoinToEmptyRelation should respect exchange behavior when canChangeNumPartitions == false

2020-09-01 Thread Leanken.Lin (Jira)
Leanken.Lin created SPARK-32765:
---

 Summary: EliminateJoinToEmptyRelation should respect exchange 
behavior when canChangeNumPartitions == false
 Key: SPARK-32765
 URL: https://issues.apache.org/jira/browse/SPARK-32765
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
 Environment: Currently, EliminateJoinToEmptyRelation Rule will convert 
Join into EmptyRelation in some cases with AQE on. But if either sub plan of 
Join contains a ShuffleQueryStage(canChangeNumPartitions == false), which means 
the Exchange produced by repartition Or singlePartition, in this case, if we 
were to convert it into an EmptyRelation, it will lost user specified number 
partition information for downstream operator, it's not right. 

So in the Patch, try not to do the conversion if either sub plan of Join 
contains ShuffleQueryStage(canChangeNumPartitions == false)
Reporter: Leanken.Lin






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32705) EmptyHashedRelation$; no valid constructor

2020-08-26 Thread Leanken.Lin (Jira)
Leanken.Lin created SPARK-32705:
---

 Summary: EmptyHashedRelation$; no valid constructor
 Key: SPARK-32705
 URL: https://issues.apache.org/jira/browse/SPARK-32705
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Leanken.Lin


Current EmptyHashedRelation is object and it will cause JavaDeserialization 
Exception as follow

 
{code:java}
// Stack
20/08/26 11:13:30 WARN [task-result-getter-2] TaskSetManager: Lost task 34.0 in 
stage 57.0 (TID 18076, emr-worker-5.cluster-183257, executor 18): java
.io.InvalidClassException: 
org.apache.spark.sql.execution.joins.EmptyHashedRelation$; no valid constructor
at 
java.io.ObjectStreamClass$ExceptionInfo.newInvalidClassException(ObjectStreamClass.java:169)
at 
java.io.ObjectStreamClass.checkDeserialize(ObjectStreamClass.java:874)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2042)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1572)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:430)
at 
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
at 
org.apache.spark.broadcast.TorrentBroadcast$.$anonfun$unBlockifyObject$4(TorrentBroadcast.scala:328)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
at 
org.apache.spark.broadcast.TorrentBroadcast$.unBlockifyObject(TorrentBroadcast.scala:330)
at 
org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$4(TorrentBroadcast.scala:249)
at scala.Option.getOrElse(Option.scala:189)
at 
org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$2(TorrentBroadcast.scala:223)
at org.apache.spark.util.KeyLock.withLock(KeyLock.scala:64)
at 
org.apache.spark.broadcast.TorrentBroadcast.$anonfun$readBroadcastBlock$1(TorrentBroadcast.scala:218)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1343)

{code}
using case object instead to fix serialization issue. And also change 
EmptyHashedRelation not to extend NullAwareHashedRelation since it's already 
being used in other non-NAAJ joins.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32678) Rename EmptyHashedRelationWithAllNullKeys and simplify NAAJ generated code

2020-08-21 Thread Leanken.Lin (Jira)
Leanken.Lin created SPARK-32678:
---

 Summary: Rename EmptyHashedRelationWithAllNullKeys and simplify 
NAAJ generated code
 Key: SPARK-32678
 URL: https://issues.apache.org/jira/browse/SPARK-32678
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Leanken.Lin


EmptyHashedRelationWithAllNullKeys is a bit of confusing for its naming, and 
this minor change also simplify the generated code for BHJ NAAj.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-32649) Optimize BHJ/SHJ inner and semi join with empty hashed relation

2020-08-18 Thread Leanken.Lin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179397#comment-17179397
 ] 

Leanken.Lin commented on SPARK-32649:
-

Feel free to just send out PR for reviewing,^_^

> Optimize BHJ/SHJ inner and semi join with empty hashed relation
> ---
>
> Key: SPARK-32649
> URL: https://issues.apache.org/jira/browse/SPARK-32649
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Trivial
>
> With `EmptyHashedRelation` introduced in 
> [https://github.com/apache/spark/pull/29389], it inspired me that there's a 
> minor optimization we can apply to broadcast hash join and shuffled hash join 
> if build side hashed relation is empty.
> If build side hashed relation is empty (i.e. build side is empty)
> 1.inner join: we don't need to execute stream side at all, just return an 
> empty iterator - 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala#L152]
> 2.semi join: we don't need to execute stream side at all, just return an 
> empty iterator - 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala#L227]
>  .
> This is not common that build side is empty, but in case it is, we can 
> leverage it to not execute stream side at all for better query CPU/IO 
> performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-32649) Optimize BHJ/SHJ inner and semi join with empty hashed relation

2020-08-18 Thread Leanken.Lin (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-32649?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17179397#comment-17179397
 ] 

Leanken.Lin edited comment on SPARK-32649 at 8/18/20, 6:37 AM:
---

Feel free to just send out PR for reviewing


was (Author: leanken):
Feel free to just send out PR for reviewing,^_^

> Optimize BHJ/SHJ inner and semi join with empty hashed relation
> ---
>
> Key: SPARK-32649
> URL: https://issues.apache.org/jira/browse/SPARK-32649
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Trivial
>
> With `EmptyHashedRelation` introduced in 
> [https://github.com/apache/spark/pull/29389], it inspired me that there's a 
> minor optimization we can apply to broadcast hash join and shuffled hash join 
> if build side hashed relation is empty.
> If build side hashed relation is empty (i.e. build side is empty)
> 1.inner join: we don't need to execute stream side at all, just return an 
> empty iterator - 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala#L152]
> 2.semi join: we don't need to execute stream side at all, just return an 
> empty iterator - 
> [https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashJoin.scala#L227]
>  .
> This is not common that build side is empty, but in case it is, we can 
> leverage it to not execute stream side at all for better query CPU/IO 
> performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32644) NAAJ support for ShuffleHashJoin when AQE is on

2020-08-17 Thread Leanken.Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leanken.Lin updated SPARK-32644:

Description: 
In SPARK-32290, we managed to optimize NAAJ scenario from BNLJ to BHJ, but 
skipped the checking the BuildSide plan with 
*spark.sql.autoBroadcastJoinThreshold* parameter, which means in very bad case, 
BuildSide Plan being too big might cause Driver OOM. So the NAAJ support for  
ShuffledHashJoin is important as well.

 

The support of SHJ for NAAJ has some difficulties in NullKey scenario, as for 
normal HashedRelation and EmtpyHashedRelation, the code logical should be the 
same when it comes to BHJ and SHJ, but if NullKey exists in global BuildSide 
data, and only one partition could be built into 
EmptyHashedRelationWithAllNullKeys, and this partition was not able to do *fast 
stop* for the entire RDD. So after offline talked with some committer and 
discussion, decided to support NAAJ for SHJ when AQE is on, because when AQE is 
on, Shuffle will be pre-executed, and we should be able to know that whether 
the BuildSide contains NullKey or not before the actual JOIN executed.

 

Basically, In NAAJ SHJ Implementation, we collected information whether 
BuildSide is Empty or contains NullKey, and keep these information in 
ShuffleExchangeExec Metrics, and during AQE, we rewritten these two case into 
LocalTableScan and StreamedSidePlan to Avoid NAAJ, as for the normal relation, 
we processed it in Distributed style.

 

  was:
In [SPARK-32290|https://issues.apache.org/jira/browse/SPARK-32290], we managed 
to optimize NAAJ scenario from BNLJ to BHJ, but skipped the checking the 
BuildSide plan with *spark.sql.autoBroadcastJoinThreshold* parameter, which 
means in very bad case, BuildSide Plan being to big might cause Driver OOM. So 
the NAAJ support for  ShuffledHashJoin is important as well.

 

The support of SHJ for NAAJ has some difficulties in NullKey scenario, as for 
normal HashedRelation and EmtpyHashedRelation, the code logical should be the 
same when it comes to BHJ and SHJ, but if NullKey exists in global BuildSide 
data, and only one partition could be built into 
EmptyHashedRelationWithAllNullKeys, and this partition was not able to do *fast 
stop* for the entire RDD. So after offline talked with some committer and 
discussion, decided to support NAAJ for SHJ when AQE is on, because when AQE is 
on, Shuffle will be pre-executed, and we should be able to know that whether 
the BuildSide contains NullKey or not before the actual JOIN executed.

 

Basically, In NAAJ SHJ Implementation, we collected information whether 
BuildSide is Empty or contains NullKey, and keep these information in 
ShuffleExchangeExec Metrics, and during AQE, we rewritten these two case into 
LocalTableScan and StreamedSidePlan to Avoid NAAJ, as for the normal relation, 
we processed it in Distributed style.

 


> NAAJ support for ShuffleHashJoin when AQE is on
> ---
>
> Key: SPARK-32644
> URL: https://issues.apache.org/jira/browse/SPARK-32644
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Leanken.Lin
>Priority: Major
>
> In SPARK-32290, we managed to optimize NAAJ scenario from BNLJ to BHJ, but 
> skipped the checking the BuildSide plan with 
> *spark.sql.autoBroadcastJoinThreshold* parameter, which means in very bad 
> case, BuildSide Plan being too big might cause Driver OOM. So the NAAJ 
> support for  ShuffledHashJoin is important as well.
>  
> The support of SHJ for NAAJ has some difficulties in NullKey scenario, as for 
> normal HashedRelation and EmtpyHashedRelation, the code logical should be the 
> same when it comes to BHJ and SHJ, but if NullKey exists in global BuildSide 
> data, and only one partition could be built into 
> EmptyHashedRelationWithAllNullKeys, and this partition was not able to do 
> *fast stop* for the entire RDD. So after offline talked with some committer 
> and discussion, decided to support NAAJ for SHJ when AQE is on, because when 
> AQE is on, Shuffle will be pre-executed, and we should be able to know that 
> whether the BuildSide contains NullKey or not before the actual JOIN executed.
>  
> Basically, In NAAJ SHJ Implementation, we collected information whether 
> BuildSide is Empty or contains NullKey, and keep these information in 
> ShuffleExchangeExec Metrics, and during AQE, we rewritten these two case into 
> LocalTableScan and StreamedSidePlan to Avoid NAAJ, as for the normal 
> relation, we processed it in Distributed style.
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: 

[jira] [Created] (SPARK-32644) NAAJ support for ShuffleHashJoin when AQE is on

2020-08-17 Thread Leanken.Lin (Jira)
Leanken.Lin created SPARK-32644:
---

 Summary: NAAJ support for ShuffleHashJoin when AQE is on
 Key: SPARK-32644
 URL: https://issues.apache.org/jira/browse/SPARK-32644
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Leanken.Lin


In [SPARK-32290|https://issues.apache.org/jira/browse/SPARK-32290], we managed 
to optimize NAAJ scenario from BNLJ to BHJ, but skipped the checking the 
BuildSide plan with *spark.sql.autoBroadcastJoinThreshold* parameter, which 
means in very bad case, BuildSide Plan being to big might cause Driver OOM. So 
the NAAJ support for  ShuffledHashJoin is important as well.

 

The support of SHJ for NAAJ has some difficulties in NullKey scenario, as for 
normal HashedRelation and EmtpyHashedRelation, the code logical should be the 
same when it comes to BHJ and SHJ, but if NullKey exists in global BuildSide 
data, and only one partition could be built into 
EmptyHashedRelationWithAllNullKeys, and this partition was not able to do *fast 
stop* for the entire RDD. So after offline talked with some committer and 
discussion, decided to support NAAJ for SHJ when AQE is on, because when AQE is 
on, Shuffle will be pre-executed, and we should be able to know that whether 
the BuildSide contains NullKey or not before the actual JOIN executed.

 

Basically, In NAAJ SHJ Implementation, we collected information whether 
BuildSide is Empty or contains NullKey, and keep these information in 
ShuffleExchangeExec Metrics, and during AQE, we rewritten these two case into 
LocalTableScan and StreamedSidePlan to Avoid NAAJ, as for the normal relation, 
we processed it in Distributed style.

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32615) Fix AQE aggregateMetrics java.util.NoSuchElementException

2020-08-14 Thread Leanken.Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leanken.Lin updated SPARK-32615:

Description: 
{code:java}
// Reproduce Step
sql/test-only org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite -- 
-z "SPARK-32573: Eliminate NAAJ when BuildSide is 
EmptyHashedRelationWithAllNullKeys"
{code}
{code:java}
// Error Message
14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread 
element-tracking-store-worker
14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread 
element-tracking-store-worker java.util.NoSuchElementException: key not found: 
12 
at scala.collection.immutable.Map$Map1.apply(Map.scala:114) 
at 
org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$aggregateMetrics$11(SQLAppStatusListener.scala:257)
 at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) 
at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149) at 
scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) at 
scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) at 
scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) at 
scala.collection.mutable.HashMap.foreach(HashMap.scala:149) at 
scala.collection.TraversableLike.map(TraversableLike.scala:238) at 
scala.collection.TraversableLike.map$(TraversableLike.scala:231) at 
scala.collection.AbstractTraversable.map(Traversable.scala:108) at 
org.apache.spark.sql.execution.ui.SQLAppStatusListener.aggregateMetrics(SQLAppStatusListener.scala:256)
 at 
org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$onExecutionEnd$2(SQLAppStatusListener.scala:365)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at 
org.apache.spark.util.Utils$.tryLog(Utils.scala:1971) at 
org.apache.spark.status.ElementTrackingStore$$anon$1.run(ElementTrackingStore.scala:117)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at 
java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)[info] - SPARK-32573: Eliminate NAAJ 
when BuildSide is EmptyHashedRelationWithAllNullKeys (2 seconds, 14 
milliseconds)
{code}
This issue is mainly because during AQE, while sub-plan changed, the metrics 
update is overwrite. for example, in this UT, change from BroadcastHashJoinExec 
into a LocalTableScanExec, and in the onExecutionEnd action it will try 
aggregate all metrics including old ones during the execution, which will cause 
NoSuchElementException, since the metricsType is already updated with plan 
rewritten. So we need to filter out those outdated metrics.

  was:
{code:java}
// Reproduce Step
sql/test-only org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite -- 
-z "SPARK-32573: Eliminate NAAJ when BuildSide is 
EmptyHashedRelationWithAllNullKeys"
{code}
{code:java}
// Error Message
14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread 
element-tracking-store-worker
14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread 
element-tracking-store-worker java.util.NoSuchElementException: key not found: 
12 
at scala.collection.immutable.Map$Map1.apply(Map.scala:114) 
at 
org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$aggregateMetrics$11(SQLAppStatusListener.scala:257)
 at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) 
at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149) at 
scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) at 
scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) at 
scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) at 
scala.collection.mutable.HashMap.foreach(HashMap.scala:149) at 
scala.collection.TraversableLike.map(TraversableLike.scala:238) at 
scala.collection.TraversableLike.map$(TraversableLike.scala:231) at 
scala.collection.AbstractTraversable.map(Traversable.scala:108) at 
org.apache.spark.sql.execution.ui.SQLAppStatusListener.aggregateMetrics(SQLAppStatusListener.scala:256)
 at 
org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$onExecutionEnd$2(SQLAppStatusListener.scala:365)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at 
org.apache.spark.util.Utils$.tryLog(Utils.scala:1971) at 
org.apache.spark.status.ElementTrackingStore$$anon$1.run(ElementTrackingStore.scala:117)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at 
java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 

[jira] [Updated] (SPARK-32615) Fix AQE aggregateMetrics java.util.NoSuchElementException

2020-08-14 Thread Leanken.Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leanken.Lin updated SPARK-32615:

Description: 
{code:java}
// Reproduce Step
sql/test-only org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite -- 
-z "SPARK-32573: Eliminate NAAJ when BuildSide is 
EmptyHashedRelationWithAllNullKeys"
{code}
{code:java}
// Error Message
14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread 
element-tracking-store-worker
14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread 
element-tracking-store-worker java.util.NoSuchElementException: key not found: 
12 
at scala.collection.immutable.Map$Map1.apply(Map.scala:114) 
at 
org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$aggregateMetrics$11(SQLAppStatusListener.scala:257)
 at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) 
at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149) at 
scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) at 
scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) at 
scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) at 
scala.collection.mutable.HashMap.foreach(HashMap.scala:149) at 
scala.collection.TraversableLike.map(TraversableLike.scala:238) at 
scala.collection.TraversableLike.map$(TraversableLike.scala:231) at 
scala.collection.AbstractTraversable.map(Traversable.scala:108) at 
org.apache.spark.sql.execution.ui.SQLAppStatusListener.aggregateMetrics(SQLAppStatusListener.scala:256)
 at 
org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$onExecutionEnd$2(SQLAppStatusListener.scala:365)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at 
org.apache.spark.util.Utils$.tryLog(Utils.scala:1971) at 
org.apache.spark.status.ElementTrackingStore$$anon$1.run(ElementTrackingStore.scala:117)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at 
java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)[info] - SPARK-32573: Eliminate NAAJ 
when BuildSide is EmptyHashedRelationWithAllNullKeys (2 seconds, 14 
milliseconds)
{code}
This issue is mainly because during AQE, while sub-plan changed, the metrics 
update is overwrite. for example, in this UT, change from BroadcastHashJoinExec 
into a LocalTableScanExec, and in the onExecutionEnd action it will try 
aggregate all metrics during the execution, which will cause 
NoSuchElementException

  was:
{code:java}
// Reproduce Step
sql/test-only org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite -- 
-z "SPARK-32573: Eliminate NAAJ when BuildSide is 
EmptyHashedRelationWithAllNullKeys"
{code}
{code:java}
// Error Message
14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread 
element-tracking-store-worker14:40:44.089 ERROR org.apache.spark.util.Utils: 
Uncaught exception in thread 
element-tracking-store-workerjava.util.NoSuchElementException: key not found: 
12 at scala.collection.immutable.Map$Map1.apply(Map.scala:114) at 
org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$aggregateMetrics$11(SQLAppStatusListener.scala:257)
 at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) 
at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149) at 
scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) at 
scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) at 
scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) at 
scala.collection.mutable.HashMap.foreach(HashMap.scala:149) at 
scala.collection.TraversableLike.map(TraversableLike.scala:238) at 
scala.collection.TraversableLike.map$(TraversableLike.scala:231) at 
scala.collection.AbstractTraversable.map(Traversable.scala:108) at 
org.apache.spark.sql.execution.ui.SQLAppStatusListener.aggregateMetrics(SQLAppStatusListener.scala:256)
 at 
org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$onExecutionEnd$2(SQLAppStatusListener.scala:365)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at 
org.apache.spark.util.Utils$.tryLog(Utils.scala:1971) at 
org.apache.spark.status.ElementTrackingStore$$anon$1.run(ElementTrackingStore.scala:117)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at 
java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)[info] - SPARK-32573: Eliminate NAAJ 
when BuildSide is EmptyHashedRelationWithAllNullKeys 

[jira] [Updated] (SPARK-32615) Fix AQE aggregateMetrics java.util.NoSuchElementException

2020-08-14 Thread Leanken.Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leanken.Lin updated SPARK-32615:

Description: 
{code:java}
// Reproduce Step
sql/test-only org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite -- 
-z "SPARK-32573: Eliminate NAAJ when BuildSide is 
EmptyHashedRelationWithAllNullKeys"
{code}
{code:java}
// Error Message
14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread 
element-tracking-store-worker14:40:44.089 ERROR org.apache.spark.util.Utils: 
Uncaught exception in thread 
element-tracking-store-workerjava.util.NoSuchElementException: key not found: 
12 at scala.collection.immutable.Map$Map1.apply(Map.scala:114) at 
org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$aggregateMetrics$11(SQLAppStatusListener.scala:257)
 at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) 
at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149) at 
scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) at 
scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) at 
scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) at 
scala.collection.mutable.HashMap.foreach(HashMap.scala:149) at 
scala.collection.TraversableLike.map(TraversableLike.scala:238) at 
scala.collection.TraversableLike.map$(TraversableLike.scala:231) at 
scala.collection.AbstractTraversable.map(Traversable.scala:108) at 
org.apache.spark.sql.execution.ui.SQLAppStatusListener.aggregateMetrics(SQLAppStatusListener.scala:256)
 at 
org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$onExecutionEnd$2(SQLAppStatusListener.scala:365)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at 
org.apache.spark.util.Utils$.tryLog(Utils.scala:1971) at 
org.apache.spark.status.ElementTrackingStore$$anon$1.run(ElementTrackingStore.scala:117)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at 
java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)[info] - SPARK-32573: Eliminate NAAJ 
when BuildSide is EmptyHashedRelationWithAllNullKeys (2 seconds, 14 
milliseconds)
{code}
This issue is mainly because during AQE, while sub-plan changed, the metrics 
update is overwrite. for example, in this UT, change from BroadcastHashJoinExec 
into a LocalTableScanExec, and in the onExecutionEnd action it will try 
aggregate all metrics during the execution, which will cause 
NoSuchElementException

  was:
Reproduce Step
{code:java}
//代码占位符
sql/test-only org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite -- 
-z "SPARK-32573: Eliminate NAAJ when BuildSide is 
EmptyHashedRelationWithAllNullKeys"
{code}
{code:java}
//代码占位符
14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread 
element-tracking-store-worker14:40:44.089 ERROR org.apache.spark.util.Utils: 
Uncaught exception in thread 
element-tracking-store-workerjava.util.NoSuchElementException: key not found: 
12 at scala.collection.immutable.Map$Map1.apply(Map.scala:114) at 
org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$aggregateMetrics$11(SQLAppStatusListener.scala:257)
 at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) 
at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149) at 
scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) at 
scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) at 
scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) at 
scala.collection.mutable.HashMap.foreach(HashMap.scala:149) at 
scala.collection.TraversableLike.map(TraversableLike.scala:238) at 
scala.collection.TraversableLike.map$(TraversableLike.scala:231) at 
scala.collection.AbstractTraversable.map(Traversable.scala:108) at 
org.apache.spark.sql.execution.ui.SQLAppStatusListener.aggregateMetrics(SQLAppStatusListener.scala:256)
 at 
org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$onExecutionEnd$2(SQLAppStatusListener.scala:365)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at 
org.apache.spark.util.Utils$.tryLog(Utils.scala:1971) at 
org.apache.spark.status.ElementTrackingStore$$anon$1.run(ElementTrackingStore.scala:117)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at 
java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)[info] - SPARK-32573: Eliminate NAAJ 
when BuildSide is EmptyHashedRelationWithAllNullKeys (2 

[jira] [Updated] (SPARK-32615) Fix AQE aggregateMetrics java.util.NoSuchElementException

2020-08-14 Thread Leanken.Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leanken.Lin updated SPARK-32615:

Summary: Fix AQE aggregateMetrics java.util.NoSuchElementException  (was: 
AQE aggregateMetrics java.util.NoSuchElementException)

> Fix AQE aggregateMetrics java.util.NoSuchElementException
> -
>
> Key: SPARK-32615
> URL: https://issues.apache.org/jira/browse/SPARK-32615
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Leanken.Lin
>Priority: Minor
>
> Reproduce Step
> {code:java}
> //代码占位符
> sql/test-only org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite 
> -- -z "SPARK-32573: Eliminate NAAJ when BuildSide is 
> EmptyHashedRelationWithAllNullKeys"
> {code}
> {code:java}
> //代码占位符
> 14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread 
> element-tracking-store-worker14:40:44.089 ERROR org.apache.spark.util.Utils: 
> Uncaught exception in thread 
> element-tracking-store-workerjava.util.NoSuchElementException: key not found: 
> 12 at scala.collection.immutable.Map$Map1.apply(Map.scala:114) at 
> org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$aggregateMetrics$11(SQLAppStatusListener.scala:257)
>  at 
> scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at 
> scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149) at 
> scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) at 
> scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) at 
> scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) at 
> scala.collection.mutable.HashMap.foreach(HashMap.scala:149) at 
> scala.collection.TraversableLike.map(TraversableLike.scala:238) at 
> scala.collection.TraversableLike.map$(TraversableLike.scala:231) at 
> scala.collection.AbstractTraversable.map(Traversable.scala:108) at 
> org.apache.spark.sql.execution.ui.SQLAppStatusListener.aggregateMetrics(SQLAppStatusListener.scala:256)
>  at 
> org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$onExecutionEnd$2(SQLAppStatusListener.scala:365)
>  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at 
> org.apache.spark.util.Utils$.tryLog(Utils.scala:1971) at 
> org.apache.spark.status.ElementTrackingStore$$anon$1.run(ElementTrackingStore.scala:117)
>  at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) 
> at java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>  at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>  at java.lang.Thread.run(Thread.java:748)[info] - SPARK-32573: Eliminate NAAJ 
> when BuildSide is EmptyHashedRelationWithAllNullKeys (2 seconds, 14 
> milliseconds)
> {code}
> This issue is mainly because during AQE, while sub-plan changed, the metrics 
> update is overwrite. for example, in this UT, change from 
> BroadcastHashJoinExec into a LocalTableScanExec, and in the onExecutionEnd 
> action it will try aggregate all metrics during the execution, which will 
> cause NoSuchElementException



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32615) AQE aggregateMetrics java.util.NoSuchElementException

2020-08-14 Thread Leanken.Lin (Jira)
Leanken.Lin created SPARK-32615:
---

 Summary: AQE aggregateMetrics java.util.NoSuchElementException
 Key: SPARK-32615
 URL: https://issues.apache.org/jira/browse/SPARK-32615
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: Leanken.Lin


Reproduce Step
{code:java}
//代码占位符
sql/test-only org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite -- 
-z "SPARK-32573: Eliminate NAAJ when BuildSide is 
EmptyHashedRelationWithAllNullKeys"
{code}
{code:java}
//代码占位符
14:40:44.089 ERROR org.apache.spark.util.Utils: Uncaught exception in thread 
element-tracking-store-worker14:40:44.089 ERROR org.apache.spark.util.Utils: 
Uncaught exception in thread 
element-tracking-store-workerjava.util.NoSuchElementException: key not found: 
12 at scala.collection.immutable.Map$Map1.apply(Map.scala:114) at 
org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$aggregateMetrics$11(SQLAppStatusListener.scala:257)
 at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) 
at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:149) at 
scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237) at 
scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230) at 
scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44) at 
scala.collection.mutable.HashMap.foreach(HashMap.scala:149) at 
scala.collection.TraversableLike.map(TraversableLike.scala:238) at 
scala.collection.TraversableLike.map$(TraversableLike.scala:231) at 
scala.collection.AbstractTraversable.map(Traversable.scala:108) at 
org.apache.spark.sql.execution.ui.SQLAppStatusListener.aggregateMetrics(SQLAppStatusListener.scala:256)
 at 
org.apache.spark.sql.execution.ui.SQLAppStatusListener.$anonfun$onExecutionEnd$2(SQLAppStatusListener.scala:365)
 at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23) at 
org.apache.spark.util.Utils$.tryLog(Utils.scala:1971) at 
org.apache.spark.status.ElementTrackingStore$$anon$1.run(ElementTrackingStore.scala:117)
 at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at 
java.util.concurrent.FutureTask.run(FutureTask.java:266) at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) 
at java.lang.Thread.run(Thread.java:748)[info] - SPARK-32573: Eliminate NAAJ 
when BuildSide is EmptyHashedRelationWithAllNullKeys (2 seconds, 14 
milliseconds)
{code}
This issue is mainly because during AQE, while sub-plan changed, the metrics 
update is overwrite. for example, in this UT, change from BroadcastHashJoinExec 
into a LocalTableScanExec, and in the onExecutionEnd action it will try 
aggregate all metrics during the execution, which will cause 
NoSuchElementException



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32573) Anti Join Improvement with EmptyHashedRelation and EmptyHashedRelationWithAllNullKeys

2020-08-10 Thread Leanken.Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leanken.Lin updated SPARK-32573:

Description: 
In SPARK-32290, we introduced several new types of HashedRelation
 * EmptyHashedRelation
 * EmptyHashedRelationWithAllNullKeys

They were all limited to used only in NAAJ scenario. These new HashedRelation 
could be applied to other scenario for performance improvements.
 * EmptyHashedRelation could also be used in Normal AntiJoin for fast stop
 * While AQE is on and buildSide is EmptyHashedRelationWithAllNullKeys, can 
convert NAAJ to a Empty LocalRelation to skip meaningless data iteration since 
in Single-Key NAAJ, if null key exists in BuildSide, will drop all records in 
streamedSide.

This Patch including two changes.
 * using EmptyHashedRelation to do fast stop for common anti join as well
 * In AQE, eliminate BroadcastHashJoin(NAAJ) if buildSide is a 
EmptyHashedRelationWithAllNullKeys

  was:
In [SPARK-32290|https://issues.apache.org/jira/browse/SPARK-32290], we 
introduced several new types of HashedRelation
 * EmptyHashedRelation
 * EmptyHashedRelationWithAllNullKeys

They were all limited to used only in NAAJ scenario. But as for a improvement, 
EmptyHashedRelation could also be used in Normal AntiJoin for fast stop, and as 
for in AQE, we can even eliminate anti join when we knew that buildSide is 
empty.

 

This Patch including two changes.

In Non-AQE, using EmptyHashedRelation to do fast stop for common anti join as 
well

In AQE, eliminate anti join if buildSide is a EmptyHashedRelation of 
ShuffleWriteRecord is 0

 

Summary: Anti Join Improvement with EmptyHashedRelation and 
EmptyHashedRelationWithAllNullKeys  (was: Eliminate Anti Join when BuildSide is 
Empty)

> Anti Join Improvement with EmptyHashedRelation and 
> EmptyHashedRelationWithAllNullKeys
> -
>
> Key: SPARK-32573
> URL: https://issues.apache.org/jira/browse/SPARK-32573
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Leanken.Lin
>Priority: Minor
>
> In SPARK-32290, we introduced several new types of HashedRelation
>  * EmptyHashedRelation
>  * EmptyHashedRelationWithAllNullKeys
> They were all limited to used only in NAAJ scenario. These new HashedRelation 
> could be applied to other scenario for performance improvements.
>  * EmptyHashedRelation could also be used in Normal AntiJoin for fast stop
>  * While AQE is on and buildSide is EmptyHashedRelationWithAllNullKeys, can 
> convert NAAJ to a Empty LocalRelation to skip meaningless data iteration 
> since in Single-Key NAAJ, if null key exists in BuildSide, will drop all 
> records in streamedSide.
> This Patch including two changes.
>  * using EmptyHashedRelation to do fast stop for common anti join as well
>  * In AQE, eliminate BroadcastHashJoin(NAAJ) if buildSide is a 
> EmptyHashedRelationWithAllNullKeys



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32573) Eliminate Anti Join when BuildSide is Empty

2020-08-08 Thread Leanken.Lin (Jira)
Leanken.Lin created SPARK-32573:
---

 Summary: Eliminate Anti Join when BuildSide is Empty
 Key: SPARK-32573
 URL: https://issues.apache.org/jira/browse/SPARK-32573
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Leanken.Lin


In [SPARK-32290|https://issues.apache.org/jira/browse/SPARK-32290], we 
introduced several new types of HashedRelation
 * EmptyHashedRelation
 * EmptyHashedRelationWithAllNullKeys

They were all limited to used only in NAAJ scenario. But as for a improvement, 
EmptyHashedRelation could also be used in Normal AntiJoin for fast stop, and as 
for in AQE, we can even eliminate anti join when we knew that buildSide is 
empty.

 

This Patch including two changes.

In Non-AQE, using EmptyHashedRelation to do fast stop for common anti join as 
well

In AQE, eliminate anti join if buildSide is a EmptyHashedRelation of 
ShuffleWriteRecord is 0

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32494) Null Aware Anti Join Optimize Support Multi-Column

2020-08-05 Thread Leanken.Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leanken.Lin resolved SPARK-32494.
-
Resolution: Later

> Null Aware Anti Join Optimize Support Multi-Column
> --
>
> Key: SPARK-32494
> URL: https://issues.apache.org/jira/browse/SPARK-32494
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Leanken.Lin
>Priority: Major
>
> In Issue SPARK-32290, we managed to optimize BroadcastNestedLoopJoin into 
> BroadcastHashJoin within the Single-Column NAAJ scenario, by using hash 
> lookup instead of loop join. 
> It's simple to just fulfill a "NOT IN" logical when it's a single key, but 
> multi-column not in is much more complicated with all these null aware 
> comparison.
> FYI, code logical for single and multi column is defined at
> ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-single-column.sql
> ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-multi-column.sql
>  
> Hence, proposed with a New type HashedRelation, NullAwareHashedRelation. 
> For NullAwareHashedRelation
>  # it will not skip anyNullColumn key like LongHashedRelation and 
> UnsafeHashedRelation do.
>  # while building NullAwareHashedRelation, will put extra keys into the 
> relation, just to make null aware columns comparison in hash lookup style.
> the duplication would be 2^numKeys - 1 times, for example, if we are to 
> support NAAJ with 3 column join key, the buildSide would be expanded into 
> (2^3 - 1) times, 7X.
> For example, if there is a UnsafeRow key (1,2,3)
> In NullAware Mode, it should be expanded into 7 keys with extra C(3,1), 
> C(3,2) combinations, within the combinations, we duplicated these record with 
> null padding as following.
> Original record
> (1,2,3)
> Extra record to be appended into NullAwareHashedRelation
> (null, 2, 3) (1, null, 3) (1, 2, null)
>  (null, null, 3) (null, 2, null) (1, null, null))
> with the expanded data we can extract a common pattern for both single and 
> multi column. allNull refer to a unsafeRow which has all null columns.
>  * buildSide is empty input => return all rows
>  * allNullColumnKey Exists In buildSide input => reject all rows
>  * if streamedSideRow.allNull is true => drop the row
>  * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation 
> => drop the row
>  * if streamedSideRow.allNull is false & notFindMatch in 
> NullAwareHashedRelation => return the row
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32494) Null Aware Anti Join Optimize Support Multi-Column

2020-07-30 Thread Leanken.Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leanken.Lin updated SPARK-32494:

Description: 
In Issue SPARK-32290, we managed to optimize BroadcastNestedLoopJoin into 
BroadcastHashJoin within the Single-Column NAAJ scenario, by using hash lookup 
instead of loop join. 

It's simple to just fulfill a "NOT IN" logical when it's a single key, but 
multi-column not in is much more complicated with all these null aware 
comparison.

FYI, code logical for single and multi column is defined at

~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-single-column.sql

~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-multi-column.sql

 

Hence, proposed with a New type HashedRelation, NullAwareHashedRelation. 

For NullAwareHashedRelation
 # it will not skip anyNullColumn key like LongHashedRelation and 
UnsafeHashedRelation do.
 # while building NullAwareHashedRelation, will put extra keys into the 
relation, just to make null aware columns comparison in hash lookup style.

the duplication would be 2^numKeys - 1 times, for example, if we are to support 
NAAJ with 3 column join key, the buildSide would be expanded into (2^3 - 1) 
times, 7X.

For example, if there is a UnsafeRow key (1,2,3)

In NullAware Mode, it should be expanded into 7 keys with extra C(3,1), C(3,2) 
combinations, within the combinations, we duplicated these record with null 
padding as following.

Original record

(1,2,3)

Extra record to be appended into NullAwareHashedRelation

(null, 2, 3) (1, null, 3) (1, 2, null)
 (null, null, 3) (null, 2, null) (1, null, null))

with the expanded data we can extract a common pattern for both single and 
multi column. allNull refer to a unsafeRow which has all null columns.
 * buildSide is empty input => return all rows
 * allNullColumnKey Exists In buildSide input => reject all rows
 * if streamedSideRow.allNull is true => drop the row
 * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation 
=> drop the row
 * if streamedSideRow.allNull is false & notFindMatch in 
NullAwareHashedRelation => return the row

 

 

  was:
In Issue [SPARK-32290|https://issues.apache.org/jira/browse/SPARK-32290], we 
managed to optimize BroadcastNestedLoopJoin into BroadcastHashJoin within the 
Single-Column NAAJ scenario, by using hash lookup instead of loop join. 

It's simple to just fulfill a "NOT IN" logical when it's a single key, but 
multi-column not in is much more complicated with all these null aware compare.

Hence, proposed with a New type HashedRelation, NullAwareHashedRelation. 

For NullAwareHashedRelation
 # it will not skip anyNullColumn key like LongHashedRelation and 
UnsafeHashedRelation
 # while building NullAwareHashedRelation, will put extra keys into the 
relation, just to make null aware columns comparison in hash lookup style.

the duplication would be 2^numKeys - 1, for example, if we are to support NAAJ 
with 3 column join key, the buildSide would be expanded into (2^3 - 1) times, 
7X.

For example, if there is a UnsafeRow key (1,2,3)


In NullAware Mode, it should be expanded into 7 keys with extra C(3,1), C(3,2) 
combinations, within the combinations, we duplicated these record with null 
padding as following.

Original record

(1,2,3)

Extra record to be appended into HashedRelation

(null, 2, 3) (1, null, 3) (1, 2, null)
(null, null, 3) (null, 2, null) (1, null, null))

with the expanded data we can extract a common pattern for both single and 
multi column. allNull refer to a unsafeRow which has all null columns.
 * buildSide is empty input => return all rows
 * allNullColumnKey Exists In buildSide input => reject all rows
 * if streamedSideRow.allNull is true => drop the row
 * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation 
=> drop the row
 * if streamedSideRow.allNull is false & notFindMatch in 
NullAwareHashedRelation => return the row

 

 


> Null Aware Anti Join Optimize Support Multi-Column
> --
>
> Key: SPARK-32494
> URL: https://issues.apache.org/jira/browse/SPARK-32494
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Leanken.Lin
>Priority: Major
>
> In Issue SPARK-32290, we managed to optimize BroadcastNestedLoopJoin into 
> BroadcastHashJoin within the Single-Column NAAJ scenario, by using hash 
> lookup instead of loop join. 
> It's simple to just fulfill a "NOT IN" logical when it's a single key, but 
> multi-column not in is much more complicated with all these null aware 
> comparison.
> FYI, code logical for single and multi column is defined at
> ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-single-column.sql
> 

[jira] [Created] (SPARK-32494) Null Aware Anti Join Optimize Support Multi-Column

2020-07-30 Thread Leanken.Lin (Jira)
Leanken.Lin created SPARK-32494:
---

 Summary: Null Aware Anti Join Optimize Support Multi-Column
 Key: SPARK-32494
 URL: https://issues.apache.org/jira/browse/SPARK-32494
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 3.0.0
Reporter: Leanken.Lin


In Issue [SPARK-32290|https://issues.apache.org/jira/browse/SPARK-32290], we 
managed to optimize BroadcastNestedLoopJoin into BroadcastHashJoin within the 
Single-Column NAAJ scenario, by using hash lookup instead of loop join. 

It's simple to just fulfill a "NOT IN" logical when it's a single key, but 
multi-column not in is much more complicated with all these null aware compare.

Hence, proposed with a New type HashedRelation, NullAwareHashedRelation. 

For NullAwareHashedRelation
 # it will not skip anyNullColumn key like LongHashedRelation and 
UnsafeHashedRelation
 # while building NullAwareHashedRelation, will put extra keys into the 
relation, just to make null aware columns comparison in hash lookup style.

the duplication would be 2^numKeys - 1, for example, if we are to support NAAJ 
with 3 column join key, the buildSide would be expanded into (2^3 - 1) times, 
7X.

For example, if there is a UnsafeRow key (1,2,3)


In NullAware Mode, it should be expanded into 7 keys with extra C(3,1), C(3,2) 
combinations, within the combinations, we duplicated these record with null 
padding as following.

Original record

(1,2,3)

Extra record to be appended into HashedRelation

(null, 2, 3) (1, null, 3) (1, 2, null)
(null, null, 3) (null, 2, null) (1, null, null))

with the expanded data we can extract a common pattern for both single and 
multi column. allNull refer to a unsafeRow which has all null columns.
 * buildSide is empty input => return all rows
 * allNullColumnKey Exists In buildSide input => reject all rows
 * if streamedSideRow.allNull is true => drop the row
 * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation 
=> drop the row
 * if streamedSideRow.allNull is false & notFindMatch in 
NullAwareHashedRelation => return the row

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-32474) NullAwareAntiJoin multi-column support

2020-07-30 Thread Leanken.Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leanken.Lin resolved SPARK-32474.
-
Resolution: Duplicate

> NullAwareAntiJoin multi-column support
> --
>
> Key: SPARK-32474
> URL: https://issues.apache.org/jira/browse/SPARK-32474
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Leanken.Lin
>Priority: Minor
>
> This is a follow up improvement of Issue SPARK-32290.
> In SPARK-32290, we already optimize NAAJ from BroadcastNestedLoopJoin to 
> BroadcastHashJoin, which improve total calculation from O(M*N) to O(M), but 
> it's only targeting on Single Column Case, because it's much more complicate 
> in multi column support.
> See. [http://www.vldb.org/pvldb/vol2/vldb09-423.pdf] Section 6
>  
> FYI, code logical for single and multi column is defined at
> ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-single-column.sql
> ~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-multi-column.sql
>  
> For supporting multi column, I throw the following idea and see if is it 
> worth to do multi-column support with some trade off. I would need to do some 
> data expansion in HashedRelation, and i would call this new type of 
> HashedRelation as NullAwareHashedRelation.
>  
> In NullAwareHashedRelation, key with null column is allowed, which is 
> opposite in LongHashedRelation and UnsafeHashedRelation; And single key might 
> be expanded into 2^N - 1 records, (N refer to columnNum of the key). for 
> example, if there is a record
> (1 ,2, 3) is about to insert into NullAwareHashedRelation, we take C(1,3), 
> C(2,3) as a combination to copy origin key row, and setNull at target 
> position, and then insert into NullAwareHashedRelation. including the origin 
> key row, there will be 7 key row inserted as follow.
> (null, 2, 3)
> (1, null, 3)
> (1, 2, null)
> (null, null, 3)
> (null, 2, null)
> (1, null, null)
> (1, 2, 3)
>  
> with the expanded data we can extract a common pattern for both single and 
> multi column. allNull refer to a unsafeRow which has all null columns.
>  * buildSide is empty input => return all rows
>  * allNullColumnKey Exists In buildSide input => reject all rows
>  * if streamedSideRow.allNull is true => drop the row
>  * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation 
> => drop the row
>  * if streamedSideRow.allNull is false & notFindMatch in 
> NullAwareHashedRelation => return the row
>  
> this solution will sure make buildSide data expand to 2^N-1 times, but since 
> it is normally up to 2~3 column in NAAJ in normal production query, i suppose 
> that it's acceptable to expand buildSide data to around 7X. I would also have 
> a limitation of max column support for NAAJ, basically should not more than 
> 3. 
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32474) NullAwareAntiJoin multi-column support

2020-07-28 Thread Leanken.Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leanken.Lin updated SPARK-32474:

Description: 
This is a follow up improvement of Issue SPARK-32290.

In SPARK-32290, we already optimize NAAJ from BroadcastNestedLoopJoin to 
BroadcastHashJoin, which improve total calculation from O(M*N) to O(M), but 
it's only targeting on Single Column Case, because it's much more complicate in 
multi column support.

See. [http://www.vldb.org/pvldb/vol2/vldb09-423.pdf] Section 6

 

FYI, code logical for single and multi column is defined at

~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-single-column.sql

~/sql/core/src/test/resources/sql-tests/inputs/subquery/in-subquery/not-in-unit-tests-multi-column.sql

 

For supporting multi column, I throw the following idea and see if is it worth 
to do multi-column support with some trade off. I would need to do some data 
expansion in HashedRelation, and i would call this new type of HashedRelation 
as NullAwareHashedRelation.

 

In NullAwareHashedRelation, key with null column is allowed, which is opposite 
in LongHashedRelation and UnsafeHashedRelation; And single key might be 
expanded into 2^N - 1 records, (N refer to columnNum of the key). for example, 
if there is a record

(1 ,2, 3) is about to insert into NullAwareHashedRelation, we take C(1,3), 
C(2,3) as a combination to copy origin key row, and setNull at target position, 
and then insert into NullAwareHashedRelation. including the origin key row, 
there will be 7 key row inserted as follow.

(null, 2, 3)

(1, null, 3)

(1, 2, null)

(null, null, 3)

(null, 2, null)

(1, null, null)

(1, 2, 3)

 

with the expanded data we can extract a common pattern for both single and 
multi column. allNull refer to a unsafeRow which has all null columns.
 * buildSide is empty input => return all rows
 * allNullColumnKey Exists In buildSide input => reject all rows
 * if streamedSideRow.allNull is true => drop the row
 * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation 
=> drop the row
 * if streamedSideRow.allNull is false & notFindMatch in 
NullAwareHashedRelation => return the row

 

this solution will sure make buildSide data expand to 2^N-1 times, but since it 
is normally up to 2~3 column in NAAJ in normal production query, i suppose that 
it's acceptable to expand buildSide data to around 7X. I would also have a 
limitation of max column support for NAAJ, basically should not more than 3. 

 

 

 

 

 

 

  was:
This is a follow up improvement of Issue SPARK-32290.

In SPARK-32290, we already optimize NAAJ from BroadcastNestedLoopJoin to 
BroadcastHashJoin, which improve total calculation from O(M*N) to O(M), but 
it's only targeting on Single Column Case, because it's much more complicate in 
multi column support.

See. [http://www.vldb.org/pvldb/vol2/vldb09-423.pdf] Section 6

 

NAAJ multi column logical

!image-2020-07-29-12-03-22-554.png!

NAAJ single column logical

!image-2020-07-29-12-03-32-677.png!

For supporting multi column, I throw the following idea and see if is it worth 
to do multi-column support with some trade off. I would need to do some data 
expansion in HashedRelation, and i would call this new type of HashedRelation 
as NullAwareHashedRelation.

 

In NullAwareHashedRelation, key with null column is allowed, which is opposite 
in LongHashedRelation and UnsafeHashedRelation; And single key might be 
expanded into 2^N - 1 records, (N refer to columnNum of the key). for example, 
if there is a record

(1 ,2, 3) is about to insert into NullAwareHashedRelation, we take C(1,3), 
C(2,3) as a combination to copy origin key row, and setNull at target position, 
and then insert into NullAwareHashedRelation. including the origin key row, 
there will be 7 key row inserted as follow.

(null, 2, 3)

(1, null, 3)

(1, 2, null)

(null, null, 3)

(null, 2, null)

(1, null, null)

(1, 2, 3)

 

with the expanded data we can extract a common pattern for both single and 
multi column. allNull refer to a unsafeRow which has all null columns.
 * buildSide is empty input => return all rows
 * allNullColumnKey Exists In buildSide input => reject all rows
 * if streamedSideRow.allNull is true => drop the row
 * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation 
=> drop the row
 * if streamedSideRow.allNull is false & notFindMatch in 
NullAwareHashedRelation => return the row

 

this solution will sure make buildSide data expand to 2^N-1 times, but since it 
is normally up to 2~3 column in NAAJ in normal production query, i suppose that 
it's acceptable to expand buildSide data to around 7X. I would also have a 
limitation of max column support for NAAJ, basically should not more than 3. 

 

 

 

 

 

 


> NullAwareAntiJoin multi-column support
> --
>
> Key: 

[jira] [Updated] (SPARK-32474) NullAwareAntiJoin multi-column support

2020-07-28 Thread Leanken.Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32474?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leanken.Lin updated SPARK-32474:

Description: 
This is a follow up improvement of Issue SPARK-32290.

In SPARK-32290, we already optimize NAAJ from BroadcastNestedLoopJoin to 
BroadcastHashJoin, which improve total calculation from O(M*N) to O(M), but 
it's only targeting on Single Column Case, because it's much more complicate in 
multi column support.

See. [http://www.vldb.org/pvldb/vol2/vldb09-423.pdf] Section 6

 

NAAJ multi column logical

!image-2020-07-29-12-03-22-554.png!

NAAJ single column logical

!image-2020-07-29-12-03-32-677.png!

For supporting multi column, I throw the following idea and see if is it worth 
to do multi-column support with some trade off. I would need to do some data 
expansion in HashedRelation, and i would call this new type of HashedRelation 
as NullAwareHashedRelation.

 

In NullAwareHashedRelation, key with null column is allowed, which is opposite 
in LongHashedRelation and UnsafeHashedRelation; And single key might be 
expanded into 2^N - 1 records, (N refer to columnNum of the key). for example, 
if there is a record

(1 ,2, 3) is about to insert into NullAwareHashedRelation, we take C(1,3), 
C(2,3) as a combination to copy origin key row, and setNull at target position, 
and then insert into NullAwareHashedRelation. including the origin key row, 
there will be 7 key row inserted as follow.

(null, 2, 3)

(1, null, 3)

(1, 2, null)

(null, null, 3)

(null, 2, null)

(1, null, null)

(1, 2, 3)

 

with the expanded data we can extract a common pattern for both single and 
multi column. allNull refer to a unsafeRow which has all null columns.
 * buildSide is empty input => return all rows
 * allNullColumnKey Exists In buildSide input => reject all rows
 * if streamedSideRow.allNull is true => drop the row
 * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation 
=> drop the row
 * if streamedSideRow.allNull is false & notFindMatch in 
NullAwareHashedRelation => return the row

 

this solution will sure make buildSide data expand to 2^N-1 times, but since it 
is normally up to 2~3 column in NAAJ in normal production query, i suppose that 
it's acceptable to expand buildSide data to around 7X. I would also have a 
limitation of max column support for NAAJ, basically should not more than 3. 

 

 

 

 

 

 

  was:
This is a follow up improvement of Issue 
[SPARK-32290|https://issues.apache.org/jira/browse/SPARK-32290].

In SPARK-32290, we already optimize NAAJ from BroadcastNestedLoopJoin to 
BroadcastHashJoin, which improve total calculation from O(M*N) to O(M), but 
it's only targeting on Single Column Case, because it's much more complicate in 
multi column support.

See. http://www.vldb.org/pvldb/vol2/vldb09-423.pdf Section 6

 

NAAJ multi column logical

!image-2020-07-29-11-41-11-939.png!

NAAJ single column logical

!image-2020-07-29-11-41-03-757.png!

For supporting multi column, I throw the following idea and see if is it worth 
to do multi-column support with some trade off. I would need to do some data 
expansion in HashedRelation, and i would call this new type of HashedRelation 
as NullAwareHashedRelation.

 

In NullAwareHashedRelation, key with null column is allowed, which is opposite 
in LongHashedRelation and UnsafeHashedRelation; And single key might be 
expanded into 2^N - 1 records, (N refer to columnNum of the key). for example, 
if there is a record

(1 ,2, 3) is about to insert into NullAwareHashedRelation, we take C(1,3), 
C(2,3) as a combination to copy origin key row, and setNull at target position, 
and then insert into NullAwareHashedRelation. including the origin key row, 
there will be 7 key row inserted as follow.

(null, 2, 3)

(1, null, 3)

(1, 2, null)

(null, null, 3)

(null, 2, null)

(1, null, null)

(1, 2, 3)

 

with the expanded data we can extract a common pattern for both single and 
multi column. allNull refer to a unsafeRow which has all null columns.
 * buildSide is empty input => return all rows
 * allNullColumnKey Exists In buildSide input => reject all rows
 * if streamedSideRow.allNull is true => drop the row
 * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation 
=> drop the row
 * if streamedSideRow.allNull is false & notFindMatch in 
NullAwareHashedRelation => return the row

 

this solution will sure make buildSide data expand to 2^N-1 times, but since it 
is normally up to 2~3 column in NAAJ in normal production query, i suppose that 
it's acceptable to expand buildSide data to around 7X. I would also have a 
limitation of max column support for NAAJ, basically should not more than 3. 

 

 

 

 

 

 


> NullAwareAntiJoin multi-column support
> --
>
> Key: SPARK-32474
> URL: https://issues.apache.org/jira/browse/SPARK-32474
>   

[jira] [Created] (SPARK-32474) NullAwareAntiJoin multi-column support

2020-07-28 Thread Leanken.Lin (Jira)
Leanken.Lin created SPARK-32474:
---

 Summary: NullAwareAntiJoin multi-column support
 Key: SPARK-32474
 URL: https://issues.apache.org/jira/browse/SPARK-32474
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Leanken.Lin
 Fix For: 3.1.0


This is a follow up improvement of Issue 
[SPARK-32290|https://issues.apache.org/jira/browse/SPARK-32290].

In SPARK-32290, we already optimize NAAJ from BroadcastNestedLoopJoin to 
BroadcastHashJoin, which improve total calculation from O(M*N) to O(M), but 
it's only targeting on Single Column Case, because it's much more complicate in 
multi column support.

See. http://www.vldb.org/pvldb/vol2/vldb09-423.pdf Section 6

 

NAAJ multi column logical

!image-2020-07-29-11-41-11-939.png!

NAAJ single column logical

!image-2020-07-29-11-41-03-757.png!

For supporting multi column, I throw the following idea and see if is it worth 
to do multi-column support with some trade off. I would need to do some data 
expansion in HashedRelation, and i would call this new type of HashedRelation 
as NullAwareHashedRelation.

 

In NullAwareHashedRelation, key with null column is allowed, which is opposite 
in LongHashedRelation and UnsafeHashedRelation; And single key might be 
expanded into 2^N - 1 records, (N refer to columnNum of the key). for example, 
if there is a record

(1 ,2, 3) is about to insert into NullAwareHashedRelation, we take C(1,3), 
C(2,3) as a combination to copy origin key row, and setNull at target position, 
and then insert into NullAwareHashedRelation. including the origin key row, 
there will be 7 key row inserted as follow.

(null, 2, 3)

(1, null, 3)

(1, 2, null)

(null, null, 3)

(null, 2, null)

(1, null, null)

(1, 2, 3)

 

with the expanded data we can extract a common pattern for both single and 
multi column. allNull refer to a unsafeRow which has all null columns.
 * buildSide is empty input => return all rows
 * allNullColumnKey Exists In buildSide input => reject all rows
 * if streamedSideRow.allNull is true => drop the row
 * if streamedSideRow.allNull is false & findMatch in NullAwareHashedRelation 
=> drop the row
 * if streamedSideRow.allNull is false & notFindMatch in 
NullAwareHashedRelation => return the row

 

this solution will sure make buildSide data expand to 2^N-1 times, but since it 
is normally up to 2~3 column in NAAJ in normal production query, i suppose that 
it's acceptable to expand buildSide data to around 7X. I would also have a 
limitation of max column support for NAAJ, basically should not more than 3. 

 

 

 

 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32290) NotInSubquery SingleColumn Optimize

2020-07-13 Thread Leanken.Lin (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leanken.Lin updated SPARK-32290:

Fix Version/s: 3.0.1

> NotInSubquery SingleColumn Optimize
> ---
>
> Key: SPARK-32290
> URL: https://issues.apache.org/jira/browse/SPARK-32290
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Leanken.Lin
>Priority: Minor
> Fix For: 3.0.1, 3.1.0
>
>   Original Estimate: 24h
>  Remaining Estimate: 24h
>
> Normally,
> A NotInSubquery will plan into BroadcastNestedLoopJoinExec, which is very 
> very time consuming. For example, I've done TPCH benchmark lately, Query 16 
> almost took half of the entire TPCH 22Query execution Time. So i proposed 
> that to do the following optimize.
> Inside BroadcastNestedLoopJoinExec, we can identify not in subquery with only 
> single column in following pattern.
> {code:java}
> case _@Or(
> _@EqualTo(leftAttr: AttributeReference, rightAttr: 
> AttributeReference),
> _@IsNull(
>   _@EqualTo(_: AttributeReference, _: AttributeReference)
> )
>   )
> {code}
> if buildSide rows is small enough, we can change build side data into a 
> HashMap.
> so the M*N calculation can be optimized into M*log(N)
> I've done a benchmark job in 1TB TPCH, before apply the optimize
> Query 16 take around 18 mins to finish, after apply the M*log(N) optimize, it 
> takes only 30s to finish. 
> But this optimize only works on single column not in subquery, so i am here 
> to seek advise whether the community need this update or not. I will do the 
> pull request first, if the community member thought it's hack, it's fine to 
> just ignore this request.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32290) NotInSubquery SingleColumn Optimize

2020-07-13 Thread Leanken.Lin (Jira)
Leanken.Lin created SPARK-32290:
---

 Summary: NotInSubquery SingleColumn Optimize
 Key: SPARK-32290
 URL: https://issues.apache.org/jira/browse/SPARK-32290
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: Leanken.Lin
 Fix For: 3.1.0


Normally,
A NotInSubquery will plan into BroadcastNestedLoopJoinExec, which is very very 
time consuming. For example, I've done TPCH benchmark lately, Query 16 almost 
took half of the entire TPCH 22Query execution Time. So i proposed that to do 
the following optimize.

Inside BroadcastNestedLoopJoinExec, we can identify not in subquery with only 
single column in following pattern.

{code:java}
case _@Or(
_@EqualTo(leftAttr: AttributeReference, rightAttr: 
AttributeReference),
_@IsNull(
  _@EqualTo(_: AttributeReference, _: AttributeReference)
)
  )
{code}

if buildSide rows is small enough, we can change build side data into a HashMap.
so the M*N calculation can be optimized into M*log(N)

I've done a benchmark job in 1TB TPCH, before apply the optimize
Query 16 take around 18 mins to finish, after apply the M*log(N) optimize, it 
takes only 30s to finish. 

But this optimize only works on single column not in subquery, so i am here to 
seek advise whether the community need this update or not. I will do the pull 
request first, if the community member thought it's hack, it's fine to just 
ignore this request.




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28050) DataFrameWriter support insertInto a specific table partition

2019-06-14 Thread Leanken.Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leanken.Lin updated SPARK-28050:

Description: 
{code:java}
// Some comments here
val ptTableName = "mc_test_pt_table"
sql(s"CREATE TABLE ${ptTableName} (name STRING, num BIGINT) PARTITIONED BY (pt1 
STRING, pt2 STRING)")

val df = spark.sparkContext.parallelize(0 to 99, 2)
  .map(f =>
{
  (s"name-$f", f)
})
  .toDF("name", "num")

// if i want to insert df into a specific partition
// say pt1='2018',pt2='0601' current api does not supported
// only with following work around
df.createOrReplaceTempView(s"${ptTableName}_tmp_view")
sql(s"insert into table ${ptTableName} partition (pt1='2018', pt2='0601') 
select * from ${ptTableName}_tmp_view")
{code}

Propose to have another API in DataframeWriter that can do somethink like:

{code:java}
df.write.insertInto(ptTableName, "pt1='2018',pt2='0601'")
{code}


we have a lot of this kind of scenario in our production env. providing a api 
like this will make us less painful.

  was:
```
val ptTableName = "mc_test_pt_table"
sql(s"CREATE TABLE ${ptTableName} (name STRING, num BIGINT) PARTITIONED BY (pt1 
STRING, pt2 STRING)")

val df = spark.sparkContext.parallelize(0 to 99, 2)
  .map(f =>
{
  (s"name-$f", f)
})
  .toDF("name", "num")

// if i want to insert df into a specific partition
// say pt1='2018',pt2='0601' current api does not supported
// only with following work around
df.createOrReplaceTempView(s"${ptTableName}_tmp_view")
sql(s"insert into table ${ptTableName} partition (pt1='2018', pt2='0601') 
select * from ${ptTableName}_tmp_view")
```

Propose to have another API in DataframeWriter that can do somethink like:

```
df.write.insertInto(ptTableName, "pt1='2018',pt2='0601'")
```

we have a lot of this kind of scenario in our production env. providing a api 
like this will make us less painful.


> DataFrameWriter support insertInto a specific table partition
> -
>
> Key: SPARK-28050
> URL: https://issues.apache.org/jira/browse/SPARK-28050
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.3, 2.4.3
>Reporter: Leanken.Lin
>Priority: Minor
> Fix For: 2.3.3, 2.4.3
>
>
> {code:java}
> // Some comments here
> val ptTableName = "mc_test_pt_table"
> sql(s"CREATE TABLE ${ptTableName} (name STRING, num BIGINT) PARTITIONED BY 
> (pt1 STRING, pt2 STRING)")
> val df = spark.sparkContext.parallelize(0 to 99, 2)
>   .map(f =>
> {
>   (s"name-$f", f)
> })
>   .toDF("name", "num")
> // if i want to insert df into a specific partition
> // say pt1='2018',pt2='0601' current api does not supported
> // only with following work around
> df.createOrReplaceTempView(s"${ptTableName}_tmp_view")
> sql(s"insert into table ${ptTableName} partition (pt1='2018', pt2='0601') 
> select * from ${ptTableName}_tmp_view")
> {code}
> Propose to have another API in DataframeWriter that can do somethink like:
> {code:java}
> df.write.insertInto(ptTableName, "pt1='2018',pt2='0601'")
> {code}
> we have a lot of this kind of scenario in our production env. providing a api 
> like this will make us less painful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-28050) DataFrameWriter support insertInto a specific table partition

2019-06-14 Thread Leanken.Lin (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-28050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Leanken.Lin updated SPARK-28050:

Description: 
```
val ptTableName = "mc_test_pt_table"
sql(s"CREATE TABLE ${ptTableName} (name STRING, num BIGINT) PARTITIONED BY (pt1 
STRING, pt2 STRING)")

val df = spark.sparkContext.parallelize(0 to 99, 2)
  .map(f =>
{
  (s"name-$f", f)
})
  .toDF("name", "num")

// if i want to insert df into a specific partition
// say pt1='2018',pt2='0601' current api does not supported
// only with following work around
df.createOrReplaceTempView(s"${ptTableName}_tmp_view")
sql(s"insert into table ${ptTableName} partition (pt1='2018', pt2='0601') 
select * from ${ptTableName}_tmp_view")
```

Propose to have another API in DataframeWriter that can do somethink like:

```
df.write.insertInto(ptTableName, "pt1='2018',pt2='0601'")
```

we have a lot of this kind of scenario in our production env. providing a api 
like this will make us less painful.

  was:
val ptTableName = "mc_test_pt_table"
sql(s"CREATE TABLE ${ptTableName} (name STRING, num BIGINT) PARTITIONED BY (pt1 
STRING, pt2 STRING)")

val df = spark.sparkContext.parallelize(0 to 99, 2)
  .map(f =>
{
  (s"name-$f", f)
})
  .toDF("name", "num")

// if i want to insert df into a specific partition
// say pt1='2018',pt2='0601' current api does not supported
// only with following work around
df.createOrReplaceTempView(s"${ptTableName}_tmp_view")
sql(s"insert into table ${ptTableName} partition (pt1='2018', pt2='0601') 
select * from ${ptTableName}_tmp_view")

Propose to have another API in DataframeWriter that can do somethink like:
df.write.insertInto(ptTableName, "pt1='2018',pt2='0601'")

we have a lot of this kind of scenario in our production env. providing a api 
like this will make us less painful.


> DataFrameWriter support insertInto a specific table partition
> -
>
> Key: SPARK-28050
> URL: https://issues.apache.org/jira/browse/SPARK-28050
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 2.3.3, 2.4.3
>Reporter: Leanken.Lin
>Priority: Minor
> Fix For: 2.3.3, 2.4.3
>
>
> ```
> val ptTableName = "mc_test_pt_table"
> sql(s"CREATE TABLE ${ptTableName} (name STRING, num BIGINT) PARTITIONED BY 
> (pt1 STRING, pt2 STRING)")
> val df = spark.sparkContext.parallelize(0 to 99, 2)
>   .map(f =>
> {
>   (s"name-$f", f)
> })
>   .toDF("name", "num")
> // if i want to insert df into a specific partition
> // say pt1='2018',pt2='0601' current api does not supported
> // only with following work around
> df.createOrReplaceTempView(s"${ptTableName}_tmp_view")
> sql(s"insert into table ${ptTableName} partition (pt1='2018', pt2='0601') 
> select * from ${ptTableName}_tmp_view")
> ```
> Propose to have another API in DataframeWriter that can do somethink like:
> ```
> df.write.insertInto(ptTableName, "pt1='2018',pt2='0601'")
> ```
> we have a lot of this kind of scenario in our production env. providing a api 
> like this will make us less painful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28050) DataFrameWriter support insertInto a specific table partition

2019-06-14 Thread Leanken.Lin (JIRA)
Leanken.Lin created SPARK-28050:
---

 Summary: DataFrameWriter support insertInto a specific table 
partition
 Key: SPARK-28050
 URL: https://issues.apache.org/jira/browse/SPARK-28050
 Project: Spark
  Issue Type: New Feature
  Components: SQL
Affects Versions: 2.4.3, 2.3.3
Reporter: Leanken.Lin
 Fix For: 2.4.3, 2.3.3


val ptTableName = "mc_test_pt_table"
sql(s"CREATE TABLE ${ptTableName} (name STRING, num BIGINT) PARTITIONED BY (pt1 
STRING, pt2 STRING)")

val df = spark.sparkContext.parallelize(0 to 99, 2)
  .map(f =>
{
  (s"name-$f", f)
})
  .toDF("name", "num")

// if i want to insert df into a specific partition
// say pt1='2018',pt2='0601' current api does not supported
// only with following work around
df.createOrReplaceTempView(s"${ptTableName}_tmp_view")
sql(s"insert into table ${ptTableName} partition (pt1='2018', pt2='0601') 
select * from ${ptTableName}_tmp_view")

Propose to have another API in DataframeWriter that can do somethink like:
df.write.insertInto(ptTableName, "pt1='2018',pt2='0601'")

we have a lot of this kind of scenario in our production env. providing a api 
like this will make us less painful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org