[jira] [Created] (SPARK-41200) BytesToBytesMap's longArray size can be up to MAX_CAPACITY

2022-11-18 Thread EdisonWang (Jira)
EdisonWang created SPARK-41200:
--

 Summary: BytesToBytesMap's longArray size can be up to MAX_CAPACITY
 Key: SPARK-41200
 URL: https://issues.apache.org/jira/browse/SPARK-41200
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.3.0
Reporter: EdisonWang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-40035) Avoid apply filter twice when listing files

2022-08-10 Thread EdisonWang (Jira)
EdisonWang created SPARK-40035:
--

 Summary: Avoid apply filter twice when listing files
 Key: SPARK-40035
 URL: https://issues.apache.org/jira/browse/SPARK-40035
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: EdisonWang






--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39476) Disable Unwrap cast optimize when casting from Long to Float/ Double or from Integer to Float

2022-06-14 Thread EdisonWang (Jira)
EdisonWang created SPARK-39476:
--

 Summary: Disable Unwrap cast optimize when casting from Long to 
Float/ Double or from Integer to Float
 Key: SPARK-39476
 URL: https://issues.apache.org/jira/browse/SPARK-39476
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.2.1, 3.2.0, 3.1.2, 3.1.1, 3.1.0, 3.3.0
Reporter: EdisonWang






--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-39249) Improve subexpression elimination for conditional expressions

2022-05-21 Thread EdisonWang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-39249?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

EdisonWang updated SPARK-39249:
---
Description: Currently we can do subexpression elimination for conditional 
expressions when the subexpression is common across all {{{}branchGroups{}}}. 
In fact, we can farther improve this when there are common expressions between 
{{alwaysEvaluatedInputs}} and {{{}branchGroups{}}}.

> Improve subexpression elimination for conditional expressions
> -
>
> Key: SPARK-39249
> URL: https://issues.apache.org/jira/browse/SPARK-39249
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.4.0
>Reporter: EdisonWang
>Priority: Minor
>
> Currently we can do subexpression elimination for conditional expressions 
> when the subexpression is common across all {{{}branchGroups{}}}. In fact, we 
> can farther improve this when there are common expressions between 
> {{alwaysEvaluatedInputs}} and {{{}branchGroups{}}}.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39249) Improve subexpression elimination for conditional expressions

2022-05-21 Thread EdisonWang (Jira)
EdisonWang created SPARK-39249:
--

 Summary: Improve subexpression elimination for conditional 
expressions
 Key: SPARK-39249
 URL: https://issues.apache.org/jira/browse/SPARK-39249
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.4.0
Reporter: EdisonWang






--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-39002) StringEndsWith/Contains support push down to Parquet so that we can leverage dictionary filter

2022-04-23 Thread EdisonWang (Jira)
EdisonWang created SPARK-39002:
--

 Summary: StringEndsWith/Contains support push down to Parquet so 
that we can leverage dictionary filter
 Key: SPARK-39002
 URL: https://issues.apache.org/jira/browse/SPARK-39002
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: EdisonWang


Support push down StringEndsWith/StringContains to Parquet so that we can 
leverage parquet dictionary filtering



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-38160) Shuffle by rand could lead to incorrect answers when ShuffleFetchFailed happend

2022-02-09 Thread EdisonWang (Jira)
EdisonWang created SPARK-38160:
--

 Summary: Shuffle by rand could lead to incorrect answers when 
ShuffleFetchFailed happend
 Key: SPARK-38160
 URL: https://issues.apache.org/jira/browse/SPARK-38160
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.3.0
Reporter: EdisonWang


When we do shuffle on indeterminate expressions such as rand, and 
ShuffleFetchFailed happend, we may get incorrent result since it only retries 
failed map tasks.

We try to fix this by retry all upstream map tasks in this situation.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37593) Optimize HeapMemoryAllocator to avoid memory waste when using G1GC

2021-12-15 Thread EdisonWang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

EdisonWang updated SPARK-37593:
---
Description: 
Spark's tungsten memory model usually tries to allocate memory by one `page` 
each time and allocated by long[pageSizeBytes/8] in 
HeapMemoryAllocator.allocate. 

Remember that java long array needs extra object header (usually 16 bytes in 
64bit system), so the really bytes allocated is pageSize+16.

Assume that the G1HeapRegionSize is 4M and pageSizeBytes is 4M as well. Since 
every time we need to allocate 4M+16byte memory, so two regions are used with 
one region only occupies 16byte. Then there are about 50% memory waste.
It can happenes under different combinations of G1HeapRegionSize (varies from 
1M to 32M) and pageSizeBytes (varies from 1M to 64M).

  was:
As we may know, a phenomenon called humongous allocations exists in G1GC when 
allocations that are larger than 50% of the region size.

Spark's tungsten memory model usually tries to allocate memory by one `page` 
each time and allocated by long[pageSizeBytes/8] in 
HeapMemoryAllocator.allocate. 

Remember that java long array needs extra object header (usually 16 bytes in 
64bit system), so the really bytes allocated is pageSize+16.

Assume that the G1HeapRegionSize is 4M and pageSizeBytes is 4M as well. Since 
every time we need to allocate 4M+16byte memory, so two regions are used with 
one region only occupies 16byte. Then there are about 50% memory waste.
It can happenes under different combinations of G1HeapRegionSize (varies from 
1M to 32M) and pageSizeBytes (varies from 1M to 64M).


> Optimize HeapMemoryAllocator to avoid memory waste when using G1GC
> --
>
> Key: SPARK-37593
> URL: https://issues.apache.org/jira/browse/SPARK-37593
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: EdisonWang
>Priority: Minor
> Fix For: 3.3.0
>
>
> Spark's tungsten memory model usually tries to allocate memory by one `page` 
> each time and allocated by long[pageSizeBytes/8] in 
> HeapMemoryAllocator.allocate. 
> Remember that java long array needs extra object header (usually 16 bytes in 
> 64bit system), so the really bytes allocated is pageSize+16.
> Assume that the G1HeapRegionSize is 4M and pageSizeBytes is 4M as well. Since 
> every time we need to allocate 4M+16byte memory, so two regions are used with 
> one region only occupies 16byte. Then there are about 50% memory waste.
> It can happenes under different combinations of G1HeapRegionSize (varies from 
> 1M to 32M) and pageSizeBytes (varies from 1M to 64M).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37593) Optimize HeapMemoryAllocator to avoid memory waste when using G1GC

2021-12-15 Thread EdisonWang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

EdisonWang updated SPARK-37593:
---
Summary: Optimize HeapMemoryAllocator to avoid memory waste when using G1GC 
 (was: Optimize HeapMemoryAllocator to avoid memory waste in humongous 
allocation when using G1GC)

> Optimize HeapMemoryAllocator to avoid memory waste when using G1GC
> --
>
> Key: SPARK-37593
> URL: https://issues.apache.org/jira/browse/SPARK-37593
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: EdisonWang
>Priority: Minor
> Fix For: 3.3.0
>
>
> As we may know, a phenomenon called humongous allocations exists in G1GC when 
> allocations that are larger than 50% of the region size.
> Spark's tungsten memory model usually tries to allocate memory by one `page` 
> each time and allocated by long[pageSizeBytes/8] in 
> HeapMemoryAllocator.allocate. 
> Remember that java long array needs extra object header (usually 16 bytes in 
> 64bit system), so the really bytes allocated is pageSize+16.
> Assume that the G1HeapRegionSize is 4M and pageSizeBytes is 4M as well. Since 
> every time we need to allocate 4M+16byte memory, so two regions are used with 
> one region only occupies 16byte. Then there are about 50% memory waste.
> It can happenes under different combinations of G1HeapRegionSize (varies from 
> 1M to 32M) and pageSizeBytes (varies from 1M to 64M).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-37593) Optimize HeapMmeoryAllocator to avoid memory waste in humongous allocation when using G1GC

2021-12-09 Thread EdisonWang (Jira)
EdisonWang created SPARK-37593:
--

 Summary: Optimize HeapMmeoryAllocator to avoid memory waste in 
humongous allocation when using G1GC
 Key: SPARK-37593
 URL: https://issues.apache.org/jira/browse/SPARK-37593
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core, SQL
Affects Versions: 3.3.0
Reporter: EdisonWang
 Fix For: 3.3.0


As we may know, a phenomenon called humongous allocations exists in G1GC when 
allocations that are larger than 50% of the region size.

Spark's tungsten memory model usually tries to allocate memory by one `page` 
each time and allocated by long[pageSizeBytes/8] in 
HeapMemoryAllocator.allocate. 

Remember that java long array needs extra object header (usually 16 bytes in 
64bit system), so the really bytes allocated is pageSize+16.

Assume that the G1HeapRegionSize is 4M and pageSizeBytes is 4M as well. Since 
every time we need to allocate 4M+16byte memory, so two regions are used with 
one region only occupies 16byte. Then there are about 50% memory waste.
It can happenes under different combinations of G1HeapRegionSize (varies from 
1M to 32M) and pageSizeBytes (varies from 1M to 64M).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-37593) Optimize HeapMemoryAllocator to avoid memory waste in humongous allocation when using G1GC

2021-12-09 Thread EdisonWang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-37593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

EdisonWang updated SPARK-37593:
---
Summary: Optimize HeapMemoryAllocator to avoid memory waste in humongous 
allocation when using G1GC  (was: Optimize HeapMmeoryAllocator to avoid memory 
waste in humongous allocation when using G1GC)

> Optimize HeapMemoryAllocator to avoid memory waste in humongous allocation 
> when using G1GC
> --
>
> Key: SPARK-37593
> URL: https://issues.apache.org/jira/browse/SPARK-37593
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: EdisonWang
>Priority: Minor
> Fix For: 3.3.0
>
>
> As we may know, a phenomenon called humongous allocations exists in G1GC when 
> allocations that are larger than 50% of the region size.
> Spark's tungsten memory model usually tries to allocate memory by one `page` 
> each time and allocated by long[pageSizeBytes/8] in 
> HeapMemoryAllocator.allocate. 
> Remember that java long array needs extra object header (usually 16 bytes in 
> 64bit system), so the really bytes allocated is pageSize+16.
> Assume that the G1HeapRegionSize is 4M and pageSizeBytes is 4M as well. Since 
> every time we need to allocate 4M+16byte memory, so two regions are used with 
> one region only occupies 16byte. Then there are about 50% memory waste.
> It can happenes under different combinations of G1HeapRegionSize (varies from 
> 1M to 32M) and pageSizeBytes (varies from 1M to 64M).



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34819) MapType supports orderable semantics

2021-03-22 Thread EdisonWang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

EdisonWang updated SPARK-34819:
---
Issue Type: New Feature  (was: Bug)

> MapType supports orderable semantics
> 
>
> Key: SPARK-34819
> URL: https://issues.apache.org/jira/browse/SPARK-34819
> Project: Spark
>  Issue Type: New Feature
>  Components: SQL
>Affects Versions: 3.1.1
>Reporter: EdisonWang
>Priority: Minor
>
> Comparable/orderable semantics for map types is useful in some scenarios, and 
> it's implemented in hive/presto. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34819) MapType supports orderable semantics

2021-03-22 Thread EdisonWang (Jira)
EdisonWang created SPARK-34819:
--

 Summary: MapType supports orderable semantics
 Key: SPARK-34819
 URL: https://issues.apache.org/jira/browse/SPARK-34819
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.1
Reporter: EdisonWang


Comparable/orderable semantics for map types is useful in some scenarios, and 
it's implemented in hive/presto. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34634) Self-join with script transformation failed to resolve attribute correctly

2021-03-05 Thread EdisonWang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

EdisonWang updated SPARK-34634:
---
Affects Version/s: 2.4.0
   2.4.1
   2.4.2
   2.4.3
   2.4.4
   2.4.5
   2.4.6
   2.4.7

> Self-join with script transformation failed to resolve attribute correctly
> --
>
> Key: SPARK-34634
> URL: https://issues.apache.org/jira/browse/SPARK-34634
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0, 2.4.1, 2.4.2, 2.4.3, 2.4.4, 2.4.5, 2.4.6, 2.4.7, 
> 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1
>Reporter: EdisonWang
>Assignee: Apache Spark
>Priority: Minor
>
> To reproduce,
> {code:java}
> // code placeholder
> create temporary view t as select * from values 0, 1, 2 as t(a);
> WITH temp AS (
> SELECT TRANSFORM(a) USING 'cat' AS (b string) FROM t
> )
> SELECT t1.b FROM temp t1 JOIN temp t2 ON t1.b = t2.b
> {code}
>  
> Spark will throw AnalysisException
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34634) Self-join with script transformation failed to resolve attribute correctly

2021-03-04 Thread EdisonWang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

EdisonWang updated SPARK-34634:
---
Description: 
To reproduce,

```

create temporary view t as select * from values 0, 1, 2 as t(a);

WITH temp AS (
 SELECT TRANSFORM(a) USING 'cat' AS (b string) FROM t
)
SELECT t1.b FROM temp t1 JOIN temp t2 ON t1.b = t2.b

```,

 

Spark will throw AnalysisException

 

> Self-join with script transformation failed to resolve attribute correctly
> --
>
> Key: SPARK-34634
> URL: https://issues.apache.org/jira/browse/SPARK-34634
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1
>Reporter: EdisonWang
>Assignee: Apache Spark
>Priority: Minor
>
> To reproduce,
> ```
> create temporary view t as select * from values 0, 1, 2 as t(a);
> WITH temp AS (
>  SELECT TRANSFORM(a) USING 'cat' AS (b string) FROM t
> )
> SELECT t1.b FROM temp t1 JOIN temp t2 ON t1.b = t2.b
> ```,
>  
> Spark will throw AnalysisException
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34634) Self-join with script transformation failed to resolve attribute correctly

2021-03-04 Thread EdisonWang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

EdisonWang updated SPARK-34634:
---
Description: 
To reproduce,
{code:java}
// code placeholder
create temporary view t as select * from values 0, 1, 2 as t(a);
WITH temp AS (
SELECT TRANSFORM(a) USING 'cat' AS (b string) FROM t
)
SELECT t1.b FROM temp t1 JOIN temp t2 ON t1.b = t2.b

{code}
 

Spark will throw AnalysisException

 

  was:
To reproduce,
{code:java}
// code placeholder
create temporary view t as select * from values 0, 1, 2 as t(a);
WITH temp AS (
 SELECT TRANSFORM(a) USING 'cat' AS (b string) FROM t
 )
 SELECT t1.b FROM temp t1 JOIN temp t2 ON t1.b = t2.b

{code}
```

 

```,

 

Spark will throw AnalysisException

 


> Self-join with script transformation failed to resolve attribute correctly
> --
>
> Key: SPARK-34634
> URL: https://issues.apache.org/jira/browse/SPARK-34634
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1
>Reporter: EdisonWang
>Assignee: Apache Spark
>Priority: Minor
>
> To reproduce,
> {code:java}
> // code placeholder
> create temporary view t as select * from values 0, 1, 2 as t(a);
> WITH temp AS (
> SELECT TRANSFORM(a) USING 'cat' AS (b string) FROM t
> )
> SELECT t1.b FROM temp t1 JOIN temp t2 ON t1.b = t2.b
> {code}
>  
> Spark will throw AnalysisException
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-34634) Self-join with script transformation failed to resolve attribute correctly

2021-03-04 Thread EdisonWang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-34634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

EdisonWang updated SPARK-34634:
---
Description: 
To reproduce,
{code:java}
// code placeholder
create temporary view t as select * from values 0, 1, 2 as t(a);
WITH temp AS (
 SELECT TRANSFORM(a) USING 'cat' AS (b string) FROM t
 )
 SELECT t1.b FROM temp t1 JOIN temp t2 ON t1.b = t2.b

{code}
```

 

```,

 

Spark will throw AnalysisException

 

  was:
To reproduce,

```

create temporary view t as select * from values 0, 1, 2 as t(a);

WITH temp AS (
 SELECT TRANSFORM(a) USING 'cat' AS (b string) FROM t
)
SELECT t1.b FROM temp t1 JOIN temp t2 ON t1.b = t2.b

```,

 

Spark will throw AnalysisException

 


> Self-join with script transformation failed to resolve attribute correctly
> --
>
> Key: SPARK-34634
> URL: https://issues.apache.org/jira/browse/SPARK-34634
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0, 3.0.1, 3.0.2, 3.1.0, 3.1.1
>Reporter: EdisonWang
>Assignee: Apache Spark
>Priority: Minor
>
> To reproduce,
> {code:java}
> // code placeholder
> create temporary view t as select * from values 0, 1, 2 as t(a);
> WITH temp AS (
>  SELECT TRANSFORM(a) USING 'cat' AS (b string) FROM t
>  )
>  SELECT t1.b FROM temp t1 JOIN temp t2 ON t1.b = t2.b
> {code}
> ```
>  
> ```,
>  
> Spark will throw AnalysisException
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34634) Self-join with script transformation failed to resolve attribute correctly

2021-03-04 Thread EdisonWang (Jira)
EdisonWang created SPARK-34634:
--

 Summary: Self-join with script transformation failed to resolve 
attribute correctly
 Key: SPARK-34634
 URL: https://issues.apache.org/jira/browse/SPARK-34634
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.1, 3.1.0, 3.0.2, 3.0.1, 3.0.0
Reporter: EdisonWang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-34633) Self-join with script transformation failed to resolve attribute correctly

2021-03-04 Thread EdisonWang (Jira)
EdisonWang created SPARK-34633:
--

 Summary: Self-join with script transformation failed to resolve 
attribute correctly
 Key: SPARK-34633
 URL: https://issues.apache.org/jira/browse/SPARK-34633
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.1, 3.1.0, 3.0.2, 3.0.1, 3.0.0
Reporter: EdisonWang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33306) TimezoneID is needed when there cast from Date to String

2020-10-31 Thread EdisonWang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

EdisonWang updated SPARK-33306:
---
Description: 
A simple way to reproduce this is 

```

spark-shell --conf spark.sql.legacy.typeCoercion.datetimeToString.enabled

scala> sql("""

select a.d1 from
 (select to_date(concat('2000-01-0', id)) as d1 from range(1, 2)) a
 join
 (select concat('2000-01-0', id) as d2 from range(1, 2)) b
 on a.d1 = b.d2

""").show

```

 

it will throw

```

java.util.NoSuchElementException: None.get
 at scala.None$.get(Option.scala:529)
 at scala.None$.get(Option.scala:527)
 at 
org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId(datetimeExpressions.scala:56)
 at 
org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId$(datetimeExpressions.scala:56)
 at 
org.apache.spark.sql.catalyst.expressions.CastBase.zoneId$lzycompute(Cast.scala:253)
 at org.apache.spark.sql.catalyst.expressions.CastBase.zoneId(Cast.scala:253)
 at 
org.apache.spark.sql.catalyst.expressions.CastBase.dateFormatter$lzycompute(Cast.scala:287)
 at 
org.apache.spark.sql.catalyst.expressions.CastBase.dateFormatter(Cast.scala:287)

```

  was:
A simple way to reproduce this is 

```

spark-shell --conf spark.sql.legacy.typeCoercion.datetimeToString.enabled

>> sql("""

select a.d1 from
(select to_date(concat('2000-01-0', id)) as d1 from range(1, 2)) a
join
(select concat('2000-01-0', id) as d2 from range(1, 2)) b
on a.d1 = b.d2

""").show

```

 

it will throw

```

java.util.NoSuchElementException: None.get
 at scala.None$.get(Option.scala:529)
 at scala.None$.get(Option.scala:527)
 at 
org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId(datetimeExpressions.scala:56)
 at 
org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId$(datetimeExpressions.scala:56)
 at 
org.apache.spark.sql.catalyst.expressions.CastBase.zoneId$lzycompute(Cast.scala:253)
 at org.apache.spark.sql.catalyst.expressions.CastBase.zoneId(Cast.scala:253)
 at 
org.apache.spark.sql.catalyst.expressions.CastBase.dateFormatter$lzycompute(Cast.scala:287)
 at 
org.apache.spark.sql.catalyst.expressions.CastBase.dateFormatter(Cast.scala:287)

```


> TimezoneID is needed when there cast from Date to String
> 
>
> Key: SPARK-33306
> URL: https://issues.apache.org/jira/browse/SPARK-33306
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: EdisonWang
>Priority: Major
>
> A simple way to reproduce this is 
> ```
> spark-shell --conf spark.sql.legacy.typeCoercion.datetimeToString.enabled
> scala> sql("""
> select a.d1 from
>  (select to_date(concat('2000-01-0', id)) as d1 from range(1, 2)) a
>  join
>  (select concat('2000-01-0', id) as d2 from range(1, 2)) b
>  on a.d1 = b.d2
> """).show
> ```
>  
> it will throw
> ```
> java.util.NoSuchElementException: None.get
>  at scala.None$.get(Option.scala:529)
>  at scala.None$.get(Option.scala:527)
>  at 
> org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId(datetimeExpressions.scala:56)
>  at 
> org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId$(datetimeExpressions.scala:56)
>  at 
> org.apache.spark.sql.catalyst.expressions.CastBase.zoneId$lzycompute(Cast.scala:253)
>  at org.apache.spark.sql.catalyst.expressions.CastBase.zoneId(Cast.scala:253)
>  at 
> org.apache.spark.sql.catalyst.expressions.CastBase.dateFormatter$lzycompute(Cast.scala:287)
>  at 
> org.apache.spark.sql.catalyst.expressions.CastBase.dateFormatter(Cast.scala:287)
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-33306) TimezoneID is needed when there cast from Date to String

2020-10-31 Thread EdisonWang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-33306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

EdisonWang updated SPARK-33306:
---
Description: 
A simple way to reproduce this is 

```

spark-shell --conf spark.sql.legacy.typeCoercion.datetimeToString.enabled

>> sql("""

select a.d1 from
(select to_date(concat('2000-01-0', id)) as d1 from range(1, 2)) a
join
(select concat('2000-01-0', id) as d2 from range(1, 2)) b
on a.d1 = b.d2

""").show

```

 

it will throw

```

java.util.NoSuchElementException: None.get
 at scala.None$.get(Option.scala:529)
 at scala.None$.get(Option.scala:527)
 at 
org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId(datetimeExpressions.scala:56)
 at 
org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId$(datetimeExpressions.scala:56)
 at 
org.apache.spark.sql.catalyst.expressions.CastBase.zoneId$lzycompute(Cast.scala:253)
 at org.apache.spark.sql.catalyst.expressions.CastBase.zoneId(Cast.scala:253)
 at 
org.apache.spark.sql.catalyst.expressions.CastBase.dateFormatter$lzycompute(Cast.scala:287)
 at 
org.apache.spark.sql.catalyst.expressions.CastBase.dateFormatter(Cast.scala:287)

```

> TimezoneID is needed when there cast from Date to String
> 
>
> Key: SPARK-33306
> URL: https://issues.apache.org/jira/browse/SPARK-33306
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: EdisonWang
>Priority: Major
>
> A simple way to reproduce this is 
> ```
> spark-shell --conf spark.sql.legacy.typeCoercion.datetimeToString.enabled
> >> sql("""
> select a.d1 from
> (select to_date(concat('2000-01-0', id)) as d1 from range(1, 2)) a
> join
> (select concat('2000-01-0', id) as d2 from range(1, 2)) b
> on a.d1 = b.d2
> """).show
> ```
>  
> it will throw
> ```
> java.util.NoSuchElementException: None.get
>  at scala.None$.get(Option.scala:529)
>  at scala.None$.get(Option.scala:527)
>  at 
> org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId(datetimeExpressions.scala:56)
>  at 
> org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId$(datetimeExpressions.scala:56)
>  at 
> org.apache.spark.sql.catalyst.expressions.CastBase.zoneId$lzycompute(Cast.scala:253)
>  at org.apache.spark.sql.catalyst.expressions.CastBase.zoneId(Cast.scala:253)
>  at 
> org.apache.spark.sql.catalyst.expressions.CastBase.dateFormatter$lzycompute(Cast.scala:287)
>  at 
> org.apache.spark.sql.catalyst.expressions.CastBase.dateFormatter(Cast.scala:287)
> ```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-33306) TimezoneID is needed when there cast from Date to String

2020-10-30 Thread EdisonWang (Jira)
EdisonWang created SPARK-33306:
--

 Summary: TimezoneID is needed when there cast from Date to String
 Key: SPARK-33306
 URL: https://issues.apache.org/jira/browse/SPARK-33306
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: EdisonWang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-32559) Fix the trim logic in UTF8String.toInt/toLong did't handle Chinese characters correctly

2020-08-06 Thread EdisonWang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-32559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

EdisonWang updated SPARK-32559:
---
Description: 
The trim logic in Cast expression introduced in 
[https://github.com/apache/spark/pull/26622] will trim chinese characters 
unexpectly.

For example,  sql  select cast("1中文" as float) gives 1 instead of null

 

  was:
The trim logic in Cast expression introduced in 
[https://github.com/apache/spark/pull/26622] will trim chinese characters 
unexpectly.

For example, 

!image-2020-08-06-17-01-48-646.png!

 


> Fix the trim logic in UTF8String.toInt/toLong did't handle Chinese characters 
> correctly
> ---
>
> Key: SPARK-32559
> URL: https://issues.apache.org/jira/browse/SPARK-32559
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: EdisonWang
>Priority: Minor
>
> The trim logic in Cast expression introduced in 
> [https://github.com/apache/spark/pull/26622] will trim chinese characters 
> unexpectly.
> For example,  sql  select cast("1中文" as float) gives 1 instead of null
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-32559) Fix the trim logic in UTF8String.toInt/toLong did't handle Chinese characters correctly

2020-08-06 Thread EdisonWang (Jira)
EdisonWang created SPARK-32559:
--

 Summary: Fix the trim logic in UTF8String.toInt/toLong did't 
handle Chinese characters correctly
 Key: SPARK-32559
 URL: https://issues.apache.org/jira/browse/SPARK-32559
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: EdisonWang


The trim logic in Cast expression introduced in 
[https://github.com/apache/spark/pull/26622] will trim chinese characters 
unexpectly.

For example, 

!image-2020-08-06-17-01-48-646.png!

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31952) The metric of MemoryBytesSpill is incorrect when doing Aggregate

2020-06-10 Thread EdisonWang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

EdisonWang updated SPARK-31952:
---
Description: 
When doing Aggregate and spill occurs, the Spill(memory) metric is zero while 
Spill(disk) metric is right. As shown below

!image-2020-06-10-16-35-58-002.png!

> The metric of MemoryBytesSpill is incorrect when doing Aggregate
> 
>
> Key: SPARK-31952
> URL: https://issues.apache.org/jira/browse/SPARK-31952
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: EdisonWang
>Priority: Minor
> Attachments: image-2020-06-10-16-35-58-002.png
>
>
> When doing Aggregate and spill occurs, the Spill(memory) metric is zero while 
> Spill(disk) metric is right. As shown below
> !image-2020-06-10-16-35-58-002.png!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31952) The metric of MemoryBytesSpill is incorrect when doing Aggregate

2020-06-10 Thread EdisonWang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

EdisonWang updated SPARK-31952:
---
Attachment: image-2020-06-10-16-35-58-002.png

> The metric of MemoryBytesSpill is incorrect when doing Aggregate
> 
>
> Key: SPARK-31952
> URL: https://issues.apache.org/jira/browse/SPARK-31952
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: EdisonWang
>Priority: Minor
> Attachments: image-2020-06-10-16-35-58-002.png
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-31952) Fix incorrect MemoryBytesSpill metric when doing Aggregate

2020-06-10 Thread EdisonWang (Jira)
EdisonWang created SPARK-31952:
--

 Summary: Fix incorrect MemoryBytesSpill metric when doing Aggregate
 Key: SPARK-31952
 URL: https://issues.apache.org/jira/browse/SPARK-31952
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: EdisonWang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-31952) The metric of MemoryBytesSpill is incorrect when doing Aggregate

2020-06-10 Thread EdisonWang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-31952?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

EdisonWang updated SPARK-31952:
---
Summary: The metric of MemoryBytesSpill is incorrect when doing Aggregate  
(was: Fix incorrect MemoryBytesSpill metric when doing Aggregate)

> The metric of MemoryBytesSpill is incorrect when doing Aggregate
> 
>
> Key: SPARK-31952
> URL: https://issues.apache.org/jira/browse/SPARK-31952
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: EdisonWang
>Priority: Minor
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30806) Evaluate once per group in UnboundedWindowFunctionFrame

2020-02-12 Thread EdisonWang (Jira)
EdisonWang created SPARK-30806:
--

 Summary: Evaluate once per group in UnboundedWindowFunctionFrame
 Key: SPARK-30806
 URL: https://issues.apache.org/jira/browse/SPARK-30806
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: EdisonWang






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28332) SQLMetric wrong initValue

2019-12-23 Thread EdisonWang (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-28332?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17002148#comment-17002148
 ] 

EdisonWang commented on SPARK-28332:


I've taken it [~cloud_fan]

> SQLMetric wrong initValue 
> --
>
> Key: SPARK-28332
> URL: https://issues.apache.org/jira/browse/SPARK-28332
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Song Jun
>Priority: Minor
> Fix For: 3.0.0
>
>
> Currently SQLMetrics.createSizeMetric create a SQLMetric with initValue set 
> to -1.
> If there is a ShuffleMapStage with lots of Tasks which read 0 bytes data, 
> these tasks will send the metric(the metric value still be the initValue with 
> -1) to Driver,  then Driver do metric merge for this Stage in 
> DAGScheduler.updateAccumulators, this will cause the merged metric value of 
> this Stage set to be a negative value. 
> This is incorrect, we should set the initValue to 0 .
> Another same case in SQLMetrics.createTimingMetric.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-30088) Adaptive execution should convert SortMergeJoin to BroadcastJoin when plan generates empty result

2019-12-01 Thread EdisonWang (Jira)
EdisonWang created SPARK-30088:
--

 Summary: Adaptive execution should convert SortMergeJoin to 
BroadcastJoin when plan generates empty result
 Key: SPARK-30088
 URL: https://issues.apache.org/jira/browse/SPARK-30088
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: EdisonWang


Adaptive execution try to  convert SortMergeJoin to BroadcastJoin by checking 
the `dataSize` metrics of spark plan. However, if the spark plan generates 
empty result, the `dataSize` metrics is empty due to SQLMetrics's initial value 
is -1, which could lead to the following check return false

```

private def canBroadcast(plan: LogicalPlan): Boolean = {
 plan.stats.sizeInBytes >= 0 && plan.stats.sizeInBytes <= 
conf.autoBroadcastJoinThreshold
}

```



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-29918) RecordBinaryComparator should check endianness when compared by long

2019-11-15 Thread EdisonWang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-29918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

EdisonWang updated SPARK-29918:
---
Labels: correctness  (was: )

> RecordBinaryComparator should check endianness when compared by long
> 
>
> Key: SPARK-29918
> URL: https://issues.apache.org/jira/browse/SPARK-29918
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: EdisonWang
>Priority: Minor
>  Labels: correctness
>
> If the architecture supports unaligned or the offset is 8 bytes aligned, 
> RecordBinaryComparator compare 8 bytes at a time by reading 8 bytes as a 
> long. Otherwise, it will compare bytes by bytes. 
> However, on little-endian machine,  the result of compared by a long value 
> and compared bytes by bytes maybe different. If the architectures in a yarn 
> cluster is different(Some is unaligned-access capable while others not), then 
> the sequence of two records after sorted is undetermined, which will result 
> in the same problem as in https://issues.apache.org/jira/browse/SPARK-23207
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29918) RecordBinaryComparator should check endianness when compared by long

2019-11-15 Thread EdisonWang (Jira)
EdisonWang created SPARK-29918:
--

 Summary: RecordBinaryComparator should check endianness when 
compared by long
 Key: SPARK-29918
 URL: https://issues.apache.org/jira/browse/SPARK-29918
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.0.0
Reporter: EdisonWang


If the architecture supports unaligned or the offset is 8 bytes aligned, 
RecordBinaryComparator compare 8 bytes at a time by reading 8 bytes as a long. 
Otherwise, it will compare bytes by bytes. 

However, on little-endian machine,  the result of compared by a long value and 
compared bytes by bytes maybe different. If the architectures in a yarn cluster 
is different(Some is unaligned-access capable while others not), then the 
sequence of two records after sorted is undetermined, which will result in the 
same problem as in https://issues.apache.org/jira/browse/SPARK-23207

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-27789) Use stopEarly in codegen of ColumnarBatchScan

2019-10-09 Thread EdisonWang (Jira)


 [ 
https://issues.apache.org/jira/browse/SPARK-27789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

EdisonWang resolved SPARK-27789.

Resolution: Not A Problem

> Use stopEarly in codegen of ColumnarBatchScan
> -
>
> Key: SPARK-27789
> URL: https://issues.apache.org/jira/browse/SPARK-27789
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: EdisonWang
>Priority: Minor
>
> Suppose that we have a hive table like this
> ```sql("create table parquet_test (id int) using parquet")```, and our query 
> sql is `select id from parquet_test limit 10`.
> With `spark.sql.hive.convertMetastoreParquet`  set to false, the sql 
> execution will go into `InputAdapter`, in its codegen, it can use `stopEarly` 
> to accelerate local limit.
> But if we set  `spark.sql.hive.convertMetastoreParquet`  to true, the sql 
> exectuion will go into `ColumnarBatchScan`, which didn't optimize local limit.
> In this patch, We use `stopEarly` in `ColumnarBatchScan` as well



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-29343) Eliminate sorts without limit in the subquery of Join/Aggregation

2019-10-03 Thread EdisonWang (Jira)
EdisonWang created SPARK-29343:
--

 Summary: Eliminate sorts without limit in the subquery of 
Join/Aggregation
 Key: SPARK-29343
 URL: https://issues.apache.org/jira/browse/SPARK-29343
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: EdisonWang


The {{Sort}} without {{Limit}} operator in {{Join/GroupBy}} subquery is useless.

 

For example, {{select count(1) from (select a from test1 order by a)}} is equal 
to {{select count(1) from (select a from test1)}}.
'select * from (select a from test1 order by a) t1 join (select b from test2) 
t2 on t1.a = t2.b' is equal to {{select * from (select a from test1) t1 join 
(select b from test2) t2 on t1.a = t2.b}}.

Remove useless {{Sort}} operator can import performance.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-28123) String Functions: Add support btrim

2019-08-02 Thread EdisonWang (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-28123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899349#comment-16899349
 ] 

EdisonWang commented on SPARK-28123:


seems it is the same with trim() on both sides?

> String Functions: Add support btrim
> ---
>
> Key: SPARK-28123
> URL: https://issues.apache.org/jira/browse/SPARK-28123
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.0.0
>Reporter: Yuming Wang
>Priority: Major
>
> ||Function||Return Type||Description||Example||Result||
> |{{btrim(_{{string}}_}}{{bytea}}{{, 
> _{{bytes}}_}}{{bytea}}{{)}}|{{bytea}}|Remove the longest string containing 
> only bytes appearing in _{{bytes}}_from the start and end of 
> _{{string}}_|{{btrim('\000trim\001'::bytea, '\000\001'::bytea)}}|{{trim}}|
> More details: https://www.postgresql.org/docs/11/functions-binarystring.html



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-28257) Use ConfigEntry for hardcoded configs in SQL module

2019-07-05 Thread EdisonWang (JIRA)
EdisonWang created SPARK-28257:
--

 Summary: Use ConfigEntry for hardcoded configs in SQL module
 Key: SPARK-28257
 URL: https://issues.apache.org/jira/browse/SPARK-28257
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.0.0
Reporter: EdisonWang


Use ConfigEntry for hardcoded configs in SQL module



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27789) Use stopEarly in codegen of ColumnarBatchScan

2019-05-21 Thread EdisonWang (JIRA)
EdisonWang created SPARK-27789:
--

 Summary: Use stopEarly in codegen of ColumnarBatchScan
 Key: SPARK-27789
 URL: https://issues.apache.org/jira/browse/SPARK-27789
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.3, 2.4.2, 2.4.1, 2.4.0
Reporter: EdisonWang


Suppose that we have a hive table like this
```sql("create table parquet_test (id int) using parquet")```, and our query 
sql is `select id from parquet_test limit 10`.
With `spark.sql.hive.convertMetastoreParquet`  set to false, the sql execution 
will go into `InputAdapter`, in its codegen, it can use `stopEarly` to 
accelerate local limit.
But if we set  `spark.sql.hive.convertMetastoreParquet`  to true, the sql 
exectuion will go into `ColumnarBatchScan`, which didn't optimize local limit.

In this patch, We use `stopEarly` in `ColumnarBatchScan` as well



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-26500) Add conf to support ignore hdfs data locality

2019-03-22 Thread EdisonWang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

EdisonWang closed SPARK-26500.
--

> Add conf to support ignore hdfs data locality
> -
>
> Key: SPARK-26500
> URL: https://issues.apache.org/jira/browse/SPARK-26500
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: EdisonWang
>Priority: Trivial
>
> When reading a large hive table/directory with thousands of files, it will 
> cost up to several minutes or even hours to calculate data locality for each 
> split in driver, while executors are in idle status.
> This situation is even worth when running in SparkThriftServer mode, because 
> handleJobSubmitted(it will call getPreferedLocation) is handled in a single 
> thread. One big sql will block all the following sqls.
> At the same time, most companies's internal networks are all gigabit network 
> cards, so it is ok to read a data not locality.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27232) Ignore file locality in InMemoryFileIndex if spark.locality.wait is set to

2019-03-21 Thread EdisonWang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

EdisonWang updated SPARK-27232:
---
Summary: Ignore file locality in InMemoryFileIndex if spark.locality.wait 
is set to  (was: Skip to get file block location if locality is ignored)

> Ignore file locality in InMemoryFileIndex if spark.locality.wait is set to
> --
>
> Key: SPARK-27232
> URL: https://issues.apache.org/jira/browse/SPARK-27232
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: EdisonWang
>Priority: Minor
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27232) Ignore file locality in InMemoryFileIndex if spark.locality.wait is set to

2019-03-21 Thread EdisonWang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

EdisonWang updated SPARK-27232:
---
Description: 
`InMemoryFileIndex` needs to request file block location information in order 
to do locality schedule in `TaskSetManager`. 

Usually this is a time-cost task.  For example, In our production env, there 
are 24 partitions, with totally 149925 files and 83TB in size. It costs about 
10 minutes to request file block locations before submit a spark job. Even 
though I set `spark.sql.sources.parallelPartitionDiscovery.threshold` to 24 to 
make it parallelized, it also needs 2 minutes. 

Anyway, this is a waste if we don't care about the locality of files(for 
example, storage and computation are separate).

So there should be a conf to control whether we need to send 
`getFileBlockLocations` request to HDFS NN. If user set `spark.locality.wait` 
to 0, file block location information is meaningless. 

Here in this PR, if `spark.locality.wait` is set to 0, it will not request file 
location information anymore, which will save several seconds to minutes.


> Ignore file locality in InMemoryFileIndex if spark.locality.wait is set to
> --
>
> Key: SPARK-27232
> URL: https://issues.apache.org/jira/browse/SPARK-27232
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: EdisonWang
>Priority: Minor
>
> `InMemoryFileIndex` needs to request file block location information in order 
> to do locality schedule in `TaskSetManager`. 
> Usually this is a time-cost task.  For example, In our production env, there 
> are 24 partitions, with totally 149925 files and 83TB in size. It costs about 
> 10 minutes to request file block locations before submit a spark job. Even 
> though I set `spark.sql.sources.parallelPartitionDiscovery.threshold` to 24 
> to make it parallelized, it also needs 2 minutes. 
> Anyway, this is a waste if we don't care about the locality of files(for 
> example, storage and computation are separate).
> So there should be a conf to control whether we need to send 
> `getFileBlockLocations` request to HDFS NN. If user set `spark.locality.wait` 
> to 0, file block location information is meaningless. 
> Here in this PR, if `spark.locality.wait` is set to 0, it will not request 
> file location information anymore, which will save several seconds to minutes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27232) Skip to get file block location if locality is ignored

2019-03-21 Thread EdisonWang (JIRA)
EdisonWang created SPARK-27232:
--

 Summary: Skip to get file block location if locality is ignored
 Key: SPARK-27232
 URL: https://issues.apache.org/jira/browse/SPARK-27232
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: EdisonWang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27202) update comments to keep according with code

2019-03-19 Thread EdisonWang (JIRA)
EdisonWang created SPARK-27202:
--

 Summary: update comments to keep according with code
 Key: SPARK-27202
 URL: https://issues.apache.org/jira/browse/SPARK-27202
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: EdisonWang


update comments in InMemoryFileIndex.scala to keep according with code



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27079) Fix typo & Remove useless imports

2019-03-06 Thread EdisonWang (JIRA)
EdisonWang created SPARK-27079:
--

 Summary: Fix typo & Remove useless imports
 Key: SPARK-27079
 URL: https://issues.apache.org/jira/browse/SPARK-27079
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: EdisonWang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27033) Add rule to optimize binary comparisons to its push down format

2019-03-02 Thread EdisonWang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

EdisonWang updated SPARK-27033:
---
Affects Version/s: (was: 3.0.0)
   2.4.0

> Add rule to optimize binary comparisons to its push down format
> ---
>
> Key: SPARK-27033
> URL: https://issues.apache.org/jira/browse/SPARK-27033
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 2.4.0
>Reporter: EdisonWang
>Priority: Minor
>
> Currently, filters like this "select * from table where a + 1 >= 3" cannot be 
> pushed down, this optimizer can convert it to "select * from table where a >= 
> 3 - 1", and then be "select * from table where a >= 2". 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27033) Add rule to optimize binary comparisons to its push down format

2019-03-02 Thread EdisonWang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

EdisonWang updated SPARK-27033:
---
Description: Currently, filters like this "select * from table where a + 1 
>= 3" cannot be pushed down, this optimizer can convert it to "select * from 
table where a >= 3 - 1", and then be "select * from table where a >= 2".   
(was: _emphasized text_)

> Add rule to optimize binary comparisons to its push down format
> ---
>
> Key: SPARK-27033
> URL: https://issues.apache.org/jira/browse/SPARK-27033
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.0.0
>Reporter: EdisonWang
>Priority: Minor
>
> Currently, filters like this "select * from table where a + 1 >= 3" cannot be 
> pushed down, this optimizer can convert it to "select * from table where a >= 
> 3 - 1", and then be "select * from table where a >= 2". 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-27033) Add rule to optimize binary comparisons to its push down format

2019-03-02 Thread EdisonWang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-27033?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

EdisonWang updated SPARK-27033:
---
Description: _emphasized text_

> Add rule to optimize binary comparisons to its push down format
> ---
>
> Key: SPARK-27033
> URL: https://issues.apache.org/jira/browse/SPARK-27033
> Project: Spark
>  Issue Type: Improvement
>  Components: Optimizer
>Affects Versions: 3.0.0
>Reporter: EdisonWang
>Priority: Minor
>
> _emphasized text_



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-27033) Add rule to optimize binary comparisons to its push down format

2019-03-02 Thread EdisonWang (JIRA)
EdisonWang created SPARK-27033:
--

 Summary: Add rule to optimize binary comparisons to its push down 
format
 Key: SPARK-27033
 URL: https://issues.apache.org/jira/browse/SPARK-27033
 Project: Spark
  Issue Type: Improvement
  Components: Optimizer
Affects Versions: 3.0.0
Reporter: EdisonWang






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26544) escape string when serialize map/array make it a valid json (keep alignment with hive)

2019-01-04 Thread EdisonWang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

EdisonWang updated SPARK-26544:
---
Summary: escape string when serialize map/array make it a valid json (keep 
alignment with hive)  (was: the string serialized from map/array type is not a 
valid json (while hive is))

> escape string when serialize map/array make it a valid json (keep alignment 
> with hive)
> --
>
> Key: SPARK-26544
> URL: https://issues.apache.org/jira/browse/SPARK-26544
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: EdisonWang
>Priority: Major
>
> when reading a hive table with map/array type, the string serialized by 
> thrift server is not a valid json, while hive is. 
> For example, select a field whose type is map, the spark 
> thrift server returns 
>  
> {code:java}
> {"author_id":"123","log_pb":"{"impr_id":"20181231"}","request_id":"001"}
> {code}
>  
> while hive thriftserver returns
>  
> {code:java}
> {"author_id":"123", "log_pb":"{\"impr_id\":\"20181231\"}","request_id":"001"}
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-26544) escape string when serialize map/array to make it a valid json (keep alignment with hive)

2019-01-04 Thread EdisonWang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

EdisonWang updated SPARK-26544:
---
Summary: escape string when serialize map/array to make it a valid json 
(keep alignment with hive)  (was: escape string when serialize map/array make 
it a valid json (keep alignment with hive))

> escape string when serialize map/array to make it a valid json (keep 
> alignment with hive)
> -
>
> Key: SPARK-26544
> URL: https://issues.apache.org/jira/browse/SPARK-26544
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 2.4.0
>Reporter: EdisonWang
>Priority: Major
>
> when reading a hive table with map/array type, the string serialized by 
> thrift server is not a valid json, while hive is. 
> For example, select a field whose type is map, the spark 
> thrift server returns 
>  
> {code:java}
> {"author_id":"123","log_pb":"{"impr_id":"20181231"}","request_id":"001"}
> {code}
>  
> while hive thriftserver returns
>  
> {code:java}
> {"author_id":"123", "log_pb":"{\"impr_id\":\"20181231\"}","request_id":"001"}
> {code}
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26544) the string serialized from map/array type is not a valid json (while hive is)

2019-01-04 Thread EdisonWang (JIRA)
EdisonWang created SPARK-26544:
--

 Summary: the string serialized from map/array type is not a valid 
json (while hive is)
 Key: SPARK-26544
 URL: https://issues.apache.org/jira/browse/SPARK-26544
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 2.4.0
Reporter: EdisonWang


when reading a hive table with map/array type, the string serialized by thrift 
server is not a valid json, while hive is. 

For example, select a field whose type is map, the spark thrift 
server returns 

 
{code:java}
{"author_id":"123","log_pb":"{"impr_id":"20181231"}","request_id":"001"}
{code}
 

while hive thriftserver returns

 
{code:java}
{"author_id":"123", "log_pb":"{\"impr_id\":\"20181231\"}","request_id":"001"}
{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-26500) Add conf to support ignore hdfs data locality

2018-12-31 Thread EdisonWang (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-26500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

EdisonWang resolved SPARK-26500.

Resolution: Not A Problem

> Add conf to support ignore hdfs data locality
> -
>
> Key: SPARK-26500
> URL: https://issues.apache.org/jira/browse/SPARK-26500
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 2.4.0
>Reporter: EdisonWang
>Priority: Trivial
>
> When reading a large hive table/directory with thousands of files, it will 
> cost up to several minutes or even hours to calculate data locality for each 
> split in driver, while executors are in idle status.
> This situation is even worth when running in SparkThriftServer mode, because 
> handleJobSubmitted(it will call getPreferedLocation) is handled in a single 
> thread. One big sql will block all the following sqls.
> At the same time, most companies's internal networks are all gigabit network 
> cards, so it is ok to read a data not locality.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-26500) Add conf to support ignore hdfs data locality

2018-12-28 Thread EdisonWang (JIRA)
EdisonWang created SPARK-26500:
--

 Summary: Add conf to support ignore hdfs data locality
 Key: SPARK-26500
 URL: https://issues.apache.org/jira/browse/SPARK-26500
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 2.4.0
Reporter: EdisonWang


When reading a large hive table/directory with thousands of files, it will cost 
up to several minutes or even hours to calculate data locality for each split 
in driver, while executors are in idle status.

This situation is even worth when running in SparkThriftServer mode, because 
handleJobSubmitted(it will call getPreferedLocation) is handled in a single 
thread. One big sql will block all the following sqls.

At the same time, most companies's internal networks are all gigabit network 
cards, so it is ok to read a data not locality.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org