date:20210809

[jira] [Assigned] (SPARK-36466) Table in unloaded catalog referenced by view should load correctly

2021-08-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36466:


Assignee: (was: Apache Spark)

> Table in unloaded catalog referenced by view should load correctly
> --
>
> Key: SPARK-36466
> URL: https://issues.apache.org/jira/browse/SPARK-36466
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: Cheng Pan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36466) Table in unloaded catalog referenced by view should load correctly

2021-08-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36466:


Assignee: Apache Spark

> Table in unloaded catalog referenced by view should load correctly
> --
>
> Key: SPARK-36466
> URL: https://issues.apache.org/jira/browse/SPARK-36466
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.1.2, 3.2.0
>Reporter: Cheng Pan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36466) Table in unloaded catalog referenced by view should load correctly

2021-08-09 Thread Cheng Pan (Jira)

Cheng Pan created SPARK-36466:
-

 Summary: Table in unloaded catalog referenced by view should load 
correctly
 Key: SPARK-36466
 URL: https://issues.apache.org/jira/browse/SPARK-36466
 Project: Spark
  Issue Type: Bug
  Components: SQL
Affects Versions: 3.1.2, 3.2.0
Reporter: Cheng Pan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36464) Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream for Writing Over 2GB Data

2021-08-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396462#comment-17396462
 ] 

Apache Spark commented on SPARK-36464:
--

User 'kazuyukitanimura' has created a pull request for this issue:
https://github.com/apache/spark/pull/33690

> Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream 
> for Writing Over 2GB Data
> --
>
> Key: SPARK-36464
> URL: https://issues.apache.org/jira/browse/SPARK-36464
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Kazuyuki Tanimura
>Priority: Major
>
> The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; 
> however, the underlying `_size` variable is initialized as `Int`.
> That causes an overflow and returns a negative size when over 2GB data is 
> written into `ChunkedByteBufferOutputStream`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36465) Dynamic gap duration in session window

2021-08-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36465:


Assignee: (was: Apache Spark)

> Dynamic gap duration in session window
> --
>
> Key: SPARK-36465
> URL: https://issues.apache.org/jira/browse/SPARK-36465
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: L. C. Hsieh
>Priority: Major
>
> The gap duration used in session window for now is a static value. To support 
> more complex usage, it is better to support dynamic gap duration which 
> determines the gap duration by looking at the current data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36464) Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream for Writing Over 2GB Data

2021-08-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36464:


Assignee: (was: Apache Spark)

> Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream 
> for Writing Over 2GB Data
> --
>
> Key: SPARK-36464
> URL: https://issues.apache.org/jira/browse/SPARK-36464
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Kazuyuki Tanimura
>Priority: Major
>
> The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; 
> however, the underlying `_size` variable is initialized as `Int`.
> That causes an overflow and returns a negative size when over 2GB data is 
> written into `ChunkedByteBufferOutputStream`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36465) Dynamic gap duration in session window

2021-08-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396461#comment-17396461
 ] 

Apache Spark commented on SPARK-36465:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/33691

> Dynamic gap duration in session window
> --
>
> Key: SPARK-36465
> URL: https://issues.apache.org/jira/browse/SPARK-36465
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: L. C. Hsieh
>Priority: Major
>
> The gap duration used in session window for now is a static value. To support 
> more complex usage, it is better to support dynamic gap duration which 
> determines the gap duration by looking at the current data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36464) Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream for Writing Over 2GB Data

2021-08-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36464:


Assignee: Apache Spark

> Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream 
> for Writing Over 2GB Data
> --
>
> Key: SPARK-36464
> URL: https://issues.apache.org/jira/browse/SPARK-36464
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Kazuyuki Tanimura
>Assignee: Apache Spark
>Priority: Major
>
> The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; 
> however, the underlying `_size` variable is initialized as `Int`.
> That causes an overflow and returns a negative size when over 2GB data is 
> written into `ChunkedByteBufferOutputStream`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36465) Dynamic gap duration in session window

2021-08-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36465:


Assignee: Apache Spark

> Dynamic gap duration in session window
> --
>
> Key: SPARK-36465
> URL: https://issues.apache.org/jira/browse/SPARK-36465
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.3.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Major
>
> The gap duration used in session window for now is a static value. To support 
> more complex usage, it is better to support dynamic gap duration which 
> determines the gap duration by looking at the current data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36464) Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream for Writing Over 2GB Data

2021-08-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396460#comment-17396460
 ] 

Apache Spark commented on SPARK-36464:
--

User 'kazuyukitanimura' has created a pull request for this issue:
https://github.com/apache/spark/pull/33690

> Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream 
> for Writing Over 2GB Data
> --
>
> Key: SPARK-36464
> URL: https://issues.apache.org/jira/browse/SPARK-36464
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Kazuyuki Tanimura
>Priority: Major
>
> The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; 
> however, the underlying `_size` variable is initialized as `Int`.
> That causes an overflow and returns a negative size when over 2GB data is 
> written into `ChunkedByteBufferOutputStream`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36464) Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream for Writing Over 2GB Data

2021-08-09 Thread Kazuyuki Tanimura (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuyuki Tanimura updated SPARK-36464:
--
Description: 
The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; 
however, the underlying `_size` variable is initialized as `Int`.

That causes an overflow and returns a negative size when over 2GB data is 
written into `ChunkedByteBufferOutputStream`

  was:
The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; 
however, the underlying `_size` variable is initialized as `Int`.

That causes an overflow and returns a negative size when over 2GB data is 
written into `ChunkedByteBufferOutputStream`

 

build/sbt "core/testOnly *ChunkedByteBufferOutputStreamSuite -- -z SPARK-36464"


> Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream 
> for Writing Over 2GB Data
> --
>
> Key: SPARK-36464
> URL: https://issues.apache.org/jira/browse/SPARK-36464
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Kazuyuki Tanimura
>Priority: Major
>
> The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; 
> however, the underlying `_size` variable is initialized as `Int`.
> That causes an overflow and returns a negative size when over 2GB data is 
> written into `ChunkedByteBufferOutputStream`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36465) Dynamic gap duration in session window

2021-08-09 Thread L. C. Hsieh (Jira)

L. C. Hsieh created SPARK-36465:
---

 Summary: Dynamic gap duration in session window
 Key: SPARK-36465
 URL: https://issues.apache.org/jira/browse/SPARK-36465
 Project: Spark
  Issue Type: Improvement
  Components: Structured Streaming
Affects Versions: 3.3.0
Reporter: L. C. Hsieh


The gap duration used in session window for now is a static value. To support 
more complex usage, it is better to support dynamic gap duration which 
determines the gap duration by looking at the current data.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36464) Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream for Writing Over 2GB Data

2021-08-09 Thread Kazuyuki Tanimura (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuyuki Tanimura updated SPARK-36464:
--
Description: 
The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; 
however, the underlying `_size` variable is initialized as `Int`.

That causes an overflow and returns a negative size when over 2GB data is 
written into `ChunkedByteBufferOutputStream`

 

build/sbt "core/testOnly *ChunkedByteBufferOutputStreamSuite -- -z SPARK-36464"

  was:
The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; 
however, the underlying `_size` variable is initialized as `Int`.

That causes an overflow and returns a negative size when over 2GB data is 
written into `ChunkedByteBufferOutputStream`


> Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream 
> for Writing Over 2GB Data
> --
>
> Key: SPARK-36464
> URL: https://issues.apache.org/jira/browse/SPARK-36464
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Kazuyuki Tanimura
>Priority: Major
>
> The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; 
> however, the underlying `_size` variable is initialized as `Int`.
> That causes an overflow and returns a negative size when over 2GB data is 
> written into `ChunkedByteBufferOutputStream`
>  
> build/sbt "core/testOnly *ChunkedByteBufferOutputStreamSuite -- -z 
> SPARK-36464"



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36464) Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream for Writing Over 2GB Data

2021-08-09 Thread Kazuyuki Tanimura (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kazuyuki Tanimura updated SPARK-36464:
--
Description: 
The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; 
however, the underlying `_size` variable is initialized as `Int`.

That causes an overflow and returns a negative size when over 2GB data is 
written into `ChunkedByteBufferOutputStream`

  was:
The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; 
however, the underlying `_size` variable is initialized as `int`.

That causes an overflow and returns negative size when over 2GB data is written 
into `ChunkedByteBufferOutputStream`


> Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream 
> for Writing Over 2GB Data
> --
>
> Key: SPARK-36464
> URL: https://issues.apache.org/jira/browse/SPARK-36464
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: Kazuyuki Tanimura
>Priority: Major
>
> The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; 
> however, the underlying `_size` variable is initialized as `Int`.
> That causes an overflow and returns a negative size when over 2GB data is 
> written into `ChunkedByteBufferOutputStream`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34276) Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11

2021-08-09 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396450#comment-17396450
 ] 

Yuming Wang commented on SPARK-34276:
-

I think so. cc [~smilegator] 

> Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 
> --
>
> Key: SPARK-34276
> URL: https://issues.apache.org/jira/browse/SPARK-34276
> Project: Spark
>  Issue Type: Task
>  Components: Build, SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Blocker
>
> Before the release, we need to double check the unreleased/unresolved 
> JIRAs/PRs of Parquet 1.11 and then decide whether we should upgrade/revert 
> Parquet. At the same time, we should encourage the whole community to do the 
> compatibility and performance tests for their production workloads, including 
> both read and write code paths.
> More details: 
> https://github.com/apache/spark/pull/26804#issuecomment-768790620



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36449) ALTER TABLE REPLACE COLUMNS should check duplicates for the specified columns for v2 command

2021-08-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-36449.
-
Resolution: Fixed

Issue resolved by pull request 33676
[https://github.com/apache/spark/pull/33676]

> ALTER TABLE REPLACE COLUMNS should check duplicates for the specified columns 
> for v2 command
> 
>
> Key: SPARK-36449
> URL: https://issues.apache.org/jira/browse/SPARK-36449
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Major
> Fix For: 3.2.0
>
>
> ALTER TABLE REPLACE COLUMNS currently doesn't check duplicates for the 
> specified columns for v2 command.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36464) Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream for Writing Over 2GB Data

2021-08-09 Thread Kazuyuki Tanimura (Jira)

Kazuyuki Tanimura created SPARK-36464:
-

 Summary: Fix Underlying Size Variable Initialization in 
ChunkedByteBufferOutputStream for Writing Over 2GB Data
 Key: SPARK-36464
 URL: https://issues.apache.org/jira/browse/SPARK-36464
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 3.1.2
Reporter: Kazuyuki Tanimura


The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; 
however, the underlying `_size` variable is initialized as `int`.

That causes an overflow and returns negative size when over 2GB data is written 
into `ChunkedByteBufferOutputStream`



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34276) Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11

2021-08-09 Thread Gengliang Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396441#comment-17396441
 ] 

Gengliang Wang commented on SPARK-34276:


ok, then shall we close this one?

> Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 
> --
>
> Key: SPARK-34276
> URL: https://issues.apache.org/jira/browse/SPARK-34276
> Project: Spark
>  Issue Type: Task
>  Components: Build, SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Blocker
>
> Before the release, we need to double check the unreleased/unresolved 
> JIRAs/PRs of Parquet 1.11 and then decide whether we should upgrade/revert 
> Parquet. At the same time, we should encourage the whole community to do the 
> compatibility and performance tests for their production workloads, including 
> both read and write code paths.
> More details: 
> https://github.com/apache/spark/pull/26804#issuecomment-768790620



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36458) MinHashLSH.approxSimilarityJoin should not required inputCol if output exist

2021-08-09 Thread Thai Thien (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thai Thien updated SPARK-36458:
---
Description: 
Refer to documents and example code in MinHashLSH 
 [https://spark.apache.org/docs/latest/ml-features#minhash-for-jaccard-distance]

The example written that:

We could avoid computing hashes by passing in the already-transformed dataset, 
e.g. `model.approxSimilarityJoin(transformedA, transformedB, 0.6)`

However, inputCol still required in transformedA and transformedB even if they 
already have outputCol.

A code that should work but it doesn't

 

 
{code:java}
from pyspark.ml.feature import MinHashLSH
 from pyspark.ml.linalg import Vectors
 from pyspark.sql.functions import col
dataA = [(0, Vectors.sparse(6, [0, 1, 2], [1.0, 1.0, 1.0]),),
 (1, Vectors.sparse(6, [2, 3, 4], [1.0, 1.0, 1.0]),),
 (2, Vectors.sparse(6, [0, 2, 4], [1.0, 1.0, 1.0]),)]
 dfA = spark.createDataFrame(dataA, ["id", "features"])
dataB = [(3, Vectors.sparse(6, [1, 3, 5], [1.0, 1.0, 1.0]),),
 (4, Vectors.sparse(6, [2, 3, 5], [1.0, 1.0, 1.0]),),
 (5, Vectors.sparse(6, [1, 2, 4], [1.0, 1.0, 1.0]),)]
 dfB = spark.createDataFrame(dataB, ["id", "features"])
key = Vectors.sparse(6, [1, 3], [1.0, 1.0])
mh = MinHashLSH(inputCol="features", outputCol="hashes", numHashTables=5)
 model = mh.fit(dfA)
transformedA = model.transform(dfA).select("id", "hashes")
 transformedB = model.transform(dfB).select("id", "hashes")
model.approxSimilarityJoin(transformedA, transformedB, 0.6, 
distCol="JaccardDistance")\
 .select(col("datasetA.id").alias("idA"),
 col("datasetB.id").alias("idB"),
 col("JaccardDistance")).show()
{code}
As in the code I give, I discard columns `features` but keep column `hashes` 
which is output data. 
 approxSimilarityJoin should only work on `hashes` (the outputCol), which is 
exist and ignore the lack of `features` (the inputCol).

Be able to transform the data beforehand and remove inputCol can make input 
data much smaller and prevent confusion about the tip "_We could avoid 
computing hashes by passing in the already-transformed dataset_".

  was:
Refer to documents and example code in MinHashLSH 
 [https://spark.apache.org/docs/latest/ml-features#minhash-for-jaccard-distance]

The example written that:

 We could avoid computing hashes by passing in the already-transformed dataset, 
e.g. `model.approxSimilarityJoin(transformedA, transformedB, 0.6)` 

However, inputCol still required in transformedA and transformedB even if they 
already have outputCol.

An code that should work but it doesn't

 

 
{code:java}
from pyspark.ml.feature import MinHashLSH
 from pyspark.ml.linalg import Vectors
 from pyspark.sql.functions import col
dataA = [(0, Vectors.sparse(6, [0, 1, 2], [1.0, 1.0, 1.0]),),
 (1, Vectors.sparse(6, [2, 3, 4], [1.0, 1.0, 1.0]),),
 (2, Vectors.sparse(6, [0, 2, 4], [1.0, 1.0, 1.0]),)]
 dfA = spark.createDataFrame(dataA, ["id", "features"])
dataB = [(3, Vectors.sparse(6, [1, 3, 5], [1.0, 1.0, 1.0]),),
 (4, Vectors.sparse(6, [2, 3, 5], [1.0, 1.0, 1.0]),),
 (5, Vectors.sparse(6, [1, 2, 4], [1.0, 1.0, 1.0]),)]
 dfB = spark.createDataFrame(dataB, ["id", "features"])
key = Vectors.sparse(6, [1, 3], [1.0, 1.0])
mh = MinHashLSH(inputCol="features", outputCol="hashes", numHashTables=5)
 model = mh.fit(dfA)
transformedA = model.transform(dfA).select("id", "hashes")
 transformedB = model.transform(dfB).select("id", "hashes")
model.approxSimilarityJoin(transformedA, transformedB, 0.6, 
distCol="JaccardDistance")\
 .select(col("datasetA.id").alias("idA"),
 col("datasetB.id").alias("idB"),
 col("JaccardDistance")).show()
{code}




As in the code I give, I discard columns `features` but keep column `hashes` 
which is output data. 
 approxSimilarityJoin should only work on `hashes` (the outputCol), which is 
exist and ignore the lack of `features` (the inputCol).

Be able to transform the data beforehand and remove inputCol can make input 
data much smaller and prevent confusion about the tip "_We could avoid 
computing hashes by passing in the already-transformed dataset_".


> MinHashLSH.approxSimilarityJoin should not required inputCol if output exist
> 
>
> Key: SPARK-36458
> URL: https://issues.apache.org/jira/browse/SPARK-36458
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.1
>Reporter: Thai Thien
>Priority: Minor
>
> Refer to documents and example code in MinHashLSH 
>  
> [https://spark.apache.org/docs/latest/ml-features#minhash-for-jaccard-distance]
> The example written that:
> We could avoid computing hashes by passing in the already-transformed 
> dataset, e.g. `model.approxSimilarityJoin(transformedA, transformedB, 0.6)`
> However, inputCol still required in transformedA and transformedB even if 
> they alread

[jira] [Commented] (SPARK-36463) Prohibit update mode in native support of session window

2021-08-09 Thread L. C. Hsieh (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396440#comment-17396440
 ] 

L. C. Hsieh commented on SPARK-36463:
-

Thanks [~kabhwan] ,[~Gengliang.Wang] . I will do the review asap.

> Prohibit update mode in native support of session window
> 
>
> Key: SPARK-36463
> URL: https://issues.apache.org/jira/browse/SPARK-36463
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Jungtaek Lim
>Priority: Blocker
>
> The semantic of update mode for native support of session window seems to be 
> broken.
> Strictly saying, it doesn't break the semantic based on our explanation of 
> update mode:
> {quote}
> Update Mode - Only the rows that were updated in the Result Table since the 
> last trigger will be written to the external storage (available since Spark 
> 2.1.1). Note that this is different from the Complete Mode in that this mode 
> only outputs the rows that have changed since the last trigger. If the query 
> doesn’t contain aggregations, it will be equivalent to Append mode.
> {quote}
> But given the grouping key is changing due to the nature of session window, 
> there is no way to "upsert" the output into destination. If end users try to 
> "upsert" the output based on the grouping key, it is high likely that a 
> single session window output will be written into multiple rows across 
> multiple updates.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-35579) Fix a bug in janino or work around it in Spark.

2021-08-09 Thread Wenchen Fan (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-35579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396436#comment-17396436
 ] 

Wenchen Fan commented on SPARK-35579:
-

The janino upgrade was reverted. Let's retarget to 3.3

> Fix a bug in janino or work around it in Spark.
> ---
>
> Key: SPARK-35579
> URL: https://issues.apache.org/jira/browse/SPARK-35579
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Priority: Blocker
>
> See the test in SPARK-35578



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35579) Fix a bug in janino or work around it in Spark.

2021-08-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-35579:

Affects Version/s: (was: 3.2.0)
   3.3.0

> Fix a bug in janino or work around it in Spark.
> ---
>
> Key: SPARK-35579
> URL: https://issues.apache.org/jira/browse/SPARK-35579
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Wenchen Fan
>Priority: Critical
>
> See the test in SPARK-35578



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-35579) Fix a bug in janino or work around it in Spark.

2021-08-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-35579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-35579:

Priority: Critical  (was: Blocker)

> Fix a bug in janino or work around it in Spark.
> ---
>
> Key: SPARK-35579
> URL: https://issues.apache.org/jira/browse/SPARK-35579
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Priority: Critical
>
> See the test in SPARK-35578



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34276) Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11

2021-08-09 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396435#comment-17396435
 ] 

Yuming Wang commented on SPARK-34276:
-

We have used parquet 1.11/1.12 in the production environment for half a year 
and no obvious performance issues.

> Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 
> --
>
> Key: SPARK-34276
> URL: https://issues.apache.org/jira/browse/SPARK-34276
> Project: Spark
>  Issue Type: Task
>  Components: Build, SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Blocker
>
> Before the release, we need to double check the unreleased/unresolved 
> JIRAs/PRs of Parquet 1.11 and then decide whether we should upgrade/revert 
> Parquet. At the same time, we should encourage the whole community to do the 
> compatibility and performance tests for their production workloads, including 
> both read and write code paths.
> More details: 
> https://github.com/apache/spark/pull/26804#issuecomment-768790620



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36463) Prohibit update mode in native support of session window

2021-08-09 Thread Gengliang Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396434#comment-17396434
 ] 

Gengliang Wang commented on SPARK-36463:


[~kabhwan] Thanks for the info. I plan to cut RC1 next Monday. Please try to 
finish the PR this week. Thanks!

> Prohibit update mode in native support of session window
> 
>
> Key: SPARK-36463
> URL: https://issues.apache.org/jira/browse/SPARK-36463
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Jungtaek Lim
>Priority: Blocker
>
> The semantic of update mode for native support of session window seems to be 
> broken.
> Strictly saying, it doesn't break the semantic based on our explanation of 
> update mode:
> {quote}
> Update Mode - Only the rows that were updated in the Result Table since the 
> last trigger will be written to the external storage (available since Spark 
> 2.1.1). Note that this is different from the Complete Mode in that this mode 
> only outputs the rows that have changed since the last trigger. If the query 
> doesn’t contain aggregations, it will be equivalent to Append mode.
> {quote}
> But given the grouping key is changing due to the nature of session window, 
> there is no way to "upsert" the output into destination. If end users try to 
> "upsert" the output based on the grouping key, it is high likely that a 
> single session window output will be written into multiple rows across 
> multiple updates.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34276) Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11

2021-08-09 Thread Yuming Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396433#comment-17396433
 ] 

Yuming Wang commented on SPARK-34276:
-

It seems that there is no unreleased/unresolved JIRAs/PRs of Parquet 1.11/1.12.
https://issues.apache.org/jira/issues/?jql=project%20%3D%20Parquet%20%20AND%20resolution%20%3D%20Unresolved%20AND%20affectedVersion%20in%20(1.11.0%2C%201.11.1%2C%201.11.2%2C%201.12.0%2C%201.12.1)%20%20%20ORDER%20BY%20createdDate%20%20DESC

> Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 
> --
>
> Key: SPARK-34276
> URL: https://issues.apache.org/jira/browse/SPARK-34276
> Project: Spark
>  Issue Type: Task
>  Components: Build, SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Blocker
>
> Before the release, we need to double check the unreleased/unresolved 
> JIRAs/PRs of Parquet 1.11 and then decide whether we should upgrade/revert 
> Parquet. At the same time, we should encourage the whole community to do the 
> compatibility and performance tests for their production workloads, including 
> both read and write code paths.
> More details: 
> https://github.com/apache/spark/pull/26804#issuecomment-768790620



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-33807) Data Source V2: Remove read specific distributions

2021-08-09 Thread Gengliang Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-33807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396432#comment-17396432
 ] 

Gengliang Wang commented on SPARK-33807:


[~aokolnychyi] I plan to cut 3.2 RC1 next Monday. What is the status of this 
one? Do we need to have it in Spark 3.2?

> Data Source V2: Remove read specific distributions
> --
>
> Key: SPARK-33807
> URL: https://issues.apache.org/jira/browse/SPARK-33807
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Anton Okolnychyi
>Priority: Blocker
>
> We should remove the read-specific distributions for DS V2 as discussed 
> [here|https://github.com/apache/spark/pull/30706#discussion_r543059827].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34183) DataSource V2: Support required distribution and ordering in SS

2021-08-09 Thread Gengliang Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396430#comment-17396430
 ] 

Gengliang Wang commented on SPARK-34183:


[~aokolnychyi] I plan to cut 3.2 RC1 next Monday. What is the status of this 
one? Do we need to have it in Spark 3.2?

> DataSource V2: Support required distribution and ordering in SS
> ---
>
> Key: SPARK-34183
> URL: https://issues.apache.org/jira/browse/SPARK-34183
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL, Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Anton Okolnychyi
>Priority: Blocker
>
> We need to support a required distribution and ordering for SS. See the 
> discussion 
> [here|https://github.com/apache/spark/pull/31083#issuecomment-763214597].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36463) Prohibit update mode in native support of session window

2021-08-09 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396424#comment-17396424
 ] 

Jungtaek Lim commented on SPARK-36463:
--

[~Gengliang.Wang]
FYI as I marked this as a blocker. I would like to remove some functionality 
which is unclear about the desired behavior before releasing the feature.
Also cc. to [~viirya] to see whether he could help reviewing the proposed 
change.

> Prohibit update mode in native support of session window
> 
>
> Key: SPARK-36463
> URL: https://issues.apache.org/jira/browse/SPARK-36463
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Jungtaek Lim
>Priority: Blocker
>
> The semantic of update mode for native support of session window seems to be 
> broken.
> Strictly saying, it doesn't break the semantic based on our explanation of 
> update mode:
> {quote}
> Update Mode - Only the rows that were updated in the Result Table since the 
> last trigger will be written to the external storage (available since Spark 
> 2.1.1). Note that this is different from the Complete Mode in that this mode 
> only outputs the rows that have changed since the last trigger. If the query 
> doesn’t contain aggregations, it will be equivalent to Append mode.
> {quote}
> But given the grouping key is changing due to the nature of session window, 
> there is no way to "upsert" the output into destination. If end users try to 
> "upsert" the output based on the grouping key, it is high likely that a 
> single session window output will be written into multiple rows across 
> multiple updates.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34276) Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11

2021-08-09 Thread Gengliang Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396421#comment-17396421
 ] 

Gengliang Wang commented on SPARK-34276:


[~yumwang] I plan to cut 3.2 RC1 next Monday. What is the status of this one? 
Do we need to have it in Spark 3.2?

> Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 
> --
>
> Key: SPARK-34276
> URL: https://issues.apache.org/jira/browse/SPARK-34276
> Project: Spark
>  Issue Type: Task
>  Components: Build, SQL
>Affects Versions: 3.2.0
>Reporter: Yuming Wang
>Priority: Blocker
>
> Before the release, we need to double check the unreleased/unresolved 
> JIRAs/PRs of Parquet 1.11 and then decide whether we should upgrade/revert 
> Parquet. At the same time, we should encourage the whole community to do the 
> compatibility and performance tests for their production workloads, including 
> both read and write code paths.
> More details: 
> https://github.com/apache/spark/pull/26804#issuecomment-768790620



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-34827) Support fetching shuffle blocks in batch with i/o encryption

2021-08-09 Thread Gengliang Wang (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-34827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396420#comment-17396420
 ] 

Gengliang Wang commented on SPARK-34827:


[~dongjoon] I plan to cut 3.2 RC1 next Monday. What is the status of this one?  
Do we need to have it in Spark 3.2?

> Support fetching shuffle blocks in batch with i/o encryption
> 
>
> Key: SPARK-34827
> URL: https://issues.apache.org/jira/browse/SPARK-34827
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Dongjoon Hyun
>Priority: Blocker
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36463) Prohibit update mode in native support of session window

2021-08-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36463:


Assignee: Apache Spark

> Prohibit update mode in native support of session window
> 
>
> Key: SPARK-36463
> URL: https://issues.apache.org/jira/browse/SPARK-36463
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Jungtaek Lim
>Assignee: Apache Spark
>Priority: Blocker
>
> The semantic of update mode for native support of session window seems to be 
> broken.
> Strictly saying, it doesn't break the semantic based on our explanation of 
> update mode:
> {quote}
> Update Mode - Only the rows that were updated in the Result Table since the 
> last trigger will be written to the external storage (available since Spark 
> 2.1.1). Note that this is different from the Complete Mode in that this mode 
> only outputs the rows that have changed since the last trigger. If the query 
> doesn’t contain aggregations, it will be equivalent to Append mode.
> {quote}
> But given the grouping key is changing due to the nature of session window, 
> there is no way to "upsert" the output into destination. If end users try to 
> "upsert" the output based on the grouping key, it is high likely that a 
> single session window output will be written into multiple rows across 
> multiple updates.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36463) Prohibit update mode in native support of session window

2021-08-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36463:


Assignee: (was: Apache Spark)

> Prohibit update mode in native support of session window
> 
>
> Key: SPARK-36463
> URL: https://issues.apache.org/jira/browse/SPARK-36463
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Jungtaek Lim
>Priority: Blocker
>
> The semantic of update mode for native support of session window seems to be 
> broken.
> Strictly saying, it doesn't break the semantic based on our explanation of 
> update mode:
> {quote}
> Update Mode - Only the rows that were updated in the Result Table since the 
> last trigger will be written to the external storage (available since Spark 
> 2.1.1). Note that this is different from the Complete Mode in that this mode 
> only outputs the rows that have changed since the last trigger. If the query 
> doesn’t contain aggregations, it will be equivalent to Append mode.
> {quote}
> But given the grouping key is changing due to the nature of session window, 
> there is no way to "upsert" the output into destination. If end users try to 
> "upsert" the output based on the grouping key, it is high likely that a 
> single session window output will be written into multiple rows across 
> multiple updates.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36463) Prohibit update mode in native support of session window

2021-08-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396396#comment-17396396
 ] 

Apache Spark commented on SPARK-36463:
--

User 'HeartSaVioR' has created a pull request for this issue:
https://github.com/apache/spark/pull/33689

> Prohibit update mode in native support of session window
> 
>
> Key: SPARK-36463
> URL: https://issues.apache.org/jira/browse/SPARK-36463
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Jungtaek Lim
>Priority: Blocker
>
> The semantic of update mode for native support of session window seems to be 
> broken.
> Strictly saying, it doesn't break the semantic based on our explanation of 
> update mode:
> {quote}
> Update Mode - Only the rows that were updated in the Result Table since the 
> last trigger will be written to the external storage (available since Spark 
> 2.1.1). Note that this is different from the Complete Mode in that this mode 
> only outputs the rows that have changed since the last trigger. If the query 
> doesn’t contain aggregations, it will be equivalent to Append mode.
> {quote}
> But given the grouping key is changing due to the nature of session window, 
> there is no way to "upsert" the output into destination. If end users try to 
> "upsert" the output based on the grouping key, it is high likely that a 
> single session window output will be written into multiple rows across 
> multiple updates.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36463) Prohibit update mode in native support of session window

2021-08-09 Thread Jungtaek Lim (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396384#comment-17396384
 ] 

Jungtaek Lim commented on SPARK-36463:
--

Will submit a PR shortly.

> Prohibit update mode in native support of session window
> 
>
> Key: SPARK-36463
> URL: https://issues.apache.org/jira/browse/SPARK-36463
> Project: Spark
>  Issue Type: Bug
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Jungtaek Lim
>Priority: Blocker
>
> The semantic of update mode for native support of session window seems to be 
> broken.
> Strictly saying, it doesn't break the semantic based on our explanation of 
> update mode:
> {quote}
> Update Mode - Only the rows that were updated in the Result Table since the 
> last trigger will be written to the external storage (available since Spark 
> 2.1.1). Note that this is different from the Complete Mode in that this mode 
> only outputs the rows that have changed since the last trigger. If the query 
> doesn’t contain aggregations, it will be equivalent to Append mode.
> {quote}
> But given the grouping key is changing due to the nature of session window, 
> there is no way to "upsert" the output into destination. If end users try to 
> "upsert" the output based on the grouping key, it is high likely that a 
> single session window output will be written into multiple rows across 
> multiple updates.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36463) Prohibit update mode in native support of session window

2021-08-09 Thread Jungtaek Lim (Jira)

Jungtaek Lim created SPARK-36463:


 Summary: Prohibit update mode in native support of session window
 Key: SPARK-36463
 URL: https://issues.apache.org/jira/browse/SPARK-36463
 Project: Spark
  Issue Type: Bug
  Components: Structured Streaming
Affects Versions: 3.2.0
Reporter: Jungtaek Lim


The semantic of update mode for native support of session window seems to be 
broken.

Strictly saying, it doesn't break the semantic based on our explanation of 
update mode:

{quote}
Update Mode - Only the rows that were updated in the Result Table since the 
last trigger will be written to the external storage (available since Spark 
2.1.1). Note that this is different from the Complete Mode in that this mode 
only outputs the rows that have changed since the last trigger. If the query 
doesn’t contain aggregations, it will be equivalent to Append mode.
{quote}

But given the grouping key is changing due to the nature of session window, 
there is no way to "upsert" the output into destination. If end users try to 
"upsert" the output based on the grouping key, it is high likely that a single 
session window output will be written into multiple rows across multiple 
updates.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36458) MinHashLSH.approxSimilarityJoin should not required inputCol if output exist

2021-08-09 Thread Thai Thien (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thai Thien updated SPARK-36458:
---
Component/s: (was: Spark Core)
 ML

> MinHashLSH.approxSimilarityJoin should not required inputCol if output exist
> 
>
> Key: SPARK-36458
> URL: https://issues.apache.org/jira/browse/SPARK-36458
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 3.1.1
>Reporter: Thai Thien
>Priority: Minor
>
> Refer to documents and example code in MinHashLSH 
>  
> [https://spark.apache.org/docs/latest/ml-features#minhash-for-jaccard-distance]
> The example written that:
>  We could avoid computing hashes by passing in the already-transformed 
> dataset, e.g. `model.approxSimilarityJoin(transformedA, transformedB, 0.6)` 
> However, inputCol still required in transformedA and transformedB even if 
> they already have outputCol.
> An code that should work but it doesn't
>  
>  
> {code:java}
> from pyspark.ml.feature import MinHashLSH
>  from pyspark.ml.linalg import Vectors
>  from pyspark.sql.functions import col
> dataA = [(0, Vectors.sparse(6, [0, 1, 2], [1.0, 1.0, 1.0]),),
>  (1, Vectors.sparse(6, [2, 3, 4], [1.0, 1.0, 1.0]),),
>  (2, Vectors.sparse(6, [0, 2, 4], [1.0, 1.0, 1.0]),)]
>  dfA = spark.createDataFrame(dataA, ["id", "features"])
> dataB = [(3, Vectors.sparse(6, [1, 3, 5], [1.0, 1.0, 1.0]),),
>  (4, Vectors.sparse(6, [2, 3, 5], [1.0, 1.0, 1.0]),),
>  (5, Vectors.sparse(6, [1, 2, 4], [1.0, 1.0, 1.0]),)]
>  dfB = spark.createDataFrame(dataB, ["id", "features"])
> key = Vectors.sparse(6, [1, 3], [1.0, 1.0])
> mh = MinHashLSH(inputCol="features", outputCol="hashes", numHashTables=5)
>  model = mh.fit(dfA)
> transformedA = model.transform(dfA).select("id", "hashes")
>  transformedB = model.transform(dfB).select("id", "hashes")
> model.approxSimilarityJoin(transformedA, transformedB, 0.6, 
> distCol="JaccardDistance")\
>  .select(col("datasetA.id").alias("idA"),
>  col("datasetB.id").alias("idB"),
>  col("JaccardDistance")).show()
> {code}
> As in the code I give, I discard columns `features` but keep column `hashes` 
> which is output data. 
>  approxSimilarityJoin should only work on `hashes` (the outputCol), which is 
> exist and ignore the lack of `features` (the inputCol).
> Be able to transform the data beforehand and remove inputCol can make input 
> data much smaller and prevent confusion about the tip "_We could avoid 
> computing hashes by passing in the already-transformed dataset_".



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36459) Date Value '0001-01-01' changes to '0001-12-30' when inserted into a parquet hive table

2021-08-09 Thread Hyukjin Kwon (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396376#comment-17396376
 ] 

Hyukjin Kwon commented on SPARK-36459:
--

That should be fixed from Spark 3.0.0. Can you try that out?

> Date Value '0001-01-01' changes to '0001-12-30' when inserted into a parquet 
> hive table
> ---
>
> Key: SPARK-36459
> URL: https://issues.apache.org/jira/browse/SPARK-36459
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core, Spark Shell
>Affects Versions: 2.4.4
>Reporter: sindhura alluri
>Priority: Major
>
> Hi All, 
> we are seeing this issue on spark 2.4.4. Below are the steps to reproduce it. 
> *Login in to hive terminal on cluster and create below tables.*
> create table t_src(dob timestamp);
> insert into t_src values('0001-01-01 00:00:00.0');
> create table t_tgt(dob timestamp) stored as parquet;
>  
> *Spark-shell steps :*
>  
> import org.apache.spark.sql.hive.HiveContext
> val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)
> val q0 = "TRUNCATE table t_tgt"
> val q1 = "SELECT alias.dob as a0 FROM t_src alias"
> val q2 = "INSERT INTO TABLE t_tgt SELECT tbl0.a0 as c0 FROM tbl0"
> sqlContext.sql(q0)
> sqlContext.sql(q1).select("a0").createOrReplaceTempView("tbl0")
> sqlContext.sql(q2)
>  
>  After this check the contents of target table t_tgt. You will see the date 
> "0001-01-01 00:00:00" changed to "0001-12-30 00:00:00".
>  select * from t_tgt;
> Is this a known issue? Is it fixed in any subsequent releases?
> Thanks & regards,
> Sindhura Alluri



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36460) Pull out NoOpMergedShuffleFileManager inner class outside

2021-08-09 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang resolved SPARK-36460.

Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33688
[https://github.com/apache/spark/pull/33688]

> Pull out NoOpMergedShuffleFileManager inner class outside
> -
>
> Key: SPARK-36460
> URL: https://issues.apache.org/jira/browse/SPARK-36460
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.2.0
>Reporter: Venkata krishnan Sowrirajan
>Assignee: Venkata krishnan Sowrirajan
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36460) Pull out NoOpMergedShuffleFileManager inner class outside

2021-08-09 Thread Gengliang Wang (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gengliang Wang reassigned SPARK-36460:
--

Assignee: Venkata krishnan Sowrirajan

> Pull out NoOpMergedShuffleFileManager inner class outside
> -
>
> Key: SPARK-36460
> URL: https://issues.apache.org/jira/browse/SPARK-36460
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.2.0
>Reporter: Venkata krishnan Sowrirajan
>Assignee: Venkata krishnan Sowrirajan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36377) Fix documentation in spark-env.sh.template

2021-08-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-36377:


Assignee: Yuto Akutsu

> Fix documentation in spark-env.sh.template
> --
>
> Key: SPARK-36377
> URL: https://issues.apache.org/jira/browse/SPARK-36377
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Spark Submit
>Affects Versions: 3.1.2
>Reporter: Yuto Akutsu
>Assignee: Yuto Akutsu
>Priority: Major
>
> Some options in the "Options read in YARN client/cluster mode" section in 
> spark-env.sh.template is read by other modes too (e.g. SPARK_CONF_DIR, 
> SPARK_EXECUTOR_CORES, etc.), so we should re-document it to help users 
> distinguish what's only read by YARN mode from what's not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36377) Fix documentation in spark-env.sh.template

2021-08-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36377.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 33604
[https://github.com/apache/spark/pull/33604]

> Fix documentation in spark-env.sh.template
> --
>
> Key: SPARK-36377
> URL: https://issues.apache.org/jira/browse/SPARK-36377
> Project: Spark
>  Issue Type: Documentation
>  Components: Documentation, Spark Submit
>Affects Versions: 3.1.2
>Reporter: Yuto Akutsu
>Assignee: Yuto Akutsu
>Priority: Major
> Fix For: 3.3.0
>
>
> Some options in the "Options read in YARN client/cluster mode" section in 
> spark-env.sh.template is read by other modes too (e.g. SPARK_CONF_DIR, 
> SPARK_EXECUTOR_CORES, etc.), so we should re-document it to help users 
> distinguish what's only read by YARN mode from what's not.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36332) Cleanup RemoteBlockPushResolver log messages

2021-08-09 Thread wuyi (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuyi resolved SPARK-36332.
--
Fix Version/s: 3.3.0
   3.2.0
 Assignee: Venkata krishnan Sowrirajan
   Resolution: Fixed

Issue resolved by https://github.com/apache/spark/pull/33561

> Cleanup RemoteBlockPushResolver log messages
> 
>
> Key: SPARK-36332
> URL: https://issues.apache.org/jira/browse/SPARK-36332
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.2.0
>Reporter: Venkata krishnan Sowrirajan
>Assignee: Venkata krishnan Sowrirajan
>Priority: Minor
> Fix For: 3.2.0, 3.3.0
>
>
> Minor cleanups to RemoteBlockPushResolver to use AppShufflePartitionsInfo 
> toString() for log messages.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36386) Fix DataFrame groupby-expanding to follow pandas 1.3

2021-08-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36386.
--
Fix Version/s: 3.3.0
 Assignee: Haejoon Lee
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/33646

> Fix DataFrame groupby-expanding to follow pandas 1.3
> 
>
> Key: SPARK-36386
> URL: https://issues.apache.org/jira/browse/SPARK-36386
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36388) Fix DataFrame groupby-rolling to follow pandas 1.3

2021-08-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36388.
--
Fix Version/s: 3.3.0
 Assignee: Haejoon Lee
   Resolution: Fixed

Fixed in https://github.com/apache/spark/pull/33646

> Fix DataFrame groupby-rolling to follow pandas 1.3
> --
>
> Key: SPARK-36388
> URL: https://issues.apache.org/jira/browse/SPARK-36388
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Haejoon Lee
>Priority: Major
> Fix For: 3.3.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-32709) Write Hive ORC/Parquet bucketed table with hivehash (for Hive 1,2)

2021-08-09 Thread Shashank Pedamallu (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-32709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396330#comment-17396330
 ] 

Shashank Pedamallu commented on SPARK-32709:


Issue observed at Lyft. When attempting to apply the patch on production query, 
the query fails eventually due to S3 throttling issues due to too many files 
generated by the bucketing unlike hive. To overcome the too many small files 
problem, we tried reducing the number of reducers which is creating OOM issues. 
We enabled adaptive query execution which reduced the number of reducers to 44. 
But even with 44 reducers and number of buckets being 1024, the final number of 
files as 45057 is little higher compared to 1024 end files in Hive. This method 
did not seem to work effectively on larger tables (even with AQE, we would get 
hit by S3 throttling).

*Query (I anonymized the names for privacy. please let me know if that's a 
concern)*:

 
{noformat}
-- Default configurations SET hive.exec.compress.output=true;SET 
hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
SET mapred.output.compress=true;
SET parquet.compression=SNAPPY;
SET mapreduce.input.fileinputformat.split.maxsize=25600;
SET mapreduce.input.fileinputformat.split.minsize=6400;
SET hive.exec.parallel=true;
SET hive.exec.parallel.thread.number=32;
SET hive.hadoop.supports.splittable.combineinputformat=true;
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;
SET hive.exec.max.dynamic.partitions=1000;
SET hive.exec.max.dynamic.partitions.pernode=1000;-- User configurations SET 
hive.enforce.bucketing = true;
SET hive.mapred.mode = nonstrict;
SET hive.exec.max.created.files=180;
SET hive.execution.engine=tez;-- spark configs
SET spark.executor.memory=8g;
SET spark.driver.memoryOverhead=4g;
SET spark.driver.memory=12g;
SET spark.sql.adaptive.advisoryPartitionSizeInBytes=1536MB;
SET spark.sql.adaptive.coalescePartitions.minPartitionNum=16;
DROP TABLE IF EXISTS anon_table_a;

WITH anon_table_a
AS (
SELECT col_a,
col_b,
col_c,
RANK () OVER (PARTITION BY col_b ORDER BY col_c) AS alias_a
FROM schema_a.src_table_a
WHERE col_d IS NOT NULL
DISTRIBUTE BY col_b SORT BY col_c
),
anon_table_b AS (
SELECT
col_e
FROM
(
SELECT
col_e,
ROW_NUMBER() OVER (PARTITION BY col_e) AS rn
FROM
schema_b.src_table_b
WHERE
1 = 1
) v
WHERE
rn = 1
)INSERT OVERWRITE TABLE personal_schema.temp_dest_table SELECT
data.*
FROM
(

SELECT col_a,
col_b,
alias_a
FROM anon_table_a
) data
LEFT OUTER JOIN
anon_table_b
ON data.col_b = anon_table_b.col_e
WHERE
(anon_table_b.col_e IS NULL OR data.col_b is null );
DROP TABLE IF EXISTS personal_schema.dest_table;
ALTER TABLE personal_schema.temp_dest_table RENAME TO personal_schema.dest_table
{noformat}
 

*Spark shuffle metrics:*
 !91275701_stage6_metrics.png|width=300,height=160!

Average file size in the final path is ~500kb when writing from Spark compared 
to 29MB in Hive

*Question:*
So, just wanted to raise the question again about if there is any active / 
planned effort to support 1 file per reducer for buckted tables?

> Write Hive ORC/Parquet bucketed table with hivehash (for Hive 1,2)
> --
>
> Key: SPARK-32709
> URL: https://issues.apache.org/jira/browse/SPARK-32709
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Minor
> Attachments: 91275701_stage6_metrics.png
>
>
> Hive ORC/Parquet write code path is same as data source v1 code path 
> (FileFormatWriter). This JIRA is to add the support to write Hive ORC/Parquet 
> bucketed table with hivehash. The change is to custom `bucketIdExpression` to 
> use hivehash when the table is Hive bucketed table, and the Hive version is 
> 1.x.y or 2.x.y.
>  
> This will allow us write Hive/Presto-compatible bucketed table for Hive 1 and 
> 2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-32709) Write Hive ORC/Parquet bucketed table with hivehash (for Hive 1,2)

2021-08-09 Thread Shashank Pedamallu (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-32709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shashank Pedamallu updated SPARK-32709:
---
Attachment: 91275701_stage6_metrics.png

> Write Hive ORC/Parquet bucketed table with hivehash (for Hive 1,2)
> --
>
> Key: SPARK-32709
> URL: https://issues.apache.org/jira/browse/SPARK-32709
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Cheng Su
>Priority: Minor
> Attachments: 91275701_stage6_metrics.png
>
>
> Hive ORC/Parquet write code path is same as data source v1 code path 
> (FileFormatWriter). This JIRA is to add the support to write Hive ORC/Parquet 
> bucketed table with hivehash. The change is to custom `bucketIdExpression` to 
> use hivehash when the table is Hive bucketed table, and the Hive version is 
> 1.x.y or 2.x.y.
>  
> This will allow us write Hive/Presto-compatible bucketed table for Hive 1 and 
> 2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36462) Allow Spark on Kube to operate without polling or watchers

2021-08-09 Thread Holden Karau (Jira)

Holden Karau created SPARK-36462:


 Summary: Allow Spark on Kube to operate without polling or watchers
 Key: SPARK-36462
 URL: https://issues.apache.org/jira/browse/SPARK-36462
 Project: Spark
  Issue Type: Improvement
  Components: Kubernetes
Affects Versions: 3.2.0
Reporter: Holden Karau


Add an option to Spark on Kube to not track the individual executor pods and 
just assume K8s is doing what it's asked. This would be a developer feature 
intended for minimizing load on etcd & driver.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36461) Enable ObjectHashAggregate for more Aggregate functions

2021-08-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36461:


Assignee: L. C. Hsieh  (was: Apache Spark)

> Enable ObjectHashAggregate for more Aggregate functions
> ---
>
> Key: SPARK-36461
> URL: https://issues.apache.org/jira/browse/SPARK-36461
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> Enabing more {{ObjectHashAggregate}} for more aggregate functions, as it has 
> better performance compared to {{SortAggregate}} according to current 
> benchmark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36461) Enable ObjectHashAggregate for more Aggregate functions

2021-08-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396246#comment-17396246
 ] 

Apache Spark commented on SPARK-36461:
--

User 'viirya' has created a pull request for this issue:
https://github.com/apache/spark/pull/33068

> Enable ObjectHashAggregate for more Aggregate functions
> ---
>
> Key: SPARK-36461
> URL: https://issues.apache.org/jira/browse/SPARK-36461
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: L. C. Hsieh
>Assignee: L. C. Hsieh
>Priority: Major
>
> Enabing more {{ObjectHashAggregate}} for more aggregate functions, as it has 
> better performance compared to {{SortAggregate}} according to current 
> benchmark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36461) Enable ObjectHashAggregate for more Aggregate functions

2021-08-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36461:


Assignee: Apache Spark  (was: L. C. Hsieh)

> Enable ObjectHashAggregate for more Aggregate functions
> ---
>
> Key: SPARK-36461
> URL: https://issues.apache.org/jira/browse/SPARK-36461
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: L. C. Hsieh
>Assignee: Apache Spark
>Priority: Major
>
> Enabing more {{ObjectHashAggregate}} for more aggregate functions, as it has 
> better performance compared to {{SortAggregate}} according to current 
> benchmark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36461) Enable ObjectHashAggregate for more Aggregate functions

2021-08-09 Thread L. C. Hsieh (Jira)

L. C. Hsieh created SPARK-36461:
---

 Summary: Enable ObjectHashAggregate for more Aggregate functions
 Key: SPARK-36461
 URL: https://issues.apache.org/jira/browse/SPARK-36461
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Affects Versions: 3.3.0
Reporter: L. C. Hsieh
Assignee: L. C. Hsieh


Enabing more {{ObjectHashAggregate}} for more aggregate functions, as it has 
better performance compared to {{SortAggregate}} according to current benchmark.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36370) Avoid using SelectionMixin._builtin_table which is removed in pandas 1.3

2021-08-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396223#comment-17396223
 ] 

Apache Spark commented on SPARK-36370:
--

User 'Cedric-Magnan' has created a pull request for this issue:
https://github.com/apache/spark/pull/33687

> Avoid using SelectionMixin._builtin_table which is removed in pandas 1.3
> 
>
> Key: SPARK-36370
> URL: https://issues.apache.org/jira/browse/SPARK-36370
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark
>Affects Versions: 3.2.0
>Reporter: Takuya Ueshin
>Assignee: Takuya Ueshin
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36460) Pull out NoOpMergedShuffleFileManager inner class outside

2021-08-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36460:


Assignee: (was: Apache Spark)

> Pull out NoOpMergedShuffleFileManager inner class outside
> -
>
> Key: SPARK-36460
> URL: https://issues.apache.org/jira/browse/SPARK-36460
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.2.0
>Reporter: Venkata krishnan Sowrirajan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36460) Pull out NoOpMergedShuffleFileManager inner class outside

2021-08-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36460:


Assignee: Apache Spark

> Pull out NoOpMergedShuffleFileManager inner class outside
> -
>
> Key: SPARK-36460
> URL: https://issues.apache.org/jira/browse/SPARK-36460
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.2.0
>Reporter: Venkata krishnan Sowrirajan
>Assignee: Apache Spark
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36460) Pull out NoOpMergedShuffleFileManager inner class outside

2021-08-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396195#comment-17396195
 ] 

Apache Spark commented on SPARK-36460:
--

User 'venkata91' has created a pull request for this issue:
https://github.com/apache/spark/pull/33688

> Pull out NoOpMergedShuffleFileManager inner class outside
> -
>
> Key: SPARK-36460
> URL: https://issues.apache.org/jira/browse/SPARK-36460
> Project: Spark
>  Issue Type: Sub-task
>  Components: Shuffle, Spark Core
>Affects Versions: 3.2.0
>Reporter: Venkata krishnan Sowrirajan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36454) Not push down partition filter to ORCScan for DSv2

2021-08-09 Thread Dongjoon Hyun (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dongjoon Hyun resolved SPARK-36454.
---
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33680
[https://github.com/apache/spark/pull/33680]

> Not push down partition filter to ORCScan for DSv2
> --
>
> Key: SPARK-36454
> URL: https://issues.apache.org/jira/browse/SPARK-36454
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Huaxin Gao
>Assignee: Apache Spark
>Priority: Minor
> Fix For: 3.2.0
>
>
> Seems to me that partition filter is only used for partition pruning and 
> shouldn't be pushed down to ORCScan. We don't push down partition filter to 
> ORCScan in DSv1, and we don't push down partition filter for parquet in both 
> DSv1 and DSv2.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36460) Pull out NoOpMergedShuffleFileManager inner class outside

2021-08-09 Thread Venkata krishnan Sowrirajan (Jira)

Venkata krishnan Sowrirajan created SPARK-36460:
---

 Summary: Pull out NoOpMergedShuffleFileManager inner class outside
 Key: SPARK-36460
 URL: https://issues.apache.org/jira/browse/SPARK-36460
 Project: Spark
  Issue Type: Sub-task
  Components: Shuffle, Spark Core
Affects Versions: 3.2.0
Reporter: Venkata krishnan Sowrirajan






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-36459) Date Value '0001-01-01' changes to '0001-12-30' when inserted into a parquet hive table

2021-08-09 Thread sindhura alluri (Jira)

sindhura alluri created SPARK-36459:
---

 Summary: Date Value '0001-01-01' changes to '0001-12-30' when 
inserted into a parquet hive table
 Key: SPARK-36459
 URL: https://issues.apache.org/jira/browse/SPARK-36459
 Project: Spark
  Issue Type: Bug
  Components: Spark Core, Spark Shell
Affects Versions: 2.4.4
Reporter: sindhura alluri


Hi All, 

we are seeing this issue on spark 2.4.4. Below are the steps to reproduce it. 

*Login in to hive terminal on cluster and create below tables.*

create table t_src(dob timestamp);
insert into t_src values('0001-01-01 00:00:00.0');
create table t_tgt(dob timestamp) stored as parquet;

 

*Spark-shell steps :*
 
import org.apache.spark.sql.hive.HiveContext
val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc)

val q0 = "TRUNCATE table t_tgt"
val q1 = "SELECT alias.dob as a0 FROM t_src alias"
val q2 = "INSERT INTO TABLE t_tgt SELECT tbl0.a0 as c0 FROM tbl0"
sqlContext.sql(q0)
sqlContext.sql(q1).select("a0").createOrReplaceTempView("tbl0")
sqlContext.sql(q2)

 

 After this check the contents of target table t_tgt. You will see the date 
"0001-01-01 00:00:00" changed to "0001-12-30 00:00:00".

 select * from t_tgt;

Is this a known issue? Is it fixed in any subsequent releases?

Thanks & regards,

Sindhura Alluri



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-26208) Empty dataframe does not roundtrip for csv with header

2021-08-09 Thread Ranga Reddy (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-26208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396076#comment-17396076
 ] 

Ranga Reddy commented on SPARK-26208:
-

The above code will work only when dataframe created manually.

Issue still persists when when we create dataframe while reading hive table.

*Hive Table:*
{code:java}
CREATE EXTERNAL TABLE `test_empty_csv_table`( 
 `col1` bigint, 
 `col2` bigint) 
STORED AS ORC 
LOCATION '/tmp/test_empty_csv_table';{code}
*spark-shell*

 
{code:java}
val tableName = "test_empty_csv_table"
val emptyCSVFilePath = "/tmp/empty_csv_file"
val df = spark.sql("select * from "+tableName)
df.printSchema()
df.write.format("csv").option("header", 
true).mode("overwrite").save(emptyCSVFilePath)
val df2 = spark.read.option("header", true).csv(emptyCSVFilePath)
{code}
 
{code:java}
org.apache.spark.sql.AnalysisException: Unable to infer schema for CSV. It must 
be specified manually.;
 at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:208)
 at 
org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:208)
 at scala.Option.getOrElse(Option.scala:121)
 at 
org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:207)
 at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:393)
 at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239)
 at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227)
 at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596)
 at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473)
 ... 49 elided{code}

> Empty dataframe does not roundtrip for csv with header
> --
>
> Key: SPARK-26208
> URL: https://issues.apache.org/jira/browse/SPARK-26208
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.0
> Environment: master branch,
> commit 034ae305c33b1990b3c1a284044002874c343b4d,
> date:   Sun Nov 18 16:02:15 2018 +0800
>Reporter: koert kuipers
>Assignee: Koert Kuipers
>Priority: Minor
> Fix For: 3.0.0
>
>
> when we write empty part file for csv and header=true we fail to write 
> header. the result cannot be read back in.
> when header=true a part file with zero rows should still have header



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-20384) supporting value classes over primitives in DataSets

2021-08-09 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-20384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen reassigned SPARK-20384:


Assignee: Mick Jermsurawong

> supporting value classes over primitives in DataSets
> 
>
> Key: SPARK-20384
> URL: https://issues.apache.org/jira/browse/SPARK-20384
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Daniel Davis
>Assignee: Mick Jermsurawong
>Priority: Minor
>
> As a spark user who uses value classes in scala for modelling domain objects, 
> I also would like to make use of them for datasets. 
> For example, I would like to use the {{User}} case class which is using a 
> value-class for it's {{id}} as the type for a DataSet:
> - the underlying primitive should be mapped to the value-class column
> - function on the column (for example comparison ) should only work if 
> defined on the value-class and use these implementation
> - show() should pick up the toString method of the value-class
> {code}
> case class Id(value: Long) extends AnyVal {
>   def toString: String = value.toHexString
> }
> case class User(id: Id, name: String)
> val ds = spark.sparkContext
>   .parallelize(0L to 12L).map(i => (i, f"name-$i")).toDS()
>   .withColumnRenamed("_1", "id")
>   .withColumnRenamed("_2", "name")
> // mapping should work
> val usrs = ds.as[User]
> // show should use toString
> usrs.show()
> // comparison with long should throw exception, as not defined on Id
> usrs.col("id") > 0L
> {code}
> For example `.show()` should use the toString of the `Id` value class:
> {noformat}
> +---+---+
> | id|   name|
> +---+---+
> |  0| name-0|
> |  1| name-1|
> |  2| name-2|
> |  3| name-3|
> |  4| name-4|
> |  5| name-5|
> |  6| name-6|
> |  7| name-7|
> |  8| name-8|
> |  9| name-9|
> |  A|name-10|
> |  B|name-11|
> |  C|name-12|
> +---+---+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-20384) supporting value classes over primitives in DataSets

2021-08-09 Thread Sean R. Owen (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-20384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean R. Owen resolved SPARK-20384.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 33205
[https://github.com/apache/spark/pull/33205]

> supporting value classes over primitives in DataSets
> 
>
> Key: SPARK-20384
> URL: https://issues.apache.org/jira/browse/SPARK-20384
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.1.0
>Reporter: Daniel Davis
>Assignee: Mick Jermsurawong
>Priority: Minor
> Fix For: 3.3.0
>
>
> As a spark user who uses value classes in scala for modelling domain objects, 
> I also would like to make use of them for datasets. 
> For example, I would like to use the {{User}} case class which is using a 
> value-class for it's {{id}} as the type for a DataSet:
> - the underlying primitive should be mapped to the value-class column
> - function on the column (for example comparison ) should only work if 
> defined on the value-class and use these implementation
> - show() should pick up the toString method of the value-class
> {code}
> case class Id(value: Long) extends AnyVal {
>   def toString: String = value.toHexString
> }
> case class User(id: Id, name: String)
> val ds = spark.sparkContext
>   .parallelize(0L to 12L).map(i => (i, f"name-$i")).toDS()
>   .withColumnRenamed("_1", "id")
>   .withColumnRenamed("_2", "name")
> // mapping should work
> val usrs = ds.as[User]
> // show should use toString
> usrs.show()
> // comparison with long should throw exception, as not defined on Id
> usrs.col("id") > 0L
> {code}
> For example `.show()` should use the toString of the `Id` value class:
> {noformat}
> +---+---+
> | id|   name|
> +---+---+
> |  0| name-0|
> |  1| name-1|
> |  2| name-2|
> |  3| name-3|
> |  4| name-4|
> |  5| name-5|
> |  6| name-6|
> |  7| name-7|
> |  8| name-8|
> |  9| name-9|
> |  A|name-10|
> |  B|name-11|
> |  C|name-12|
> +---+---+
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36432) Upgrade Jetty version to 9.4.43

2021-08-09 Thread Sajith A (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sajith A updated SPARK-36432:
-
Affects Version/s: 3.2.0

> Upgrade Jetty version to 9.4.43
> ---
>
> Key: SPARK-36432
> URL: https://issues.apache.org/jira/browse/SPARK-36432
> Project: Spark
>  Issue Type: Bug
>  Components: Build
>Affects Versions: 3.2.0, 3.3.0
>Reporter: Sajith A
>Assignee: Sajith A
>Priority: Minor
> Fix For: 3.2.0
>
>
> Upgrade Jetty version to 9.4.43.v20210629 in current Spark master in order to 
> fix vulnerability https://nvd.nist.gov/vuln/detail/CVE-2021-34429.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36458) MinHashLSH.approxSimilarityJoin should not required inputCol if output exist

2021-08-09 Thread Thai Thien (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Thai Thien updated SPARK-36458:
---
Description: 
Refer to documents and example code in MinHashLSH 
 [https://spark.apache.org/docs/latest/ml-features#minhash-for-jaccard-distance]

The example written that:

 We could avoid computing hashes by passing in the already-transformed dataset, 
e.g. `model.approxSimilarityJoin(transformedA, transformedB, 0.6)` 

However, inputCol still required in transformedA and transformedB even if they 
already have outputCol.

An code that should work but it doesn't

 

 
{code:java}
from pyspark.ml.feature import MinHashLSH
 from pyspark.ml.linalg import Vectors
 from pyspark.sql.functions import col
dataA = [(0, Vectors.sparse(6, [0, 1, 2], [1.0, 1.0, 1.0]),),
 (1, Vectors.sparse(6, [2, 3, 4], [1.0, 1.0, 1.0]),),
 (2, Vectors.sparse(6, [0, 2, 4], [1.0, 1.0, 1.0]),)]
 dfA = spark.createDataFrame(dataA, ["id", "features"])
dataB = [(3, Vectors.sparse(6, [1, 3, 5], [1.0, 1.0, 1.0]),),
 (4, Vectors.sparse(6, [2, 3, 5], [1.0, 1.0, 1.0]),),
 (5, Vectors.sparse(6, [1, 2, 4], [1.0, 1.0, 1.0]),)]
 dfB = spark.createDataFrame(dataB, ["id", "features"])
key = Vectors.sparse(6, [1, 3], [1.0, 1.0])
mh = MinHashLSH(inputCol="features", outputCol="hashes", numHashTables=5)
 model = mh.fit(dfA)
transformedA = model.transform(dfA).select("id", "hashes")
 transformedB = model.transform(dfB).select("id", "hashes")
model.approxSimilarityJoin(transformedA, transformedB, 0.6, 
distCol="JaccardDistance")\
 .select(col("datasetA.id").alias("idA"),
 col("datasetB.id").alias("idB"),
 col("JaccardDistance")).show()
{code}




As in the code I give, I discard columns `features` but keep column `hashes` 
which is output data. 
 approxSimilarityJoin should only work on `hashes` (the outputCol), which is 
exist and ignore the lack of `features` (the inputCol).

Be able to transform the data beforehand and remove inputCol can make input 
data much smaller and prevent confusion about the tip "_We could avoid 
computing hashes by passing in the already-transformed dataset_".

  was:
Refer to documents and example code in MinHashLSH 
https://spark.apache.org/docs/latest/ml-features#minhash-for-jaccard-distance

The example written that: 

```
# We could avoid computing hashes by passing in the already-transformed 
dataset, e.g.
# `model.approxSimilarityJoin(transformedA, transformedB, 0.6)`
```

However, inputCol still required in transformedA and transformedB even if they 
already have outputCol. 

An  code that should work but it doesn't 

```
from pyspark.ml.feature import MinHashLSH
from pyspark.ml.linalg import Vectors
from pyspark.sql.functions import col

dataA = [(0, Vectors.sparse(6, [0, 1, 2], [1.0, 1.0, 1.0]),),
 (1, Vectors.sparse(6, [2, 3, 4], [1.0, 1.0, 1.0]),),
 (2, Vectors.sparse(6, [0, 2, 4], [1.0, 1.0, 1.0]),)]
dfA = spark.createDataFrame(dataA, ["id", "features"])

dataB = [(3, Vectors.sparse(6, [1, 3, 5], [1.0, 1.0, 1.0]),),
 (4, Vectors.sparse(6, [2, 3, 5], [1.0, 1.0, 1.0]),),
 (5, Vectors.sparse(6, [1, 2, 4], [1.0, 1.0, 1.0]),)]
dfB = spark.createDataFrame(dataB, ["id", "features"])

key = Vectors.sparse(6, [1, 3], [1.0, 1.0])

mh = MinHashLSH(inputCol="features", outputCol="hashes", numHashTables=5)
model = mh.fit(dfA)

transformedA = model.transform(dfA).select("id", "hashes")
transformedB = model.transform(dfB).select("id", "hashes")

model.approxSimilarityJoin(transformedA, transformedB, 0.6, 
distCol="JaccardDistance")\
.select(col("datasetA.id").alias("idA"),
col("datasetB.id").alias("idB"),
col("JaccardDistance")).show()
```

As in the code I give, I discard columns `features` but keep column `hashes` 
which is output data. 
approxSimilarityJoin should only work on `hashes` (the outputCol), which is 
exist and ignore the lack of `features` (the inputCol). 

Be able to transform the data beforehand and remove inputCol can make input 
data much smaller and prevent confusion about  "We could avoid computing hashes 
by passing in the already-transformed dataset". 


> MinHashLSH.approxSimilarityJoin should not required inputCol if output exist
> 
>
> Key: SPARK-36458
> URL: https://issues.apache.org/jira/browse/SPARK-36458
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 3.1.1
>Reporter: Thai Thien
>Priority: Minor
>
> Refer to documents and example code in MinHashLSH 
>  
> [https://spark.apache.org/docs/latest/ml-features#minhash-for-jaccard-distance]
> The example written that:
>  We could avoid computing hashes by passing in the already-transformed 
> dataset, e.g. `model.approxSimilarityJoin(transformedA, transformedB, 0.6)` 
> However, inputCol still requir

[jira] [Created] (SPARK-36458) MinHashLSH.approxSimilarityJoin should not required inputCol if output exist

2021-08-09 Thread Thai Thien (Jira)

Thai Thien created SPARK-36458:
--

 Summary: MinHashLSH.approxSimilarityJoin should not required 
inputCol if output exist
 Key: SPARK-36458
 URL: https://issues.apache.org/jira/browse/SPARK-36458
 Project: Spark
  Issue Type: Improvement
  Components: Spark Core
Affects Versions: 3.1.1
Reporter: Thai Thien


Refer to documents and example code in MinHashLSH 
https://spark.apache.org/docs/latest/ml-features#minhash-for-jaccard-distance

The example written that: 

```
# We could avoid computing hashes by passing in the already-transformed 
dataset, e.g.
# `model.approxSimilarityJoin(transformedA, transformedB, 0.6)`
```

However, inputCol still required in transformedA and transformedB even if they 
already have outputCol. 

An  code that should work but it doesn't 

```
from pyspark.ml.feature import MinHashLSH
from pyspark.ml.linalg import Vectors
from pyspark.sql.functions import col

dataA = [(0, Vectors.sparse(6, [0, 1, 2], [1.0, 1.0, 1.0]),),
 (1, Vectors.sparse(6, [2, 3, 4], [1.0, 1.0, 1.0]),),
 (2, Vectors.sparse(6, [0, 2, 4], [1.0, 1.0, 1.0]),)]
dfA = spark.createDataFrame(dataA, ["id", "features"])

dataB = [(3, Vectors.sparse(6, [1, 3, 5], [1.0, 1.0, 1.0]),),
 (4, Vectors.sparse(6, [2, 3, 5], [1.0, 1.0, 1.0]),),
 (5, Vectors.sparse(6, [1, 2, 4], [1.0, 1.0, 1.0]),)]
dfB = spark.createDataFrame(dataB, ["id", "features"])

key = Vectors.sparse(6, [1, 3], [1.0, 1.0])

mh = MinHashLSH(inputCol="features", outputCol="hashes", numHashTables=5)
model = mh.fit(dfA)

transformedA = model.transform(dfA).select("id", "hashes")
transformedB = model.transform(dfB).select("id", "hashes")

model.approxSimilarityJoin(transformedA, transformedB, 0.6, 
distCol="JaccardDistance")\
.select(col("datasetA.id").alias("idA"),
col("datasetB.id").alias("idB"),
col("JaccardDistance")).show()
```

As in the code I give, I discard columns `features` but keep column `hashes` 
which is output data. 
approxSimilarityJoin should only work on `hashes` (the outputCol), which is 
exist and ignore the lack of `features` (the inputCol). 

Be able to transform the data beforehand and remove inputCol can make input 
data much smaller and prevent confusion about  "We could avoid computing hashes 
by passing in the already-transformed dataset". 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36367) Fix the behavior to follow pandas >= 1.3

2021-08-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17395973#comment-17395973
 ] 

Apache Spark commented on SPARK-36367:
--

User 'Cedric-Magnan' has created a pull request for this issue:
https://github.com/apache/spark/pull/33687

> Fix the behavior to follow pandas >= 1.3
> 
>
> Key: SPARK-36367
> URL: https://issues.apache.org/jira/browse/SPARK-36367
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Assignee: Haejoon Lee
>Priority: Major
>
> Pandas 1.3 has been released. We should follow the new pandas behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36367) Fix the behavior to follow pandas >= 1.3

2021-08-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36367:


Assignee: Haejoon Lee  (was: Apache Spark)

> Fix the behavior to follow pandas >= 1.3
> 
>
> Key: SPARK-36367
> URL: https://issues.apache.org/jira/browse/SPARK-36367
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Assignee: Haejoon Lee
>Priority: Major
>
> Pandas 1.3 has been released. We should follow the new pandas behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36367) Fix the behavior to follow pandas >= 1.3

2021-08-09 Thread Apache Spark (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-36367:


Assignee: Apache Spark  (was: Haejoon Lee)

> Fix the behavior to follow pandas >= 1.3
> 
>
> Key: SPARK-36367
> URL: https://issues.apache.org/jira/browse/SPARK-36367
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Assignee: Apache Spark
>Priority: Major
>
> Pandas 1.3 has been released. We should follow the new pandas behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-36367) Fix the behavior to follow pandas >= 1.3

2021-08-09 Thread Apache Spark (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-36367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17395972#comment-17395972
 ] 

Apache Spark commented on SPARK-36367:
--

User 'Cedric-Magnan' has created a pull request for this issue:
https://github.com/apache/spark/pull/33687

> Fix the behavior to follow pandas >= 1.3
> 
>
> Key: SPARK-36367
> URL: https://issues.apache.org/jira/browse/SPARK-36367
> Project: Spark
>  Issue Type: Umbrella
>  Components: PySpark
>Affects Versions: 3.3.0
>Reporter: Takuya Ueshin
>Assignee: Haejoon Lee
>Priority: Major
>
> Pandas 1.3 has been released. We should follow the new pandas behavior.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36455) Provide an example of complex session window via flatMapGroupsWithState

2021-08-09 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim reassigned SPARK-36455:


Assignee: Jungtaek Lim

> Provide an example of complex session window via flatMapGroupsWithState
> ---
>
> Key: SPARK-36455
> URL: https://issues.apache.org/jira/browse/SPARK-36455
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
>
> Now that we replaced sessionization example with native support of session 
> window, we may want to provide another example of session window which can 
> only be dealt with flatMapGroupsWithState.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36455) Provide an example of complex session window via flatMapGroupsWithState

2021-08-09 Thread Jungtaek Lim (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jungtaek Lim resolved SPARK-36455.
--
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33681
[https://github.com/apache/spark/pull/33681]

> Provide an example of complex session window via flatMapGroupsWithState
> ---
>
> Key: SPARK-36455
> URL: https://issues.apache.org/jira/browse/SPARK-36455
> Project: Spark
>  Issue Type: Improvement
>  Components: Structured Streaming
>Affects Versions: 3.2.0
>Reporter: Jungtaek Lim
>Assignee: Jungtaek Lim
>Priority: Major
> Fix For: 3.2.0
>
>
> Now that we replaced sessionization example with native support of session 
> window, we may want to provide another example of session window which can 
> only be dealt with flatMapGroupsWithState.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data

2021-08-09 Thread Cameron Todd (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17395953#comment-17395953
 ] 

Cameron Todd edited comment on SPARK-18105 at 8/9/21, 10:31 AM:


Yep I understand. I have hashed my data keeping the same distribution and the 
full_name hashed column is weak but string distance functions still work on it. 
Do you have any recommendations where I can upload this data, it's only 2gb?
{code:java}
//So this line of code:
Dataset relevantPivots = spark.read().parquet(pathToDataDedup)
.select("id", "full_name", 
"last_name","birthdate")
.na().drop()
.withColumn("pivot_hash", 
hash(col("last_name"),col("birthdate")))
.drop("last_name","birthdate")
.repartition(5000)
.cache();
// can be replaced with this
Dataset relevantPivots = 
spark.read().parquet(pathToDataDedup).repartition(5000).cache();{code}
I have also run the same code on the same hashed data and getting the same 
corrupted stream error. Also in case it wasn't clear my data normally sits on 
an s3 bucket.


was (Author: cameron.todd):
Yep I understand. I have hashed my data keeping the same distribution and the 
full_name hashed column is weak but string distance functions still work on it. 
Do you have any recommendations where I can upload this data?
{code:java}
//So this line of code:
Dataset relevantPivots = spark.read().parquet(pathToDataDedup)
.select("id", "full_name", 
"last_name","birthdate")
.na().drop()
.withColumn("pivot_hash", 
hash(col("last_name"),col("birthdate")))
.drop("last_name","birthdate")
.repartition(5000)
.cache();
// can be replaced with this
Dataset relevantPivots = 
spark.read().parquet(pathToDataDedup).repartition(5000).cache();{code}
I have also run the same code on the same hashed data and getting the same 
corrupted stream error. Also in case it wasn't clear my data normally sits on 
an s3 bucket.

> LZ4 failed to decompress a stream of shuffled data
> --
>
> Key: SPARK-18105
> URL: https://issues.apache.org/jira/browse/SPARK-18105
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1, 3.1.1
>Reporter: Davies Liu
>Priority: Major
> Attachments: TestWeightedGraph.java
>
>
> When lz4 is used to compress the shuffle files, it may fail to decompress it 
> as "stream is corrupt"
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 92 in stage 5.0 failed 4 times, most recent failure: Lost task 92.3 in 
> stage 5.0 (TID 16616, 10.0.27.18): java.io.IOException: Stream is corrupted
>   at 
> org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:220)
>   at 
> org.apache.spark.io.LZ4BlockInputStream.available(LZ4BlockInputStream.java:109)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:353)
>   at java.io.DataInputStream.read(DataInputStream.java:149)
>   at com.google.common.io.ByteStreams.read(ByteStreams.java:828)
>   at com.google.common.io.ByteStreams.readFully(ByteStreams.java:695)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110)
>   at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
>   at 
> org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:397)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply

[jira] [Resolved] (SPARK-36410) Replace anonymous classes with lambda expressions

2021-08-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon resolved SPARK-36410.
--
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 33635
[https://github.com/apache/spark/pull/33635]

> Replace anonymous classes with lambda expressions
> -
>
> Key: SPARK-36410
> URL: https://issues.apache.org/jira/browse/SPARK-36410
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples, Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
> Fix For: 3.3.0
>
>
> Anonymous classes can be replaced with lambda expressions in Java code, for 
> example
> *before*
> {code:java}
> new Thread(new Runnable() {
> @Override
> public void run() {
>   // run thread
> }
>   });
> {code}
> *after*
> {code:java}
> new Thread(() -> {
> // run thread
>   });
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36410) Replace anonymous classes with lambda expressions

2021-08-09 Thread Hyukjin Kwon (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon reassigned SPARK-36410:


Assignee: Yang Jie

> Replace anonymous classes with lambda expressions
> -
>
> Key: SPARK-36410
> URL: https://issues.apache.org/jira/browse/SPARK-36410
> Project: Spark
>  Issue Type: Improvement
>  Components: Examples, Spark Core, SQL
>Affects Versions: 3.3.0
>Reporter: Yang Jie
>Assignee: Yang Jie
>Priority: Minor
>
> Anonymous classes can be replaced with lambda expressions in Java code, for 
> example
> *before*
> {code:java}
> new Thread(new Runnable() {
> @Override
> public void run() {
>   // run thread
> }
>   });
> {code}
> *after*
> {code:java}
> new Thread(() -> {
> // run thread
>   });
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data

2021-08-09 Thread Cameron Todd (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17395953#comment-17395953
 ] 

Cameron Todd edited comment on SPARK-18105 at 8/9/21, 10:29 AM:


Yep I understand. I have hashed my data keeping the same distribution and the 
full_name hashed column is weak but string distance functions still work on it. 
Do you have any recommendations where I can upload this data?
{code:java}
//So this line of code:
Dataset relevantPivots = spark.read().parquet(pathToDataDedup)
.select("id", "full_name", 
"last_name","birthdate")
.na().drop()
.withColumn("pivot_hash", 
hash(col("last_name"),col("birthdate")))
.drop("last_name","birthdate")
.repartition(5000)
.cache();
// can be replaced with this
Dataset relevantPivots = 
spark.read().parquet(pathToDataDedup).repartition(5000).cache();{code}
I have also run the same code on the same hashed data and getting the same 
corrupted stream error. Also in case it wasn't clear my data normally sits on 
an s3 bucket.


was (Author: cameron.todd):
Yep I understand. I have attached my hashed data keeping the same distribution 
and the full_name hashed column is weak but string distance functions still 
work on it. So this line of code:
{code:java}
Dataset relevantPivots = spark.read().parquet(pathToDataDedup)
.select("id", "full_name", 
"last_name","birthdate")
.na().drop()
.withColumn("pivot_hash", 
hash(col("last_name"),col("birthdate")))
.drop("last_name","birthdate")
.repartition(5000)
.cache();
// can be replaced with this
Dataset relevantPivots = 
spark.read().parquet(pathToDataDedup).repartition(5000).cache();{code}
I have also run the same code on the same hashed data and getting the same 
corrupted stream error. Also in case it wasn't clear my data normally sits on 
an s3 bucket.

> LZ4 failed to decompress a stream of shuffled data
> --
>
> Key: SPARK-18105
> URL: https://issues.apache.org/jira/browse/SPARK-18105
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1, 3.1.1
>Reporter: Davies Liu
>Priority: Major
> Attachments: TestWeightedGraph.java
>
>
> When lz4 is used to compress the shuffle files, it may fail to decompress it 
> as "stream is corrupt"
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 92 in stage 5.0 failed 4 times, most recent failure: Lost task 92.3 in 
> stage 5.0 (TID 16616, 10.0.27.18): java.io.IOException: Stream is corrupted
>   at 
> org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:220)
>   at 
> org.apache.spark.io.LZ4BlockInputStream.available(LZ4BlockInputStream.java:109)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:353)
>   at java.io.DataInputStream.read(DataInputStream.java:149)
>   at com.google.common.io.ByteStreams.read(ByteStreams.java:828)
>   at com.google.common.io.ByteStreams.readFully(ByteStreams.java:695)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110)
>   at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
>   at 
> org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:397)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   a

[jira] [Issue Comment Deleted] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data

2021-08-09 Thread Cameron Todd (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Cameron Todd updated SPARK-18105:
-
Comment: was deleted

(was: [^hashed_data.zip])

> LZ4 failed to decompress a stream of shuffled data
> --
>
> Key: SPARK-18105
> URL: https://issues.apache.org/jira/browse/SPARK-18105
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1, 3.1.1
>Reporter: Davies Liu
>Priority: Major
> Attachments: TestWeightedGraph.java
>
>
> When lz4 is used to compress the shuffle files, it may fail to decompress it 
> as "stream is corrupt"
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 92 in stage 5.0 failed 4 times, most recent failure: Lost task 92.3 in 
> stage 5.0 (TID 16616, 10.0.27.18): java.io.IOException: Stream is corrupted
>   at 
> org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:220)
>   at 
> org.apache.spark.io.LZ4BlockInputStream.available(LZ4BlockInputStream.java:109)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:353)
>   at java.io.DataInputStream.read(DataInputStream.java:149)
>   at com.google.common.io.ByteStreams.read(ByteStreams.java:828)
>   at com.google.common.io.ByteStreams.readFully(ByteStreams.java:695)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110)
>   at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
>   at 
> org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:397)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> https://github.com/jpountz/lz4-java/issues/89



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data

2021-08-09 Thread Cameron Todd (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17395954#comment-17395954
 ] 

Cameron Todd commented on SPARK-18105:
--

[^hashed_data.zip]

> LZ4 failed to decompress a stream of shuffled data
> --
>
> Key: SPARK-18105
> URL: https://issues.apache.org/jira/browse/SPARK-18105
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1, 3.1.1
>Reporter: Davies Liu
>Priority: Major
> Attachments: TestWeightedGraph.java
>
>
> When lz4 is used to compress the shuffle files, it may fail to decompress it 
> as "stream is corrupt"
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 92 in stage 5.0 failed 4 times, most recent failure: Lost task 92.3 in 
> stage 5.0 (TID 16616, 10.0.27.18): java.io.IOException: Stream is corrupted
>   at 
> org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:220)
>   at 
> org.apache.spark.io.LZ4BlockInputStream.available(LZ4BlockInputStream.java:109)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:353)
>   at java.io.DataInputStream.read(DataInputStream.java:149)
>   at com.google.common.io.ByteStreams.read(ByteStreams.java:828)
>   at com.google.common.io.ByteStreams.readFully(ByteStreams.java:695)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110)
>   at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
>   at 
> org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:397)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> https://github.com/jpountz/lz4-java/issues/89



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data

2021-08-09 Thread Cameron Todd (Jira)



[ 
https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17395953#comment-17395953
 ] 

Cameron Todd commented on SPARK-18105:
--

Yep I understand. I have attached my hashed data keeping the same distribution 
and the full_name hashed column is weak but string distance functions still 
work on it. So this line of code:
{code:java}
Dataset relevantPivots = spark.read().parquet(pathToDataDedup)
.select("id", "full_name", 
"last_name","birthdate")
.na().drop()
.withColumn("pivot_hash", 
hash(col("last_name"),col("birthdate")))
.drop("last_name","birthdate")
.repartition(5000)
.cache();
// can be replaced with this
Dataset relevantPivots = 
spark.read().parquet(pathToDataDedup).repartition(5000).cache();{code}
I have also run the same code on the same hashed data and getting the same 
corrupted stream error. Also in case it wasn't clear my data normally sits on 
an s3 bucket.

> LZ4 failed to decompress a stream of shuffled data
> --
>
> Key: SPARK-18105
> URL: https://issues.apache.org/jira/browse/SPARK-18105
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.0.1, 3.1.1
>Reporter: Davies Liu
>Priority: Major
> Attachments: TestWeightedGraph.java
>
>
> When lz4 is used to compress the shuffle files, it may fail to decompress it 
> as "stream is corrupt"
> {code}
> Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: 
> Task 92 in stage 5.0 failed 4 times, most recent failure: Lost task 92.3 in 
> stage 5.0 (TID 16616, 10.0.27.18): java.io.IOException: Stream is corrupted
>   at 
> org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:220)
>   at 
> org.apache.spark.io.LZ4BlockInputStream.available(LZ4BlockInputStream.java:109)
>   at java.io.BufferedInputStream.read(BufferedInputStream.java:353)
>   at java.io.DataInputStream.read(DataInputStream.java:149)
>   at com.google.common.io.ByteStreams.read(ByteStreams.java:828)
>   at com.google.common.io.ByteStreams.readFully(ByteStreams.java:695)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127)
>   at 
> org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110)
>   at scala.collection.Iterator$$anon$13.next(Iterator.scala:372)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
>   at 
> org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43)
>   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
>  Source)
>   at 
> org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
>  Source)
>   at 
> org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
>   at 
> org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
>   at 
> org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:397)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143)
>   at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
>   at org.apache.spark.scheduler.Task.run(Task.scala:86)
>   at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>   at java.lang.Thread.run(Thread.java:745)
> {code}
> https://github.com/jpountz/lz4-java/issues/89



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36430) Adaptively calculate the target size when coalescing shuffle partitions in AQE

2021-08-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-36430:
---

Assignee: Wenchen Fan

> Adaptively calculate the target size when coalescing shuffle partitions in AQE
> --
>
> Key: SPARK-36430
> URL: https://issues.apache.org/jira/browse/SPARK-36430
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36430) Adaptively calculate the target size when coalescing shuffle partitions in AQE

2021-08-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-36430.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33655
[https://github.com/apache/spark/pull/33655]

> Adaptively calculate the target size when coalescing shuffle partitions in AQE
> --
>
> Key: SPARK-36430
> URL: https://issues.apache.org/jira/browse/SPARK-36430
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: Wenchen Fan
>Assignee: Wenchen Fan
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36271) Hive SerDe and V1 Insert data to parquet/orc/avro need to check schema too

2021-08-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-36271.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33566
[https://github.com/apache/spark/pull/33566]

> Hive SerDe and V1 Insert data to parquet/orc/avro need to check schema too
> --
>
> Key: SPARK-36271
> URL: https://issues.apache.org/jira/browse/SPARK-36271
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.2.0
>
> Attachments: screenshot-1.png
>
>
> {code:java}
> withTempPath { path =>
> spark.sql(
>   s"""
>  |SELECT
>  |NAMED_STRUCT('ID', ID, 'IF(ID=1,ID,0)', IF(ID=1,ID,0), 'B', 
> ABS(ID)) AS col1
>  |FROM v
>
> """.stripMargin).write.mode(SaveMode.Overwrite).parquet(path.getCanonicalPath)
>   }
> {code}
> !screenshot-1.png|width=1302,height=176!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36271) Hive SerDe and V1 Insert data to parquet/orc/avro need to check schema too

2021-08-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-36271:
---

Assignee: angerszhu

> Hive SerDe and V1 Insert data to parquet/orc/avro need to check schema too
> --
>
> Key: SPARK-36271
> URL: https://issues.apache.org/jira/browse/SPARK-36271
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Attachments: screenshot-1.png
>
>
> {code:java}
> withTempPath { path =>
> spark.sql(
>   s"""
>  |SELECT
>  |NAMED_STRUCT('ID', ID, 'IF(ID=1,ID,0)', IF(ID=1,ID,0), 'B', 
> ABS(ID)) AS col1
>  |FROM v
>
> """.stripMargin).write.mode(SaveMode.Overwrite).parquet(path.getCanonicalPath)
>   }
> {code}
> !screenshot-1.png|width=1302,height=176!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-36339) aggsBuffer should collect AggregateExpression in the map range

2021-08-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan updated SPARK-36339:

Fix Version/s: 3.0.4
   3.1.3

> aggsBuffer should collect AggregateExpression in the map range
> --
>
> Key: SPARK-36339
> URL: https://issues.apache.org/jira/browse/SPARK-36339
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.4.8, 3.0.3, 3.1.2
>Reporter: gaoyajun02
>Assignee: gaoyajun02
>Priority: Major
>  Labels: grouping
> Fix For: 3.2.0, 3.1.3, 3.0.4
>
>
> show demo for this ISSUE:
> {code:java}
> // SQL without error
> SELECT name, count(name) c
> FROM VALUES ('Alice'), ('Bob') people(name)
> GROUP BY name GROUPING SETS(name);
> // An error is reported after exchanging the order of the query columns:
> SELECT count(name) c, name
> FROM VALUES ('Alice'), ('Bob') people(name)
> GROUP BY name GROUPING SETS(name);
> {code}
> The error message is：
> {code:java}
> Error in query: expression 'people.`name`' is neither present in the group 
> by, nor is it an aggregate function. Add to group by or wrap in first() (or 
> first_value) if you don't care which value you get.;;
> Aggregate [name#5, spark_grouping_id#3], [count(name#1) AS c#0L, name#1]
> +- Expand [List(name#1, name#4, 0)], [name#1, name#5, spark_grouping_id#3]
>+- Project [name#1, name#1 AS name#4]
>   +- SubqueryAlias `people`
>  +- LocalRelation [name#1]
> {code}
> So far, I have checked that there is no problem before version 2.3.
>  
> During debugging, I found that the behavior of constructAggregateExprs in 
> ResolveGroupingAnalytics has changed.
> {code:java}
> /*
>  * Construct new aggregate expressions by replacing grouping functions.
>  */
> private def constructAggregateExprs(
> groupByExprs: Seq[Expression],
> aggregations: Seq[NamedExpression],
> groupByAliases: Seq[Alias],
> groupingAttrs: Seq[Expression],
> gid: Attribute): Seq[NamedExpression] = aggregations.map {
>   // collect all the found AggregateExpression, so we can check an 
> expression is part of
>   // any AggregateExpression or not.
>   val aggsBuffer = ArrayBuffer[Expression]()
>   // Returns whether the expression belongs to any expressions in 
> `aggsBuffer` or not.
>   def isPartOfAggregation(e: Expression): Boolean = {
> aggsBuffer.exists(a => a.find(_ eq e).isDefined)
>   }
>   replaceGroupingFunc(_, groupByExprs, gid).transformDown {
> // AggregateExpression should be computed on the unmodified value of 
> its argument
> // expressions, so we should not replace any references to grouping 
> expression
> // inside it.
> case e: AggregateExpression =>
>   aggsBuffer += e
>   e
> case e if isPartOfAggregation(e) => e
> case e =>
>   // Replace expression by expand output attribute.
>   val index = groupByAliases.indexWhere(_.child.semanticEquals(e))
>   if (index == -1) {
> e
>   } else {
> groupingAttrs(index)
>   }
>   }.asInstanceOf[NamedExpression]
> }
> {code}
> When performing aggregations.map, the aggsBuffer here seems to be outside the 
> scope of the map. It can store the AggregateExpression of all the elements 
> processed by the map function, but this is not before 2.3.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36424) Support eliminate limits in AQE Optimizer

2021-08-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-36424.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 33651
[https://github.com/apache/spark/pull/33651]

> Support eliminate limits in AQE Optimizer
> -
>
> Key: SPARK-36424
> URL: https://issues.apache.org/jira/browse/SPARK-36424
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: XiDuo You
>Priority: Major
> Fix For: 3.3.0
>
>
> In Ad-hoc scenario, we always add limit for the query if user have no special 
> limit value, but not all limit is nesessary.
> With the power of AQE, we can eliminate limits using running statistics.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36352) Spark should check result plan's output schema name

2021-08-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-36352:
---

Assignee: angerszhu

> Spark should check result plan's output schema name
> ---
>
> Key: SPARK-36352
> URL: https://issues.apache.org/jira/browse/SPARK-36352
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36352) Spark should check result plan's output schema name

2021-08-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-36352.
-
Fix Version/s: 3.2.0
   Resolution: Fixed

Issue resolved by pull request 33583
[https://github.com/apache/spark/pull/33583]

> Spark should check result plan's output schema name
> ---
>
> Key: SPARK-36352
> URL: https://issues.apache.org/jira/browse/SPARK-36352
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 3.2.0
>Reporter: angerszhu
>Assignee: angerszhu
>Priority: Major
> Fix For: 3.2.0
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Assigned] (SPARK-36450) Remove unused UnresolvedV2Relation

2021-08-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan reassigned SPARK-36450:
---

Assignee: Terry Kim

> Remove unused UnresolvedV2Relation
> --
>
> Key: SPARK-36450
> URL: https://issues.apache.org/jira/browse/SPARK-36450
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Minor
>
> Now that all the commands that use UnresolvedV2Relation have been migrated to 
> use UnresolvedTable, and UnresolvedView, it can be removed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-36450) Remove unused UnresolvedV2Relation

2021-08-09 Thread Wenchen Fan (Jira)



 [ 
https://issues.apache.org/jira/browse/SPARK-36450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wenchen Fan resolved SPARK-36450.
-
Fix Version/s: 3.3.0
   Resolution: Fixed

Issue resolved by pull request 33677
[https://github.com/apache/spark/pull/33677]

> Remove unused UnresolvedV2Relation
> --
>
> Key: SPARK-36450
> URL: https://issues.apache.org/jira/browse/SPARK-36450
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Affects Versions: 3.3.0
>Reporter: Terry Kim
>Assignee: Terry Kim
>Priority: Minor
> Fix For: 3.3.0
>
>
> Now that all the commands that use UnresolvedV2Relation have been migrated to 
> use UnresolvedTable, and UnresolvedView, it can be removed.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

88 matches

Mail list logo