[jira] [Assigned] (SPARK-36466) Table in unloaded catalog referenced by view should load correctly
[ https://issues.apache.org/jira/browse/SPARK-36466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36466: Assignee: (was: Apache Spark) > Table in unloaded catalog referenced by view should load correctly > -- > > Key: SPARK-36466 > URL: https://issues.apache.org/jira/browse/SPARK-36466 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: Cheng Pan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36466) Table in unloaded catalog referenced by view should load correctly
[ https://issues.apache.org/jira/browse/SPARK-36466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36466: Assignee: Apache Spark > Table in unloaded catalog referenced by view should load correctly > -- > > Key: SPARK-36466 > URL: https://issues.apache.org/jira/browse/SPARK-36466 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.1.2, 3.2.0 >Reporter: Cheng Pan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36466) Table in unloaded catalog referenced by view should load correctly
Cheng Pan created SPARK-36466: - Summary: Table in unloaded catalog referenced by view should load correctly Key: SPARK-36466 URL: https://issues.apache.org/jira/browse/SPARK-36466 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.1.2, 3.2.0 Reporter: Cheng Pan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36464) Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream for Writing Over 2GB Data
[ https://issues.apache.org/jira/browse/SPARK-36464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396462#comment-17396462 ] Apache Spark commented on SPARK-36464: -- User 'kazuyukitanimura' has created a pull request for this issue: https://github.com/apache/spark/pull/33690 > Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream > for Writing Over 2GB Data > -- > > Key: SPARK-36464 > URL: https://issues.apache.org/jira/browse/SPARK-36464 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Kazuyuki Tanimura >Priority: Major > > The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; > however, the underlying `_size` variable is initialized as `Int`. > That causes an overflow and returns a negative size when over 2GB data is > written into `ChunkedByteBufferOutputStream` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36465) Dynamic gap duration in session window
[ https://issues.apache.org/jira/browse/SPARK-36465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36465: Assignee: (was: Apache Spark) > Dynamic gap duration in session window > -- > > Key: SPARK-36465 > URL: https://issues.apache.org/jira/browse/SPARK-36465 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.3.0 >Reporter: L. C. Hsieh >Priority: Major > > The gap duration used in session window for now is a static value. To support > more complex usage, it is better to support dynamic gap duration which > determines the gap duration by looking at the current data. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36464) Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream for Writing Over 2GB Data
[ https://issues.apache.org/jira/browse/SPARK-36464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36464: Assignee: (was: Apache Spark) > Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream > for Writing Over 2GB Data > -- > > Key: SPARK-36464 > URL: https://issues.apache.org/jira/browse/SPARK-36464 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Kazuyuki Tanimura >Priority: Major > > The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; > however, the underlying `_size` variable is initialized as `Int`. > That causes an overflow and returns a negative size when over 2GB data is > written into `ChunkedByteBufferOutputStream` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36465) Dynamic gap duration in session window
[ https://issues.apache.org/jira/browse/SPARK-36465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396461#comment-17396461 ] Apache Spark commented on SPARK-36465: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/33691 > Dynamic gap duration in session window > -- > > Key: SPARK-36465 > URL: https://issues.apache.org/jira/browse/SPARK-36465 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.3.0 >Reporter: L. C. Hsieh >Priority: Major > > The gap duration used in session window for now is a static value. To support > more complex usage, it is better to support dynamic gap duration which > determines the gap duration by looking at the current data. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36464) Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream for Writing Over 2GB Data
[ https://issues.apache.org/jira/browse/SPARK-36464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36464: Assignee: Apache Spark > Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream > for Writing Over 2GB Data > -- > > Key: SPARK-36464 > URL: https://issues.apache.org/jira/browse/SPARK-36464 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Kazuyuki Tanimura >Assignee: Apache Spark >Priority: Major > > The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; > however, the underlying `_size` variable is initialized as `Int`. > That causes an overflow and returns a negative size when over 2GB data is > written into `ChunkedByteBufferOutputStream` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36465) Dynamic gap duration in session window
[ https://issues.apache.org/jira/browse/SPARK-36465?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36465: Assignee: Apache Spark > Dynamic gap duration in session window > -- > > Key: SPARK-36465 > URL: https://issues.apache.org/jira/browse/SPARK-36465 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.3.0 >Reporter: L. C. Hsieh >Assignee: Apache Spark >Priority: Major > > The gap duration used in session window for now is a static value. To support > more complex usage, it is better to support dynamic gap duration which > determines the gap duration by looking at the current data. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36464) Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream for Writing Over 2GB Data
[ https://issues.apache.org/jira/browse/SPARK-36464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396460#comment-17396460 ] Apache Spark commented on SPARK-36464: -- User 'kazuyukitanimura' has created a pull request for this issue: https://github.com/apache/spark/pull/33690 > Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream > for Writing Over 2GB Data > -- > > Key: SPARK-36464 > URL: https://issues.apache.org/jira/browse/SPARK-36464 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Kazuyuki Tanimura >Priority: Major > > The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; > however, the underlying `_size` variable is initialized as `Int`. > That causes an overflow and returns a negative size when over 2GB data is > written into `ChunkedByteBufferOutputStream` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36464) Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream for Writing Over 2GB Data
[ https://issues.apache.org/jira/browse/SPARK-36464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuyuki Tanimura updated SPARK-36464: -- Description: The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; however, the underlying `_size` variable is initialized as `Int`. That causes an overflow and returns a negative size when over 2GB data is written into `ChunkedByteBufferOutputStream` was: The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; however, the underlying `_size` variable is initialized as `Int`. That causes an overflow and returns a negative size when over 2GB data is written into `ChunkedByteBufferOutputStream` build/sbt "core/testOnly *ChunkedByteBufferOutputStreamSuite -- -z SPARK-36464" > Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream > for Writing Over 2GB Data > -- > > Key: SPARK-36464 > URL: https://issues.apache.org/jira/browse/SPARK-36464 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Kazuyuki Tanimura >Priority: Major > > The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; > however, the underlying `_size` variable is initialized as `Int`. > That causes an overflow and returns a negative size when over 2GB data is > written into `ChunkedByteBufferOutputStream` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36465) Dynamic gap duration in session window
L. C. Hsieh created SPARK-36465: --- Summary: Dynamic gap duration in session window Key: SPARK-36465 URL: https://issues.apache.org/jira/browse/SPARK-36465 Project: Spark Issue Type: Improvement Components: Structured Streaming Affects Versions: 3.3.0 Reporter: L. C. Hsieh The gap duration used in session window for now is a static value. To support more complex usage, it is better to support dynamic gap duration which determines the gap duration by looking at the current data. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36464) Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream for Writing Over 2GB Data
[ https://issues.apache.org/jira/browse/SPARK-36464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuyuki Tanimura updated SPARK-36464: -- Description: The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; however, the underlying `_size` variable is initialized as `Int`. That causes an overflow and returns a negative size when over 2GB data is written into `ChunkedByteBufferOutputStream` build/sbt "core/testOnly *ChunkedByteBufferOutputStreamSuite -- -z SPARK-36464" was: The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; however, the underlying `_size` variable is initialized as `Int`. That causes an overflow and returns a negative size when over 2GB data is written into `ChunkedByteBufferOutputStream` > Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream > for Writing Over 2GB Data > -- > > Key: SPARK-36464 > URL: https://issues.apache.org/jira/browse/SPARK-36464 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Kazuyuki Tanimura >Priority: Major > > The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; > however, the underlying `_size` variable is initialized as `Int`. > That causes an overflow and returns a negative size when over 2GB data is > written into `ChunkedByteBufferOutputStream` > > build/sbt "core/testOnly *ChunkedByteBufferOutputStreamSuite -- -z > SPARK-36464" -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36464) Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream for Writing Over 2GB Data
[ https://issues.apache.org/jira/browse/SPARK-36464?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kazuyuki Tanimura updated SPARK-36464: -- Description: The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; however, the underlying `_size` variable is initialized as `Int`. That causes an overflow and returns a negative size when over 2GB data is written into `ChunkedByteBufferOutputStream` was: The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; however, the underlying `_size` variable is initialized as `int`. That causes an overflow and returns negative size when over 2GB data is written into `ChunkedByteBufferOutputStream` > Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream > for Writing Over 2GB Data > -- > > Key: SPARK-36464 > URL: https://issues.apache.org/jira/browse/SPARK-36464 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.1.2 >Reporter: Kazuyuki Tanimura >Priority: Major > > The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; > however, the underlying `_size` variable is initialized as `Int`. > That causes an overflow and returns a negative size when over 2GB data is > written into `ChunkedByteBufferOutputStream` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34276) Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11
[ https://issues.apache.org/jira/browse/SPARK-34276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396450#comment-17396450 ] Yuming Wang commented on SPARK-34276: - I think so. cc [~smilegator] > Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 > -- > > Key: SPARK-34276 > URL: https://issues.apache.org/jira/browse/SPARK-34276 > Project: Spark > Issue Type: Task > Components: Build, SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Priority: Blocker > > Before the release, we need to double check the unreleased/unresolved > JIRAs/PRs of Parquet 1.11 and then decide whether we should upgrade/revert > Parquet. At the same time, we should encourage the whole community to do the > compatibility and performance tests for their production workloads, including > both read and write code paths. > More details: > https://github.com/apache/spark/pull/26804#issuecomment-768790620 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36449) ALTER TABLE REPLACE COLUMNS should check duplicates for the specified columns for v2 command
[ https://issues.apache.org/jira/browse/SPARK-36449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-36449. - Resolution: Fixed Issue resolved by pull request 33676 [https://github.com/apache/spark/pull/33676] > ALTER TABLE REPLACE COLUMNS should check duplicates for the specified columns > for v2 command > > > Key: SPARK-36449 > URL: https://issues.apache.org/jira/browse/SPARK-36449 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Major > Fix For: 3.2.0 > > > ALTER TABLE REPLACE COLUMNS currently doesn't check duplicates for the > specified columns for v2 command. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36464) Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream for Writing Over 2GB Data
Kazuyuki Tanimura created SPARK-36464: - Summary: Fix Underlying Size Variable Initialization in ChunkedByteBufferOutputStream for Writing Over 2GB Data Key: SPARK-36464 URL: https://issues.apache.org/jira/browse/SPARK-36464 Project: Spark Issue Type: Bug Components: Spark Core Affects Versions: 3.1.2 Reporter: Kazuyuki Tanimura The `size` method of `ChunkedByteBufferOutputStream` returns a `Long` value; however, the underlying `_size` variable is initialized as `int`. That causes an overflow and returns negative size when over 2GB data is written into `ChunkedByteBufferOutputStream` -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34276) Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11
[ https://issues.apache.org/jira/browse/SPARK-34276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396441#comment-17396441 ] Gengliang Wang commented on SPARK-34276: ok, then shall we close this one? > Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 > -- > > Key: SPARK-34276 > URL: https://issues.apache.org/jira/browse/SPARK-34276 > Project: Spark > Issue Type: Task > Components: Build, SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Priority: Blocker > > Before the release, we need to double check the unreleased/unresolved > JIRAs/PRs of Parquet 1.11 and then decide whether we should upgrade/revert > Parquet. At the same time, we should encourage the whole community to do the > compatibility and performance tests for their production workloads, including > both read and write code paths. > More details: > https://github.com/apache/spark/pull/26804#issuecomment-768790620 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36458) MinHashLSH.approxSimilarityJoin should not required inputCol if output exist
[ https://issues.apache.org/jira/browse/SPARK-36458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thai Thien updated SPARK-36458: --- Description: Refer to documents and example code in MinHashLSH [https://spark.apache.org/docs/latest/ml-features#minhash-for-jaccard-distance] The example written that: We could avoid computing hashes by passing in the already-transformed dataset, e.g. `model.approxSimilarityJoin(transformedA, transformedB, 0.6)` However, inputCol still required in transformedA and transformedB even if they already have outputCol. A code that should work but it doesn't {code:java} from pyspark.ml.feature import MinHashLSH from pyspark.ml.linalg import Vectors from pyspark.sql.functions import col dataA = [(0, Vectors.sparse(6, [0, 1, 2], [1.0, 1.0, 1.0]),), (1, Vectors.sparse(6, [2, 3, 4], [1.0, 1.0, 1.0]),), (2, Vectors.sparse(6, [0, 2, 4], [1.0, 1.0, 1.0]),)] dfA = spark.createDataFrame(dataA, ["id", "features"]) dataB = [(3, Vectors.sparse(6, [1, 3, 5], [1.0, 1.0, 1.0]),), (4, Vectors.sparse(6, [2, 3, 5], [1.0, 1.0, 1.0]),), (5, Vectors.sparse(6, [1, 2, 4], [1.0, 1.0, 1.0]),)] dfB = spark.createDataFrame(dataB, ["id", "features"]) key = Vectors.sparse(6, [1, 3], [1.0, 1.0]) mh = MinHashLSH(inputCol="features", outputCol="hashes", numHashTables=5) model = mh.fit(dfA) transformedA = model.transform(dfA).select("id", "hashes") transformedB = model.transform(dfB).select("id", "hashes") model.approxSimilarityJoin(transformedA, transformedB, 0.6, distCol="JaccardDistance")\ .select(col("datasetA.id").alias("idA"), col("datasetB.id").alias("idB"), col("JaccardDistance")).show() {code} As in the code I give, I discard columns `features` but keep column `hashes` which is output data. approxSimilarityJoin should only work on `hashes` (the outputCol), which is exist and ignore the lack of `features` (the inputCol). Be able to transform the data beforehand and remove inputCol can make input data much smaller and prevent confusion about the tip "_We could avoid computing hashes by passing in the already-transformed dataset_". was: Refer to documents and example code in MinHashLSH [https://spark.apache.org/docs/latest/ml-features#minhash-for-jaccard-distance] The example written that: We could avoid computing hashes by passing in the already-transformed dataset, e.g. `model.approxSimilarityJoin(transformedA, transformedB, 0.6)` However, inputCol still required in transformedA and transformedB even if they already have outputCol. An code that should work but it doesn't {code:java} from pyspark.ml.feature import MinHashLSH from pyspark.ml.linalg import Vectors from pyspark.sql.functions import col dataA = [(0, Vectors.sparse(6, [0, 1, 2], [1.0, 1.0, 1.0]),), (1, Vectors.sparse(6, [2, 3, 4], [1.0, 1.0, 1.0]),), (2, Vectors.sparse(6, [0, 2, 4], [1.0, 1.0, 1.0]),)] dfA = spark.createDataFrame(dataA, ["id", "features"]) dataB = [(3, Vectors.sparse(6, [1, 3, 5], [1.0, 1.0, 1.0]),), (4, Vectors.sparse(6, [2, 3, 5], [1.0, 1.0, 1.0]),), (5, Vectors.sparse(6, [1, 2, 4], [1.0, 1.0, 1.0]),)] dfB = spark.createDataFrame(dataB, ["id", "features"]) key = Vectors.sparse(6, [1, 3], [1.0, 1.0]) mh = MinHashLSH(inputCol="features", outputCol="hashes", numHashTables=5) model = mh.fit(dfA) transformedA = model.transform(dfA).select("id", "hashes") transformedB = model.transform(dfB).select("id", "hashes") model.approxSimilarityJoin(transformedA, transformedB, 0.6, distCol="JaccardDistance")\ .select(col("datasetA.id").alias("idA"), col("datasetB.id").alias("idB"), col("JaccardDistance")).show() {code} As in the code I give, I discard columns `features` but keep column `hashes` which is output data. approxSimilarityJoin should only work on `hashes` (the outputCol), which is exist and ignore the lack of `features` (the inputCol). Be able to transform the data beforehand and remove inputCol can make input data much smaller and prevent confusion about the tip "_We could avoid computing hashes by passing in the already-transformed dataset_". > MinHashLSH.approxSimilarityJoin should not required inputCol if output exist > > > Key: SPARK-36458 > URL: https://issues.apache.org/jira/browse/SPARK-36458 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.1 >Reporter: Thai Thien >Priority: Minor > > Refer to documents and example code in MinHashLSH > > [https://spark.apache.org/docs/latest/ml-features#minhash-for-jaccard-distance] > The example written that: > We could avoid computing hashes by passing in the already-transformed > dataset, e.g. `model.approxSimilarityJoin(transformedA, transformedB, 0.6)` > However, inputCol still required in transformedA and transformedB even if > they alread
[jira] [Commented] (SPARK-36463) Prohibit update mode in native support of session window
[ https://issues.apache.org/jira/browse/SPARK-36463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396440#comment-17396440 ] L. C. Hsieh commented on SPARK-36463: - Thanks [~kabhwan] ,[~Gengliang.Wang] . I will do the review asap. > Prohibit update mode in native support of session window > > > Key: SPARK-36463 > URL: https://issues.apache.org/jira/browse/SPARK-36463 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: Jungtaek Lim >Priority: Blocker > > The semantic of update mode for native support of session window seems to be > broken. > Strictly saying, it doesn't break the semantic based on our explanation of > update mode: > {quote} > Update Mode - Only the rows that were updated in the Result Table since the > last trigger will be written to the external storage (available since Spark > 2.1.1). Note that this is different from the Complete Mode in that this mode > only outputs the rows that have changed since the last trigger. If the query > doesn’t contain aggregations, it will be equivalent to Append mode. > {quote} > But given the grouping key is changing due to the nature of session window, > there is no way to "upsert" the output into destination. If end users try to > "upsert" the output based on the grouping key, it is high likely that a > single session window output will be written into multiple rows across > multiple updates. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-35579) Fix a bug in janino or work around it in Spark.
[ https://issues.apache.org/jira/browse/SPARK-35579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396436#comment-17396436 ] Wenchen Fan commented on SPARK-35579: - The janino upgrade was reverted. Let's retarget to 3.3 > Fix a bug in janino or work around it in Spark. > --- > > Key: SPARK-35579 > URL: https://issues.apache.org/jira/browse/SPARK-35579 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Wenchen Fan >Priority: Blocker > > See the test in SPARK-35578 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35579) Fix a bug in janino or work around it in Spark.
[ https://issues.apache.org/jira/browse/SPARK-35579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-35579: Affects Version/s: (was: 3.2.0) 3.3.0 > Fix a bug in janino or work around it in Spark. > --- > > Key: SPARK-35579 > URL: https://issues.apache.org/jira/browse/SPARK-35579 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Wenchen Fan >Priority: Critical > > See the test in SPARK-35578 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-35579) Fix a bug in janino or work around it in Spark.
[ https://issues.apache.org/jira/browse/SPARK-35579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-35579: Priority: Critical (was: Blocker) > Fix a bug in janino or work around it in Spark. > --- > > Key: SPARK-35579 > URL: https://issues.apache.org/jira/browse/SPARK-35579 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Wenchen Fan >Priority: Critical > > See the test in SPARK-35578 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34276) Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11
[ https://issues.apache.org/jira/browse/SPARK-34276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396435#comment-17396435 ] Yuming Wang commented on SPARK-34276: - We have used parquet 1.11/1.12 in the production environment for half a year and no obvious performance issues. > Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 > -- > > Key: SPARK-34276 > URL: https://issues.apache.org/jira/browse/SPARK-34276 > Project: Spark > Issue Type: Task > Components: Build, SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Priority: Blocker > > Before the release, we need to double check the unreleased/unresolved > JIRAs/PRs of Parquet 1.11 and then decide whether we should upgrade/revert > Parquet. At the same time, we should encourage the whole community to do the > compatibility and performance tests for their production workloads, including > both read and write code paths. > More details: > https://github.com/apache/spark/pull/26804#issuecomment-768790620 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36463) Prohibit update mode in native support of session window
[ https://issues.apache.org/jira/browse/SPARK-36463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396434#comment-17396434 ] Gengliang Wang commented on SPARK-36463: [~kabhwan] Thanks for the info. I plan to cut RC1 next Monday. Please try to finish the PR this week. Thanks! > Prohibit update mode in native support of session window > > > Key: SPARK-36463 > URL: https://issues.apache.org/jira/browse/SPARK-36463 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: Jungtaek Lim >Priority: Blocker > > The semantic of update mode for native support of session window seems to be > broken. > Strictly saying, it doesn't break the semantic based on our explanation of > update mode: > {quote} > Update Mode - Only the rows that were updated in the Result Table since the > last trigger will be written to the external storage (available since Spark > 2.1.1). Note that this is different from the Complete Mode in that this mode > only outputs the rows that have changed since the last trigger. If the query > doesn’t contain aggregations, it will be equivalent to Append mode. > {quote} > But given the grouping key is changing due to the nature of session window, > there is no way to "upsert" the output into destination. If end users try to > "upsert" the output based on the grouping key, it is high likely that a > single session window output will be written into multiple rows across > multiple updates. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34276) Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11
[ https://issues.apache.org/jira/browse/SPARK-34276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396433#comment-17396433 ] Yuming Wang commented on SPARK-34276: - It seems that there is no unreleased/unresolved JIRAs/PRs of Parquet 1.11/1.12. https://issues.apache.org/jira/issues/?jql=project%20%3D%20Parquet%20%20AND%20resolution%20%3D%20Unresolved%20AND%20affectedVersion%20in%20(1.11.0%2C%201.11.1%2C%201.11.2%2C%201.12.0%2C%201.12.1)%20%20%20ORDER%20BY%20createdDate%20%20DESC > Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 > -- > > Key: SPARK-34276 > URL: https://issues.apache.org/jira/browse/SPARK-34276 > Project: Spark > Issue Type: Task > Components: Build, SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Priority: Blocker > > Before the release, we need to double check the unreleased/unresolved > JIRAs/PRs of Parquet 1.11 and then decide whether we should upgrade/revert > Parquet. At the same time, we should encourage the whole community to do the > compatibility and performance tests for their production workloads, including > both read and write code paths. > More details: > https://github.com/apache/spark/pull/26804#issuecomment-768790620 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-33807) Data Source V2: Remove read specific distributions
[ https://issues.apache.org/jira/browse/SPARK-33807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396432#comment-17396432 ] Gengliang Wang commented on SPARK-33807: [~aokolnychyi] I plan to cut 3.2 RC1 next Monday. What is the status of this one? Do we need to have it in Spark 3.2? > Data Source V2: Remove read specific distributions > -- > > Key: SPARK-33807 > URL: https://issues.apache.org/jira/browse/SPARK-33807 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Anton Okolnychyi >Priority: Blocker > > We should remove the read-specific distributions for DS V2 as discussed > [here|https://github.com/apache/spark/pull/30706#discussion_r543059827]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34183) DataSource V2: Support required distribution and ordering in SS
[ https://issues.apache.org/jira/browse/SPARK-34183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396430#comment-17396430 ] Gengliang Wang commented on SPARK-34183: [~aokolnychyi] I plan to cut 3.2 RC1 next Monday. What is the status of this one? Do we need to have it in Spark 3.2? > DataSource V2: Support required distribution and ordering in SS > --- > > Key: SPARK-34183 > URL: https://issues.apache.org/jira/browse/SPARK-34183 > Project: Spark > Issue Type: Sub-task > Components: SQL, Structured Streaming >Affects Versions: 3.2.0 >Reporter: Anton Okolnychyi >Priority: Blocker > > We need to support a required distribution and ordering for SS. See the > discussion > [here|https://github.com/apache/spark/pull/31083#issuecomment-763214597]. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36463) Prohibit update mode in native support of session window
[ https://issues.apache.org/jira/browse/SPARK-36463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396424#comment-17396424 ] Jungtaek Lim commented on SPARK-36463: -- [~Gengliang.Wang] FYI as I marked this as a blocker. I would like to remove some functionality which is unclear about the desired behavior before releasing the feature. Also cc. to [~viirya] to see whether he could help reviewing the proposed change. > Prohibit update mode in native support of session window > > > Key: SPARK-36463 > URL: https://issues.apache.org/jira/browse/SPARK-36463 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: Jungtaek Lim >Priority: Blocker > > The semantic of update mode for native support of session window seems to be > broken. > Strictly saying, it doesn't break the semantic based on our explanation of > update mode: > {quote} > Update Mode - Only the rows that were updated in the Result Table since the > last trigger will be written to the external storage (available since Spark > 2.1.1). Note that this is different from the Complete Mode in that this mode > only outputs the rows that have changed since the last trigger. If the query > doesn’t contain aggregations, it will be equivalent to Append mode. > {quote} > But given the grouping key is changing due to the nature of session window, > there is no way to "upsert" the output into destination. If end users try to > "upsert" the output based on the grouping key, it is high likely that a > single session window output will be written into multiple rows across > multiple updates. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34276) Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11
[ https://issues.apache.org/jira/browse/SPARK-34276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396421#comment-17396421 ] Gengliang Wang commented on SPARK-34276: [~yumwang] I plan to cut 3.2 RC1 next Monday. What is the status of this one? Do we need to have it in Spark 3.2? > Check the unreleased/unresolved JIRAs/PRs of Parquet 1.11 > -- > > Key: SPARK-34276 > URL: https://issues.apache.org/jira/browse/SPARK-34276 > Project: Spark > Issue Type: Task > Components: Build, SQL >Affects Versions: 3.2.0 >Reporter: Yuming Wang >Priority: Blocker > > Before the release, we need to double check the unreleased/unresolved > JIRAs/PRs of Parquet 1.11 and then decide whether we should upgrade/revert > Parquet. At the same time, we should encourage the whole community to do the > compatibility and performance tests for their production workloads, including > both read and write code paths. > More details: > https://github.com/apache/spark/pull/26804#issuecomment-768790620 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-34827) Support fetching shuffle blocks in batch with i/o encryption
[ https://issues.apache.org/jira/browse/SPARK-34827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396420#comment-17396420 ] Gengliang Wang commented on SPARK-34827: [~dongjoon] I plan to cut 3.2 RC1 next Monday. What is the status of this one? Do we need to have it in Spark 3.2? > Support fetching shuffle blocks in batch with i/o encryption > > > Key: SPARK-34827 > URL: https://issues.apache.org/jira/browse/SPARK-34827 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: Dongjoon Hyun >Priority: Blocker > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36463) Prohibit update mode in native support of session window
[ https://issues.apache.org/jira/browse/SPARK-36463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36463: Assignee: Apache Spark > Prohibit update mode in native support of session window > > > Key: SPARK-36463 > URL: https://issues.apache.org/jira/browse/SPARK-36463 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: Jungtaek Lim >Assignee: Apache Spark >Priority: Blocker > > The semantic of update mode for native support of session window seems to be > broken. > Strictly saying, it doesn't break the semantic based on our explanation of > update mode: > {quote} > Update Mode - Only the rows that were updated in the Result Table since the > last trigger will be written to the external storage (available since Spark > 2.1.1). Note that this is different from the Complete Mode in that this mode > only outputs the rows that have changed since the last trigger. If the query > doesn’t contain aggregations, it will be equivalent to Append mode. > {quote} > But given the grouping key is changing due to the nature of session window, > there is no way to "upsert" the output into destination. If end users try to > "upsert" the output based on the grouping key, it is high likely that a > single session window output will be written into multiple rows across > multiple updates. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36463) Prohibit update mode in native support of session window
[ https://issues.apache.org/jira/browse/SPARK-36463?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36463: Assignee: (was: Apache Spark) > Prohibit update mode in native support of session window > > > Key: SPARK-36463 > URL: https://issues.apache.org/jira/browse/SPARK-36463 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: Jungtaek Lim >Priority: Blocker > > The semantic of update mode for native support of session window seems to be > broken. > Strictly saying, it doesn't break the semantic based on our explanation of > update mode: > {quote} > Update Mode - Only the rows that were updated in the Result Table since the > last trigger will be written to the external storage (available since Spark > 2.1.1). Note that this is different from the Complete Mode in that this mode > only outputs the rows that have changed since the last trigger. If the query > doesn’t contain aggregations, it will be equivalent to Append mode. > {quote} > But given the grouping key is changing due to the nature of session window, > there is no way to "upsert" the output into destination. If end users try to > "upsert" the output based on the grouping key, it is high likely that a > single session window output will be written into multiple rows across > multiple updates. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36463) Prohibit update mode in native support of session window
[ https://issues.apache.org/jira/browse/SPARK-36463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396396#comment-17396396 ] Apache Spark commented on SPARK-36463: -- User 'HeartSaVioR' has created a pull request for this issue: https://github.com/apache/spark/pull/33689 > Prohibit update mode in native support of session window > > > Key: SPARK-36463 > URL: https://issues.apache.org/jira/browse/SPARK-36463 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: Jungtaek Lim >Priority: Blocker > > The semantic of update mode for native support of session window seems to be > broken. > Strictly saying, it doesn't break the semantic based on our explanation of > update mode: > {quote} > Update Mode - Only the rows that were updated in the Result Table since the > last trigger will be written to the external storage (available since Spark > 2.1.1). Note that this is different from the Complete Mode in that this mode > only outputs the rows that have changed since the last trigger. If the query > doesn’t contain aggregations, it will be equivalent to Append mode. > {quote} > But given the grouping key is changing due to the nature of session window, > there is no way to "upsert" the output into destination. If end users try to > "upsert" the output based on the grouping key, it is high likely that a > single session window output will be written into multiple rows across > multiple updates. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36463) Prohibit update mode in native support of session window
[ https://issues.apache.org/jira/browse/SPARK-36463?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396384#comment-17396384 ] Jungtaek Lim commented on SPARK-36463: -- Will submit a PR shortly. > Prohibit update mode in native support of session window > > > Key: SPARK-36463 > URL: https://issues.apache.org/jira/browse/SPARK-36463 > Project: Spark > Issue Type: Bug > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: Jungtaek Lim >Priority: Blocker > > The semantic of update mode for native support of session window seems to be > broken. > Strictly saying, it doesn't break the semantic based on our explanation of > update mode: > {quote} > Update Mode - Only the rows that were updated in the Result Table since the > last trigger will be written to the external storage (available since Spark > 2.1.1). Note that this is different from the Complete Mode in that this mode > only outputs the rows that have changed since the last trigger. If the query > doesn’t contain aggregations, it will be equivalent to Append mode. > {quote} > But given the grouping key is changing due to the nature of session window, > there is no way to "upsert" the output into destination. If end users try to > "upsert" the output based on the grouping key, it is high likely that a > single session window output will be written into multiple rows across > multiple updates. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36463) Prohibit update mode in native support of session window
Jungtaek Lim created SPARK-36463: Summary: Prohibit update mode in native support of session window Key: SPARK-36463 URL: https://issues.apache.org/jira/browse/SPARK-36463 Project: Spark Issue Type: Bug Components: Structured Streaming Affects Versions: 3.2.0 Reporter: Jungtaek Lim The semantic of update mode for native support of session window seems to be broken. Strictly saying, it doesn't break the semantic based on our explanation of update mode: {quote} Update Mode - Only the rows that were updated in the Result Table since the last trigger will be written to the external storage (available since Spark 2.1.1). Note that this is different from the Complete Mode in that this mode only outputs the rows that have changed since the last trigger. If the query doesn’t contain aggregations, it will be equivalent to Append mode. {quote} But given the grouping key is changing due to the nature of session window, there is no way to "upsert" the output into destination. If end users try to "upsert" the output based on the grouping key, it is high likely that a single session window output will be written into multiple rows across multiple updates. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36458) MinHashLSH.approxSimilarityJoin should not required inputCol if output exist
[ https://issues.apache.org/jira/browse/SPARK-36458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thai Thien updated SPARK-36458: --- Component/s: (was: Spark Core) ML > MinHashLSH.approxSimilarityJoin should not required inputCol if output exist > > > Key: SPARK-36458 > URL: https://issues.apache.org/jira/browse/SPARK-36458 > Project: Spark > Issue Type: Improvement > Components: ML >Affects Versions: 3.1.1 >Reporter: Thai Thien >Priority: Minor > > Refer to documents and example code in MinHashLSH > > [https://spark.apache.org/docs/latest/ml-features#minhash-for-jaccard-distance] > The example written that: > We could avoid computing hashes by passing in the already-transformed > dataset, e.g. `model.approxSimilarityJoin(transformedA, transformedB, 0.6)` > However, inputCol still required in transformedA and transformedB even if > they already have outputCol. > An code that should work but it doesn't > > > {code:java} > from pyspark.ml.feature import MinHashLSH > from pyspark.ml.linalg import Vectors > from pyspark.sql.functions import col > dataA = [(0, Vectors.sparse(6, [0, 1, 2], [1.0, 1.0, 1.0]),), > (1, Vectors.sparse(6, [2, 3, 4], [1.0, 1.0, 1.0]),), > (2, Vectors.sparse(6, [0, 2, 4], [1.0, 1.0, 1.0]),)] > dfA = spark.createDataFrame(dataA, ["id", "features"]) > dataB = [(3, Vectors.sparse(6, [1, 3, 5], [1.0, 1.0, 1.0]),), > (4, Vectors.sparse(6, [2, 3, 5], [1.0, 1.0, 1.0]),), > (5, Vectors.sparse(6, [1, 2, 4], [1.0, 1.0, 1.0]),)] > dfB = spark.createDataFrame(dataB, ["id", "features"]) > key = Vectors.sparse(6, [1, 3], [1.0, 1.0]) > mh = MinHashLSH(inputCol="features", outputCol="hashes", numHashTables=5) > model = mh.fit(dfA) > transformedA = model.transform(dfA).select("id", "hashes") > transformedB = model.transform(dfB).select("id", "hashes") > model.approxSimilarityJoin(transformedA, transformedB, 0.6, > distCol="JaccardDistance")\ > .select(col("datasetA.id").alias("idA"), > col("datasetB.id").alias("idB"), > col("JaccardDistance")).show() > {code} > As in the code I give, I discard columns `features` but keep column `hashes` > which is output data. > approxSimilarityJoin should only work on `hashes` (the outputCol), which is > exist and ignore the lack of `features` (the inputCol). > Be able to transform the data beforehand and remove inputCol can make input > data much smaller and prevent confusion about the tip "_We could avoid > computing hashes by passing in the already-transformed dataset_". -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36459) Date Value '0001-01-01' changes to '0001-12-30' when inserted into a parquet hive table
[ https://issues.apache.org/jira/browse/SPARK-36459?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396376#comment-17396376 ] Hyukjin Kwon commented on SPARK-36459: -- That should be fixed from Spark 3.0.0. Can you try that out? > Date Value '0001-01-01' changes to '0001-12-30' when inserted into a parquet > hive table > --- > > Key: SPARK-36459 > URL: https://issues.apache.org/jira/browse/SPARK-36459 > Project: Spark > Issue Type: Bug > Components: Spark Core, Spark Shell >Affects Versions: 2.4.4 >Reporter: sindhura alluri >Priority: Major > > Hi All, > we are seeing this issue on spark 2.4.4. Below are the steps to reproduce it. > *Login in to hive terminal on cluster and create below tables.* > create table t_src(dob timestamp); > insert into t_src values('0001-01-01 00:00:00.0'); > create table t_tgt(dob timestamp) stored as parquet; > > *Spark-shell steps :* > > import org.apache.spark.sql.hive.HiveContext > val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) > val q0 = "TRUNCATE table t_tgt" > val q1 = "SELECT alias.dob as a0 FROM t_src alias" > val q2 = "INSERT INTO TABLE t_tgt SELECT tbl0.a0 as c0 FROM tbl0" > sqlContext.sql(q0) > sqlContext.sql(q1).select("a0").createOrReplaceTempView("tbl0") > sqlContext.sql(q2) > > After this check the contents of target table t_tgt. You will see the date > "0001-01-01 00:00:00" changed to "0001-12-30 00:00:00". > select * from t_tgt; > Is this a known issue? Is it fixed in any subsequent releases? > Thanks & regards, > Sindhura Alluri -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36460) Pull out NoOpMergedShuffleFileManager inner class outside
[ https://issues.apache.org/jira/browse/SPARK-36460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang resolved SPARK-36460. Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 33688 [https://github.com/apache/spark/pull/33688] > Pull out NoOpMergedShuffleFileManager inner class outside > - > > Key: SPARK-36460 > URL: https://issues.apache.org/jira/browse/SPARK-36460 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Affects Versions: 3.2.0 >Reporter: Venkata krishnan Sowrirajan >Assignee: Venkata krishnan Sowrirajan >Priority: Major > Fix For: 3.2.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36460) Pull out NoOpMergedShuffleFileManager inner class outside
[ https://issues.apache.org/jira/browse/SPARK-36460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gengliang Wang reassigned SPARK-36460: -- Assignee: Venkata krishnan Sowrirajan > Pull out NoOpMergedShuffleFileManager inner class outside > - > > Key: SPARK-36460 > URL: https://issues.apache.org/jira/browse/SPARK-36460 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Affects Versions: 3.2.0 >Reporter: Venkata krishnan Sowrirajan >Assignee: Venkata krishnan Sowrirajan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36377) Fix documentation in spark-env.sh.template
[ https://issues.apache.org/jira/browse/SPARK-36377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-36377: Assignee: Yuto Akutsu > Fix documentation in spark-env.sh.template > -- > > Key: SPARK-36377 > URL: https://issues.apache.org/jira/browse/SPARK-36377 > Project: Spark > Issue Type: Documentation > Components: Documentation, Spark Submit >Affects Versions: 3.1.2 >Reporter: Yuto Akutsu >Assignee: Yuto Akutsu >Priority: Major > > Some options in the "Options read in YARN client/cluster mode" section in > spark-env.sh.template is read by other modes too (e.g. SPARK_CONF_DIR, > SPARK_EXECUTOR_CORES, etc.), so we should re-document it to help users > distinguish what's only read by YARN mode from what's not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36377) Fix documentation in spark-env.sh.template
[ https://issues.apache.org/jira/browse/SPARK-36377?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-36377. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 33604 [https://github.com/apache/spark/pull/33604] > Fix documentation in spark-env.sh.template > -- > > Key: SPARK-36377 > URL: https://issues.apache.org/jira/browse/SPARK-36377 > Project: Spark > Issue Type: Documentation > Components: Documentation, Spark Submit >Affects Versions: 3.1.2 >Reporter: Yuto Akutsu >Assignee: Yuto Akutsu >Priority: Major > Fix For: 3.3.0 > > > Some options in the "Options read in YARN client/cluster mode" section in > spark-env.sh.template is read by other modes too (e.g. SPARK_CONF_DIR, > SPARK_EXECUTOR_CORES, etc.), so we should re-document it to help users > distinguish what's only read by YARN mode from what's not. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36332) Cleanup RemoteBlockPushResolver log messages
[ https://issues.apache.org/jira/browse/SPARK-36332?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuyi resolved SPARK-36332. -- Fix Version/s: 3.3.0 3.2.0 Assignee: Venkata krishnan Sowrirajan Resolution: Fixed Issue resolved by https://github.com/apache/spark/pull/33561 > Cleanup RemoteBlockPushResolver log messages > > > Key: SPARK-36332 > URL: https://issues.apache.org/jira/browse/SPARK-36332 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Affects Versions: 3.2.0 >Reporter: Venkata krishnan Sowrirajan >Assignee: Venkata krishnan Sowrirajan >Priority: Minor > Fix For: 3.2.0, 3.3.0 > > > Minor cleanups to RemoteBlockPushResolver to use AppShufflePartitionsInfo > toString() for log messages. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36386) Fix DataFrame groupby-expanding to follow pandas 1.3
[ https://issues.apache.org/jira/browse/SPARK-36386?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-36386. -- Fix Version/s: 3.3.0 Assignee: Haejoon Lee Resolution: Fixed Fixed in https://github.com/apache/spark/pull/33646 > Fix DataFrame groupby-expanding to follow pandas 1.3 > > > Key: SPARK-36386 > URL: https://issues.apache.org/jira/browse/SPARK-36386 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36388) Fix DataFrame groupby-rolling to follow pandas 1.3
[ https://issues.apache.org/jira/browse/SPARK-36388?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-36388. -- Fix Version/s: 3.3.0 Assignee: Haejoon Lee Resolution: Fixed Fixed in https://github.com/apache/spark/pull/33646 > Fix DataFrame groupby-rolling to follow pandas 1.3 > -- > > Key: SPARK-36388 > URL: https://issues.apache.org/jira/browse/SPARK-36388 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Assignee: Haejoon Lee >Priority: Major > Fix For: 3.3.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-32709) Write Hive ORC/Parquet bucketed table with hivehash (for Hive 1,2)
[ https://issues.apache.org/jira/browse/SPARK-32709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396330#comment-17396330 ] Shashank Pedamallu commented on SPARK-32709: Issue observed at Lyft. When attempting to apply the patch on production query, the query fails eventually due to S3 throttling issues due to too many files generated by the bucketing unlike hive. To overcome the too many small files problem, we tried reducing the number of reducers which is creating OOM issues. We enabled adaptive query execution which reduced the number of reducers to 44. But even with 44 reducers and number of buckets being 1024, the final number of files as 45057 is little higher compared to 1024 end files in Hive. This method did not seem to work effectively on larger tables (even with AQE, we would get hit by S3 throttling). *Query (I anonymized the names for privacy. please let me know if that's a concern)*: {noformat} -- Default configurations SET hive.exec.compress.output=true;SET hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; SET mapred.output.compress=true; SET parquet.compression=SNAPPY; SET mapreduce.input.fileinputformat.split.maxsize=25600; SET mapreduce.input.fileinputformat.split.minsize=6400; SET hive.exec.parallel=true; SET hive.exec.parallel.thread.number=32; SET hive.hadoop.supports.splittable.combineinputformat=true; SET hive.exec.dynamic.partition=true; SET hive.exec.dynamic.partition.mode=nonstrict; SET hive.exec.max.dynamic.partitions=1000; SET hive.exec.max.dynamic.partitions.pernode=1000;-- User configurations SET hive.enforce.bucketing = true; SET hive.mapred.mode = nonstrict; SET hive.exec.max.created.files=180; SET hive.execution.engine=tez;-- spark configs SET spark.executor.memory=8g; SET spark.driver.memoryOverhead=4g; SET spark.driver.memory=12g; SET spark.sql.adaptive.advisoryPartitionSizeInBytes=1536MB; SET spark.sql.adaptive.coalescePartitions.minPartitionNum=16; DROP TABLE IF EXISTS anon_table_a; WITH anon_table_a AS ( SELECT col_a, col_b, col_c, RANK () OVER (PARTITION BY col_b ORDER BY col_c) AS alias_a FROM schema_a.src_table_a WHERE col_d IS NOT NULL DISTRIBUTE BY col_b SORT BY col_c ), anon_table_b AS ( SELECT col_e FROM ( SELECT col_e, ROW_NUMBER() OVER (PARTITION BY col_e) AS rn FROM schema_b.src_table_b WHERE 1 = 1 ) v WHERE rn = 1 )INSERT OVERWRITE TABLE personal_schema.temp_dest_table SELECT data.* FROM ( SELECT col_a, col_b, alias_a FROM anon_table_a ) data LEFT OUTER JOIN anon_table_b ON data.col_b = anon_table_b.col_e WHERE (anon_table_b.col_e IS NULL OR data.col_b is null ); DROP TABLE IF EXISTS personal_schema.dest_table; ALTER TABLE personal_schema.temp_dest_table RENAME TO personal_schema.dest_table {noformat} *Spark shuffle metrics:* !91275701_stage6_metrics.png|width=300,height=160! Average file size in the final path is ~500kb when writing from Spark compared to 29MB in Hive *Question:* So, just wanted to raise the question again about if there is any active / planned effort to support 1 file per reducer for buckted tables? > Write Hive ORC/Parquet bucketed table with hivehash (for Hive 1,2) > -- > > Key: SPARK-32709 > URL: https://issues.apache.org/jira/browse/SPARK-32709 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Priority: Minor > Attachments: 91275701_stage6_metrics.png > > > Hive ORC/Parquet write code path is same as data source v1 code path > (FileFormatWriter). This JIRA is to add the support to write Hive ORC/Parquet > bucketed table with hivehash. The change is to custom `bucketIdExpression` to > use hivehash when the table is Hive bucketed table, and the Hive version is > 1.x.y or 2.x.y. > > This will allow us write Hive/Presto-compatible bucketed table for Hive 1 and > 2. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-32709) Write Hive ORC/Parquet bucketed table with hivehash (for Hive 1,2)
[ https://issues.apache.org/jira/browse/SPARK-32709?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashank Pedamallu updated SPARK-32709: --- Attachment: 91275701_stage6_metrics.png > Write Hive ORC/Parquet bucketed table with hivehash (for Hive 1,2) > -- > > Key: SPARK-32709 > URL: https://issues.apache.org/jira/browse/SPARK-32709 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.1.0 >Reporter: Cheng Su >Priority: Minor > Attachments: 91275701_stage6_metrics.png > > > Hive ORC/Parquet write code path is same as data source v1 code path > (FileFormatWriter). This JIRA is to add the support to write Hive ORC/Parquet > bucketed table with hivehash. The change is to custom `bucketIdExpression` to > use hivehash when the table is Hive bucketed table, and the Hive version is > 1.x.y or 2.x.y. > > This will allow us write Hive/Presto-compatible bucketed table for Hive 1 and > 2. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36462) Allow Spark on Kube to operate without polling or watchers
Holden Karau created SPARK-36462: Summary: Allow Spark on Kube to operate without polling or watchers Key: SPARK-36462 URL: https://issues.apache.org/jira/browse/SPARK-36462 Project: Spark Issue Type: Improvement Components: Kubernetes Affects Versions: 3.2.0 Reporter: Holden Karau Add an option to Spark on Kube to not track the individual executor pods and just assume K8s is doing what it's asked. This would be a developer feature intended for minimizing load on etcd & driver. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36461) Enable ObjectHashAggregate for more Aggregate functions
[ https://issues.apache.org/jira/browse/SPARK-36461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36461: Assignee: L. C. Hsieh (was: Apache Spark) > Enable ObjectHashAggregate for more Aggregate functions > --- > > Key: SPARK-36461 > URL: https://issues.apache.org/jira/browse/SPARK-36461 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > Enabing more {{ObjectHashAggregate}} for more aggregate functions, as it has > better performance compared to {{SortAggregate}} according to current > benchmark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36461) Enable ObjectHashAggregate for more Aggregate functions
[ https://issues.apache.org/jira/browse/SPARK-36461?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396246#comment-17396246 ] Apache Spark commented on SPARK-36461: -- User 'viirya' has created a pull request for this issue: https://github.com/apache/spark/pull/33068 > Enable ObjectHashAggregate for more Aggregate functions > --- > > Key: SPARK-36461 > URL: https://issues.apache.org/jira/browse/SPARK-36461 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: L. C. Hsieh >Assignee: L. C. Hsieh >Priority: Major > > Enabing more {{ObjectHashAggregate}} for more aggregate functions, as it has > better performance compared to {{SortAggregate}} according to current > benchmark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36461) Enable ObjectHashAggregate for more Aggregate functions
[ https://issues.apache.org/jira/browse/SPARK-36461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36461: Assignee: Apache Spark (was: L. C. Hsieh) > Enable ObjectHashAggregate for more Aggregate functions > --- > > Key: SPARK-36461 > URL: https://issues.apache.org/jira/browse/SPARK-36461 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: L. C. Hsieh >Assignee: Apache Spark >Priority: Major > > Enabing more {{ObjectHashAggregate}} for more aggregate functions, as it has > better performance compared to {{SortAggregate}} according to current > benchmark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36461) Enable ObjectHashAggregate for more Aggregate functions
L. C. Hsieh created SPARK-36461: --- Summary: Enable ObjectHashAggregate for more Aggregate functions Key: SPARK-36461 URL: https://issues.apache.org/jira/browse/SPARK-36461 Project: Spark Issue Type: Improvement Components: SQL Affects Versions: 3.3.0 Reporter: L. C. Hsieh Assignee: L. C. Hsieh Enabing more {{ObjectHashAggregate}} for more aggregate functions, as it has better performance compared to {{SortAggregate}} according to current benchmark. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36370) Avoid using SelectionMixin._builtin_table which is removed in pandas 1.3
[ https://issues.apache.org/jira/browse/SPARK-36370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396223#comment-17396223 ] Apache Spark commented on SPARK-36370: -- User 'Cedric-Magnan' has created a pull request for this issue: https://github.com/apache/spark/pull/33687 > Avoid using SelectionMixin._builtin_table which is removed in pandas 1.3 > > > Key: SPARK-36370 > URL: https://issues.apache.org/jira/browse/SPARK-36370 > Project: Spark > Issue Type: Sub-task > Components: PySpark >Affects Versions: 3.2.0 >Reporter: Takuya Ueshin >Assignee: Takuya Ueshin >Priority: Major > Fix For: 3.2.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36460) Pull out NoOpMergedShuffleFileManager inner class outside
[ https://issues.apache.org/jira/browse/SPARK-36460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36460: Assignee: (was: Apache Spark) > Pull out NoOpMergedShuffleFileManager inner class outside > - > > Key: SPARK-36460 > URL: https://issues.apache.org/jira/browse/SPARK-36460 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Affects Versions: 3.2.0 >Reporter: Venkata krishnan Sowrirajan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36460) Pull out NoOpMergedShuffleFileManager inner class outside
[ https://issues.apache.org/jira/browse/SPARK-36460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36460: Assignee: Apache Spark > Pull out NoOpMergedShuffleFileManager inner class outside > - > > Key: SPARK-36460 > URL: https://issues.apache.org/jira/browse/SPARK-36460 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Affects Versions: 3.2.0 >Reporter: Venkata krishnan Sowrirajan >Assignee: Apache Spark >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36460) Pull out NoOpMergedShuffleFileManager inner class outside
[ https://issues.apache.org/jira/browse/SPARK-36460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396195#comment-17396195 ] Apache Spark commented on SPARK-36460: -- User 'venkata91' has created a pull request for this issue: https://github.com/apache/spark/pull/33688 > Pull out NoOpMergedShuffleFileManager inner class outside > - > > Key: SPARK-36460 > URL: https://issues.apache.org/jira/browse/SPARK-36460 > Project: Spark > Issue Type: Sub-task > Components: Shuffle, Spark Core >Affects Versions: 3.2.0 >Reporter: Venkata krishnan Sowrirajan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36454) Not push down partition filter to ORCScan for DSv2
[ https://issues.apache.org/jira/browse/SPARK-36454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dongjoon Hyun resolved SPARK-36454. --- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 33680 [https://github.com/apache/spark/pull/33680] > Not push down partition filter to ORCScan for DSv2 > -- > > Key: SPARK-36454 > URL: https://issues.apache.org/jira/browse/SPARK-36454 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.3.0 >Reporter: Huaxin Gao >Assignee: Apache Spark >Priority: Minor > Fix For: 3.2.0 > > > Seems to me that partition filter is only used for partition pruning and > shouldn't be pushed down to ORCScan. We don't push down partition filter to > ORCScan in DSv1, and we don't push down partition filter for parquet in both > DSv1 and DSv2. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36460) Pull out NoOpMergedShuffleFileManager inner class outside
Venkata krishnan Sowrirajan created SPARK-36460: --- Summary: Pull out NoOpMergedShuffleFileManager inner class outside Key: SPARK-36460 URL: https://issues.apache.org/jira/browse/SPARK-36460 Project: Spark Issue Type: Sub-task Components: Shuffle, Spark Core Affects Versions: 3.2.0 Reporter: Venkata krishnan Sowrirajan -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-36459) Date Value '0001-01-01' changes to '0001-12-30' when inserted into a parquet hive table
sindhura alluri created SPARK-36459: --- Summary: Date Value '0001-01-01' changes to '0001-12-30' when inserted into a parquet hive table Key: SPARK-36459 URL: https://issues.apache.org/jira/browse/SPARK-36459 Project: Spark Issue Type: Bug Components: Spark Core, Spark Shell Affects Versions: 2.4.4 Reporter: sindhura alluri Hi All, we are seeing this issue on spark 2.4.4. Below are the steps to reproduce it. *Login in to hive terminal on cluster and create below tables.* create table t_src(dob timestamp); insert into t_src values('0001-01-01 00:00:00.0'); create table t_tgt(dob timestamp) stored as parquet; *Spark-shell steps :* import org.apache.spark.sql.hive.HiveContext val sqlContext = new org.apache.spark.sql.hive.HiveContext(sc) val q0 = "TRUNCATE table t_tgt" val q1 = "SELECT alias.dob as a0 FROM t_src alias" val q2 = "INSERT INTO TABLE t_tgt SELECT tbl0.a0 as c0 FROM tbl0" sqlContext.sql(q0) sqlContext.sql(q1).select("a0").createOrReplaceTempView("tbl0") sqlContext.sql(q2) After this check the contents of target table t_tgt. You will see the date "0001-01-01 00:00:00" changed to "0001-12-30 00:00:00". select * from t_tgt; Is this a known issue? Is it fixed in any subsequent releases? Thanks & regards, Sindhura Alluri -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-26208) Empty dataframe does not roundtrip for csv with header
[ https://issues.apache.org/jira/browse/SPARK-26208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17396076#comment-17396076 ] Ranga Reddy commented on SPARK-26208: - The above code will work only when dataframe created manually. Issue still persists when when we create dataframe while reading hive table. *Hive Table:* {code:java} CREATE EXTERNAL TABLE `test_empty_csv_table`( `col1` bigint, `col2` bigint) STORED AS ORC LOCATION '/tmp/test_empty_csv_table';{code} *spark-shell* {code:java} val tableName = "test_empty_csv_table" val emptyCSVFilePath = "/tmp/empty_csv_file" val df = spark.sql("select * from "+tableName) df.printSchema() df.write.format("csv").option("header", true).mode("overwrite").save(emptyCSVFilePath) val df2 = spark.read.option("header", true).csv(emptyCSVFilePath) {code} {code:java} org.apache.spark.sql.AnalysisException: Unable to infer schema for CSV. It must be specified manually.; at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:208) at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$9.apply(DataSource.scala:208) at scala.Option.getOrElse(Option.scala:121) at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:207) at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:393) at org.apache.spark.sql.DataFrameReader.loadV1Source(DataFrameReader.scala:239) at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:227) at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:596) at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:473) ... 49 elided{code} > Empty dataframe does not roundtrip for csv with header > -- > > Key: SPARK-26208 > URL: https://issues.apache.org/jira/browse/SPARK-26208 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.0 > Environment: master branch, > commit 034ae305c33b1990b3c1a284044002874c343b4d, > date: Sun Nov 18 16:02:15 2018 +0800 >Reporter: koert kuipers >Assignee: Koert Kuipers >Priority: Minor > Fix For: 3.0.0 > > > when we write empty part file for csv and header=true we fail to write > header. the result cannot be read back in. > when header=true a part file with zero rows should still have header -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-20384) supporting value classes over primitives in DataSets
[ https://issues.apache.org/jira/browse/SPARK-20384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen reassigned SPARK-20384: Assignee: Mick Jermsurawong > supporting value classes over primitives in DataSets > > > Key: SPARK-20384 > URL: https://issues.apache.org/jira/browse/SPARK-20384 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Daniel Davis >Assignee: Mick Jermsurawong >Priority: Minor > > As a spark user who uses value classes in scala for modelling domain objects, > I also would like to make use of them for datasets. > For example, I would like to use the {{User}} case class which is using a > value-class for it's {{id}} as the type for a DataSet: > - the underlying primitive should be mapped to the value-class column > - function on the column (for example comparison ) should only work if > defined on the value-class and use these implementation > - show() should pick up the toString method of the value-class > {code} > case class Id(value: Long) extends AnyVal { > def toString: String = value.toHexString > } > case class User(id: Id, name: String) > val ds = spark.sparkContext > .parallelize(0L to 12L).map(i => (i, f"name-$i")).toDS() > .withColumnRenamed("_1", "id") > .withColumnRenamed("_2", "name") > // mapping should work > val usrs = ds.as[User] > // show should use toString > usrs.show() > // comparison with long should throw exception, as not defined on Id > usrs.col("id") > 0L > {code} > For example `.show()` should use the toString of the `Id` value class: > {noformat} > +---+---+ > | id| name| > +---+---+ > | 0| name-0| > | 1| name-1| > | 2| name-2| > | 3| name-3| > | 4| name-4| > | 5| name-5| > | 6| name-6| > | 7| name-7| > | 8| name-8| > | 9| name-9| > | A|name-10| > | B|name-11| > | C|name-12| > +---+---+ > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-20384) supporting value classes over primitives in DataSets
[ https://issues.apache.org/jira/browse/SPARK-20384?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean R. Owen resolved SPARK-20384. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 33205 [https://github.com/apache/spark/pull/33205] > supporting value classes over primitives in DataSets > > > Key: SPARK-20384 > URL: https://issues.apache.org/jira/browse/SPARK-20384 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.1.0 >Reporter: Daniel Davis >Assignee: Mick Jermsurawong >Priority: Minor > Fix For: 3.3.0 > > > As a spark user who uses value classes in scala for modelling domain objects, > I also would like to make use of them for datasets. > For example, I would like to use the {{User}} case class which is using a > value-class for it's {{id}} as the type for a DataSet: > - the underlying primitive should be mapped to the value-class column > - function on the column (for example comparison ) should only work if > defined on the value-class and use these implementation > - show() should pick up the toString method of the value-class > {code} > case class Id(value: Long) extends AnyVal { > def toString: String = value.toHexString > } > case class User(id: Id, name: String) > val ds = spark.sparkContext > .parallelize(0L to 12L).map(i => (i, f"name-$i")).toDS() > .withColumnRenamed("_1", "id") > .withColumnRenamed("_2", "name") > // mapping should work > val usrs = ds.as[User] > // show should use toString > usrs.show() > // comparison with long should throw exception, as not defined on Id > usrs.col("id") > 0L > {code} > For example `.show()` should use the toString of the `Id` value class: > {noformat} > +---+---+ > | id| name| > +---+---+ > | 0| name-0| > | 1| name-1| > | 2| name-2| > | 3| name-3| > | 4| name-4| > | 5| name-5| > | 6| name-6| > | 7| name-7| > | 8| name-8| > | 9| name-9| > | A|name-10| > | B|name-11| > | C|name-12| > +---+---+ > {noformat} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36432) Upgrade Jetty version to 9.4.43
[ https://issues.apache.org/jira/browse/SPARK-36432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sajith A updated SPARK-36432: - Affects Version/s: 3.2.0 > Upgrade Jetty version to 9.4.43 > --- > > Key: SPARK-36432 > URL: https://issues.apache.org/jira/browse/SPARK-36432 > Project: Spark > Issue Type: Bug > Components: Build >Affects Versions: 3.2.0, 3.3.0 >Reporter: Sajith A >Assignee: Sajith A >Priority: Minor > Fix For: 3.2.0 > > > Upgrade Jetty version to 9.4.43.v20210629 in current Spark master in order to > fix vulnerability https://nvd.nist.gov/vuln/detail/CVE-2021-34429. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36458) MinHashLSH.approxSimilarityJoin should not required inputCol if output exist
[ https://issues.apache.org/jira/browse/SPARK-36458?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Thai Thien updated SPARK-36458: --- Description: Refer to documents and example code in MinHashLSH [https://spark.apache.org/docs/latest/ml-features#minhash-for-jaccard-distance] The example written that: We could avoid computing hashes by passing in the already-transformed dataset, e.g. `model.approxSimilarityJoin(transformedA, transformedB, 0.6)` However, inputCol still required in transformedA and transformedB even if they already have outputCol. An code that should work but it doesn't {code:java} from pyspark.ml.feature import MinHashLSH from pyspark.ml.linalg import Vectors from pyspark.sql.functions import col dataA = [(0, Vectors.sparse(6, [0, 1, 2], [1.0, 1.0, 1.0]),), (1, Vectors.sparse(6, [2, 3, 4], [1.0, 1.0, 1.0]),), (2, Vectors.sparse(6, [0, 2, 4], [1.0, 1.0, 1.0]),)] dfA = spark.createDataFrame(dataA, ["id", "features"]) dataB = [(3, Vectors.sparse(6, [1, 3, 5], [1.0, 1.0, 1.0]),), (4, Vectors.sparse(6, [2, 3, 5], [1.0, 1.0, 1.0]),), (5, Vectors.sparse(6, [1, 2, 4], [1.0, 1.0, 1.0]),)] dfB = spark.createDataFrame(dataB, ["id", "features"]) key = Vectors.sparse(6, [1, 3], [1.0, 1.0]) mh = MinHashLSH(inputCol="features", outputCol="hashes", numHashTables=5) model = mh.fit(dfA) transformedA = model.transform(dfA).select("id", "hashes") transformedB = model.transform(dfB).select("id", "hashes") model.approxSimilarityJoin(transformedA, transformedB, 0.6, distCol="JaccardDistance")\ .select(col("datasetA.id").alias("idA"), col("datasetB.id").alias("idB"), col("JaccardDistance")).show() {code} As in the code I give, I discard columns `features` but keep column `hashes` which is output data. approxSimilarityJoin should only work on `hashes` (the outputCol), which is exist and ignore the lack of `features` (the inputCol). Be able to transform the data beforehand and remove inputCol can make input data much smaller and prevent confusion about the tip "_We could avoid computing hashes by passing in the already-transformed dataset_". was: Refer to documents and example code in MinHashLSH https://spark.apache.org/docs/latest/ml-features#minhash-for-jaccard-distance The example written that: ``` # We could avoid computing hashes by passing in the already-transformed dataset, e.g. # `model.approxSimilarityJoin(transformedA, transformedB, 0.6)` ``` However, inputCol still required in transformedA and transformedB even if they already have outputCol. An code that should work but it doesn't ``` from pyspark.ml.feature import MinHashLSH from pyspark.ml.linalg import Vectors from pyspark.sql.functions import col dataA = [(0, Vectors.sparse(6, [0, 1, 2], [1.0, 1.0, 1.0]),), (1, Vectors.sparse(6, [2, 3, 4], [1.0, 1.0, 1.0]),), (2, Vectors.sparse(6, [0, 2, 4], [1.0, 1.0, 1.0]),)] dfA = spark.createDataFrame(dataA, ["id", "features"]) dataB = [(3, Vectors.sparse(6, [1, 3, 5], [1.0, 1.0, 1.0]),), (4, Vectors.sparse(6, [2, 3, 5], [1.0, 1.0, 1.0]),), (5, Vectors.sparse(6, [1, 2, 4], [1.0, 1.0, 1.0]),)] dfB = spark.createDataFrame(dataB, ["id", "features"]) key = Vectors.sparse(6, [1, 3], [1.0, 1.0]) mh = MinHashLSH(inputCol="features", outputCol="hashes", numHashTables=5) model = mh.fit(dfA) transformedA = model.transform(dfA).select("id", "hashes") transformedB = model.transform(dfB).select("id", "hashes") model.approxSimilarityJoin(transformedA, transformedB, 0.6, distCol="JaccardDistance")\ .select(col("datasetA.id").alias("idA"), col("datasetB.id").alias("idB"), col("JaccardDistance")).show() ``` As in the code I give, I discard columns `features` but keep column `hashes` which is output data. approxSimilarityJoin should only work on `hashes` (the outputCol), which is exist and ignore the lack of `features` (the inputCol). Be able to transform the data beforehand and remove inputCol can make input data much smaller and prevent confusion about "We could avoid computing hashes by passing in the already-transformed dataset". > MinHashLSH.approxSimilarityJoin should not required inputCol if output exist > > > Key: SPARK-36458 > URL: https://issues.apache.org/jira/browse/SPARK-36458 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 3.1.1 >Reporter: Thai Thien >Priority: Minor > > Refer to documents and example code in MinHashLSH > > [https://spark.apache.org/docs/latest/ml-features#minhash-for-jaccard-distance] > The example written that: > We could avoid computing hashes by passing in the already-transformed > dataset, e.g. `model.approxSimilarityJoin(transformedA, transformedB, 0.6)` > However, inputCol still requir
[jira] [Created] (SPARK-36458) MinHashLSH.approxSimilarityJoin should not required inputCol if output exist
Thai Thien created SPARK-36458: -- Summary: MinHashLSH.approxSimilarityJoin should not required inputCol if output exist Key: SPARK-36458 URL: https://issues.apache.org/jira/browse/SPARK-36458 Project: Spark Issue Type: Improvement Components: Spark Core Affects Versions: 3.1.1 Reporter: Thai Thien Refer to documents and example code in MinHashLSH https://spark.apache.org/docs/latest/ml-features#minhash-for-jaccard-distance The example written that: ``` # We could avoid computing hashes by passing in the already-transformed dataset, e.g. # `model.approxSimilarityJoin(transformedA, transformedB, 0.6)` ``` However, inputCol still required in transformedA and transformedB even if they already have outputCol. An code that should work but it doesn't ``` from pyspark.ml.feature import MinHashLSH from pyspark.ml.linalg import Vectors from pyspark.sql.functions import col dataA = [(0, Vectors.sparse(6, [0, 1, 2], [1.0, 1.0, 1.0]),), (1, Vectors.sparse(6, [2, 3, 4], [1.0, 1.0, 1.0]),), (2, Vectors.sparse(6, [0, 2, 4], [1.0, 1.0, 1.0]),)] dfA = spark.createDataFrame(dataA, ["id", "features"]) dataB = [(3, Vectors.sparse(6, [1, 3, 5], [1.0, 1.0, 1.0]),), (4, Vectors.sparse(6, [2, 3, 5], [1.0, 1.0, 1.0]),), (5, Vectors.sparse(6, [1, 2, 4], [1.0, 1.0, 1.0]),)] dfB = spark.createDataFrame(dataB, ["id", "features"]) key = Vectors.sparse(6, [1, 3], [1.0, 1.0]) mh = MinHashLSH(inputCol="features", outputCol="hashes", numHashTables=5) model = mh.fit(dfA) transformedA = model.transform(dfA).select("id", "hashes") transformedB = model.transform(dfB).select("id", "hashes") model.approxSimilarityJoin(transformedA, transformedB, 0.6, distCol="JaccardDistance")\ .select(col("datasetA.id").alias("idA"), col("datasetB.id").alias("idB"), col("JaccardDistance")).show() ``` As in the code I give, I discard columns `features` but keep column `hashes` which is output data. approxSimilarityJoin should only work on `hashes` (the outputCol), which is exist and ignore the lack of `features` (the inputCol). Be able to transform the data beforehand and remove inputCol can make input data much smaller and prevent confusion about "We could avoid computing hashes by passing in the already-transformed dataset". -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36367) Fix the behavior to follow pandas >= 1.3
[ https://issues.apache.org/jira/browse/SPARK-36367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17395973#comment-17395973 ] Apache Spark commented on SPARK-36367: -- User 'Cedric-Magnan' has created a pull request for this issue: https://github.com/apache/spark/pull/33687 > Fix the behavior to follow pandas >= 1.3 > > > Key: SPARK-36367 > URL: https://issues.apache.org/jira/browse/SPARK-36367 > Project: Spark > Issue Type: Umbrella > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Takuya Ueshin >Assignee: Haejoon Lee >Priority: Major > > Pandas 1.3 has been released. We should follow the new pandas behavior. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36367) Fix the behavior to follow pandas >= 1.3
[ https://issues.apache.org/jira/browse/SPARK-36367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36367: Assignee: Haejoon Lee (was: Apache Spark) > Fix the behavior to follow pandas >= 1.3 > > > Key: SPARK-36367 > URL: https://issues.apache.org/jira/browse/SPARK-36367 > Project: Spark > Issue Type: Umbrella > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Takuya Ueshin >Assignee: Haejoon Lee >Priority: Major > > Pandas 1.3 has been released. We should follow the new pandas behavior. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36367) Fix the behavior to follow pandas >= 1.3
[ https://issues.apache.org/jira/browse/SPARK-36367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-36367: Assignee: Apache Spark (was: Haejoon Lee) > Fix the behavior to follow pandas >= 1.3 > > > Key: SPARK-36367 > URL: https://issues.apache.org/jira/browse/SPARK-36367 > Project: Spark > Issue Type: Umbrella > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Takuya Ueshin >Assignee: Apache Spark >Priority: Major > > Pandas 1.3 has been released. We should follow the new pandas behavior. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-36367) Fix the behavior to follow pandas >= 1.3
[ https://issues.apache.org/jira/browse/SPARK-36367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17395972#comment-17395972 ] Apache Spark commented on SPARK-36367: -- User 'Cedric-Magnan' has created a pull request for this issue: https://github.com/apache/spark/pull/33687 > Fix the behavior to follow pandas >= 1.3 > > > Key: SPARK-36367 > URL: https://issues.apache.org/jira/browse/SPARK-36367 > Project: Spark > Issue Type: Umbrella > Components: PySpark >Affects Versions: 3.3.0 >Reporter: Takuya Ueshin >Assignee: Haejoon Lee >Priority: Major > > Pandas 1.3 has been released. We should follow the new pandas behavior. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36455) Provide an example of complex session window via flatMapGroupsWithState
[ https://issues.apache.org/jira/browse/SPARK-36455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim reassigned SPARK-36455: Assignee: Jungtaek Lim > Provide an example of complex session window via flatMapGroupsWithState > --- > > Key: SPARK-36455 > URL: https://issues.apache.org/jira/browse/SPARK-36455 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > > Now that we replaced sessionization example with native support of session > window, we may want to provide another example of session window which can > only be dealt with flatMapGroupsWithState. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36455) Provide an example of complex session window via flatMapGroupsWithState
[ https://issues.apache.org/jira/browse/SPARK-36455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jungtaek Lim resolved SPARK-36455. -- Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 33681 [https://github.com/apache/spark/pull/33681] > Provide an example of complex session window via flatMapGroupsWithState > --- > > Key: SPARK-36455 > URL: https://issues.apache.org/jira/browse/SPARK-36455 > Project: Spark > Issue Type: Improvement > Components: Structured Streaming >Affects Versions: 3.2.0 >Reporter: Jungtaek Lim >Assignee: Jungtaek Lim >Priority: Major > Fix For: 3.2.0 > > > Now that we replaced sessionization example with native support of session > window, we may want to provide another example of session window which can > only be dealt with flatMapGroupsWithState. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data
[ https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17395953#comment-17395953 ] Cameron Todd edited comment on SPARK-18105 at 8/9/21, 10:31 AM: Yep I understand. I have hashed my data keeping the same distribution and the full_name hashed column is weak but string distance functions still work on it. Do you have any recommendations where I can upload this data, it's only 2gb? {code:java} //So this line of code: Dataset relevantPivots = spark.read().parquet(pathToDataDedup) .select("id", "full_name", "last_name","birthdate") .na().drop() .withColumn("pivot_hash", hash(col("last_name"),col("birthdate"))) .drop("last_name","birthdate") .repartition(5000) .cache(); // can be replaced with this Dataset relevantPivots = spark.read().parquet(pathToDataDedup).repartition(5000).cache();{code} I have also run the same code on the same hashed data and getting the same corrupted stream error. Also in case it wasn't clear my data normally sits on an s3 bucket. was (Author: cameron.todd): Yep I understand. I have hashed my data keeping the same distribution and the full_name hashed column is weak but string distance functions still work on it. Do you have any recommendations where I can upload this data? {code:java} //So this line of code: Dataset relevantPivots = spark.read().parquet(pathToDataDedup) .select("id", "full_name", "last_name","birthdate") .na().drop() .withColumn("pivot_hash", hash(col("last_name"),col("birthdate"))) .drop("last_name","birthdate") .repartition(5000) .cache(); // can be replaced with this Dataset relevantPivots = spark.read().parquet(pathToDataDedup).repartition(5000).cache();{code} I have also run the same code on the same hashed data and getting the same corrupted stream error. Also in case it wasn't clear my data normally sits on an s3 bucket. > LZ4 failed to decompress a stream of shuffled data > -- > > Key: SPARK-18105 > URL: https://issues.apache.org/jira/browse/SPARK-18105 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1, 3.1.1 >Reporter: Davies Liu >Priority: Major > Attachments: TestWeightedGraph.java > > > When lz4 is used to compress the shuffle files, it may fail to decompress it > as "stream is corrupt" > {code} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 92 in stage 5.0 failed 4 times, most recent failure: Lost task 92.3 in > stage 5.0 (TID 16616, 10.0.27.18): java.io.IOException: Stream is corrupted > at > org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:220) > at > org.apache.spark.io.LZ4BlockInputStream.available(LZ4BlockInputStream.java:109) > at java.io.BufferedInputStream.read(BufferedInputStream.java:353) > at java.io.DataInputStream.read(DataInputStream.java:149) > at com.google.common.io.ByteStreams.read(ByteStreams.java:828) > at com.google.common.io.ByteStreams.readFully(ByteStreams.java:695) > at > org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127) > at > org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110) > at scala.collection.Iterator$$anon$13.next(Iterator.scala:372) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30) > at > org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:397) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply
[jira] [Resolved] (SPARK-36410) Replace anonymous classes with lambda expressions
[ https://issues.apache.org/jira/browse/SPARK-36410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon resolved SPARK-36410. -- Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 33635 [https://github.com/apache/spark/pull/33635] > Replace anonymous classes with lambda expressions > - > > Key: SPARK-36410 > URL: https://issues.apache.org/jira/browse/SPARK-36410 > Project: Spark > Issue Type: Improvement > Components: Examples, Spark Core, SQL >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > Fix For: 3.3.0 > > > Anonymous classes can be replaced with lambda expressions in Java code, for > example > *before* > {code:java} > new Thread(new Runnable() { > @Override > public void run() { > // run thread > } > }); > {code} > *after* > {code:java} > new Thread(() -> { > // run thread > }); > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36410) Replace anonymous classes with lambda expressions
[ https://issues.apache.org/jira/browse/SPARK-36410?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hyukjin Kwon reassigned SPARK-36410: Assignee: Yang Jie > Replace anonymous classes with lambda expressions > - > > Key: SPARK-36410 > URL: https://issues.apache.org/jira/browse/SPARK-36410 > Project: Spark > Issue Type: Improvement > Components: Examples, Spark Core, SQL >Affects Versions: 3.3.0 >Reporter: Yang Jie >Assignee: Yang Jie >Priority: Minor > > Anonymous classes can be replaced with lambda expressions in Java code, for > example > *before* > {code:java} > new Thread(new Runnable() { > @Override > public void run() { > // run thread > } > }); > {code} > *after* > {code:java} > new Thread(() -> { > // run thread > }); > {code} -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data
[ https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17395953#comment-17395953 ] Cameron Todd edited comment on SPARK-18105 at 8/9/21, 10:29 AM: Yep I understand. I have hashed my data keeping the same distribution and the full_name hashed column is weak but string distance functions still work on it. Do you have any recommendations where I can upload this data? {code:java} //So this line of code: Dataset relevantPivots = spark.read().parquet(pathToDataDedup) .select("id", "full_name", "last_name","birthdate") .na().drop() .withColumn("pivot_hash", hash(col("last_name"),col("birthdate"))) .drop("last_name","birthdate") .repartition(5000) .cache(); // can be replaced with this Dataset relevantPivots = spark.read().parquet(pathToDataDedup).repartition(5000).cache();{code} I have also run the same code on the same hashed data and getting the same corrupted stream error. Also in case it wasn't clear my data normally sits on an s3 bucket. was (Author: cameron.todd): Yep I understand. I have attached my hashed data keeping the same distribution and the full_name hashed column is weak but string distance functions still work on it. So this line of code: {code:java} Dataset relevantPivots = spark.read().parquet(pathToDataDedup) .select("id", "full_name", "last_name","birthdate") .na().drop() .withColumn("pivot_hash", hash(col("last_name"),col("birthdate"))) .drop("last_name","birthdate") .repartition(5000) .cache(); // can be replaced with this Dataset relevantPivots = spark.read().parquet(pathToDataDedup).repartition(5000).cache();{code} I have also run the same code on the same hashed data and getting the same corrupted stream error. Also in case it wasn't clear my data normally sits on an s3 bucket. > LZ4 failed to decompress a stream of shuffled data > -- > > Key: SPARK-18105 > URL: https://issues.apache.org/jira/browse/SPARK-18105 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1, 3.1.1 >Reporter: Davies Liu >Priority: Major > Attachments: TestWeightedGraph.java > > > When lz4 is used to compress the shuffle files, it may fail to decompress it > as "stream is corrupt" > {code} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 92 in stage 5.0 failed 4 times, most recent failure: Lost task 92.3 in > stage 5.0 (TID 16616, 10.0.27.18): java.io.IOException: Stream is corrupted > at > org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:220) > at > org.apache.spark.io.LZ4BlockInputStream.available(LZ4BlockInputStream.java:109) > at java.io.BufferedInputStream.read(BufferedInputStream.java:353) > at java.io.DataInputStream.read(DataInputStream.java:149) > at com.google.common.io.ByteStreams.read(ByteStreams.java:828) > at com.google.common.io.ByteStreams.readFully(ByteStreams.java:695) > at > org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127) > at > org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110) > at scala.collection.Iterator$$anon$13.next(Iterator.scala:372) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30) > at > org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:397) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > a
[jira] [Issue Comment Deleted] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data
[ https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Cameron Todd updated SPARK-18105: - Comment: was deleted (was: [^hashed_data.zip]) > LZ4 failed to decompress a stream of shuffled data > -- > > Key: SPARK-18105 > URL: https://issues.apache.org/jira/browse/SPARK-18105 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1, 3.1.1 >Reporter: Davies Liu >Priority: Major > Attachments: TestWeightedGraph.java > > > When lz4 is used to compress the shuffle files, it may fail to decompress it > as "stream is corrupt" > {code} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 92 in stage 5.0 failed 4 times, most recent failure: Lost task 92.3 in > stage 5.0 (TID 16616, 10.0.27.18): java.io.IOException: Stream is corrupted > at > org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:220) > at > org.apache.spark.io.LZ4BlockInputStream.available(LZ4BlockInputStream.java:109) > at java.io.BufferedInputStream.read(BufferedInputStream.java:353) > at java.io.DataInputStream.read(DataInputStream.java:149) > at com.google.common.io.ByteStreams.read(ByteStreams.java:828) > at com.google.common.io.ByteStreams.readFully(ByteStreams.java:695) > at > org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127) > at > org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110) > at scala.collection.Iterator$$anon$13.next(Iterator.scala:372) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30) > at > org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:397) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > https://github.com/jpountz/lz4-java/issues/89 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data
[ https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17395954#comment-17395954 ] Cameron Todd commented on SPARK-18105: -- [^hashed_data.zip] > LZ4 failed to decompress a stream of shuffled data > -- > > Key: SPARK-18105 > URL: https://issues.apache.org/jira/browse/SPARK-18105 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1, 3.1.1 >Reporter: Davies Liu >Priority: Major > Attachments: TestWeightedGraph.java > > > When lz4 is used to compress the shuffle files, it may fail to decompress it > as "stream is corrupt" > {code} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 92 in stage 5.0 failed 4 times, most recent failure: Lost task 92.3 in > stage 5.0 (TID 16616, 10.0.27.18): java.io.IOException: Stream is corrupted > at > org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:220) > at > org.apache.spark.io.LZ4BlockInputStream.available(LZ4BlockInputStream.java:109) > at java.io.BufferedInputStream.read(BufferedInputStream.java:353) > at java.io.DataInputStream.read(DataInputStream.java:149) > at com.google.common.io.ByteStreams.read(ByteStreams.java:828) > at com.google.common.io.ByteStreams.readFully(ByteStreams.java:695) > at > org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127) > at > org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110) > at scala.collection.Iterator$$anon$13.next(Iterator.scala:372) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30) > at > org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:397) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > https://github.com/jpountz/lz4-java/issues/89 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-18105) LZ4 failed to decompress a stream of shuffled data
[ https://issues.apache.org/jira/browse/SPARK-18105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17395953#comment-17395953 ] Cameron Todd commented on SPARK-18105: -- Yep I understand. I have attached my hashed data keeping the same distribution and the full_name hashed column is weak but string distance functions still work on it. So this line of code: {code:java} Dataset relevantPivots = spark.read().parquet(pathToDataDedup) .select("id", "full_name", "last_name","birthdate") .na().drop() .withColumn("pivot_hash", hash(col("last_name"),col("birthdate"))) .drop("last_name","birthdate") .repartition(5000) .cache(); // can be replaced with this Dataset relevantPivots = spark.read().parquet(pathToDataDedup).repartition(5000).cache();{code} I have also run the same code on the same hashed data and getting the same corrupted stream error. Also in case it wasn't clear my data normally sits on an s3 bucket. > LZ4 failed to decompress a stream of shuffled data > -- > > Key: SPARK-18105 > URL: https://issues.apache.org/jira/browse/SPARK-18105 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 3.0.1, 3.1.1 >Reporter: Davies Liu >Priority: Major > Attachments: TestWeightedGraph.java > > > When lz4 is used to compress the shuffle files, it may fail to decompress it > as "stream is corrupt" > {code} > Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: > Task 92 in stage 5.0 failed 4 times, most recent failure: Lost task 92.3 in > stage 5.0 (TID 16616, 10.0.27.18): java.io.IOException: Stream is corrupted > at > org.apache.spark.io.LZ4BlockInputStream.refill(LZ4BlockInputStream.java:220) > at > org.apache.spark.io.LZ4BlockInputStream.available(LZ4BlockInputStream.java:109) > at java.io.BufferedInputStream.read(BufferedInputStream.java:353) > at java.io.DataInputStream.read(DataInputStream.java:149) > at com.google.common.io.ByteStreams.read(ByteStreams.java:828) > at com.google.common.io.ByteStreams.readFully(ByteStreams.java:695) > at > org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:127) > at > org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$3$$anon$1.next(UnsafeRowSerializer.scala:110) > at scala.collection.Iterator$$anon$13.next(Iterator.scala:372) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30) > at > org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43) > at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown > Source) > at > org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown > Source) > at > org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) > at > org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370) > at > org.apache.spark.sql.execution.datasources.DynamicPartitionWriterContainer.writeRows(WriterContainer.scala:397) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand$$anonfun$run$1$$anonfun$apply$mcV$sp$1.apply(InsertIntoHadoopFsRelationCommand.scala:143) > at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) > at org.apache.spark.scheduler.Task.run(Task.scala:86) > at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) > {code} > https://github.com/jpountz/lz4-java/issues/89 -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36430) Adaptively calculate the target size when coalescing shuffle partitions in AQE
[ https://issues.apache.org/jira/browse/SPARK-36430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-36430: --- Assignee: Wenchen Fan > Adaptively calculate the target size when coalescing shuffle partitions in AQE > -- > > Key: SPARK-36430 > URL: https://issues.apache.org/jira/browse/SPARK-36430 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36430) Adaptively calculate the target size when coalescing shuffle partitions in AQE
[ https://issues.apache.org/jira/browse/SPARK-36430?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-36430. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 33655 [https://github.com/apache/spark/pull/33655] > Adaptively calculate the target size when coalescing shuffle partitions in AQE > -- > > Key: SPARK-36430 > URL: https://issues.apache.org/jira/browse/SPARK-36430 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: Wenchen Fan >Assignee: Wenchen Fan >Priority: Major > Fix For: 3.2.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36271) Hive SerDe and V1 Insert data to parquet/orc/avro need to check schema too
[ https://issues.apache.org/jira/browse/SPARK-36271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-36271. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 33566 [https://github.com/apache/spark/pull/33566] > Hive SerDe and V1 Insert data to parquet/orc/avro need to check schema too > -- > > Key: SPARK-36271 > URL: https://issues.apache.org/jira/browse/SPARK-36271 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.2.0 > > Attachments: screenshot-1.png > > > {code:java} > withTempPath { path => > spark.sql( > s""" > |SELECT > |NAMED_STRUCT('ID', ID, 'IF(ID=1,ID,0)', IF(ID=1,ID,0), 'B', > ABS(ID)) AS col1 > |FROM v > > """.stripMargin).write.mode(SaveMode.Overwrite).parquet(path.getCanonicalPath) > } > {code} > !screenshot-1.png|width=1302,height=176! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36271) Hive SerDe and V1 Insert data to parquet/orc/avro need to check schema too
[ https://issues.apache.org/jira/browse/SPARK-36271?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-36271: --- Assignee: angerszhu > Hive SerDe and V1 Insert data to parquet/orc/avro need to check schema too > -- > > Key: SPARK-36271 > URL: https://issues.apache.org/jira/browse/SPARK-36271 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Attachments: screenshot-1.png > > > {code:java} > withTempPath { path => > spark.sql( > s""" > |SELECT > |NAMED_STRUCT('ID', ID, 'IF(ID=1,ID,0)', IF(ID=1,ID,0), 'B', > ABS(ID)) AS col1 > |FROM v > > """.stripMargin).write.mode(SaveMode.Overwrite).parquet(path.getCanonicalPath) > } > {code} > !screenshot-1.png|width=1302,height=176! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-36339) aggsBuffer should collect AggregateExpression in the map range
[ https://issues.apache.org/jira/browse/SPARK-36339?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan updated SPARK-36339: Fix Version/s: 3.0.4 3.1.3 > aggsBuffer should collect AggregateExpression in the map range > -- > > Key: SPARK-36339 > URL: https://issues.apache.org/jira/browse/SPARK-36339 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 2.4.8, 3.0.3, 3.1.2 >Reporter: gaoyajun02 >Assignee: gaoyajun02 >Priority: Major > Labels: grouping > Fix For: 3.2.0, 3.1.3, 3.0.4 > > > show demo for this ISSUE: > {code:java} > // SQL without error > SELECT name, count(name) c > FROM VALUES ('Alice'), ('Bob') people(name) > GROUP BY name GROUPING SETS(name); > // An error is reported after exchanging the order of the query columns: > SELECT count(name) c, name > FROM VALUES ('Alice'), ('Bob') people(name) > GROUP BY name GROUPING SETS(name); > {code} > The error message is: > {code:java} > Error in query: expression 'people.`name`' is neither present in the group > by, nor is it an aggregate function. Add to group by or wrap in first() (or > first_value) if you don't care which value you get.;; > Aggregate [name#5, spark_grouping_id#3], [count(name#1) AS c#0L, name#1] > +- Expand [List(name#1, name#4, 0)], [name#1, name#5, spark_grouping_id#3] >+- Project [name#1, name#1 AS name#4] > +- SubqueryAlias `people` > +- LocalRelation [name#1] > {code} > So far, I have checked that there is no problem before version 2.3. > > During debugging, I found that the behavior of constructAggregateExprs in > ResolveGroupingAnalytics has changed. > {code:java} > /* > * Construct new aggregate expressions by replacing grouping functions. > */ > private def constructAggregateExprs( > groupByExprs: Seq[Expression], > aggregations: Seq[NamedExpression], > groupByAliases: Seq[Alias], > groupingAttrs: Seq[Expression], > gid: Attribute): Seq[NamedExpression] = aggregations.map { > // collect all the found AggregateExpression, so we can check an > expression is part of > // any AggregateExpression or not. > val aggsBuffer = ArrayBuffer[Expression]() > // Returns whether the expression belongs to any expressions in > `aggsBuffer` or not. > def isPartOfAggregation(e: Expression): Boolean = { > aggsBuffer.exists(a => a.find(_ eq e).isDefined) > } > replaceGroupingFunc(_, groupByExprs, gid).transformDown { > // AggregateExpression should be computed on the unmodified value of > its argument > // expressions, so we should not replace any references to grouping > expression > // inside it. > case e: AggregateExpression => > aggsBuffer += e > e > case e if isPartOfAggregation(e) => e > case e => > // Replace expression by expand output attribute. > val index = groupByAliases.indexWhere(_.child.semanticEquals(e)) > if (index == -1) { > e > } else { > groupingAttrs(index) > } > }.asInstanceOf[NamedExpression] > } > {code} > When performing aggregations.map, the aggsBuffer here seems to be outside the > scope of the map. It can store the AggregateExpression of all the elements > processed by the map function, but this is not before 2.3. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36424) Support eliminate limits in AQE Optimizer
[ https://issues.apache.org/jira/browse/SPARK-36424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-36424. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 33651 [https://github.com/apache/spark/pull/33651] > Support eliminate limits in AQE Optimizer > - > > Key: SPARK-36424 > URL: https://issues.apache.org/jira/browse/SPARK-36424 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.2.0 >Reporter: XiDuo You >Priority: Major > Fix For: 3.3.0 > > > In Ad-hoc scenario, we always add limit for the query if user have no special > limit value, but not all limit is nesessary. > With the power of AQE, we can eliminate limits using running statistics. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36352) Spark should check result plan's output schema name
[ https://issues.apache.org/jira/browse/SPARK-36352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-36352: --- Assignee: angerszhu > Spark should check result plan's output schema name > --- > > Key: SPARK-36352 > URL: https://issues.apache.org/jira/browse/SPARK-36352 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36352) Spark should check result plan's output schema name
[ https://issues.apache.org/jira/browse/SPARK-36352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-36352. - Fix Version/s: 3.2.0 Resolution: Fixed Issue resolved by pull request 33583 [https://github.com/apache/spark/pull/33583] > Spark should check result plan's output schema name > --- > > Key: SPARK-36352 > URL: https://issues.apache.org/jira/browse/SPARK-36352 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 3.2.0 >Reporter: angerszhu >Assignee: angerszhu >Priority: Major > Fix For: 3.2.0 > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-36450) Remove unused UnresolvedV2Relation
[ https://issues.apache.org/jira/browse/SPARK-36450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan reassigned SPARK-36450: --- Assignee: Terry Kim > Remove unused UnresolvedV2Relation > -- > > Key: SPARK-36450 > URL: https://issues.apache.org/jira/browse/SPARK-36450 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Minor > > Now that all the commands that use UnresolvedV2Relation have been migrated to > use UnresolvedTable, and UnresolvedView, it can be removed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-36450) Remove unused UnresolvedV2Relation
[ https://issues.apache.org/jira/browse/SPARK-36450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Wenchen Fan resolved SPARK-36450. - Fix Version/s: 3.3.0 Resolution: Fixed Issue resolved by pull request 33677 [https://github.com/apache/spark/pull/33677] > Remove unused UnresolvedV2Relation > -- > > Key: SPARK-36450 > URL: https://issues.apache.org/jira/browse/SPARK-36450 > Project: Spark > Issue Type: Improvement > Components: SQL >Affects Versions: 3.3.0 >Reporter: Terry Kim >Assignee: Terry Kim >Priority: Minor > Fix For: 3.3.0 > > > Now that all the commands that use UnresolvedV2Relation have been migrated to > use UnresolvedTable, and UnresolvedView, it can be removed. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org