[GitHub] [spark] itholic opened a new pull request, #38420: [SPARK-40947][PS][INFRA] Upgrade pandas to 1.5.1
itholic opened a new pull request, #38420: URL: https://github.com/apache/spark/pull/38420 ### What changes were proposed in this pull request? This PR proposes upgrading pandas to 1.5.1, for pandas API on Spark. New version of pandas (1.5.1) was released last week (Oct 19, 2022). See [What's new in 1.5.1](https://pandas.pydata.org/pandas-docs/dev/whatsnew/v1.5.1.html) for more detail. ### Why are the changes needed? We should follow the behavior of latest pandas, and support it. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The existing tests should all pass. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #38395: [SPARK-40917][SQL] Add a dedicated logical plan for `Summary`
zhengruifeng commented on code in PR #38395: URL: https://github.com/apache/spark/pull/38395#discussion_r100764 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala: ## @@ -2100,3 +2100,53 @@ object AsOfJoin { } } } + + +/** + * A logical plan for summary. + */ +case class Summary( Review Comment: > But we should probably name it UnresolvedSummary maybe we can create a `UnresolvedDataFrameFunction` instead? we can put a group of functions (`summary`, `sampleBy`, `freqItem`, `crosstab`) in it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #38413: [SPARK-40936][SQL][TESTS] Refactor `AnalysisTest#assertAnalysisErrorClass` by reusing the `SparkFunSuite#checkError`
MaxGekk commented on PR #38413: URL: https://github.com/apache/spark/pull/38413#issuecomment-1294484128 Waiting for Ci. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on pull request #37219: [WIP][SPARK-39794][PYTHON] Introduce parametric singleton for DataType
itholic commented on PR #37219: URL: https://github.com/apache/spark/pull/37219#issuecomment-1294481459 Is this still in progress? Just for confirming -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #38395: [SPARK-40917][SQL] Add a dedicated logical plan for `Summary`
zhengruifeng commented on code in PR #38395: URL: https://github.com/apache/spark/pull/38395#discussion_r100764 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala: ## @@ -2100,3 +2100,53 @@ object AsOfJoin { } } } + + +/** + * A logical plan for summary. + */ +case class Summary( Review Comment: > But we should probably name it UnresolvedSummary -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38397: [SPARK-40918][SQL] Mismatch between FileSourceScanExec and Orc and ParquetFileFormat on producing columnar output
dongjoon-hyun commented on code in PR #38397: URL: https://github.com/apache/spark/pull/38397#discussion_r1007642185 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala: ## @@ -173,6 +181,22 @@ class ParquetFileFormat val datetimeRebaseModeInRead = parquetOptions.datetimeRebaseModeInRead val int96RebaseModeInRead = parquetOptions.int96RebaseModeInRead +// Should always be set by FileSourceScanExec creating this. +// Check conf before checking option, to allow working around an issue by changing conf. +val returningBatch = sparkSession.sessionState.conf.parquetVectorizedReaderEnabled && + options.get(FileFormat.OPTION_RETURNING_BATCH) +.getOrElse { + throw new IllegalArgumentException( +"OPTION_RETURNING_BATCH should always be set for ParquetFileFormat." + Review Comment: Ditto. nit. Add one more space at the end of the message. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38397: [SPARK-40918][SQL] Mismatch between FileSourceScanExec and Orc and ParquetFileFormat on producing columnar output
dongjoon-hyun commented on code in PR #38397: URL: https://github.com/apache/spark/pull/38397#discussion_r1007641451 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala: ## @@ -126,9 +136,24 @@ class OrcFileFormat val resultSchema = StructType(requiredSchema.fields ++ partitionSchema.fields) val sqlConf = sparkSession.sessionState.conf -val enableVectorizedReader = supportBatch(sparkSession, resultSchema) val capacity = sqlConf.orcVectorizedReaderBatchSize +// Should always be set by FileSourceScanExec creating this. +// Check conf before checking option, to allow working around an issue by changing conf. +val enableVectorizedReader = sqlConf.orcVectorizedReaderEnabled && + options.get(FileFormat.OPTION_RETURNING_BATCH) +.getOrElse { + throw new IllegalArgumentException( +"OPTION_RETURNING_BATCH should always be set for OrcFileFormat." + + "To workaround this issue, set spark.sql.orc.enableVectorizedReader=false.") Review Comment: Is this a correct recommendation? Why not recommend to set `OPTION_RETURNING_BATCH`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38397: [SPARK-40918][SQL] Mismatch between FileSourceScanExec and Orc and ParquetFileFormat on producing columnar output
dongjoon-hyun commented on code in PR #38397: URL: https://github.com/apache/spark/pull/38397#discussion_r1007640461 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFileFormat.scala: ## @@ -126,9 +136,24 @@ class OrcFileFormat val resultSchema = StructType(requiredSchema.fields ++ partitionSchema.fields) val sqlConf = sparkSession.sessionState.conf -val enableVectorizedReader = supportBatch(sparkSession, resultSchema) val capacity = sqlConf.orcVectorizedReaderBatchSize +// Should always be set by FileSourceScanExec creating this. +// Check conf before checking option, to allow working around an issue by changing conf. +val enableVectorizedReader = sqlConf.orcVectorizedReaderEnabled && + options.get(FileFormat.OPTION_RETURNING_BATCH) +.getOrElse { + throw new IllegalArgumentException( +"OPTION_RETURNING_BATCH should always be set for OrcFileFormat." + Review Comment: nit. Add one space at the end? ``` "OPTION_RETURNING_BATCH should always be set for OrcFileFormat." + "OPTION_RETURNING_BATCH should always be set for OrcFileFormat. " + ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #37955: [SPARK-40512][SPARK-40896][PS][INFRA] Upgrade pandas to 1.5.0
dongjoon-hyun commented on PR #37955: URL: https://github.com/apache/spark/pull/37955#issuecomment-1294469606 Thank you all! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #38406: [SPARK-40926][CONNECT] Refactor server side tests to only use DataFrame API
cloud-fan closed pull request #38406: [SPARK-40926][CONNECT] Refactor server side tests to only use DataFrame API URL: https://github.com/apache/spark/pull/38406 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #38406: [SPARK-40926][CONNECT] Refactor server side tests to only use DataFrame API
cloud-fan commented on PR #38406: URL: https://github.com/apache/spark/pull/38406#issuecomment-1294467993 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #38417: [SPARK-40941][K8S] Use Java 17 in K8s Dockerfile by default and remove `Dockerfile.java17`
dongjoon-hyun commented on PR #38417: URL: https://github.com/apache/spark/pull/38417#issuecomment-1294464841 Thank you, @Yikun . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk closed pull request #38416: [SPARK-40924][SQL][3.3] Fix for Unhex when input has odd number of symbols
MaxGekk closed pull request #38416: [SPARK-40924][SQL][3.3] Fix for Unhex when input has odd number of symbols URL: https://github.com/apache/spark/pull/38416 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk closed pull request #38398: [SPARK-39778][SQL] Improve error classes and messages
MaxGekk closed pull request #38398: [SPARK-39778][SQL] Improve error classes and messages URL: https://github.com/apache/spark/pull/38398 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #38398: [SPARK-39778][SQL] Improve error classes and messages
MaxGekk commented on PR #38398: URL: https://github.com/apache/spark/pull/38398#issuecomment-1294464168 Merging to master. Thank you, @itholic @cloud-fan for review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] chenminghua8 closed pull request #38381: [SPARK-40793][SQL] Fix the LogicalRelation computeStats for Row-level Runtime Filtering cannot be applied
chenminghua8 closed pull request #38381: [SPARK-40793][SQL] Fix the LogicalRelation computeStats for Row-level Runtime Filtering cannot be applied URL: https://github.com/apache/spark/pull/38381 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #38416: [SPARK-40924][SQL][3.3] Fix for Unhex when input has odd number of symbols
MaxGekk commented on PR #38416: URL: https://github.com/apache/spark/pull/38416#issuecomment-1294459349 All GAs passed: https://user-images.githubusercontent.com/1580697/198506375-f36f3e73-b0a8-4d2e-a3f8-9fb49ba453c3.png;> Merging to 3.3. Thank you, @vitaliili-db. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #38336: [SPARK-40862][SQL] Support non-aggregated subqueries in RewriteCorrelatedScalarSubquery
cloud-fan commented on PR #38336: URL: https://github.com/apache/spark/pull/38336#issuecomment-1294436600 thanks, merging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #38336: [SPARK-40862][SQL] Support non-aggregated subqueries in RewriteCorrelatedScalarSubquery
cloud-fan closed pull request #38336: [SPARK-40862][SQL] Support non-aggregated subqueries in RewriteCorrelatedScalarSubquery URL: https://github.com/apache/spark/pull/38336 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38413: [SPARK-40936][SQL][TESTS] Remove outer conditions to simplify `AnalysisTest#assertAnalysisErrorClass` method
LuciferYang commented on PR #38413: URL: https://github.com/apache/spark/pull/38413#issuecomment-1294431154 @MaxGekk Refactor `assertAnalysisErrorClass` method: - Reuse `checkError` in `assertAnalysisErrorClass` - Use `queryContext` instead of `line + pos` in `assertAnalysisErrorClass` method signature - Fixed related tests Please review this if you have time, thanks ~ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #38413: [WIP][SPARK-40936][SQL][TESTS] Remove outer conditions to simplify `AnalysisTest#assertAnalysisErrorClass` method
LuciferYang commented on code in PR #38413: URL: https://github.com/apache/spark/pull/38413#discussion_r1007582019 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisSuite.scala: ## @@ -716,9 +714,8 @@ class AnalysisSuite extends AnalysisTest with Matchers { assertAnalysisErrorClass(parsePlan("WITH t(x) AS (SELECT 1) SELECT * FROM t WHERE y = 1"), "UNRESOLVED_COLUMN.WITH_SUGGESTION", Map("objectName" -> "`y`", "proposal" -> "`t`.`x`"), - caseSensitive = true, - line = -1, - pos = -1) + Array(ExpectedContext("y", 46, 46)) Review Comment: https://github.com/apache/spark/blob/cf086b10de784fc92ae8b4d16065823ace520a7a/core/src/test/scala/org/apache/spark/SparkFunSuite.scala#L319-L332 This change is due to the `checkError` method will perform a forced check when `actualQueryContext` is not empty. If we can to relax some check conditions, can add a precondition `queryContext.nonEmpty` for the `queryContext` check. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #37955: [SPARK-40512][SPARK-40896][PS][INFRA] Upgrade pandas to 1.5.0
zhengruifeng commented on PR #37955: URL: https://github.com/apache/spark/pull/37955#issuecomment-1294400941 Merged into master, thank you @itholic for doing this! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng closed pull request #37955: [SPARK-40512][SPARK-40896][PS][INFRA] Upgrade pandas to 1.5.0
zhengruifeng closed pull request #37955: [SPARK-40512][SPARK-40896][PS][INFRA] Upgrade pandas to 1.5.0 URL: https://github.com/apache/spark/pull/37955 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark-docker] Yikun commented on pull request #15: [SPARK-40569] Expose SPARK_MASTER_PORT 7077 for spark standalone cluster
Yikun commented on PR #15: URL: https://github.com/apache/spark-docker/pull/15#issuecomment-1294391023 also cc @holdenk @dongjoon-hyun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AngersZhuuuu commented on pull request #35594: [SPARK-38270][SQL] Spark SQL CLI's AM should keep same exit code with client side
AngersZh commented on PR #35594: URL: https://github.com/apache/spark/pull/35594#issuecomment-1294389484 ping @cloud-fan @yaooqinn @LuciferYang -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a diff in pull request #38405: [SPARK-40925][SQL][SS] Fix stateful operator late record filtering
HeartSaVioR commented on code in PR #38405: URL: https://github.com/apache/spark/pull/38405#discussion_r1007564839 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SessionWindow.scala: ## @@ -68,11 +68,29 @@ case class SessionWindow(timeColumn: Expression, gapDuration: Expression) extend with Unevaluable with NonSQLExpression { + private def inputTypeOnTimeColumn: AbstractDataType = { Review Comment: Nope. Just for future visibility. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #38411: [SPARK-40933][SQL] Make df.stat.{cov, corr} consistent with sql functions
zhengruifeng commented on PR #38411: URL: https://github.com/apache/spark/pull/38411#issuecomment-1294384281 @HyukjinKwon as far as I know, they are the seem now. the original tests didn't cover null handling and empty dataset, I add a new UT to make sure no behavior change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #38406: [SPARK-40926][CONNECT] Refactor server side tests to only use DataFrame API
amaliujia commented on code in PR #38406: URL: https://github.com/apache/spark/pull/38406#discussion_r1007558507 ## connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala: ## @@ -215,4 +180,16 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest { } proto.Relation.newBuilder().setLocalRelation(localRelationBuilder.build()).build() } + + private def analyzePlan(plan: LogicalPlan): LogicalPlan = { +val connectAnalyzed = analysis.SimpleAnalyzer.execute(plan) +analysis.SimpleAnalyzer.checkAnalysis(connectAnalyzed) +EliminateSubqueryAliases(connectAnalyzed) Review Comment: There is no issue after removing it. I pushed a commit to remove it anyway. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #38415: [SPARK-40938][CONNECT] Support Alias for every type of Relation
amaliujia commented on code in PR #38415: URL: https://github.com/apache/spark/pull/38415#discussion_r1007556667 ## connector/connect/src/main/protobuf/spark/connect/relations.proto: ## @@ -47,6 +47,13 @@ message Relation { Unknown unknown = 999; } + // Optional. Every relation might have an alias. + Alias alias = 200; + + // Relation alias. + message Alias { Review Comment: default is empty string... I really don't like the way of proto for this.. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] alex-balikov commented on a diff in pull request #38405: [SPARK-40925][SQL][SS] Fix stateful operator late record filtering
alex-balikov commented on code in PR #38405: URL: https://github.com/apache/spark/pull/38405#discussion_r1007479005 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/SessionWindow.scala: ## @@ -68,11 +68,29 @@ case class SessionWindow(timeColumn: Expression, gapDuration: Expression) extend with Unevaluable with NonSQLExpression { + private def inputTypeOnTimeColumn: AbstractDataType = { Review Comment: Sorry, are you asking for anything actionable? ## sql/core/src/test/scala/org/apache/spark/sql/streaming/MultiStatefulOperatorsSuite.scala: ## @@ -0,0 +1,400 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.streaming + +import org.scalatest.BeforeAndAfter + +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.execution.streaming.MemoryStream +import org.apache.spark.sql.execution.streaming.state.StateStore +import org.apache.spark.sql.functions._ + +// Tests for the multiple stateful operators support. +class MultiStatefulOperatorsSuite + extends StreamTest with StateStoreMetricsTest with BeforeAndAfter { + + import testImplicits._ + + before { +SparkSession.setActiveSession(spark) // set this before force initializing 'joinExec' +spark.streams.stateStoreCoordinator // initialize the lazy coordinator + } + + after { +StateStore.stop() + } + + test("window agg -> window agg, append mode") { +withSQLConf("spark.sql.streaming.unsupportedOperationCheck" -> "false") { + val inputData = MemoryStream[Int] + + val stream = inputData.toDF() +.withColumn("eventTime", timestamp_seconds($"value")) +.withWatermark("eventTime", "0 seconds") +.groupBy(window($"eventTime", "5 seconds") as 'window) +.agg(count("*") as 'count) +.groupBy(window($"window", "10 seconds")) +.agg(count("*") as 'count, sum("count") as 'sum) +.select($"window".getField("start").cast("long").as[Long], + $"count".as[Long], $"sum".as[Long]) + + testStream(stream)( +AddData(inputData, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21), +// op1 W (0, 0) +// agg: [10, 15) 5, [15, 20) 5, [20, 25) 2 +// output: None +// state: [10, 15) 5, [15, 20) 5, [20, 25) 2 +// op2 W (0, 0) +// agg: None +// output: None +// state: None + +// no-data batch triggered + +// op1 W (0, 21) +// agg: None +// output: [10, 15) 5, [15, 20) 5 +// state: [20, 25) 2 +// op2 W (0, 21) +// agg: [10, 20) (2, 10) +// output: [10, 20) (2, 10) +// state: None +CheckNewAnswer((10, 2, 10)), // W (21, 21) [20, 25) 2 ... W (0, 21) None // [10, 20) 2 +assertNumStateRows(Seq(0, 1)), +assertNumRowsDroppedByWatermark(Seq(0, 0)), + +AddData(inputData, 22, 23, 24, 25, 26, 27, 28, 29), +// op1 W (21, 21) +// agg: [20, 25) 5, [25, 30) 4 +// output: None +// state: [20, 25) 5, [25, 30) 4 +// op2 W (21, 21) +// agg: None +// output: None +// state: None + +// no-data batch triggered + +// op1 W (21, 29) +// agg: None +// output: [20, 25) 5 +// state: [25, 30) 4 +// op2 W (20, 25) +// agg: [20, 30) (1, 5) +// output: None +// state: [20, 30) (1, 5) +CheckNewAnswer(), +assertNumStateRows(Seq(1, 1)), +assertNumRowsDroppedByWatermark(Seq(0, 0)), + +// Move the watermark. Review Comment: Good point. Added late records -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38406: [SPARK-40926][CONNECT] Refactor server side tests to only use DataFrame API
cloud-fan commented on code in PR #38406: URL: https://github.com/apache/spark/pull/38406#discussion_r100731 ## connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala: ## @@ -215,4 +180,16 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest { } proto.Relation.newBuilder().setLocalRelation(localRelationBuilder.build()).build() } + + private def analyzePlan(plan: LogicalPlan): LogicalPlan = { +val connectAnalyzed = analysis.SimpleAnalyzer.execute(plan) +analysis.SimpleAnalyzer.checkAnalysis(connectAnalyzed) +EliminateSubqueryAliases(connectAnalyzed) Review Comment: Did we hit any issues in this test suite without doing it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38406: [SPARK-40926][CONNECT] Refactor server side tests to only use DataFrame API
cloud-fan commented on code in PR #38406: URL: https://github.com/apache/spark/pull/38406#discussion_r100731 ## connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala: ## @@ -215,4 +180,16 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest { } proto.Relation.newBuilder().setLocalRelation(localRelationBuilder.build()).build() } + + private def analyzePlan(plan: LogicalPlan): LogicalPlan = { +val connectAnalyzed = analysis.SimpleAnalyzer.execute(plan) +analysis.SimpleAnalyzer.checkAnalysis(connectAnalyzed) +EliminateSubqueryAliases(connectAnalyzed) Review Comment: Did we hit any issues if not doing it? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38415: [SPARK-40938][CONNECT] Support Alias for every type of Relation
cloud-fan commented on code in PR #38415: URL: https://github.com/apache/spark/pull/38415#discussion_r1007555285 ## connector/connect/src/main/protobuf/spark/connect/relations.proto: ## @@ -47,6 +47,13 @@ message Relation { Unknown unknown = 999; } + // Optional. Every relation might have an alias. + Alias alias = 200; + + // Relation alias. + message Alias { Review Comment: oh, I thought the default value of string is null... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #38415: [SPARK-40938][CONNECT] Support Alias for every type of Relation
amaliujia commented on code in PR #38415: URL: https://github.com/apache/spark/pull/38415#discussion_r1007553459 ## connector/connect/src/main/protobuf/spark/connect/relations.proto: ## @@ -47,6 +47,13 @@ message Relation { Unknown unknown = 999; } + // Optional. Every relation might have an alias. + Alias alias = 200; + + // Relation alias. + message Alias { Review Comment: do you think if we need to care client call `xx.as("")`? Does client side should reject this? If so we can just use a string. It's always a matter of if we need to know a field set or not set or default value, etc. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #38415: [SPARK-40938][CONNECT] Support Alias for every type of Relation
amaliujia commented on code in PR #38415: URL: https://github.com/apache/spark/pull/38415#discussion_r1007553459 ## connector/connect/src/main/protobuf/spark/connect/relations.proto: ## @@ -47,6 +47,13 @@ message Relation { Unknown unknown = 999; } + // Optional. Every relation might have an alias. + Alias alias = 200; + + // Relation alias. + message Alias { Review Comment: do you think if we need to care client call `xx.as("")`? Does client side should reject this? If so we can just use a string. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38415: [SPARK-40938][CONNECT] Support Alias for every type of Relation
cloud-fan commented on code in PR #38415: URL: https://github.com/apache/spark/pull/38415#discussion_r1007552036 ## connector/connect/src/main/protobuf/spark/connect/relations.proto: ## @@ -47,6 +47,13 @@ message Relation { Unknown unknown = 999; } + // Optional. Every relation might have an alias. + Alias alias = 200; + + // Relation alias. + message Alias { Review Comment: why can't we use string directly? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38415: [SPARK-40938][CONNECT] Support Alias for every type of Relation
cloud-fan commented on code in PR #38415: URL: https://github.com/apache/spark/pull/38415#discussion_r1007551962 ## connector/connect/src/main/protobuf/spark/connect/relations.proto: ## @@ -47,6 +47,13 @@ message Relation { Unknown unknown = 999; } + // Optional. Every relation might have an alias. Review Comment: I don't have a strong opinion, but not sure which one is better. We can also follow catalyst, and add a new plan `SubqueryAlias(child, alias)` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38395: [SPARK-40917][SQL] Add a dedicated logical plan for `Summary`
cloud-fan commented on code in PR #38395: URL: https://github.com/apache/spark/pull/38395#discussion_r1007548886 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala: ## @@ -2100,3 +2100,53 @@ object AsOfJoin { } } } + + +/** + * A logical plan for summary. + */ +case class Summary( Review Comment: I see, by using a logical plan, we delay the analysis until the spark connect planner constructs the full plan. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] shardulm94 commented on pull request #37479: [SPARK-40045][SQL]Optimize the order of filtering predicates
shardulm94 commented on PR #37479: URL: https://github.com/apache/spark/pull/37479#issuecomment-1294365611 @caican00 Do you think this PR is ready for another round of review? In our organization, we have seen a number of users impacted by this after migration to DSv2, so it would be nice to get this merged. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR closed pull request #38361: [SPARK-40892][SQL][SS] Loosen the requirement of window_time rule - allow multiple window_time calls
HeartSaVioR closed pull request #38361: [SPARK-40892][SQL][SS] Loosen the requirement of window_time rule - allow multiple window_time calls URL: https://github.com/apache/spark/pull/38361 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #38361: [SPARK-40892][SQL][SS] Loosen the requirement of window_time rule - allow multiple window_time calls
HeartSaVioR commented on PR #38361: URL: https://github.com/apache/spark/pull/38361#issuecomment-1294361918 Thanks @cloud-fan ! Given this PR stayed for 4 days and no feedback so far, I'm merging this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38369: [SPARK-40895][BUILD] Upgrade arrow to 10.0.0
LuciferYang commented on PR #38369: URL: https://github.com/apache/spark/pull/38369#issuecomment-1294357109 Thanks @dongjoon-hyun @itholic @bjornjorgensen -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on a diff in pull request #38395: [SPARK-40917][SQL] Add a dedicated logical plan for `Summary`
zhengruifeng commented on code in PR #38395: URL: https://github.com/apache/spark/pull/38395#discussion_r1007535733 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala: ## @@ -2100,3 +2100,53 @@ object AsOfJoin { } } } + + +/** + * A logical plan for summary. + */ +case class Summary( Review Comment: if we use `df.summary` here, an analysis exception will be thrown immediately; if we use a new unresolved plan, I guess the exception will not be thrown before job execution. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #38418: [SPARK-40944][SQL] Relax ordering constraint for CREATE TABLE column options
cloud-fan commented on code in PR #38418: URL: https://github.com/apache/spark/pull/38418#discussion_r1007527891 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala: ## @@ -2813,13 +2813,41 @@ class AstBuilder extends SqlBaseParserBaseVisitor[AnyRef] with SQLConfHelper wit ctx: CreateOrReplaceTableColTypeContext): StructField = withOrigin(ctx) { import ctx._ +// Check that no duplicates exist among any CREATE TABLE column options specified. +var isNotNull: Option[Boolean] = None Review Comment: seems this can be a boolean, starting with false, and set to true if NOT NULL is present. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic commented on pull request #37955: [SPARK-40512][SPARK-40896][PS][INFRA] Upgrade pandas to 1.5.0
itholic commented on PR #37955: URL: https://github.com/apache/spark/pull/37955#issuecomment-1294313246 CI passed! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Yikun commented on pull request #38417: [SPARK-40941][K8S] Use Java 17 in K8s Dockerfile by default and remove `Dockerfile.java17`
Yikun commented on PR #38417: URL: https://github.com/apache/spark/pull/38417#issuecomment-1294284723 Thanks, late LGTM. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #38419: [SPARK-40945][SQL] Support built-in function to truncate numbers
HyukjinKwon commented on PR #38419: URL: https://github.com/apache/spark/pull/38419#issuecomment-1294277396 cc @wangyum FYI -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #38417: [SPARK-40941][K8S] Use Java 17 in K8s Dockerfile by default and remove `Dockerfile.java17`
dongjoon-hyun closed pull request #38417: [SPARK-40941][K8S] Use Java 17 in K8s Dockerfile by default and remove `Dockerfile.java17` URL: https://github.com/apache/spark/pull/38417 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #38417: [SPARK-40941][K8S] Use Java 17 in K8s Dockerfile by default and remove `Dockerfile.java17`
dongjoon-hyun commented on PR #38417: URL: https://github.com/apache/spark/pull/38417#issuecomment-1294270084 Thank you so much, @viirya . Merged to master for Apache Spark 3.4. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #38417: [SPARK-40941][K8S] Use Java 17 in K8s Dockerfile by default and remove `Dockerfile.java17`
dongjoon-hyun commented on PR #38417: URL: https://github.com/apache/spark/pull/38417#issuecomment-1294267607 Could you review this, @viirya ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #37048: [SPARK-39655][CORE] Add a config to limit the number of RDD partitions
github-actions[bot] closed pull request #37048: [SPARK-39655][CORE] Add a config to limit the number of RDD partitions URL: https://github.com/apache/spark/pull/37048 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #37219: [WIP][SPARK-39794][PYTHON] Introduce parametric singleton for DataType
github-actions[bot] commented on PR #37219: URL: https://github.com/apache/spark/pull/37219#issuecomment-1294261496 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vinodkc commented on a diff in pull request #38263: [SPARK-40692][SQL] Support data masking built-in function 'mask_hash'
vinodkc commented on code in PR #38263: URL: https://github.com/apache/spark/pull/38263#discussion_r1007474647 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -3950,6 +3950,14 @@ object SQLConf { .checkValues(ErrorMessageFormat.values.map(_.toString)) .createWithDefault(ErrorMessageFormat.PRETTY.toString) + val SPARK_MASKING_ALGO = buildConf("spark.sql.masking.algo") Review Comment: I was trying to follow a similar property value approach as in [HiveConf](https://github.com/apache/hive/blob/master/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L4487) , where SHA algorithm type is used to decide whether FIPS Mode is enabled or not. Do you still recommend to change it to boolean property `spark.sql.masking.algo.fipsModeEnabled` ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vinodkc commented on pull request #38146: [SPARK-40687][SQL] Support data masking built-in function 'mask'
vinodkc commented on PR #38146: URL: https://github.com/apache/spark/pull/38146#issuecomment-1294245157 @dtenedor , yes, please close yours as a dup. I appreciate your help in reviewing this PR and on top of this change, I'm planning to add additional built-in mask functions supported in Hive -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vinodkc commented on pull request #38419: [SPARK-40945][SQL] Support built-in function to truncate numbers
vinodkc commented on PR #38419: URL: https://github.com/apache/spark/pull/38419#issuecomment-1294221419 @HyukjinKwon , @dongjoon-hyun , Can you please review this PR? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vinodkc opened a new pull request, #38419: [SPARK-40945][SQL] Support built-in function to truncate numbers
vinodkc opened a new pull request, #38419: URL: https://github.com/apache/spark/pull/38419 ### What changes were proposed in this pull request? This PR implements the built-in function `TRUNC`to truncate numbers to the previous integer or decimal. It optionally accepts a second integer argument to indicate the number of decimal places for rounding, in either direction. When the second argument is absent, the function rounds to the nearest whole number. When the second argument is specified, the function rounds to the nearest number with n decimal places of precision. ### Why are the changes needed? To support truncate numbers eg: trunc(123.4382) -> 123 trunc(-123.4382) -> -123 trunc(123.4382, 2) -> 123.43 trunc(123.4382, 1) -> 123.4 trunc(123.4382, 0) -> 123 trunc(123.4382, -1) -> 120 trunc(123.4382, -2) -> 100 trunc(123.4382, -3) -> 0 The same feature is available in Hive, PostgreSQL, MySQL, Oracle, Redshift, **Some references** : PostgreSQL: https://www.postgresql.org/docs/15/functions-math.html AWS Redshift: https://docs.aws.amazon.com/redshift/latest/dg/r_TRUNC.html Presto : https://prestodb.io/docs/current/functions/math.html?highlight=trunc#id3 ### Does this PR introduce _any_ user-facing change? Yes, new built-in function 'trunc' is added. Same function has an overloaded version to trucate Date values. ### How was this patch tested? Added test cases -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] rahulsmahadev commented on a diff in pull request #38404: [WIP] Replace Where
rahulsmahadev commented on code in PR #38404: URL: https://github.com/apache/spark/pull/38404#discussion_r1007450737 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala: ## @@ -1276,6 +1276,24 @@ class Analyzer(override val catalogManager: CatalogManager) AppendData.byPosition(r, query) } else if (conf.partitionOverwriteMode == PartitionOverwriteMode.DYNAMIC) { OverwritePartitionsDynamic.byPosition(r, query) +} else if (i.replacePredicates.nonEmpty) { + def findAttrInRelation(name: String): Attribute = { Review Comment: we should check with OSS reviewers(once this PR is ready) if there is some existing method for this, so that we don't need to duplicate code. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dtenedor opened a new pull request, #38418: [SPARK-40944][SQL] Relax ordering constraint for CREATE TABLE column options
dtenedor opened a new pull request, #38418: URL: https://github.com/apache/spark/pull/38418 ### What changes were proposed in this pull request? Relax ordering constraint for CREATE TABLE column options. Before this PR, the grammar for each CREATE TABLE column was: ``` createOrReplaceTableColType : colName=errorCapturingIdentifier dataType (NOT NULL)? defaultExpression? commentSpec? ; ``` ### Why are the changes needed? There was a constraint on the order of: `(NOT NULL, DEFAULT value, COMMENT value)`. This PR updates the grammar to allow these options in any order instead, to improve usability. ### Does this PR introduce _any_ user-facing change? Yes, the SQL syntax updates slightly. ### How was this patch tested? Existing parser tests plus a new unit test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] carlfu-db commented on pull request #38404: [WIP] Replace Where
carlfu-db commented on PR #38404: URL: https://github.com/apache/spark/pull/38404#issuecomment-1294192968 > Mind adding a test, filing a JIRA, etc? See also https://spark.apache.org/contributing.html Will do. Still in progress :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a diff in pull request #38405: [SPARK-40925][SQL][SS] Fix stateful operator late record filtering
HeartSaVioR commented on code in PR #38405: URL: https://github.com/apache/spark/pull/38405#discussion_r1007411825 ## sql/core/src/test/scala/org/apache/spark/sql/streaming/MultiStatefulOperatorsSuite.scala: ## @@ -0,0 +1,400 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.streaming + +import org.scalatest.BeforeAndAfter + +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.execution.streaming.MemoryStream +import org.apache.spark.sql.execution.streaming.state.StateStore +import org.apache.spark.sql.functions._ + +// Tests for the multiple stateful operators support. +class MultiStatefulOperatorsSuite + extends StreamTest with StateStoreMetricsTest with BeforeAndAfter { + + import testImplicits._ + + before { +SparkSession.setActiveSession(spark) // set this before force initializing 'joinExec' +spark.streams.stateStoreCoordinator // initialize the lazy coordinator + } + + after { +StateStore.stop() + } + + test("window agg -> window agg, append mode") { +withSQLConf("spark.sql.streaming.unsupportedOperationCheck" -> "false") { + val inputData = MemoryStream[Int] + + val stream = inputData.toDF() +.withColumn("eventTime", timestamp_seconds($"value")) +.withWatermark("eventTime", "0 seconds") +.groupBy(window($"eventTime", "5 seconds") as 'window) +.agg(count("*") as 'count) +.groupBy(window($"window", "10 seconds")) +.agg(count("*") as 'count, sum("count") as 'sum) +.select($"window".getField("start").cast("long").as[Long], + $"count".as[Long], $"sum".as[Long]) + + testStream(stream)( +AddData(inputData, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21), +// op1 W (0, 0) +// agg: [10, 15) 5, [15, 20) 5, [20, 25) 2 +// output: None +// state: [10, 15) 5, [15, 20) 5, [20, 25) 2 +// op2 W (0, 0) +// agg: None +// output: None +// state: None + +// no-data batch triggered + +// op1 W (0, 21) +// agg: None +// output: [10, 15) 5, [15, 20) 5 +// state: [20, 25) 2 +// op2 W (0, 21) +// agg: [10, 20) (2, 10) +// output: [10, 20) (2, 10) +// state: None +CheckNewAnswer((10, 2, 10)), // W (21, 21) [20, 25) 2 ... W (0, 21) None // [10, 20) 2 +assertNumStateRows(Seq(0, 1)), +assertNumRowsDroppedByWatermark(Seq(0, 0)), + +AddData(inputData, 22, 23, 24, 25, 26, 27, 28, 29), +// op1 W (21, 21) +// agg: [20, 25) 5, [25, 30) 4 +// output: None +// state: [20, 25) 5, [25, 30) 4 +// op2 W (21, 21) +// agg: None +// output: None +// state: None + +// no-data batch triggered + +// op1 W (21, 29) +// agg: None +// output: [20, 25) 5 +// state: [25, 30) 4 +// op2 W (20, 25) +// agg: [20, 30) (1, 5) +// output: None +// state: [20, 30) (1, 5) +CheckNewAnswer(), +assertNumStateRows(Seq(1, 1)), +assertNumRowsDroppedByWatermark(Seq(0, 0)), + +// Move the watermark. Review Comment: Although not mandatory as other test case contains the verification. Just something "good to have". -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sunchao commented on pull request #38277: [SPARK-40815][SQL] Introduce DelegateSymlinkTextInputFormat to handle empty splits when "spark.hadoopRDD.ignoreEmptySplits" is enabled in ord
sunchao commented on PR #38277: URL: https://github.com/apache/spark/pull/38277#issuecomment-129460 @sadikovi sorry for the delay, will take a look. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadikovi commented on pull request #38277: [SPARK-40815][SQL] Introduce DelegateSymlinkTextInputFormat to handle empty splits when "spark.hadoopRDD.ignoreEmptySplits" is enabled in or
sadikovi commented on PR #38277: URL: https://github.com/apache/spark/pull/38277#issuecomment-1294089945 @sunchao @dongjoon-hyun Could you take another look? Thanks. I have addressed your comments. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] alex-balikov commented on a diff in pull request #38405: [SPARK-40925][SQL][SS] Fix stateful operator late record filtering
alex-balikov commented on code in PR #38405: URL: https://github.com/apache/spark/pull/38405#discussion_r1007394123 ## sql/core/src/test/scala/org/apache/spark/sql/streaming/MultiStatefulOperatorsSuite.scala: ## @@ -0,0 +1,400 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.streaming + +import org.scalatest.BeforeAndAfter + +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.execution.streaming.MemoryStream +import org.apache.spark.sql.execution.streaming.state.StateStore +import org.apache.spark.sql.functions._ + +// Tests for the multiple stateful operators support. +class MultiStatefulOperatorsSuite + extends StreamTest with StateStoreMetricsTest with BeforeAndAfter { + + import testImplicits._ + + before { +SparkSession.setActiveSession(spark) // set this before force initializing 'joinExec' +spark.streams.stateStoreCoordinator // initialize the lazy coordinator + } + + after { +StateStore.stop() + } + + test("window agg -> window agg, append mode") { +withSQLConf("spark.sql.streaming.unsupportedOperationCheck" -> "false") { + val inputData = MemoryStream[Int] + + val stream = inputData.toDF() +.withColumn("eventTime", timestamp_seconds($"value")) +.withWatermark("eventTime", "0 seconds") +.groupBy(window($"eventTime", "5 seconds") as 'window) +.agg(count("*") as 'count) +.groupBy(window($"window", "10 seconds")) +.agg(count("*") as 'count, sum("count") as 'sum) +.select($"window".getField("start").cast("long").as[Long], + $"count".as[Long], $"sum".as[Long]) + + testStream(stream)( +AddData(inputData, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21), +// op1 W (0, 0) +// agg: [10, 15) 5, [15, 20) 5, [20, 25) 2 +// output: None +// state: [10, 15) 5, [15, 20) 5, [20, 25) 2 +// op2 W (0, 0) +// agg: None +// output: None +// state: None + +// no-data batch triggered + +// op1 W (0, 21) +// agg: None +// output: [10, 15) 5, [15, 20) 5 +// state: [20, 25) 2 +// op2 W (0, 21) +// agg: [10, 20) (2, 10) +// output: [10, 20) (2, 10) +// state: None +CheckNewAnswer((10, 2, 10)), // W (21, 21) [20, 25) 2 ... W (0, 21) None // [10, 20) 2 +assertNumStateRows(Seq(0, 1)), +assertNumRowsDroppedByWatermark(Seq(0, 0)), + +AddData(inputData, 22, 23, 24, 25, 26, 27, 28, 29), +// op1 W (21, 21) +// agg: [20, 25) 5, [25, 30) 4 +// output: None +// state: [20, 25) 5, [25, 30) 4 +// op2 W (21, 21) +// agg: None +// output: None +// state: None + +// no-data batch triggered + +// op1 W (21, 29) +// agg: None +// output: [20, 25) 5 +// state: [25, 30) 4 +// op2 W (20, 25) +// agg: [20, 30) (1, 5) +// output: None +// state: [20, 30) (1, 5) +CheckNewAnswer(), +assertNumStateRows(Seq(1, 1)), +assertNumRowsDroppedByWatermark(Seq(0, 0)), + +// Move the watermark. +AddData(inputData, 30, 31), +// op1 W (29, 29) +// agg: [25, 30) 5 [30, 35) 2 +// output: None +// state: [25, 30) 5 [30, 35) 2 +// op2 W (29, 29) +// agg: None +// output: None +// state: [20, 30) (1, 5) + +// no-data batch triggered + +// op1 W (29, 31) +// agg: None +// output: [25, 30) 5 +// state: [30, 35) 2 +// op2 W (29, 31) +// agg: [20, 30) (1, 5) +// output: [20, 30) (1, 5) +// state: None +CheckNewAnswer((20, 2, 10)), +assertNumStateRows(Seq(0, 1)), +assertNumRowsDroppedByWatermark(Seq(0, 0)) + ) +} + } + + test("agg -> agg -> agg, append mode") { +withSQLConf("spark.sql.streaming.unsupportedOperationCheck" -> "false") { + val inputData = MemoryStream[Int] + + val stream = inputData.toDF() +.withColumn("eventTime", timestamp_seconds($"value")) +.withWatermark("eventTime", "0 seconds") +
[GitHub] [spark] HeartSaVioR commented on a diff in pull request #38405: [SPARK-40925][SQL][SS] Fix stateful operator late record filtering
HeartSaVioR commented on code in PR #38405: URL: https://github.com/apache/spark/pull/38405#discussion_r1007386332 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -1941,6 +1941,22 @@ object SQLConf { .booleanConf .createWithDefault(true) + val STATEFUL_OPERATOR_ALLOW_MULTIPLE = +buildConf("spark.sql.streaming.statefulOperator.allowMultiple") + .internal() + .doc("When true, multiple stateful operators are allowed to be present in a streaming " + +"pipeline. The support for multiple stateful operators introduces a minor (semantically " + +"correct) change in respect to late record filtering - late records are detected and " + +"filtered in respect to the watermark from the previous microbatch instead of the " + +"current one. This is a behavior change for Spark streaming pipelines and we allow " + Review Comment: Yes, we will have a bunch of errors in test suites when we disable no-data batch. Most test cases are assuming that no-data batch always happens. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a diff in pull request #38405: [SPARK-40925][SQL][SS] Fix stateful operator late record filtering
HeartSaVioR commented on code in PR #38405: URL: https://github.com/apache/spark/pull/38405#discussion_r1007383463 ## sql/core/src/test/scala/org/apache/spark/sql/streaming/MultiStatefulOperatorsSuite.scala: ## @@ -0,0 +1,400 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.streaming + +import org.scalatest.BeforeAndAfter + +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.execution.streaming.MemoryStream +import org.apache.spark.sql.execution.streaming.state.StateStore +import org.apache.spark.sql.functions._ + +// Tests for the multiple stateful operators support. +class MultiStatefulOperatorsSuite + extends StreamTest with StateStoreMetricsTest with BeforeAndAfter { + + import testImplicits._ + + before { +SparkSession.setActiveSession(spark) // set this before force initializing 'joinExec' +spark.streams.stateStoreCoordinator // initialize the lazy coordinator + } + + after { +StateStore.stop() + } + + test("window agg -> window agg, append mode") { +withSQLConf("spark.sql.streaming.unsupportedOperationCheck" -> "false") { + val inputData = MemoryStream[Int] + + val stream = inputData.toDF() +.withColumn("eventTime", timestamp_seconds($"value")) +.withWatermark("eventTime", "0 seconds") +.groupBy(window($"eventTime", "5 seconds") as 'window) +.agg(count("*") as 'count) +.groupBy(window($"window", "10 seconds")) +.agg(count("*") as 'count, sum("count") as 'sum) +.select($"window".getField("start").cast("long").as[Long], + $"count".as[Long], $"sum".as[Long]) + + testStream(stream)( +AddData(inputData, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21), +// op1 W (0, 0) +// agg: [10, 15) 5, [15, 20) 5, [20, 25) 2 +// output: None +// state: [10, 15) 5, [15, 20) 5, [20, 25) 2 +// op2 W (0, 0) +// agg: None +// output: None +// state: None + +// no-data batch triggered + +// op1 W (0, 21) +// agg: None +// output: [10, 15) 5, [15, 20) 5 +// state: [20, 25) 2 +// op2 W (0, 21) +// agg: [10, 20) (2, 10) +// output: [10, 20) (2, 10) +// state: None +CheckNewAnswer((10, 2, 10)), // W (21, 21) [20, 25) 2 ... W (0, 21) None // [10, 20) 2 +assertNumStateRows(Seq(0, 1)), +assertNumRowsDroppedByWatermark(Seq(0, 0)), + +AddData(inputData, 22, 23, 24, 25, 26, 27, 28, 29), +// op1 W (21, 21) +// agg: [20, 25) 5, [25, 30) 4 +// output: None +// state: [20, 25) 5, [25, 30) 4 +// op2 W (21, 21) +// agg: None +// output: None +// state: None + +// no-data batch triggered + +// op1 W (21, 29) +// agg: None +// output: [20, 25) 5 +// state: [25, 30) 4 +// op2 W (20, 25) +// agg: [20, 30) (1, 5) +// output: None +// state: [20, 30) (1, 5) +CheckNewAnswer(), +assertNumStateRows(Seq(1, 1)), +assertNumRowsDroppedByWatermark(Seq(0, 0)), + +// Move the watermark. Review Comment: I meant adding a new batch (or existing batch) for testing late events. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] alex-balikov commented on a diff in pull request #38405: [SPARK-40925][SQL][SS] Fix stateful operator late record filtering
alex-balikov commented on code in PR #38405: URL: https://github.com/apache/spark/pull/38405#discussion_r1007369668 ## sql/core/src/test/scala/org/apache/spark/sql/streaming/MultiStatefulOperatorsSuite.scala: ## @@ -0,0 +1,400 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.streaming + +import org.scalatest.BeforeAndAfter + +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.execution.streaming.MemoryStream +import org.apache.spark.sql.execution.streaming.state.StateStore +import org.apache.spark.sql.functions._ + +// Tests for the multiple stateful operators support. +class MultiStatefulOperatorsSuite + extends StreamTest with StateStoreMetricsTest with BeforeAndAfter { + + import testImplicits._ + + before { +SparkSession.setActiveSession(spark) // set this before force initializing 'joinExec' +spark.streams.stateStoreCoordinator // initialize the lazy coordinator + } + + after { +StateStore.stop() + } + + test("window agg -> window agg, append mode") { +withSQLConf("spark.sql.streaming.unsupportedOperationCheck" -> "false") { + val inputData = MemoryStream[Int] + + val stream = inputData.toDF() +.withColumn("eventTime", timestamp_seconds($"value")) +.withWatermark("eventTime", "0 seconds") +.groupBy(window($"eventTime", "5 seconds") as 'window) +.agg(count("*") as 'count) +.groupBy(window($"window", "10 seconds")) +.agg(count("*") as 'count, sum("count") as 'sum) +.select($"window".getField("start").cast("long").as[Long], + $"count".as[Long], $"sum".as[Long]) + + testStream(stream)( +AddData(inputData, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21), +// op1 W (0, 0) +// agg: [10, 15) 5, [15, 20) 5, [20, 25) 2 +// output: None +// state: [10, 15) 5, [15, 20) 5, [20, 25) 2 +// op2 W (0, 0) +// agg: None +// output: None +// state: None + +// no-data batch triggered + +// op1 W (0, 21) +// agg: None +// output: [10, 15) 5, [15, 20) 5 +// state: [20, 25) 2 +// op2 W (0, 21) +// agg: [10, 20) (2, 10) +// output: [10, 20) (2, 10) +// state: None +CheckNewAnswer((10, 2, 10)), // W (21, 21) [20, 25) 2 ... W (0, 21) None // [10, 20) 2 +assertNumStateRows(Seq(0, 1)), +assertNumRowsDroppedByWatermark(Seq(0, 0)), + +AddData(inputData, 22, 23, 24, 25, 26, 27, 28, 29), +// op1 W (21, 21) +// agg: [20, 25) 5, [25, 30) 4 +// output: None +// state: [20, 25) 5, [25, 30) 4 +// op2 W (21, 21) +// agg: None +// output: None +// state: None + +// no-data batch triggered + +// op1 W (21, 29) +// agg: None +// output: [20, 25) 5 +// state: [25, 30) 4 +// op2 W (20, 25) +// agg: [20, 30) (1, 5) +// output: None +// state: [20, 30) (1, 5) +CheckNewAnswer(), +assertNumStateRows(Seq(1, 1)), +assertNumRowsDroppedByWatermark(Seq(0, 0)), + +// Move the watermark. +AddData(inputData, 30, 31), +// op1 W (29, 29) +// agg: [25, 30) 5 [30, 35) 2 +// output: None +// state: [25, 30) 5 [30, 35) 2 +// op2 W (29, 29) +// agg: None +// output: None +// state: [20, 30) (1, 5) + +// no-data batch triggered + +// op1 W (29, 31) +// agg: None +// output: [25, 30) 5 +// state: [30, 35) 2 +// op2 W (29, 31) +// agg: [20, 30) (1, 5) +// output: [20, 30) (1, 5) +// state: None +CheckNewAnswer((20, 2, 10)), +assertNumStateRows(Seq(0, 1)), +assertNumRowsDroppedByWatermark(Seq(0, 0)) + ) +} + } + + test("agg -> agg -> agg, append mode") { +withSQLConf("spark.sql.streaming.unsupportedOperationCheck" -> "false") { + val inputData = MemoryStream[Int] + + val stream = inputData.toDF() +.withColumn("eventTime", timestamp_seconds($"value")) +.withWatermark("eventTime", "0 seconds") +
[GitHub] [spark] dongjoon-hyun closed pull request #38369: [SPARK-40895][BUILD] Upgrade arrow to 10.0.0
dongjoon-hyun closed pull request #38369: [SPARK-40895][BUILD] Upgrade arrow to 10.0.0 URL: https://github.com/apache/spark/pull/38369 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #38417: [SPARK-40941][K8S] Use Java 17 in K8s Dockerfile by default and remove `Dockerfile.java17`
dongjoon-hyun commented on PR #38417: URL: https://github.com/apache/spark/pull/38417#issuecomment-1294047475 cc @Yikun -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] alex-balikov commented on a diff in pull request #38405: [SPARK-40925][SQL][SS] Fix stateful operator late record filtering
alex-balikov commented on code in PR #38405: URL: https://github.com/apache/spark/pull/38405#discussion_r1007348475 ## sql/core/src/test/scala/org/apache/spark/sql/streaming/MultiStatefulOperatorsSuite.scala: ## @@ -0,0 +1,400 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.streaming + +import org.scalatest.BeforeAndAfter + +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.execution.streaming.MemoryStream +import org.apache.spark.sql.execution.streaming.state.StateStore +import org.apache.spark.sql.functions._ + +// Tests for the multiple stateful operators support. +class MultiStatefulOperatorsSuite + extends StreamTest with StateStoreMetricsTest with BeforeAndAfter { + + import testImplicits._ + + before { +SparkSession.setActiveSession(spark) // set this before force initializing 'joinExec' +spark.streams.stateStoreCoordinator // initialize the lazy coordinator + } + + after { +StateStore.stop() + } + + test("window agg -> window agg, append mode") { +withSQLConf("spark.sql.streaming.unsupportedOperationCheck" -> "false") { + val inputData = MemoryStream[Int] + + val stream = inputData.toDF() +.withColumn("eventTime", timestamp_seconds($"value")) +.withWatermark("eventTime", "0 seconds") +.groupBy(window($"eventTime", "5 seconds") as 'window) +.agg(count("*") as 'count) +.groupBy(window($"window", "10 seconds")) +.agg(count("*") as 'count, sum("count") as 'sum) +.select($"window".getField("start").cast("long").as[Long], + $"count".as[Long], $"sum".as[Long]) + + testStream(stream)( +AddData(inputData, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21), +// op1 W (0, 0) +// agg: [10, 15) 5, [15, 20) 5, [20, 25) 2 +// output: None +// state: [10, 15) 5, [15, 20) 5, [20, 25) 2 +// op2 W (0, 0) +// agg: None +// output: None +// state: None + +// no-data batch triggered + +// op1 W (0, 21) +// agg: None +// output: [10, 15) 5, [15, 20) 5 +// state: [20, 25) 2 +// op2 W (0, 21) +// agg: [10, 20) (2, 10) +// output: [10, 20) (2, 10) +// state: None +CheckNewAnswer((10, 2, 10)), // W (21, 21) [20, 25) 2 ... W (0, 21) None // [10, 20) 2 +assertNumStateRows(Seq(0, 1)), +assertNumRowsDroppedByWatermark(Seq(0, 0)), + +AddData(inputData, 22, 23, 24, 25, 26, 27, 28, 29), +// op1 W (21, 21) +// agg: [20, 25) 5, [25, 30) 4 +// output: None +// state: [20, 25) 5, [25, 30) 4 +// op2 W (21, 21) +// agg: None +// output: None +// state: None + +// no-data batch triggered + +// op1 W (21, 29) +// agg: None +// output: [20, 25) 5 +// state: [25, 30) 4 +// op2 W (20, 25) +// agg: [20, 30) (1, 5) +// output: None +// state: [20, 30) (1, 5) +CheckNewAnswer(), +assertNumStateRows(Seq(1, 1)), +assertNumRowsDroppedByWatermark(Seq(0, 0)), + +// Move the watermark. +AddData(inputData, 30, 31), +// op1 W (29, 29) +// agg: [25, 30) 5 [30, 35) 2 +// output: None +// state: [25, 30) 5 [30, 35) 2 +// op2 W (29, 29) +// agg: None +// output: None +// state: [20, 30) (1, 5) + +// no-data batch triggered + +// op1 W (29, 31) +// agg: None +// output: [25, 30) 5 +// state: [30, 35) 2 +// op2 W (29, 31) +// agg: [20, 30) (1, 5) +// output: [20, 30) (1, 5) +// state: None +CheckNewAnswer((20, 2, 10)), +assertNumStateRows(Seq(0, 1)), +assertNumRowsDroppedByWatermark(Seq(0, 0)) + ) +} + } + + test("agg -> agg -> agg, append mode") { +withSQLConf("spark.sql.streaming.unsupportedOperationCheck" -> "false") { + val inputData = MemoryStream[Int] + + val stream = inputData.toDF() +.withColumn("eventTime", timestamp_seconds($"value")) +.withWatermark("eventTime", "0 seconds") +
[GitHub] [spark] dongjoon-hyun opened a new pull request, #38417: [SPARK-40941][K8S] Use Java 17 in K8s Dockerfile by default and remove `Dockerfile.java17`
dongjoon-hyun opened a new pull request, #38417: URL: https://github.com/apache/spark/pull/38417 … ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] alex-balikov commented on a diff in pull request #38405: [SPARK-40925][SQL][SS] Fix stateful operator late record filtering
alex-balikov commented on code in PR #38405: URL: https://github.com/apache/spark/pull/38405#discussion_r1007325252 ## sql/core/src/test/scala/org/apache/spark/sql/streaming/MultiStatefulOperatorsSuite.scala: ## @@ -0,0 +1,400 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.streaming + +import org.scalatest.BeforeAndAfter + +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.execution.streaming.MemoryStream +import org.apache.spark.sql.execution.streaming.state.StateStore +import org.apache.spark.sql.functions._ + +// Tests for the multiple stateful operators support. +class MultiStatefulOperatorsSuite + extends StreamTest with StateStoreMetricsTest with BeforeAndAfter { + + import testImplicits._ + + before { +SparkSession.setActiveSession(spark) // set this before force initializing 'joinExec' +spark.streams.stateStoreCoordinator // initialize the lazy coordinator + } + + after { +StateStore.stop() + } + + test("window agg -> window agg, append mode") { +withSQLConf("spark.sql.streaming.unsupportedOperationCheck" -> "false") { + val inputData = MemoryStream[Int] + + val stream = inputData.toDF() +.withColumn("eventTime", timestamp_seconds($"value")) +.withWatermark("eventTime", "0 seconds") +.groupBy(window($"eventTime", "5 seconds") as 'window) +.agg(count("*") as 'count) +.groupBy(window($"window", "10 seconds")) +.agg(count("*") as 'count, sum("count") as 'sum) +.select($"window".getField("start").cast("long").as[Long], + $"count".as[Long], $"sum".as[Long]) + + testStream(stream)( +AddData(inputData, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21), +// op1 W (0, 0) +// agg: [10, 15) 5, [15, 20) 5, [20, 25) 2 +// output: None +// state: [10, 15) 5, [15, 20) 5, [20, 25) 2 +// op2 W (0, 0) +// agg: None +// output: None +// state: None + +// no-data batch triggered + +// op1 W (0, 21) +// agg: None +// output: [10, 15) 5, [15, 20) 5 +// state: [20, 25) 2 +// op2 W (0, 21) +// agg: [10, 20) (2, 10) +// output: [10, 20) (2, 10) +// state: None +CheckNewAnswer((10, 2, 10)), // W (21, 21) [20, 25) 2 ... W (0, 21) None // [10, 20) 2 +assertNumStateRows(Seq(0, 1)), +assertNumRowsDroppedByWatermark(Seq(0, 0)), + +AddData(inputData, 22, 23, 24, 25, 26, 27, 28, 29), +// op1 W (21, 21) +// agg: [20, 25) 5, [25, 30) 4 +// output: None +// state: [20, 25) 5, [25, 30) 4 +// op2 W (21, 21) +// agg: None +// output: None +// state: None + +// no-data batch triggered + +// op1 W (21, 29) +// agg: None +// output: [20, 25) 5 +// state: [25, 30) 4 +// op2 W (20, 25) Review Comment: Good catch, thanks -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #38406: [SPARK-40926][CONNECT] Refactor server side tests to only use DataFrame API
amaliujia commented on code in PR #38406: URL: https://github.com/apache/spark/pull/38406#discussion_r1007318294 ## connector/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala: ## @@ -215,4 +180,16 @@ class SparkConnectProtoSuite extends PlanTest with SparkConnectPlanTest { } proto.Relation.newBuilder().setLocalRelation(localRelationBuilder.build()).build() } + + private def analyzePlan(plan: LogicalPlan): LogicalPlan = { +val connectAnalyzed = analysis.SimpleAnalyzer.execute(plan) +analysis.SimpleAnalyzer.checkAnalysis(connectAnalyzed) +EliminateSubqueryAliases(connectAnalyzed) Review Comment: hmmm this is what I borrowed from https://github.com/apache/spark/blob/c50d865fa9eb5207bc8c9992e37843412ec0cbc3/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/dsl/package.scala#L513 We are using this Catalyst DSL analyze call already before this refactoring. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] alex-balikov commented on a diff in pull request #38405: [SPARK-40925][SQL][SS] Fix stateful operator late record filtering
alex-balikov commented on code in PR #38405: URL: https://github.com/apache/spark/pull/38405#discussion_r1007234930 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -1941,6 +1941,22 @@ object SQLConf { .booleanConf .createWithDefault(true) + val STATEFUL_OPERATOR_ALLOW_MULTIPLE = +buildConf("spark.sql.streaming.statefulOperator.allowMultiple") + .internal() + .doc("When true, multiple stateful operators are allowed to be present in a streaming " + +"pipeline. The support for multiple stateful operators introduces a minor (semantically " + +"correct) change in respect to late record filtering - late records are detected and " + +"filtered in respect to the watermark from the previous microbatch instead of the " + +"current one. This is a behavior change for Spark streaming pipelines and we allow " + Review Comment: Currently it is the same watermark passed to all operators. The issue is if anyone has nit tests which check exactly what records are filtered with carefully constructed batches and Trigger.Once - such tests can detect the change in behavior and fail. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #38404: Replace Where
AmplabJenkins commented on PR #38404: URL: https://github.com/apache/spark/pull/38404#issuecomment-1293998786 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #38405: [SPARK-40925][SQL][SS] Fix stateful operator late record filtering
AmplabJenkins commented on PR #38405: URL: https://github.com/apache/spark/pull/38405#issuecomment-1293998716 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #38406: [SPARK-40926][CONNECT] Refactor server side tests to only use DataFrame API
AmplabJenkins commented on PR #38406: URL: https://github.com/apache/spark/pull/38406#issuecomment-1293998669 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #38409: [SPARK-40930][CONNECT] Support Collect() in Python client
AmplabJenkins commented on PR #38409: URL: https://github.com/apache/spark/pull/38409#issuecomment-1293998625 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #38410: [SPARK-40932][CORE] Fix issue messages for allGather are overridden
AmplabJenkins commented on PR #38410: URL: https://github.com/apache/spark/pull/38410#issuecomment-1293998577 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vitaliili-db commented on pull request #38416: [SPARK-40924][SQL][3.3] Fix for Unhex when input has odd number of symbols
vitaliili-db commented on PR #38416: URL: https://github.com/apache/spark/pull/38416#issuecomment-1293889541 @MaxGekk backported, please take a look -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vitaliili-db opened a new pull request, #38416: [SPARK-40924][SQL][3.3] Fix for Unhex when input has odd number of symbols
vitaliili-db opened a new pull request, #38416: URL: https://github.com/apache/spark/pull/38416 ### What changes were proposed in this pull request? Fix for a bug in Unhex function when there is an odd number of symbols in the input string. This is backport of #38402 ### Why are the changes needed? Unhex function and other functions depending on it (e.g. ToBinary) produce incorrect output. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #38406: [SPARK-40926][CONNECT] Refactor server side tests to only use DataFrame API
amaliujia commented on code in PR #38406: URL: https://github.com/apache/spark/pull/38406#discussion_r1007207351 ## connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala: ## @@ -67,7 +67,7 @@ class SparkConnectPlanner(plan: proto.Relation, session: SparkSession) { case proto.Relation.RelTypeCase.AGGREGATE => transformAggregate(rel.getAggregate) case proto.Relation.RelTypeCase.SQL => transformSql(rel.getSql) case proto.Relation.RelTypeCase.LOCAL_RELATION => -transformLocalRelation(rel.getLocalRelation) +transformLocalRelation(rel.getLocalRelation, common) Review Comment: I sent a PR for this topic (to avoid complicate current refactoring PR too much): https://github.com/apache/spark/pull/38415 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia opened a new pull request, #38415: [SPARK-40938][CONNECT] Support Alias for every type of Relation
amaliujia opened a new pull request, #38415: URL: https://github.com/apache/spark/pull/38415 ### What changes were proposed in this pull request? In the past, Connect server can check `alias` for `Read` and `Project`. However for Spark DataFrame, every DataFrame can be chained with `as(alias: String)` thus every Relation/LogicalPlan can have an `alias`. This PR refactors to make this work. ### Why are the changes needed? Improve API coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #38415: [SPARK-40938][CONNECT] Support Alias for every type of Relation
amaliujia commented on PR #38415: URL: https://github.com/apache/spark/pull/38415#issuecomment-1293887655 R: @cloud-fan -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #38395: [SPARK-40917][SQL] Add a dedicated logical plan for `Summary`
amaliujia commented on code in PR #38395: URL: https://github.com/apache/spark/pull/38395#discussion_r1007192022 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala: ## @@ -2100,3 +2100,53 @@ object AsOfJoin { } } } + + +/** + * A logical plan for summary. + */ +case class Summary( Review Comment: > but if there is something wrong in analysis, the exception will not be thrown. @zhengruifeng can you elaborate this a bit more? Is this a regression in terms of bad cases (e.g. users expect to exceptions but now it is gone)? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #38412: [SPARK-40935][BUILD] Upgrade zstd-jni to 1.5.2-5
dongjoon-hyun commented on PR #38412: URL: https://github.com/apache/spark/pull/38412#issuecomment-1293828092 Thank you, @LuciferYang , @HyukjinKwon , @singhpk234 . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun closed pull request #38412: [SPARK-40935][BUILD] Upgrade zstd-jni to 1.5.2-5
dongjoon-hyun closed pull request #38412: [SPARK-40935][BUILD] Upgrade zstd-jni to 1.5.2-5 URL: https://github.com/apache/spark/pull/38412 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #38352: [SPARK-40801][BUILD][3.2] Upgrade `Apache commons-text` to 1.10
dongjoon-hyun commented on PR #38352: URL: https://github.com/apache/spark/pull/38352#issuecomment-1293774559 > Could you make the same patch for 3.1 branch? No, Apache Spark 3.1 reached EOL last month because the first release was March 2, 2021. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] vitas commented on pull request #38352: [SPARK-40801][BUILD][3.2] Upgrade `Apache commons-text` to 1.10
vitas commented on PR #38352: URL: https://github.com/apache/spark/pull/38352#issuecomment-1293772433 Could you make the same patch for 3.1 branch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38413: [SPARK-40936][SQL][TESTS] Remove outer conditions to simplify `AnalysisTest#assertAnalysisErrorClass` method
LuciferYang commented on PR #38413: URL: https://github.com/apache/spark/pull/38413#issuecomment-1293745767 > assertAnalysisErrorClass Sounds good, let me try. Set this to draft first and will ping you when it can be reviewed @MaxGekk -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] bjornjorgensen commented on pull request #38352: [SPARK-40801][BUILD][3.2] Upgrade `Apache commons-text` to 1.10
bjornjorgensen commented on PR #38352: URL: https://github.com/apache/spark/pull/38352#issuecomment-1293732028 @xinrong-meng There are two tests that don't work for branch 3.2 Those are both python tests, can you have a look at them? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] MaxGekk commented on pull request #38413: [SPARK-40936][SQL][TESTS] Remove outer conditions to simplify `AnalysisTest#assertAnalysisErrorClass` method
MaxGekk commented on PR #38413: URL: https://github.com/apache/spark/pull/38413#issuecomment-129372 I wonder why do we need `assertAnalysisErrorClass()` at all. `checkError` does the same job. Seems like `assertAnalysisErrorClass()` checks additionally case sensitivity (can be done in test explicitly if it is really needed) + checking line + pos (we should check query context instead of that, I guess). Let's consider to invoke `checkError` in `assertAnalysisErrorClass()` or remove it completely (invoke `checkError()` directly in tests as we do in other places). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] ljfgem commented on a diff in pull request #35636: [SPARK-31357][SQL][WIP] Catalog API for view metadata
ljfgem commented on code in PR #35636: URL: https://github.com/apache/spark/pull/35636#discussion_r1007065247 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/V2ViewDescription.scala: ## @@ -0,0 +1,55 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.catalyst.analysis + +import scala.collection.JavaConverters._ + +import org.apache.spark.sql.catalyst.plans.logical.ViewDescription +import org.apache.spark.sql.connector.catalog.{View, ViewCatalog} +import org.apache.spark.sql.types.StructType + +/** + * View description backed by a View in V2 catalog. + * + * @param view a view in V2 catalog + */ +case class V2ViewDescription( +override val identifier: String, +view: View) extends ViewDescription { + + override val schema: StructType = view.schema + + override val viewText: Option[String] = None Review Comment: Isn't it supposed to be `Option(view.sql)`? Or how does it differ from the `val sql` below? I think we just need to keep one for callers? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] bjornjorgensen commented on pull request #38352: [SPARK-40801][BUILD][3.2] Upgrade `Apache commons-text` to 1.10
bjornjorgensen commented on PR #38352: URL: https://github.com/apache/spark/pull/38352#issuecomment-1293716740 @srowen tests have been re-run but it's the same result. But it's the same with everyone else's PR, for a long time. Like this one https://github.com/LuciferYang/spark/actions/runs/3309640149 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] SandishKumarHN commented on a diff in pull request #38344: [SPARK-40777][SQL][PROTOBUF] Protobuf import support and move error-classes.
SandishKumarHN commented on code in PR #38344: URL: https://github.com/apache/spark/pull/38344#discussion_r1007058916 ## connector/protobuf/src/main/scala/org/apache/spark/sql/protobuf/ProtobufDataToCatalyst.scala: ## @@ -98,13 +92,9 @@ private[protobuf] case class ProtobufDataToCatalyst( case PermissiveMode => nullResultRow case FailFastMode => -throw new SparkException( - "Malformed records are detected in record parsing. " + -s"Current parse Mode: ${FailFastMode.name}. To process malformed records as null " + -"result, try setting the option 'mode' as 'PERMISSIVE'.", - e) +throw QueryCompilationErrors.malformedRecordsDetectedInRecordParsingError(e) case _ => -throw new AnalysisException(unacceptableModeMessage(parseMode.name)) +throw QueryCompilationErrors.parseModeUnsupportedError("from_protobuf", parseMode) Review Comment: @MaxGekk fixed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyeweikuer closed pull request #38414: Review from master
wangyeweikuer closed pull request #38414: Review from master URL: https://github.com/apache/spark/pull/38414 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyeweikuer opened a new pull request, #38414: Review from master
wangyeweikuer opened a new pull request, #38414: URL: https://github.com/apache/spark/pull/38414 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] awdavidson commented on pull request #38312: [SPARK-40819][SQL] Timestamp nanos behaviour regression
awdavidson commented on PR #38312: URL: https://github.com/apache/spark/pull/38312#issuecomment-1293507848 @cloud-fan @LuciferYang any update/response regarding this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] bjornjorgensen commented on pull request #38352: [SPARK-40801][BUILD][3.2] Upgrade `Apache commons-text` to 1.10
bjornjorgensen commented on PR #38352: URL: https://github.com/apache/spark/pull/38352#issuecomment-1293504608 ok, I re-run the tests now. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on a diff in pull request #38413: [SPARK-40936][SQL][TESTS] Remove outer conditions to simplify `AnalysisTest#assertAnalysisErrorClass` method
LuciferYang commented on code in PR #38413: URL: https://github.com/apache/spark/pull/38413#discussion_r1006825185 ## sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/AnalysisTest.scala: ## @@ -183,30 +185,28 @@ trait AnalysisTest extends PlanTest { analyzer.checkAnalysis(analyzer.execute(inputPlan)) } - if (e.getErrorClass != expectedErrorClass || - e.messageParameters != expectedMessageParameters || - (line >= 0 && e.line.getOrElse(-1) != line) || - (pos >= 0) && e.startPosition.getOrElse(-1) != pos) { Review Comment: It seems that the last condition was wrong? I think ``` (pos >= 0) && e.startPosition.getOrElse(-1) != pos ``` Should be ``` (pos >= 0 && e.startPosition.getOrElse(-1) != pos) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] LuciferYang commented on pull request #38413: [SPARK-40936][SQL][TESTS] Remove outer conditions to simplify `AnalysisTest#assertAnalysisErrorClass` method
LuciferYang commented on PR #38413: URL: https://github.com/apache/spark/pull/38413#issuecomment-1293477815 cc @MaxGekk @HyukjinKwon @dongjoon-hyun Think again, does this refactor look more simple? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org