[GitHub] [spark] aokolnychyi opened a new pull request, #36304: [SPARK-38959][SQL] DS V2: Support runtime group filtering in row-level commands
aokolnychyi opened a new pull request, #36304: URL: https://github.com/apache/spark/pull/36304 ### What changes were proposed in this pull request? This PR adds runtime group filtering for group-based row-level operations. ### Why are the changes needed? These changes are needed to avoid rewriting unnecessary groups as the data skipping during job planning is limited and can still report false positive groups to rewrite. ### Does this PR introduce _any_ user-facing change? This PR extends `RowLevelOperation` but the changes are backward compatible. ### How was this patch tested? This PR comes with tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37994: [SPARK-40454][CONNECT] Initial DSL framework for protobuf testing
cloud-fan commented on code in PR #37994: URL: https://github.com/apache/spark/pull/37994#discussion_r980610466 ## connect/src/main/scala/org/apache/spark/sql/connect/package.scala: ## @@ -0,0 +1,98 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql Review Comment: The dsl should not be a public API. Can we move it to `org.apache.spark.sql.catalyst` which is a private package? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #35608: [SPARK-32838][SQL] Static partition overwrite could use staging dir insert
github-actions[bot] commented on PR #35608: URL: https://github.com/apache/spark/pull/35608#issuecomment-1258818178 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan closed pull request #37679: [SPARK-35242][SQL] Support changing session catalog's default database
cloud-fan closed pull request #37679: [SPARK-35242][SQL] Support changing session catalog's default database URL: https://github.com/apache/spark/pull/37679 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #35734: [SPARK-32432][SQL] Add support for reading ORC/Parquet files of SymlinkTextInputFormat table And Fix Analyze for SymlinkTextInput
github-actions[bot] commented on PR #35734: URL: https://github.com/apache/spark/pull/35734#issuecomment-1258818146 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #35638: [SPARK-38296][SQL] Support error class AnalysisExceptions in FunctionRegistry
github-actions[bot] commented on PR #35638: URL: https://github.com/apache/spark/pull/35638#issuecomment-1258818165 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #35748: [SPARK-38431][SQL]Support to delete matched rows from jdbc tables
github-actions[bot] commented on PR #35748: URL: https://github.com/apache/spark/pull/35748#issuecomment-1258818118 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #35744: [SPARK-37383][SQL][WEBUI]Show the parsing time for each phase of a SQL on spark ui
github-actions[bot] commented on PR #35744: URL: https://github.com/apache/spark/pull/35744#issuecomment-1258818132 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #35594: [SPARK-38270][SQL] Spark SQL CLI's AM should keep same exit code with client side
github-actions[bot] commented on PR #35594: URL: https://github.com/apache/spark/pull/35594#issuecomment-1258818196 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #37998: [SPARK-40561][PS] Implement `min_count` in `GroupBy.min`
zhengruifeng commented on PR #37998: URL: https://github.com/apache/spark/pull/37998#issuecomment-1258863348 Merged into master, thanks @HyukjinKwon for reivew -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38008: [SPARk-40571][SS][TESTS] Construct a new test case for applyInPandasWithState to verify fault-tolerance semantic with random py
HyukjinKwon commented on code in PR #38008: URL: https://github.com/apache/spark/pull/38008#discussion_r98058 ## python/pyspark/sql/tests/test_pandas_grouped_map_with_state.py: ## @@ -90,6 +107,99 @@ def check_results(batch_df, _): self.assertTrue(q.isActive) q.processAllAvailable() +def test_apply_in_pandas_with_state_python_worker_random_failure(self): +output_path = tempfile.mkdtemp() +checkpoint_loc = tempfile.mkdtemp() +shutil.rmtree(output_path) +shutil.rmtree(checkpoint_loc) + +def run_query(): +df = self.spark.readStream.format("text") \ +.option("maxFilesPerTrigger", "1") \ +.load(self.base_path + "/random_failure/input") + +for q in self.spark.streams.active: +q.stop() +self.assertTrue(df.isStreaming) + +output_type = StructType( +[StructField("value", StringType()), StructField("count", LongType())] +) +state_type = StructType([StructField("cnt", LongType())]) + +def func(key, pdf_iter, state): +assert isinstance(state, GroupState) + +# should be huge enough to not trigger kill in every batches +# but should be also reasonable to trigger kill multiple times across batches +if random.randrange(300) == 1: +sys.exit(1) + +count = state.getOption +if count is None: +count = 0 +else: +count = count[0] + +for pdf in pdf_iter: +count += len(pdf) + +state.update((count,)) +yield pd.DataFrame({"value": [key[0]], "count": [count]}) + +q = ( +df.groupBy(df["value"]) +.applyInPandasWithState( +func, output_type, state_type, "Append", GroupStateTimeout.NoTimeout +) +.writeStream.queryName("this_query") +.format("json") +.outputMode("append") +.option("path", output_path) +.option("checkpointLocation", checkpoint_loc) +.start() +) + +return q + +q = run_query() + +self.assertEqual(q.name, "this_query") +self.assertTrue(q.isActive) + +# expected_output directory is constucted from below query: +# spark.read.format("text").load("./input").groupBy("value").count() \ +# .repartition(1).sort("value").write.format("json").save("./output") +expected = self.spark.read.schema("value string, count int").format("json") \ +.load(self.base_path + "/random_failure/expected_output") \ +.sort("value").collect() Review Comment: This style is recommended by PEP8 IIRC ```suggestion expected = ( self.spark.read.schema("value string, count int").format("json") .load(self.base_path + "/random_failure/expected_output") .sort("value").collect() ) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38006: [SPARK-40536][CONNECT] Make Spark Connect port configurable
HyukjinKwon commented on code in PR #38006: URL: https://github.com/apache/spark/pull/38006#discussion_r980669582 ## core/src/main/scala/org/apache/spark/internal/config/Connect.scala: ## @@ -0,0 +1,33 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.internal.config + +private[spark] object Connect { Review Comment: Yeah, it should better be placed within `connect` if possible. How do we use this plugin? I suspect the jar should be provided anyway (?) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] aokolnychyi opened a new pull request, #38004: [SPARK-40551][SQL] DataSource V2: Add APIs for delta-based row-level operations
aokolnychyi opened a new pull request, #38004: URL: https://github.com/apache/spark/pull/38004 ### What changes were proposed in this pull request? This PR adds DS v2 APIs for handling row-level operations for data sources that support deltas of rows. ### Why are the changes needed? These changes are part of the approved SPIP in SPARK-35801. ### Does this PR introduce _any_ user-facing change? Yes, this PR adds new DS v2 APIs per [design doc](https://docs.google.com/document/d/12Ywmc47j3l2WF4anG5vL4qlrhT2OKigb7_EbIKhxg60). ### How was this patch tested? Tests will be part of the implementation PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia opened a new pull request, #38006: [SPARK-40536] Make Spark Connect port configurable
amaliujia opened a new pull request, #38006: URL: https://github.com/apache/spark/pull/38006 ### What changes were proposed in this pull request? Add `Connect` config and two connect gRPC config keys. 1. `spark.connect.grpc.debug.enabled` Boolean 2. `spark.connect.grpc.binding.port` Int ### Why are the changes needed? Currently 1. Spark Connect gRPC port is hardcoded 2. config key `spark.connect.grpc.debug.enabled` is not defined. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing UT -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37994: [SPARK-40454][CONNECT] Initial DSL framework for protobuf testing
cloud-fan commented on code in PR #37994: URL: https://github.com/apache/spark/pull/37994#discussion_r980615123 ## connect/src/main/scala/org/apache/spark/sql/connect/package.scala: ## @@ -0,0 +1,98 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql Review Comment: BTW, I don't think we have to make it available in spark-shell. The catalyst dsl was added before we have DataFrame API. At that time, it was the only way for end-users to build query plan with a dataframe-like API. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Kimahriman commented on a diff in pull request #38003: [SPARK-40565][SQL] Don't push non-deterministic filters to V2 file sources
Kimahriman commented on code in PR #38003: URL: https://github.com/apache/spark/pull/38003#discussion_r980631860 ## sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitionsSuite.scala: ## @@ -140,6 +140,24 @@ class PruneFileSourcePartitionsSuite extends PrunePartitionSuiteBase with Shared } } + test("SPARK-40565: don't push down non-deterministic filters for V2 file sources") { Review Comment: So it looks like that's because of the way the "number of partitions" is calculated for V2 reads at https://github.com/apache/spark/blob/697574c2d5014626d2c960b222b41a8da7c702bf/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitionsSuite.scala#L168. It uses `BatchScanExec.partitions.size`, which is the number of RDD partitions, _not_ file partitions. I can see if I can fix that but don't see an easy way like `FileSourceScanExec.selectedPartitions`, there's no intermediate variable to get from a batch scan. Might just have to try to calculate unique partitions values from all files to get actual number of file partitions? If you change the 2 to 3 for the tests it passes for V1. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] attilapiros commented on a diff in pull request #37990: [WIP][SPARK-40458][K8S] Bump Kubernetes Client Version to 6.1.1
attilapiros commented on code in PR #37990: URL: https://github.com/apache/spark/pull/37990#discussion_r980633310 ## resource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManagerSuite.scala: ## @@ -109,6 +116,12 @@ class ExecutorPodsLifecycleManagerSuite extends SparkFunSuite with BeforeAndAfte verify(schedulerBackend).doRemoveExecutor("1", expectedLossReason) } + test("test executor inactivation function") { Review Comment: Ok. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] attilapiros commented on pull request #37990: [WIP][SPARK-40458][K8S] Bump Kubernetes Client Version to 6.1.1
attilapiros commented on PR #37990: URL: https://github.com/apache/spark/pull/37990#issuecomment-1258851049 > Although 6.1.1 is intrusive, this patch looks solid. Is there any other reason why this is still WIP, @attilapiros ? @dongjoon-hyun I just executed some manual tests regarding the spark-submit/application management (actually I have found a bug and corrected it). The test steps are in the description. Started an app in non-default namespace (`bla`): ``` ./bin/spark-submit \ --master k8s://http://127.0.0.1:8001 \ --deploy-mode cluster \ --name spark-pi \ --class org.apache.spark.examples.SparkPi \ --conf spark.executor.instances=5 \ --conf spark.kubernetes.namespace=bla \ --conf spark.kubernetes.container.image=docker.io/kubespark/spark:3.4.0-SNAPSHOT_2EA46AC6-0613-4BA1-BA76-130F5C0DAB15 \ local:///opt/spark/examples/jars/spark-examples_2.12-3.4.0-SNAPSHOT.jar 20 ``` Then checking the status by using the regexp: ``` ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] viirya commented on a diff in pull request #38001: [SPARK-40562][SQL] Add `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`
viirya commented on code in PR #38001: URL: https://github.com/apache/spark/pull/38001#discussion_r980647185 ## sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala: ## @@ -3574,6 +3574,15 @@ object SQLConf { .booleanConf .createWithDefault(false) + val LEGACY_GROUPING_ID_WITH_APPENDED_USER_GROUPBY = +buildConf("spark.sql.legacy.groupingIdWithAppendedUserGroupBy") + .internal() + .doc("When true, grouping_id() returns values based on grouping set columns plus " + +"user-given group-by expressions order.") Review Comment: Do we need to mention when the behavior of grouping_id has changed? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon closed pull request #37993: [SPARK-40557][CONNECT] Update generated proto files for Spark Connect
HyukjinKwon closed pull request #37993: [SPARK-40557][CONNECT] Update generated proto files for Spark Connect URL: https://github.com/apache/spark/pull/37993 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on pull request #37993: [SPARK-40557][CONNECT] Update generated proto files for Spark Connect
HyukjinKwon commented on PR #37993: URL: https://github.com/apache/spark/pull/37993#issuecomment-1258864096 Merged to master. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38006: [SPARK-40536][CONNECT] Make Spark Connect port configurable
HyukjinKwon commented on code in PR #38006: URL: https://github.com/apache/spark/pull/38006#discussion_r980668935 ## core/src/main/scala/org/apache/spark/internal/config/Connect.scala: ## @@ -0,0 +1,33 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.internal.config + +private[spark] object Connect { + + private[spark] val CONNECT_GRPC_DEBUG_MODE = +ConfigBuilder("spark.connect.grpc.debug.enabled") + .version("3.4.0") + .booleanConf + .createOptional Review Comment: why is it optional? Seems like it defaults to `true` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] aokolnychyi commented on pull request #36304: [SPARK-38959][SQL] DS V2: Support runtime group filtering in row-level commands
aokolnychyi commented on PR #36304: URL: https://github.com/apache/spark/pull/36304#issuecomment-1258618345 I want to resume working on this PR but I need feedback on one point. In the original implementation, @cloud-fan and I discussed supporting a separate scan builder for runtime group filtering in row-level operations. That way, we can prune columns and push down filters while looking for groups that have matches. We can't do that in the main row-level scan for group-based data sources as non-matching records in matching groups have to be copied over. See PR #35395 for context. The only challenge is ensuring the same version of the table is scanned in the main row-level scan and in the scan that searches for matching groups to rewrite. There are multiple solutions to consider. **Option 1** The first option is shown in this PR. We can add a new method to `RowLevelOperation` that would provide us a scan builder for runtime group filtering. ``` interface RowLevelOperation { // existing method ScanBuilder newScanBuilder(CaseInsensitiveStringMap options); // new method default ScanBuilder newAffectedGroupsScanBuilder(CaseInsensitiveStringMap options) { return newScanBuilder(options); } ... } ``` Under this implementation, it is up to data sources to ensure the same version is scanned in both scans. It is a fairly simple approach but it complicates the row-level API. On top, the new method is useless for data sources that can handle a delta of rows. **Option 2** The main row-level `Scan` can report scanned `tableVersion` and we can use that information to load a correct table version in the rule that assigns a runtime filter. This can be done via `TableCatalog$load(ident, version)`. The only API change is to extend `Scan` with `tableVersion` to know which table version is being read in the main scan. **Option 3** The rule that assigns a runtime group filter has access to the original `Table` object. We could just call `newScanBuilder` on it. However, I don't see anything in the API implying that reusing the `Table` instance guarantees the same version of the table will be scanned. If we call `newScanBuilder` on the same `Table` instance, do we expect the same version to be scanned? Seems like it is NOT the assumption right now. If we can somehow benefit from reusing `Table` object, it will be the cleanest option from the API perspective. Any ideas how to make Option 3 work? cc @cloud-fan @rdblue @huaxingao @dongjoon-hyun @sunchao @viirya -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #37779: [wip][SPARK-40320][Core] Executor should exit when it failed to initialize for fatal error
mridulm commented on PR #37779: URL: https://github.com/apache/spark/pull/37779#issuecomment-1258716473 Added a few debug statements, and it became clear what the issue is. Essentially, since we are leveraging a `ThreadPoolExecutor`, it does not result in killing the thread with the exception/error thrown - but rather, will call `ThreadPoolExecutor.afterExecute` with the cause for failure (See `runWorker` for more). We should be overriding this, and invoke our `uncaughtExceptionHandler` when an exception is thrown. In `receiveLoopRunnable` when a `Throwable` is thrown: ``` 22/09/26 17:17:12 INFO DedicatedMessageLoop: Current exceptionHandler = org.apache.spark.util.SparkUncaughtExceptionHandler@27c71f14 22/09/26 17:17:12 INFO DedicatedMessageLoop: Thread = Thread[dispatcher-Executor,5,main] 22/09/26 17:17:12 INFO DedicatedMessageLoop: Stack ... java.lang.Exception: For stack at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:56) at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) at java.base/java.lang.Thread.run(Thread.java:829) ``` In `receiveLoopRunnable`'s `run`, when a `Throwable` is thrown: ``` 2/09/26 17:17:12 INFO DedicatedMessageLoop: Thread = Thread[dispatcher-Executor,5,main], stackTrace = 22/09/26 17:17:12 INFO DedicatedMessageLoop: java.base@11.0.16/java.lang.Thread.dumpThreads(Native Method) 22/09/26 17:17:12 INFO DedicatedMessageLoop: java.base@11.0.16/java.lang.Thread.getAllStackTraces(Thread.java:1653) 22/09/26 17:17:12 INFO DedicatedMessageLoop: app//org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$dumpAllStackTraces(MessageLoop.scala:70) 22/09/26 17:17:12 INFO DedicatedMessageLoop: app//org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:58) 22/09/26 17:17:12 INFO DedicatedMessageLoop: java.base@11.0.16/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515) 22/09/26 17:17:12 INFO DedicatedMessageLoop: java.base@11.0.16/java.util.concurrent.FutureTask.run(FutureTask.java:264) 22/09/26 17:17:12 INFO DedicatedMessageLoop: java.base@11.0.16/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 22/09/26 17:17:12 INFO DedicatedMessageLoop: java.base@11.0.16/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ``` and finally, a few seconds after Executor Inbox failure - dumping all threads in a new thread. ``` 22/09/26 17:17:14 INFO DedicatedMessageLoop: Thread = Thread[dispatcher-Executor,5,main], stackTrace = 22/09/26 17:17:14 INFO DedicatedMessageLoop: java.base@11.0.16/jdk.internal.misc.Unsafe.park(Native Method) 22/09/26 17:17:14 INFO DedicatedMessageLoop: java.base@11.0.16/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194) 22/09/26 17:17:14 INFO DedicatedMessageLoop: java.base@11.0.16/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2081) 22/09/26 17:17:14 INFO DedicatedMessageLoop: java.base@11.0.16/java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:433) 22/09/26 17:17:14 INFO DedicatedMessageLoop: app//org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:102) 22/09/26 17:17:14 INFO DedicatedMessageLoop: app//org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:45) 22/09/26 17:17:14 INFO DedicatedMessageLoop: java.base@11.0.16/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) 22/09/26 17:17:14 INFO DedicatedMessageLoop: java.base@11.0.16/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) 22/09/26 17:17:14 INFO DedicatedMessageLoop: java.base@11.0.16/java.lang.Thread.run(Thread.java:829) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37994: [SPARK-40454][CONNECT] Initial DSL framework for protobuf testing
cloud-fan commented on code in PR #37994: URL: https://github.com/apache/spark/pull/37994#discussion_r980614606 ## connect/src/main/scala/org/apache/spark/sql/connect/package.scala: ## @@ -0,0 +1,98 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql Review Comment: yup -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37994: [SPARK-40454][CONNECT] Initial DSL framework for protobuf testing
cloud-fan commented on code in PR #37994: URL: https://github.com/apache/spark/pull/37994#discussion_r980616391 ## connect/src/main/scala/org/apache/spark/sql/connect/package.scala: ## @@ -0,0 +1,98 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql + +import scala.collection.JavaConverters._ + +import org.apache.spark.connect.proto +import org.apache.spark.sql.catalyst.analysis.UnresolvedAlias +import org.apache.spark.sql.catalyst.expressions.{Expression, NamedExpression} + + +/** + * A collection of implicit conversions that create a DSL for constructing connect protos. + * + * {{{ + * scala> import org.apache.spark.sql.connect.plans.DslLogicalPlan + * + * // Standard way to construct connect proto + * scala> import org.apache.spark.connect.proto + * scala> :paste + * // Entering paste mode (ctrl-D to finish) + * val connectTestRelation = + *proto.Relation.newBuilder() + * .setRead( + * proto.Read.newBuilder() + * .setNamedTable(proto.Read.NamedTable.newBuilder().addParts("student").build()) + * .build()) + * .build() + * // Exiting paste mode, now interpreting. + *connectTestRelation: org.apache.spark.connect.proto.Relation = + * read { + * named_table { + *parts: "student" + * } + *} + * + * // Now we can apply select on the proto relation above + * scala> import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute + * scala> connectTestRelation.select(UnresolvedAttribute(Seq("id"))) + * res14: org.apache.spark.connect.proto.Relation = + *project { + * input { + *read { + * named_table { + *parts: "student" + * } + *} + * } + * expressions { + *unresolved_attribute { + * parts: "id" + *} + * } + *} + * + * }}} + * + */ +package object connect { + + object plans { // scalastyle:ignore +implicit class DslLogicalPlan(val logicalPlan: proto.Relation) { Review Comment: Does `proto.Relation` mean logical plan in the proto world? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng closed pull request #37998: [SPARK-40561][PS] Implement `min_count` in `GroupBy.min`
zhengruifeng closed pull request #37998: [SPARK-40561][PS] Implement `min_count` in `GroupBy.min` URL: https://github.com/apache/spark/pull/37998 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38006: [SPARK-40536][CONNECT] Make Spark Connect port configurable
HyukjinKwon commented on code in PR #38006: URL: https://github.com/apache/spark/pull/38006#discussion_r980668686 ## core/src/main/scala/org/apache/spark/internal/config/Connect.scala: ## @@ -0,0 +1,33 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.internal.config + +private[spark] object Connect { + + private[spark] val CONNECT_GRPC_DEBUG_MODE = +ConfigBuilder("spark.connect.grpc.debug.enabled") + .version("3.4.0") + .booleanConf Review Comment: can we add docs? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #37994: [SPARK-40454][CONNECT] Initial DSL framework for protobuf testing
amaliujia commented on code in PR #37994: URL: https://github.com/apache/spark/pull/37994#discussion_r980480469 ## connect/src/main/scala/org/apache/spark/sql/connect/package.scala: ## @@ -0,0 +1,39 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql + +import org.apache.spark.connect.proto + Review Comment: Ah I found that we need to add connect to the deps of the REPL module. I made the change and also then update the documentation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] aokolnychyi commented on a diff in pull request #38004: [SPARK-40551][SQL] DataSource V2: Add APIs for delta-based row-level operations
aokolnychyi commented on code in PR #38004: URL: https://github.com/apache/spark/pull/38004#discussion_r980509911 ## sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/DeltaBatchWrite.java: ## @@ -0,0 +1,31 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.connector.write; + +import org.apache.spark.annotation.Experimental; + +/** + * An interface that defines how to write a delta of rows during batch processing. + * + * @since 3.4.0 + */ +@Experimental +public interface DeltaBatchWrite extends BatchWrite { + @Override Review Comment: I did not add any doc for inherited methods as it would mostly overlap with the parent doc. I could add a few sentences and reference the parent doc, though. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #37994: [SPARK-40454][CONNECT] Initial DSL framework for protobuf testing
amaliujia commented on code in PR #37994: URL: https://github.com/apache/spark/pull/37994#discussion_r980613106 ## connect/src/main/scala/org/apache/spark/sql/connect/package.scala: ## @@ -0,0 +1,98 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql Review Comment: hmmm that will be a dependency issue? connect depends on sql but this DSL needs to depends on connect proto? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #37994: [SPARK-40454][CONNECT] Initial DSL framework for protobuf testing
amaliujia commented on code in PR #37994: URL: https://github.com/apache/spark/pull/37994#discussion_r980613293 ## connect/src/main/scala/org/apache/spark/sql/connect/package.scala: ## @@ -0,0 +1,98 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql Review Comment: Are you actually saying move to `org.apache.spark.sql.catalyst` in connect module? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] attilapiros commented on a diff in pull request #37990: [WIP][SPARK-40458][K8S] Bump Kubernetes Client Version to 6.1.1
attilapiros commented on code in PR #37990: URL: https://github.com/apache/spark/pull/37990#discussion_r980632958 ## resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala: ## @@ -85,7 +85,7 @@ private[spark] class KubernetesClusterSchedulerBackend( Map(SPARK_APP_ID_LABEL -> applicationId(), SPARK_ROLE_LABEL -> SPARK_POD_EXECUTOR_ROLE) val configMap = KubernetesClientUtils.buildConfigMap(configMapName, confFilesMap, labels) KubernetesUtils.addOwnerReference(driverPod.orNull, Seq(configMap)) -kubernetesClient.configMaps().create(configMap) +kubernetesClient.configMaps().inAnyNamespace().resource(configMap).create() Review Comment: Double checked: https://user-images.githubusercontent.com/2017933/192407277-d476449e-1112-40fb-916e-81393855988c.png;> -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR opened a new pull request, #38008: [SPARk-40571][SS][TESTS] Construct a new test case for applyInPandasWithState to verify fault-tolerance semantic with random python work
HeartSaVioR opened a new pull request, #38008: URL: https://github.com/apache/spark/pull/38008 ### What changes were proposed in this pull request? This PR proposes a new test case for applyInPandasWithState to verify fault-tolerance semantic is not broken despite of random python worker failure. If the sink provides end-to-end exactly-once, the query should respect the guarantee. Otherwise, the query should respect stateful exactly-once, but at-least-once in terms of outputs. The test leverages file stream sink which is end-to-end exactly-once, but to make the verification simpler, we just verify whether the stateful exactly-once is guaranteed despite of python worker failures. ### Why are the changes needed? This strengthen the test coverage, especially the fault-tolerance semantic. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New test added. Manually ran `./python/run-tests --testnames 'pyspark.sql.tests.test_pandas_grouped_map_with_state'` 10 times and all succeeded. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #37993: [SPARK-40557][CONNECT] Update generated proto files for Spark Connect
HeartSaVioR commented on PR #37993: URL: https://github.com/apache/spark/pull/37993#issuecomment-1258866543 post +1, thanks for updating this! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng commented on pull request #37967: Scalable SkipGram-Word2Vec implementation
zhengruifeng commented on PR #37967: URL: https://github.com/apache/spark/pull/37967#issuecomment-1258873185 so this is a totally new implementation of `SkipGram` W2V in `.mllib` is it possible to improve existing w2v instead of implementing a new one? what about implementing it in `.ml` side? since `.mllib` was in maintenance mode I think we should wait for @srowen @WeichenXu123 @huaxingao @mengxr 's opinions since this is a big change; -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38008: [SPARk-40571][SS][TESTS] Construct a new test case for applyInPandasWithState to verify fault-tolerance semantic with random py
HyukjinKwon commented on code in PR #38008: URL: https://github.com/apache/spark/pull/38008#discussion_r980667435 ## python/pyspark/sql/tests/test_pandas_grouped_map_with_state.py: ## @@ -90,6 +107,99 @@ def check_results(batch_df, _): self.assertTrue(q.isActive) q.processAllAvailable() +def test_apply_in_pandas_with_state_python_worker_random_failure(self): +output_path = tempfile.mkdtemp() +checkpoint_loc = tempfile.mkdtemp() +shutil.rmtree(output_path) +shutil.rmtree(checkpoint_loc) + +def run_query(): +df = self.spark.readStream.format("text") \ +.option("maxFilesPerTrigger", "1") \ +.load(self.base_path + "/random_failure/input") + +for q in self.spark.streams.active: +q.stop() +self.assertTrue(df.isStreaming) + +output_type = StructType( +[StructField("value", StringType()), StructField("count", LongType())] +) +state_type = StructType([StructField("cnt", LongType())]) + +def func(key, pdf_iter, state): +assert isinstance(state, GroupState) + +# should be huge enough to not trigger kill in every batches +# but should be also reasonable to trigger kill multiple times across batches +if random.randrange(300) == 1: +sys.exit(1) + +count = state.getOption +if count is None: +count = 0 +else: +count = count[0] + +for pdf in pdf_iter: +count += len(pdf) + +state.update((count,)) +yield pd.DataFrame({"value": [key[0]], "count": [count]}) + +q = ( +df.groupBy(df["value"]) +.applyInPandasWithState( +func, output_type, state_type, "Append", GroupStateTimeout.NoTimeout +) +.writeStream.queryName("this_query") +.format("json") +.outputMode("append") +.option("path", output_path) +.option("checkpointLocation", checkpoint_loc) +.start() +) + +return q + +q = run_query() + +self.assertEqual(q.name, "this_query") +self.assertTrue(q.isActive) + +# expected_output directory is constucted from below query: +# spark.read.format("text").load("./input").groupBy("value").count() \ +# .repartition(1).sort("value").write.format("json").save("./output") +expected = self.spark.read.schema("value string, count int").format("json") \ +.load(self.base_path + "/random_failure/expected_output") \ +.sort("value").collect() + +curr_time = time.time() +timeout = curr_time + 120 # 2 minutes Review Comment: There's `eventually` available at `pyspark.testing.utils`. Can we leverage this? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sigmod commented on pull request #37996: [SPARK-40558][SQL] Add Reusable Exchange in Bloom creation side plan
sigmod commented on PR #37996: URL: https://github.com/apache/spark/pull/37996#issuecomment-1258522623 cc @andylam-db @maryannxue -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on pull request #38001: [SPARK-40562][SQL] Add `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`
dongjoon-hyun commented on PR #38001: URL: https://github.com/apache/spark/pull/38001#issuecomment-1258524084 Thank you for your feedback, @thiyaga . -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #37994: [SPARK-40454][CONNECT] Initial DSL framework for protobuf testing
AmplabJenkins commented on PR #37994: URL: https://github.com/apache/spark/pull/37994#issuecomment-1258554960 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #37993: [SPARK-40557] [CONNECT] [Cleanup] Update generated proto files for Spark Connect
AmplabJenkins commented on PR #37993: URL: https://github.com/apache/spark/pull/37993#issuecomment-1258555111 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] aokolnychyi commented on pull request #38004: [SPARK-40551][SQL] DataSource V2: Add APIs for delta-based row-level operations
aokolnychyi commented on PR #38004: URL: https://github.com/apache/spark/pull/38004#issuecomment-1258647018 @cloud-fan @rdblue @huaxingao @dongjoon-hyun @sunchao @viirya, could you take a look? This is the API from the design doc we discussed earlier. I have also created PR #38005 that shows how this API will be consumed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] bersprockets commented on pull request #37825: [SPARK-40382][SQL] Group distinct aggregate expressions by semantically equivalent children in `RewriteDistinctAggregates`
bersprockets commented on PR #37825: URL: https://github.com/apache/spark/pull/37825#issuecomment-1258849538 @beliefer > Please reference `SimplifyBinaryComparison`. Thanks, I will take a look. This is reference to the fall-through case, where we discover there is really only a single distinct group, correct? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] huleilei opened a new pull request, #38007: [SPARK-40566][SQL]Add showIndex function
huleilei opened a new pull request, #38007: URL: https://github.com/apache/spark/pull/38007 ### What changes were proposed in this pull request? I create an index for a table.I want to know what indexes are in the table. But SHOW INDEX syntax is not supported. So I think the SHOW INDEX syntax should be added in the Spark code. Please let me know if you have any thoughts. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] attilapiros commented on pull request #37990: [SPARK-40458][K8S] Bump Kubernetes Client Version to 6.1.1
attilapiros commented on PR #37990: URL: https://github.com/apache/spark/pull/37990#issuecomment-1258859382 I would like to go through one more time to find all the places where we can specify the namespace. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] aokolnychyi commented on a diff in pull request #38004: [SPARK-40551][SQL] DataSource V2: Add APIs for delta-based row-level operations
aokolnychyi commented on code in PR #38004: URL: https://github.com/apache/spark/pull/38004#discussion_r980511958 ## sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/DeltaWriteBuilder.java: ## @@ -0,0 +1,33 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.connector.write; + +import org.apache.spark.annotation.Experimental; + +/** + * An interface for building a {@link DeltaWrite}. + * + * @since 3.4.0 + */ +@Experimental +public interface DeltaWriteBuilder extends WriteBuilder { + @Override + default DeltaWrite build() { +throw new UnsupportedOperationException(getClass().getName() + " does not implement build"); Review Comment: I had to override to avoid inheriting the default implementation from the parent interface. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] grundprinzip commented on a diff in pull request #38006: [SPARK-40536][CONNECT] Make Spark Connect port configurable
grundprinzip commented on code in PR #38006: URL: https://github.com/apache/spark/pull/38006#discussion_r980606668 ## core/src/main/scala/org/apache/spark/internal/config/Connect.scala: ## @@ -0,0 +1,33 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.internal.config + +private[spark] object Connect { Review Comment: As Spark connect is build as plugin, all of the configuration should ideally be not located in core. Is there a way we can move this to the connect module? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #38006: [SPARK-40536][CONNECT] Make Spark Connect port configurable
amaliujia commented on code in PR #38006: URL: https://github.com/apache/spark/pull/38006#discussion_r980607213 ## core/src/main/scala/org/apache/spark/internal/config/Connect.scala: ## @@ -0,0 +1,33 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.internal.config + +private[spark] object Connect { Review Comment: I agree. let's see other reviewer's suggestions on this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38008: [SPARk-40571][SS][TESTS] Construct a new test case for applyInPandasWithState to verify fault-tolerance semantic with random py
HyukjinKwon commented on code in PR #38008: URL: https://github.com/apache/spark/pull/38008#discussion_r980666406 ## python/test_support/sql/streaming/apply_in_pandas_with_state/random_failure/input/test-0.txt: ## @@ -0,0 +1,100 @@ +non Review Comment: Can we avoid adding these files? I try to avoid this in general to make the test case self-contained and more readable. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #37994: [SPARK-40454][CONNECT] Initial DSL framework for protobuf testing
amaliujia commented on code in PR #37994: URL: https://github.com/apache/spark/pull/37994#discussion_r980404430 ## connect/src/main/scala/org/apache/spark/sql/connect/package.scala: ## @@ -0,0 +1,39 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql + +import org.apache.spark.connect.proto + Review Comment: I double-checked in case I did't include right DSL classes in the shell but just import a different connect class with same error: ``` scala> import org.apache.spark.sql.connect.service.SparkConnectService :25: error: object connect is not a member of package org.apache.spark.sql import org.apache.spark.sql.connect.service.SparkConnectService ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] AmplabJenkins commented on pull request #37996: [SPARK-40558][SQL] Add Reusable Exchange in Bloom creation side plan
AmplabJenkins commented on PR #37996: URL: https://github.com/apache/spark/pull/37996#issuecomment-1258485849 Can one of the admins verify this patch? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] aokolnychyi commented on a diff in pull request #38004: [SPARK-40551][SQL] DataSource V2: Add APIs for delta-based row-level operations
aokolnychyi commented on code in PR #38004: URL: https://github.com/apache/spark/pull/38004#discussion_r980508846 ## sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/LogicalWriteInfo.java: ## @@ -45,4 +45,14 @@ public interface LogicalWriteInfo { * the schema of the input data from Spark to data source. */ StructType schema(); + + /** + * the schema of the input metadata from Spark to data source. + */ + StructType metadataSchema(); Review Comment: I could add default implementations and return null but I think instances of this interface are only created by Spark. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37994: [SPARK-40454][CONNECT] Initial DSL framework for protobuf testing
cloud-fan commented on code in PR #37994: URL: https://github.com/apache/spark/pull/37994#discussion_r980615123 ## connect/src/main/scala/org/apache/spark/sql/connect/package.scala: ## @@ -0,0 +1,98 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql Review Comment: BTW, I don't think we have to make it available in spark-shell. The catalyst dsl was added before we have DataFrame API. At that time, it was the only way for end-users to build query plan with a dataframe-like API. We don't have such a need for spark connect dsl. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37994: [SPARK-40454][CONNECT] Initial DSL framework for protobuf testing
cloud-fan commented on code in PR #37994: URL: https://github.com/apache/spark/pull/37994#discussion_r980615483 ## connect/src/main/scala/org/apache/spark/sql/connect/package.scala: ## @@ -0,0 +1,98 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql + +import scala.collection.JavaConverters._ + +import org.apache.spark.connect.proto +import org.apache.spark.sql.catalyst.analysis.UnresolvedAlias +import org.apache.spark.sql.catalyst.expressions.{Expression, NamedExpression} + + +/** + * A collection of implicit conversions that create a DSL for constructing connect protos. + * + * {{{ + * scala> import org.apache.spark.sql.connect.plans.DslLogicalPlan + * + * // Standard way to construct connect proto + * scala> import org.apache.spark.connect.proto + * scala> :paste + * // Entering paste mode (ctrl-D to finish) + * val connectTestRelation = + *proto.Relation.newBuilder() + * .setRead( + * proto.Read.newBuilder() + * .setNamedTable(proto.Read.NamedTable.newBuilder().addParts("student").build()) + * .build()) + * .build() + * // Exiting paste mode, now interpreting. + *connectTestRelation: org.apache.spark.connect.proto.Relation = + * read { + * named_table { + *parts: "student" + * } + *} + * + * // Now we can apply select on the proto relation above + * scala> import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute + * scala> connectTestRelation.select(UnresolvedAttribute(Seq("id"))) + * res14: org.apache.spark.connect.proto.Relation = + *project { + * input { + *read { + * named_table { + *parts: "student" + * } + *} + * } + * expressions { + *unresolved_attribute { + * parts: "id" + *} + * } + *} + * + * }}} + * + */ +package object connect { + + object plans { // scalastyle:ignore +implicit class DslLogicalPlan(val logicalPlan: proto.Relation) { + def select(exprs: Expression*): proto.Relation = { Review Comment: should this be `proto.Expression`? Will we ever add `proto.Expression`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on a diff in pull request #37994: [SPARK-40454][CONNECT] Initial DSL framework for protobuf testing
cloud-fan commented on code in PR #37994: URL: https://github.com/apache/spark/pull/37994#discussion_r980615672 ## connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala: ## @@ -0,0 +1,74 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.sql.connect.planner + +import org.scalatest.BeforeAndAfter + +import org.apache.spark.SparkFunSuite +import org.apache.spark.connect.proto +import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute +import org.apache.spark.sql.catalyst.plans.PlanTest +import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan +import org.apache.spark.sql.connect.plans.DslLogicalPlan +import org.apache.spark.sql.test.SharedSparkSession + +/** + * This suite is based on connect DSL and test that given same dataframe operations, whether + * connect could construct a proto plan that can be translated back, and after analyzed, be the + * same as Spark dataframe's generated plan. + */ +class SparkConnectProtoSuite extends SparkFunSuite + with SharedSparkSession with SparkConnectPlanTest with PlanTest with BeforeAndAfter { + + lazy val connectTestRelation = +proto.Relation.newBuilder() + .setRead( +proto.Read.newBuilder() + .setNamedTable(proto.Read.NamedTable.newBuilder().addParts("student").build()) +.build()) + .build() + + lazy val sparkTestRelation = spark.table("student") + + before { +setupTestData() + } + + test("Basic select") { +val connectPlan = analyze(connectTestRelation.select(UnresolvedAttribute(Seq("id" +val sparkPlan = sparkTestRelation.select("id").queryExecution.analyzed +comparePlans(connectPlan, sparkPlan) + } + + private def analyze(plan: proto.Relation): LogicalPlan = { +spark.sessionState.executePlan(transform(plan)).analyzed + } + + protected override def comparePlans( +plan1: LogicalPlan, Review Comment: nit: 4 spaces indentation -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37989: [SPARK-40096][CORE][TESTS][FOLLOW-UP] Explicitly check the element and length
HyukjinKwon commented on code in PR #37989: URL: https://github.com/apache/spark/pull/37989#discussion_r980652463 ## core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala: ## @@ -4495,7 +4495,8 @@ class DAGSchedulerSuite extends SparkFunSuite with TempLocalSparkContext with Ti sendRequestsLatch.await() verify(blockStoreClient, times(2)) .finalizeShuffleMerge(any(), any(), any(), any(), any()) - assert(sentHosts === Seq("hostB")) + assert(sentHosts.nonEmpty) + assert(sentHosts.head === "hostB" && sentHosts.length == 1) Review Comment: Ah, okay. Seems like this is flaky because of the timeout: ``` [info] - SPARK-40096: Send finalize events even if shuffle merger blocks indefinitely with registerMergeResults is false *** FAILED *** (106 milliseconds) [info] ArrayBuffer("hostB") was empty (DAGSchedulerSuite.scala:4498) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472) [info] at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471) [info] at org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231) [info] at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295) [info] at org.apache.spark.scheduler.DAGSchedulerSuite.$anonfun$new$286(DAGSchedulerSuite.scala:4498) [info] at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85) [info] at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83) [info] at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104) [info] at org.scalatest.Transformer.apply(Transformer.scala:22) [info] at org.scalatest.Transformer.apply(Transformer.scala:20) ``` https://github.com/apache/spark/actions/runs/3129263557/jobs/5078150518 I think it was empty when the condition is checked. Later, I think the array is filled at the time when the exception is actually printed out. Should we increase the timeout, @mridulm and @wankunde? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #38006: [SPARK-40536] Make Spark Connect port configurable
amaliujia commented on PR #38006: URL: https://github.com/apache/spark/pull/38006#issuecomment-1258739872 @HyukjinKwon @cloud-fan @grundprinzip -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] Kimahriman commented on pull request #38003: [SPARK-40565][SQL] Don't push non-deterministic filters to V2 file sources
Kimahriman commented on PR #38003: URL: https://github.com/apache/spark/pull/38003#issuecomment-1258752350 > Thank you for making a PR with the test coverage, @Kimahriman . Previously, it fails, right? Yeah these tests actually fail with an exception without the change -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] aokolnychyi commented on a diff in pull request #38004: [SPARK-40551][SQL] DataSource V2: Add APIs for delta-based row-level operations
aokolnychyi commented on code in PR #38004: URL: https://github.com/apache/spark/pull/38004#discussion_r980508846 ## sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/LogicalWriteInfo.java: ## @@ -45,4 +45,14 @@ public interface LogicalWriteInfo { * the schema of the input data from Spark to data source. */ StructType schema(); + + /** + * the schema of the input metadata from Spark to data source. + */ + StructType metadataSchema(); Review Comment: I could add default implementations and return null but I think instances of this interface are only created by Spark. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on a diff in pull request #38006: [SPARK-40536][CONNECT] Make Spark Connect port configurable
amaliujia commented on code in PR #38006: URL: https://github.com/apache/spark/pull/38006#discussion_r980607213 ## core/src/main/scala/org/apache/spark/internal/config/Connect.scala: ## @@ -0,0 +1,33 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.internal.config + +private[spark] object Connect { Review Comment: I agree. I am not sure if there is a way to have config only for a plugin. let's see other reviewer's suggestions on this. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #38001: [SPARK-40562][SQL] Add `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`
cloud-fan commented on PR #38001: URL: https://github.com/apache/spark/pull/38001#issuecomment-1258815644 > it's not in the SQL standard Yea, but since we copied it from Hive, I think the result should match Hive as well. Sorry I didn't realize there is a result change when doing the bug fix PR. Can we try with Hive to make sure we return the correct result today? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] aokolnychyi commented on a diff in pull request #38004: [SPARK-40551][SQL] DataSource V2: Add APIs for delta-based row-level operations
aokolnychyi commented on code in PR #38004: URL: https://github.com/apache/spark/pull/38004#discussion_r980626015 ## sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/LogicalWriteInfo.java: ## @@ -45,4 +45,18 @@ public interface LogicalWriteInfo { * the schema of the input data from Spark to data source. */ StructType schema(); + + /** + * the schema of the input metadata from Spark to data source. + */ + default StructType metadataSchema() { +return null; Review Comment: The default implementation is purely for compatibility. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] bersprockets commented on a diff in pull request #37825: [SPARK-40382][SQL] Group distinct aggregate expressions by semantically equivalent children in `RewriteDistinctAggregates`
bersprockets commented on code in PR #37825: URL: https://github.com/apache/spark/pull/37825#discussion_r980641159 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala: ## @@ -218,9 +218,16 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] { val aggExpressions = collectAggregateExprs(a) val distinctAggs = aggExpressions.filter(_.isDistinct) +val funcChildren = distinctAggs.flatMap { e => + e.aggregateFunction.children.filter(!_.foldable) +} +val funcChildrenLookup = funcChildren.map { e => + (e, funcChildren.find(fc => e.semanticEquals(fc)).getOrElse(e)) +}.toMap + // Extract distinct aggregate expressions. val distinctAggGroups = aggExpressions.filter(_.isDistinct).groupBy { e => Review Comment: Not sure if this is what you were hinting at, but for all maps related to distinct aggregation children, the code now uses `ExpressionSet` as a key. That way look-ups shouldn't care about superficial differences: the code never makes a lookup using an original child (...for the distinct aggregations. It still uses original children for regular aggregations). >Then it's pretty easy to get back the original expressions, by ExpressionSet.toSeq. By using `ExpressionSet` as the key to `distinctAggChildAttrLookup`, hopefully I don't need the originals at all. Which is a good thing, since `ExpressionSet` is lossy when it comes to the originals, for example: ``` select count(distinct 1 + c1, c1 + 1), count(distinct c2 + 1, c2 + 2) from df; ``` This creates the following grouping keys for `distinctAggGroups`: ``` Set((1 + c1#106)) Set((c2#107 + 1), (c2#107 + 2)) ``` `c1#106 + 1` is lost because of the way `ExpressionSet#add` works (it just ignores a new expression that is semantically equivalent to anything in `baseSet`). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a diff in pull request #38008: [SPARk-40571][SS][TESTS] Construct a new test case for applyInPandasWithState to verify fault-tolerance semantic with random py
HeartSaVioR commented on code in PR #38008: URL: https://github.com/apache/spark/pull/38008#discussion_r980664896 ## python/pyspark/sql/tests/test_pandas_grouped_map_with_state.py: ## @@ -90,6 +107,99 @@ def check_results(batch_df, _): self.assertTrue(q.isActive) q.processAllAvailable() +def test_apply_in_pandas_with_state_python_worker_random_failure(self): +output_path = tempfile.mkdtemp() +checkpoint_loc = tempfile.mkdtemp() +shutil.rmtree(output_path) +shutil.rmtree(checkpoint_loc) + +def run_query(): +df = self.spark.readStream.format("text") \ +.option("maxFilesPerTrigger", "1") \ Review Comment: This query runs 10 batches from 10 files, which each file has 100 words. 1000 words in overall. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on pull request #38008: [SPARk-40571][SS][TESTS] Construct a new test case for applyInPandasWithState to verify fault-tolerance semantic with random python worke
HeartSaVioR commented on PR #38008: URL: https://github.com/apache/spark/pull/38008#issuecomment-1258874132 cc. @HyukjinKwon @alex-balikov -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38003: [SPARK-40565][SQL] Don't push non-deterministic filters to V2 file sources
dongjoon-hyun commented on code in PR #38003: URL: https://github.com/apache/spark/pull/38003#discussion_r980419475 ## sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitionsSuite.scala: ## @@ -140,6 +140,24 @@ class PruneFileSourcePartitionsSuite extends PrunePartitionSuiteBase with Shared } } + test("SPARK-40565: don't push down non-deterministic filters for V2 file sources") { Review Comment: BTW, this query seems to fail in V1 source. Could you confirm this? ```scala -withSQLConf((SQLConf.USE_V1_SOURCE_LIST.key, "")) { +withSQLConf((SQLConf.USE_V1_SOURCE_LIST.key, "parquet")) { ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia commented on pull request #37750: [SPARK-40296] Error class for DISTINCT function not found
amaliujia commented on PR #37750: URL: https://github.com/apache/spark/pull/37750#issuecomment-1258518169 Because Spark supports `SELECT distinct(col1, col2)` (and the return is a struct of co1 and col2), which makes this error message proposal complicated. Because now we cannot say that `SELECT SUM(a1), distinct(c1, c2)` is not a valid usage, even though this is not supported by the parser, because users might have been implied that `distinct(col list)` is a valid aggregate like usage. We at least need a better way to distinguish 1) select distinct col_list 2) select distinct(col_list) 3) select agg1, agg2, distinct(col_list)`. Given the reason above, I will close this PR and defer the handling for the distinct error described in this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] amaliujia closed pull request #37750: [SPARK-40296] Error class for DISTINCT function not found
amaliujia closed pull request #37750: [SPARK-40296] Error class for DISTINCT function not found URL: https://github.com/apache/spark/pull/37750 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] thiyaga commented on pull request #38001: [SPARK-40562][SQL] Add `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`
thiyaga commented on PR #38001: URL: https://github.com/apache/spark/pull/38001#issuecomment-1258517989 We use grouping sets on our queries and rely on `grouping__id` to use as an identifier to query the data for respective group. If we use `grouping__id` directly, it will be prone to change if grouping set changes (for e.g. adding new grouping set/ adding new column to existing grouping set). Any grouping id change will make things even more complex when consuming this data directly from reporting tools like Tableau . We need to do the one of the following options to mitigate the changing `grouping__id` 1. Either we need to transform the `grouping__id` to something that won't be impacted when the grouping set changes and deterministic (for e.g convert `grouping__id` to `group_name`) 2. Have some sort of logical DB view which will handle the transformation at runtime (for e.g. using CASE WHEN) In essence, we always have dependency with `grouping__id` when grouping sets are used in our query. Any change in the grouping id generation will have immediate impact. This new parameter will help us to use the legacy logic. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38003: [SPARK-40565][SQL] Don't push non-deterministic filters to V2 file sources
dongjoon-hyun commented on code in PR #38003: URL: https://github.com/apache/spark/pull/38003#discussion_r980419475 ## sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitionsSuite.scala: ## @@ -140,6 +140,24 @@ class PruneFileSourcePartitionsSuite extends PrunePartitionSuiteBase with Shared } } + test("SPARK-40565: don't push down non-deterministic filters for V2 file sources") { Review Comment: BTW, this test case seems to fail in V1 source. Could you confirm this? ```scala -withSQLConf((SQLConf.USE_V1_SOURCE_LIST.key, "")) { +withSQLConf((SQLConf.USE_V1_SOURCE_LIST.key, "parquet")) { ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] aokolnychyi commented on a diff in pull request #38004: [SPARK-40551][SQL] DataSource V2: Add APIs for delta-based row-level operations
aokolnychyi commented on code in PR #38004: URL: https://github.com/apache/spark/pull/38004#discussion_r980510778 ## sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/DeltaWrite.java: ## @@ -0,0 +1,33 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.connector.write; + +import org.apache.spark.annotation.Experimental; + +/** + * A logical representation of a data source write that handles a delta of rows. + * + * @since 3.4.0 + */ +@Experimental +public interface DeltaWrite extends Write { + @Override + default DeltaBatchWrite toBatch() { Review Comment: I added a default implementation to match the parent interface. In the future, we may also override `toStreaming`. ## sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/DeltaWrite.java: ## @@ -0,0 +1,33 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.connector.write; + +import org.apache.spark.annotation.Experimental; + +/** + * A logical representation of a data source write that handles a delta of rows. + * + * @since 3.4.0 + */ +@Experimental +public interface DeltaWrite extends Write { + @Override + default DeltaBatchWrite toBatch() { Review Comment: I added a default implementation to match the parent interface. In the future, we may also override `toStreaming`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] aokolnychyi commented on a diff in pull request #38004: [SPARK-40551][SQL] DataSource V2: Add APIs for delta-based row-level operations
aokolnychyi commented on code in PR #38004: URL: https://github.com/apache/spark/pull/38004#discussion_r980511079 ## sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/DeltaWrite.java: ## @@ -0,0 +1,33 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.sql.connector.write; + +import org.apache.spark.annotation.Experimental; + +/** + * A logical representation of a data source write that handles a delta of rows. + * + * @since 3.4.0 + */ +@Experimental +public interface DeltaWrite extends Write { + @Override + default DeltaBatchWrite toBatch() { Review Comment: It is also required to avoid inheriting the base implementation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] aokolnychyi opened a new pull request, #38005: [SPARK-40550][SQL] Handle DELETE commands for delta-based sources
aokolnychyi opened a new pull request, #38005: URL: https://github.com/apache/spark/pull/38005 ### What changes were proposed in this pull request? This WIP PR shows how the API added in PR #38004 can be implemented. ### Why are the changes needed? Thes changes are needed as per SPIP SPARK-35801. ### Does this PR introduce _any_ user-facing change? Yes, this PR adds new DS v2 APIs. ### How was this patch tested? This PR comes with tests. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #36030: Draft: [SPARK-38715] Configurable client ID for Kafka Spark SQL producer
github-actions[bot] closed pull request #36030: Draft: [SPARK-38715] Configurable client ID for Kafka Spark SQL producer URL: https://github.com/apache/spark/pull/36030 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #36829: [SPARK-39438][SQL] Add a threshold to not in line CTE
github-actions[bot] closed pull request #36829: [SPARK-39438][SQL] Add a threshold to not in line CTE URL: https://github.com/apache/spark/pull/36829 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #36005: [SPARK-38506][SQL] Push partial aggregation through join
github-actions[bot] closed pull request #36005: [SPARK-38506][SQL] Push partial aggregation through join URL: https://github.com/apache/spark/pull/36005 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #36046: [SPARK-38771][SQL] Adaptive Bloom filter Join
github-actions[bot] closed pull request #36046: [SPARK-38771][SQL] Adaptive Bloom filter Join URL: https://github.com/apache/spark/pull/36046 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #35799: [SPARK-38498][STREAM] Support customized StreamingListener by configuration
github-actions[bot] closed pull request #35799: [SPARK-38498][STREAM] Support customized StreamingListener by configuration URL: https://github.com/apache/spark/pull/35799 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #35858: [SPARK-38448] [YARN] [CORE] Sending Available Resources in Yarn Cluster Information to Spark Driver
github-actions[bot] closed pull request #35858: [SPARK-38448] [YARN] [CORE] Sending Available Resources in Yarn Cluster Information to Spark Driver URL: https://github.com/apache/spark/pull/35858 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #35806: [SPARK-38505][SQL] Make partial aggregation adaptive
github-actions[bot] closed pull request #35806: [SPARK-38505][SQL] Make partial aggregation adaptive URL: https://github.com/apache/spark/pull/35806 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #35763: [SPARK-38433][BUILD] change the shell code style with shellcheck
github-actions[bot] closed pull request #35763: [SPARK-38433][BUILD] change the shell code style with shellcheck URL: https://github.com/apache/spark/pull/35763 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #35751: [SPARK-38433][BUILD] Add shell code style check Actions
github-actions[bot] commented on PR #35751: URL: https://github.com/apache/spark/pull/35751#issuecomment-1258818104 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #35845: [SPARK-38520][SQL] ANSI interval overflow when reading CSV
github-actions[bot] closed pull request #35845: [SPARK-38520][SQL] ANSI interval overflow when reading CSV URL: https://github.com/apache/spark/pull/35845 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #35569: [SPARK-38250][CORE] Check existence before deleting stagingDir in HadoopMapReduceCommitProtocol
github-actions[bot] commented on PR #35569: URL: https://github.com/apache/spark/pull/35569#issuecomment-1258818209 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] closed pull request #35808: [WIP][SPARK-38512] Rebased traversal order from "pre-order" to "post-order" for `ResolveFunctions` Rule
github-actions[bot] closed pull request #35808: [WIP][SPARK-38512] Rebased traversal order from "pre-order" to "post-order" for `ResolveFunctions` Rule URL: https://github.com/apache/spark/pull/35808 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] cloud-fan commented on pull request #37679: [SPARK-35242][SQL] Support changing session catalog's default database
cloud-fan commented on PR #37679: URL: https://github.com/apache/spark/pull/37679#issuecomment-1258818080 thanks, meriging to master! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] github-actions[bot] commented on pull request #36889: [SPARK-21195][CORE] Dynamically register metrics from sources as they are reported
github-actions[bot] commented on PR #36889: URL: https://github.com/apache/spark/pull/36889#issuecomment-1258818008 We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadikovi commented on a diff in pull request #35764: [SPARK-38444][SQL]Automatically calculate the upper and lower bounds of partitions when no specified partition related params
sadikovi commented on code in PR #35764: URL: https://github.com/apache/spark/pull/35764#discussion_r980780872 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala: ## @@ -111,6 +111,9 @@ class JDBCOptions( // the number of partitions val numPartitions = parameters.get(JDBC_NUM_PARTITIONS).map(_.toInt) + // the default number of partitions + val defaultNumPartitions = parameters.getOrElse(DEFAULT_NUM_PARTITIONS, "10").toInt Review Comment: I replied below. I think if this is about default number of partitions, we can just set the default to JDBC_NUM_PARTITIONS flag. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mskapilks commented on pull request #37996: [SPARK-40558][SQL] Add Reusable Exchange in Bloom creation side plan
mskapilks commented on PR #37996: URL: https://github.com/apache/spark/pull/37996#issuecomment-1259015446 > 2. We can use `shuffle records written` instead of `spark.sql.optimizer.runtime.bloomFilter.expectedNumItems` to build bloom filter. Good point. It would be better than current stats based row count we are using to create bloom -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadikovi commented on a diff in pull request #35764: [SPARK-38444][SQL]Automatically calculate the upper and lower bounds of partitions when no specified partition related params
sadikovi commented on code in PR #35764: URL: https://github.com/apache/spark/pull/35764#discussion_r980779629 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala: ## @@ -111,6 +111,9 @@ class JDBCOptions( // the number of partitions val numPartitions = parameters.get(JDBC_NUM_PARTITIONS).map(_.toInt) + // the default number of partitions + val defaultNumPartitions = parameters.getOrElse(DEFAULT_NUM_PARTITIONS, "10").toInt Review Comment: I think the name of the config is misleading, this is essentially the default value of `numPartitions` configs: ```scala val numPartitions = parameters.get(JDBC_NUM_PARTITIONS).map(_.toInt).getOrElse(10) ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] mridulm commented on pull request #37779: [wip][SPARK-40320][Core] Executor should exit when it failed to initialize for fatal error
mridulm commented on PR #37779: URL: https://github.com/apache/spark/pull/37779#issuecomment-1259007734 Thanks for the query @Ngone51 - I missed out one aspect of my analysis, which ends up completely changing the solution - my bad :-( The answer to your query has the reason for the lack of failure - this is due to the two types of api's we are using ... For `DeducatedMessageLoop`, for the initial submission of `receiveLoopRunnable` - we use `ExecutorService.submit` api - while for other cases, we use the `Executor.execute` api - and this is the cause for the behavior. The `submit` api returns a `Future` - and if we look at the implementation of `FutureTask.run`, we see that it catches `Throwable` and preserves that as the outcome (`setException`) - which is the reason why the thread itself does not die. So the specific case might be mitigated by using `execute` instead of `submit` api. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] sadikovi commented on a diff in pull request #35764: [SPARK-38444][SQL]Automatically calculate the upper and lower bounds of partitions when no specified partition related params
sadikovi commented on code in PR #35764: URL: https://github.com/apache/spark/pull/35764#discussion_r980766168 ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala: ## @@ -168,6 +177,71 @@ private[sql] object JDBCRelation extends Logging { partitions } + /** + * get the min and max value by the column + * @param schema resolved schema of a JDBC table + * @param resolver function used to determine if two identifiers are equal + * @param timeZoneId timezone ID to be used if a partition column type is date or timestamp + * @param jdbcOptions JDBC options that contains url + * @param filters filters in Where clause + * @return JDBCPartitioningInfo + */ + def getPartitionBound( + schema: StructType, + resolver: Resolver, + timeZoneId: String, + jdbcOptions: JDBCOptions, + filters: Array[Filter] = Array.empty): JDBCPartitioningInfo = { +// columns in filters +val filterColumns = new util.ArrayList[String]() +filters.map(filter => filter.references.distinct.map(r => filterColumns.add(r))) +// primary keys used for partitioning +val prks = schema.fields.filter( + f => f.metadata.getBoolean("isIndexKey") && +!filterColumns.contains(f.name) && +(f.dataType.isInstanceOf[NumericType] || + f.dataType.isInstanceOf[DateType] || + f.dataType.isInstanceOf[TimestampType])) + +if (prks.length > 0) { + val prk = prks.head + val dataType = prk.dataType + var lBound: String = null + var uBound: String = null + val sql = s"select min(${prk.name}) as lBound, max(${prk.name}) as uBound " + Review Comment: Can you explain this logic in the javadoc for this method? Also, what happens if the table is empty? ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala: ## @@ -168,6 +177,71 @@ private[sql] object JDBCRelation extends Logging { partitions } + /** + * get the min and max value by the column Review Comment: nit: Get the min and max values for the column. ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala: ## @@ -168,6 +177,71 @@ private[sql] object JDBCRelation extends Logging { partitions } + /** + * get the min and max value by the column + * @param schema resolved schema of a JDBC table + * @param resolver function used to determine if two identifiers are equal + * @param timeZoneId timezone ID to be used if a partition column type is date or timestamp + * @param jdbcOptions JDBC options that contains url + * @param filters filters in Where clause + * @return JDBCPartitioningInfo + */ + def getPartitionBound( + schema: StructType, + resolver: Resolver, + timeZoneId: String, + jdbcOptions: JDBCOptions, + filters: Array[Filter] = Array.empty): JDBCPartitioningInfo = { +// columns in filters +val filterColumns = new util.ArrayList[String]() +filters.map(filter => filter.references.distinct.map(r => filterColumns.add(r))) +// primary keys used for partitioning +val prks = schema.fields.filter( + f => f.metadata.getBoolean("isIndexKey") && +!filterColumns.contains(f.name) && +(f.dataType.isInstanceOf[NumericType] || + f.dataType.isInstanceOf[DateType] || + f.dataType.isInstanceOf[TimestampType])) + +if (prks.length > 0) { + val prk = prks.head + val dataType = prk.dataType + var lBound: String = null + var uBound: String = null + val sql = s"select min(${prk.name}) as lBound, max(${prk.name}) as uBound " + +s"from ${jdbcOptions.tableOrQuery} limit 1" + val conn = JdbcDialects.get(jdbcOptions.url).createConnectionFactory(jdbcOptions)(-1) + try { +val statement = conn.prepareStatement(sql) +try { + statement.setQueryTimeout(jdbcOptions.queryTimeout) + val resultSet = statement.executeQuery() + while (resultSet.next()) { +lBound = resultSet.getString("lBound") Review Comment: Would it work for primary keys that are integers or timestamps? ## sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala: ## @@ -168,6 +177,71 @@ private[sql] object JDBCRelation extends Logging { partitions } + /** + * get the min and max value by the column + * @param schema resolved schema of a JDBC table + * @param resolver function used to determine if two identifiers are equal + * @param timeZoneId timezone ID to be used if a partition column type is date or timestamp + * @param jdbcOptions JDBC options that contains url + * @param filters filters in Where clause + * @return JDBCPartitioningInfo + */ + def getPartitionBound( + schema: StructType, + resolver: Resolver, + timeZoneId: String, +
[GitHub] [spark] LuciferYang commented on pull request #37999: [SPARK-39146][CORE][SQL][K8S] Introduce `JacksonUtils` to use singleton Jackson ObjectMapper
LuciferYang commented on PR #37999: URL: https://github.com/apache/spark/pull/37999#issuecomment-1258914643 In the serial r/w scenario, the benefits are obvious, - Reading scenario: using singleton is 1800+% faster than creating `ObjectMapper ` every time - Write scenario: using a single instance is 500+% faster than creating `ObjectMapper ` every time -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] zhengruifeng opened a new pull request, #38009: [SPARK-40573][PS] Make `ddof` in `GroupBy.std`, `GroupBy.var` and `GroupBy.sem` accept arbitary integers
zhengruifeng opened a new pull request, #38009: URL: https://github.com/apache/spark/pull/38009 ### What changes were proposed in this pull request? Make `ddof` in `GroupBy.std`, `GroupBy.var` and `GroupBy.sem` accept arbitary integers ### Why are the changes needed? for API coverage ### Does this PR introduce _any_ user-facing change? yes, can not accept non-{0,1} `ddof` ### How was this patch tested? added testsutes -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen commented on a diff in pull request #38010: [MINOR] Clarify that xxhash64 seed is 42
srowen commented on code in PR #38010: URL: https://github.com/apache/spark/pull/38010#discussion_r980690447 ## sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala: ## @@ -643,7 +643,8 @@ object Murmur3HashFunction extends InterpretedHashFunction { * A xxHash64 64-bit hash expression. */ @ExpressionDescription( - usage = "_FUNC_(expr1, expr2, ...) - Returns a 64-bit hash value of the arguments.", + usage = "_FUNC_(expr1, expr2, ...) - Returns a 64-bit hash value of the arguments. " + Review Comment: We could arguably document the same about hash() above, but, it doesn't even state it's murmur3, and maybe that's on purpose, so didn't seem necessary -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] srowen opened a new pull request, #38010: [MINOR] Clarify that xxhash64 seed is 42
srowen opened a new pull request, #38010: URL: https://github.com/apache/spark/pull/38010 ### What changes were proposed in this pull request? State that the hash seed used for xxhash64 is 42 in docs. ### Why are the changes needed? It's somewhat non-standard not seed to 0. Users would have to know this seed to reproduce the hash value. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] itholic opened a new pull request, #38012: [DO-NOT-MERGE][TEST] Pandas 1.5 Test
itholic opened a new pull request, #38012: URL: https://github.com/apache/spark/pull/38012 ### What changes were proposed in this pull request? ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a diff in pull request #38008: [SPARK-40571][SS][TESTS] Construct a new test case for applyInPandasWithState to verify fault-tolerance semantic with random py
HeartSaVioR commented on code in PR #38008: URL: https://github.com/apache/spark/pull/38008#discussion_r980722405 ## python/pyspark/sql/tests/test_pandas_grouped_map_with_state.py: ## @@ -46,8 +55,27 @@ cast(str, pandas_requirement_message or pyarrow_requirement_message), ) class GroupedMapInPandasWithStateTests(ReusedSQLTestCase): +@classmethod +def conf(cls): +cfg = SparkConf() +cfg.set("spark.sql.shuffle.partitions", "5") +return cfg + +def __init__(self, methodName="runTest"): +super(GroupedMapInPandasWithStateTests, self).__init__(methodName) +self.base_path = "python/test_support/sql/streaming/apply_in_pandas_with_state" Review Comment: Nice finding! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] wangyum opened a new pull request, #38011: [SPARK-40574][DOCS] Enhance DROP TABLE documentation
wangyum opened a new pull request, #38011: URL: https://github.com/apache/spark/pull/38011 ### What changes were proposed in this pull request? This PR adds `PURGE` in `DROP TABLE` documentation. Related documentation and code: 1. Hive `DROP TABLE` documentation: https://cwiki.apache.org/confluence/display/hive/languagemanual+ddl https://user-images.githubusercontent.com/5399861/192425153-63ac5373-dd34-48b3-864c-324cf5ba5db9.png;> 2. Hive code: https://github.com/apache/hive/blob/rel/release-2.3.9/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L1185-L1209 3. Spark code: https://github.com/apache/spark/blob/v3.3.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala#L1317-L1327 ### Why are the changes needed? Enhance documentation. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? manual test. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] [spark] HeartSaVioR commented on a diff in pull request #38008: [SPARK-40571][SS][TESTS] Construct a new test case for applyInPandasWithState to verify fault-tolerance semantic with random py
HeartSaVioR commented on code in PR #38008: URL: https://github.com/apache/spark/pull/38008#discussion_r980721102 ## python/test_support/sql/streaming/apply_in_pandas_with_state/random_failure/input/test-0.txt: ## @@ -0,0 +1,100 @@ +non Review Comment: I just changed both tests to create a dataset files before running a test; neither input files nor golden file is needed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org