date:20220926

[GitHub] [spark] aokolnychyi opened a new pull request, #36304: [SPARK-38959][SQL] DS V2: Support runtime group filtering in row-level commands

2022-09-26 Thread GitBox



aokolnychyi opened a new pull request, #36304:
URL: https://github.com/apache/spark/pull/36304

   
   
   
   ### What changes were proposed in this pull request?
   
   
   This PR adds runtime group filtering for group-based row-level operations.
   
   ### Why are the changes needed?
   
   
   These changes are needed to avoid rewriting unnecessary groups as the data 
skipping during job planning is limited and can still report false positive 
groups to rewrite.
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   This PR extends `RowLevelOperation` but the changes are backward compatible. 
   
   ### How was this patch tested?
   
   
   This PR comes with tests.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37994: [SPARK-40454][CONNECT] Initial DSL framework for protobuf testing

2022-09-26 Thread GitBox



cloud-fan commented on code in PR #37994:
URL: https://github.com/apache/spark/pull/37994#discussion_r980610466


##
connect/src/main/scala/org/apache/spark/sql/connect/package.scala:
##
@@ -0,0 +1,98 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql

Review Comment:
   The dsl should not be a public API. Can we move it to 
`org.apache.spark.sql.catalyst` which is a private package?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #35608: [SPARK-32838][SQL] Static partition overwrite could use staging dir insert

2022-09-26 Thread GitBox



github-actions[bot] commented on PR #35608:
URL: https://github.com/apache/spark/pull/35608#issuecomment-1258818178

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan closed pull request #37679: [SPARK-35242][SQL] Support changing session catalog's default database

2022-09-26 Thread GitBox



cloud-fan closed pull request #37679: [SPARK-35242][SQL] Support changing 
session catalog's default database
URL: https://github.com/apache/spark/pull/37679


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #35734: [SPARK-32432][SQL] Add support for reading ORC/Parquet files of SymlinkTextInputFormat table And Fix Analyze for SymlinkTextInput

2022-09-26 Thread GitBox



github-actions[bot] commented on PR #35734:
URL: https://github.com/apache/spark/pull/35734#issuecomment-1258818146

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #35638: [SPARK-38296][SQL] Support error class AnalysisExceptions in FunctionRegistry

2022-09-26 Thread GitBox



github-actions[bot] commented on PR #35638:
URL: https://github.com/apache/spark/pull/35638#issuecomment-1258818165

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #35748: [SPARK-38431][SQL]Support to delete matched rows from jdbc tables

2022-09-26 Thread GitBox



github-actions[bot] commented on PR #35748:
URL: https://github.com/apache/spark/pull/35748#issuecomment-1258818118

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #35744: [SPARK-37383][SQL][WEBUI]Show the parsing time for each phase of a SQL on spark ui

2022-09-26 Thread GitBox



github-actions[bot] commented on PR #35744:
URL: https://github.com/apache/spark/pull/35744#issuecomment-1258818132

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #35594: [SPARK-38270][SQL] Spark SQL CLI's AM should keep same exit code with client side

2022-09-26 Thread GitBox



github-actions[bot] commented on PR #35594:
URL: https://github.com/apache/spark/pull/35594#issuecomment-1258818196

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #37998: [SPARK-40561][PS] Implement `min_count` in `GroupBy.min`

2022-09-26 Thread GitBox



zhengruifeng commented on PR #37998:
URL: https://github.com/apache/spark/pull/37998#issuecomment-1258863348

   Merged into master, thanks @HyukjinKwon for reivew


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38008: [SPARk-40571][SS][TESTS] Construct a new test case for applyInPandasWithState to verify fault-tolerance semantic with random py

2022-09-26 Thread GitBox



HyukjinKwon commented on code in PR #38008:
URL: https://github.com/apache/spark/pull/38008#discussion_r98058


##
python/pyspark/sql/tests/test_pandas_grouped_map_with_state.py:
##
@@ -90,6 +107,99 @@ def check_results(batch_df, _):
 self.assertTrue(q.isActive)
 q.processAllAvailable()
 
+def test_apply_in_pandas_with_state_python_worker_random_failure(self):
+output_path = tempfile.mkdtemp()
+checkpoint_loc = tempfile.mkdtemp()
+shutil.rmtree(output_path)
+shutil.rmtree(checkpoint_loc)
+
+def run_query():
+df = self.spark.readStream.format("text") \
+.option("maxFilesPerTrigger", "1") \
+.load(self.base_path + "/random_failure/input")
+
+for q in self.spark.streams.active:
+q.stop()
+self.assertTrue(df.isStreaming)
+
+output_type = StructType(
+[StructField("value", StringType()), StructField("count", 
LongType())]
+)
+state_type = StructType([StructField("cnt", LongType())])
+
+def func(key, pdf_iter, state):
+assert isinstance(state, GroupState)
+
+# should be huge enough to not trigger kill in every batches
+# but should be also reasonable to trigger kill multiple times 
across batches
+if random.randrange(300) == 1:
+sys.exit(1)
+
+count = state.getOption
+if count is None:
+count = 0
+else:
+count = count[0]
+
+for pdf in pdf_iter:
+count += len(pdf)
+
+state.update((count,))
+yield pd.DataFrame({"value": [key[0]], "count": [count]})
+
+q = (
+df.groupBy(df["value"])
+.applyInPandasWithState(
+func, output_type, state_type, "Append", 
GroupStateTimeout.NoTimeout
+)
+.writeStream.queryName("this_query")
+.format("json")
+.outputMode("append")
+.option("path", output_path)
+.option("checkpointLocation", checkpoint_loc)
+.start()
+)
+
+return q
+
+q = run_query()
+
+self.assertEqual(q.name, "this_query")
+self.assertTrue(q.isActive)
+
+# expected_output directory is constucted from below query:
+# spark.read.format("text").load("./input").groupBy("value").count() \
+# 
.repartition(1).sort("value").write.format("json").save("./output")
+expected = self.spark.read.schema("value string, count 
int").format("json") \
+.load(self.base_path + "/random_failure/expected_output") \
+.sort("value").collect()

Review Comment:
   This style is recommended by PEP8 IIRC
   ```suggestion
   expected = (
   self.spark.read.schema("value string, count int").format("json")
   .load(self.base_path + "/random_failure/expected_output")
   .sort("value").collect()
   )
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38006: [SPARK-40536][CONNECT] Make Spark Connect port configurable

2022-09-26 Thread GitBox



HyukjinKwon commented on code in PR #38006:
URL: https://github.com/apache/spark/pull/38006#discussion_r980669582


##
core/src/main/scala/org/apache/spark/internal/config/Connect.scala:
##
@@ -0,0 +1,33 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.internal.config
+
+private[spark] object Connect {

Review Comment:
   Yeah, it should better be placed within `connect` if possible. How do we use 
this plugin? I suspect the jar should be provided anyway (?)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] aokolnychyi opened a new pull request, #38004: [SPARK-40551][SQL] DataSource V2: Add APIs for delta-based row-level operations

2022-09-26 Thread GitBox



aokolnychyi opened a new pull request, #38004:
URL: https://github.com/apache/spark/pull/38004

   
   
   ### What changes were proposed in this pull request?
   
   
   This PR adds DS v2 APIs for handling row-level operations for data sources 
that support deltas of rows.
   
   ### Why are the changes needed?
   
   
   These changes are part of the approved SPIP in SPARK-35801.
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   Yes, this PR adds new DS v2 APIs per [design 
doc](https://docs.google.com/document/d/12Ywmc47j3l2WF4anG5vL4qlrhT2OKigb7_EbIKhxg60).
   
   ### How was this patch tested?
   
   
   Tests will be part of the implementation PR.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia opened a new pull request, #38006: [SPARK-40536] Make Spark Connect port configurable

2022-09-26 Thread GitBox



amaliujia opened a new pull request, #38006:
URL: https://github.com/apache/spark/pull/38006

   
   
   ### What changes were proposed in this pull request?
   
   
   Add `Connect` config and two connect gRPC config keys.
   1. `spark.connect.grpc.debug.enabled` Boolean 
   2. `spark.connect.grpc.binding.port` Int
   
   ### Why are the changes needed?
   
   
   Currently
   1. Spark Connect gRPC port is hardcoded
   2. config key `spark.connect.grpc.debug.enabled` is not defined.
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   No
   ### How was this patch tested?
   
   Existing UT
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37994: [SPARK-40454][CONNECT] Initial DSL framework for protobuf testing

2022-09-26 Thread GitBox



cloud-fan commented on code in PR #37994:
URL: https://github.com/apache/spark/pull/37994#discussion_r980615123


##
connect/src/main/scala/org/apache/spark/sql/connect/package.scala:
##
@@ -0,0 +1,98 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql

Review Comment:
   BTW, I don't think we have to make it available in spark-shell. The catalyst 
dsl was added before we have DataFrame API. At that time, it was the only way 
for end-users to build query plan with a dataframe-like API.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Kimahriman commented on a diff in pull request #38003: [SPARK-40565][SQL] Don't push non-deterministic filters to V2 file sources

2022-09-26 Thread GitBox



Kimahriman commented on code in PR #38003:
URL: https://github.com/apache/spark/pull/38003#discussion_r980631860


##
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitionsSuite.scala:
##
@@ -140,6 +140,24 @@ class PruneFileSourcePartitionsSuite extends 
PrunePartitionSuiteBase with Shared
 }
   }
 
+  test("SPARK-40565: don't push down non-deterministic filters for V2 file 
sources") {

Review Comment:
   So it looks like that's because of the way the "number of partitions" is 
calculated for V2 reads at 
https://github.com/apache/spark/blob/697574c2d5014626d2c960b222b41a8da7c702bf/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitionsSuite.scala#L168.
 It uses `BatchScanExec.partitions.size`, which is the number of RDD 
partitions, _not_ file partitions. I can see if I can fix that but don't see an 
easy way like `FileSourceScanExec.selectedPartitions`, there's no intermediate 
variable to get from a batch scan. Might just have to try to calculate unique 
partitions values from all files to get actual number of file partitions?
   
   If you change the 2 to 3 for the tests it passes for V1.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] attilapiros commented on a diff in pull request #37990: [WIP][SPARK-40458][K8S] Bump Kubernetes Client Version to 6.1.1

2022-09-26 Thread GitBox



attilapiros commented on code in PR #37990:
URL: https://github.com/apache/spark/pull/37990#discussion_r980633310


##
resource-managers/kubernetes/core/src/test/scala/org/apache/spark/scheduler/cluster/k8s/ExecutorPodsLifecycleManagerSuite.scala:
##
@@ -109,6 +116,12 @@ class ExecutorPodsLifecycleManagerSuite extends 
SparkFunSuite with BeforeAndAfte
 verify(schedulerBackend).doRemoveExecutor("1", expectedLossReason)
   }
 
+  test("test executor inactivation function") {

Review Comment:
   Ok.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] attilapiros commented on pull request #37990: [WIP][SPARK-40458][K8S] Bump Kubernetes Client Version to 6.1.1

2022-09-26 Thread GitBox



attilapiros commented on PR #37990:
URL: https://github.com/apache/spark/pull/37990#issuecomment-1258851049

   > Although 6.1.1 is intrusive, this patch looks solid. Is there any other 
reason why this is still WIP, @attilapiros ?
   
   @dongjoon-hyun I just executed some manual tests regarding the 
spark-submit/application management (actually I have found a bug and corrected 
it). The test steps are in the description.
   
   Started an app in non-default namespace (`bla`):
   ```
   ./bin/spark-submit \
   --master k8s://http://127.0.0.1:8001 \
   --deploy-mode cluster \
   --name spark-pi \
   --class org.apache.spark.examples.SparkPi \
   --conf spark.executor.instances=5 \
   --conf spark.kubernetes.namespace=bla \
   --conf 
spark.kubernetes.container.image=docker.io/kubespark/spark:3.4.0-SNAPSHOT_2EA46AC6-0613-4BA1-BA76-130F5C0DAB15
 \
   local:///opt/spark/examples/jars/spark-examples_2.12-3.4.0-SNAPSHOT.jar 
20
   ```
   
   Then checking the status by using the regexp:
   ```
   
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] viirya commented on a diff in pull request #38001: [SPARK-40562][SQL] Add `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`

2022-09-26 Thread GitBox



viirya commented on code in PR #38001:
URL: https://github.com/apache/spark/pull/38001#discussion_r980647185


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -3574,6 +3574,15 @@ object SQLConf {
   .booleanConf
   .createWithDefault(false)
 
+   val LEGACY_GROUPING_ID_WITH_APPENDED_USER_GROUPBY =
+buildConf("spark.sql.legacy.groupingIdWithAppendedUserGroupBy")
+  .internal()
+  .doc("When true, grouping_id() returns values based on grouping set 
columns plus " +
+"user-given group-by expressions order.")

Review Comment:
   Do we need to mention when the behavior of grouping_id has changed?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #37993: [SPARK-40557][CONNECT] Update generated proto files for Spark Connect

2022-09-26 Thread GitBox



HyukjinKwon closed pull request #37993: [SPARK-40557][CONNECT] Update generated 
proto files for Spark Connect
URL: https://github.com/apache/spark/pull/37993


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #37993: [SPARK-40557][CONNECT] Update generated proto files for Spark Connect

2022-09-26 Thread GitBox



HyukjinKwon commented on PR #37993:
URL: https://github.com/apache/spark/pull/37993#issuecomment-1258864096

   Merged to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38006: [SPARK-40536][CONNECT] Make Spark Connect port configurable

2022-09-26 Thread GitBox



HyukjinKwon commented on code in PR #38006:
URL: https://github.com/apache/spark/pull/38006#discussion_r980668935


##
core/src/main/scala/org/apache/spark/internal/config/Connect.scala:
##
@@ -0,0 +1,33 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.internal.config
+
+private[spark] object Connect {
+
+  private[spark] val CONNECT_GRPC_DEBUG_MODE =
+ConfigBuilder("spark.connect.grpc.debug.enabled")
+  .version("3.4.0")
+  .booleanConf
+  .createOptional

Review Comment:
   why is it optional? Seems like it defaults to `true`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] aokolnychyi commented on pull request #36304: [SPARK-38959][SQL] DS V2: Support runtime group filtering in row-level commands

2022-09-26 Thread GitBox



aokolnychyi commented on PR #36304:
URL: https://github.com/apache/spark/pull/36304#issuecomment-1258618345

   I want to resume working on this PR but I need feedback on one point.
   
   In the original implementation, @cloud-fan and I discussed supporting a 
separate scan builder for runtime group filtering in row-level operations. That 
way, we can prune columns and push down filters while looking for groups that 
have matches. We can't do that in the main row-level scan for group-based data 
sources as non-matching records in matching groups have to be copied over. See 
PR #35395 for context.
   
   The only challenge is ensuring the same version of the table is scanned in 
the main row-level scan and in the scan that searches for matching groups to 
rewrite. There are multiple solutions to consider.
   
   **Option 1**
   
   The first option is shown in this PR. We can add a new method to 
`RowLevelOperation` that would provide us a scan builder for runtime group 
filtering.
   
   ```
   interface RowLevelOperation {
 // existing method
 ScanBuilder newScanBuilder(CaseInsensitiveStringMap options);
   
 // new method
 default ScanBuilder newAffectedGroupsScanBuilder(CaseInsensitiveStringMap 
options) {
return newScanBuilder(options);
 }
   
 ...
   }
   ```
   
   Under this implementation, it is up to data sources to ensure the same 
version is scanned in both scans. It is a fairly simple approach but it 
complicates the row-level API. On top, the new method is useless for data 
sources that can handle a delta of rows.
   
   **Option 2**
   
   The main row-level `Scan` can report scanned `tableVersion` and we can use 
that information to load a correct table version in the rule that assigns a 
runtime filter. This can be done via `TableCatalog$load(ident, version)`. The 
only API change is to extend `Scan` with `tableVersion` to know which table 
version is being read in the main scan.
   
   **Option 3**
   
   The rule that assigns a runtime group filter has access to the original 
`Table` object. We could just call `newScanBuilder` on it. However, I don't see 
anything in the API implying that reusing the `Table` instance guarantees the 
same version of the table will be scanned. If we call `newScanBuilder` on the 
same `Table` instance, do we expect the same version to be scanned? Seems like 
it is NOT the assumption right now.
   
   If we can somehow benefit from reusing `Table` object, it will be the 
cleanest option from the API perspective.
   
   Any ideas how to make Option 3 work?
   
   cc @cloud-fan @rdblue @huaxingao @dongjoon-hyun @sunchao @viirya
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm commented on pull request #37779: [wip][SPARK-40320][Core] Executor should exit when it failed to initialize for fatal error

2022-09-26 Thread GitBox



mridulm commented on PR #37779:
URL: https://github.com/apache/spark/pull/37779#issuecomment-1258716473

   Added a few debug statements, and it became clear what the issue is.
   Essentially, since we are leveraging a `ThreadPoolExecutor`, it does not 
result in killing the thread with the exception/error thrown - but rather, will 
call `ThreadPoolExecutor.afterExecute` with the cause for failure (See 
`runWorker` for more).
   
   We should be overriding this, and invoke our `uncaughtExceptionHandler` when 
an exception is thrown.
   
   
   In `receiveLoopRunnable` when a `Throwable` is thrown:
   
   ```
   22/09/26 17:17:12 INFO DedicatedMessageLoop: Current exceptionHandler = 
org.apache.spark.util.SparkUncaughtExceptionHandler@27c71f14
   22/09/26 17:17:12 INFO DedicatedMessageLoop: Thread = 
Thread[dispatcher-Executor,5,main]
   22/09/26 17:17:12 INFO DedicatedMessageLoop: Stack ...
   java.lang.Exception: For stack
   at 
org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:56)
   at 
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
   at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
   at 
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
   at 
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
   at java.base/java.lang.Thread.run(Thread.java:829)
   
   ```
   
   In `receiveLoopRunnable`'s `run`, when a `Throwable` is thrown:
   ```
   2/09/26 17:17:12 INFO DedicatedMessageLoop: Thread = 
Thread[dispatcher-Executor,5,main], stackTrace =
   22/09/26 17:17:12 INFO DedicatedMessageLoop: 
java.base@11.0.16/java.lang.Thread.dumpThreads(Native Method)
   22/09/26 17:17:12 INFO DedicatedMessageLoop: 
java.base@11.0.16/java.lang.Thread.getAllStackTraces(Thread.java:1653)
   22/09/26 17:17:12 INFO DedicatedMessageLoop: 
app//org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$dumpAllStackTraces(MessageLoop.scala:70)
   22/09/26 17:17:12 INFO DedicatedMessageLoop: 
app//org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:58)
   22/09/26 17:17:12 INFO DedicatedMessageLoop: 
java.base@11.0.16/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
   22/09/26 17:17:12 INFO DedicatedMessageLoop: 
java.base@11.0.16/java.util.concurrent.FutureTask.run(FutureTask.java:264)
   22/09/26 17:17:12 INFO DedicatedMessageLoop: 
java.base@11.0.16/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
   22/09/26 17:17:12 INFO DedicatedMessageLoop: 
java.base@11.0.16/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
   
   ```
   
   
   and finally, a few seconds after Executor Inbox failure - dumping all 
threads in a new thread.
   ```
   22/09/26 17:17:14 INFO DedicatedMessageLoop: Thread = 
Thread[dispatcher-Executor,5,main], stackTrace =
   22/09/26 17:17:14 INFO DedicatedMessageLoop: 
java.base@11.0.16/jdk.internal.misc.Unsafe.park(Native Method)
   22/09/26 17:17:14 INFO DedicatedMessageLoop: 
java.base@11.0.16/java.util.concurrent.locks.LockSupport.park(LockSupport.java:194)
   22/09/26 17:17:14 INFO DedicatedMessageLoop: 
java.base@11.0.16/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:2081)
   22/09/26 17:17:14 INFO DedicatedMessageLoop: 
java.base@11.0.16/java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:433)
   22/09/26 17:17:14 INFO DedicatedMessageLoop: 
app//org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:102)
   22/09/26 17:17:14 INFO DedicatedMessageLoop: 
app//org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:45)
   22/09/26 17:17:14 INFO DedicatedMessageLoop: 
java.base@11.0.16/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
   22/09/26 17:17:14 INFO DedicatedMessageLoop: 
java.base@11.0.16/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
   22/09/26 17:17:14 INFO DedicatedMessageLoop: 
java.base@11.0.16/java.lang.Thread.run(Thread.java:829)
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37994: [SPARK-40454][CONNECT] Initial DSL framework for protobuf testing

2022-09-26 Thread GitBox



cloud-fan commented on code in PR #37994:
URL: https://github.com/apache/spark/pull/37994#discussion_r980614606


##
connect/src/main/scala/org/apache/spark/sql/connect/package.scala:
##
@@ -0,0 +1,98 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql

Review Comment:
   yup



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37994: [SPARK-40454][CONNECT] Initial DSL framework for protobuf testing

2022-09-26 Thread GitBox



cloud-fan commented on code in PR #37994:
URL: https://github.com/apache/spark/pull/37994#discussion_r980616391


##
connect/src/main/scala/org/apache/spark/sql/connect/package.scala:
##
@@ -0,0 +1,98 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.connect.proto
+import org.apache.spark.sql.catalyst.analysis.UnresolvedAlias
+import org.apache.spark.sql.catalyst.expressions.{Expression, NamedExpression}
+
+
+/**
+ * A collection of implicit conversions that create a DSL for constructing 
connect protos.
+ *
+ * {{{
+ *   scala> import org.apache.spark.sql.connect.plans.DslLogicalPlan
+ *
+ *   // Standard way to construct connect proto
+ *   scala> import org.apache.spark.connect.proto
+ *   scala> :paste
+ *   // Entering paste mode (ctrl-D to finish)
+ *   val connectTestRelation =
+ *proto.Relation.newBuilder()
+ * .setRead(
+ *   proto.Read.newBuilder()
+ *   
.setNamedTable(proto.Read.NamedTable.newBuilder().addParts("student").build())
+ *   .build())
+ * .build()
+ *  // Exiting paste mode, now interpreting.
+ *connectTestRelation: org.apache.spark.connect.proto.Relation =
+ * read {
+ *  named_table {
+ *parts: "student"
+ *  }
+ *}
+ *
+ *   // Now we can apply select on the proto relation above
+ *   scala> import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute
+ *   scala> connectTestRelation.select(UnresolvedAttribute(Seq("id")))
+ *   res14: org.apache.spark.connect.proto.Relation =
+ *project {
+ *  input {
+ *read {
+ *  named_table {
+ *parts: "student"
+ *  }
+ *}
+ *  }
+ *  expressions {
+ *unresolved_attribute {
+ *  parts: "id"
+ *}
+ *  }
+ *}
+ *
+ * }}}
+ *
+ */
+package object connect {
+
+  object plans { // scalastyle:ignore
+implicit class DslLogicalPlan(val logicalPlan: proto.Relation) {

Review Comment:
   Does `proto.Relation` mean logical plan in the proto world?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng closed pull request #37998: [SPARK-40561][PS] Implement `min_count` in `GroupBy.min`

2022-09-26 Thread GitBox



zhengruifeng closed pull request #37998: [SPARK-40561][PS] Implement 
`min_count` in `GroupBy.min`
URL: https://github.com/apache/spark/pull/37998


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38006: [SPARK-40536][CONNECT] Make Spark Connect port configurable

2022-09-26 Thread GitBox



HyukjinKwon commented on code in PR #38006:
URL: https://github.com/apache/spark/pull/38006#discussion_r980668686


##
core/src/main/scala/org/apache/spark/internal/config/Connect.scala:
##
@@ -0,0 +1,33 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.internal.config
+
+private[spark] object Connect {
+
+  private[spark] val CONNECT_GRPC_DEBUG_MODE =
+ConfigBuilder("spark.connect.grpc.debug.enabled")
+  .version("3.4.0")
+  .booleanConf

Review Comment:
   can we add docs?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #37994: [SPARK-40454][CONNECT] Initial DSL framework for protobuf testing

2022-09-26 Thread GitBox



amaliujia commented on code in PR #37994:
URL: https://github.com/apache/spark/pull/37994#discussion_r980480469


##
connect/src/main/scala/org/apache/spark/sql/connect/package.scala:
##
@@ -0,0 +1,39 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql
+
+import org.apache.spark.connect.proto
+

Review Comment:
   Ah I found that we need to add connect to the deps of the REPL module. 
   
   I made the change and also then update the documentation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] aokolnychyi commented on a diff in pull request #38004: [SPARK-40551][SQL] DataSource V2: Add APIs for delta-based row-level operations

2022-09-26 Thread GitBox



aokolnychyi commented on code in PR #38004:
URL: https://github.com/apache/spark/pull/38004#discussion_r980509911


##
sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/DeltaBatchWrite.java:
##
@@ -0,0 +1,31 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.write;
+
+import org.apache.spark.annotation.Experimental;
+
+/**
+ * An interface that defines how to write a delta of rows during batch 
processing.
+ *
+ * @since 3.4.0
+ */
+@Experimental
+public interface DeltaBatchWrite extends BatchWrite {
+  @Override

Review Comment:
   I did not add any doc for inherited methods as it would mostly overlap with 
the parent doc. I could add a few sentences and reference the parent doc, 
though.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #37994: [SPARK-40454][CONNECT] Initial DSL framework for protobuf testing

2022-09-26 Thread GitBox



amaliujia commented on code in PR #37994:
URL: https://github.com/apache/spark/pull/37994#discussion_r980613106


##
connect/src/main/scala/org/apache/spark/sql/connect/package.scala:
##
@@ -0,0 +1,98 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql

Review Comment:
   hmmm that will be a dependency issue? connect depends on sql but this DSL 
needs to depends on connect proto?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #37994: [SPARK-40454][CONNECT] Initial DSL framework for protobuf testing

2022-09-26 Thread GitBox



amaliujia commented on code in PR #37994:
URL: https://github.com/apache/spark/pull/37994#discussion_r980613293


##
connect/src/main/scala/org/apache/spark/sql/connect/package.scala:
##
@@ -0,0 +1,98 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql

Review Comment:
   Are you actually saying move to `org.apache.spark.sql.catalyst` in connect 
module?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] attilapiros commented on a diff in pull request #37990: [WIP][SPARK-40458][K8S] Bump Kubernetes Client Version to 6.1.1

2022-09-26 Thread GitBox



attilapiros commented on code in PR #37990:
URL: https://github.com/apache/spark/pull/37990#discussion_r980632958


##
resource-managers/kubernetes/core/src/main/scala/org/apache/spark/scheduler/cluster/k8s/KubernetesClusterSchedulerBackend.scala:
##
@@ -85,7 +85,7 @@ private[spark] class KubernetesClusterSchedulerBackend(
   Map(SPARK_APP_ID_LABEL -> applicationId(), SPARK_ROLE_LABEL -> 
SPARK_POD_EXECUTOR_ROLE)
 val configMap = KubernetesClientUtils.buildConfigMap(configMapName, 
confFilesMap, labels)
 KubernetesUtils.addOwnerReference(driverPod.orNull, Seq(configMap))
-kubernetesClient.configMaps().create(configMap)
+kubernetesClient.configMaps().inAnyNamespace().resource(configMap).create()

Review Comment:
   Double checked:
   https://user-images.githubusercontent.com/2017933/192407277-d476449e-1112-40fb-916e-81393855988c.png;>
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR opened a new pull request, #38008: [SPARk-40571][SS][TESTS] Construct a new test case for applyInPandasWithState to verify fault-tolerance semantic with random python work

2022-09-26 Thread GitBox



HeartSaVioR opened a new pull request, #38008:
URL: https://github.com/apache/spark/pull/38008

   ### What changes were proposed in this pull request?
   
   This PR proposes a new test case for applyInPandasWithState to verify 
fault-tolerance semantic is not broken despite of random python worker failure. 
If the sink provides end-to-end exactly-once, the query should respect the 
guarantee. Otherwise, the query should respect stateful exactly-once, but 
at-least-once in terms of outputs.
   
   The test leverages file stream sink which is end-to-end exactly-once, but to 
make the verification simpler, we just verify whether the stateful exactly-once 
is guaranteed despite of python worker failures.
   
   ### Why are the changes needed?
   
   This strengthen the test coverage, especially the fault-tolerance semantic.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   New test added.  Manually ran `./python/run-tests --testnames 
'pyspark.sql.tests.test_pandas_grouped_map_with_state'` 10 times and all 
succeeded.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR commented on pull request #37993: [SPARK-40557][CONNECT] Update generated proto files for Spark Connect

2022-09-26 Thread GitBox



HeartSaVioR commented on PR #37993:
URL: https://github.com/apache/spark/pull/37993#issuecomment-1258866543

   post +1, thanks for updating this!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #37967: Scalable SkipGram-Word2Vec implementation

2022-09-26 Thread GitBox



zhengruifeng commented on PR #37967:
URL: https://github.com/apache/spark/pull/37967#issuecomment-1258873185

   so this is a totally new implementation of `SkipGram` W2V in `.mllib`
   
   is it possible to improve existing w2v instead of implementing a new one?
   what about implementing it in `.ml` side? since `.mllib` was in maintenance 
mode
   
   I think we should wait for @srowen @WeichenXu123 @huaxingao  @mengxr 's 
opinions since this is a big change;


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38008: [SPARk-40571][SS][TESTS] Construct a new test case for applyInPandasWithState to verify fault-tolerance semantic with random py

2022-09-26 Thread GitBox



HyukjinKwon commented on code in PR #38008:
URL: https://github.com/apache/spark/pull/38008#discussion_r980667435


##
python/pyspark/sql/tests/test_pandas_grouped_map_with_state.py:
##
@@ -90,6 +107,99 @@ def check_results(batch_df, _):
 self.assertTrue(q.isActive)
 q.processAllAvailable()
 
+def test_apply_in_pandas_with_state_python_worker_random_failure(self):
+output_path = tempfile.mkdtemp()
+checkpoint_loc = tempfile.mkdtemp()
+shutil.rmtree(output_path)
+shutil.rmtree(checkpoint_loc)
+
+def run_query():
+df = self.spark.readStream.format("text") \
+.option("maxFilesPerTrigger", "1") \
+.load(self.base_path + "/random_failure/input")
+
+for q in self.spark.streams.active:
+q.stop()
+self.assertTrue(df.isStreaming)
+
+output_type = StructType(
+[StructField("value", StringType()), StructField("count", 
LongType())]
+)
+state_type = StructType([StructField("cnt", LongType())])
+
+def func(key, pdf_iter, state):
+assert isinstance(state, GroupState)
+
+# should be huge enough to not trigger kill in every batches
+# but should be also reasonable to trigger kill multiple times 
across batches
+if random.randrange(300) == 1:
+sys.exit(1)
+
+count = state.getOption
+if count is None:
+count = 0
+else:
+count = count[0]
+
+for pdf in pdf_iter:
+count += len(pdf)
+
+state.update((count,))
+yield pd.DataFrame({"value": [key[0]], "count": [count]})
+
+q = (
+df.groupBy(df["value"])
+.applyInPandasWithState(
+func, output_type, state_type, "Append", 
GroupStateTimeout.NoTimeout
+)
+.writeStream.queryName("this_query")
+.format("json")
+.outputMode("append")
+.option("path", output_path)
+.option("checkpointLocation", checkpoint_loc)
+.start()
+)
+
+return q
+
+q = run_query()
+
+self.assertEqual(q.name, "this_query")
+self.assertTrue(q.isActive)
+
+# expected_output directory is constucted from below query:
+# spark.read.format("text").load("./input").groupBy("value").count() \
+# 
.repartition(1).sort("value").write.format("json").save("./output")
+expected = self.spark.read.schema("value string, count 
int").format("json") \
+.load(self.base_path + "/random_failure/expected_output") \
+.sort("value").collect()
+
+curr_time = time.time()
+timeout = curr_time + 120  # 2 minutes

Review Comment:
   There's `eventually` available at `pyspark.testing.utils`. Can we leverage 
this?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] sigmod commented on pull request #37996: [SPARK-40558][SQL] Add Reusable Exchange in Bloom creation side plan

2022-09-26 Thread GitBox



sigmod commented on PR #37996:
URL: https://github.com/apache/spark/pull/37996#issuecomment-1258522623

   cc @andylam-db @maryannxue 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on pull request #38001: [SPARK-40562][SQL] Add `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`

2022-09-26 Thread GitBox



dongjoon-hyun commented on PR #38001:
URL: https://github.com/apache/spark/pull/38001#issuecomment-1258524084

   Thank you for your feedback, @thiyaga .


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #37994: [SPARK-40454][CONNECT] Initial DSL framework for protobuf testing

2022-09-26 Thread GitBox



AmplabJenkins commented on PR #37994:
URL: https://github.com/apache/spark/pull/37994#issuecomment-1258554960

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #37993: [SPARK-40557] [CONNECT] [Cleanup] Update generated proto files for Spark Connect

2022-09-26 Thread GitBox



AmplabJenkins commented on PR #37993:
URL: https://github.com/apache/spark/pull/37993#issuecomment-1258555111

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] aokolnychyi commented on pull request #38004: [SPARK-40551][SQL] DataSource V2: Add APIs for delta-based row-level operations

2022-09-26 Thread GitBox



aokolnychyi commented on PR #38004:
URL: https://github.com/apache/spark/pull/38004#issuecomment-1258647018

   @cloud-fan @rdblue @huaxingao @dongjoon-hyun @sunchao @viirya, could you 
take a look? This is the API from the design doc we discussed earlier.
   
   I have also created PR #38005 that shows how this API will be consumed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] bersprockets commented on pull request #37825: [SPARK-40382][SQL] Group distinct aggregate expressions by semantically equivalent children in `RewriteDistinctAggregates`

2022-09-26 Thread GitBox



bersprockets commented on PR #37825:
URL: https://github.com/apache/spark/pull/37825#issuecomment-1258849538

   @beliefer 
   
   > Please reference `SimplifyBinaryComparison`.
   
   Thanks, I will take a look. This is reference to the fall-through case, 
where we discover there is really only a single distinct group, correct?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] huleilei opened a new pull request, #38007: [SPARK-40566][SQL]Add showIndex function

2022-09-26 Thread GitBox



huleilei opened a new pull request, #38007:
URL: https://github.com/apache/spark/pull/38007

   
   
   ### What changes were proposed in this pull request?
   
   I create an index  for a table.I want to know what indexes are in the table. 
But SHOW INDEX syntax is not supported. So I think the SHOW INDEX syntax should 
be added in the Spark code. Please let me know if you have any thoughts.
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] attilapiros commented on pull request #37990: [SPARK-40458][K8S] Bump Kubernetes Client Version to 6.1.1

2022-09-26 Thread GitBox



attilapiros commented on PR #37990:
URL: https://github.com/apache/spark/pull/37990#issuecomment-1258859382

   I would like to go through one more time to find all the places where we can 
specify the namespace.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] aokolnychyi commented on a diff in pull request #38004: [SPARK-40551][SQL] DataSource V2: Add APIs for delta-based row-level operations

2022-09-26 Thread GitBox



aokolnychyi commented on code in PR #38004:
URL: https://github.com/apache/spark/pull/38004#discussion_r980511958


##
sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/DeltaWriteBuilder.java:
##
@@ -0,0 +1,33 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.write;
+
+import org.apache.spark.annotation.Experimental;
+
+/**
+ * An interface for building a {@link DeltaWrite}.
+ *
+ * @since 3.4.0
+ */
+@Experimental
+public interface DeltaWriteBuilder extends WriteBuilder {
+  @Override
+  default DeltaWrite build() {
+throw new UnsupportedOperationException(getClass().getName() + " does not 
implement build");

Review Comment:
   I had to override to avoid inheriting the default implementation from the 
parent interface.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] grundprinzip commented on a diff in pull request #38006: [SPARK-40536][CONNECT] Make Spark Connect port configurable

2022-09-26 Thread GitBox



grundprinzip commented on code in PR #38006:
URL: https://github.com/apache/spark/pull/38006#discussion_r980606668


##
core/src/main/scala/org/apache/spark/internal/config/Connect.scala:
##
@@ -0,0 +1,33 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.internal.config
+
+private[spark] object Connect {

Review Comment:
   As Spark connect is build as plugin, all of the configuration should ideally 
be not located in core. Is there a way we can move this to the connect module?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #38006: [SPARK-40536][CONNECT] Make Spark Connect port configurable

2022-09-26 Thread GitBox



amaliujia commented on code in PR #38006:
URL: https://github.com/apache/spark/pull/38006#discussion_r980607213


##
core/src/main/scala/org/apache/spark/internal/config/Connect.scala:
##
@@ -0,0 +1,33 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.internal.config
+
+private[spark] object Connect {

Review Comment:
   I agree. let's see other reviewer's suggestions on this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #38008: [SPARk-40571][SS][TESTS] Construct a new test case for applyInPandasWithState to verify fault-tolerance semantic with random py

2022-09-26 Thread GitBox



HyukjinKwon commented on code in PR #38008:
URL: https://github.com/apache/spark/pull/38008#discussion_r980666406


##
python/test_support/sql/streaming/apply_in_pandas_with_state/random_failure/input/test-0.txt:
##
@@ -0,0 +1,100 @@
+non

Review Comment:
   Can we avoid adding these files? I try to avoid this in general to make the 
test case self-contained and more readable.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #37994: [SPARK-40454][CONNECT] Initial DSL framework for protobuf testing

2022-09-26 Thread GitBox



amaliujia commented on code in PR #37994:
URL: https://github.com/apache/spark/pull/37994#discussion_r980404430


##
connect/src/main/scala/org/apache/spark/sql/connect/package.scala:
##
@@ -0,0 +1,39 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql
+
+import org.apache.spark.connect.proto
+

Review Comment:
   I double-checked in case I did't include right DSL classes in the shell but 
just import a different connect class with same error:
   ```
   scala> import org.apache.spark.sql.connect.service.SparkConnectService
   :25: error: object connect is not a member of package 
org.apache.spark.sql
  import org.apache.spark.sql.connect.service.SparkConnectService
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AmplabJenkins commented on pull request #37996: [SPARK-40558][SQL] Add Reusable Exchange in Bloom creation side plan

2022-09-26 Thread GitBox



AmplabJenkins commented on PR #37996:
URL: https://github.com/apache/spark/pull/37996#issuecomment-1258485849

   Can one of the admins verify this patch?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] aokolnychyi commented on a diff in pull request #38004: [SPARK-40551][SQL] DataSource V2: Add APIs for delta-based row-level operations

2022-09-26 Thread GitBox



aokolnychyi commented on code in PR #38004:
URL: https://github.com/apache/spark/pull/38004#discussion_r980508846


##
sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/LogicalWriteInfo.java:
##
@@ -45,4 +45,14 @@ public interface LogicalWriteInfo {
* the schema of the input data from Spark to data source.
*/
   StructType schema();
+
+  /**
+   * the schema of the input metadata from Spark to data source.
+   */
+  StructType metadataSchema();

Review Comment:
   I could add default implementations and return null but I think instances of 
this interface are only created by Spark.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37994: [SPARK-40454][CONNECT] Initial DSL framework for protobuf testing

2022-09-26 Thread GitBox



cloud-fan commented on code in PR #37994:
URL: https://github.com/apache/spark/pull/37994#discussion_r980615123


##
connect/src/main/scala/org/apache/spark/sql/connect/package.scala:
##
@@ -0,0 +1,98 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql

Review Comment:
   BTW, I don't think we have to make it available in spark-shell. The catalyst 
dsl was added before we have DataFrame API. At that time, it was the only way 
for end-users to build query plan with a dataframe-like API. We don't have such 
a need for spark connect dsl.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37994: [SPARK-40454][CONNECT] Initial DSL framework for protobuf testing

2022-09-26 Thread GitBox



cloud-fan commented on code in PR #37994:
URL: https://github.com/apache/spark/pull/37994#discussion_r980615483


##
connect/src/main/scala/org/apache/spark/sql/connect/package.scala:
##
@@ -0,0 +1,98 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql
+
+import scala.collection.JavaConverters._
+
+import org.apache.spark.connect.proto
+import org.apache.spark.sql.catalyst.analysis.UnresolvedAlias
+import org.apache.spark.sql.catalyst.expressions.{Expression, NamedExpression}
+
+
+/**
+ * A collection of implicit conversions that create a DSL for constructing 
connect protos.
+ *
+ * {{{
+ *   scala> import org.apache.spark.sql.connect.plans.DslLogicalPlan
+ *
+ *   // Standard way to construct connect proto
+ *   scala> import org.apache.spark.connect.proto
+ *   scala> :paste
+ *   // Entering paste mode (ctrl-D to finish)
+ *   val connectTestRelation =
+ *proto.Relation.newBuilder()
+ * .setRead(
+ *   proto.Read.newBuilder()
+ *   
.setNamedTable(proto.Read.NamedTable.newBuilder().addParts("student").build())
+ *   .build())
+ * .build()
+ *  // Exiting paste mode, now interpreting.
+ *connectTestRelation: org.apache.spark.connect.proto.Relation =
+ * read {
+ *  named_table {
+ *parts: "student"
+ *  }
+ *}
+ *
+ *   // Now we can apply select on the proto relation above
+ *   scala> import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute
+ *   scala> connectTestRelation.select(UnresolvedAttribute(Seq("id")))
+ *   res14: org.apache.spark.connect.proto.Relation =
+ *project {
+ *  input {
+ *read {
+ *  named_table {
+ *parts: "student"
+ *  }
+ *}
+ *  }
+ *  expressions {
+ *unresolved_attribute {
+ *  parts: "id"
+ *}
+ *  }
+ *}
+ *
+ * }}}
+ *
+ */
+package object connect {
+
+  object plans { // scalastyle:ignore
+implicit class DslLogicalPlan(val logicalPlan: proto.Relation) {
+  def select(exprs: Expression*): proto.Relation = {

Review Comment:
   should this be `proto.Expression`? Will we ever add `proto.Expression`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #37994: [SPARK-40454][CONNECT] Initial DSL framework for protobuf testing

2022-09-26 Thread GitBox



cloud-fan commented on code in PR #37994:
URL: https://github.com/apache/spark/pull/37994#discussion_r980615672


##
connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala:
##
@@ -0,0 +1,74 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.sql.connect.planner
+
+import org.scalatest.BeforeAndAfter
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.connect.proto
+import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute
+import org.apache.spark.sql.catalyst.plans.PlanTest
+import org.apache.spark.sql.catalyst.plans.logical.LogicalPlan
+import org.apache.spark.sql.connect.plans.DslLogicalPlan
+import org.apache.spark.sql.test.SharedSparkSession
+
+/**
+ * This suite is based on connect DSL and test that given same dataframe 
operations, whether
+ * connect could construct a proto plan that can be translated back, and after 
analyzed, be the
+ * same as Spark dataframe's generated plan.
+ */
+class SparkConnectProtoSuite extends SparkFunSuite
+  with SharedSparkSession with SparkConnectPlanTest with PlanTest with 
BeforeAndAfter {
+
+  lazy val connectTestRelation =
+proto.Relation.newBuilder()
+  .setRead(
+proto.Read.newBuilder()
+
.setNamedTable(proto.Read.NamedTable.newBuilder().addParts("student").build())
+.build())
+  .build()
+
+  lazy val sparkTestRelation = spark.table("student")
+
+  before {
+setupTestData()
+  }
+
+  test("Basic select") {
+val connectPlan = 
analyze(connectTestRelation.select(UnresolvedAttribute(Seq("id"
+val sparkPlan = sparkTestRelation.select("id").queryExecution.analyzed
+comparePlans(connectPlan, sparkPlan)
+  }
+
+  private def analyze(plan: proto.Relation): LogicalPlan = {
+spark.sessionState.executePlan(transform(plan)).analyzed
+  }
+
+  protected override def comparePlans(
+plan1: LogicalPlan,

Review Comment:
   nit: 4 spaces indentation



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on a diff in pull request #37989: [SPARK-40096][CORE][TESTS][FOLLOW-UP] Explicitly check the element and length

2022-09-26 Thread GitBox



HyukjinKwon commented on code in PR #37989:
URL: https://github.com/apache/spark/pull/37989#discussion_r980652463


##
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:
##
@@ -4495,7 +4495,8 @@ class DAGSchedulerSuite extends SparkFunSuite with 
TempLocalSparkContext with Ti
   sendRequestsLatch.await()
   verify(blockStoreClient, times(2))
 .finalizeShuffleMerge(any(), any(), any(), any(), any())
-  assert(sentHosts === Seq("hostB"))
+  assert(sentHosts.nonEmpty)
+  assert(sentHosts.head === "hostB" && sentHosts.length == 1)

Review Comment:
   Ah, okay. Seems like this is flaky because of the timeout:
   
   ```
   [info] - SPARK-40096: Send finalize events even if shuffle merger blocks 
indefinitely with registerMergeResults is false *** FAILED *** (106 
milliseconds)
   [info]   ArrayBuffer("hostB") was empty (DAGSchedulerSuite.scala:4498)
   [info]   org.scalatest.exceptions.TestFailedException:
   [info]   at 
org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
   [info]   at 
org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:471)
   [info]   at 
org.scalatest.Assertions$.newAssertionFailedException(Assertions.scala:1231)
   [info]   at 
org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:1295)
   [info]   at 
org.apache.spark.scheduler.DAGSchedulerSuite.$anonfun$new$286(DAGSchedulerSuite.scala:4498)
   [info]   at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
   [info]   at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
   [info]   at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
   [info]   at org.scalatest.Transformer.apply(Transformer.scala:22)
   [info]   at org.scalatest.Transformer.apply(Transformer.scala:20)
   ```
   https://github.com/apache/spark/actions/runs/3129263557/jobs/5078150518
   
   I think it was empty when the condition is checked. Later, I think the array 
is filled at the time when the exception is actually printed out.
   
   Should we increase the timeout, @mridulm and @wankunde?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on pull request #38006: [SPARK-40536] Make Spark Connect port configurable

2022-09-26 Thread GitBox



amaliujia commented on PR #38006:
URL: https://github.com/apache/spark/pull/38006#issuecomment-1258739872

   @HyukjinKwon @cloud-fan @grundprinzip


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] Kimahriman commented on pull request #38003: [SPARK-40565][SQL] Don't push non-deterministic filters to V2 file sources

2022-09-26 Thread GitBox



Kimahriman commented on PR #38003:
URL: https://github.com/apache/spark/pull/38003#issuecomment-1258752350

   > Thank you for making a PR with the test coverage, @Kimahriman . 
Previously, it fails, right?
   
   Yeah these tests actually fail with an exception without the change


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] aokolnychyi commented on a diff in pull request #38004: [SPARK-40551][SQL] DataSource V2: Add APIs for delta-based row-level operations

2022-09-26 Thread GitBox



aokolnychyi commented on code in PR #38004:
URL: https://github.com/apache/spark/pull/38004#discussion_r980508846


##
sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/LogicalWriteInfo.java:
##
@@ -45,4 +45,14 @@ public interface LogicalWriteInfo {
* the schema of the input data from Spark to data source.
*/
   StructType schema();
+
+  /**
+   * the schema of the input metadata from Spark to data source.
+   */
+  StructType metadataSchema();

Review Comment:
   I could add default implementations and return null but I think instances of 
this interface are only created by Spark.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #38006: [SPARK-40536][CONNECT] Make Spark Connect port configurable

2022-09-26 Thread GitBox



amaliujia commented on code in PR #38006:
URL: https://github.com/apache/spark/pull/38006#discussion_r980607213


##
core/src/main/scala/org/apache/spark/internal/config/Connect.scala:
##
@@ -0,0 +1,33 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.internal.config
+
+private[spark] object Connect {

Review Comment:
   I agree. I am not sure if there is a way to have config only for a plugin. 
let's see other reviewer's suggestions on this.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #38001: [SPARK-40562][SQL] Add `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`

2022-09-26 Thread GitBox



cloud-fan commented on PR #38001:
URL: https://github.com/apache/spark/pull/38001#issuecomment-1258815644

   > it's not in the SQL standard
   
   Yea, but since we copied it from Hive, I think the result should match Hive 
as well. Sorry I didn't realize there is a result change when doing the bug fix 
PR. Can we try with Hive to make sure we return the correct result today?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] aokolnychyi commented on a diff in pull request #38004: [SPARK-40551][SQL] DataSource V2: Add APIs for delta-based row-level operations

2022-09-26 Thread GitBox



aokolnychyi commented on code in PR #38004:
URL: https://github.com/apache/spark/pull/38004#discussion_r980626015


##
sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/LogicalWriteInfo.java:
##
@@ -45,4 +45,18 @@ public interface LogicalWriteInfo {
* the schema of the input data from Spark to data source.
*/
   StructType schema();
+
+  /**
+   * the schema of the input metadata from Spark to data source.
+   */
+  default StructType metadataSchema() {
+return null;

Review Comment:
   The default implementation is purely for compatibility.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] bersprockets commented on a diff in pull request #37825: [SPARK-40382][SQL] Group distinct aggregate expressions by semantically equivalent children in `RewriteDistinctAggregates`

2022-09-26 Thread GitBox



bersprockets commented on code in PR #37825:
URL: https://github.com/apache/spark/pull/37825#discussion_r980641159


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/RewriteDistinctAggregates.scala:
##
@@ -218,9 +218,16 @@ object RewriteDistinctAggregates extends Rule[LogicalPlan] 
{
 val aggExpressions = collectAggregateExprs(a)
 val distinctAggs = aggExpressions.filter(_.isDistinct)
 
+val funcChildren = distinctAggs.flatMap { e =>
+  e.aggregateFunction.children.filter(!_.foldable)
+}
+val funcChildrenLookup = funcChildren.map { e =>
+  (e, funcChildren.find(fc => e.semanticEquals(fc)).getOrElse(e))
+}.toMap
+
 // Extract distinct aggregate expressions.
 val distinctAggGroups = aggExpressions.filter(_.isDistinct).groupBy { e =>

Review Comment:
   Not sure if this is what you were hinting at, but for all maps related to 
distinct aggregation children, the code now uses `ExpressionSet` as a key. That 
way look-ups shouldn't care about superficial differences: the code never makes 
a lookup using an original child (...for the distinct aggregations. It still 
uses original children for regular aggregations).
   
   >Then it's pretty easy to get back the original expressions, by 
ExpressionSet.toSeq.
   
   By using `ExpressionSet` as the key to `distinctAggChildAttrLookup`, 
hopefully I don't need the originals at all.
   
   Which is a good thing, since `ExpressionSet` is lossy when it comes to the 
originals, for example:
   
   ```
   select count(distinct 1 + c1, c1 + 1), count(distinct c2 + 1, c2 + 2) from 
df;
   ```
   This creates the following grouping keys for `distinctAggGroups`:
   ```
   Set((1 + c1#106))
   Set((c2#107 + 1), (c2#107 + 2))
   ```
   `c1#106 + 1` is lost because of the way `ExpressionSet#add` works (it just 
ignores a new expression that is semantically equivalent to anything in 
`baseSet`).
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #38008: [SPARk-40571][SS][TESTS] Construct a new test case for applyInPandasWithState to verify fault-tolerance semantic with random py

2022-09-26 Thread GitBox



HeartSaVioR commented on code in PR #38008:
URL: https://github.com/apache/spark/pull/38008#discussion_r980664896


##
python/pyspark/sql/tests/test_pandas_grouped_map_with_state.py:
##
@@ -90,6 +107,99 @@ def check_results(batch_df, _):
 self.assertTrue(q.isActive)
 q.processAllAvailable()
 
+def test_apply_in_pandas_with_state_python_worker_random_failure(self):
+output_path = tempfile.mkdtemp()
+checkpoint_loc = tempfile.mkdtemp()
+shutil.rmtree(output_path)
+shutil.rmtree(checkpoint_loc)
+
+def run_query():
+df = self.spark.readStream.format("text") \
+.option("maxFilesPerTrigger", "1") \

Review Comment:
   This query runs 10 batches from 10 files, which each file has 100 words. 
1000 words in overall.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR commented on pull request #38008: [SPARk-40571][SS][TESTS] Construct a new test case for applyInPandasWithState to verify fault-tolerance semantic with random python worke

2022-09-26 Thread GitBox



HeartSaVioR commented on PR #38008:
URL: https://github.com/apache/spark/pull/38008#issuecomment-1258874132

   cc. @HyukjinKwon @alex-balikov 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38003: [SPARK-40565][SQL] Don't push non-deterministic filters to V2 file sources

2022-09-26 Thread GitBox



dongjoon-hyun commented on code in PR #38003:
URL: https://github.com/apache/spark/pull/38003#discussion_r980419475


##
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitionsSuite.scala:
##
@@ -140,6 +140,24 @@ class PruneFileSourcePartitionsSuite extends 
PrunePartitionSuiteBase with Shared
 }
   }
 
+  test("SPARK-40565: don't push down non-deterministic filters for V2 file 
sources") {

Review Comment:
   BTW, this query seems to fail in V1 source. Could you confirm this?
   ```scala
   -withSQLConf((SQLConf.USE_V1_SOURCE_LIST.key, "")) {
   +withSQLConf((SQLConf.USE_V1_SOURCE_LIST.key, "parquet")) {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on pull request #37750: [SPARK-40296] Error class for DISTINCT function not found

2022-09-26 Thread GitBox



amaliujia commented on PR #37750:
URL: https://github.com/apache/spark/pull/37750#issuecomment-1258518169

   Because Spark supports `SELECT distinct(col1, col2)` (and the return is a 
struct of co1 and col2), which makes this error message proposal complicated. 
   
   Because now we cannot say that `SELECT SUM(a1), distinct(c1, c2)` is not a 
valid usage, even though this is not supported by the parser, because users 
might have been implied that `distinct(col list)` is a valid aggregate like 
usage. We at least need a better way to distinguish 1) select distinct col_list 
2) select distinct(col_list) 3) select agg1, agg2, distinct(col_list)`.  
   
   Given the reason above, I will close this PR and defer the handling for the 
distinct error described in this PR. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia closed pull request #37750: [SPARK-40296] Error class for DISTINCT function not found

2022-09-26 Thread GitBox



amaliujia closed pull request #37750: [SPARK-40296] Error class for DISTINCT 
function not found
URL: https://github.com/apache/spark/pull/37750


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] thiyaga commented on pull request #38001: [SPARK-40562][SQL] Add `spark.sql.legacy.groupingIdWithAppendedUserGroupBy`

2022-09-26 Thread GitBox



thiyaga commented on PR #38001:
URL: https://github.com/apache/spark/pull/38001#issuecomment-1258517989

   We use grouping sets on our queries and rely on `grouping__id` to use as an 
identifier to query the data for respective group. If we use `grouping__id` 
directly, it will be prone to change if grouping set changes (for e.g. adding 
new grouping set/ adding new column to existing grouping set). Any grouping id 
change will make things even more complex when consuming this data directly 
from reporting tools like Tableau . We need to do the one of the following 
options to mitigate the changing `grouping__id`
   
   1. Either we need to transform the `grouping__id` to something that won't be 
impacted when the grouping set changes and deterministic (for e.g convert 
`grouping__id` to `group_name`)
   2. Have some sort of logical DB view which will handle the transformation at 
runtime (for e.g. using CASE WHEN)
   
   In essence, we always have dependency with `grouping__id` when grouping sets 
are used in our query. Any change in the grouping id generation will have 
immediate impact. This new parameter will help us to use the legacy logic.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #38003: [SPARK-40565][SQL] Don't push non-deterministic filters to V2 file sources

2022-09-26 Thread GitBox



dongjoon-hyun commented on code in PR #38003:
URL: https://github.com/apache/spark/pull/38003#discussion_r980419475


##
sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitionsSuite.scala:
##
@@ -140,6 +140,24 @@ class PruneFileSourcePartitionsSuite extends 
PrunePartitionSuiteBase with Shared
 }
   }
 
+  test("SPARK-40565: don't push down non-deterministic filters for V2 file 
sources") {

Review Comment:
   BTW, this test case seems to fail in V1 source. Could you confirm this?
   ```scala
   -withSQLConf((SQLConf.USE_V1_SOURCE_LIST.key, "")) {
   +withSQLConf((SQLConf.USE_V1_SOURCE_LIST.key, "parquet")) {
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] aokolnychyi commented on a diff in pull request #38004: [SPARK-40551][SQL] DataSource V2: Add APIs for delta-based row-level operations

2022-09-26 Thread GitBox



aokolnychyi commented on code in PR #38004:
URL: https://github.com/apache/spark/pull/38004#discussion_r980510778


##
sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/DeltaWrite.java:
##
@@ -0,0 +1,33 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.write;
+
+import org.apache.spark.annotation.Experimental;
+
+/**
+ * A logical representation of a data source write that handles a delta of 
rows.
+ *
+ * @since 3.4.0
+ */
+@Experimental
+public interface DeltaWrite extends Write {
+  @Override
+  default DeltaBatchWrite toBatch() {

Review Comment:
   I added a default implementation to match the parent interface. In the 
future, we may also override `toStreaming`.



##
sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/DeltaWrite.java:
##
@@ -0,0 +1,33 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.write;
+
+import org.apache.spark.annotation.Experimental;
+
+/**
+ * A logical representation of a data source write that handles a delta of 
rows.
+ *
+ * @since 3.4.0
+ */
+@Experimental
+public interface DeltaWrite extends Write {
+  @Override
+  default DeltaBatchWrite toBatch() {

Review Comment:
   I added a default implementation to match the parent interface.
   In the future, we may also override `toStreaming`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] aokolnychyi commented on a diff in pull request #38004: [SPARK-40551][SQL] DataSource V2: Add APIs for delta-based row-level operations

2022-09-26 Thread GitBox



aokolnychyi commented on code in PR #38004:
URL: https://github.com/apache/spark/pull/38004#discussion_r980511079


##
sql/catalyst/src/main/java/org/apache/spark/sql/connector/write/DeltaWrite.java:
##
@@ -0,0 +1,33 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.connector.write;
+
+import org.apache.spark.annotation.Experimental;
+
+/**
+ * A logical representation of a data source write that handles a delta of 
rows.
+ *
+ * @since 3.4.0
+ */
+@Experimental
+public interface DeltaWrite extends Write {
+  @Override
+  default DeltaBatchWrite toBatch() {

Review Comment:
   It is also required to avoid inheriting the base implementation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] aokolnychyi opened a new pull request, #38005: [SPARK-40550][SQL] Handle DELETE commands for delta-based sources

2022-09-26 Thread GitBox



aokolnychyi opened a new pull request, #38005:
URL: https://github.com/apache/spark/pull/38005

   
   
   ### What changes were proposed in this pull request?
   
   
   This WIP PR shows how the API added in PR #38004 can be implemented.
   
   ### Why are the changes needed?
   
   
   Thes changes are needed as per SPIP SPARK-35801.
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   Yes, this PR adds new DS v2 APIs.
   
   ### How was this patch tested?
   
   
   This PR comes with tests.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #36030: Draft: [SPARK-38715] Configurable client ID for Kafka Spark SQL producer

2022-09-26 Thread GitBox



github-actions[bot] closed pull request #36030: Draft: [SPARK-38715] 
Configurable client ID for Kafka Spark SQL producer
URL: https://github.com/apache/spark/pull/36030


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #36829: [SPARK-39438][SQL] Add a threshold to not in line CTE

2022-09-26 Thread GitBox



github-actions[bot] closed pull request #36829: [SPARK-39438][SQL] Add a 
threshold to not in line CTE
URL: https://github.com/apache/spark/pull/36829


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #36005: [SPARK-38506][SQL] Push partial aggregation through join

2022-09-26 Thread GitBox



github-actions[bot] closed pull request #36005: [SPARK-38506][SQL] Push partial 
aggregation through join
URL: https://github.com/apache/spark/pull/36005


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #36046: [SPARK-38771][SQL] Adaptive Bloom filter Join

2022-09-26 Thread GitBox



github-actions[bot] closed pull request #36046: [SPARK-38771][SQL] Adaptive 
Bloom filter Join
URL: https://github.com/apache/spark/pull/36046


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #35799: [SPARK-38498][STREAM] Support customized StreamingListener by configuration

2022-09-26 Thread GitBox



github-actions[bot] closed pull request #35799: [SPARK-38498][STREAM] Support 
customized StreamingListener by configuration
URL: https://github.com/apache/spark/pull/35799


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #35858: [SPARK-38448] [YARN] [CORE] Sending Available Resources in Yarn Cluster Information to Spark Driver

2022-09-26 Thread GitBox



github-actions[bot] closed pull request #35858: [SPARK-38448] [YARN] [CORE] 
Sending Available Resources in Yarn Cluster Information to Spark Driver
URL: https://github.com/apache/spark/pull/35858


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #35806: [SPARK-38505][SQL] Make partial aggregation adaptive

2022-09-26 Thread GitBox



github-actions[bot] closed pull request #35806: [SPARK-38505][SQL] Make partial 
aggregation adaptive
URL: https://github.com/apache/spark/pull/35806


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #35763: [SPARK-38433][BUILD] change the shell code style with shellcheck

2022-09-26 Thread GitBox



github-actions[bot] closed pull request #35763: [SPARK-38433][BUILD] change the 
shell code style with shellcheck
URL: https://github.com/apache/spark/pull/35763


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #35751: [SPARK-38433][BUILD] Add shell code style check Actions

2022-09-26 Thread GitBox



github-actions[bot] commented on PR #35751:
URL: https://github.com/apache/spark/pull/35751#issuecomment-1258818104

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #35845: [SPARK-38520][SQL] ANSI interval overflow when reading CSV

2022-09-26 Thread GitBox



github-actions[bot] closed pull request #35845: [SPARK-38520][SQL] ANSI 
interval overflow when reading CSV
URL: https://github.com/apache/spark/pull/35845


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #35569: [SPARK-38250][CORE] Check existence before deleting stagingDir in HadoopMapReduceCommitProtocol

2022-09-26 Thread GitBox



github-actions[bot] commented on PR #35569:
URL: https://github.com/apache/spark/pull/35569#issuecomment-1258818209

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] closed pull request #35808: [WIP][SPARK-38512] Rebased traversal order from "pre-order" to "post-order" for `ResolveFunctions` Rule

2022-09-26 Thread GitBox



github-actions[bot] closed pull request #35808: [WIP][SPARK-38512] Rebased 
traversal order from "pre-order" to "post-order" for `ResolveFunctions` Rule
URL: https://github.com/apache/spark/pull/35808


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on pull request #37679: [SPARK-35242][SQL] Support changing session catalog's default database

2022-09-26 Thread GitBox



cloud-fan commented on PR #37679:
URL: https://github.com/apache/spark/pull/37679#issuecomment-1258818080

   thanks, meriging to master!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] github-actions[bot] commented on pull request #36889: [SPARK-21195][CORE] Dynamically register metrics from sources as they are reported

2022-09-26 Thread GitBox



github-actions[bot] commented on PR #36889:
URL: https://github.com/apache/spark/pull/36889#issuecomment-1258818008

   We're closing this PR because it hasn't been updated in a while. This isn't 
a judgement on the merit of the PR in any way. It's just a way of keeping the 
PR queue manageable.
   If you'd like to revive this PR, please reopen it and ask a committer to 
remove the Stale tag!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] sadikovi commented on a diff in pull request #35764: [SPARK-38444][SQL]Automatically calculate the upper and lower bounds of partitions when no specified partition related params

2022-09-26 Thread GitBox



sadikovi commented on code in PR #35764:
URL: https://github.com/apache/spark/pull/35764#discussion_r980780872


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala:
##
@@ -111,6 +111,9 @@ class JDBCOptions(
   // the number of partitions
   val numPartitions = parameters.get(JDBC_NUM_PARTITIONS).map(_.toInt)
 
+  // the default number of partitions
+  val defaultNumPartitions = parameters.getOrElse(DEFAULT_NUM_PARTITIONS, 
"10").toInt

Review Comment:
   I replied below. I think if this is about default number of partitions, we 
can just set the default to JDBC_NUM_PARTITIONS flag. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mskapilks commented on pull request #37996: [SPARK-40558][SQL] Add Reusable Exchange in Bloom creation side plan

2022-09-26 Thread GitBox



mskapilks commented on PR #37996:
URL: https://github.com/apache/spark/pull/37996#issuecomment-1259015446

   > 2. We can use `shuffle records written` instead of 
`spark.sql.optimizer.runtime.bloomFilter.expectedNumItems` to build bloom 
filter.
   Good point. It would be better than current stats based row count we are 
using to create bloom
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] sadikovi commented on a diff in pull request #35764: [SPARK-38444][SQL]Automatically calculate the upper and lower bounds of partitions when no specified partition related params

2022-09-26 Thread GitBox



sadikovi commented on code in PR #35764:
URL: https://github.com/apache/spark/pull/35764#discussion_r980779629


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCOptions.scala:
##
@@ -111,6 +111,9 @@ class JDBCOptions(
   // the number of partitions
   val numPartitions = parameters.get(JDBC_NUM_PARTITIONS).map(_.toInt)
 
+  // the default number of partitions
+  val defaultNumPartitions = parameters.getOrElse(DEFAULT_NUM_PARTITIONS, 
"10").toInt

Review Comment:
   I think the name of the config is misleading, this is essentially the 
default value of `numPartitions` configs:
   ```scala
   val numPartitions = 
parameters.get(JDBC_NUM_PARTITIONS).map(_.toInt).getOrElse(10)
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm commented on pull request #37779: [wip][SPARK-40320][Core] Executor should exit when it failed to initialize for fatal error

2022-09-26 Thread GitBox



mridulm commented on PR #37779:
URL: https://github.com/apache/spark/pull/37779#issuecomment-1259007734

   Thanks for the query @Ngone51 - I missed out one aspect of my analysis, 
which ends up completely changing the solution - my bad :-(
   
   The answer to your query has the reason for the lack of failure - this is 
due to the two types of api's we are using ...
   
   For `DeducatedMessageLoop`, for the initial submission of 
`receiveLoopRunnable` - we use `ExecutorService.submit` api - while for other 
cases, we use the `Executor.execute` api - and this is the cause for the 
behavior.
   
   The `submit` api returns a `Future` - and if we look at the implementation 
of `FutureTask.run`, we see that it catches `Throwable` and preserves that as 
the outcome (`setException`) - which is the reason why the thread itself does 
not die.
   
   
   So the specific case might be mitigated by using `execute` instead of 
`submit` api.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] sadikovi commented on a diff in pull request #35764: [SPARK-38444][SQL]Automatically calculate the upper and lower bounds of partitions when no specified partition related params

2022-09-26 Thread GitBox



sadikovi commented on code in PR #35764:
URL: https://github.com/apache/spark/pull/35764#discussion_r980766168


##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala:
##
@@ -168,6 +177,71 @@ private[sql] object JDBCRelation extends Logging {
 partitions
   }
 
+  /**
+   * get the min and max value by the column
+   * @param schema resolved schema of a JDBC table
+   * @param resolver function used to determine if two identifiers are equal
+   * @param timeZoneId timezone ID to be used if a partition column type is 
date or timestamp
+   * @param jdbcOptions JDBC options that contains url
+   * @param filters filters in Where clause
+   * @return JDBCPartitioningInfo
+   */
+  def getPartitionBound(
+  schema: StructType,
+  resolver: Resolver,
+  timeZoneId: String,
+  jdbcOptions: JDBCOptions,
+  filters: Array[Filter] = Array.empty): JDBCPartitioningInfo = {
+// columns in filters
+val filterColumns = new util.ArrayList[String]()
+filters.map(filter => filter.references.distinct.map(r => 
filterColumns.add(r)))
+// primary keys used for partitioning
+val prks = schema.fields.filter(
+  f => f.metadata.getBoolean("isIndexKey") &&
+!filterColumns.contains(f.name) &&
+(f.dataType.isInstanceOf[NumericType] ||
+  f.dataType.isInstanceOf[DateType] ||
+  f.dataType.isInstanceOf[TimestampType]))
+
+if (prks.length > 0) {
+  val prk = prks.head
+  val dataType = prk.dataType
+  var lBound: String = null
+  var uBound: String = null
+  val sql = s"select min(${prk.name}) as lBound, max(${prk.name}) as 
uBound " +

Review Comment:
   Can you explain this logic in the javadoc for this method? Also, what 
happens if the table is empty?



##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala:
##
@@ -168,6 +177,71 @@ private[sql] object JDBCRelation extends Logging {
 partitions
   }
 
+  /**
+   * get the min and max value by the column

Review Comment:
   nit: Get the min and max values for the column.



##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala:
##
@@ -168,6 +177,71 @@ private[sql] object JDBCRelation extends Logging {
 partitions
   }
 
+  /**
+   * get the min and max value by the column
+   * @param schema resolved schema of a JDBC table
+   * @param resolver function used to determine if two identifiers are equal
+   * @param timeZoneId timezone ID to be used if a partition column type is 
date or timestamp
+   * @param jdbcOptions JDBC options that contains url
+   * @param filters filters in Where clause
+   * @return JDBCPartitioningInfo
+   */
+  def getPartitionBound(
+  schema: StructType,
+  resolver: Resolver,
+  timeZoneId: String,
+  jdbcOptions: JDBCOptions,
+  filters: Array[Filter] = Array.empty): JDBCPartitioningInfo = {
+// columns in filters
+val filterColumns = new util.ArrayList[String]()
+filters.map(filter => filter.references.distinct.map(r => 
filterColumns.add(r)))
+// primary keys used for partitioning
+val prks = schema.fields.filter(
+  f => f.metadata.getBoolean("isIndexKey") &&
+!filterColumns.contains(f.name) &&
+(f.dataType.isInstanceOf[NumericType] ||
+  f.dataType.isInstanceOf[DateType] ||
+  f.dataType.isInstanceOf[TimestampType]))
+
+if (prks.length > 0) {
+  val prk = prks.head
+  val dataType = prk.dataType
+  var lBound: String = null
+  var uBound: String = null
+  val sql = s"select min(${prk.name}) as lBound, max(${prk.name}) as 
uBound " +
+s"from ${jdbcOptions.tableOrQuery} limit 1"
+  val conn = 
JdbcDialects.get(jdbcOptions.url).createConnectionFactory(jdbcOptions)(-1)
+  try {
+val statement = conn.prepareStatement(sql)
+try {
+  statement.setQueryTimeout(jdbcOptions.queryTimeout)
+  val resultSet = statement.executeQuery()
+  while (resultSet.next()) {
+lBound = resultSet.getString("lBound")

Review Comment:
   Would it work for primary keys that are integers or timestamps?



##
sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRelation.scala:
##
@@ -168,6 +177,71 @@ private[sql] object JDBCRelation extends Logging {
 partitions
   }
 
+  /**
+   * get the min and max value by the column
+   * @param schema resolved schema of a JDBC table
+   * @param resolver function used to determine if two identifiers are equal
+   * @param timeZoneId timezone ID to be used if a partition column type is 
date or timestamp
+   * @param jdbcOptions JDBC options that contains url
+   * @param filters filters in Where clause
+   * @return JDBCPartitioningInfo
+   */
+  def getPartitionBound(
+  schema: StructType,
+  resolver: Resolver,
+  timeZoneId: String,
+

[GitHub] [spark] LuciferYang commented on pull request #37999: [SPARK-39146][CORE][SQL][K8S] Introduce `JacksonUtils` to use singleton Jackson ObjectMapper

2022-09-26 Thread GitBox



LuciferYang commented on PR #37999:
URL: https://github.com/apache/spark/pull/37999#issuecomment-1258914643

   In the serial r/w scenario, the benefits are obvious,
   
   - Reading scenario: using singleton is 1800+% faster than creating 
`ObjectMapper ` every time
   - Write scenario: using a single instance is 500+% faster than creating 
`ObjectMapper ` every time


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng opened a new pull request, #38009: [SPARK-40573][PS] Make `ddof` in `GroupBy.std`, `GroupBy.var` and `GroupBy.sem` accept arbitary integers

2022-09-26 Thread GitBox



zhengruifeng opened a new pull request, #38009:
URL: https://github.com/apache/spark/pull/38009

   ### What changes were proposed in this pull request?
   Make `ddof` in `GroupBy.std`, `GroupBy.var` and `GroupBy.sem` accept 
arbitary integers
   
   ### Why are the changes needed?
   for API coverage
   
   ### Does this PR introduce _any_ user-facing change?
   yes, can not accept non-{0,1} `ddof`
   
   
   ### How was this patch tested?
   added testsutes


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srowen commented on a diff in pull request #38010: [MINOR] Clarify that xxhash64 seed is 42

2022-09-26 Thread GitBox



srowen commented on code in PR #38010:
URL: https://github.com/apache/spark/pull/38010#discussion_r980690447


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/hash.scala:
##
@@ -643,7 +643,8 @@ object Murmur3HashFunction extends InterpretedHashFunction {
  * A xxHash64 64-bit hash expression.
  */
 @ExpressionDescription(
-  usage = "_FUNC_(expr1, expr2, ...) - Returns a 64-bit hash value of the 
arguments.",
+  usage = "_FUNC_(expr1, expr2, ...) - Returns a 64-bit hash value of the 
arguments. " +

Review Comment:
   We could arguably document the same about hash() above, but, it doesn't even 
state it's murmur3, and maybe that's on purpose, so didn't seem necessary



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srowen opened a new pull request, #38010: [MINOR] Clarify that xxhash64 seed is 42

2022-09-26 Thread GitBox



srowen opened a new pull request, #38010:
URL: https://github.com/apache/spark/pull/38010

   ### What changes were proposed in this pull request?
   
   State that the hash seed used for xxhash64 is 42 in docs.
   
   ### Why are the changes needed?
   
   It's somewhat non-standard not seed to 0. Users would have to know this seed 
to reproduce the hash value.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   N/A


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic opened a new pull request, #38012: [DO-NOT-MERGE][TEST] Pandas 1.5 Test

2022-09-26 Thread GitBox



itholic opened a new pull request, #38012:
URL: https://github.com/apache/spark/pull/38012

   
   
   ### What changes were proposed in this pull request?
   
   
   
   ### Why are the changes needed?
   
   
   
   ### Does this PR introduce _any_ user-facing change?
   
   
   
   ### How was this patch tested?
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #38008: [SPARK-40571][SS][TESTS] Construct a new test case for applyInPandasWithState to verify fault-tolerance semantic with random py

2022-09-26 Thread GitBox



HeartSaVioR commented on code in PR #38008:
URL: https://github.com/apache/spark/pull/38008#discussion_r980722405


##
python/pyspark/sql/tests/test_pandas_grouped_map_with_state.py:
##
@@ -46,8 +55,27 @@
 cast(str, pandas_requirement_message or pyarrow_requirement_message),
 )
 class GroupedMapInPandasWithStateTests(ReusedSQLTestCase):
+@classmethod
+def conf(cls):
+cfg = SparkConf()
+cfg.set("spark.sql.shuffle.partitions", "5")
+return cfg
+
+def __init__(self, methodName="runTest"):
+super(GroupedMapInPandasWithStateTests, self).__init__(methodName)
+self.base_path = 
"python/test_support/sql/streaming/apply_in_pandas_with_state"

Review Comment:
   Nice finding!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] wangyum opened a new pull request, #38011: [SPARK-40574][DOCS] Enhance DROP TABLE documentation

2022-09-26 Thread GitBox



wangyum opened a new pull request, #38011:
URL: https://github.com/apache/spark/pull/38011

   ### What changes were proposed in this pull request?
   
   This PR adds `PURGE` in `DROP TABLE` documentation.
   
   Related documentation and code:
   1. Hive `DROP TABLE` documentation:
   https://cwiki.apache.org/confluence/display/hive/languagemanual+ddl
   https://user-images.githubusercontent.com/5399861/192425153-63ac5373-dd34-48b3-864c-324cf5ba5db9.png;>
   2. Hive code:
   
https://github.com/apache/hive/blob/rel/release-2.3.9/ql/src/java/org/apache/hadoop/hive/ql/metadata/Hive.java#L1185-L1209
   3. Spark code:
   
https://github.com/apache/spark/blob/v3.3.0/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala#L1317-L1327
   
   ### Why are the changes needed?
   
   Enhance documentation.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   manual test.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HeartSaVioR commented on a diff in pull request #38008: [SPARK-40571][SS][TESTS] Construct a new test case for applyInPandasWithState to verify fault-tolerance semantic with random py

2022-09-26 Thread GitBox



HeartSaVioR commented on code in PR #38008:
URL: https://github.com/apache/spark/pull/38008#discussion_r980721102


##
python/test_support/sql/streaming/apply_in_pandas_with_state/random_failure/input/test-0.txt:
##
@@ -0,0 +1,100 @@
+non

Review Comment:
   I just changed both tests to create a dataset files before running a test; 
neither input files nor golden file is needed.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 >

1 - 100 of 208 matches

Mail list logo