date:20230308

[GitHub] [spark] LuciferYang commented on a diff in pull request #40342: [SPARK-42721][CONNECT] RPC logging interceptor

2023-03-08 Thread via GitHub



LuciferYang commented on code in PR #40342:
URL: https://github.com/apache/spark/pull/40342#discussion_r1130607164


##
connector/connect/server/pom.xml:
##
@@ -155,6 +155,12 @@
   ${protobuf.version}
   compile
 
+
+  com.google.protobuf
+  protobuf-java-util

Review Comment:
   Without this pr, the server module depended on `protobuf-java-util:3.19.2 `. 
Should `protobuf-java-util` and `protobuf-java` always use the same version?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] sigmod commented on a diff in pull request #40300: [SPARK-42683] Automatically rename conflicting metadata columns

2023-03-08 Thread via GitHub



sigmod commented on code in PR #40300:
URL: https://github.com/apache/spark/pull/40300#discussion_r1130602165


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala:
##
@@ -340,13 +358,26 @@ trait ExposesMetadataColumns extends LogicalPlan {
   val resolve = conf.resolver
   val outputNames = outputSet.map(_.name)
 
-  def isOutputColumn(col: AttributeReference): Boolean = {
-outputNames.exists(name => resolve(col.name, name))
+  // Generate a unique name by prepending underscores.
+  @scala.annotation.tailrec
+  def makeUnique(name: String): String = name match {
+case name if outputNames.exists(resolve(_, name)) => 
makeUnique(s"_$name")
+case name => name
+  }
+
+  // Rename metadata struct columns whose names conflict with output 
columns.

Review Comment:
   > If there are duplicate columns, it possibly picks incorrect attribute.
   
   Is it possible to keep both in the requested schema?  
   
   cc @cashmand 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #40348: [SPARK-42724][CONNECT][BUILD] Upgrade buf to v1.15.1

2023-03-08 Thread via GitHub



zhengruifeng commented on PR #40348:
URL: https://github.com/apache/spark/pull/40348#issuecomment-1461495703

   merged to master/branch-3.4


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng closed pull request #40348: [SPARK-42724][CONNECT][BUILD] Upgrade buf to v1.15.1

2023-03-08 Thread via GitHub



zhengruifeng closed pull request #40348: [SPARK-42724][CONNECT][BUILD] Upgrade 
buf to v1.15.1
URL: https://github.com/apache/spark/pull/40348


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] jerqi commented on a diff in pull request #40339: [SPARK-42719][CORE] `MapOutputTracker#getMapLocation` should respect `spark.shuffle.reduceLocality.enabled`

2023-03-08 Thread via GitHub



jerqi commented on code in PR #40339:
URL: https://github.com/apache/spark/pull/40339#discussion_r1130592715


##
core/src/test/scala/org/apache/spark/MapOutputTrackerSuite.scala:
##
@@ -1030,4 +1030,21 @@ class MapOutputTrackerSuite extends SparkFunSuite with 
LocalSparkContext {
 rpcEnv.shutdown()
 assert(npeCounter.intValue() == 0)
   }
+
+  test("SPARK-42719: `MapOutputTracker#getMapLocation` should respect the 
config option") {
+val rpcEnv = createRpcEnv("test")
+val newConf = new SparkConf
+newConf.set(SHUFFLE_REDUCE_LOCALITY_ENABLE, false)
+val tracker = newTrackerMaster(newConf)
+tracker.trackerEndpoint = 
rpcEnv.setupEndpoint(MapOutputTracker.ENDPOINT_NAME,
+  new MapOutputTrackerMasterEndpoint(rpcEnv, tracker, newConf))
+tracker.registerShuffle(10, 6, 1)
+tracker.registerMapOutput(10, 0, MapStatus(BlockManagerId("a", "hostA", 
1000),
+  Array(2L), 5))
+val mockShuffleDep = mock(classOf[ShuffleDependency[Int, Int, _]])
+when(mockShuffleDep.shuffleId).thenReturn(10)
+assert(tracker.getMapLocation(mockShuffleDep, 0, 1) === Nil)
+tracker.stop()

Review Comment:
   Added.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] allanf-db commented on a diff in pull request #40324: [WIP][SPARK-42496][CONNECT][DOCS] Adding Spark Connect to the Spark 3.4 documentation

2023-03-08 Thread via GitHub



allanf-db commented on code in PR #40324:
URL: https://github.com/apache/spark/pull/40324#discussion_r1130590417


##
docs/spark-connect-overview.md:
##
@@ -0,0 +1,108 @@
+---
+layout: global
+title: Spark Connect Overview - Building client-side Spark applications
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+ 
+ http://www.apache.org/licenses/LICENSE-2.0
+ 
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+In Apache Spark 3.4, Spark Connect introduced a decoupled client-server 
architecture that allows remote connectivity to Spark clusters using the 
DataFrame API and unresolved logical plans as the protocol. The separation 
between client and server allows Spark and its open ecosystem to be leveraged 
from everywhere. It can be embedded in modern data applications, in IDEs, 
Notebooks and programming languages.
+
+
+  
+
+
+# How Spark Connect Works
+
+The Spark Connect client library is designed to simplify Spark application 
development. It is a thin API that can be embedded everywhere: in application 
servers, IDEs, notebooks, and programming languages. The Spark Connect API 
builds on Spark's DataFrame API using unresolved logical plans as a 
language-agnostic protocol between the client and the Spark driver.
+
+The Spark Connect client translates DataFrame operations into unresolved 
logical query plans which are encoded using protocol buffers. These are sent to 
the server using the gRPC framework.
+
+The Spark Connect endpoint embedded on the Spark Server, receives and 
translates unresolved logical plans into Spark's logical plan operators. This 
is similar to parsing a SQL query, where attributes and relations are parsed 
and an initial parse plan is built. From there, the standard Spark execution 
process kicks in, ensuring that Spark Connect leverages all of Spark's 
optimizations and enhancements. Results are streamed back to the client via 
gRPC as Apache Arrow-encoded row batches.
+
+
+  
+
+
+# Operational Benefits of Spark Connect
+
+With this new architecture, Spark Connect mitigates several operational issues:
+
+**Stability**: Applications that use too much memory will now only impact 
their own environment as they can run in their own processes. Users can define 
their own dependencies on the client and don't need to worry about potential 
conflicts with the Spark driver.
+
+**Upgradability**: The Spark driver can now seamlessly be upgraded 
independently of applications, e.g. to benefit from performance improvements 
and security fixes. This means applications can be forward-compatible, as long 
as the server-side RPC definitions are designed to be backwards compatible.
+
+**Debuggability and Observability**: Spark Connect enables interactive 
debugging during development directly from your favorite IDE. Similarly, 
applications can be monitored using the application's framework native metrics 
and logging libraries.
+
+# How to use Spark Connect
+
+Starting with Spark 3.4, Spark Connect is available and supports PySpark 
applications. When creating a Spark session, you can specify that you want to 
use Spark Connect and there are a few ways to do that as outlined below.
+
+If you do not use one of the mechanisms outlined below, your Spark session 
will work just like before, without leveraging Spark Connect, and your 
application code will run on the Spark driver node.
+
+## Set SPARK_REMOTE environment variable
+
+If you set the SPARK_REMOTE environment variable on the client machine where 
your Spark client application is running and create a new Spark Session as 
illustrated below, the session will be a Spark Connect session. With this 
approach, there is no code change needed to start using Spark Connect.
+
+Set SPARK_REMOTE environment variable:
+
+{% highlight bash %}
+export SPARK_REMOTE="sc://localhost/"
+{% endhighlight %}

Review Comment:
   Fixed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #40339: [SPARK-42719][CORE] `MapOutputTracker#getMapLocation` should respect `spark.shuffle.reduceLocality.enabled`

2023-03-08 Thread via GitHub



cloud-fan commented on code in PR #40339:
URL: https://github.com/apache/spark/pull/40339#discussion_r1130589059


##
core/src/test/scala/org/apache/spark/MapOutputTrackerSuite.scala:
##
@@ -1030,4 +1030,21 @@ class MapOutputTrackerSuite extends SparkFunSuite with 
LocalSparkContext {
 rpcEnv.shutdown()
 assert(npeCounter.intValue() == 0)
   }
+
+  test("SPARK-42719: `MapOutputTracker#getMapLocation` should respect the 
config option") {
+val rpcEnv = createRpcEnv("test")
+val newConf = new SparkConf
+newConf.set(SHUFFLE_REDUCE_LOCALITY_ENABLE, false)
+val tracker = newTrackerMaster(newConf)
+tracker.trackerEndpoint = 
rpcEnv.setupEndpoint(MapOutputTracker.ENDPOINT_NAME,
+  new MapOutputTrackerMasterEndpoint(rpcEnv, tracker, newConf))
+tracker.registerShuffle(10, 6, 1)
+tracker.registerMapOutput(10, 0, MapStatus(BlockManagerId("a", "hostA", 
1000),
+  Array(2L), 5))
+val mockShuffleDep = mock(classOf[ShuffleDependency[Int, Int, _]])
+when(mockShuffleDep.shuffleId).thenReturn(10)
+assert(tracker.getMapLocation(mockShuffleDep, 0, 1) === Nil)
+tracker.stop()

Review Comment:
   +1



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #40308: [SPARK-42151][SQL] Align UPDATE assignments with table attributes

2023-03-08 Thread via GitHub



cloud-fan commented on code in PR #40308:
URL: https://github.com/apache/spark/pull/40308#discussion_r1130585454


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala:
##
@@ -3344,43 +3345,6 @@ class Analyzer(override val catalogManager: 
CatalogManager) extends RuleExecutor
 } else {
   v2Write
 }
-
-  case u: UpdateTable if !u.skipSchemaResolution && u.resolved =>

Review Comment:
   I think runtime null check is the right direction if we want to unify but we 
can collect feedback from more people.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #40300: [SPARK-42683] Automatically rename conflicting metadata columns

2023-03-08 Thread via GitHub



cloud-fan commented on code in PR #40300:
URL: https://github.com/apache/spark/pull/40300#discussion_r1130583100


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala:
##
@@ -340,13 +358,26 @@ trait ExposesMetadataColumns extends LogicalPlan {
   val resolve = conf.resolver
   val outputNames = outputSet.map(_.name)
 
-  def isOutputColumn(col: AttributeReference): Boolean = {
-outputNames.exists(name => resolve(col.name, name))
+  // Generate a unique name by prepending underscores.
+  @scala.annotation.tailrec
+  def makeUnique(name: String): String = name match {
+case name if outputNames.exists(resolve(_, name)) => 
makeUnique(s"_$name")
+case name => name
+  }
+
+  // Rename metadata struct columns whose names conflict with output 
columns.

Review Comment:
   is it easy to fix?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #40341: [SPARK-42715][SQL] Tips for Optimizing NegativeArraySizeException

2023-03-08 Thread via GitHub



LuciferYang commented on code in PR #40341:
URL: https://github.com/apache/spark/pull/40341#discussion_r1130576248


##
sql/core/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.java:
##
@@ -204,7 +204,12 @@ public void initBatch(
* by copying from ORC VectorizedRowBatch columns to Spark ColumnarBatch 
columns.
*/
   private boolean nextBatch() throws IOException {
-recordReader.nextBatch(wrap.batch());
+try {

Review Comment:
   Will Parquet have the same issue?
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #40326: [SPARK-42708] [Docs] Improve doc about protobuf java file can't be indexed.

2023-03-08 Thread via GitHub



LuciferYang commented on code in PR #40326:
URL: https://github.com/apache/spark/pull/40326#discussion_r1130573900


##
connector/protobuf/README.md:
##
@@ -34,3 +34,17 @@ export SPARK_PROTOC_EXEC_PATH=/path-to-protoc-exe
 
 The user-defined `protoc` binary files can be produced in the user's 
compilation environment by source code compilation, 
 for compilation steps, please refer to 
[protobuf](https://github.com/protocolbuffers/protobuf).
+
+### Work with IDE

Review Comment:
   1. I'm not sure whether this item should be added. 
   2. On the other hand, there are many modules that use protobuf, such as 
`core` (in fact, the `StoreTypes `generated in the `core` module have exceeded 
the default limit of IntelliJ),  `connect-common`, so if we need to add this 
one, should put it in a more general place instead of in `protobuf` module.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] rangadi commented on pull request #40342: [SPARK-42721][CONNECT] RPC logging interceptor

2023-03-08 Thread via GitHub



rangadi commented on PR #40342:
URL: https://github.com/apache/spark/pull/40342#issuecomment-1461455759

   @LuciferYang thanks. Fixed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm commented on pull request #40307: [SPARK-42689][CORE][SHUFFLE] Allow ShuffleDriverComponent to declare if shuffle data is reliably stored

2023-03-08 Thread via GitHub



mridulm commented on PR #40307:
URL: https://github.com/apache/spark/pull/40307#issuecomment-1461449991

   Thanks for all the reviews @dongjoon-hyun, @otterc, @jerqi, @pan3793 and 
@waitinfuture :-)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #40339: [SPARK-42719][CORE] `MapOutputTracker#getMapLocation` should respect `spark.shuffle.reduceLocality.enabled`

2023-03-08 Thread via GitHub



LuciferYang commented on code in PR #40339:
URL: https://github.com/apache/spark/pull/40339#discussion_r1130569496


##
core/src/test/scala/org/apache/spark/MapOutputTrackerSuite.scala:
##
@@ -1030,4 +1030,21 @@ class MapOutputTrackerSuite extends SparkFunSuite with 
LocalSparkContext {
 rpcEnv.shutdown()
 assert(npeCounter.intValue() == 0)
   }
+
+  test("SPARK-42719: `MapOutputTracker#getMapLocation` should respect the 
config option") {
+val rpcEnv = createRpcEnv("test")
+val newConf = new SparkConf
+newConf.set(SHUFFLE_REDUCE_LOCALITY_ENABLE, false)
+val tracker = newTrackerMaster(newConf)
+tracker.trackerEndpoint = 
rpcEnv.setupEndpoint(MapOutputTracker.ENDPOINT_NAME,
+  new MapOutputTrackerMasterEndpoint(rpcEnv, tracker, newConf))
+tracker.registerShuffle(10, 6, 1)
+tracker.registerMapOutput(10, 0, MapStatus(BlockManagerId("a", "hostA", 
1000),
+  Array(2L), 5))
+val mockShuffleDep = mock(classOf[ShuffleDependency[Int, Int, _]])
+when(mockShuffleDep.shuffleId).thenReturn(10)
+assert(tracker.getMapLocation(mockShuffleDep, 0, 1) === Nil)
+tracker.stop()

Review Comment:
   I think we should always release the resource in `finally` block although 
other cases in this suite not use ` try finally`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm closed pull request #40307: [SPARK-42689][CORE][SHUFFLE] Allow ShuffleDriverComponent to declare if shuffle data is reliably stored

2023-03-08 Thread via GitHub



mridulm closed pull request #40307: [SPARK-42689][CORE][SHUFFLE] Allow 
ShuffleDriverComponent to declare if shuffle data is reliably stored
URL: https://github.com/apache/spark/pull/40307


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm commented on pull request #40307: [SPARK-42689][CORE][SHUFFLE] Allow ShuffleDriverComponent to declare if shuffle data is reliably stored

2023-03-08 Thread via GitHub



mridulm commented on PR #40307:
URL: https://github.com/apache/spark/pull/40307#issuecomment-1461445933

   All tests passed, merging to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #40342: [SPARK-42721][CONNECT] RPC logging interceptor

2023-03-08 Thread via GitHub



LuciferYang commented on PR #40342:
URL: https://github.com/apache/spark/pull/40342#issuecomment-1461439265

   Need run `./build/mvn -Pscala-2.12 scalafmt:format -Dscalafmt.skip=false 
-Dscalafmt.validateOnly=false -Dscalafmt.changedOnly=false -pl 
connector/connect/common -pl connector/connect/server -pl 
connector/connect/client/jvm` to format the code in connect modules


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #40332: [SPARK-42690][CONNECT] Implement CSV/JSON parsing functions for Scala client

2023-03-08 Thread via GitHub



LuciferYang commented on PR #40332:
URL: https://github.com/apache/spark/pull/40332#issuecomment-1461435665

   Thanks @zhengruifeng @hvanhovell @HyukjinKwon  ~


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #40332: [SPARK-42690][CONNECT] Implement CSV/JSON parsing functions for Scala client

2023-03-08 Thread via GitHub



zhengruifeng commented on PR #40332:
URL: https://github.com/apache/spark/pull/40332#issuecomment-1461430958

   @LuciferYang thank you for woking on this.
   
   merged into master/branch-3.4


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng closed pull request #40332: [SPARK-42690][CONNECT] Implement CSV/JSON parsing functions for Scala client

2023-03-08 Thread via GitHub



zhengruifeng closed pull request #40332: [SPARK-42690][CONNECT] Implement 
CSV/JSON parsing functions for Scala client
URL: https://github.com/apache/spark/pull/40332


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng opened a new pull request, #40349: [SPARK-42725][CONNECT][PYTHON] Make LiteralExpression support array

2023-03-08 Thread via GitHub



zhengruifeng opened a new pull request, #40349:
URL: https://github.com/apache/spark/pull/40349

   ### What changes were proposed in this pull request?
   Make LiteralExpression support array
   
   
   ### Why are the changes needed?
   MLIib requires literal array to carry the params, like  `IntArrayParam`, 
`DoubleArrayArrayParam`.
   
   Not that this PR doesn't affect existing `functions.lit` method which apply 
`CreateArray` expression to support array input.
   
   ### Does this PR introduce _any_ user-facing change?
   No, dev-only
   
   
   ### How was this patch tested?
   added UT


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] liang3zy22 commented on pull request #40347: [SPARK-42711][BUILD]Update usage info for sbt tool

2023-03-08 Thread via GitHub



liang3zy22 commented on PR #40347:
URL: https://github.com/apache/spark/pull/40347#issuecomment-1461364371

   Thanks for comment, @yaooqinn .  The build/sbt script seems to be a 
simplified sbt script. I think the hint info for current java is not correct.to 
be honest, I think it is a copy-paste error from the hint info for sbt_version 
argument. I can create a PR in sbt/sbt project to see if it can be updated.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] jerqi commented on pull request #40339: [SPARK-42719][CORE] `MapOutputTracker#getMapLocation` should respect `spark.shuffle.reduceLocality.enabled`

2023-03-08 Thread via GitHub



jerqi commented on PR #40339:
URL: https://github.com/apache/spark/pull/40339#issuecomment-1461353825

   The test failure is unrelated, I retrigger the test.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk closed pull request #40340: [SPARK-42701][SQL] Add the `try_aes_decrypt()` function

2023-03-08 Thread via GitHub



MaxGekk closed pull request #40340: [SPARK-42701][SQL] Add the 
`try_aes_decrypt()` function
URL: https://github.com/apache/spark/pull/40340


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] MaxGekk commented on pull request #40340: [SPARK-42701][SQL] Add the `try_aes_decrypt()` function

2023-03-08 Thread via GitHub



MaxGekk commented on PR #40340:
URL: https://github.com/apache/spark/pull/40340#issuecomment-1461349007

   Merging to master. Thank you, @srielau and @HyukjinKwon for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] viirya commented on a diff in pull request #40300: [SPARK-42683] Automatically rename conflicting metadata columns

2023-03-08 Thread via GitHub



viirya commented on code in PR #40300:
URL: https://github.com/apache/spark/pull/40300#discussion_r1130515459


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala:
##
@@ -340,13 +358,26 @@ trait ExposesMetadataColumns extends LogicalPlan {
   val resolve = conf.resolver
   val outputNames = outputSet.map(_.name)
 
-  def isOutputColumn(col: AttributeReference): Boolean = {
-outputNames.exists(name => resolve(col.name, name))
+  // Generate a unique name by prepending underscores.
+  @scala.annotation.tailrec
+  def makeUnique(name: String): String = name match {
+case name if outputNames.exists(resolve(_, name)) => 
makeUnique(s"_$name")
+case name => name
+  }
+
+  // Rename metadata struct columns whose names conflict with output 
columns.

Review Comment:
   Looks like `getProjection` wrongly matches attribute. It tries to get pruned 
datatype from `schema` by looking with field name. `StructType` maintains the 
mapping between (field name -> StructField), not (expr id -> StructField). If 
there are duplicate columns, it possibly picks incorrect attribute.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] allanf-db commented on a diff in pull request #40324: [WIP][SPARK-42496][CONNECT][DOCS] Adding Spark Connect to the Spark 3.4 documentation

2023-03-08 Thread via GitHub



allanf-db commented on code in PR #40324:
URL: https://github.com/apache/spark/pull/40324#discussion_r1130508482


##
docs/index.md:
##
@@ -49,8 +49,19 @@ For Java 11, `-Dio.netty.tryReflectionSetAccessible=true` is 
required additional
 
 # Running the Examples and Shell
 
-Spark comes with several sample programs.  Scala, Java, Python and R examples 
are in the
-`examples/src/main` directory. To run one of the Java or Scala sample 
programs, use
+Spark comes with several sample programs. Python, Scala, Java and R examples 
are in the
+`examples/src/main` directory.
+
+To run Spark interactively in a Python interpreter, use
+`bin/pyspark`:
+
+./bin/pyspark --master local[2]

Review Comment:
   I tried on my machine and it fails for me as well so I removed "--master 
local[2]" from these examples and then they work for me.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] yaooqinn commented on pull request #40347: [SPARK-42711][BUILD]Update usage info for sbt tool

2023-03-08 Thread via GitHub



yaooqinn commented on PR #40347:
URL: https://github.com/apache/spark/pull/40347#issuecomment-1461324040

   This script is just a fork from the sbt repo, and the hint info for current 
java seems correct to me. FYI, https://github.com/sbt/sbt/blob/1.9.x/sbt#L576


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] yaooqinn commented on pull request #40335: [SPARK-42717][BUILD] Upgrade mysql-connector-java from 8.0.31 to 8.0.32

2023-03-08 Thread via GitHub



yaooqinn commented on PR #40335:
URL: https://github.com/apache/spark/pull/40335#issuecomment-1461318716

   thanks, merged to master


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] yaooqinn closed pull request #40335: [SPARK-42717][BUILD] Upgrade mysql-connector-java from 8.0.31 to 8.0.32

2023-03-08 Thread via GitHub



yaooqinn closed pull request #40335: [SPARK-42717][BUILD] Upgrade 
mysql-connector-java from 8.0.31 to 8.0.32
URL: https://github.com/apache/spark/pull/40335


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] yaooqinn commented on pull request #40313: [SPARK-42697][WEBUI] Fix /api/v1/applications to return total uptime instead of 0 for the duration field

2023-03-08 Thread via GitHub



yaooqinn commented on PR #40313:
URL: https://github.com/apache/spark/pull/40313#issuecomment-1461316501

   thanks @mridulm @srowen, merged to master/3.4/3.3/3.2


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srowen commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error

2023-03-08 Thread via GitHub



srowen commented on PR #40258:
URL: https://github.com/apache/spark/pull/40258#issuecomment-1461314871

   Hm, I just don't see the logic in that. It isn't how SQL works either, as 
far as I understand. Here's maybe another example, imagine a DataFrame defined 
by `SELECT 3 as id, 3 as ID`. Would you also say selecting "id" is unambiguous? 
and it makes sense to you if I change a 3 to a 4 that this query is no longer 
semantically valid?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] yaooqinn closed pull request #40313: [SPARK-42697][WEBUI] Fix /api/v1/applications to return total uptime instead of 0 for the duration field

2023-03-08 Thread via GitHub



yaooqinn closed pull request #40313: [SPARK-42697][WEBUI] Fix 
/api/v1/applications to return total uptime instead of 0 for the duration field
URL: https://github.com/apache/spark/pull/40313


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] shrprasa commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error

2023-03-08 Thread via GitHub



shrprasa commented on PR #40258:
URL: https://github.com/apache/spark/pull/40258#issuecomment-1461303721

   It's very much relevant as this is the only case which requires the fix. If 
they do not come from same source, the plan  will reflect that and it will 
throw the ambiguous error even after this fix.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srowen commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error

2023-03-08 Thread via GitHub



srowen commented on PR #40258:
URL: https://github.com/apache/spark/pull/40258#issuecomment-1461291351

   That isn't relevant. You are selecting from a DataFrame with cols id and ID. 
Imagine for instance they do not come from the same source, it's clearly 
ambiguous. It wouldn't make sense if it were different in this case. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] shrprasa commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error

2023-03-08 Thread via GitHub



shrprasa commented on PR #40258:
URL: https://github.com/apache/spark/pull/40258#issuecomment-1461288036

   > Hm, how is it not ambiguous? When case insensitive, 'id' could mean one of 
two different columns
   
   It's not ambiguous because the  when we are selecting using list of column 
names, both id and ID are getting value from same column 'id' in the source 
dataframe. 
   val **df1** = 
sc.parallelize(List((1,2,3,4,5),(1,2,3,4,5))).toDF("**id"**,"col2","col3","col4",
 "col5")
   val op_cols_mixed_case = List(**"id"**,"col2","col3","col4", "col5", 
**"ID"**)
   val df3 = **df1**.select(op_cols_mixed_case.head, op_cols_mixed_case.tail: 
_*)
   df3.select("id").show()
   
   df3.explain()
   == Physical Plan ==
   *(1) Project [**_1#6 AS id#17**, _2#7 AS col2#18, _3#8 AS col3#19, _4#9 AS 
col4#20, _5#10 AS col5#21, **_1#6 AS ID#17**]


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] allanf-db commented on a diff in pull request #40324: [WIP][SPARK-42496][CONNECT][DOCS] Adding Spark Connect to the Spark 3.4 documentation

2023-03-08 Thread via GitHub



allanf-db commented on code in PR #40324:
URL: https://github.com/apache/spark/pull/40324#discussion_r1130437989


##
docs/index.md:
##
@@ -86,6 +88,15 @@ Example applications are also provided in R. For example,
 
 ./bin/spark-submit examples/src/main/r/dataframe.R
 
+## Running Spark Client Applications Anywhere with Spark Connect

Review Comment:
   Fixed



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srowen commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error

2023-03-08 Thread via GitHub



srowen commented on PR #40258:
URL: https://github.com/apache/spark/pull/40258#issuecomment-1461279768

   Hm, how is it not ambiguous? When case insensitive, 'id' could mean one of 
two different columns 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm commented on a diff in pull request #40313: [SPARK-42697][WEBUI] Fix /api/v1/applications to return total uptime instead of 0 for the duration field

2023-03-08 Thread via GitHub



mridulm commented on code in PR #40313:
URL: https://github.com/apache/spark/pull/40313#discussion_r1130428282


##
core/src/main/scala/org/apache/spark/ui/SparkUI.scala:
##
@@ -167,7 +167,7 @@ private[spark] class SparkUI private (
 attemptId = None,
 startTime = new Date(startTime),
 endTime = new Date(-1),
-duration = 0,
+duration = System.currentTimeMillis() - startTime,

Review Comment:
   Sounds good, thanks for confirming !



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #40324: [WIP][SPARK-42496][CONNECT][DOCS] Adding Spark Connect to the Spark 3.4 documentation

2023-03-08 Thread via GitHub



itholic commented on code in PR #40324:
URL: https://github.com/apache/spark/pull/40324#discussion_r1130421548


##
docs/index.md:
##
@@ -49,8 +49,19 @@ For Java 11, `-Dio.netty.tryReflectionSetAccessible=true` is 
required additional
 
 # Running the Examples and Shell
 
-Spark comes with several sample programs.  Scala, Java, Python and R examples 
are in the
-`examples/src/main` directory. To run one of the Java or Scala sample 
programs, use
+Spark comes with several sample programs. Python, Scala, Java and R examples 
are in the
+`examples/src/main` directory.
+
+To run Spark interactively in a Python interpreter, use
+`bin/pyspark`:
+
+./bin/pyspark --master local[2]

Review Comment:
   Yeah, I think we can just update this to `./bin/pyspark --master local` or 
just simply `./bin/pyspark` while we're here :-)
   
   Both way looks fine to me if it's working properly (I checked both work fine 
in my local env)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mskapilks commented on pull request #40266: [SPARK-42660][SQL] Infer filters for Join produced by IN and EXISTS clause (RewritePredicateSubquery rule)

2023-03-08 Thread via GitHub



mskapilks commented on PR #40266:
URL: https://github.com/apache/spark/pull/40266#issuecomment-1461267270

   cc: @wangyum @peter-toth 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #40324: [WIP][SPARK-42496][CONNECT][DOCS] Adding Spark Connect to the Spark 3.4 documentation

2023-03-08 Thread via GitHub



itholic commented on code in PR #40324:
URL: https://github.com/apache/spark/pull/40324#discussion_r1130418343


##
docs/index.md:
##
@@ -86,6 +88,16 @@ Example applications are also provided in R. For example,
 
 ./bin/spark-submit examples/src/main/r/dataframe.R
 
+## Running Spark Client Applications Anywhere with Spark Connect
+
+Spark Connect is a new client-server architecture introduced in Spark 3.4 that 
decouples Spark
+client applications and allows remote connectivity to Spark clusters. The 
separation between
+client and server allows Spark and its open ecosystem to be leveraged from 
anywhere, embedded
+in any application. In Spark 3.4, Spark Connect provides DataFrame API 
coverage for PySpark.
+Spark Connect clients for other languages are planned for the future.
+
+To learn more about Spark Connect and how to use it, visit [Spark Connect 
Overview](spark-connect-overview.html).

Review Comment:
   qq: It seems that the only way to access the Spark Connect Overview is 
through the link provided at this point.
   Do you think that's enough, or maybe should we also place the link in a more 
prominent location to expose it more actively to the users?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] cloud-fan commented on a diff in pull request #40300: [SPARK-42683] Automatically rename conflicting metadata columns

2023-03-08 Thread via GitHub



cloud-fan commented on code in PR #40300:
URL: https://github.com/apache/spark/pull/40300#discussion_r1130417773


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala:
##
@@ -340,13 +358,26 @@ trait ExposesMetadataColumns extends LogicalPlan {
   val resolve = conf.resolver
   val outputNames = outputSet.map(_.name)
 
-  def isOutputColumn(col: AttributeReference): Boolean = {
-outputNames.exists(name => resolve(col.name, name))
+  // Generate a unique name by prepending underscores.
+  @scala.annotation.tailrec
+  def makeUnique(name: String): String = name match {
+case name if outputNames.exists(resolve(_, name)) => 
makeUnique(s"_$name")
+case name => name
+  }
+
+  // Rename metadata struct columns whose names conflict with output 
columns.

Review Comment:
   @viirya this seems like a bug in nested column pruning



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #40332: [SPARK-42690][CONNECT] Implement CSV/JSON parsing functions for Scala client

2023-03-08 Thread via GitHub



LuciferYang commented on code in PR #40332:
URL: https://github.com/apache/spark/pull/40332#discussion_r1130414814


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameReader.scala:
##
@@ -504,6 +542,18 @@ class DataFrameReader private[sql] (sparkSession: 
SparkSession) extends Logging
 }
   }
 
+  private def parse(ds: Dataset[String], format: ParseFormat): DataFrame = {
+sparkSession.newDataFrame { builder =>
+  val parseBuilder = builder.getParseBuilder
+.setInput(ds.plan.getRoot)
+.setFormat(format)
+  userSpecifiedSchema.foreach(schema => 
parseBuilder.setSchema(schema.toDDL))

Review Comment:
   In this scenario, I think `DataType schema` is used and `schema.toDDL` is no 
longer needed
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] shrprasa commented on pull request #40258: [SPARK-42655][SQL] Incorrect ambiguous column reference error

2023-03-08 Thread via GitHub



shrprasa commented on PR #40258:
URL: https://github.com/apache/spark/pull/40258#issuecomment-1461262849

   > I don't get it, it is due to case sensitivity; that's why it becomes 
ambiguous and that's what you see. The issue is that the error isn't super 
helpful because it shows the lower-cased column right? that's what I was 
saying. Or: does your change still result in an error without case sensitivity? 
it should
   
   The issue is not with the error message. Problem is that in this case error 
should not be thrown. Select query should return result.  After this change, 
ambiguous error will not be thrown as we are fixing the duplicate attribute 
match. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #40332: [SPARK-42690][CONNECT] Implement CSV/JSON parsing functions for Scala client

2023-03-08 Thread via GitHub



LuciferYang commented on code in PR #40332:
URL: https://github.com/apache/spark/pull/40332#discussion_r1130412260


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameReader.scala:
##
@@ -504,6 +542,18 @@ class DataFrameReader private[sql] (sparkSession: 
SparkSession) extends Logging
 }
   }
 
+  private def parse(ds: Dataset[String], format: ParseFormat): DataFrame = {
+sparkSession.newDataFrame { builder =>
+  val parseBuilder = builder.getParseBuilder
+.setInput(ds.plan.getRoot)
+.setFormat(format)
+  userSpecifiedSchema.foreach(schema => 
parseBuilder.setSchema(schema.toDDL))

Review Comment:
   
[ca6ec7b](https://github.com/apache/spark/pull/40332/commits/ca6ec7b4651b3e9c69bc96240687840179b49ff0)
 rename `data_type` to `schema` in proto message



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] AngersZhuuuu commented on pull request #40314: [SPARK-42698][CORE] SparkSubmit should pass exitCode to AM side for yarn mode

2023-03-08 Thread via GitHub



AngersZh commented on PR #40314:
URL: https://github.com/apache/spark/pull/40314#issuecomment-1461250566

   > Does YARN still have this issue with Spark 3.4?
   
   Didn't see such fix in current code.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] yaooqinn commented on a diff in pull request #40313: [SPARK-42697][WEBUI] Fix /api/v1/applications to return total uptime instead of 0 for the duration field

2023-03-08 Thread via GitHub



yaooqinn commented on code in PR #40313:
URL: https://github.com/apache/spark/pull/40313#discussion_r1130397872


##
core/src/main/scala/org/apache/spark/ui/SparkUI.scala:
##
@@ -167,7 +167,7 @@ private[spark] class SparkUI private (
 attemptId = None,
 startTime = new Date(startTime),
 endTime = new Date(-1),
-duration = 0,
+duration = System.currentTimeMillis() - startTime,

Review Comment:
   > `SparkUI` is used from history server
   
   Only the method `create` from the companion object of `SparkUI` is 
referenced by history server. The classes `SparkUI` and `HistoryServer` both 
inherit from `WebUI`, they don't share functionality with each other.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on pull request #40346: [SPARK-42667][CONNECT][FOLLOW-UP] SparkSession created by newSession should not share the channel

2023-03-08 Thread via GitHub



amaliujia commented on PR #40346:
URL: https://github.com/apache/spark/pull/40346#issuecomment-1461241503

   In fact, the complete error message says how to fix it


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #40332: [SPARK-42690][CONNECT] Implement CSV/JSON parsing functions for Scala client

2023-03-08 Thread via GitHub



LuciferYang commented on code in PR #40332:
URL: https://github.com/apache/spark/pull/40332#discussion_r1130391567


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameReader.scala:
##
@@ -504,6 +542,18 @@ class DataFrameReader private[sql] (sparkSession: 
SparkSession) extends Logging
 }
   }
 
+  private def parse(ds: Dataset[String], format: ParseFormat): DataFrame = {
+sparkSession.newDataFrame { builder =>
+  val parseBuilder = builder.getParseBuilder
+.setInput(ds.plan.getRoot)
+.setFormat(format)
+  userSpecifiedSchema.foreach(schema => 
parseBuilder.setSchema(schema.toDDL))

Review Comment:
   
[20f1722](https://github.com/apache/spark/pull/40332/commits/20f1722c4cc3331e0c8247f4d3cae419a3e81870)
 change to pass a DataType @zhengruifeng 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] jerqi commented on pull request #40339: [SPARK-42719][CORE] `MapOutputTracker#getMapLocation` should respect `spark.shuffle.reduceLocality.enabled`

2023-03-08 Thread via GitHub



jerqi commented on PR #40339:
URL: https://github.com/apache/spark/pull/40339#issuecomment-1461236436

   thanks @mridulm 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] ueshin commented on a diff in pull request #40238: [SPARK-42633][CONNECT] Make LocalRelation take an actual schema

2023-03-08 Thread via GitHub



ueshin commented on code in PR #40238:
URL: https://github.com/apache/spark/pull/40238#discussion_r1130385693


##
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##
@@ -353,11 +353,16 @@ message LocalRelation {
   optional bytes data = 1;
 
   // (Optional) The schema of local data.
-  // It should be either a DDL-formatted type string or a JSON string.
   //
   // The server side will update the column names and data types according to 
this schema.
   // If the 'data' is not provided, then this schema will be required.
-  optional string schema = 2;
+  oneof schema {

Review Comment:
   On second thoughts, if `UnparsedDataType` at #40260 is good to go, we might 
not need to use DDL string anymore, and just use `DataType`.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm commented on a diff in pull request #40313: [SPARK-42697][WEBUI] Fix /api/v1/applications to return total uptime instead of 0 for the duration field

2023-03-08 Thread via GitHub



mridulm commented on code in PR #40313:
URL: https://github.com/apache/spark/pull/40313#discussion_r1130384125


##
core/src/main/scala/org/apache/spark/ui/SparkUI.scala:
##
@@ -167,7 +167,7 @@ private[spark] class SparkUI private (
 attemptId = None,
 startTime = new Date(startTime),
 endTime = new Date(-1),
-duration = 0,
+duration = System.currentTimeMillis() - startTime,

Review Comment:
   `SparkUI` is used from history server - not sure if this specific method is 
used though.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm commented on a diff in pull request #40313: [SPARK-42697][WEBUI] Fix /api/v1/applications to return total uptime instead of 0 for the duration field

2023-03-08 Thread via GitHub



mridulm commented on code in PR #40313:
URL: https://github.com/apache/spark/pull/40313#discussion_r1130384125


##
core/src/main/scala/org/apache/spark/ui/SparkUI.scala:
##
@@ -167,7 +167,7 @@ private[spark] class SparkUI private (
 attemptId = None,
 startTime = new Date(startTime),
 endTime = new Date(-1),
-duration = 0,
+duration = System.currentTimeMillis() - startTime,

Review Comment:
   `SparkUI` is used from history server - not sure if this specific api is 
used though.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm commented on pull request #40339: [SPARK-42719][CORE] `MapOutputTracker#getMapLocation` should respect `spark.shuffle.reduceLocality.enabled`

2023-03-08 Thread via GitHub



mridulm commented on PR #40339:
URL: https://github.com/apache/spark/pull/40339#issuecomment-1461226978

   Will wait for the CI to succeed. Thanks for fixing this @jerqi !


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] yaooqinn commented on a diff in pull request #40313: [SPARK-42697][WEBUI] Fix /api/v1/applications to return total uptime instead of 0 for the duration field

2023-03-08 Thread via GitHub



yaooqinn commented on code in PR #40313:
URL: https://github.com/apache/spark/pull/40313#discussion_r1130373337


##
core/src/main/scala/org/apache/spark/ui/SparkUI.scala:
##
@@ -167,7 +167,7 @@ private[spark] class SparkUI private (
 attemptId = None,
 startTime = new Date(startTime),
 endTime = new Date(-1),
-duration = 0,
+duration = System.currentTimeMillis() - startTime,

Review Comment:
   
https://github.com/apache/spark/pull/10648/files#diff-32f7bbe4f1ae694bd35f180c2ff918ed7ce5da42f37560bbd4d5ade6c7c2783eR116
   
   It was initially merged with a giant PR, which was for HistoryServer. So I 
guess this line didn't get much attention



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #40332: [SPARK-42690][CONNECT] Implement CSV/JSON parsing functions for Scala client

2023-03-08 Thread via GitHub



LuciferYang commented on code in PR #40332:
URL: https://github.com/apache/spark/pull/40332#discussion_r1130368098


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameReader.scala:
##
@@ -504,6 +542,18 @@ class DataFrameReader private[sql] (sparkSession: 
SparkSession) extends Logging
 }
   }
 
+  private def parse(ds: Dataset[String], format: ParseFormat): DataFrame = {
+sparkSession.newDataFrame { builder =>
+  val parseBuilder = builder.getParseBuilder
+.setInput(ds.plan.getRoot)
+.setFormat(format)
+  userSpecifiedSchema.foreach(schema => 
parseBuilder.setSchema(schema.toDDL))

Review Comment:
   What do you think about this? @hvanhovell 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] gengliangwang closed pull request #40345: [SPARK-42723][SQL] Support parser data type json "timestamp_ltz" as TimestampType

2023-03-08 Thread via GitHub



gengliangwang closed pull request #40345: [SPARK-42723][SQL] Support parser 
data type json "timestamp_ltz" as TimestampType
URL: https://github.com/apache/spark/pull/40345


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] gengliangwang commented on pull request #40345: [SPARK-42723][SQL] Support parser data type json "timestamp_ltz" as TimestampType

2023-03-08 Thread via GitHub



gengliangwang commented on PR #40345:
URL: https://github.com/apache/spark/pull/40345#issuecomment-1461214457

   @HyukjinKwon Thanks for the review.
   Merging to master/3.4


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] jerqi commented on a diff in pull request #40339: [SPARK-42719][CORE] `MapOutputTracker#getMapLocation` should respect `spark.shuffle.reduceLocality.enabled`

2023-03-08 Thread via GitHub



jerqi commented on code in PR #40339:
URL: https://github.com/apache/spark/pull/40339#discussion_r1130364436


##
core/src/main/scala/org/apache/spark/MapOutputTracker.scala:
##
@@ -1113,7 +1113,7 @@ private[spark] class MapOutputTrackerMaster(
   endMapIndex: Int): Seq[String] =
   {
 val shuffleStatus = shuffleStatuses.get(dep.shuffleId).orNull
-if (shuffleStatus != null) {
+if (shuffleLocalityEnabled && shuffleStatus != null) {
   shuffleStatus.withMapStatuses { statuses =>

Review Comment:
   I got it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] allanf-db commented on a diff in pull request #40324: [WIP][SPARK-42496][CONNECT][DOCS] Adding Spark Connect to the Spark 3.4 documentation

2023-03-08 Thread via GitHub



allanf-db commented on code in PR #40324:
URL: https://github.com/apache/spark/pull/40324#discussion_r1130361771


##
docs/spark-connect-overview.md:
##
@@ -0,0 +1,108 @@
+---
+layout: global
+title: Spark Connect Overview - Building client-side Spark applications
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+ 
+ http://www.apache.org/licenses/LICENSE-2.0
+ 
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+In Apache Spark 3.4, Spark Connect introduced a decoupled client-server 
architecture that allows remote connectivity to Spark clusters using the 
DataFrame API and unresolved logical plans as the protocol. The separation 
between client and server allows Spark and its open ecosystem to be leveraged 
from everywhere. It can be embedded in modern data applications, in IDEs, 
Notebooks and programming languages.
+
+
+  
+
+
+# How Spark Connect Works
+
+The Spark Connect client library is designed to simplify Spark application 
development. It is a thin API that can be embedded everywhere: in application 
servers, IDEs, notebooks, and programming languages. The Spark Connect API 
builds on Spark's DataFrame API using unresolved logical plans as a 
language-agnostic protocol between the client and the Spark driver.
+
+The Spark Connect client translates DataFrame operations into unresolved 
logical query plans which are encoded using protocol buffers. These are sent to 
the server using the gRPC framework.
+
+The Spark Connect endpoint embedded on the Spark Server, receives and 
translates unresolved logical plans into Spark's logical plan operators. This 
is similar to parsing a SQL query, where attributes and relations are parsed 
and an initial parse plan is built. From there, the standard Spark execution 
process kicks in, ensuring that Spark Connect leverages all of Spark's 
optimizations and enhancements. Results are streamed back to the client via 
gRPC as Apache Arrow-encoded row batches.
+
+
+  
+
+
+# Operational Benefits of Spark Connect
+
+With this new architecture, Spark Connect mitigates several operational issues:
+
+**Stability**: Applications that use too much memory will now only impact 
their own environment as they can run in their own processes. Users can define 
their own dependencies on the client and don't need to worry about potential 
conflicts with the Spark driver.
+
+**Upgradability**: The Spark driver can now seamlessly be upgraded 
independently of applications, e.g. to benefit from performance improvements 
and security fixes. This means applications can be forward-compatible, as long 
as the server-side RPC definitions are designed to be backwards compatible.
+
+**Debuggability and Observability**: Spark Connect enables interactive 
debugging during development directly from your favorite IDE. Similarly, 
applications can be monitored using the application's framework native metrics 
and logging libraries.
+
+# How to use Spark Connect
+
+Starting with Spark 3.4, Spark Connect is available and supports PySpark 
applications. When creating a Spark session, you can specify that you want to 
use Spark Connect and there are a few ways to do that as outlined below.
+
+If you do not use one of the mechanisms outlined below, your Spark session 
will work just like before, without leveraging Spark Connect, and your 
application code will run on the Spark driver node.
+
+## Set SPARK_REMOTE environment variable
+
+If you set the SPARK_REMOTE environment variable on the client machine where 
your Spark client application is running and create a new Spark Session as 
illustrated below, the session will be a Spark Connect session. With this 
approach, there is no code change needed to start using Spark Connect.
+
+Set SPARK_REMOTE environment variable:
+
+{% highlight bash %}
+export SPARK_REMOTE="sc://localhost/"
+{% endhighlight %}
+
+Start the PySpark CLI, for example:
+
+{% highlight bash %}
+./bin/pyspark
+{% endhighlight %}
+
+And notice that the PySpark CLI is now connected to Spark using Spark Connect 
as indicated in the welcome message: “Client connected to the Spark Connect 
server at...”.
+
+And if you write your own Python program, create a Spark Session as shown in 
this example:
+
+{% highlight python %}
+from pyspark.sql import SparkSession
+spark =

[GitHub] [spark] allanf-db commented on a diff in pull request #40324: [WIP][SPARK-42496][CONNECT][DOCS] Adding Spark Connect to the Spark 3.4 documentation

2023-03-08 Thread via GitHub



allanf-db commented on code in PR #40324:
URL: https://github.com/apache/spark/pull/40324#discussion_r1130361509


##
docs/index.md:
##
@@ -49,8 +49,19 @@ For Java 11, `-Dio.netty.tryReflectionSetAccessible=true` is 
required additional
 
 # Running the Examples and Shell
 
-Spark comes with several sample programs.  Scala, Java, Python and R examples 
are in the
-`examples/src/main` directory. To run one of the Java or Scala sample 
programs, use
+Spark comes with several sample programs. Python, Scala, Java and R examples 
are in the
+`examples/src/main` directory.
+
+To run Spark interactively in a Python interpreter, use
+`bin/pyspark`:
+
+./bin/pyspark --master local[2]

Review Comment:
   Yeah, I did not update the existing code samples on this page (yet)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #40274: [SPARK-42215][CONNECT] Simplify Scala Client IT tests

2023-03-08 Thread via GitHub



LuciferYang commented on PR #40274:
URL: https://github.com/apache/spark/pull/40274#issuecomment-1461210301

   seems `SimpleSparkConnectService` startup failed, the error message is 
   
   ```
   Error: Missing application resource.
   
   Usage: spark-submit [options]  [app 
arguments]
   Usage: spark-submit --kill [submission ID] --master [spark://...]
   Usage: spark-submit --status [submission ID] --master [spark://...]
   Usage: spark-submit run-example [options] example-class [example args]
   
   Options:
 --master MASTER_URL spark://host:port, mesos://host:port, yarn,
 k8s://https://host:port, or local (Default: 
local[*]).
 --deploy-mode DEPLOY_MODE   Whether to launch the driver program locally 
("client") or
 on one of the worker machines inside the 
cluster ("cluster")
 (Default: client).
 --class CLASS_NAME  Your application's main class (for Java / 
Scala apps).
 --name NAME A name of your application.
 --jars JARS Comma-separated list of jars to include on the 
driver
   ...
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm commented on a diff in pull request #40339: [SPARK-42719][CORE] `MapOutputTracker#getMapLocation` should respect `spark.shuffle.reduceLocality.enabled`

2023-03-08 Thread via GitHub



mridulm commented on code in PR #40339:
URL: https://github.com/apache/spark/pull/40339#discussion_r1130360051


##
core/src/main/scala/org/apache/spark/MapOutputTracker.scala:
##
@@ -1113,7 +1113,7 @@ private[spark] class MapOutputTrackerMaster(
   endMapIndex: Int): Seq[String] =
   {
 val shuffleStatus = shuffleStatuses.get(dep.shuffleId).orNull
-if (shuffleStatus != null) {
+if (shuffleLocalityEnabled && shuffleStatus != null) {
   shuffleStatus.withMapStatuses { statuses =>

Review Comment:
   I actually meant this:
   
   ```
   if (!shuffleLocalityEnabled) return Nil
   
   val shuffleStatus = shuffleStatuses.get(dep.shuffleId).orNull
   if (shuffleStatus != null) {
   
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm commented on a diff in pull request #40339: [SPARK-42719][CORE] `MapOutputTracker#getMapLocation` should respect `spark.shuffle.reduceLocality.enabled`

2023-03-08 Thread via GitHub



mridulm commented on code in PR #40339:
URL: https://github.com/apache/spark/pull/40339#discussion_r1130360051


##
core/src/main/scala/org/apache/spark/MapOutputTracker.scala:
##
@@ -1113,7 +1113,7 @@ private[spark] class MapOutputTrackerMaster(
   endMapIndex: Int): Seq[String] =
   {
 val shuffleStatus = shuffleStatuses.get(dep.shuffleId).orNull
-if (shuffleStatus != null) {
+if (shuffleLocalityEnabled && shuffleStatus != null) {
   shuffleStatus.withMapStatuses { statuses =>

Review Comment:
   I actually meant this:
   
   ```
   if (shuffleLocalityEnabled) return Nil
   val shuffleStatus = shuffleStatuses.get(dep.shuffleId).orNull
   if (shuffleStatus != null) {
   
   ```
   



##
core/src/main/scala/org/apache/spark/MapOutputTracker.scala:
##
@@ -1113,7 +1113,7 @@ private[spark] class MapOutputTrackerMaster(
   endMapIndex: Int): Seq[String] =
   {
 val shuffleStatus = shuffleStatuses.get(dep.shuffleId).orNull
-if (shuffleStatus != null) {
+if (shuffleLocalityEnabled && shuffleStatus != null) {
   shuffleStatus.withMapStatuses { statuses =>

Review Comment:
   I actually meant this:
   
   ```
   if (!shuffleLocalityEnabled) return Nil
   val shuffleStatus = shuffleStatuses.get(dep.shuffleId).orNull
   if (shuffleStatus != null) {
   
   ```
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on a diff in pull request #40238: [SPARK-42633][CONNECT] Make LocalRelation take an actual schema

2023-03-08 Thread via GitHub



LuciferYang commented on code in PR #40238:
URL: https://github.com/apache/spark/pull/40238#discussion_r1130353503


##
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##
@@ -353,11 +353,16 @@ message LocalRelation {
   optional bytes data = 1;
 
   // (Optional) The schema of local data.
-  // It should be either a DDL-formatted type string or a JSON string.
   //
   // The server side will update the column names and data types according to 
this schema.
   // If the 'data' is not provided, then this schema will be required.
-  optional string schema = 2;
+  oneof schema {

Review Comment:
   Thanks @zhengruifeng 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] panbingkun opened a new pull request, #40348: [SPARK-42724][CONNECT][BUILD] Upgrade buf to v1.15.1

2023-03-08 Thread via GitHub



panbingkun opened a new pull request, #40348:
URL: https://github.com/apache/spark/pull/40348

   ### What changes were proposed in this pull request?
   The pr aims to upgrade buf from 1.15.0 to 1.15.1
   
   ### Why are the changes needed?
   Manually test: dev/connect-gen-protos.sh
   This upgrade will not change the generated files.
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   Manually test and Pass GA.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] liang3zy22 opened a new pull request, #40347: [SPARK-42711][BUILD]Update usage info for sbt tool

2023-03-08 Thread via GitHub



liang3zy22 opened a new pull request, #40347:
URL: https://github.com/apache/spark/pull/40347

   
   
   
   ### What changes were proposed in this pull request?
   
   The build/sbt script's usage information have several error. See the info 
below:
   ```
   (base) spark% ./build/sbt -help
   Usage:  [options]
   
 -h | -help print this message
 -v | -verbose  this runner is chattier
   
 # java version (default: java from PATH, currently openjdk version 
"11.0.17" 2022-10-18 LTS)
 -java-home  alternate JAVA_HOME
   ```
   There is no script name after the usage and "-java-home" argument's 
description is wrong. With this change, the info become:
   ```
   (base) spark % ./build/sbt -help
   Usage: sbt [options]
   
 -h | -help print this message
 -v | -verbose  this runner is chattier
   .
 # JAVA_HOME (this will override default JAVA_HOME)
 -java-home  alternate JAVA_HOME
   
   ```
   This changes also fix several shellcheck error.

   ### Why are the changes needed?
   
   Bug fix.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Yes, the user of build/sbt tool will see usage information updated.
   
   ### How was this patch tested?
   
   Manual check the script codes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #40238: [SPARK-42633][CONNECT] Make LocalRelation take an actual schema

2023-03-08 Thread via GitHub



zhengruifeng commented on code in PR #40238:
URL: https://github.com/apache/spark/pull/40238#discussion_r1130349889


##
connector/connect/common/src/main/protobuf/spark/connect/relations.proto:
##
@@ -353,11 +353,16 @@ message LocalRelation {
   optional bytes data = 1;
 
   // (Optional) The schema of local data.
-  // It should be either a DDL-formatted type string or a JSON string.
   //
   // The server side will update the column names and data types according to 
this schema.
   // If the 'data' is not provided, then this schema will be required.
-  optional string schema = 2;
+  oneof schema {

Review Comment:
   > I'd rather like to change all places to accept both string and DataType 
and the choice is up to the client developers.
   
   I think it is a good idea. We may add a dedicated message for it and then 
use it in all places.
   
   Are we going to do this?
   
   also cc @LuciferYang 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on a diff in pull request #40332: [SPARK-42690][CONNECT] Implement CSV/JSON parsing functions for Scala client

2023-03-08 Thread via GitHub



zhengruifeng commented on code in PR #40332:
URL: https://github.com/apache/spark/pull/40332#discussion_r1130347402


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameReader.scala:
##
@@ -504,6 +542,18 @@ class DataFrameReader private[sql] (sparkSession: 
SparkSession) extends Logging
 }
   }
 
+  private def parse(ds: Dataset[String], format: ParseFormat): DataFrame = {
+sparkSession.newDataFrame { builder =>
+  val parseBuilder = builder.getParseBuilder
+.setInput(ds.plan.getRoot)
+.setFormat(format)
+  userSpecifiedSchema.foreach(schema => 
parseBuilder.setSchema(schema.toDDL))

Review Comment:
   as to this PR itself, I think we should probably use `DataType schema` in 
the proto message, `schema.toDDL` always discards the metadata.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] srowen commented on a diff in pull request #40313: [SPARK-42697][WEBUI] Fix /api/v1/applications to return total uptime instead of 0 for the duration field

2023-03-08 Thread via GitHub



srowen commented on code in PR #40313:
URL: https://github.com/apache/spark/pull/40313#discussion_r1130333597


##
core/src/main/scala/org/apache/spark/ui/SparkUI.scala:
##
@@ -167,7 +167,7 @@ private[spark] class SparkUI private (
 attemptId = None,
 startTime = new Date(startTime),
 endTime = new Date(-1),
-duration = 0,
+duration = System.currentTimeMillis() - startTime,

Review Comment:
   Seems OK, I just wonder why it was ever always 0?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] LuciferYang commented on pull request #40332: [SPARK-42690][CONNECT] Implement CSV/JSON parsing functions for Scala client

2023-03-08 Thread via GitHub



LuciferYang commented on PR #40332:
URL: https://github.com/apache/spark/pull/40332#issuecomment-1461183365

   > probably not related to this PR:
   > 
   > 
https://github.com/apache/spark/blob/39a55121888d2543a6056be65e0c74126a9d3bdf/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L63-L76
   > 
   > ```
   >   def schema(schemaString: String): DataFrameReader = {
   > schema(StructType.fromDDL(schemaString))
   >   }
   > ```
   > 
   > when the user provide a DDL string, it invoke the parser. Here I think we 
should keep both StructType and DDL string, and pass them to the server side.
   
   message `Read` seems also need to consider this？I think we can further 
discuss this problem in a separate jira?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] zhengruifeng commented on pull request #40332: [SPARK-42690][CONNECT] Implement CSV/JSON parsing functions for Scala client

2023-03-08 Thread via GitHub



zhengruifeng commented on PR #40332:
URL: https://github.com/apache/spark/pull/40332#issuecomment-1461175796

   probably not related to this PR: 
   
   
https://github.com/apache/spark/blob/39a55121888d2543a6056be65e0c74126a9d3bdf/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/DataFrameReader.scala#L63-L76
   
   ```
 def schema(schemaString: String): DataFrameReader = {
   schema(StructType.fromDDL(schemaString))
 }
   ```
   
   when the user provide a DDL string, it invoke the parser. Here I think we 
should keep both StructType and DDL string, and pass them to the server side.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] jerqi commented on a diff in pull request #40339: [SPARK-42719][CORE] `MapOutputTracker#getMapLocation` should respect `spark.shuffle.reduceLocality.enabled`

2023-03-08 Thread via GitHub



jerqi commented on code in PR #40339:
URL: https://github.com/apache/spark/pull/40339#discussion_r1130315958


##
core/src/main/scala/org/apache/spark/MapOutputTracker.scala:
##
@@ -1113,7 +1113,7 @@ private[spark] class MapOutputTrackerMaster(
   endMapIndex: Int): Seq[String] =
   {
 val shuffleStatus = shuffleStatuses.get(dep.shuffleId).orNull
-if (shuffleStatus != null) {
+if (shuffleStatus != null && shuffleLocalityEnabled) {

Review Comment:
   Ok, I will change it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm commented on a diff in pull request #40339: [SPARK-42719][CORE] `MapOutputTracker#getMapLocation` should respect `spark.shuffle.reduceLocality.enabled`

2023-03-08 Thread via GitHub



mridulm commented on code in PR #40339:
URL: https://github.com/apache/spark/pull/40339#discussion_r1130313793


##
core/src/main/scala/org/apache/spark/MapOutputTracker.scala:
##
@@ -1113,7 +1113,7 @@ private[spark] class MapOutputTrackerMaster(
   endMapIndex: Int): Seq[String] =
   {
 val shuffleStatus = shuffleStatuses.get(dep.shuffleId).orNull
-if (shuffleStatus != null) {
+if (shuffleStatus != null && shuffleLocalityEnabled) {

Review Comment:
   nit: Why not avoid the `shuffleStatuses` query if `shuffleLocalityEnabled` 
is `false` and return `Nil` immediately ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] yaooqinn commented on pull request #40313: [SPARK-42697][WEBUI] Fix /api/v1/applications to return total uptime instead of 0 for the duration field

2023-03-08 Thread via GitHub



yaooqinn commented on PR #40313:
URL: https://github.com/apache/spark/pull/40313#issuecomment-1461171648

   cc @HyukjinKwon @srowen @dongjoon-hyun ,thanks


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm commented on pull request #40302: [SPARK-42686][CORE] Defer formatting for debug messages in TaskMemoryManager

2023-03-08 Thread via GitHub



mridulm commented on PR #40302:
URL: https://github.com/apache/spark/pull/40302#issuecomment-1461170777

   @pan3793's suggestion is simpler and less change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on pull request #40346: [SPARK-42667][CONNECT][FOLLOW-UP] SparkSession created by newSession should not share the channel

2023-03-08 Thread via GitHub



amaliujia commented on PR #40346:
URL: https://github.com/apache/spark/pull/40346#issuecomment-1461163592

   @hvanhovell do you know how to fix this
   
   ```
   Error: ] 
/home/runner/work/spark/spark/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/SparkConnectClient.scala:395:
 inferred existential type io.grpc.ManagedChannelBuilder[?0]( forSome { type ?0 
<: io.grpc.ManagedChannelBuilder[?0] }), which cannot be expressed by 
wildcards,  should be enabled
   by making the implicit value scala.language.existentials visible.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm commented on pull request #40307: [SPARK-42689][CORE][SHUFFLE] Allow ShuffleDriverComponent to declare if shuffle data is reliably stored

2023-03-08 Thread via GitHub



mridulm commented on PR #40307:
URL: https://github.com/apache/spark/pull/40307#issuecomment-1461162897

   Thanks for the review @dongjoon-hyun ! Really helpful


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] mridulm commented on a diff in pull request #40307: [SPARK-42689][CORE][SHUFFLE] Allow ShuffleDriverComponent to declare if shuffle data is reliably stored

2023-03-08 Thread via GitHub



mridulm commented on code in PR #40307:
URL: https://github.com/apache/spark/pull/40307#discussion_r1130215354


##
core/src/main/java/org/apache/spark/shuffle/api/ShuffleDriverComponents.java:
##
@@ -61,4 +61,13 @@ default void registerShuffle(int shuffleId) {}
* @param blocking Whether this call should block on the deletion of the 
data.
*/
   default void removeShuffle(int shuffleId, boolean blocking) {}
+
+  /**
+   * Does this shuffle component support reliable storage - external to the 
lifecycle of the
+   * executor host ? For example, writing shuffle data to a distributed 
filesystem or
+   * persisting it in a remote shuffle service.
+   */
+  default boolean supportsReliableStorage() {

Review Comment:
   In this context, we are looking to say the shuffle driver component 
`supports` reliable storage as a feature.



##
core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala:
##
@@ -103,7 +104,8 @@ private[spark] class ExecutorAllocationManager(
 conf: SparkConf,
 cleaner: Option[ContextCleaner] = None,
 clock: Clock = new SystemClock(),
-resourceProfileManager: ResourceProfileManager)
+resourceProfileManager: ResourceProfileManager,
+shuffleDriverComponents: ShuffleDriverComponents)

Review Comment:
   Since `ExecutorAllocationManager` is `private[spark]`, if we need to pull 
more details in future - we can do that.
   Given this, I am making the change @dongjoon-hyun recommended - to pass the 
`boolean` instead of object: we can revisit it as this functionality evolves.



##
core/src/test/scala/org/apache/spark/SparkContextSuite.scala:
##
@@ -1399,6 +1399,14 @@ class SparkContextSuite extends SparkFunSuite with 
LocalSparkContext with Eventu
 }
 sc.stop()
   }
+
+  test("ShuffleDataIO initialized after application id has been configured") {

Review Comment:
   Added jira id assuming that is what you meant.



##
core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala:
##
@@ -985,6 +985,28 @@ class DAGSchedulerSuite extends SparkFunSuite with 
TempLocalSparkContext with Ti
 }
   }
 
+  test("worker lost with remote shuffle service enabled should not treat 
shuffle files lost") {

Review Comment:
   Added jira.



##
core/src/main/scala/org/apache/spark/SparkContext.scala:
##
@@ -596,6 +591,13 @@ class SparkContext(config: SparkConf) extends Logging {
   _conf.set(APP_ATTEMPT_ID, attemptId)
   _env.blockManager.blockStoreClient.setAppAttemptId(attemptId)
 }
+
+// initialize after application id and attempt id has been initialized
+_shuffleDriverComponents = 
ShuffleDataIOUtils.loadShuffleDataIO(_conf).driver()
+_shuffleDriverComponents.initializeApplication().asScala.foreach { case 
(k, v) =>
+  _conf.set(ShuffleDataIOUtils.SHUFFLE_SPARK_CONF_PREFIX + k, v)
+}

Review Comment:
   Good point, moved cleanup before cleanupApplication to be before 
`_heartbeater` stop



##
core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala:
##
@@ -203,7 +205,8 @@ private[spark] class ExecutorAllocationManager(
   throw new SparkException(
 s"s${DYN_ALLOCATION_SUSTAINED_SCHEDULER_BACKLOG_TIMEOUT.key} must be > 
0!")
 }
-if (!conf.get(config.SHUFFLE_SERVICE_ENABLED)) {
+if (!conf.get(config.SHUFFLE_SERVICE_ENABLED) &&
+  !shuffleDriverComponents.supportsReliableStorage()) {

Review Comment:
   We can change that after SPARK-25299 support is complete.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] jerqi commented on pull request #40339: [SPARK-42719][CORE] `MapOutputTracker#getMapLocation` should respect `spark.shuffle.reduceLocality.enabled`

2023-03-08 Thread via GitHub



jerqi commented on PR #40339:
URL: https://github.com/apache/spark/pull/40339#issuecomment-1461148705

   > Is this still in draft @jerqi ? Also, +CC @cloud-fan who reviewed this 
earlier.
   
   
   > Is this still in draft @jerqi ? Also, +CC @cloud-fan who reviewed this 
earlier.
   
   Thanks @mridulm It's ready for review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] allanf-db commented on a diff in pull request #40324: [WIP][SPARK-42496][CONNECT][DOCS] Adding Spark Connect to the Spark 3.4 documentation

2023-03-08 Thread via GitHub



allanf-db commented on code in PR #40324:
URL: https://github.com/apache/spark/pull/40324#discussion_r1130279011


##
docs/index.md:
##
@@ -86,6 +88,15 @@ Example applications are also provided in R. For example,
 
 ./bin/spark-submit examples/src/main/r/dataframe.R
 
+## Running Spark Client Applications Anywhere with Spark Connect

Review Comment:
   Fixing, thanks



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] allanf-db commented on a diff in pull request #40324: [WIP][SPARK-42496][CONNECT][DOCS] Adding Spark Connect to the Spark 3.4 documentation

2023-03-08 Thread via GitHub



allanf-db commented on code in PR #40324:
URL: https://github.com/apache/spark/pull/40324#discussion_r1130275977


##
docs/spark-connect-overview.md:
##
@@ -0,0 +1,108 @@
+---
+layout: global
+title: Spark Connect Overview - Building client-side Spark applications
+license: |
+  Licensed to the Apache Software Foundation (ASF) under one or more
+  contributor license agreements.  See the NOTICE file distributed with
+  this work for additional information regarding copyright ownership.
+  The ASF licenses this file to You under the Apache License, Version 2.0
+  (the "License"); you may not use this file except in compliance with
+  the License.  You may obtain a copy of the License at
+ 
+ http://www.apache.org/licenses/LICENSE-2.0
+ 
+  Unless required by applicable law or agreed to in writing, software
+  distributed under the License is distributed on an "AS IS" BASIS,
+  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+  See the License for the specific language governing permissions and
+  limitations under the License.
+---
+
+In Apache Spark 3.4, Spark Connect introduced a decoupled client-server 
architecture that allows remote connectivity to Spark clusters using the 
DataFrame API and unresolved logical plans as the protocol. The separation 
between client and server allows Spark and its open ecosystem to be leveraged 
from everywhere. It can be embedded in modern data applications, in IDEs, 
Notebooks and programming languages.
+
+
+  
+
+
+# How Spark Connect Works
+
+The Spark Connect client library is designed to simplify Spark application 
development. It is a thin API that can be embedded everywhere: in application 
servers, IDEs, notebooks, and programming languages. The Spark Connect API 
builds on Spark's DataFrame API using unresolved logical plans as a 
language-agnostic protocol between the client and the Spark driver.
+
+The Spark Connect client translates DataFrame operations into unresolved 
logical query plans which are encoded using protocol buffers. These are sent to 
the server using the gRPC framework.
+
+The Spark Connect endpoint embedded on the Spark Server, receives and 
translates unresolved logical plans into Spark's logical plan operators. This 
is similar to parsing a SQL query, where attributes and relations are parsed 
and an initial parse plan is built. From there, the standard Spark execution 
process kicks in, ensuring that Spark Connect leverages all of Spark's 
optimizations and enhancements. Results are streamed back to the client via 
gRPC as Apache Arrow-encoded row batches.
+
+
+  
+
+
+# Operational Benefits of Spark Connect
+
+With this new architecture, Spark Connect mitigates several operational issues:
+
+**Stability**: Applications that use too much memory will now only impact 
their own environment as they can run in their own processes. Users can define 
their own dependencies on the client and don't need to worry about potential 
conflicts with the Spark driver.
+
+**Upgradability**: The Spark driver can now seamlessly be upgraded 
independently of applications, e.g. to benefit from performance improvements 
and security fixes. This means applications can be forward-compatible, as long 
as the server-side RPC definitions are designed to be backwards compatible.
+
+**Debuggability and Observability**: Spark Connect enables interactive 
debugging during development directly from your favorite IDE. Similarly, 
applications can be monitored using the application's framework native metrics 
and logging libraries.
+
+# How to use Spark Connect
+
+Starting with Spark 3.4, Spark Connect is available and supports PySpark 
applications. When creating a Spark session, you can specify that you want to 
use Spark Connect and there are a few ways to do that as outlined below.
+
+If you do not use one of the mechanisms outlined below, your Spark session 
will work just like before, without leveraging Spark Connect, and your 
application code will run on the Spark driver node.
+
+## Set SPARK_REMOTE environment variable
+
+If you set the SPARK_REMOTE environment variable on the client machine where 
your Spark client application is running and create a new Spark Session as 
illustrated below, the session will be a Spark Connect session. With this 
approach, there is no code change needed to start using Spark Connect.
+
+Set SPARK_REMOTE environment variable:
+
+{% highlight bash %}
+export SPARK_REMOTE="sc://localhost/"
+{% endhighlight %}

Review Comment:
   Good point, fixing this



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail:

[GitHub] [spark] panbingkun commented on a diff in pull request #40316: [SPARK-42679][CONNECT][PYTHON] createDataFrame doesn't work with non-nullable schema

2023-03-08 Thread via GitHub



panbingkun commented on code in PR #40316:
URL: https://github.com/apache/spark/pull/40316#discussion_r1130275274


##
python/pyspark/sql/types.py:
##
@@ -1599,6 +1599,18 @@ def _has_nulltype(dt: DataType) -> bool:
 return isinstance(dt, NullType)
 
 
+def _has_non_nullable(dt: Optional[Union[AtomicType, StructType]]) -> bool:

Review Comment:
   OK, let me fix it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #40344: [SPARK-42656][CONNECT][Followup] Fix the spark-connect script

2023-03-08 Thread via GitHub



HyukjinKwon closed pull request #40344: [SPARK-42656][CONNECT][Followup] Fix 
the spark-connect script
URL: https://github.com/apache/spark/pull/40344


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #40344: [SPARK-42656][CONNECT][Followup] Fix the spark-connect script

2023-03-08 Thread via GitHub



HyukjinKwon commented on PR #40344:
URL: https://github.com/apache/spark/pull/40344#issuecomment-1461128696

   Merged to master and branch-3.4.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #40271: SPARK-42258][PYTHON] pyspark.sql.functions should not expose typing.cast

2023-03-08 Thread via GitHub



itholic commented on code in PR #40271:
URL: https://github.com/apache/spark/pull/40271#discussion_r1130262880


##
python/pyspark/sql/tests/test_functions.py:
##
@@ -1268,6 +1268,12 @@ def test_bucket(self):
 message_parameters={"arg_name": "numBuckets", "arg_type": "str"},
 )
 
+def test_no_cast(self):

Review Comment:
   I don't think there's a concern for the name collision with `typing.cast` 
because `functions.cast` is not a function specified in the [PySpark API 
Reference](https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/functions.html).
   
   However, I also sort of agree that users might get confused and expect a 
`functions.cast` to exist since it's a function name that is generally expected 
to be supported (and anyway it is also supported in the Column API...) based on 
the example you provided in the PR description.
   
   At the moment, I am not 100% sure which direction is the right based on my 
own thoughts, so let's gather some more opinions from others.
   
   WDYT, @HyukjinKwon @ueshin @zhenglaizhang ??



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer commented on a diff in pull request #40291: [WIP][SPARK-42578][CONNECT] Add JDBC to DataFrameWriter

2023-03-08 Thread via GitHub



beliefer commented on code in PR #40291:
URL: https://github.com/apache/spark/pull/40291#discussion_r1130258611


##
connector/connect/server/pom.xml:
##
@@ -199,6 +199,12 @@
   ${tomcat.annotations.api.version}
   provided
 
+

Review Comment:
   It seems the server can't find the jar from class path.
   
![image](https://user-images.githubusercontent.com/8486025/223889202-4a56911e-336b-4695-ba40-8d052e2e52a1.png)
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] beliefer commented on pull request #40277: [SPARK-42555][CONNECT][FOLLOWUP] Add the new proto msg to support the remaining jdbc API

2023-03-08 Thread via GitHub



beliefer commented on PR #40277:
URL: https://github.com/apache/spark/pull/40277#issuecomment-1461105352

   @hvanhovell @zhengruifeng Thank you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell commented on a diff in pull request #40291: [WIP][SPARK-42578][CONNECT] Add JDBC to DataFrameWriter

2023-03-08 Thread via GitHub



hvanhovell commented on code in PR #40291:
URL: https://github.com/apache/spark/pull/40291#discussion_r1130251210


##
connector/connect/server/pom.xml:
##
@@ -199,6 +199,12 @@
   ${tomcat.annotations.api.version}
   provided
 
+

Review Comment:
   Why the move?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #40346: [SPARK-42667][CONNECT][FOLLOW-UP] SparkSession created by newSession should not share the channel

2023-03-08 Thread via GitHub



amaliujia commented on code in PR #40346:
URL: https://github.com/apache/spark/pull/40346#discussion_r1130250085


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/SparkConnectClient.scala:
##
@@ -164,7 +165,7 @@ private[sql] class SparkConnectClient(
   }
 
   def copy(): SparkConnectClient = {
-new SparkConnectClient(userContext, channel, userAgent)
+SparkConnectClient.builder().connectionString(connectString).build()

Review Comment:
   updated to use the builder thus we can copy entire channel setup to new 
client.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #40329: [SPARK-42710][CONNECT][PYTHON] Rename FrameMap proto to MapPartitions

2023-03-08 Thread via GitHub



HyukjinKwon commented on PR #40329:
URL: https://github.com/apache/spark/pull/40329#issuecomment-1461095654

   @xinrong-meng mind rebasing this? Otherwise should be good to go


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] hvanhovell commented on a diff in pull request #40346: [SPARK-42667][CONNECT][FOLLOW-UP] SparkSession created by newSession should not share the channel

2023-03-08 Thread via GitHub



hvanhovell commented on code in PR #40346:
URL: https://github.com/apache/spark/pull/40346#discussion_r1130241752


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/SparkConnectClient.scala:
##
@@ -164,7 +165,7 @@ private[sql] class SparkConnectClient(
   }
 
   def copy(): SparkConnectClient = {
-new SparkConnectClient(userContext, channel, userAgent)
+SparkConnectClient.builder().connectionString(connectString).build()

Review Comment:
   I don't buy that argument. The problem is that the client builder supports 
all these things (for good reason), now we are making the assumption that this 
thing will only be used to create a SparkSession, while there are also other 
use cases.
   
   Please keep the builder.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] itholic commented on a diff in pull request #40316: [SPARK-42679][CONNECT][PYTHON] createDataFrame doesn't work with non-nullable schema

2023-03-08 Thread via GitHub



itholic commented on code in PR #40316:
URL: https://github.com/apache/spark/pull/40316#discussion_r1130239606


##
python/pyspark/sql/types.py:
##
@@ -1599,6 +1599,18 @@ def _has_nulltype(dt: DataType) -> bool:
 return isinstance(dt, NullType)
 
 
+def _has_non_nullable(dt: Optional[Union[AtomicType, StructType]]) -> bool:

Review Comment:
   How about just simply:
   
   ```python
   def _has_non_nullable(dt: DataType) -> bool:
   ```
   
   I believe it would fix the linter



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #40346: [SPARK-42667][CONNECT][FOLLOW-UP] SparkSession created by newSession should not share the channel

2023-03-08 Thread via GitHub



amaliujia commented on code in PR #40346:
URL: https://github.com/apache/spark/pull/40346#discussion_r1130237717


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/SparkConnectClient.scala:
##
@@ -164,7 +165,7 @@ private[sql] class SparkConnectClient(
   }
 
   def copy(): SparkConnectClient = {
-new SparkConnectClient(userContext, channel, userAgent)
+SparkConnectClient.builder().connectionString(connectString).build()

Review Comment:
   
https://github.com/apache/spark/blob/a77bb37b4112543fcd77a7f6091e465daeb3f8ae/connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/SparkSession.scala#L497



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon closed pull request #40343: [SPARK-42722][CONNECT][PYTHON] Python Connect def schema() should not cache the schema

2023-03-08 Thread via GitHub



HyukjinKwon closed pull request #40343: [SPARK-42722][CONNECT][PYTHON] Python 
Connect def schema() should not cache the schema
URL: https://github.com/apache/spark/pull/40343


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] amaliujia commented on a diff in pull request #40346: [SPARK-42667][CONNECT][FOLLOW-UP] SparkSession created by newSession should not share the channel

2023-03-08 Thread via GitHub



amaliujia commented on code in PR #40346:
URL: https://github.com/apache/spark/pull/40346#discussion_r1130236992


##
connector/connect/client/jvm/src/main/scala/org/apache/spark/sql/connect/client/SparkConnectClient.scala:
##
@@ -164,7 +165,7 @@ private[sql] class SparkConnectClient(
   }
 
   def copy(): SparkConnectClient = {
-new SparkConnectClient(userContext, channel, userAgent)
+SparkConnectClient.builder().connectionString(connectString).build()

Review Comment:
   If you see the current SparkSession builder, I think it only has two 
versions to build the SparkSession:
   
   1. with ConnectionString.
   2. nothing but default build.
   
   that is why I only keep ConnectString. But If you think keep the builder to 
preserve everything is better I can update.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] HyukjinKwon commented on pull request #40343: [SPARK-42722][CONNECT][PYTHON] Python Connect def schema() should not cache the schema

2023-03-08 Thread via GitHub



HyukjinKwon commented on PR #40343:
URL: https://github.com/apache/spark/pull/40343#issuecomment-1461088027

   Merged to master and branch-3.4.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] dongjoon-hyun commented on a diff in pull request #40069: [SPARK-42480][SQL] Improve the performance of drop partitions

2023-03-08 Thread via GitHub



dongjoon-hyun commented on code in PR #40069:
URL: https://github.com/apache/spark/pull/40069#discussion_r1130236038


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -1227,6 +1227,14 @@ object SQLConf {
 .booleanConf
 .createWithDefault(false)
 
+  val HIVE_METASTORE_DROP_PARTITION_BY_NAME =
+buildConf("spark.sql.hive.dropPartitionByName.enabled")
+  .doc("When true, Spark will get partition name rather than partition 
object " +
+   "to drop partition, which can improve the performance of drop 
partition.")
+  .version("3.4.0")

Review Comment:
   Thank you for the decision. I also support your decision. Here is my +1.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] [spark] sunchao commented on a diff in pull request #40069: [SPARK-42480][SQL] Improve the performance of drop partitions

2023-03-08 Thread via GitHub



sunchao commented on code in PR #40069:
URL: https://github.com/apache/spark/pull/40069#discussion_r1130235798


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -1227,6 +1227,14 @@ object SQLConf {
 .booleanConf
 .createWithDefault(false)
 
+  val HIVE_METASTORE_DROP_PARTITION_BY_NAME =
+buildConf("spark.sql.hive.dropPartitionByName.enabled")
+  .doc("When true, Spark will get partition name rather than partition 
object " +
+   "to drop partition, which can improve the performance of drop 
partition.")
+  .version("3.4.0")

Review Comment:
   I think it's pretty safe to backport to branch-3.4 since the feature is 
turned off by default.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 3 >

1 - 100 of 257 matches

Mail list logo