Re: [PR] [SPARK-50130][SQL][FOLLOWUP] Make Encoder generation lazy [spark]

2024-11-20 Thread via GitHub


ueshin closed pull request #48829: [SPARK-50130][SQL][FOLLOWUP] Make Encoder 
generation lazy
URL: https://github.com/apache/spark/pull/48829


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-50130][SQL][FOLLOWUP] Make Encoder generation lazy [spark]

2024-11-20 Thread via GitHub


ueshin commented on PR #48829:
URL: https://github.com/apache/spark/pull/48829#issuecomment-2489557706

   Thanks! merging to master.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-50130][SQL][FOLLOWUP] Make Encoder generation lazy [spark]

2024-11-20 Thread via GitHub


ueshin commented on PR #48829:
URL: https://github.com/apache/spark/pull/48829#issuecomment-2489557250

   The remaining test failures are not related to this PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-50130][SQL][FOLLOWUP] Make Encoder generation lazy [spark]

2024-11-20 Thread via GitHub


ueshin commented on code in PR #48829:
URL: https://github.com/apache/spark/pull/48829#discussion_r1850870243


##
sql/core/src/test/scala/org/apache/spark/sql/UDFSuite.scala:
##
@@ -1205,7 +1205,7 @@ class UDFSuite extends QueryTest with SharedSparkSession {
 dt
   )
   checkError(
-intercept[AnalysisException](spark.range(1).select(f())),
+intercept[AnalysisException](spark.range(1).select(f()).encoder),

Review Comment:
   Sure, thanks for confirming.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-50130][SQL][FOLLOWUP] Make Encoder generation lazy [spark]

2024-11-19 Thread via GitHub


cloud-fan commented on code in PR #48829:
URL: https://github.com/apache/spark/pull/48829#discussion_r1849559800


##
sql/core/src/test/scala/org/apache/spark/sql/UDFSuite.scala:
##
@@ -1205,7 +1205,7 @@ class UDFSuite extends QueryTest with SharedSparkSession {
 dt
   )
   checkError(
-intercept[AnalysisException](spark.range(1).select(f())),
+intercept[AnalysisException](spark.range(1).select(f()).encoder),

Review Comment:
   I think this is fine, we will remove this error eventually after completely 
support char/varchar (it's in progress)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-50130][SQL][FOLLOWUP] Make Encoder generation lazy [spark]

2024-11-18 Thread via GitHub


hvanhovell commented on code in PR #48829:
URL: https://github.com/apache/spark/pull/48829#discussion_r1847039697


##
sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala:
##
@@ -95,13 +95,12 @@ private[sql] object Dataset {
   def ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan): DataFrame =
 sparkSession.withActive {
   val qe = sparkSession.sessionState.executePlan(logicalPlan)
-  val encoder = if (qe.isLazyAnalysis) {
-RowEncoder.encoderFor(new StructType())
+  if (qe.isLazyAnalysis) {

Review Comment:
   You don't have to create a RowEncoder at this point, as it should work with 
any schema. For normal encoders we do have to bind them because the underlying 
schema can be incompatible with the encoder. So, long story short, @cloud-fan's 
suggestion is absolutely fine.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-50130][SQL][FOLLOWUP] Make Encoder generation lazy [spark]

2024-11-18 Thread via GitHub


ueshin commented on code in PR #48829:
URL: https://github.com/apache/spark/pull/48829#discussion_r1847155110


##
sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala:
##
@@ -95,13 +95,12 @@ private[sql] object Dataset {
   def ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan): DataFrame =
 sparkSession.withActive {
   val qe = sparkSession.sessionState.executePlan(logicalPlan)
-  val encoder = if (qe.isLazyAnalysis) {
-RowEncoder.encoderFor(new StructType())
+  if (qe.isLazyAnalysis) {

Review Comment:
   Sure, I updated.
   
   I'm not sure [this 
change](https://github.com/apache/spark/pull/48829/files#diff-0d94d471b64499a7962de4e9c3335ff5227b80cd85e50cfbd4cab67e41651fb8R1208)
 is allowed. Seems like there is a case the encoder creation still can throw an 
exception.
   I guess we still want to create it eagerly if it's not lazy analysis?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-50130][SQL][FOLLOWUP] Make Encoder generation lazy [spark]

2024-11-18 Thread via GitHub


ueshin commented on code in PR #48829:
URL: https://github.com/apache/spark/pull/48829#discussion_r1847155110


##
sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala:
##
@@ -95,13 +95,12 @@ private[sql] object Dataset {
   def ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan): DataFrame =
 sparkSession.withActive {
   val qe = sparkSession.sessionState.executePlan(logicalPlan)
-  val encoder = if (qe.isLazyAnalysis) {
-RowEncoder.encoderFor(new StructType())
+  if (qe.isLazyAnalysis) {

Review Comment:
   Sure, I updated.
   
   I'm not sure [this 
change](https://github.com/apache/spark/pull/48829/files#diff-0d94d471b64499a7962de4e9c3335ff5227b80cd85e50cfbd4cab67e41651fb8R1208)
 is allowed. I guess we still want to create it eagerly if it's not lazy 
analysis?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-50130][SQL][FOLLOWUP] Make Encoder generation lazy [spark]

2024-11-16 Thread via GitHub


cloud-fan commented on code in PR #48829:
URL: https://github.com/apache/spark/pull/48829#discussion_r1845085038


##
sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala:
##
@@ -95,13 +95,12 @@ private[sql] object Dataset {
   def ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan): DataFrame =
 sparkSession.withActive {
   val qe = sparkSession.sessionState.executePlan(logicalPlan)
-  val encoder = if (qe.isLazyAnalysis) {
-RowEncoder.encoderFor(new StructType())
+  if (qe.isLazyAnalysis) {

Review Comment:
   I get it that we want to create RowEncoder eagerly to fail earlier for the 
common case, but `if (!qe.isLazyAnalysis) qe.assertAnalyzed()` should do the 
same thing, right?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-50130][SQL][FOLLOWUP] Make Encoder generation lazy [spark]

2024-11-15 Thread via GitHub


ueshin commented on code in PR #48829:
URL: https://github.com/apache/spark/pull/48829#discussion_r1844320464


##
sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala:
##
@@ -95,13 +95,12 @@ private[sql] object Dataset {
   def ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan): DataFrame =
 sparkSession.withActive {
   val qe = sparkSession.sessionState.executePlan(logicalPlan)
-  val encoder = if (qe.isLazyAnalysis) {
-RowEncoder.encoderFor(new StructType())
+  if (qe.isLazyAnalysis) {

Review Comment:
   The lazy encoder creation only happens when `qe.isLazyAnalysis`; otherwise 
create it right here as same as before to avoid changing the behavior. 
`Dataset` has a new default constructor to take a function for the lazy case, 
and the original default constructor is still there to keep the behavior.
   The suggested cleanup doesn't keep the original behavior for non-lazy case.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-50130][SQL][FOLLOWUP] Make Encoder generation lazy [spark]

2024-11-15 Thread via GitHub


ueshin commented on code in PR #48829:
URL: https://github.com/apache/spark/pull/48829#discussion_r1844320464


##
sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala:
##
@@ -95,13 +95,12 @@ private[sql] object Dataset {
   def ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan): DataFrame =
 sparkSession.withActive {
   val qe = sparkSession.sessionState.executePlan(logicalPlan)
-  val encoder = if (qe.isLazyAnalysis) {
-RowEncoder.encoderFor(new StructType())
+  if (qe.isLazyAnalysis) {

Review Comment:
   The lazy encoder creation only happens when `qe.isLazyAnalysis`; otherwise 
create it right here as same as before to avoid changing the behavior. 
`Dataset` has a new default constructor to take a function for the lazy case, 
and the original default constructor is still there to keep the behavior.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-50130][SQL][FOLLOWUP] Make Encoder generation lazy [spark]

2024-11-15 Thread via GitHub


cloud-fan commented on code in PR #48829:
URL: https://github.com/apache/spark/pull/48829#discussion_r1843600376


##
sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala:
##
@@ -95,13 +95,12 @@ private[sql] object Dataset {
   def ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan): DataFrame =
 sparkSession.withActive {
   val qe = sparkSession.sessionState.executePlan(logicalPlan)
-  val encoder = if (qe.isLazyAnalysis) {
-RowEncoder.encoderFor(new StructType())
+  if (qe.isLazyAnalysis) {

Review Comment:
   I don't get it. My proposal is a pure code cleanup that doesn't change any 
actual logic.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-50130][SQL][FOLLOWUP] Make Encoder generation lazy [spark]

2024-11-14 Thread via GitHub


ueshin commented on code in PR #48829:
URL: https://github.com/apache/spark/pull/48829#discussion_r1842774971


##
sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala:
##
@@ -95,13 +95,12 @@ private[sql] object Dataset {
   def ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan): DataFrame =
 sparkSession.withActive {
   val qe = sparkSession.sessionState.executePlan(logicalPlan)
-  val encoder = if (qe.isLazyAnalysis) {
-RowEncoder.encoderFor(new StructType())
+  if (qe.isLazyAnalysis) {

Review Comment:
   Now with this PR, the encoder creation is lazy only when 
`qe.isLazyAnalysis`, to be consistent with the current behavior. Also the 
previously failed test suggested to make it consistent.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-50130][SQL][FOLLOWUP] Make Encoder generation lazy [spark]

2024-11-14 Thread via GitHub


cloud-fan commented on code in PR #48829:
URL: https://github.com/apache/spark/pull/48829#discussion_r1841773144


##
sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala:
##
@@ -95,13 +95,12 @@ private[sql] object Dataset {
   def ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan): DataFrame =
 sparkSession.withActive {
   val qe = sparkSession.sessionState.executePlan(logicalPlan)
-  val encoder = if (qe.isLazyAnalysis) {
-RowEncoder.encoderFor(new StructType())
+  if (qe.isLazyAnalysis) {

Review Comment:
   can we clean up the code a bit more?
   ```
   if (!qe.isLazyAnalysis) qe.assertAnalyzed()
   new Dataset[Row](qe, () => RowEncoder.encoderFor(qe.analyzed.schema))
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-50130][SQL][FOLLOWUP] Make Encoder generation lazy [spark]

2024-11-13 Thread via GitHub


ueshin commented on code in PR #48829:
URL: https://github.com/apache/spark/pull/48829#discussion_r1841253803


##
sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala:
##
@@ -95,13 +95,11 @@ private[sql] object Dataset {
   def ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan): DataFrame =
 sparkSession.withActive {
   val qe = sparkSession.sessionState.executePlan(logicalPlan)
-  val encoder = if (qe.isLazyAnalysis) {
-RowEncoder.encoderFor(new StructType())
-  } else {
+  if (!qe.isLazyAnalysis) {
 qe.assertAnalyzed()
-RowEncoder.encoderFor(qe.analyzed.schema)
   }
-  new Dataset[Row](qe, encoder)
+  val encoder = () => RowEncoder.encoderFor(qe.analyzed.schema)
+  new Dataset[Row](qe, encoder())

Review Comment:
   No, the signature of the constructor changed to `lazyEncoder: => 
Encoder[T]`, so it should be invoked later.
   I changed it to make it clearer and to fix the test failures at 584c650.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-50130][SQL][FOLLOWUP] Make Encoder generation lazy [spark]

2024-11-13 Thread via GitHub


cloud-fan commented on code in PR #48829:
URL: https://github.com/apache/spark/pull/48829#discussion_r1839698559


##
sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala:
##
@@ -95,13 +95,11 @@ private[sql] object Dataset {
   def ofRows(sparkSession: SparkSession, logicalPlan: LogicalPlan): DataFrame =
 sparkSession.withActive {
   val qe = sparkSession.sessionState.executePlan(logicalPlan)
-  val encoder = if (qe.isLazyAnalysis) {
-RowEncoder.encoderFor(new StructType())
-  } else {
+  if (!qe.isLazyAnalysis) {
 qe.assertAnalyzed()
-RowEncoder.encoderFor(qe.analyzed.schema)
   }
-  new Dataset[Row](qe, encoder)
+  val encoder = () => RowEncoder.encoderFor(qe.analyzed.schema)
+  new Dataset[Row](qe, encoder())

Review Comment:
   Does it mean we define a function and then invoke the function immediately?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-50130][SQL][FOLLOWUP] Make Encoder generation lazy [spark]

2024-11-12 Thread via GitHub


zhengruifeng commented on PR #48829:
URL: https://github.com/apache/spark/pull/48829#issuecomment-2472189357

   shall we add a test for it?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org