Re: [PR] [SPARK-47307] Replace RFC 2045 base64 encoder with RFC 4648 encoder [spark]

2024-03-08 Thread via GitHub


dongjoon-hyun commented on code in PR #45408:
URL: https://github.com/apache/spark/pull/45408#discussion_r1518435444


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -3318,6 +3318,13 @@ object SQLConf {
 .booleanConf
 .createWithDefault(false)
 
+  val BASE_64_CHUNKING = buildConf("spark.sql.base64.chunking")
+.doc("When true, base64 strings generated by the base64 function are 
chunked into lines of " +
+  "at most 76 characters. When false, the base64 strings are not chunked.")
+.version("3.5.2")
+.booleanConf
+.createWithDefault(true)

Review Comment:
   Since we decide to keep the existing behavior as the default, could you 
revise the PR title and description, @ted-jenks ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47307] Replace RFC 2045 base64 encoder with RFC 4648 encoder [spark]

2024-03-08 Thread via GitHub


dongjoon-hyun commented on code in PR #45408:
URL: https://github.com/apache/spark/pull/45408#discussion_r1518435595


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -3318,6 +3318,13 @@ object SQLConf {
 .booleanConf
 .createWithDefault(false)
 
+  val BASE_64_CHUNKING = buildConf("spark.sql.base64.chunking")

Review Comment:
   This part, `val BASE_64_CHUNKING`, also should be revised accordingly.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47307] Replace RFC 2045 base64 encoder with RFC 4648 encoder [spark]

2024-03-08 Thread via GitHub


dongjoon-hyun commented on code in PR #45408:
URL: https://github.com/apache/spark/pull/45408#discussion_r1518435187


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -3318,6 +3318,13 @@ object SQLConf {
 .booleanConf
 .createWithDefault(false)
 
+  val BASE_64_CHUNKING = buildConf("spark.sql.base64.chunking")

Review Comment:
   We are very careful about the configuration namespace. In this case, it's a 
little too much to add a new namespace, `spark.sql.base64.*` for this single 
configuration. Maybe, (1) `spark.sql.chunkBase64String` or (2) 
`spark.sql.chunkBase64String.enabled`? I prefer (1).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47307] Replace RFC 2045 base64 encoder with RFC 4648 encoder [spark]

2024-03-08 Thread via GitHub


dongjoon-hyun commented on code in PR #45408:
URL: https://github.com/apache/spark/pull/45408#discussion_r1518433859


##
sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala:
##
@@ -3318,6 +3318,13 @@ object SQLConf {
 .booleanConf
 .createWithDefault(false)
 
+  val BASE_64_CHUNKING = buildConf("spark.sql.base64.chunking")
+.doc("When true, base64 strings generated by the base64 function are 
chunked into lines of " +
+  "at most 76 characters. When false, the base64 strings are not chunked.")
+.version("3.5.2")

Review Comment:
   This should be 4.0.0.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47307] Replace RFC 2045 base64 encoder with RFC 4648 encoder [spark]

2024-03-08 Thread via GitHub


ted-jenks commented on code in PR #45408:
URL: https://github.com/apache/spark/pull/45408#discussion_r1517931178


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##
@@ -2426,21 +2426,34 @@ case class Chr(child: Expression)
   """,
   since = "1.5.0",
   group = "string_funcs")
-case class Base64(child: Expression)
+case class Base64(child: Expression, chunkBase64: Boolean = 
SQLConf.get.base64Chunking)
   extends UnaryExpression with ImplicitCastInputTypes with NullIntolerant {
 
+  lazy val encoder: JBase64.Encoder = if (chunkBase64) {
+JBase64.getMimeEncoder
+  } else {
+JBase64.getMimeEncoder(-1, Array())
+  }
+
   override def dataType: DataType = StringType
   override def inputTypes: Seq[DataType] = Seq(BinaryType)
 
   protected override def nullSafeEval(bytes: Any): Any = {
-
UTF8String.fromBytes(JBase64.getMimeEncoder.encode(bytes.asInstanceOf[Array[Byte]]))
+UTF8String.fromBytes(encoder.encode(bytes.asInstanceOf[Array[Byte]]))
   }
 
   override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
 nullSafeCodeGen(ctx, ev, (child) => {
-  s"""${ev.value} = UTF8String.fromBytes(
-${classOf[JBase64].getName}.getMimeEncoder().encode($child));
-   """})
+  s"""
+if ($chunkBase64) {

Review Comment:
   nice



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47307] Replace RFC 2045 base64 encoder with RFC 4648 encoder [spark]

2024-03-08 Thread via GitHub


cloud-fan commented on code in PR #45408:
URL: https://github.com/apache/spark/pull/45408#discussion_r1517826436


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##
@@ -2426,21 +2426,34 @@ case class Chr(child: Expression)
   """,
   since = "1.5.0",
   group = "string_funcs")
-case class Base64(child: Expression)
+case class Base64(child: Expression, chunkBase64: Boolean = 
SQLConf.get.base64Chunking)
   extends UnaryExpression with ImplicitCastInputTypes with NullIntolerant {
 
+  lazy val encoder: JBase64.Encoder = if (chunkBase64) {
+JBase64.getMimeEncoder
+  } else {
+JBase64.getMimeEncoder(-1, Array())
+  }
+
   override def dataType: DataType = StringType
   override def inputTypes: Seq[DataType] = Seq(BinaryType)
 
   protected override def nullSafeEval(bytes: Any): Any = {
-
UTF8String.fromBytes(JBase64.getMimeEncoder.encode(bytes.asInstanceOf[Array[Byte]]))
+UTF8String.fromBytes(encoder.encode(bytes.asInstanceOf[Array[Byte]]))
   }
 
   override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
 nullSafeCodeGen(ctx, ev, (child) => {
-  s"""${ev.value} = UTF8String.fromBytes(
-${classOf[JBase64].getName}.getMimeEncoder().encode($child));
-   """})
+  s"""
+if ($chunkBase64) {

Review Comment:
   We know the value of `chunkBase64` before generating the java code, so we 
can do better
   ```
   if (chunkBase64) {
 s""" ...
   } else {
 s""" ...
   }
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47307] Replace RFC 2045 base64 encoder with RFC 4648 encoder [spark]

2024-03-08 Thread via GitHub


cloud-fan commented on code in PR #45408:
URL: https://github.com/apache/spark/pull/45408#discussion_r1517664214


##
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala:
##
@@ -2426,21 +2426,27 @@ case class Chr(child: Expression)
   """,
   since = "1.5.0",
   group = "string_funcs")
-case class Base64(child: Expression)
+case class Base64(child: Expression, chunkBase64: Boolean)
   extends UnaryExpression with ImplicitCastInputTypes with NullIntolerant {
 
+  def this(child: Expression) = this(child, SQLConf.get.base64Chunking)
+
+  lazy val encoder: JBase64.Encoder = if (chunkBase64) {
+JBase64.getMimeEncoder
+  } else {
+JBase64.getMimeEncoder(-1, Array())
+  }
   override def dataType: DataType = StringType
   override def inputTypes: Seq[DataType] = Seq(BinaryType)
 
   protected override def nullSafeEval(bytes: Any): Any = {
-
UTF8String.fromBytes(JBase64.getMimeEncoder.encode(bytes.asInstanceOf[Array[Byte]]))
+UTF8String.fromBytes(encoder.encode(bytes.asInstanceOf[Array[Byte]]))
   }
 
   override def doGenCode(ctx: CodegenContext, ev: ExprCode): ExprCode = {
 nullSafeCodeGen(ctx, ev, (child) => {
-  s"""${ev.value} = UTF8String.fromBytes(
-${classOf[JBase64].getName}.getMimeEncoder().encode($child));
-   """})
+  s"${ev.value} = UTF8String.fromBytes($encoder}.encode($child));"

Review Comment:
   Does this really work? `$encoder` is `JBase64.Encoder#toString`, not the 
java code that returns `JBase64.Encoder`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47307] Replace RFC 2045 base64 encoder with RFC 4648 encoder [spark]

2024-03-08 Thread via GitHub


ted-jenks commented on PR #45408:
URL: https://github.com/apache/spark/pull/45408#issuecomment-1985424230

   I think making this configurable makes the most sense. For people processing 
data for external systems with Spark they can choose to chunk or not chunk data 
based on what the use-case is.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47307] Replace RFC 2045 base64 encoder with RFC 4648 encoder [spark]

2024-03-07 Thread via GitHub


yaooqinn commented on PR #45408:
URL: https://github.com/apache/spark/pull/45408#issuecomment-1984930529

   As the Spark Community didn't get any issue report during v3.3.0 - v3.5.1 
releases, I think this is a corner case. Maybe we can make the config internal.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47307] Replace RFC 2045 base64 encoder with RFC 4648 encoder [spark]

2024-03-07 Thread via GitHub


dongjoon-hyun commented on PR #45408:
URL: https://github.com/apache/spark/pull/45408#issuecomment-1984929745

   +1 for the direction if we need to support both.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47307] Replace RFC 2045 base64 encoder with RFC 4648 encoder [spark]

2024-03-07 Thread via GitHub


yaooqinn commented on PR #45408:
URL: https://github.com/apache/spark/pull/45408#issuecomment-1984926315

   Thank you @dongjoon-hyun. 
   
   In such circumstances, I guess we can add a configuration for base64 classes 
to avoid breaking things again. AFAIK, Apache Hive also uses the JDK version, 
and I think the majority of Spark users talk to Hive heavily using Spark SQL.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47307] Replace RFC 2045 base64 encoder with RFC 4648 encoder [spark]

2024-03-07 Thread via GitHub


dongjoon-hyun commented on PR #45408:
URL: https://github.com/apache/spark/pull/45408#issuecomment-1984433848

   Thank you for the confirmation, @ted-jenks . Well, in this case, it's too 
late to change the behavior again. Apache Spark 3.3 is already the EOL status 
since last year and I don't think we need to change the behavior for Apache 
Spark 3.4.3 and 3.5.2 because Apache Spark community didn't have such an 
official contract before. It would be great if you participate the community at 
Apache Spark 3.3.0 RC votes at that time.
   
   > > It sounds like you have other systems to read Spark's data.
   >
   > Correct. The issue was that from 3.2 to 3.3 there was a behavior change in 
the base64 encodings used in spark. Previously, they did not chunk. Now, they 
do. Chunked base64 cannot be read by non-MIME compatible base64 decoders 
causing the data output by Spark to be corrupt to systems following the normal 
base64 standard.
   > 
   > I think the best path forward is to use MIME encoding/decoding without 
chunking as this is the most fault tolerant meaning existing use-cases will not 
break, but the pre 3.3 base64 behavior is upheld.
   
   However, I understand and agree with @ted-jenks 's point as a nice-to-have 
of Apache Spark 4+ officially. In other words, if we want to merge this PR, we 
need to make it official from Apache Spark 4.0.0 and protect that as a kind of 
developer interface for all future releases. Do you think it's okay, @ted-jenks 
?
   
   BTW, how do you think about this proposal, @yaooqinn (the original author of 
#35110) and @cloud-fan and @HyukjinKwon ?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47307] Replace RFC 2045 base64 encoder with RFC 4648 encoder [spark]

2024-03-07 Thread via GitHub


ted-jenks commented on PR #45408:
URL: https://github.com/apache/spark/pull/45408#issuecomment-1982836579

   @dongjoon-hyun 
   > It sounds like you have other systems to read Spark's data.
   Correct. The issue was that from 3.2 to 3.3 there was a behavior change in 
the base64 encodings used in spark. Previously, they did not chunk. Now, they 
do. Chunked base64 cannot be read by non-MIME compatible base64 decoders 
causing the data output by Spark to be corrupt to systems following the normal 
base64 standard.
   
   I think the best path forward is to use MIME encoding/decoding without 
chunking as this is the most fault tolerant meaning existing use-cases will not 
break, but the pre 3.3 base64 behavior is upheld.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47307] Replace RFC 2045 base64 encoder with RFC 4648 encoder [spark]

2024-03-06 Thread via GitHub


dongjoon-hyun commented on PR #45408:
URL: https://github.com/apache/spark/pull/45408#issuecomment-1981829489

   Hi, @ted-jenks . Could you elaborate your correctness situation a little 
more? It sounds like you have other systems to read Spark's data.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



Re: [PR] [SPARK-47307] Replace RFC 2045 base64 encoder with RFC 4648 encoder [spark]

2024-03-06 Thread via GitHub


ted-jenks commented on PR #45408:
URL: https://github.com/apache/spark/pull/45408#issuecomment-1981542843

   @dongjoon-hyun please may you take a look. Caused a big data correctness 
issue for us.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org