[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...
Github user mt40 commented on a diff in the pull request: https://github.com/apache/spark/pull/22309#discussion_r234085471 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala --- @@ -373,6 +383,32 @@ object ScalaReflection extends ScalaReflection { dataType = ObjectType(udt.getClass)) Invoke(obj, "deserialize", ObjectType(udt.userClass), path :: Nil) + case t if isValueClass(t) => +val (_, underlyingType) = getUnderlyingParameterOf(t) +val underlyingClsName = getClassNameFromType(underlyingType) +val clsName = getUnerasedClassNameFromType(t) +val newTypePath = s"""- Scala value class: $clsName($underlyingClsName)""" +: + walkedTypePath + +// Nested value class is treated as its underlying type +// because the compiler will convert value class in the schema to +// its underlying type. +// However, for value class that is top-level or array element, +// if it is used as another type (e.g. as its parent trait or generic), +// the compiler keeps the class so we must provide an instance of the +// class too. In other cases, the compiler will handle wrapping/unwrapping +// for us automatically. +val arg = deserializerFor(underlyingType, path, newTypePath, Some(t)) +val isCollectionElement = lastType.exists { lt => + lt <:< localTypeOf[Array[_]] || lt <:< localTypeOf[Seq[_]] --- End diff -- I added the support for Map --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...
Github user mt40 commented on a diff in the pull request: https://github.com/apache/spark/pull/22309#discussion_r229156708 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala --- @@ -128,6 +128,15 @@ object ScalaReflection extends ScalaReflection { case _ => false } + def isValueClass(tpe: `Type`): Boolean = { +tpe.typeSymbol.asClass.isDerivedValueClass + } + + /** Returns the name and type of the underlying parameter of value class `tpe`. */ + def getUnderlyingParameterOf(tpe: `Type`): (String, Type) = { +getConstructorParameters(tpe).head --- End diff -- not sure, I can't find any --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22309: [SPARK-20384][SQL] Support value class in schema of Data...
Github user mt40 commented on the issue: https://github.com/apache/spark/pull/22309 @cloud-fan while what you said is true, I think it's not that bad. Overall, the only major special logic is inside the case class `case` and this is due to the special nature of value class. On the other side, with value class support, unit testing Spark code with [ScalaCheck](https://github.com/rickynils/scalacheck) becomes much more easier which is also the main reason I'm trying to implement this patch. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...
Github user mt40 commented on a diff in the pull request: https://github.com/apache/spark/pull/22309#discussion_r228879109 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala --- @@ -180,11 +189,13 @@ object ScalaReflection extends ScalaReflection { * @param tpe The `Type` of deserialized object. * @param path The expression which can be used to extract serialized value. * @param walkedTypePath The paths from top to bottom to access current field when deserializing. + * @param instantiateValueClass If `true`, create an instance for Scala value class --- End diff -- ok, I updated the comment --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...
Github user mt40 commented on a diff in the pull request: https://github.com/apache/spark/pull/22309#discussion_r228863565 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala --- @@ -373,6 +383,32 @@ object ScalaReflection extends ScalaReflection { dataType = ObjectType(udt.getClass)) Invoke(obj, "deserialize", ObjectType(udt.userClass), path :: Nil) + case t if isValueClass(t) => +val (_, underlyingType) = getUnderlyingParameterOf(t) +val underlyingClsName = getClassNameFromType(underlyingType) +val clsName = getUnerasedClassNameFromType(t) +val newTypePath = s"""- Scala value class: $clsName($underlyingClsName)""" +: + walkedTypePath + +// Nested value class is treated as its underlying type +// because the compiler will convert value class in the schema to +// its underlying type. +// However, for value class that is top-level or array element, +// if it is used as another type (e.g. as its parent trait or generic), +// the compiler keeps the class so we must provide an instance of the +// class too. In other cases, the compiler will handle wrapping/unwrapping +// for us automatically. +val arg = deserializerFor(underlyingType, path, newTypePath, Some(t)) +val isCollectionElement = lastType.exists { lt => + lt <:< localTypeOf[Array[_]] || lt <:< localTypeOf[Seq[_]] +} +if (lastType.isEmpty || isCollectionElement) { --- End diff -- Yea, I choose the `lastType` approach because it may be useful for other use cases also. But what you said is also true. I changed it to a boolean parameter --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...
Github user mt40 commented on a diff in the pull request: https://github.com/apache/spark/pull/22309#discussion_r228741586 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala --- @@ -374,6 +383,29 @@ object ScalaReflection extends ScalaReflection { dataType = ObjectType(udt.getClass)) Invoke(obj, "deserialize", ObjectType(udt.userClass), path :: Nil) + case t if isValueClass(t) => +val (_, underlyingType) = getUnderlyingParameterOf(t) +val underlyingClsName = getClassNameFromType(underlyingType) +val clsName = getUnerasedClassNameFromType(t) +val newTypePath = s"""- Scala value class: $clsName($underlyingClsName)""" +: + walkedTypePath + +// Nested value class is treated as its underlying type +// because the compiler will convert value class in the schema to +// its underlying type. +// However, for value class that is top-level or array element, +// if it is used as another type (e.g. as its parent trait or generic), +// the compiler keeps the class so we must provide an instance of the +// class too. In other cases, the compiler will handle wrapping/unwrapping +// for us automatically. +val arg = deserializerFor(underlyingType, path, newTypePath) +if (walkedTypePath.length == 1 || walkedTypePath.head.contains("array element")) { --- End diff -- never mind, I think going with option 2 is better in the future --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...
Github user mt40 commented on a diff in the pull request: https://github.com/apache/spark/pull/22309#discussion_r228737728 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala --- @@ -374,6 +383,29 @@ object ScalaReflection extends ScalaReflection { dataType = ObjectType(udt.getClass)) Invoke(obj, "deserialize", ObjectType(udt.userClass), path :: Nil) + case t if isValueClass(t) => +val (_, underlyingType) = getUnderlyingParameterOf(t) +val underlyingClsName = getClassNameFromType(underlyingType) +val clsName = getUnerasedClassNameFromType(t) +val newTypePath = s"""- Scala value class: $clsName($underlyingClsName)""" +: + walkedTypePath + +// Nested value class is treated as its underlying type +// because the compiler will convert value class in the schema to +// its underlying type. +// However, for value class that is top-level or array element, +// if it is used as another type (e.g. as its parent trait or generic), +// the compiler keeps the class so we must provide an instance of the +// class too. In other cases, the compiler will handle wrapping/unwrapping +// for us automatically. +val arg = deserializerFor(underlyingType, path, newTypePath) +if (walkedTypePath.length == 1 || walkedTypePath.head.contains("array element")) { --- End diff -- @viirya nice catch! I didn't check for array case. As you said, basically we just need to instantiate value class if it is the type of array element. However, to do that, in the `case` handling value class, I need to know that I'm checking the type of array element. I think there are 2 ways: - check the previous piece of `walkedTypePath` like I'm doing here - introduce a parameter called `previousType: Type` for `deserializeFor` Personally, I think what I did here (option 1) is a little bit hacky but it is minor and works well. Do you have any suggestion on this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...
Github user mt40 commented on a diff in the pull request: https://github.com/apache/spark/pull/22309#discussion_r228734075 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala --- @@ -376,6 +387,23 @@ object ScalaReflection extends ScalaReflection { dataType = ObjectType(udt.getClass)) Invoke(obj, "deserialize", ObjectType(udt.userClass), getPath :: Nil) + case t if isValueClass(t) => +// nested value class is treated as its underlying type +// top level value class must be treated as a product +val underlyingType = getUnderlyingTypeOf(t) +val underlyingClsName = getClassNameFromType(underlyingType) +val clsName = t.typeSymbol.asClass.fullName +val newTypePath = s"""- Scala value class: $clsName($underlyingClsName)""" +: + walkedTypePath + +val arg = deserializerFor(underlyingType, path, newTypePath) +if (path.isDefined) { + arg --- End diff -- yes @viirya --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...
Github user mt40 commented on a diff in the pull request: https://github.com/apache/spark/pull/22309#discussion_r228733771 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala --- @@ -376,6 +387,23 @@ object ScalaReflection extends ScalaReflection { dataType = ObjectType(udt.getClass)) Invoke(obj, "deserialize", ObjectType(udt.userClass), getPath :: Nil) + case t if isValueClass(t) => +// nested value class is treated as its underlying type +// top level value class must be treated as a product +val underlyingType = getUnderlyingTypeOf(t) +val underlyingClsName = getClassNameFromType(underlyingType) +val clsName = t.typeSymbol.asClass.fullName +val newTypePath = s"""- Scala value class: $clsName($underlyingClsName)""" +: + walkedTypePath + +val arg = deserializerFor(underlyingType, path, newTypePath) +if (path.isDefined) { + arg --- End diff -- No in that case `Id` will be `Int` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22309: [SPARK-20384][SQL] Support value class in schema of Data...
Github user mt40 commented on the issue: https://github.com/apache/spark/pull/22309 @cloud-fan @viirya I was silly @_@, I forgot to push the change for that part and missed that error message from Jenkins as well --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...
Github user mt40 commented on a diff in the pull request: https://github.com/apache/spark/pull/22309#discussion_r228733409 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala --- @@ -376,6 +387,23 @@ object ScalaReflection extends ScalaReflection { dataType = ObjectType(udt.getClass)) Invoke(obj, "deserialize", ObjectType(udt.userClass), getPath :: Nil) + case t if isValueClass(t) => +// nested value class is treated as its underlying type +// top level value class must be treated as a product +val underlyingType = getUnderlyingTypeOf(t) +val underlyingClsName = getClassNameFromType(underlyingType) +val clsName = t.typeSymbol.asClass.fullName +val newTypePath = s"""- Scala value class: $clsName($underlyingClsName)""" +: + walkedTypePath + +val arg = deserializerFor(underlyingType, path, newTypePath) +if (path.isDefined) { + arg --- End diff -- it is the 1st rule where `Id` is treated as a type parameter, you can see the `def identity[T]` example --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22309: [SPARK-20384][SQL] Support value class in schema of Data...
Github user mt40 commented on the issue: https://github.com/apache/spark/pull/22309 hi @viirya, may I ask whether Jenkins is having issue? It reports ``` [error] (catalyst/compile:compileIncremental) Compilation failed [error] Total time: 175 s, completed Oct 27, 2018 8:33:41 PM [error] running /home/jenkins/workspace/SparkPullRequestBuilder/build/sbt -Phadoop-2.7 -Pkubernetes -Phive-thriftserver -Pkinesis-asl -Pyarn -Pspark-ganglia-lgpl -Phive -Pmesos test:package streaming-kinesis-asl-assembly/assembly ; received return code 1 ``` but I can build project `catalyst` fine on my laptop --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...
Github user mt40 commented on a diff in the pull request: https://github.com/apache/spark/pull/22309#discussion_r228733161 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala --- @@ -379,6 +388,28 @@ object ScalaReflection extends ScalaReflection { dataType = ObjectType(udt.getClass)) Invoke(obj, "deserialize", ObjectType(udt.userClass), getPath :: Nil) + case t if isValueClass(t) => +val (_, underlyingType) = getUnderlyingParameterOf(t) +val underlyingClsName = getClassNameFromType(underlyingType) +val clsName = getUnerasedClassNameFromType(t) +val newTypePath = s"""- Scala value class: $clsName($underlyingClsName)""" +: + walkedTypePath + +// Nested value class is treated as its underlying type +// because the compiler will convert value class in the schema to +// its underlying type. +// However, for top-level value class, if it is used as another type +// (e.g. as its parent trait or generic), the compiler keeps the class +// so we must provide an instance of the class too. In other cases, +// the compiler will handle wrapping/unwrapping for us automatically. +val arg = deserializerFor(underlyingType, path, newTypePath) +if (path.isDefined) { --- End diff -- No, I haven't rebased since last week. Hmm, now since `path` is not `Option` anymore, I think I have to use `if (walkedTypePath.length > 1)` to have the same logic. But this seems a little bit hacky. Do you have any suggestion on this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...
Github user mt40 commented on a diff in the pull request: https://github.com/apache/spark/pull/22309#discussion_r228733107 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala --- @@ -376,6 +387,23 @@ object ScalaReflection extends ScalaReflection { dataType = ObjectType(udt.getClass)) Invoke(obj, "deserialize", ObjectType(udt.userClass), getPath :: Nil) + case t if isValueClass(t) => +// nested value class is treated as its underlying type +// top level value class must be treated as a product +val underlyingType = getUnderlyingTypeOf(t) +val underlyingClsName = getClassNameFromType(underlyingType) +val clsName = t.typeSymbol.asClass.fullName +val newTypePath = s"""- Scala value class: $clsName($underlyingClsName)""" +: + walkedTypePath + +val arg = deserializerFor(underlyingType, path, newTypePath) +if (path.isDefined) { + arg --- End diff -- yes, but there are some cases where an instance of value class must be created. You can see this [allocation rule](https://docs.scala-lang.org/overviews/core/value-classes.html#allocation-details). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...
Github user mt40 commented on a diff in the pull request: https://github.com/apache/spark/pull/22309#discussion_r228733087 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoderSuite.scala --- @@ -297,11 +307,16 @@ class ExpressionEncoderSuite extends CodegenInterpretedPlanTest with AnalysisTes ExpressionEncoder.tuple(intEnc, ExpressionEncoder.tuple(intEnc, longEnc)) } + // test for Scala value class encodeDecodeTest( PrimitiveValueClass(42), "primitive value class") - encodeDecodeTest( ReferenceValueClass(ReferenceValueClass.Container(1)), "reference value class") + encodeDecodeTest(StringWrapper("a"), "value class string") + encodeDecodeTest(ValueContainer(1, StringWrapper("b")), "value class nested") + encodeDecodeTest( --- End diff -- sure, I added a test with `StringWrapper(null)` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...
Github user mt40 commented on a diff in the pull request: https://github.com/apache/spark/pull/22309#discussion_r228732805 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala --- @@ -376,6 +387,23 @@ object ScalaReflection extends ScalaReflection { dataType = ObjectType(udt.getClass)) Invoke(obj, "deserialize", ObjectType(udt.userClass), getPath :: Nil) + case t if isValueClass(t) => +// nested value class is treated as its underlying type +// top level value class must be treated as a product +val underlyingType = getUnderlyingTypeOf(t) +val underlyingClsName = getClassNameFromType(underlyingType) +val clsName = t.typeSymbol.asClass.fullName +val newTypePath = s"""- Scala value class: $clsName($underlyingClsName)""" +: + walkedTypePath + +val arg = deserializerFor(underlyingType, path, newTypePath) +if (path.isDefined) { + arg --- End diff -- @viirya we cannot because after compile, field `id: Id` becomes `id: Int` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...
Github user mt40 commented on a diff in the pull request: https://github.com/apache/spark/pull/22309#discussion_r228731280 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/ScalaReflectionSuite.scala --- @@ -358,4 +368,20 @@ class ScalaReflectionSuite extends SparkFunSuite { assert(numberOfCheckedArguments(deserializerFor[(java.lang.Double, Int)]) == 1) assert(numberOfCheckedArguments(deserializerFor[(java.lang.Integer, java.lang.Integer)]) == 0) } + + test("schema for case class that is a value class") { +val schema = schemaFor[TestingValueClass.IntWrapper] +assert(schema === Schema(IntegerType, nullable = false)) + } + + test("schema for case class that contains value class fields") { +val schema = schemaFor[TestingValueClass.ValueClassData] +assert(schema === Schema( + StructType(Seq( +StructField("intField", IntegerType, nullable = false), +StructField("wrappedInt", IntegerType, nullable = false), --- End diff -- yup, value class in general cannot be null since it is a subtype of AnyVal --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22309: [SPARK-20384][SQL] Support value class in schema of Data...
Github user mt40 commented on the issue: https://github.com/apache/spark/pull/22309 @cloud-fan It works now. Actually, top level value class is supported from [SPARK-17368](https://issues.apache.org/jira/browse/SPARK-17368). I try to maintain that and add support for nested value class in this patch. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...
Github user mt40 commented on a diff in the pull request: https://github.com/apache/spark/pull/22309#discussion_r228716352 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala --- @@ -635,13 +675,17 @@ object ScalaReflection extends ScalaReflection { "cannot be used as field name\n" + walkedTypePath.mkString("\n")) } + // as a field, value class is represented by its underlying type + val trueFieldType = +if (isValueClass(fieldType)) getUnderlyingTypeOf(fieldType) else fieldType + val fieldValue = Invoke( -AssertNotNull(inputObject, walkedTypePath), fieldName, dataTypeFor(fieldType), -returnNullable = !fieldType.typeSymbol.asClass.isPrimitive) - val clsName = getClassNameFromType(fieldType) +AssertNotNull(inputObject, walkedTypePath), fieldName, dataTypeFor(trueFieldType), +returnNullable = !trueFieldType.typeSymbol.asClass.isPrimitive) + val clsName = getClassNameFromType(trueFieldType) --- End diff -- I tried moving the special logic to the value class case but have a concern I don't know how to resolve yet. I need to change `dataTypeFor` to return `ObjectType` for top level value class and `dataTypeFor(underlyingType)` otherwise (see my [comment](https://github.com/apache/spark/pull/22309#discussion_r226142827)). I'm going with something like this: ```scala private def dataTypeFor(tpe: `Type`, isTopLevelValueClass: Boolean = true) ``` but this isn't right because: - the default value `true` doesn't make sense for other types - if default is `false` or there is no default value, many places that call this method need to be changed - it also feels clunky because `dataTypeFor` now has to be aware of the context of its parameter Do you have any suggestion on this? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...
Github user mt40 commented on a diff in the pull request: https://github.com/apache/spark/pull/22309#discussion_r228713038 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala --- @@ -622,6 +654,14 @@ object ScalaReflection extends ScalaReflection { dataType = ObjectType(udt.getClass)) Invoke(obj, "serialize", udt, inputObject :: Nil) + case t if isValueClass(t) => +val (name, underlyingType) = getConstructorParameters(t).head --- End diff -- you're right, I should return both name and type from `getUnderlyingTypeOf` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...
Github user mt40 commented on a diff in the pull request: https://github.com/apache/spark/pull/22309#discussion_r228713022 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala --- @@ -128,6 +128,16 @@ object ScalaReflection extends ScalaReflection { case _ => false } + def isValueClass(tpe: `Type`): Boolean = { +tpe.typeSymbol.asClass.isDerivedValueClass + } + + /** Returns the underlying type of value class `cls`. */ + def getUnderlyingTypeOf(cls: `Type`): `Type` = { --- End diff -- agree, it's better to use `tpe` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...
Github user mt40 commented on a diff in the pull request: https://github.com/apache/spark/pull/22309#discussion_r226510002 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala --- @@ -632,13 +667,17 @@ object ScalaReflection extends ScalaReflection { "cannot be used as field name\n" + walkedTypePath.mkString("\n")) } + // as a field, value class is represented by its underlying type + val trueFieldType = +if (isValueClass(fieldType)) getUnderlyingTypeOf(fieldType) else fieldType --- End diff -- update `dataTypeFor` is not enough since `trueFieldType` is used in code below as well. We can move this into `getConstructorParameters` but then this special handling is still there, so it will not be cleaner. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...
Github user mt40 commented on a diff in the pull request: https://github.com/apache/spark/pull/22309#discussion_r226509492 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala --- @@ -376,6 +386,23 @@ object ScalaReflection extends ScalaReflection { dataType = ObjectType(udt.getClass)) Invoke(obj, "deserialize", ObjectType(udt.userClass), getPath :: Nil) + case t if isValueClass(t) => +// nested value class is treated as its underlying type +// top level value class must be treated as a product --- End diff -- Here what I mean by "treated as a product" is we must create a new instance for it in case we have `Dataset[Id]` for example (as in my [comment](https://github.com/apache/spark/pull/22309#discussion_r226142827) above). Seems like this is confusing? Should I reword it? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...
Github user mt40 commented on a diff in the pull request: https://github.com/apache/spark/pull/22309#discussion_r226508827 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala --- @@ -376,6 +386,23 @@ object ScalaReflection extends ScalaReflection { dataType = ObjectType(udt.getClass)) Invoke(obj, "deserialize", ObjectType(udt.userClass), getPath :: Nil) + case t if isValueClass(t) => +// nested value class is treated as its underlying type +// top level value class must be treated as a product +val underlyingType = getUnderlyingTypeOf(t) +val underlyingClsName = getClassNameFromType(underlyingType) +val clsName = getUnerasedClassNameFromType(t) +val newTypePath = s"""- Scala value class: $clsName($underlyingClsName)""" +: --- End diff -- Sure, here is the message when running with value class `Id` above `- Scala value class: org.apache.spark.sql.catalyst.encoders.Id(scala.Int)` --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...
Github user mt40 commented on a diff in the pull request: https://github.com/apache/spark/pull/22309#discussion_r226144119 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala --- @@ -125,6 +125,17 @@ object ScalaReflection extends ScalaReflection { case _ => false } + def isValueClass(tpe: `Type`): Boolean = { +val notNull = !(tpe <:< localTypeOf[Null]) +notNull && definedByConstructorParams(tpe) && tpe <:< localTypeOf[AnyVal] --- End diff -- No, now you ask this, I realize that the `Product` constraint can be completely removed. Also, after scanning through the type api, I found a [built-in way](isDerivedValueClass) to check for value class... Don't know why I never thought about this ð¤¦ââï¸ --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...
Github user mt40 commented on a diff in the pull request: https://github.com/apache/spark/pull/22309#discussion_r226142977 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala --- @@ -376,6 +387,23 @@ object ScalaReflection extends ScalaReflection { dataType = ObjectType(udt.getClass)) Invoke(obj, "deserialize", ObjectType(udt.userClass), getPath :: Nil) + case t if isValueClass(t) => +// nested value class is treated as its underlying type +// top level value class must be treated as a product +val underlyingType = getUnderlyingTypeOf(t) +val underlyingClsName = getClassNameFromType(underlyingType) +val clsName = t.typeSymbol.asClass.fullName --- End diff -- I cannot use that because it returns the class name after erasure (e.g. it returns `Int` for `IntWrapper`). I'll create a separate method to make this clearer. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...
Github user mt40 commented on a diff in the pull request: https://github.com/apache/spark/pull/22309#discussion_r226142827 --- Diff: sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala --- @@ -376,6 +387,23 @@ object ScalaReflection extends ScalaReflection { dataType = ObjectType(udt.getClass)) Invoke(obj, "deserialize", ObjectType(udt.userClass), getPath :: Nil) + case t if isValueClass(t) => +// nested value class is treated as its underlying type +// top level value class must be treated as a product +val underlyingType = getUnderlyingTypeOf(t) +val underlyingClsName = getClassNameFromType(underlyingType) +val clsName = t.typeSymbol.asClass.fullName +val newTypePath = s"""- Scala value class: $clsName($underlyingClsName)""" +: + walkedTypePath + +val arg = deserializerFor(underlyingType, path, newTypePath) +if (path.isDefined) { + arg --- End diff -- Take class `User` above for example. After compile, field id of type `Id` will become `Int` so when constructing `User` we need `id` to be `Int`. Also why we need `NewInstance` in case `Id` is itself the schema? Because `Id` may remain as `Id` if it is treated as another type (following [allocation rule](https://docs.scala-lang.org/overviews/core/value-classes.html#allocation-details)). For example, in method [encodeDecodeTest](https://github.com/apache/spark/blob/a40806d2bd84e9a0308165f0d6c97e9cf00aa4a3/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoderSuite.scala#L373), if we pass an instance of `Id` as input, it will not be converted to `Int`. In the other case when the required type is explicitly `Id`, then both the input and the result returned from deserialization will both become `Int`. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22309: [SPARK-20384][SQL] Support value class in schema of Data...
Github user mt40 commented on the issue: https://github.com/apache/spark/pull/22309 @cloud-fan may I ask what is `error code -9`? I cannot find the error from the log in Jenkin. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...
Github user mt40 commented on a diff in the pull request: https://github.com/apache/spark/pull/22309#discussion_r224951179 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/ScalaReflectionSuite.scala --- @@ -108,6 +108,16 @@ object TestingUDT { } } +object TestingValueClass { + case class IntWrapper(i: Int) extends AnyVal + case class StrWrapper(s: String) extends AnyVal + + case class ValueClassData( +intField: Int, +wrappedInt: IntWrapper, +strField: String, +wrappedStr: StrWrapper) --- End diff -- Right, I added some comments for this --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...
Github user mt40 commented on a diff in the pull request: https://github.com/apache/spark/pull/22309#discussion_r224944923 --- Diff: sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/ScalaReflectionSuite.scala --- @@ -108,6 +108,16 @@ object TestingUDT { } } +object TestingValueClass { + case class IntWrapper(i: Int) extends AnyVal --- End diff -- It doesn't but since Spark only supports `case class` (not `class`) for schema type. So I keep it that way. Child columns can be `class` though. I think adding that in the future on top of this is not difficult. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark issue #22309: [SPARK-20384][CORE] Support value class in schema of Dat...
Github user mt40 commented on the issue: https://github.com/apache/spark/pull/22309 @cloud-fan @liancheng @marmbrus could you please take a look at this and start the tests? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #22309: [SPARK-20384][core] Support value class in schema...
GitHub user mt40 opened a pull request: https://github.com/apache/spark/pull/22309 [SPARK-20384][core] Support value class in schema of Dataset ## What changes were proposed in this pull request? This PR adds support for [Scala value class][1] in schema of Datasets (as both top level class and nested field). The idea is to treat value class as its underlying type at run time. For example: ```scala case class Id(get: Int) extends AnyVal case class User(id: Id) // field `id` will be treated as Int ``` However, if the value class is top-level (e.g. `Dataset[Id]`) then it must be treated like a boxed type and must be instantiated. I'm not sure why it behaves this way but I suspect it is related to the [expansion of value class][2] when we do casting (e.g. `asInstanceOf[T]`) Actually, this feature is addressed before in [SPARK-17368][3] but the patch only supports top-level case. Hence we see the error when value class is nested as in [SPARK-19741][4] and [SPARK-20384][5] [1]: https://docs.scala-lang.org/sips/value-classes.html [2]: https://docs.scala-lang.org/sips/value-classes.html#example-1 [3]: https://issues.apache.org/jira/browse/SPARK-17368 [4]: https://issues.apache.org/jira/browse/SPARK-19741 [5]: https://issues.apache.org/jira/browse/SPARK-20384 ## How was this patch tested? I added unit tests for top-level and nested case. You can merge this pull request into a Git repository by running: $ git pull https://github.com/mt40/spark dataset_value_class Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/22309.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #22309 commit 5613217771b1929b9f66106468fd2da2c3ea7dec Author: minhthai Date: 2018-08-31T13:49:21Z [SPARK-20384] Support value class in schema of Dataset --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org