[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...

2018-11-15 Thread mt40
Github user mt40 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22309#discussion_r234085471
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala 
---
@@ -373,6 +383,32 @@ object ScalaReflection extends ScalaReflection {
   dataType = ObjectType(udt.getClass))
 Invoke(obj, "deserialize", ObjectType(udt.userClass), path :: Nil)
 
+  case t if isValueClass(t) =>
+val (_, underlyingType) = getUnderlyingParameterOf(t)
+val underlyingClsName = getClassNameFromType(underlyingType)
+val clsName = getUnerasedClassNameFromType(t)
+val newTypePath = s"""- Scala value class: 
$clsName($underlyingClsName)""" +:
+  walkedTypePath
+
+// Nested value class is treated as its underlying type
+// because the compiler will convert value class in the schema to
+// its underlying type.
+// However, for value class that is top-level or array element,
+// if it is used as another type (e.g. as its parent trait or 
generic),
+// the compiler keeps the class so we must provide an instance of 
the
+// class too. In other cases, the compiler will handle 
wrapping/unwrapping
+// for us automatically.
+val arg = deserializerFor(underlyingType, path, newTypePath, 
Some(t))
+val isCollectionElement = lastType.exists { lt =>
+  lt <:< localTypeOf[Array[_]] || lt <:< localTypeOf[Seq[_]]
--- End diff --

I added the support for Map


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...

2018-10-29 Thread mt40
Github user mt40 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22309#discussion_r229156708
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala 
---
@@ -128,6 +128,15 @@ object ScalaReflection extends ScalaReflection {
 case _ => false
   }
 
+  def isValueClass(tpe: `Type`): Boolean = {
+tpe.typeSymbol.asClass.isDerivedValueClass
+  }
+
+  /** Returns the name and type of the underlying parameter of value class 
`tpe`. */
+  def getUnderlyingParameterOf(tpe: `Type`): (String, Type) = {
+getConstructorParameters(tpe).head
--- End diff --

not sure, I can't find any


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22309: [SPARK-20384][SQL] Support value class in schema of Data...

2018-10-29 Thread mt40
Github user mt40 commented on the issue:

https://github.com/apache/spark/pull/22309
  
@cloud-fan while what you said is true, I think it's not that bad. Overall, 
the only major special logic is inside the case class `case` and this is due to 
the special nature of value class.
On the other side, with value class support, unit testing Spark code with 
[ScalaCheck](https://github.com/rickynils/scalacheck) becomes much more easier 
which is also the main reason I'm trying to implement this patch.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...

2018-10-29 Thread mt40
Github user mt40 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22309#discussion_r228879109
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala 
---
@@ -180,11 +189,13 @@ object ScalaReflection extends ScalaReflection {
* @param tpe The `Type` of deserialized object.
* @param path The expression which can be used to extract serialized 
value.
* @param walkedTypePath The paths from top to bottom to access current 
field when deserializing.
+   * @param instantiateValueClass If `true`, create an instance for Scala 
value class
--- End diff --

ok, I updated the comment


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...

2018-10-29 Thread mt40
Github user mt40 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22309#discussion_r228863565
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala 
---
@@ -373,6 +383,32 @@ object ScalaReflection extends ScalaReflection {
   dataType = ObjectType(udt.getClass))
 Invoke(obj, "deserialize", ObjectType(udt.userClass), path :: Nil)
 
+  case t if isValueClass(t) =>
+val (_, underlyingType) = getUnderlyingParameterOf(t)
+val underlyingClsName = getClassNameFromType(underlyingType)
+val clsName = getUnerasedClassNameFromType(t)
+val newTypePath = s"""- Scala value class: 
$clsName($underlyingClsName)""" +:
+  walkedTypePath
+
+// Nested value class is treated as its underlying type
+// because the compiler will convert value class in the schema to
+// its underlying type.
+// However, for value class that is top-level or array element,
+// if it is used as another type (e.g. as its parent trait or 
generic),
+// the compiler keeps the class so we must provide an instance of 
the
+// class too. In other cases, the compiler will handle 
wrapping/unwrapping
+// for us automatically.
+val arg = deserializerFor(underlyingType, path, newTypePath, 
Some(t))
+val isCollectionElement = lastType.exists { lt =>
+  lt <:< localTypeOf[Array[_]] || lt <:< localTypeOf[Seq[_]]
+}
+if (lastType.isEmpty || isCollectionElement) {
--- End diff --

Yea, I choose the `lastType` approach because it may be useful for other 
use cases also. But what you said is also true. I changed it to a boolean 
parameter


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...

2018-10-28 Thread mt40
Github user mt40 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22309#discussion_r228741586
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala 
---
@@ -374,6 +383,29 @@ object ScalaReflection extends ScalaReflection {
   dataType = ObjectType(udt.getClass))
 Invoke(obj, "deserialize", ObjectType(udt.userClass), path :: Nil)
 
+  case t if isValueClass(t) =>
+val (_, underlyingType) = getUnderlyingParameterOf(t)
+val underlyingClsName = getClassNameFromType(underlyingType)
+val clsName = getUnerasedClassNameFromType(t)
+val newTypePath = s"""- Scala value class: 
$clsName($underlyingClsName)""" +:
+  walkedTypePath
+
+// Nested value class is treated as its underlying type
+// because the compiler will convert value class in the schema to
+// its underlying type.
+// However, for value class that is top-level or array element,
+// if it is used as another type (e.g. as its parent trait or 
generic),
+// the compiler keeps the class so we must provide an instance of 
the
+// class too. In other cases, the compiler will handle 
wrapping/unwrapping
+// for us automatically.
+val arg = deserializerFor(underlyingType, path, newTypePath)
+if (walkedTypePath.length == 1 || 
walkedTypePath.head.contains("array element")) {
--- End diff --

never mind, I think going with option 2 is better in the future


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...

2018-10-28 Thread mt40
Github user mt40 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22309#discussion_r228737728
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala 
---
@@ -374,6 +383,29 @@ object ScalaReflection extends ScalaReflection {
   dataType = ObjectType(udt.getClass))
 Invoke(obj, "deserialize", ObjectType(udt.userClass), path :: Nil)
 
+  case t if isValueClass(t) =>
+val (_, underlyingType) = getUnderlyingParameterOf(t)
+val underlyingClsName = getClassNameFromType(underlyingType)
+val clsName = getUnerasedClassNameFromType(t)
+val newTypePath = s"""- Scala value class: 
$clsName($underlyingClsName)""" +:
+  walkedTypePath
+
+// Nested value class is treated as its underlying type
+// because the compiler will convert value class in the schema to
+// its underlying type.
+// However, for value class that is top-level or array element,
+// if it is used as another type (e.g. as its parent trait or 
generic),
+// the compiler keeps the class so we must provide an instance of 
the
+// class too. In other cases, the compiler will handle 
wrapping/unwrapping
+// for us automatically.
+val arg = deserializerFor(underlyingType, path, newTypePath)
+if (walkedTypePath.length == 1 || 
walkedTypePath.head.contains("array element")) {
--- End diff --

@viirya nice catch! I didn't check for array case.
As you said, basically we just need to instantiate value class if it is the 
type of array element. However, to do that, in the `case` handling value class, 
I need to know that I'm checking the type of array element. I think there are 2 
ways:
- check the previous piece of `walkedTypePath` like I'm doing here
- introduce a parameter called `previousType: Type` for `deserializeFor`

Personally, I think what I did here (option 1) is a little bit hacky but it 
is minor and works well. Do you have any suggestion on this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...

2018-10-27 Thread mt40
Github user mt40 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22309#discussion_r228734075
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala 
---
@@ -376,6 +387,23 @@ object ScalaReflection extends ScalaReflection {
   dataType = ObjectType(udt.getClass))
 Invoke(obj, "deserialize", ObjectType(udt.userClass), getPath :: 
Nil)
 
+  case t if isValueClass(t) =>
+// nested value class is treated as its underlying type
+// top level value class must be treated as a product
+val underlyingType = getUnderlyingTypeOf(t)
+val underlyingClsName = getClassNameFromType(underlyingType)
+val clsName = t.typeSymbol.asClass.fullName
+val newTypePath = s"""- Scala value class: 
$clsName($underlyingClsName)""" +:
+  walkedTypePath
+
+val arg = deserializerFor(underlyingType, path, newTypePath)
+if (path.isDefined) {
+  arg
--- End diff --

yes @viirya 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...

2018-10-27 Thread mt40
Github user mt40 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22309#discussion_r228733771
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala 
---
@@ -376,6 +387,23 @@ object ScalaReflection extends ScalaReflection {
   dataType = ObjectType(udt.getClass))
 Invoke(obj, "deserialize", ObjectType(udt.userClass), getPath :: 
Nil)
 
+  case t if isValueClass(t) =>
+// nested value class is treated as its underlying type
+// top level value class must be treated as a product
+val underlyingType = getUnderlyingTypeOf(t)
+val underlyingClsName = getClassNameFromType(underlyingType)
+val clsName = t.typeSymbol.asClass.fullName
+val newTypePath = s"""- Scala value class: 
$clsName($underlyingClsName)""" +:
+  walkedTypePath
+
+val arg = deserializerFor(underlyingType, path, newTypePath)
+if (path.isDefined) {
+  arg
--- End diff --

No in that case `Id` will be `Int`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22309: [SPARK-20384][SQL] Support value class in schema of Data...

2018-10-27 Thread mt40
Github user mt40 commented on the issue:

https://github.com/apache/spark/pull/22309
  
@cloud-fan @viirya I was silly @_@, I forgot to push the change for that 
part and missed that error message from Jenkins as well


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...

2018-10-27 Thread mt40
Github user mt40 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22309#discussion_r228733409
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala 
---
@@ -376,6 +387,23 @@ object ScalaReflection extends ScalaReflection {
   dataType = ObjectType(udt.getClass))
 Invoke(obj, "deserialize", ObjectType(udt.userClass), getPath :: 
Nil)
 
+  case t if isValueClass(t) =>
+// nested value class is treated as its underlying type
+// top level value class must be treated as a product
+val underlyingType = getUnderlyingTypeOf(t)
+val underlyingClsName = getClassNameFromType(underlyingType)
+val clsName = t.typeSymbol.asClass.fullName
+val newTypePath = s"""- Scala value class: 
$clsName($underlyingClsName)""" +:
+  walkedTypePath
+
+val arg = deserializerFor(underlyingType, path, newTypePath)
+if (path.isDefined) {
+  arg
--- End diff --

it is the 1st rule where `Id` is treated as a type parameter, you can see 
the `def identity[T]` example


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22309: [SPARK-20384][SQL] Support value class in schema of Data...

2018-10-27 Thread mt40
Github user mt40 commented on the issue:

https://github.com/apache/spark/pull/22309
  
hi @viirya, may I ask whether Jenkins is having issue? It reports
```
[error] (catalyst/compile:compileIncremental) Compilation failed
[error] Total time: 175 s, completed Oct 27, 2018 8:33:41 PM
[error] running /home/jenkins/workspace/SparkPullRequestBuilder/build/sbt 
-Phadoop-2.7 -Pkubernetes -Phive-thriftserver -Pkinesis-asl -Pyarn 
-Pspark-ganglia-lgpl -Phive -Pmesos test:package 
streaming-kinesis-asl-assembly/assembly ; received return code 1
```
but I can build project `catalyst` fine on my laptop


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...

2018-10-27 Thread mt40
Github user mt40 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22309#discussion_r228733161
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala 
---
@@ -379,6 +388,28 @@ object ScalaReflection extends ScalaReflection {
   dataType = ObjectType(udt.getClass))
 Invoke(obj, "deserialize", ObjectType(udt.userClass), getPath :: 
Nil)
 
+  case t if isValueClass(t) =>
+val (_, underlyingType) = getUnderlyingParameterOf(t)
+val underlyingClsName = getClassNameFromType(underlyingType)
+val clsName = getUnerasedClassNameFromType(t)
+val newTypePath = s"""- Scala value class: 
$clsName($underlyingClsName)""" +:
+  walkedTypePath
+
+// Nested value class is treated as its underlying type
+// because the compiler will convert value class in the schema to
+// its underlying type.
+// However, for top-level value class, if it is used as another 
type
+// (e.g. as its parent trait or generic), the compiler keeps the 
class
+// so we must provide an instance of the class too. In other cases,
+// the compiler will handle wrapping/unwrapping for us 
automatically.
+val arg = deserializerFor(underlyingType, path, newTypePath)
+if (path.isDefined) {
--- End diff --

No, I haven't rebased since last week.
Hmm, now since `path` is not `Option` anymore, I think I have to use `if 
(walkedTypePath.length > 1)` to have the same logic. But this seems a little 
bit hacky. Do you have any suggestion on this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...

2018-10-27 Thread mt40
Github user mt40 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22309#discussion_r228733107
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala 
---
@@ -376,6 +387,23 @@ object ScalaReflection extends ScalaReflection {
   dataType = ObjectType(udt.getClass))
 Invoke(obj, "deserialize", ObjectType(udt.userClass), getPath :: 
Nil)
 
+  case t if isValueClass(t) =>
+// nested value class is treated as its underlying type
+// top level value class must be treated as a product
+val underlyingType = getUnderlyingTypeOf(t)
+val underlyingClsName = getClassNameFromType(underlyingType)
+val clsName = t.typeSymbol.asClass.fullName
+val newTypePath = s"""- Scala value class: 
$clsName($underlyingClsName)""" +:
+  walkedTypePath
+
+val arg = deserializerFor(underlyingType, path, newTypePath)
+if (path.isDefined) {
+  arg
--- End diff --

yes, but there are some cases where an instance of value class must be 
created. You can see this [allocation 
rule](https://docs.scala-lang.org/overviews/core/value-classes.html#allocation-details).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...

2018-10-27 Thread mt40
Github user mt40 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22309#discussion_r228733087
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoderSuite.scala
 ---
@@ -297,11 +307,16 @@ class ExpressionEncoderSuite extends 
CodegenInterpretedPlanTest with AnalysisTes
 ExpressionEncoder.tuple(intEnc, ExpressionEncoder.tuple(intEnc, 
longEnc))
   }
 
+  // test for Scala value class
   encodeDecodeTest(
 PrimitiveValueClass(42), "primitive value class")
-
   encodeDecodeTest(
 ReferenceValueClass(ReferenceValueClass.Container(1)), "reference 
value class")
+  encodeDecodeTest(StringWrapper("a"), "value class string")
+  encodeDecodeTest(ValueContainer(1, StringWrapper("b")), "value class 
nested")
+  encodeDecodeTest(
--- End diff --

sure, I added a test with `StringWrapper(null)`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...

2018-10-27 Thread mt40
Github user mt40 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22309#discussion_r228732805
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala 
---
@@ -376,6 +387,23 @@ object ScalaReflection extends ScalaReflection {
   dataType = ObjectType(udt.getClass))
 Invoke(obj, "deserialize", ObjectType(udt.userClass), getPath :: 
Nil)
 
+  case t if isValueClass(t) =>
+// nested value class is treated as its underlying type
+// top level value class must be treated as a product
+val underlyingType = getUnderlyingTypeOf(t)
+val underlyingClsName = getClassNameFromType(underlyingType)
+val clsName = t.typeSymbol.asClass.fullName
+val newTypePath = s"""- Scala value class: 
$clsName($underlyingClsName)""" +:
+  walkedTypePath
+
+val arg = deserializerFor(underlyingType, path, newTypePath)
+if (path.isDefined) {
+  arg
--- End diff --

@viirya we cannot because after compile, field `id: Id` becomes `id: Int`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...

2018-10-27 Thread mt40
Github user mt40 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22309#discussion_r228731280
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/ScalaReflectionSuite.scala
 ---
@@ -358,4 +368,20 @@ class ScalaReflectionSuite extends SparkFunSuite {
 assert(numberOfCheckedArguments(deserializerFor[(java.lang.Double, 
Int)]) == 1)
 assert(numberOfCheckedArguments(deserializerFor[(java.lang.Integer, 
java.lang.Integer)]) == 0)
   }
+
+  test("schema for case class that is a value class") {
+val schema = schemaFor[TestingValueClass.IntWrapper]
+assert(schema === Schema(IntegerType, nullable = false))
+  }
+
+  test("schema for case class that contains value class fields") {
+val schema = schemaFor[TestingValueClass.ValueClassData]
+assert(schema === Schema(
+  StructType(Seq(
+StructField("intField", IntegerType, nullable = false),
+StructField("wrappedInt", IntegerType, nullable = false),
--- End diff --

yup, value class in general cannot be null since it is a subtype of AnyVal


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22309: [SPARK-20384][SQL] Support value class in schema of Data...

2018-10-27 Thread mt40
Github user mt40 commented on the issue:

https://github.com/apache/spark/pull/22309
  
@cloud-fan It works now. Actually, top level value class is supported from 
[SPARK-17368](https://issues.apache.org/jira/browse/SPARK-17368). I try to 
maintain that and add support for nested value class in this patch.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...

2018-10-27 Thread mt40
Github user mt40 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22309#discussion_r228716352
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala 
---
@@ -635,13 +675,17 @@ object ScalaReflection extends ScalaReflection {
   "cannot be used as field name\n" + 
walkedTypePath.mkString("\n"))
   }
 
+  // as a field, value class is represented by its underlying type
+  val trueFieldType =
+if (isValueClass(fieldType)) getUnderlyingTypeOf(fieldType) 
else fieldType
+
   val fieldValue = Invoke(
-AssertNotNull(inputObject, walkedTypePath), fieldName, 
dataTypeFor(fieldType),
-returnNullable = !fieldType.typeSymbol.asClass.isPrimitive)
-  val clsName = getClassNameFromType(fieldType)
+AssertNotNull(inputObject, walkedTypePath), fieldName, 
dataTypeFor(trueFieldType),
+returnNullable = !trueFieldType.typeSymbol.asClass.isPrimitive)
+  val clsName = getClassNameFromType(trueFieldType)
--- End diff --

I tried moving the special logic to the value class case but have a concern 
I don't know how to resolve yet. I need to change `dataTypeFor` to return 
`ObjectType` for top level value class and `dataTypeFor(underlyingType)` 
otherwise (see my 
[comment](https://github.com/apache/spark/pull/22309#discussion_r226142827)). 
I'm going with something like this:
```scala
private def dataTypeFor(tpe: `Type`, isTopLevelValueClass: Boolean = true)
```
but this isn't right because:
- the default value `true` doesn't make sense for other types
- if default is `false` or there is no default value, many places that call 
this method need to be changed
- it also feels clunky because `dataTypeFor` now has to be aware of the 
context of its parameter

Do you have any suggestion on this?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...

2018-10-27 Thread mt40
Github user mt40 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22309#discussion_r228713038
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala 
---
@@ -622,6 +654,14 @@ object ScalaReflection extends ScalaReflection {
   dataType = ObjectType(udt.getClass))
 Invoke(obj, "serialize", udt, inputObject :: Nil)
 
+  case t if isValueClass(t) =>
+val (name, underlyingType) = getConstructorParameters(t).head
--- End diff --

you're right, I should return both name and type from `getUnderlyingTypeOf`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...

2018-10-27 Thread mt40
Github user mt40 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22309#discussion_r228713022
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala 
---
@@ -128,6 +128,16 @@ object ScalaReflection extends ScalaReflection {
 case _ => false
   }
 
+  def isValueClass(tpe: `Type`): Boolean = {
+tpe.typeSymbol.asClass.isDerivedValueClass
+  }
+
+  /** Returns the underlying type of value class `cls`. */
+  def getUnderlyingTypeOf(cls: `Type`): `Type` = {
--- End diff --

agree, it's better to use `tpe`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...

2018-10-18 Thread mt40
Github user mt40 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22309#discussion_r226510002
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala 
---
@@ -632,13 +667,17 @@ object ScalaReflection extends ScalaReflection {
   "cannot be used as field name\n" + 
walkedTypePath.mkString("\n"))
   }
 
+  // as a field, value class is represented by its underlying type
+  val trueFieldType =
+if (isValueClass(fieldType)) getUnderlyingTypeOf(fieldType) 
else fieldType
--- End diff --

update `dataTypeFor` is not enough since `trueFieldType` is used in code 
below as well. We can move this into `getConstructorParameters` but then this 
special handling is still there, so it will not be cleaner.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...

2018-10-18 Thread mt40
Github user mt40 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22309#discussion_r226509492
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala 
---
@@ -376,6 +386,23 @@ object ScalaReflection extends ScalaReflection {
   dataType = ObjectType(udt.getClass))
 Invoke(obj, "deserialize", ObjectType(udt.userClass), getPath :: 
Nil)
 
+  case t if isValueClass(t) =>
+// nested value class is treated as its underlying type
+// top level value class must be treated as a product
--- End diff --

Here what I mean by "treated as a product" is we must create a new instance 
for it in case we have `Dataset[Id]` for example (as in my 
[comment](https://github.com/apache/spark/pull/22309#discussion_r226142827) 
above). Seems like this is confusing? Should I reword it?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...

2018-10-18 Thread mt40
Github user mt40 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22309#discussion_r226508827
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala 
---
@@ -376,6 +386,23 @@ object ScalaReflection extends ScalaReflection {
   dataType = ObjectType(udt.getClass))
 Invoke(obj, "deserialize", ObjectType(udt.userClass), getPath :: 
Nil)
 
+  case t if isValueClass(t) =>
+// nested value class is treated as its underlying type
+// top level value class must be treated as a product
+val underlyingType = getUnderlyingTypeOf(t)
+val underlyingClsName = getClassNameFromType(underlyingType)
+val clsName = getUnerasedClassNameFromType(t)
+val newTypePath = s"""- Scala value class: 
$clsName($underlyingClsName)""" +:
--- End diff --

Sure, here is the message when running with value class `Id` above
`- Scala value class: org.apache.spark.sql.catalyst.encoders.Id(scala.Int)`


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...

2018-10-17 Thread mt40
Github user mt40 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22309#discussion_r226144119
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala 
---
@@ -125,6 +125,17 @@ object ScalaReflection extends ScalaReflection {
 case _ => false
   }
 
+  def isValueClass(tpe: `Type`): Boolean = {
+val notNull = !(tpe <:< localTypeOf[Null])
+notNull && definedByConstructorParams(tpe) && tpe <:< 
localTypeOf[AnyVal]
--- End diff --

No, now you ask this, I realize that the `Product` constraint can be 
completely removed.

Also, after scanning through the type api, I found a [built-in 
way](isDerivedValueClass) to check for value class... Don't know why I never 
thought about this 🤦‍♂️


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...

2018-10-17 Thread mt40
Github user mt40 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22309#discussion_r226142977
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala 
---
@@ -376,6 +387,23 @@ object ScalaReflection extends ScalaReflection {
   dataType = ObjectType(udt.getClass))
 Invoke(obj, "deserialize", ObjectType(udt.userClass), getPath :: 
Nil)
 
+  case t if isValueClass(t) =>
+// nested value class is treated as its underlying type
+// top level value class must be treated as a product
+val underlyingType = getUnderlyingTypeOf(t)
+val underlyingClsName = getClassNameFromType(underlyingType)
+val clsName = t.typeSymbol.asClass.fullName
--- End diff --

I cannot use that because it returns the class name after erasure (e.g. it 
returns `Int` for `IntWrapper`). I'll create a separate method to make this 
clearer.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...

2018-10-17 Thread mt40
Github user mt40 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22309#discussion_r226142827
  
--- Diff: 
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/ScalaReflection.scala 
---
@@ -376,6 +387,23 @@ object ScalaReflection extends ScalaReflection {
   dataType = ObjectType(udt.getClass))
 Invoke(obj, "deserialize", ObjectType(udt.userClass), getPath :: 
Nil)
 
+  case t if isValueClass(t) =>
+// nested value class is treated as its underlying type
+// top level value class must be treated as a product
+val underlyingType = getUnderlyingTypeOf(t)
+val underlyingClsName = getClassNameFromType(underlyingType)
+val clsName = t.typeSymbol.asClass.fullName
+val newTypePath = s"""- Scala value class: 
$clsName($underlyingClsName)""" +:
+  walkedTypePath
+
+val arg = deserializerFor(underlyingType, path, newTypePath)
+if (path.isDefined) {
+  arg
--- End diff --

Take class `User` above for example. After compile, field id of type `Id` 
will become `Int` so when constructing `User` we need `id` to be `Int`.

Also why we need `NewInstance` in case `Id` is itself the schema? Because 
`Id` may remain as `Id` if it is treated as another type (following [allocation 
rule](https://docs.scala-lang.org/overviews/core/value-classes.html#allocation-details)).
 For example, in method 
[encodeDecodeTest](https://github.com/apache/spark/blob/a40806d2bd84e9a0308165f0d6c97e9cf00aa4a3/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoderSuite.scala#L373),
 if we pass an instance of `Id` as input, it will not be converted to `Int`. In 
the other case when the required type is explicitly `Id`, then both the input 
and the result returned from deserialization will both become `Int`.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22309: [SPARK-20384][SQL] Support value class in schema of Data...

2018-10-13 Thread mt40
Github user mt40 commented on the issue:

https://github.com/apache/spark/pull/22309
  
@cloud-fan may I ask what is `error code -9`? I cannot find the error from 
the log in Jenkin.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...

2018-10-13 Thread mt40
Github user mt40 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22309#discussion_r224951179
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/ScalaReflectionSuite.scala
 ---
@@ -108,6 +108,16 @@ object TestingUDT {
   }
 }
 
+object TestingValueClass {
+  case class IntWrapper(i: Int) extends AnyVal
+  case class StrWrapper(s: String) extends AnyVal
+
+  case class ValueClassData(
+intField: Int,
+wrappedInt: IntWrapper,
+strField: String,
+wrappedStr: StrWrapper)
--- End diff --

Right, I added some comments for this


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22309: [SPARK-20384][SQL] Support value class in schema ...

2018-10-12 Thread mt40
Github user mt40 commented on a diff in the pull request:

https://github.com/apache/spark/pull/22309#discussion_r224944923
  
--- Diff: 
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/ScalaReflectionSuite.scala
 ---
@@ -108,6 +108,16 @@ object TestingUDT {
   }
 }
 
+object TestingValueClass {
+  case class IntWrapper(i: Int) extends AnyVal
--- End diff --

It doesn't but since Spark only supports `case class` (not `class`) for 
schema type. So I keep it that way.
Child columns can be `class` though. I think adding that in the future on 
top of this is not difficult.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22309: [SPARK-20384][CORE] Support value class in schema of Dat...

2018-09-04 Thread mt40
Github user mt40 commented on the issue:

https://github.com/apache/spark/pull/22309
  
@cloud-fan @liancheng @marmbrus could you please take a look at this and 
start the tests?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #22309: [SPARK-20384][core] Support value class in schema...

2018-09-01 Thread mt40
GitHub user mt40 opened a pull request:

https://github.com/apache/spark/pull/22309

[SPARK-20384][core] Support value class in schema of Dataset

## What changes were proposed in this pull request?
This PR adds support for [Scala value class][1] in schema of Datasets (as 
both top level class and nested field). 
The idea is to treat  value class as its underlying type at run time. For 
example:
```scala
case class Id(get: Int) extends AnyVal
case class User(id: Id) // field `id` will be treated as Int
```
However, if the value class is top-level (e.g. `Dataset[Id]`) then it must 
be treated like a boxed type and must be instantiated. I'm not sure why it 
behaves this way but I suspect it is related to the [expansion of value 
class][2] when we do casting (e.g. `asInstanceOf[T]`)

Actually, this feature is addressed before in [SPARK-17368][3] but the 
patch only supports top-level case. Hence we see the error when value class is 
nested as in [SPARK-19741][4] and [SPARK-20384][5]

[1]: https://docs.scala-lang.org/sips/value-classes.html
[2]: https://docs.scala-lang.org/sips/value-classes.html#example-1
[3]: https://issues.apache.org/jira/browse/SPARK-17368
[4]: https://issues.apache.org/jira/browse/SPARK-19741
[5]: https://issues.apache.org/jira/browse/SPARK-20384

## How was this patch tested?
I added unit tests for top-level and nested case.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mt40/spark dataset_value_class

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/22309.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #22309


commit 5613217771b1929b9f66106468fd2da2c3ea7dec
Author: minhthai 
Date:   2018-08-31T13:49:21Z

[SPARK-20384] Support value class in schema of Dataset




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org