Re: Change nullable property in Dataset schema
Thank you for your comments > You should just Seq(...).toDS I tried this, however the result is not changed. >> val ds2 = ds1.map(e => e) > Why are you e => e (since it's identity) and does nothing? Yes, e => e does nothing. For the sake of simplicity of an example, I used the simplest expression in map(). In current Spark, an expression in map() does not change an schema for its output. > .as(RowEncoder(new StructType() > .add("value", ArrayType(IntegerType, false), nullable = false))) Sorry, this was my mistake. It did not work for my purpose. It actually does nothing. Kazuaki Ishizaki From: Jacek Laskowski To: Kazuaki Ishizaki/Japan/IBM@IBMJP Cc: user Date: 2016/08/15 04:56 Subject: Re: Change nullable property in Dataset schema On Wed, Aug 10, 2016 at 12:04 AM, Kazuaki Ishizaki wrote: > import testImplicits._ > test("test") { > val ds1 = sparkContext.parallelize(Seq(Array(1, 1), Array(2, 2), > Array(3, 3)), 1).toDS You should just Seq(...).toDS > val ds2 = ds1.map(e => e) Why are you e => e (since it's identity) and does nothing? > .as(RowEncoder(new StructType() > .add("value", ArrayType(IntegerType, false), nullable = false))) I didn't know it's possible but looks like it's toDF where you could replace the schema too (in a less involved way). I learnt quite a lot from just a single email. Thanks! Pozdrawiam, Jacek - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Change nullable property in Dataset schema
My motivation is to simplify Java code generated by a compiler of Tungsten. Here is a dump of generated code from the program. https://gist.github.com/kiszk/402bd8bc45a14be29acb3674ebc4df24 If we can succeeded to let catalyst the result of map is never null, we can eliminate conditional branches. For example, in the above URL, we can say the condition at line 45 is always false since the result of map() is never null by using our schema. As a result, we can eliminate assignments at lines 52 and 56, and conditional branches at lines 55 and 61. Kazuaki Ishizaki From: Koert Kuipers To: Kazuaki Ishizaki/Japan/IBM@IBMJP Cc: "user@spark.apache.org" Date: 2016/08/16 04:35 Subject: Re: Change nullable property in Dataset schema why do you want the array to have nullable = false? what is the benefit? On Wed, Aug 3, 2016 at 10:45 AM, Kazuaki Ishizaki wrote: Dear all, Would it be possible to let me know how to change nullable property in Dataset? When I looked for how to change nullable property in Dataframe schema, I found the following approaches. http://stackoverflow.com/questions/33193958/change-nullable-property-of-column-in-spark-dataframe https://github.com/apache/spark/pull/13873(Not merged yet) However, I cannot find how to change nullable property in Dataset schema. Even when I wrote the following program, nullable property for "value: array" in ds2.schema is not changed. If my understanding is correct, current Spark 2.0 uses an ExpressionEncoder that is generated based on Dataset[T] at https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala#L46 class Test extends QueryTest with SharedSQLContext { import testImplicits._ test("test") { val ds1 = sparkContext.parallelize(Seq(Array(1, 1), Array(2, 2), Array(3, 3)), 1).toDS val schema = new StructType().add("array", ArrayType(IntegerType, false), false) val inputObject = BoundReference(0, ScalaReflection.dataTypeFor[Array[Int]], false) val encoder = new ExpressionEncoder[Array[Int]](schema, true, ScalaReflection.serializerFor[Array[Int]](inputObject).flatten, ScalaReflection.deserializerFor[Array[Int]], ClassTag[Array[Int]](classOf[Array[Int]])) val ds2 = ds1.map(e => e)(encoder) ds1.printSchema ds2.printSchema } } root |-- value: array (nullable = true) ||-- element: integer (containsNull = false) root |-- value: array (nullable = true) // Expected (nullable = false) ||-- element: integer (containsNull = false) Kazuaki Ishizaki
Re: Change nullable property in Dataset schema
why do you want the array to have nullable = false? what is the benefit? On Wed, Aug 3, 2016 at 10:45 AM, Kazuaki Ishizaki wrote: > Dear all, > Would it be possible to let me know how to change nullable property in > Dataset? > > When I looked for how to change nullable property in Dataframe schema, I > found the following approaches. > http://stackoverflow.com/questions/33193958/change- > nullable-property-of-column-in-spark-dataframe > https://github.com/apache/spark/pull/13873(Not merged yet) > > However, I cannot find how to change nullable property in Dataset schema. > Even when I wrote the following program, nullable property for "value: > array" in ds2.schema is not changed. > If my understanding is correct, current Spark 2.0 uses an > ExpressionEncoder that is generated based on Dataset[T] at > https://github.com/apache/spark/blob/master/sql/ > catalyst/src/main/scala/org/apache/spark/sql/catalyst/ > encoders/ExpressionEncoder.scala#L46 > > class Test extends QueryTest with SharedSQLContext { > import testImplicits._ > test("test") { > val ds1 = sparkContext.parallelize(Seq(Array(1, 1), Array(2, 2), > Array(3, 3)), 1).toDS > val schema = new StructType().add("array", ArrayType(IntegerType, > false), false) > val inputObject = BoundReference(0, > ScalaReflection.dataTypeFor[Array[Int]], > false) > val encoder = new ExpressionEncoder[Array[Int]](schema, true, > ScalaReflection.serializerFor[Array[Int]](inputObject).flatten, > ScalaReflection.deserializerFor[Array[Int]], > ClassTag[Array[Int]](classOf[Array[Int]])) > val ds2 = ds1.map(e => e)(encoder) > ds1.printSchema > ds2.printSchema > } > } > > root > |-- value: array (nullable = true) > ||-- element: integer (containsNull = false) > > root > |-- value: array (nullable = true) // Expected > (nullable = false) > ||-- element: integer (containsNull = false) > > > Kazuaki Ishizaki >
Re: Change nullable property in Dataset schema
On Wed, Aug 10, 2016 at 12:04 AM, Kazuaki Ishizaki wrote: > import testImplicits._ > test("test") { > val ds1 = sparkContext.parallelize(Seq(Array(1, 1), Array(2, 2), > Array(3, 3)), 1).toDS You should just Seq(...).toDS > val ds2 = ds1.map(e => e) Why are you e => e (since it's identity) and does nothing? > .as(RowEncoder(new StructType() > .add("value", ArrayType(IntegerType, false), nullable = false))) I didn't know it's possible but looks like it's toDF where you could replace the schema too (in a less involved way). I learnt quite a lot from just a single email. Thanks! Pozdrawiam, Jacek - To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Re: Change nullable property in Dataset schema
After some investigations, I was able to change nullable property in Dataset[Array[Int]] in the following way. Is this right way? (1) Apply https://github.com/apache/spark/pull/13873 (2) Use two Encoders. One is RowEncoder. The other is predefined ExressionEncoder. class Test extends QueryTest with SharedSQLContext { import testImplicits._ test("test") { val ds1 = sparkContext.parallelize(Seq(Array(1, 1), Array(2, 2), Array(3, 3)), 1).toDS val ds2 = ds1.map(e => e) .as(RowEncoder(new StructType() .add("value", ArrayType(IntegerType, false), nullable = false))) .as(newDoubleArrayEncoder) ds1.printSchema ds2.printSchema } } root |-- value: array (nullable = true) ||-- element: integer (containsNull = false) root |-- value: array (nullable = false) ||-- element: integer (containsNull = false) Kazuaki Ishizaki From: Kazuaki Ishizaki/Japan/IBM@IBMJP To: user@spark.apache.org Date: 2016/08/03 23:46 Subject:Change nullable property in Dataset schema Dear all, Would it be possible to let me know how to change nullable property in Dataset? When I looked for how to change nullable property in Dataframe schema, I found the following approaches. http://stackoverflow.com/questions/33193958/change-nullable-property-of-column-in-spark-dataframe https://github.com/apache/spark/pull/13873(Not merged yet) However, I cannot find how to change nullable property in Dataset schema. Even when I wrote the following program, nullable property for "value: array" in ds2.schema is not changed. If my understanding is correct, current Spark 2.0 uses an ExpressionEncoder that is generated based on Dataset[T] at https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala#L46 class Test extends QueryTest with SharedSQLContext { import testImplicits._ test("test") { val ds1 = sparkContext.parallelize(Seq(Array(1, 1), Array(2, 2), Array(3, 3)), 1).toDS val schema = new StructType().add("array", ArrayType(IntegerType, false), false) val inputObject = BoundReference(0, ScalaReflection.dataTypeFor[Array[Int]], false) val encoder = new ExpressionEncoder[Array[Int]](schema, true, ScalaReflection.serializerFor[Array[Int]](inputObject).flatten, ScalaReflection.deserializerFor[Array[Int]], ClassTag[Array[Int]](classOf[Array[Int]])) val ds2 = ds1.map(e => e)(encoder) ds1.printSchema ds2.printSchema } } root |-- value: array (nullable = true) ||-- element: integer (containsNull = false) root |-- value: array (nullable = true) // Expected (nullable = false) ||-- element: integer (containsNull = false) Kazuaki Ishizaki