Re: to_avro and from_avro not working with struct type in spark 2.4

2019-03-01 Thread Gabor Somogyi
> I am thinking of writing out the dfKV dataframe to disk and then use Avro
apis to read the data.
Ping me if you have something, I'm planning similar things...


On Thu, Feb 28, 2019 at 5:27 PM Hien Luu  wrote:

> Thanks for the answer.
>
> As far as the next step goes, I am thinking of writing out the dfKV
> dataframe to disk and then use Avro apis to read the data.
>
> This smells like a bug somewhere.
>
> Cheers,
>
> Hien
>
> On Thu, Feb 28, 2019 at 4:02 AM Gabor Somogyi 
> wrote:
>
>> No, just take a look at the schema of dfStruct since you've converted its
>> value column with to_avro:
>>
>> scala> dfStruct.printSchema
>> root
>>  |-- id: integer (nullable = false)
>>  |-- name: string (nullable = true)
>>  |-- age: integer (nullable = false)
>>  |-- value: struct (nullable = false)
>>  ||-- name: string (nullable = true)
>>  ||-- age: integer (nullable = false)
>>
>>
>> On Wed, Feb 27, 2019 at 6:51 PM Hien Luu  wrote:
>>
>>> Thanks for looking into this.  Does this mean string fields should alway
>>> be nullable?
>>>
>>> You are right that the result is not yet correct and further digging is
>>> needed :(
>>>
>>> On Wed, Feb 27, 2019 at 1:19 AM Gabor Somogyi 
>>> wrote:
>>>
 Hi,

 I was dealing with avro stuff lately and most of the time it has
 something to do with the schema.
 One thing I've pinpointed quickly (where I was struggling also) is the
 name field should be nullable but the result is not yet correct so further
 digging needed...

 scala> val expectedSchema = StructType(Seq(StructField("name",
 StringType,true),StructField("age", IntegerType, false)))
 expectedSchema: org.apache.spark.sql.types.StructType =
 StructType(StructField(name,StringType,true),
 StructField(age,IntegerType,false))

 scala> val avroTypeStruct =
 SchemaConverters.toAvroType(expectedSchema).toString
 avroTypeStruct: String =
 {"type":"record","name":"topLevelRecord","fields":[{"name":"name","type":["string","null"]},{"name":"age","type":"int"}]}

 scala> dfKV.select(from_avro('value, avroTypeStruct)).show
 +-+
 |from_avro(value, struct)|
 +-+
 |  [Mary Jane, 25]|
 |  [Mary Jane, 25]|
 +-+

 BR,
 G


 On Wed, Feb 27, 2019 at 7:43 AM Hien Luu  wrote:

> Hi,
>
> I ran into a pretty weird issue with to_avro and from_avro where it
> was not
> able to parse the data in a struct correctly.  Please see the simple
> and
> self contained example below. I am using Spark 2.4.  I am not sure if I
> missed something.
>
> This is how I start the spark-shell on my Mac:
>
> ./bin/spark-shell --packages org.apache.spark:spark-avro_2.11:2.4.0
>
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.avro._
> import org.apache.spark.sql.functions._
>
>
> spark.version
>
> val df = Seq((1, "John Doe",  30), (2, "Mary Jane", 25)).toDF("id",
> "name",
> "age")
>
> val dfStruct = df.withColumn("value", struct("name","age"))
>
> dfStruct.show
> dfStruct.printSchema
>
> val dfKV = dfStruct.select(to_avro('id).as("key"),
> to_avro('value).as("value"))
>
> val expectedSchema = StructType(Seq(StructField("name", StringType,
> false),StructField("age", IntegerType, false)))
>
> val avroTypeStruct =
> SchemaConverters.toAvroType(expectedSchema).toString
>
> val avroTypeStr = s"""
>   |{
>   |  "type": "int",
>   |  "name": "key"
>   |}
> """.stripMargin
>
>
> dfKV.select(from_avro('key, avroTypeStr)).show
>
> // output
> +---+
> |from_avro(key, int)|
> +---+
> |  1|
> |  2|
> +---+
>
> dfKV.select(from_avro('value, avroTypeStruct)).show
>
> // output
> +-+
> |from_avro(value, struct)|
> +-+
> |[, 9]|
> |[, 9]|
> +-+
>
> Please help and thanks in advance.
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>>>
>>> --
>>> Regards,
>>>
>>
>
> --
> Regards,
>


Re: to_avro and from_avro not working with struct type in spark 2.4

2019-02-28 Thread Hien Luu
Thanks for the answer.

As far as the next step goes, I am thinking of writing out the dfKV
dataframe to disk and then use Avro apis to read the data.

This smells like a bug somewhere.

Cheers,

Hien

On Thu, Feb 28, 2019 at 4:02 AM Gabor Somogyi 
wrote:

> No, just take a look at the schema of dfStruct since you've converted its
> value column with to_avro:
>
> scala> dfStruct.printSchema
> root
>  |-- id: integer (nullable = false)
>  |-- name: string (nullable = true)
>  |-- age: integer (nullable = false)
>  |-- value: struct (nullable = false)
>  ||-- name: string (nullable = true)
>  ||-- age: integer (nullable = false)
>
>
> On Wed, Feb 27, 2019 at 6:51 PM Hien Luu  wrote:
>
>> Thanks for looking into this.  Does this mean string fields should alway
>> be nullable?
>>
>> You are right that the result is not yet correct and further digging is
>> needed :(
>>
>> On Wed, Feb 27, 2019 at 1:19 AM Gabor Somogyi 
>> wrote:
>>
>>> Hi,
>>>
>>> I was dealing with avro stuff lately and most of the time it has
>>> something to do with the schema.
>>> One thing I've pinpointed quickly (where I was struggling also) is the
>>> name field should be nullable but the result is not yet correct so further
>>> digging needed...
>>>
>>> scala> val expectedSchema = StructType(Seq(StructField("name",
>>> StringType,true),StructField("age", IntegerType, false)))
>>> expectedSchema: org.apache.spark.sql.types.StructType =
>>> StructType(StructField(name,StringType,true),
>>> StructField(age,IntegerType,false))
>>>
>>> scala> val avroTypeStruct =
>>> SchemaConverters.toAvroType(expectedSchema).toString
>>> avroTypeStruct: String =
>>> {"type":"record","name":"topLevelRecord","fields":[{"name":"name","type":["string","null"]},{"name":"age","type":"int"}]}
>>>
>>> scala> dfKV.select(from_avro('value, avroTypeStruct)).show
>>> +-+
>>> |from_avro(value, struct)|
>>> +-+
>>> |  [Mary Jane, 25]|
>>> |  [Mary Jane, 25]|
>>> +-+
>>>
>>> BR,
>>> G
>>>
>>>
>>> On Wed, Feb 27, 2019 at 7:43 AM Hien Luu  wrote:
>>>
 Hi,

 I ran into a pretty weird issue with to_avro and from_avro where it was
 not
 able to parse the data in a struct correctly.  Please see the simple and
 self contained example below. I am using Spark 2.4.  I am not sure if I
 missed something.

 This is how I start the spark-shell on my Mac:

 ./bin/spark-shell --packages org.apache.spark:spark-avro_2.11:2.4.0

 import org.apache.spark.sql.types._
 import org.apache.spark.sql.avro._
 import org.apache.spark.sql.functions._


 spark.version

 val df = Seq((1, "John Doe",  30), (2, "Mary Jane", 25)).toDF("id",
 "name",
 "age")

 val dfStruct = df.withColumn("value", struct("name","age"))

 dfStruct.show
 dfStruct.printSchema

 val dfKV = dfStruct.select(to_avro('id).as("key"),
 to_avro('value).as("value"))

 val expectedSchema = StructType(Seq(StructField("name", StringType,
 false),StructField("age", IntegerType, false)))

 val avroTypeStruct =
 SchemaConverters.toAvroType(expectedSchema).toString

 val avroTypeStr = s"""
   |{
   |  "type": "int",
   |  "name": "key"
   |}
 """.stripMargin


 dfKV.select(from_avro('key, avroTypeStr)).show

 // output
 +---+
 |from_avro(key, int)|
 +---+
 |  1|
 |  2|
 +---+

 dfKV.select(from_avro('value, avroTypeStruct)).show

 // output
 +-+
 |from_avro(value, struct)|
 +-+
 |[, 9]|
 |[, 9]|
 +-+

 Please help and thanks in advance.




 --
 Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

 -
 To unsubscribe e-mail: user-unsubscr...@spark.apache.org


>>
>> --
>> Regards,
>>
>

-- 
Regards,


Re: to_avro and from_avro not working with struct type in spark 2.4

2019-02-28 Thread Gabor Somogyi
No, just take a look at the schema of dfStruct since you've converted its
value column with to_avro:

scala> dfStruct.printSchema
root
 |-- id: integer (nullable = false)
 |-- name: string (nullable = true)
 |-- age: integer (nullable = false)
 |-- value: struct (nullable = false)
 ||-- name: string (nullable = true)
 ||-- age: integer (nullable = false)


On Wed, Feb 27, 2019 at 6:51 PM Hien Luu  wrote:

> Thanks for looking into this.  Does this mean string fields should alway
> be nullable?
>
> You are right that the result is not yet correct and further digging is
> needed :(
>
> On Wed, Feb 27, 2019 at 1:19 AM Gabor Somogyi 
> wrote:
>
>> Hi,
>>
>> I was dealing with avro stuff lately and most of the time it has
>> something to do with the schema.
>> One thing I've pinpointed quickly (where I was struggling also) is the
>> name field should be nullable but the result is not yet correct so further
>> digging needed...
>>
>> scala> val expectedSchema = StructType(Seq(StructField("name",
>> StringType,true),StructField("age", IntegerType, false)))
>> expectedSchema: org.apache.spark.sql.types.StructType =
>> StructType(StructField(name,StringType,true),
>> StructField(age,IntegerType,false))
>>
>> scala> val avroTypeStruct =
>> SchemaConverters.toAvroType(expectedSchema).toString
>> avroTypeStruct: String =
>> {"type":"record","name":"topLevelRecord","fields":[{"name":"name","type":["string","null"]},{"name":"age","type":"int"}]}
>>
>> scala> dfKV.select(from_avro('value, avroTypeStruct)).show
>> +-+
>> |from_avro(value, struct)|
>> +-+
>> |  [Mary Jane, 25]|
>> |  [Mary Jane, 25]|
>> +-+
>>
>> BR,
>> G
>>
>>
>> On Wed, Feb 27, 2019 at 7:43 AM Hien Luu  wrote:
>>
>>> Hi,
>>>
>>> I ran into a pretty weird issue with to_avro and from_avro where it was
>>> not
>>> able to parse the data in a struct correctly.  Please see the simple and
>>> self contained example below. I am using Spark 2.4.  I am not sure if I
>>> missed something.
>>>
>>> This is how I start the spark-shell on my Mac:
>>>
>>> ./bin/spark-shell --packages org.apache.spark:spark-avro_2.11:2.4.0
>>>
>>> import org.apache.spark.sql.types._
>>> import org.apache.spark.sql.avro._
>>> import org.apache.spark.sql.functions._
>>>
>>>
>>> spark.version
>>>
>>> val df = Seq((1, "John Doe",  30), (2, "Mary Jane", 25)).toDF("id",
>>> "name",
>>> "age")
>>>
>>> val dfStruct = df.withColumn("value", struct("name","age"))
>>>
>>> dfStruct.show
>>> dfStruct.printSchema
>>>
>>> val dfKV = dfStruct.select(to_avro('id).as("key"),
>>> to_avro('value).as("value"))
>>>
>>> val expectedSchema = StructType(Seq(StructField("name", StringType,
>>> false),StructField("age", IntegerType, false)))
>>>
>>> val avroTypeStruct = SchemaConverters.toAvroType(expectedSchema).toString
>>>
>>> val avroTypeStr = s"""
>>>   |{
>>>   |  "type": "int",
>>>   |  "name": "key"
>>>   |}
>>> """.stripMargin
>>>
>>>
>>> dfKV.select(from_avro('key, avroTypeStr)).show
>>>
>>> // output
>>> +---+
>>> |from_avro(key, int)|
>>> +---+
>>> |  1|
>>> |  2|
>>> +---+
>>>
>>> dfKV.select(from_avro('value, avroTypeStruct)).show
>>>
>>> // output
>>> +-+
>>> |from_avro(value, struct)|
>>> +-+
>>> |[, 9]|
>>> |[, 9]|
>>> +-+
>>>
>>> Please help and thanks in advance.
>>>
>>>
>>>
>>>
>>> --
>>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>>
>>> -
>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>
>>>
>
> --
> Regards,
>


Re: to_avro and from_avro not working with struct type in spark 2.4

2019-02-27 Thread Hien Luu
Thanks for looking into this.  Does this mean string fields should alway be
nullable?

You are right that the result is not yet correct and further digging is
needed :(

On Wed, Feb 27, 2019 at 1:19 AM Gabor Somogyi 
wrote:

> Hi,
>
> I was dealing with avro stuff lately and most of the time it has something
> to do with the schema.
> One thing I've pinpointed quickly (where I was struggling also) is the
> name field should be nullable but the result is not yet correct so further
> digging needed...
>
> scala> val expectedSchema = StructType(Seq(StructField("name",
> StringType,true),StructField("age", IntegerType, false)))
> expectedSchema: org.apache.spark.sql.types.StructType =
> StructType(StructField(name,StringType,true),
> StructField(age,IntegerType,false))
>
> scala> val avroTypeStruct =
> SchemaConverters.toAvroType(expectedSchema).toString
> avroTypeStruct: String =
> {"type":"record","name":"topLevelRecord","fields":[{"name":"name","type":["string","null"]},{"name":"age","type":"int"}]}
>
> scala> dfKV.select(from_avro('value, avroTypeStruct)).show
> +-+
> |from_avro(value, struct)|
> +-+
> |  [Mary Jane, 25]|
> |  [Mary Jane, 25]|
> +-+
>
> BR,
> G
>
>
> On Wed, Feb 27, 2019 at 7:43 AM Hien Luu  wrote:
>
>> Hi,
>>
>> I ran into a pretty weird issue with to_avro and from_avro where it was
>> not
>> able to parse the data in a struct correctly.  Please see the simple and
>> self contained example below. I am using Spark 2.4.  I am not sure if I
>> missed something.
>>
>> This is how I start the spark-shell on my Mac:
>>
>> ./bin/spark-shell --packages org.apache.spark:spark-avro_2.11:2.4.0
>>
>> import org.apache.spark.sql.types._
>> import org.apache.spark.sql.avro._
>> import org.apache.spark.sql.functions._
>>
>>
>> spark.version
>>
>> val df = Seq((1, "John Doe",  30), (2, "Mary Jane", 25)).toDF("id",
>> "name",
>> "age")
>>
>> val dfStruct = df.withColumn("value", struct("name","age"))
>>
>> dfStruct.show
>> dfStruct.printSchema
>>
>> val dfKV = dfStruct.select(to_avro('id).as("key"),
>> to_avro('value).as("value"))
>>
>> val expectedSchema = StructType(Seq(StructField("name", StringType,
>> false),StructField("age", IntegerType, false)))
>>
>> val avroTypeStruct = SchemaConverters.toAvroType(expectedSchema).toString
>>
>> val avroTypeStr = s"""
>>   |{
>>   |  "type": "int",
>>   |  "name": "key"
>>   |}
>> """.stripMargin
>>
>>
>> dfKV.select(from_avro('key, avroTypeStr)).show
>>
>> // output
>> +---+
>> |from_avro(key, int)|
>> +---+
>> |  1|
>> |  2|
>> +---+
>>
>> dfKV.select(from_avro('value, avroTypeStruct)).show
>>
>> // output
>> +-+
>> |from_avro(value, struct)|
>> +-+
>> |[, 9]|
>> |[, 9]|
>> +-+
>>
>> Please help and thanks in advance.
>>
>>
>>
>>
>> --
>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>
>> -
>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>
>>

-- 
Regards,


Re: to_avro and from_avro not working with struct type in spark 2.4

2019-02-27 Thread Gabor Somogyi
Hi,

I was dealing with avro stuff lately and most of the time it has something
to do with the schema.
One thing I've pinpointed quickly (where I was struggling also) is the name
field should be nullable but the result is not yet correct so further
digging needed...

scala> val expectedSchema = StructType(Seq(StructField("name",
StringType,true),StructField("age", IntegerType, false)))
expectedSchema: org.apache.spark.sql.types.StructType =
StructType(StructField(name,StringType,true),
StructField(age,IntegerType,false))

scala> val avroTypeStruct =
SchemaConverters.toAvroType(expectedSchema).toString
avroTypeStruct: String =
{"type":"record","name":"topLevelRecord","fields":[{"name":"name","type":["string","null"]},{"name":"age","type":"int"}]}

scala> dfKV.select(from_avro('value, avroTypeStruct)).show
+-+
|from_avro(value, struct)|
+-+
|  [Mary Jane, 25]|
|  [Mary Jane, 25]|
+-+

BR,
G


On Wed, Feb 27, 2019 at 7:43 AM Hien Luu  wrote:

> Hi,
>
> I ran into a pretty weird issue with to_avro and from_avro where it was not
> able to parse the data in a struct correctly.  Please see the simple and
> self contained example below. I am using Spark 2.4.  I am not sure if I
> missed something.
>
> This is how I start the spark-shell on my Mac:
>
> ./bin/spark-shell --packages org.apache.spark:spark-avro_2.11:2.4.0
>
> import org.apache.spark.sql.types._
> import org.apache.spark.sql.avro._
> import org.apache.spark.sql.functions._
>
>
> spark.version
>
> val df = Seq((1, "John Doe",  30), (2, "Mary Jane", 25)).toDF("id", "name",
> "age")
>
> val dfStruct = df.withColumn("value", struct("name","age"))
>
> dfStruct.show
> dfStruct.printSchema
>
> val dfKV = dfStruct.select(to_avro('id).as("key"),
> to_avro('value).as("value"))
>
> val expectedSchema = StructType(Seq(StructField("name", StringType,
> false),StructField("age", IntegerType, false)))
>
> val avroTypeStruct = SchemaConverters.toAvroType(expectedSchema).toString
>
> val avroTypeStr = s"""
>   |{
>   |  "type": "int",
>   |  "name": "key"
>   |}
> """.stripMargin
>
>
> dfKV.select(from_avro('key, avroTypeStr)).show
>
> // output
> +---+
> |from_avro(key, int)|
> +---+
> |  1|
> |  2|
> +---+
>
> dfKV.select(from_avro('value, avroTypeStruct)).show
>
> // output
> +-+
> |from_avro(value, struct)|
> +-+
> |[, 9]|
> |[, 9]|
> +-+
>
> Please help and thanks in advance.
>
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


to_avro and from_avro not working with struct type in spark 2.4

2019-02-26 Thread Hien Luu
Hi,

I ran into a pretty weird issue with to_avro and from_avro where it was not
able to parse the data in a struct correctly.  Please see the simple and
self contained example below. I am using Spark 2.4.  I am not sure if I
missed something.

This is how I start the spark-shell on my Mac:

./bin/spark-shell --packages org.apache.spark:spark-avro_2.11:2.4.0

import org.apache.spark.sql.types._
import org.apache.spark.sql.avro._
import org.apache.spark.sql.functions._


spark.version

val df = Seq((1, "John Doe",  30), (2, "Mary Jane", 25)).toDF("id", "name",
"age")

val dfStruct = df.withColumn("value", struct("name","age"))

dfStruct.show
dfStruct.printSchema

val dfKV = dfStruct.select(to_avro('id).as("key"),
to_avro('value).as("value"))

val expectedSchema = StructType(Seq(StructField("name", StringType,
false),StructField("age", IntegerType, false)))

val avroTypeStruct = SchemaConverters.toAvroType(expectedSchema).toString

val avroTypeStr = s"""
  |{
  |  "type": "int",
  |  "name": "key"
  |}
""".stripMargin


dfKV.select(from_avro('key, avroTypeStr)).show

// output
+---+
|from_avro(key, int)|
+---+
|  1|
|  2|
+---+

dfKV.select(from_avro('value, avroTypeStruct)).show

// output
+-+
|from_avro(value, struct)|
+-+
|[, 9]|
|[, 9]|
+-+

Please help and thanks in advance.




--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org