Re: How to iterate the element of an array in DataFrame?

2016-10-24 Thread Yan Facai
scala> mblog_tags.dtypes
res13: Array[(String, String)] =
Array((tags,ArrayType(StructType(StructField(category,StringType,true),
StructField(weight,StringType,true)),true)))

scala> val testUDF = udf{ s: Seq[Tags] => s(0).weight }
testUDF: org.apache.spark.sql.expressions.UserDefinedFunction =
UserDefinedFunction(,StringType,Some(List(ArrayType(StructType(StructField(category,StringType,true),
StructField(weight,StringType,true)),true

Where is wrong with the udf function `testUDF` ?





On Tue, Oct 25, 2016 at 10:41 AM, 颜发才(Yan Facai)  wrote:

> Thanks, Cheng Lian.
>
> I try to use case class:
>
> scala> case class Tags (category: String, weight: String)
>
> scala> val testUDF = udf{ s: Seq[Tags] => s(0).weight }
>
> testUDF: org.apache.spark.sql.expressions.UserDefinedFunction =
> UserDefinedFunction(,StringType,Some(List(ArrayType(StructType(
> StructField(category,StringType,true), StructField(weight,StringType,
> true)),true
>
>
> but it raises an ClassCastException when run:
>
> scala> mblog_tags.withColumn("test", testUDF(col("tags"))).show(false)
>
> 16/10/25 10:39:54 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID
> 4)
> java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
> cannot be cast to $line58.$read$$iw$$iw$Tags
> at $line59.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.
> apply(:27)
> at $line59.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.
> apply(:27)
> ...
>
>
> Where did I do wrong?
>
>
>
>
> On Sat, Oct 22, 2016 at 6:37 AM, Cheng Lian  wrote:
>
>> You may either use SQL function "array" and "named_struct" or define a
>> case class with expected field names.
>>
>> Cheng
>>
>> On 10/21/16 2:45 AM, 颜发才(Yan Facai) wrote:
>>
>> My expectation is:
>> root
>> |-- tag: vector
>>
>> namely, I want to extract from:
>> [[tagCategory_060, 0.8], [tagCategory_029, 0.7]]|
>> to:
>> Vectors.sparse(100, Array(60, 29),  Array(0.8, 0.7))
>>
>> I believe it needs two step:
>> 1. val tag2vec = {tag: Array[Structure] => Vector}
>> 2. mblog_tags.withColumn("vec", tag2vec(col("tag"))
>>
>> But, I have no idea of how to describe the Array[Structure] in the
>> DataFrame.
>>
>>
>>
>>
>>
>> On Fri, Oct 21, 2016 at 4:51 PM, lk_spark  wrote:
>>
>>> how about change Schema from
>>> root
>>>  |-- category.firstCategory: array (nullable = true)
>>>  ||-- element: struct (containsNull = true)
>>>  |||-- category: string (nullable = true)
>>>  |||-- weight: string (nullable = true)
>>> to:
>>>
>>> root
>>>  |-- category: string (nullable = true)
>>>  |-- weight: string (nullable = true)
>>>
>>> 2016-10-21
>>> --
>>> lk_spark
>>> --
>>>
>>> *发件人:*颜发才(Yan Facai) 
>>> *发送时间:*2016-10-21 15:35
>>> *主题:*Re: How to iterate the element of an array in DataFrame?
>>> *收件人:*"user.spark"
>>> *抄送:*
>>>
>>> I don't know how to construct `array>> weight:string>>`.
>>> Could anyone help me?
>>>
>>> I try to get the array by :
>>> scala> mblog_tags.map(_.getSeq[(String, String)](0))
>>>
>>> while the result is:
>>> res40: org.apache.spark.sql.Dataset[Seq[(String, String)]] = [value:
>>> array>]
>>>
>>>
>>> How to express `struct` ?
>>>
>>>
>>>
>>> On Thu, Oct 20, 2016 at 4:34 PM, 颜发才(Yan Facai) 
>>> wrote:
>>>
 Hi, I want to extract the attribute `weight` of an array, and combine
 them to construct a sparse vector.

 ### My data is like this:

 scala> mblog_tags.printSchema
 root
  |-- category.firstCategory: array (nullable = true)
  ||-- element: struct (containsNull = true)
  |||-- category: string (nullable = true)
  |||-- weight: string (nullable = true)


 scala> mblog_tags.show(false)
 +--+
 |category.firstCategory|
 +--+
 |[[tagCategory_060, 0.8], [tagCategory_029, 0.7]]|
 |[[tagCategory_029, 0.9]]  |
 |[[tagCategory_029, 0.8]]  |
 +--+


 ### And expected:
 Vectors.sparse(100, Array(60, 29),  Array(0.8, 0.7))
 Vectors.sparse(100, Array(29),  Array(0.9))
 Vectors.sparse(100, Array(29),  Array(0.8))

 How to iterate an array in DataFrame?
 Thanks.




>>>
>>
>>
>


Re: How to iterate the element of an array in DataFrame?

2016-10-24 Thread Yan Facai
Thanks, Cheng Lian.

I try to use case class:

scala> case class Tags (category: String, weight: String)

scala> val testUDF = udf{ s: Seq[Tags] => s(0).weight }

testUDF: org.apache.spark.sql.expressions.UserDefinedFunction =
UserDefinedFunction(,StringType,Some(List(ArrayType(StructType(StructField(category,StringType,true),
StructField(weight,StringType,true)),true


but it raises an ClassCastException when run:

scala> mblog_tags.withColumn("test", testUDF(col("tags"))).show(false)

16/10/25 10:39:54 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 4)
java.lang.ClassCastException:
org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be
cast to $line58.$read$$iw$$iw$Tags
at
$line59.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:27)
at
$line59.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:27)
...


Where did I do wrong?




On Sat, Oct 22, 2016 at 6:37 AM, Cheng Lian  wrote:

> You may either use SQL function "array" and "named_struct" or define a
> case class with expected field names.
>
> Cheng
>
> On 10/21/16 2:45 AM, 颜发才(Yan Facai) wrote:
>
> My expectation is:
> root
> |-- tag: vector
>
> namely, I want to extract from:
> [[tagCategory_060, 0.8], [tagCategory_029, 0.7]]|
> to:
> Vectors.sparse(100, Array(60, 29),  Array(0.8, 0.7))
>
> I believe it needs two step:
> 1. val tag2vec = {tag: Array[Structure] => Vector}
> 2. mblog_tags.withColumn("vec", tag2vec(col("tag"))
>
> But, I have no idea of how to describe the Array[Structure] in the
> DataFrame.
>
>
>
>
>
> On Fri, Oct 21, 2016 at 4:51 PM, lk_spark  wrote:
>
>> how about change Schema from
>> root
>>  |-- category.firstCategory: array (nullable = true)
>>  ||-- element: struct (containsNull = true)
>>  |||-- category: string (nullable = true)
>>  |||-- weight: string (nullable = true)
>> to:
>>
>> root
>>  |-- category: string (nullable = true)
>>  |-- weight: string (nullable = true)
>>
>> 2016-10-21
>> --
>> lk_spark
>> --
>>
>> *发件人:*颜发才(Yan Facai) 
>> *发送时间:*2016-10-21 15:35
>> *主题:*Re: How to iterate the element of an array in DataFrame?
>> *收件人:*"user.spark"
>> *抄送:*
>>
>> I don't know how to construct `array> weight:string>>`.
>> Could anyone help me?
>>
>> I try to get the array by :
>> scala> mblog_tags.map(_.getSeq[(String, String)](0))
>>
>> while the result is:
>> res40: org.apache.spark.sql.Dataset[Seq[(String, String)]] = [value:
>> array>]
>>
>>
>> How to express `struct` ?
>>
>>
>>
>> On Thu, Oct 20, 2016 at 4:34 PM, 颜发才(Yan Facai)  wrote:
>>
>>> Hi, I want to extract the attribute `weight` of an array, and combine
>>> them to construct a sparse vector.
>>>
>>> ### My data is like this:
>>>
>>> scala> mblog_tags.printSchema
>>> root
>>>  |-- category.firstCategory: array (nullable = true)
>>>  ||-- element: struct (containsNull = true)
>>>  |||-- category: string (nullable = true)
>>>  |||-- weight: string (nullable = true)
>>>
>>>
>>> scala> mblog_tags.show(false)
>>> +--+
>>> |category.firstCategory|
>>> +--+
>>> |[[tagCategory_060, 0.8], [tagCategory_029, 0.7]]|
>>> |[[tagCategory_029, 0.9]]  |
>>> |[[tagCategory_029, 0.8]]  |
>>> +--+
>>>
>>>
>>> ### And expected:
>>> Vectors.sparse(100, Array(60, 29),  Array(0.8, 0.7))
>>> Vectors.sparse(100, Array(29),  Array(0.9))
>>> Vectors.sparse(100, Array(29),  Array(0.8))
>>>
>>> How to iterate an array in DataFrame?
>>> Thanks.
>>>
>>>
>>>
>>>
>>
>
>


Re: How to iterate the element of an array in DataFrame?

2016-10-21 Thread Cheng Lian
You may either use SQL function "array" and "named_struct" or define a 
case class with expected field names.


Cheng


On 10/21/16 2:45 AM, 颜发才(Yan Facai) wrote:

My expectation is:
root
|-- tag: vector

namely, I want to extract from:
[[tagCategory_060, 0.8], [tagCategory_029, 0.7]]|
to:
Vectors.sparse(100, Array(60, 29),  Array(0.8, 0.7))

I believe it needs two step:
1. val tag2vec = {tag: Array[Structure] => Vector}
2. mblog_tags.withColumn("vec", tag2vec(col("tag"))

But, I have no idea of how to describe the Array[Structure] in the 
DataFrame.






On Fri, Oct 21, 2016 at 4:51 PM, lk_spark > wrote:


how about change Schema from
root
 |-- category.firstCategory: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- category: string (nullable = true)
 |||-- weight: string (nullable = true)
to:
root
 |-- category: string (nullable = true)
 |-- weight: string (nullable = true)
2016-10-21

lk_spark


*发件人:*颜发才(Yan Facai) >
*发送时间:*2016-10-21 15:35
*主题:*Re: How to iterate the element of an array in DataFrame?
*收件人:*"user.spark">
*抄送:*
I don't know how to construct
`array>`.
Could anyone help me?

I try to get the array by :
scala> mblog_tags.map(_.getSeq[(String, String)](0))

while the result is:
res40: org.apache.spark.sql.Dataset[Seq[(String, String)]] =
[value: array>]


How to express `struct` ?



On Thu, Oct 20, 2016 at 4:34 PM, 颜发才(Yan Facai)
> wrote:

Hi, I want to extract the attribute `weight` of an array,
and combine them to construct a sparse vector.

### My data is like this:

scala> mblog_tags.printSchema
root
 |-- category.firstCategory: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- category: string (nullable = true)
 |||-- weight: string (nullable = true)


scala> mblog_tags.show(false)
+--+
|category.firstCategory |
+--+
|[[tagCategory_060, 0.8], [tagCategory_029, 0.7]]|
|[[tagCategory_029, 0.9]] |
|[[tagCategory_029, 0.8]]|
+--+


### And expected:
Vectors.sparse(100, Array(60, 29), Array(0.8, 0.7))
Vectors.sparse(100, Array(29), Array(0.9))
Vectors.sparse(100, Array(29), Array(0.8))

How to iterate an array in DataFrame?
Thanks.









Re: Re: How to iterate the element of an array in DataFrame?

2016-10-21 Thread Yan Facai
My expectation is:
root
|-- tag: vector

namely, I want to extract from:
[[tagCategory_060, 0.8], [tagCategory_029, 0.7]]|
to:
Vectors.sparse(100, Array(60, 29),  Array(0.8, 0.7))

I believe it needs two step:
1. val tag2vec = {tag: Array[Structure] => Vector}
2. mblog_tags.withColumn("vec", tag2vec(col("tag"))

But, I have no idea of how to describe the Array[Structure] in the
DataFrame.





On Fri, Oct 21, 2016 at 4:51 PM, lk_spark  wrote:

> how about change Schema from
> root
>  |-- category.firstCategory: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- category: string (nullable = true)
>  |||-- weight: string (nullable = true)
> to:
>
> root
>  |-- category: string (nullable = true)
>  |-- weight: string (nullable = true)
>
> 2016-10-21
> --
> lk_spark
> --
>
> *发件人:*颜发才(Yan Facai) 
> *发送时间:*2016-10-21 15:35
> *主题:*Re: How to iterate the element of an array in DataFrame?
> *收件人:*"user.spark"
> *抄送:*
>
> I don't know how to construct `array weight:string>>`.
> Could anyone help me?
>
> I try to get the array by :
> scala> mblog_tags.map(_.getSeq[(String, String)](0))
>
> while the result is:
> res40: org.apache.spark.sql.Dataset[Seq[(String, String)]] = [value:
> array>]
>
>
> How to express `struct` ?
>
>
>
> On Thu, Oct 20, 2016 at 4:34 PM, 颜发才(Yan Facai)  wrote:
>
>> Hi, I want to extract the attribute `weight` of an array, and combine
>> them to construct a sparse vector.
>>
>> ### My data is like this:
>>
>> scala> mblog_tags.printSchema
>> root
>>  |-- category.firstCategory: array (nullable = true)
>>  ||-- element: struct (containsNull = true)
>>  |||-- category: string (nullable = true)
>>  |||-- weight: string (nullable = true)
>>
>>
>> scala> mblog_tags.show(false)
>> +--+
>> |category.firstCategory|
>> +--+
>> |[[tagCategory_060, 0.8], [tagCategory_029, 0.7]]|
>> |[[tagCategory_029, 0.9]]  |
>> |[[tagCategory_029, 0.8]]  |
>> +--+
>>
>>
>> ### And expected:
>> Vectors.sparse(100, Array(60, 29),  Array(0.8, 0.7))
>> Vectors.sparse(100, Array(29),  Array(0.9))
>> Vectors.sparse(100, Array(29),  Array(0.8))
>>
>> How to iterate an array in DataFrame?
>> Thanks.
>>
>>
>>
>>
>


Re: Re: How to iterate the element of an array in DataFrame?

2016-10-21 Thread lk_spark
how about change Schema from
root
 |-- category.firstCategory: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- category: string (nullable = true)
 |||-- weight: string (nullable = true)

to:

root
 |-- category: string (nullable = true)
 |-- weight: string (nullable = true)

2016-10-21 

lk_spark 



发件人:颜发才(Yan Facai) 
发送时间:2016-10-21 15:35
主题:Re: How to iterate the element of an array in DataFrame?
收件人:"user.spark"
抄送:

I don't know how to construct `array>`.
Could anyone help me?


I try to get the array by :
scala> mblog_tags.map(_.getSeq[(String, String)](0))

while the result is:
res40: org.apache.spark.sql.Dataset[Seq[(String, String)]] = [value: 
array>]




How to express `struct` ?






On Thu, Oct 20, 2016 at 4:34 PM, 颜发才(Yan Facai)  wrote:

Hi, I want to extract the attribute `weight` of an array, and combine them to 
construct a sparse vector. 



### My data is like this:

scala> mblog_tags.printSchema
root
 |-- category.firstCategory: array (nullable = true)
 ||-- element: struct (containsNull = true)
 |||-- category: string (nullable = true)
 |||-- weight: string (nullable = true)


scala> mblog_tags.show(false)
+--+
|category.firstCategory|
+--+
|[[tagCategory_060, 0.8], [tagCategory_029, 0.7]]|
|[[tagCategory_029, 0.9]]  |
|[[tagCategory_029, 0.8]]  |
+--+



### And expected:
Vectors.sparse(100, Array(60, 29),  Array(0.8, 0.7))
Vectors.sparse(100, Array(29),  Array(0.9))
Vectors.sparse(100, Array(29),  Array(0.8))


How to iterate an array in DataFrame?

Thanks.

Re: How to iterate the element of an array in DataFrame?

2016-10-21 Thread Yan Facai
I don't know how to construct
`array>`.
Could anyone help me?

I try to get the array by :
scala> mblog_tags.map(_.getSeq[(String, String)](0))

while the result is:
res40: org.apache.spark.sql.Dataset[Seq[(String, String)]] = [value:
array>]


How to express `struct` ?



On Thu, Oct 20, 2016 at 4:34 PM, 颜发才(Yan Facai)  wrote:

> Hi, I want to extract the attribute `weight` of an array, and combine them
> to construct a sparse vector.
>
> ### My data is like this:
>
> scala> mblog_tags.printSchema
> root
>  |-- category.firstCategory: array (nullable = true)
>  ||-- element: struct (containsNull = true)
>  |||-- category: string (nullable = true)
>  |||-- weight: string (nullable = true)
>
>
> scala> mblog_tags.show(false)
> +--+
> |category.firstCategory|
> +--+
> |[[tagCategory_060, 0.8], [tagCategory_029, 0.7]]|
> |[[tagCategory_029, 0.9]]  |
> |[[tagCategory_029, 0.8]]  |
> +--+
>
>
> ### And expected:
> Vectors.sparse(100, Array(60, 29),  Array(0.8, 0.7))
> Vectors.sparse(100, Array(29),  Array(0.9))
> Vectors.sparse(100, Array(29),  Array(0.8))
>
> How to iterate an array in DataFrame?
> Thanks.
>
>
>
>