Re: How to iterate the element of an array in DataFrame?

Yan Facai Mon, 24 Oct 2016 19:49:36 -0700

scala> mblog_tags.dtypes
res13: Array[(String, String)] =
Array((tags,ArrayType(StructType(StructField(category,StringType,true),
StructField(weight,StringType,true)),true)))


scala> val testUDF = udf{ s: Seq[Tags] => s(0).weight }
testUDF: org.apache.spark.sql.expressions.UserDefinedFunction =
UserDefinedFunction(<function1>,StringType,Some(List(ArrayType(StructType(StructField(category,StringType,true),
StructField(weight,StringType,true)),true))))

Where is wrong with the udf function `testUDF` ?





On Tue, Oct 25, 2016 at 10:41 AM, 颜发才(Yan Facai) <yaf...@gmail.com> wrote:

> Thanks, Cheng Lian.
>
> I try to use case class:
>
> scala> case class Tags (category: String, weight: String)
>
> scala> val testUDF = udf{ s: Seq[Tags] => s(0).weight }
>
> testUDF: org.apache.spark.sql.expressions.UserDefinedFunction =
> UserDefinedFunction(<function1>,StringType,Some(List(ArrayType(StructType(
> StructField(category,StringType,true), StructField(weight,StringType,
> true)),true))))
>
>
> but it raises an ClassCastException when run:
>
> scala> mblog_tags.withColumn("test", testUDF(col("tags"))).show(false)
>
> 16/10/25 10:39:54 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID
> 4)
> java.lang.ClassCastException: 
> org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
> cannot be cast to $line58.$read$$iw$$iw$Tags
>         at $line59.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.
> apply(<console>:27)
>         at $line59.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.
> apply(<console>:27)
> ...
>
>
> Where did I do wrong?
>
>
>
>
> On Sat, Oct 22, 2016 at 6:37 AM, Cheng Lian <l...@databricks.com> wrote:
>
>> You may either use SQL function "array" and "named_struct" or define a
>> case class with expected field names.
>>
>> Cheng
>>
>> On 10/21/16 2:45 AM, 颜发才(Yan Facai) wrote:
>>
>> My expectation is:
>> root
>> |-- tag: vector
>>
>> namely, I want to extract from:
>> [[tagCategory_060, 0.8], [tagCategory_029, 0.7]]|
>> to:
>> Vectors.sparse(100, Array(60, 29),  Array(0.8, 0.7))
>>
>> I believe it needs two step:
>> 1. val tag2vec = {tag: Array[Structure] => Vector}
>> 2. mblog_tags.withColumn("vec", tag2vec(col("tag"))
>>
>> But, I have no idea of how to describe the Array[Structure] in the
>> DataFrame.
>>
>>
>>
>>
>>
>> On Fri, Oct 21, 2016 at 4:51 PM, lk_spark <lk_sp...@163.com> wrote:
>>
>>> how about change Schema from
>>> root
>>>  |-- category.firstCategory: array (nullable = true)
>>>  |    |-- element: struct (containsNull = true)
>>>  |    |    |-- category: string (nullable = true)
>>>  |    |    |-- weight: string (nullable = true)
>>> to:
>>>
>>> root
>>>  |-- category: string (nullable = true)
>>>  |-- weight: string (nullable = true)
>>>
>>> 2016-10-21
>>> ------------------------------
>>> lk_spark
>>> ------------------------------
>>>
>>> *发件人：*颜发才(Yan Facai) <yaf...@gmail.com>
>>> *发送时间：*2016-10-21 15:35
>>> *主题：*Re: How to iterate the element of an array in DataFrame?
>>> *收件人：*"user.spark"<user@spark.apache.org>
>>> *抄送：*
>>>
>>> I don't know how to construct `array<struct<category:string,
>>> weight:string>>`.
>>> Could anyone help me?
>>>
>>> I try to get the array by :
>>> scala> mblog_tags.map(_.getSeq[(String, String)](0))
>>>
>>> while the result is:
>>> res40: org.apache.spark.sql.Dataset[Seq[(String, String)]] = [value:
>>> array<struct<_1:string,_2:string>>]
>>>
>>>
>>> How to express `struct<string, string>` ?
>>>
>>>
>>>
>>> On Thu, Oct 20, 2016 at 4:34 PM, 颜发才(Yan Facai) <yaf...@gmail.com>
>>> wrote:
>>>
>>>> Hi, I want to extract the attribute `weight` of an array, and combine
>>>> them to construct a sparse vector.
>>>>
>>>> ### My data is like this:
>>>>
>>>> scala> mblog_tags.printSchema
>>>> root
>>>>  |-- category.firstCategory: array (nullable = true)
>>>>  |    |-- element: struct (containsNull = true)
>>>>  |    |    |-- category: string (nullable = true)
>>>>  |    |    |-- weight: string (nullable = true)
>>>>
>>>>
>>>> scala> mblog_tags.show(false)
>>>> +--------------------------------------------------------------+
>>>> |category.firstCategory                                        |
>>>> +--------------------------------------------------------------+
>>>> |[[tagCategory_060, 0.8], [tagCategory_029, 0.7]]|
>>>> |[[tagCategory_029, 0.9]]                                      |
>>>> |[[tagCategory_029, 0.8]]                                      |
>>>> +--------------------------------------------------------------+
>>>>
>>>>
>>>> ### And expected:
>>>> Vectors.sparse(100, Array(60, 29),  Array(0.8, 0.7))
>>>> Vectors.sparse(100, Array(29),  Array(0.9))
>>>> Vectors.sparse(100, Array(29),  Array(0.8))
>>>>
>>>> How to iterate an array in DataFrame?
>>>> Thanks.
>>>>
>>>>
>>>>
>>>>
>>>
>>
>>
>

Re: How to iterate the element of an array in DataFrame?

Reply via email to