scala> mblog_tags.dtypes res13: Array[(String, String)] = Array((tags,ArrayType(StructType(StructField(category,StringType,true), StructField(weight,StringType,true)),true)))
scala> val testUDF = udf{ s: Seq[Tags] => s(0).weight } testUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(ArrayType(StructType(StructField(category,StringType,true), StructField(weight,StringType,true)),true)))) Where is wrong with the udf function `testUDF` ? On Tue, Oct 25, 2016 at 10:41 AM, 颜发才(Yan Facai) <yaf...@gmail.com> wrote: > Thanks, Cheng Lian. > > I try to use case class: > > scala> case class Tags (category: String, weight: String) > > scala> val testUDF = udf{ s: Seq[Tags] => s(0).weight } > > testUDF: org.apache.spark.sql.expressions.UserDefinedFunction = > UserDefinedFunction(<function1>,StringType,Some(List(ArrayType(StructType( > StructField(category,StringType,true), StructField(weight,StringType, > true)),true)))) > > > but it raises an ClassCastException when run: > > scala> mblog_tags.withColumn("test", testUDF(col("tags"))).show(false) > > 16/10/25 10:39:54 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID > 4) > java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema > cannot be cast to $line58.$read$$iw$$iw$Tags > at $line59.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1. > apply(<console>:27) > at $line59.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1. > apply(<console>:27) > ... > > > Where did I do wrong? > > > > > On Sat, Oct 22, 2016 at 6:37 AM, Cheng Lian <l...@databricks.com> wrote: > >> You may either use SQL function "array" and "named_struct" or define a >> case class with expected field names. >> >> Cheng >> >> On 10/21/16 2:45 AM, 颜发才(Yan Facai) wrote: >> >> My expectation is: >> root >> |-- tag: vector >> >> namely, I want to extract from: >> [[tagCategory_060, 0.8], [tagCategory_029, 0.7]]| >> to: >> Vectors.sparse(100, Array(60, 29), Array(0.8, 0.7)) >> >> I believe it needs two step: >> 1. val tag2vec = {tag: Array[Structure] => Vector} >> 2. mblog_tags.withColumn("vec", tag2vec(col("tag")) >> >> But, I have no idea of how to describe the Array[Structure] in the >> DataFrame. >> >> >> >> >> >> On Fri, Oct 21, 2016 at 4:51 PM, lk_spark <lk_sp...@163.com> wrote: >> >>> how about change Schema from >>> root >>> |-- category.firstCategory: array (nullable = true) >>> | |-- element: struct (containsNull = true) >>> | | |-- category: string (nullable = true) >>> | | |-- weight: string (nullable = true) >>> to: >>> >>> root >>> |-- category: string (nullable = true) >>> |-- weight: string (nullable = true) >>> >>> 2016-10-21 >>> ------------------------------ >>> lk_spark >>> ------------------------------ >>> >>> *发件人:*颜发才(Yan Facai) <yaf...@gmail.com> >>> *发送时间:*2016-10-21 15:35 >>> *主题:*Re: How to iterate the element of an array in DataFrame? >>> *收件人:*"user.spark"<user@spark.apache.org> >>> *抄送:* >>> >>> I don't know how to construct `array<struct<category:string, >>> weight:string>>`. >>> Could anyone help me? >>> >>> I try to get the array by : >>> scala> mblog_tags.map(_.getSeq[(String, String)](0)) >>> >>> while the result is: >>> res40: org.apache.spark.sql.Dataset[Seq[(String, String)]] = [value: >>> array<struct<_1:string,_2:string>>] >>> >>> >>> How to express `struct<string, string>` ? >>> >>> >>> >>> On Thu, Oct 20, 2016 at 4:34 PM, 颜发才(Yan Facai) <yaf...@gmail.com> >>> wrote: >>> >>>> Hi, I want to extract the attribute `weight` of an array, and combine >>>> them to construct a sparse vector. >>>> >>>> ### My data is like this: >>>> >>>> scala> mblog_tags.printSchema >>>> root >>>> |-- category.firstCategory: array (nullable = true) >>>> | |-- element: struct (containsNull = true) >>>> | | |-- category: string (nullable = true) >>>> | | |-- weight: string (nullable = true) >>>> >>>> >>>> scala> mblog_tags.show(false) >>>> +--------------------------------------------------------------+ >>>> |category.firstCategory | >>>> +--------------------------------------------------------------+ >>>> |[[tagCategory_060, 0.8], [tagCategory_029, 0.7]]| >>>> |[[tagCategory_029, 0.9]] | >>>> |[[tagCategory_029, 0.8]] | >>>> +--------------------------------------------------------------+ >>>> >>>> >>>> ### And expected: >>>> Vectors.sparse(100, Array(60, 29), Array(0.8, 0.7)) >>>> Vectors.sparse(100, Array(29), Array(0.9)) >>>> Vectors.sparse(100, Array(29), Array(0.8)) >>>> >>>> How to iterate an array in DataFrame? >>>> Thanks. >>>> >>>> >>>> >>>> >>> >> >> >