Thanks, Cheng Lian. I try to use case class:
scala> case class Tags (category: String, weight: String) scala> val testUDF = udf{ s: Seq[Tags] => s(0).weight } testUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(<function1>,StringType,Some(List(ArrayType(StructType(StructField(category,StringType,true), StructField(weight,StringType,true)),true)))) but it raises an ClassCastException when run: scala> mblog_tags.withColumn("test", testUDF(col("tags"))).show(false) 16/10/25 10:39:54 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 4) java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to $line58.$read$$iw$$iw$Tags at $line59.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:27) at $line59.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:27) ... Where did I do wrong? On Sat, Oct 22, 2016 at 6:37 AM, Cheng Lian <l...@databricks.com> wrote: > You may either use SQL function "array" and "named_struct" or define a > case class with expected field names. > > Cheng > > On 10/21/16 2:45 AM, 颜发才(Yan Facai) wrote: > > My expectation is: > root > |-- tag: vector > > namely, I want to extract from: > [[tagCategory_060, 0.8], [tagCategory_029, 0.7]]| > to: > Vectors.sparse(100, Array(60, 29), Array(0.8, 0.7)) > > I believe it needs two step: > 1. val tag2vec = {tag: Array[Structure] => Vector} > 2. mblog_tags.withColumn("vec", tag2vec(col("tag")) > > But, I have no idea of how to describe the Array[Structure] in the > DataFrame. > > > > > > On Fri, Oct 21, 2016 at 4:51 PM, lk_spark <lk_sp...@163.com> wrote: > >> how about change Schema from >> root >> |-- category.firstCategory: array (nullable = true) >> | |-- element: struct (containsNull = true) >> | | |-- category: string (nullable = true) >> | | |-- weight: string (nullable = true) >> to: >> >> root >> |-- category: string (nullable = true) >> |-- weight: string (nullable = true) >> >> 2016-10-21 >> ------------------------------ >> lk_spark >> ------------------------------ >> >> *发件人:*颜发才(Yan Facai) <yaf...@gmail.com> >> *发送时间:*2016-10-21 15:35 >> *主题:*Re: How to iterate the element of an array in DataFrame? >> *收件人:*"user.spark"<user@spark.apache.org> >> *抄送:* >> >> I don't know how to construct `array<struct<category:string, >> weight:string>>`. >> Could anyone help me? >> >> I try to get the array by : >> scala> mblog_tags.map(_.getSeq[(String, String)](0)) >> >> while the result is: >> res40: org.apache.spark.sql.Dataset[Seq[(String, String)]] = [value: >> array<struct<_1:string,_2:string>>] >> >> >> How to express `struct<string, string>` ? >> >> >> >> On Thu, Oct 20, 2016 at 4:34 PM, 颜发才(Yan Facai) <yaf...@gmail.com> wrote: >> >>> Hi, I want to extract the attribute `weight` of an array, and combine >>> them to construct a sparse vector. >>> >>> ### My data is like this: >>> >>> scala> mblog_tags.printSchema >>> root >>> |-- category.firstCategory: array (nullable = true) >>> | |-- element: struct (containsNull = true) >>> | | |-- category: string (nullable = true) >>> | | |-- weight: string (nullable = true) >>> >>> >>> scala> mblog_tags.show(false) >>> +--------------------------------------------------------------+ >>> |category.firstCategory | >>> +--------------------------------------------------------------+ >>> |[[tagCategory_060, 0.8], [tagCategory_029, 0.7]]| >>> |[[tagCategory_029, 0.9]] | >>> |[[tagCategory_029, 0.8]] | >>> +--------------------------------------------------------------+ >>> >>> >>> ### And expected: >>> Vectors.sparse(100, Array(60, 29), Array(0.8, 0.7)) >>> Vectors.sparse(100, Array(29), Array(0.9)) >>> Vectors.sparse(100, Array(29), Array(0.8)) >>> >>> How to iterate an array in DataFrame? >>> Thanks. >>> >>> >>> >>> >> > >