Re: How to iterate the element of an array in DataFrame?
scala> mblog_tags.dtypes res13: Array[(String, String)] = Array((tags,ArrayType(StructType(StructField(category,StringType,true), StructField(weight,StringType,true)),true))) scala> val testUDF = udf{ s: Seq[Tags] => s(0).weight } testUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(,StringType,Some(List(ArrayType(StructType(StructField(category,StringType,true), StructField(weight,StringType,true)),true Where is wrong with the udf function `testUDF` ? On Tue, Oct 25, 2016 at 10:41 AM, 颜发才(Yan Facai)wrote: > Thanks, Cheng Lian. > > I try to use case class: > > scala> case class Tags (category: String, weight: String) > > scala> val testUDF = udf{ s: Seq[Tags] => s(0).weight } > > testUDF: org.apache.spark.sql.expressions.UserDefinedFunction = > UserDefinedFunction(,StringType,Some(List(ArrayType(StructType( > StructField(category,StringType,true), StructField(weight,StringType, > true)),true > > > but it raises an ClassCastException when run: > > scala> mblog_tags.withColumn("test", testUDF(col("tags"))).show(false) > > 16/10/25 10:39:54 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID > 4) > java.lang.ClassCastException: > org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema > cannot be cast to $line58.$read$$iw$$iw$Tags > at $line59.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1. > apply(:27) > at $line59.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1. > apply(:27) > ... > > > Where did I do wrong? > > > > > On Sat, Oct 22, 2016 at 6:37 AM, Cheng Lian wrote: > >> You may either use SQL function "array" and "named_struct" or define a >> case class with expected field names. >> >> Cheng >> >> On 10/21/16 2:45 AM, 颜发才(Yan Facai) wrote: >> >> My expectation is: >> root >> |-- tag: vector >> >> namely, I want to extract from: >> [[tagCategory_060, 0.8], [tagCategory_029, 0.7]]| >> to: >> Vectors.sparse(100, Array(60, 29), Array(0.8, 0.7)) >> >> I believe it needs two step: >> 1. val tag2vec = {tag: Array[Structure] => Vector} >> 2. mblog_tags.withColumn("vec", tag2vec(col("tag")) >> >> But, I have no idea of how to describe the Array[Structure] in the >> DataFrame. >> >> >> >> >> >> On Fri, Oct 21, 2016 at 4:51 PM, lk_spark wrote: >> >>> how about change Schema from >>> root >>> |-- category.firstCategory: array (nullable = true) >>> ||-- element: struct (containsNull = true) >>> |||-- category: string (nullable = true) >>> |||-- weight: string (nullable = true) >>> to: >>> >>> root >>> |-- category: string (nullable = true) >>> |-- weight: string (nullable = true) >>> >>> 2016-10-21 >>> -- >>> lk_spark >>> -- >>> >>> *发件人:*颜发才(Yan Facai) >>> *发送时间:*2016-10-21 15:35 >>> *主题:*Re: How to iterate the element of an array in DataFrame? >>> *收件人:*"user.spark" >>> *抄送:* >>> >>> I don't know how to construct `array >> weight:string>>`. >>> Could anyone help me? >>> >>> I try to get the array by : >>> scala> mblog_tags.map(_.getSeq[(String, String)](0)) >>> >>> while the result is: >>> res40: org.apache.spark.sql.Dataset[Seq[(String, String)]] = [value: >>> array >] >>> >>> >>> How to express `struct ` ? >>> >>> >>> >>> On Thu, Oct 20, 2016 at 4:34 PM, 颜发才(Yan Facai) >>> wrote: >>> Hi, I want to extract the attribute `weight` of an array, and combine them to construct a sparse vector. ### My data is like this: scala> mblog_tags.printSchema root |-- category.firstCategory: array (nullable = true) ||-- element: struct (containsNull = true) |||-- category: string (nullable = true) |||-- weight: string (nullable = true) scala> mblog_tags.show(false) +--+ |category.firstCategory| +--+ |[[tagCategory_060, 0.8], [tagCategory_029, 0.7]]| |[[tagCategory_029, 0.9]] | |[[tagCategory_029, 0.8]] | +--+ ### And expected: Vectors.sparse(100, Array(60, 29), Array(0.8, 0.7)) Vectors.sparse(100, Array(29), Array(0.9)) Vectors.sparse(100, Array(29), Array(0.8)) How to iterate an array in DataFrame? Thanks. >>> >> >> >
Re: How to iterate the element of an array in DataFrame?
Thanks, Cheng Lian. I try to use case class: scala> case class Tags (category: String, weight: String) scala> val testUDF = udf{ s: Seq[Tags] => s(0).weight } testUDF: org.apache.spark.sql.expressions.UserDefinedFunction = UserDefinedFunction(,StringType,Some(List(ArrayType(StructType(StructField(category,StringType,true), StructField(weight,StringType,true)),true but it raises an ClassCastException when run: scala> mblog_tags.withColumn("test", testUDF(col("tags"))).show(false) 16/10/25 10:39:54 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 4) java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to $line58.$read$$iw$$iw$Tags at $line59.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:27) at $line59.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(:27) ... Where did I do wrong? On Sat, Oct 22, 2016 at 6:37 AM, Cheng Lianwrote: > You may either use SQL function "array" and "named_struct" or define a > case class with expected field names. > > Cheng > > On 10/21/16 2:45 AM, 颜发才(Yan Facai) wrote: > > My expectation is: > root > |-- tag: vector > > namely, I want to extract from: > [[tagCategory_060, 0.8], [tagCategory_029, 0.7]]| > to: > Vectors.sparse(100, Array(60, 29), Array(0.8, 0.7)) > > I believe it needs two step: > 1. val tag2vec = {tag: Array[Structure] => Vector} > 2. mblog_tags.withColumn("vec", tag2vec(col("tag")) > > But, I have no idea of how to describe the Array[Structure] in the > DataFrame. > > > > > > On Fri, Oct 21, 2016 at 4:51 PM, lk_spark wrote: > >> how about change Schema from >> root >> |-- category.firstCategory: array (nullable = true) >> ||-- element: struct (containsNull = true) >> |||-- category: string (nullable = true) >> |||-- weight: string (nullable = true) >> to: >> >> root >> |-- category: string (nullable = true) >> |-- weight: string (nullable = true) >> >> 2016-10-21 >> -- >> lk_spark >> -- >> >> *发件人:*颜发才(Yan Facai) >> *发送时间:*2016-10-21 15:35 >> *主题:*Re: How to iterate the element of an array in DataFrame? >> *收件人:*"user.spark" >> *抄送:* >> >> I don't know how to construct `array > weight:string>>`. >> Could anyone help me? >> >> I try to get the array by : >> scala> mblog_tags.map(_.getSeq[(String, String)](0)) >> >> while the result is: >> res40: org.apache.spark.sql.Dataset[Seq[(String, String)]] = [value: >> array >] >> >> >> How to express `struct ` ? >> >> >> >> On Thu, Oct 20, 2016 at 4:34 PM, 颜发才(Yan Facai) wrote: >> >>> Hi, I want to extract the attribute `weight` of an array, and combine >>> them to construct a sparse vector. >>> >>> ### My data is like this: >>> >>> scala> mblog_tags.printSchema >>> root >>> |-- category.firstCategory: array (nullable = true) >>> ||-- element: struct (containsNull = true) >>> |||-- category: string (nullable = true) >>> |||-- weight: string (nullable = true) >>> >>> >>> scala> mblog_tags.show(false) >>> +--+ >>> |category.firstCategory| >>> +--+ >>> |[[tagCategory_060, 0.8], [tagCategory_029, 0.7]]| >>> |[[tagCategory_029, 0.9]] | >>> |[[tagCategory_029, 0.8]] | >>> +--+ >>> >>> >>> ### And expected: >>> Vectors.sparse(100, Array(60, 29), Array(0.8, 0.7)) >>> Vectors.sparse(100, Array(29), Array(0.9)) >>> Vectors.sparse(100, Array(29), Array(0.8)) >>> >>> How to iterate an array in DataFrame? >>> Thanks. >>> >>> >>> >>> >> > >
Re: How to iterate the element of an array in DataFrame?
You may either use SQL function "array" and "named_struct" or define a case class with expected field names. Cheng On 10/21/16 2:45 AM, 颜发才(Yan Facai) wrote: My expectation is: root |-- tag: vector namely, I want to extract from: [[tagCategory_060, 0.8], [tagCategory_029, 0.7]]| to: Vectors.sparse(100, Array(60, 29), Array(0.8, 0.7)) I believe it needs two step: 1. val tag2vec = {tag: Array[Structure] => Vector} 2. mblog_tags.withColumn("vec", tag2vec(col("tag")) But, I have no idea of how to describe the Array[Structure] in the DataFrame. On Fri, Oct 21, 2016 at 4:51 PM, lk_spark> wrote: how about change Schema from root |-- category.firstCategory: array (nullable = true) ||-- element: struct (containsNull = true) |||-- category: string (nullable = true) |||-- weight: string (nullable = true) to: root |-- category: string (nullable = true) |-- weight: string (nullable = true) 2016-10-21 lk_spark *发件人:*颜发才(Yan Facai) > *发送时间:*2016-10-21 15:35 *主题:*Re: How to iterate the element of an array in DataFrame? *收件人:*"user.spark" > *抄送:* I don't know how to construct `array >`. Could anyone help me? I try to get the array by : scala> mblog_tags.map(_.getSeq[(String, String)](0)) while the result is: res40: org.apache.spark.sql.Dataset[Seq[(String, String)]] = [value: array >] How to express `struct ` ? On Thu, Oct 20, 2016 at 4:34 PM, 颜发才(Yan Facai) > wrote: Hi, I want to extract the attribute `weight` of an array, and combine them to construct a sparse vector. ### My data is like this: scala> mblog_tags.printSchema root |-- category.firstCategory: array (nullable = true) ||-- element: struct (containsNull = true) |||-- category: string (nullable = true) |||-- weight: string (nullable = true) scala> mblog_tags.show(false) +--+ |category.firstCategory | +--+ |[[tagCategory_060, 0.8], [tagCategory_029, 0.7]]| |[[tagCategory_029, 0.9]] | |[[tagCategory_029, 0.8]]| +--+ ### And expected: Vectors.sparse(100, Array(60, 29), Array(0.8, 0.7)) Vectors.sparse(100, Array(29), Array(0.9)) Vectors.sparse(100, Array(29), Array(0.8)) How to iterate an array in DataFrame? Thanks.
Re: Re: How to iterate the element of an array in DataFrame?
My expectation is: root |-- tag: vector namely, I want to extract from: [[tagCategory_060, 0.8], [tagCategory_029, 0.7]]| to: Vectors.sparse(100, Array(60, 29), Array(0.8, 0.7)) I believe it needs two step: 1. val tag2vec = {tag: Array[Structure] => Vector} 2. mblog_tags.withColumn("vec", tag2vec(col("tag")) But, I have no idea of how to describe the Array[Structure] in the DataFrame. On Fri, Oct 21, 2016 at 4:51 PM, lk_sparkwrote: > how about change Schema from > root > |-- category.firstCategory: array (nullable = true) > ||-- element: struct (containsNull = true) > |||-- category: string (nullable = true) > |||-- weight: string (nullable = true) > to: > > root > |-- category: string (nullable = true) > |-- weight: string (nullable = true) > > 2016-10-21 > -- > lk_spark > -- > > *发件人:*颜发才(Yan Facai) > *发送时间:*2016-10-21 15:35 > *主题:*Re: How to iterate the element of an array in DataFrame? > *收件人:*"user.spark" > *抄送:* > > I don't know how to construct `array weight:string>>`. > Could anyone help me? > > I try to get the array by : > scala> mblog_tags.map(_.getSeq[(String, String)](0)) > > while the result is: > res40: org.apache.spark.sql.Dataset[Seq[(String, String)]] = [value: > array >] > > > How to express `struct ` ? > > > > On Thu, Oct 20, 2016 at 4:34 PM, 颜发才(Yan Facai) wrote: > >> Hi, I want to extract the attribute `weight` of an array, and combine >> them to construct a sparse vector. >> >> ### My data is like this: >> >> scala> mblog_tags.printSchema >> root >> |-- category.firstCategory: array (nullable = true) >> ||-- element: struct (containsNull = true) >> |||-- category: string (nullable = true) >> |||-- weight: string (nullable = true) >> >> >> scala> mblog_tags.show(false) >> +--+ >> |category.firstCategory| >> +--+ >> |[[tagCategory_060, 0.8], [tagCategory_029, 0.7]]| >> |[[tagCategory_029, 0.9]] | >> |[[tagCategory_029, 0.8]] | >> +--+ >> >> >> ### And expected: >> Vectors.sparse(100, Array(60, 29), Array(0.8, 0.7)) >> Vectors.sparse(100, Array(29), Array(0.9)) >> Vectors.sparse(100, Array(29), Array(0.8)) >> >> How to iterate an array in DataFrame? >> Thanks. >> >> >> >> >
Re: Re: How to iterate the element of an array in DataFrame?
how about change Schema from root |-- category.firstCategory: array (nullable = true) ||-- element: struct (containsNull = true) |||-- category: string (nullable = true) |||-- weight: string (nullable = true) to: root |-- category: string (nullable = true) |-- weight: string (nullable = true) 2016-10-21 lk_spark 发件人:颜发才(Yan Facai)发送时间:2016-10-21 15:35 主题:Re: How to iterate the element of an array in DataFrame? 收件人:"user.spark" 抄送: I don't know how to construct `array >`. Could anyone help me? I try to get the array by : scala> mblog_tags.map(_.getSeq[(String, String)](0)) while the result is: res40: org.apache.spark.sql.Dataset[Seq[(String, String)]] = [value: array >] How to express `struct ` ? On Thu, Oct 20, 2016 at 4:34 PM, 颜发才(Yan Facai) wrote: Hi, I want to extract the attribute `weight` of an array, and combine them to construct a sparse vector. ### My data is like this: scala> mblog_tags.printSchema root |-- category.firstCategory: array (nullable = true) ||-- element: struct (containsNull = true) |||-- category: string (nullable = true) |||-- weight: string (nullable = true) scala> mblog_tags.show(false) +--+ |category.firstCategory| +--+ |[[tagCategory_060, 0.8], [tagCategory_029, 0.7]]| |[[tagCategory_029, 0.9]] | |[[tagCategory_029, 0.8]] | +--+ ### And expected: Vectors.sparse(100, Array(60, 29), Array(0.8, 0.7)) Vectors.sparse(100, Array(29), Array(0.9)) Vectors.sparse(100, Array(29), Array(0.8)) How to iterate an array in DataFrame? Thanks.
Re: How to iterate the element of an array in DataFrame?
I don't know how to construct `array>`. Could anyone help me? I try to get the array by : scala> mblog_tags.map(_.getSeq[(String, String)](0)) while the result is: res40: org.apache.spark.sql.Dataset[Seq[(String, String)]] = [value: array >] How to express `struct ` ? On Thu, Oct 20, 2016 at 4:34 PM, 颜发才(Yan Facai) wrote: > Hi, I want to extract the attribute `weight` of an array, and combine them > to construct a sparse vector. > > ### My data is like this: > > scala> mblog_tags.printSchema > root > |-- category.firstCategory: array (nullable = true) > ||-- element: struct (containsNull = true) > |||-- category: string (nullable = true) > |||-- weight: string (nullable = true) > > > scala> mblog_tags.show(false) > +--+ > |category.firstCategory| > +--+ > |[[tagCategory_060, 0.8], [tagCategory_029, 0.7]]| > |[[tagCategory_029, 0.9]] | > |[[tagCategory_029, 0.8]] | > +--+ > > > ### And expected: > Vectors.sparse(100, Array(60, 29), Array(0.8, 0.7)) > Vectors.sparse(100, Array(29), Array(0.9)) > Vectors.sparse(100, Array(29), Array(0.8)) > > How to iterate an array in DataFrame? > Thanks. > > > >