Re: SQL UserDefinedType can't be saved in parquet file when using assembly jar

2015-05-19 Thread Xiangrui Meng
Hey Jaonary,

I saw this line in the error message:

org.apache.spark.sql.types.DataType$CaseClassStringParser$.apply(dataTypes.scala:163)

CaseClassStringParser is only used in older versions of Spark to parse
schema from JSON. So I suspect that the cluster was running on a old
version of Spark when you use spark-submit to run your assembly jar.

Best,
Xiangrui

On Mon, May 11, 2015 at 7:40 AM, Jaonary Rabarisoa jaon...@gmail.com wrote:
 In this example, every thing work expect save to parquet file.

 On Mon, May 11, 2015 at 4:39 PM, Jaonary Rabarisoa jaon...@gmail.com
 wrote:

 MyDenseVectorUDT do exist in the assembly jar and in this example all the
 code is in a single file to make sure every thing is included.

 On Tue, Apr 21, 2015 at 1:17 AM, Xiangrui Meng men...@gmail.com wrote:

 You should check where MyDenseVectorUDT is defined and whether it was
 on the classpath (or in the assembly jar) at runtime. Make sure the
 full class name (with package name) is used. Btw, UDTs are not public
 yet, so please use it with caution. -Xiangrui

 On Fri, Apr 17, 2015 at 12:45 AM, Jaonary Rabarisoa jaon...@gmail.com
 wrote:
  Dear all,
 
  Here is an example of code to reproduce the issue I mentioned in a
  previous
  mail about saving an UserDefinedType into a parquet file. The problem
  here
  is that the code works when I run it inside intellij idea but fails
  when I
  create the assembly jar and run it with spark-submit. I use the master
  version of  Spark.
 
  @SQLUserDefinedType(udt = classOf[MyDenseVectorUDT])
  class MyDenseVector(val data: Array[Double]) extends Serializable {
override def equals(other: Any): Boolean = other match {
  case v: MyDenseVector =
java.util.Arrays.equals(this.data, v.data)
  case _ = false
}
  }
 
  class MyDenseVectorUDT extends UserDefinedType[MyDenseVector] {
override def sqlType: DataType = ArrayType(DoubleType, containsNull =
  false)
override def serialize(obj: Any): Seq[Double] = {
  obj match {
case features: MyDenseVector =
  features.data.toSeq
  }
}
 
override def deserialize(datum: Any): MyDenseVector = {
  datum match {
case data: Seq[_] =
  new MyDenseVector(data.asInstanceOf[Seq[Double]].toArray)
  }
}
 
override def userClass: Class[MyDenseVector] = classOf[MyDenseVector]
 
  }
 
  case class Toto(imageAnnotation: MyDenseVector)
 
  object TestUserDefinedType {
 
case class Params(input: String = null,
 partitions: Int = 12,
  outputDir: String = images.parquet)
 
def main(args: Array[String]): Unit = {
 
  val conf = new
  SparkConf().setAppName(ImportImageFolder).setMaster(local[4])
 
  val sc = new SparkContext(conf)
  val sqlContext = new SQLContext(sc)
 
  import sqlContext.implicits._
 
  val rawImages = sc.parallelize((1 to 5).map(x = Toto(new
  MyDenseVector(Array[Double](x.toDouble).toDF
 
  rawImages.printSchema()
 
  rawImages.show()
 
  rawImages.save(toto.parquet) // This fails with assembly jar
  sc.stop()
 
}
  }
 
 
  My build.sbt is as follow :
 
  libraryDependencies ++= Seq(
org.apache.spark %% spark-core % sparkVersion % provided,
org.apache.spark %% spark-sql % sparkVersion,
org.apache.spark %% spark-mllib % sparkVersion
  )
 
  assemblyMergeStrategy in assembly := {
case PathList(javax, servlet, xs @ _*) = MergeStrategy.first
case PathList(org, apache, xs @ _*) = MergeStrategy.first
case PathList(org, jboss, xs @ _*) = MergeStrategy.first
  //  case PathList(ps @ _*) if ps.last endsWith .html =
  MergeStrategy.first
  //  case application.conf=
  MergeStrategy.concat
case m if m.startsWith(META-INF) = MergeStrategy.discard
//case x =
//  val oldStrategy = (assemblyMergeStrategy in assembly).value
//  oldStrategy(x)
case _ = MergeStrategy.first
  }
 
 
  As I said, this code works without problem when I execute it inside
  intellij
  idea. But when generate the assembly jar with sbt-assembly and
 
  use spark-submit I got the following error :
 
  15/04/17 09:34:01 INFO ParquetOutputFormat: Writer version is:
  PARQUET_1_0
  15/04/17 09:34:01 ERROR Executor: Exception in task 3.0 in stage 2.0
  (TID 7)
  java.lang.IllegalArgumentException: Unsupported dataType:
 
  {type:struct,fields:[{name:imageAnnotation,type:{type:udt,class:MyDenseVectorUDT,pyClass:null,sqlType:{type:array,elementType:double,containsNull:false}},nullable:true,metadata:{}}]},
  [1.1] failure: `TimestampType' expected but `{' found
 
 
  {type:struct,fields:[{name:imageAnnotation,type:{type:udt,class:MyDenseVectorUDT,pyClass:null,sqlType:{type:array,elementType:double,containsNull:false}},nullable:true,metadata:{}}]}
  ^
at
 
  org.apache.spark.sql.types.DataType$CaseClassStringParser$.apply(dataTypes.scala:163)
at
 
  org.apache.spark.sql.types.DataType$.fromCaseClassString(dataTypes.scala:98)
  

Re: SQL UserDefinedType can't be saved in parquet file when using assembly jar

2015-05-11 Thread Jaonary Rabarisoa
In this example, every thing work expect save to parquet file.

On Mon, May 11, 2015 at 4:39 PM, Jaonary Rabarisoa jaon...@gmail.com
wrote:

 MyDenseVectorUDT do exist in the assembly jar and in this example all the
 code is in a single file to make sure every thing is included.

 On Tue, Apr 21, 2015 at 1:17 AM, Xiangrui Meng men...@gmail.com wrote:

 You should check where MyDenseVectorUDT is defined and whether it was
 on the classpath (or in the assembly jar) at runtime. Make sure the
 full class name (with package name) is used. Btw, UDTs are not public
 yet, so please use it with caution. -Xiangrui

 On Fri, Apr 17, 2015 at 12:45 AM, Jaonary Rabarisoa jaon...@gmail.com
 wrote:
  Dear all,
 
  Here is an example of code to reproduce the issue I mentioned in a
 previous
  mail about saving an UserDefinedType into a parquet file. The problem
 here
  is that the code works when I run it inside intellij idea but fails
 when I
  create the assembly jar and run it with spark-submit. I use the master
  version of  Spark.
 
  @SQLUserDefinedType(udt = classOf[MyDenseVectorUDT])
  class MyDenseVector(val data: Array[Double]) extends Serializable {
override def equals(other: Any): Boolean = other match {
  case v: MyDenseVector =
java.util.Arrays.equals(this.data, v.data)
  case _ = false
}
  }
 
  class MyDenseVectorUDT extends UserDefinedType[MyDenseVector] {
override def sqlType: DataType = ArrayType(DoubleType, containsNull =
  false)
override def serialize(obj: Any): Seq[Double] = {
  obj match {
case features: MyDenseVector =
  features.data.toSeq
  }
}
 
override def deserialize(datum: Any): MyDenseVector = {
  datum match {
case data: Seq[_] =
  new MyDenseVector(data.asInstanceOf[Seq[Double]].toArray)
  }
}
 
override def userClass: Class[MyDenseVector] = classOf[MyDenseVector]
 
  }
 
  case class Toto(imageAnnotation: MyDenseVector)
 
  object TestUserDefinedType {
 
case class Params(input: String = null,
 partitions: Int = 12,
  outputDir: String = images.parquet)
 
def main(args: Array[String]): Unit = {
 
  val conf = new
  SparkConf().setAppName(ImportImageFolder).setMaster(local[4])
 
  val sc = new SparkContext(conf)
  val sqlContext = new SQLContext(sc)
 
  import sqlContext.implicits._
 
  val rawImages = sc.parallelize((1 to 5).map(x = Toto(new
  MyDenseVector(Array[Double](x.toDouble).toDF
 
  rawImages.printSchema()
 
  rawImages.show()
 
  rawImages.save(toto.parquet) // This fails with assembly jar
  sc.stop()
 
}
  }
 
 
  My build.sbt is as follow :
 
  libraryDependencies ++= Seq(
org.apache.spark %% spark-core % sparkVersion % provided,
org.apache.spark %% spark-sql % sparkVersion,
org.apache.spark %% spark-mllib % sparkVersion
  )
 
  assemblyMergeStrategy in assembly := {
case PathList(javax, servlet, xs @ _*) = MergeStrategy.first
case PathList(org, apache, xs @ _*) = MergeStrategy.first
case PathList(org, jboss, xs @ _*) = MergeStrategy.first
  //  case PathList(ps @ _*) if ps.last endsWith .html =
  MergeStrategy.first
  //  case application.conf=
  MergeStrategy.concat
case m if m.startsWith(META-INF) = MergeStrategy.discard
//case x =
//  val oldStrategy = (assemblyMergeStrategy in assembly).value
//  oldStrategy(x)
case _ = MergeStrategy.first
  }
 
 
  As I said, this code works without problem when I execute it inside
 intellij
  idea. But when generate the assembly jar with sbt-assembly and
 
  use spark-submit I got the following error :
 
  15/04/17 09:34:01 INFO ParquetOutputFormat: Writer version is:
 PARQUET_1_0
  15/04/17 09:34:01 ERROR Executor: Exception in task 3.0 in stage 2.0
 (TID 7)
  java.lang.IllegalArgumentException: Unsupported dataType:
 
 {type:struct,fields:[{name:imageAnnotation,type:{type:udt,class:MyDenseVectorUDT,pyClass:null,sqlType:{type:array,elementType:double,containsNull:false}},nullable:true,metadata:{}}]},
  [1.1] failure: `TimestampType' expected but `{' found
 
 
 {type:struct,fields:[{name:imageAnnotation,type:{type:udt,class:MyDenseVectorUDT,pyClass:null,sqlType:{type:array,elementType:double,containsNull:false}},nullable:true,metadata:{}}]}
  ^
at
 
 org.apache.spark.sql.types.DataType$CaseClassStringParser$.apply(dataTypes.scala:163)
at
 
 org.apache.spark.sql.types.DataType$.fromCaseClassString(dataTypes.scala:98)
at
 
 org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$6.apply(ParquetTypes.scala:402)
at
 
 org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$6.apply(ParquetTypes.scala:402)
at scala.util.Try.getOrElse(Try.scala:77)
at
 
 org.apache.spark.sql.parquet.ParquetTypesConverter$.convertFromString(ParquetTypes.scala:402)
at
 
 

Re: SQL UserDefinedType can't be saved in parquet file when using assembly jar

2015-05-11 Thread Jaonary Rabarisoa
MyDenseVectorUDT do exist in the assembly jar and in this example all the
code is in a single file to make sure every thing is included.

On Tue, Apr 21, 2015 at 1:17 AM, Xiangrui Meng men...@gmail.com wrote:

 You should check where MyDenseVectorUDT is defined and whether it was
 on the classpath (or in the assembly jar) at runtime. Make sure the
 full class name (with package name) is used. Btw, UDTs are not public
 yet, so please use it with caution. -Xiangrui

 On Fri, Apr 17, 2015 at 12:45 AM, Jaonary Rabarisoa jaon...@gmail.com
 wrote:
  Dear all,
 
  Here is an example of code to reproduce the issue I mentioned in a
 previous
  mail about saving an UserDefinedType into a parquet file. The problem
 here
  is that the code works when I run it inside intellij idea but fails when
 I
  create the assembly jar and run it with spark-submit. I use the master
  version of  Spark.
 
  @SQLUserDefinedType(udt = classOf[MyDenseVectorUDT])
  class MyDenseVector(val data: Array[Double]) extends Serializable {
override def equals(other: Any): Boolean = other match {
  case v: MyDenseVector =
java.util.Arrays.equals(this.data, v.data)
  case _ = false
}
  }
 
  class MyDenseVectorUDT extends UserDefinedType[MyDenseVector] {
override def sqlType: DataType = ArrayType(DoubleType, containsNull =
  false)
override def serialize(obj: Any): Seq[Double] = {
  obj match {
case features: MyDenseVector =
  features.data.toSeq
  }
}
 
override def deserialize(datum: Any): MyDenseVector = {
  datum match {
case data: Seq[_] =
  new MyDenseVector(data.asInstanceOf[Seq[Double]].toArray)
  }
}
 
override def userClass: Class[MyDenseVector] = classOf[MyDenseVector]
 
  }
 
  case class Toto(imageAnnotation: MyDenseVector)
 
  object TestUserDefinedType {
 
case class Params(input: String = null,
 partitions: Int = 12,
  outputDir: String = images.parquet)
 
def main(args: Array[String]): Unit = {
 
  val conf = new
  SparkConf().setAppName(ImportImageFolder).setMaster(local[4])
 
  val sc = new SparkContext(conf)
  val sqlContext = new SQLContext(sc)
 
  import sqlContext.implicits._
 
  val rawImages = sc.parallelize((1 to 5).map(x = Toto(new
  MyDenseVector(Array[Double](x.toDouble).toDF
 
  rawImages.printSchema()
 
  rawImages.show()
 
  rawImages.save(toto.parquet) // This fails with assembly jar
  sc.stop()
 
}
  }
 
 
  My build.sbt is as follow :
 
  libraryDependencies ++= Seq(
org.apache.spark %% spark-core % sparkVersion % provided,
org.apache.spark %% spark-sql % sparkVersion,
org.apache.spark %% spark-mllib % sparkVersion
  )
 
  assemblyMergeStrategy in assembly := {
case PathList(javax, servlet, xs @ _*) = MergeStrategy.first
case PathList(org, apache, xs @ _*) = MergeStrategy.first
case PathList(org, jboss, xs @ _*) = MergeStrategy.first
  //  case PathList(ps @ _*) if ps.last endsWith .html =
  MergeStrategy.first
  //  case application.conf=
  MergeStrategy.concat
case m if m.startsWith(META-INF) = MergeStrategy.discard
//case x =
//  val oldStrategy = (assemblyMergeStrategy in assembly).value
//  oldStrategy(x)
case _ = MergeStrategy.first
  }
 
 
  As I said, this code works without problem when I execute it inside
 intellij
  idea. But when generate the assembly jar with sbt-assembly and
 
  use spark-submit I got the following error :
 
  15/04/17 09:34:01 INFO ParquetOutputFormat: Writer version is:
 PARQUET_1_0
  15/04/17 09:34:01 ERROR Executor: Exception in task 3.0 in stage 2.0
 (TID 7)
  java.lang.IllegalArgumentException: Unsupported dataType:
 
 {type:struct,fields:[{name:imageAnnotation,type:{type:udt,class:MyDenseVectorUDT,pyClass:null,sqlType:{type:array,elementType:double,containsNull:false}},nullable:true,metadata:{}}]},
  [1.1] failure: `TimestampType' expected but `{' found
 
 
 {type:struct,fields:[{name:imageAnnotation,type:{type:udt,class:MyDenseVectorUDT,pyClass:null,sqlType:{type:array,elementType:double,containsNull:false}},nullable:true,metadata:{}}]}
  ^
at
 
 org.apache.spark.sql.types.DataType$CaseClassStringParser$.apply(dataTypes.scala:163)
at
 
 org.apache.spark.sql.types.DataType$.fromCaseClassString(dataTypes.scala:98)
at
 
 org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$6.apply(ParquetTypes.scala:402)
at
 
 org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$6.apply(ParquetTypes.scala:402)
at scala.util.Try.getOrElse(Try.scala:77)
at
 
 org.apache.spark.sql.parquet.ParquetTypesConverter$.convertFromString(ParquetTypes.scala:402)
at
 
 org.apache.spark.sql.parquet.RowWriteSupport.init(ParquetTableSupport.scala:145)
at
 
 parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:278)
at
 
 

Re: SQL UserDefinedType can't be saved in parquet file when using assembly jar

2015-04-20 Thread Xiangrui Meng
You should check where MyDenseVectorUDT is defined and whether it was
on the classpath (or in the assembly jar) at runtime. Make sure the
full class name (with package name) is used. Btw, UDTs are not public
yet, so please use it with caution. -Xiangrui

On Fri, Apr 17, 2015 at 12:45 AM, Jaonary Rabarisoa jaon...@gmail.com wrote:
 Dear all,

 Here is an example of code to reproduce the issue I mentioned in a previous
 mail about saving an UserDefinedType into a parquet file. The problem here
 is that the code works when I run it inside intellij idea but fails when I
 create the assembly jar and run it with spark-submit. I use the master
 version of  Spark.

 @SQLUserDefinedType(udt = classOf[MyDenseVectorUDT])
 class MyDenseVector(val data: Array[Double]) extends Serializable {
   override def equals(other: Any): Boolean = other match {
 case v: MyDenseVector =
   java.util.Arrays.equals(this.data, v.data)
 case _ = false
   }
 }

 class MyDenseVectorUDT extends UserDefinedType[MyDenseVector] {
   override def sqlType: DataType = ArrayType(DoubleType, containsNull =
 false)
   override def serialize(obj: Any): Seq[Double] = {
 obj match {
   case features: MyDenseVector =
 features.data.toSeq
 }
   }

   override def deserialize(datum: Any): MyDenseVector = {
 datum match {
   case data: Seq[_] =
 new MyDenseVector(data.asInstanceOf[Seq[Double]].toArray)
 }
   }

   override def userClass: Class[MyDenseVector] = classOf[MyDenseVector]

 }

 case class Toto(imageAnnotation: MyDenseVector)

 object TestUserDefinedType {

   case class Params(input: String = null,
partitions: Int = 12,
 outputDir: String = images.parquet)

   def main(args: Array[String]): Unit = {

 val conf = new
 SparkConf().setAppName(ImportImageFolder).setMaster(local[4])

 val sc = new SparkContext(conf)
 val sqlContext = new SQLContext(sc)

 import sqlContext.implicits._

 val rawImages = sc.parallelize((1 to 5).map(x = Toto(new
 MyDenseVector(Array[Double](x.toDouble).toDF

 rawImages.printSchema()

 rawImages.show()

 rawImages.save(toto.parquet) // This fails with assembly jar
 sc.stop()

   }
 }


 My build.sbt is as follow :

 libraryDependencies ++= Seq(
   org.apache.spark %% spark-core % sparkVersion % provided,
   org.apache.spark %% spark-sql % sparkVersion,
   org.apache.spark %% spark-mllib % sparkVersion
 )

 assemblyMergeStrategy in assembly := {
   case PathList(javax, servlet, xs @ _*) = MergeStrategy.first
   case PathList(org, apache, xs @ _*) = MergeStrategy.first
   case PathList(org, jboss, xs @ _*) = MergeStrategy.first
 //  case PathList(ps @ _*) if ps.last endsWith .html =
 MergeStrategy.first
 //  case application.conf=
 MergeStrategy.concat
   case m if m.startsWith(META-INF) = MergeStrategy.discard
   //case x =
   //  val oldStrategy = (assemblyMergeStrategy in assembly).value
   //  oldStrategy(x)
   case _ = MergeStrategy.first
 }


 As I said, this code works without problem when I execute it inside intellij
 idea. But when generate the assembly jar with sbt-assembly and

 use spark-submit I got the following error :

 15/04/17 09:34:01 INFO ParquetOutputFormat: Writer version is: PARQUET_1_0
 15/04/17 09:34:01 ERROR Executor: Exception in task 3.0 in stage 2.0 (TID 7)
 java.lang.IllegalArgumentException: Unsupported dataType:
 {type:struct,fields:[{name:imageAnnotation,type:{type:udt,class:MyDenseVectorUDT,pyClass:null,sqlType:{type:array,elementType:double,containsNull:false}},nullable:true,metadata:{}}]},
 [1.1] failure: `TimestampType' expected but `{' found

 {type:struct,fields:[{name:imageAnnotation,type:{type:udt,class:MyDenseVectorUDT,pyClass:null,sqlType:{type:array,elementType:double,containsNull:false}},nullable:true,metadata:{}}]}
 ^
   at
 org.apache.spark.sql.types.DataType$CaseClassStringParser$.apply(dataTypes.scala:163)
   at
 org.apache.spark.sql.types.DataType$.fromCaseClassString(dataTypes.scala:98)
   at
 org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$6.apply(ParquetTypes.scala:402)
   at
 org.apache.spark.sql.parquet.ParquetTypesConverter$$anonfun$6.apply(ParquetTypes.scala:402)
   at scala.util.Try.getOrElse(Try.scala:77)
   at
 org.apache.spark.sql.parquet.ParquetTypesConverter$.convertFromString(ParquetTypes.scala:402)
   at
 org.apache.spark.sql.parquet.RowWriteSupport.init(ParquetTableSupport.scala:145)
   at
 parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:278)
   at
 parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:252)
   at
 org.apache.spark.sql.parquet.ParquetRelation2.org$apache$spark$sql$parquet$ParquetRelation2$$writeShard$1(newParquet.scala:694)
   at
 org.apache.spark.sql.parquet.ParquetRelation2$$anonfun$insert$2.apply(newParquet.scala:716)
   at