Yes, we are going to expose the developer API. There was a long discussion in the PR: https://github.com/apache/spark/pull/3637. So we marked them package private and look for feedback on how to improve it. Please implement your classes under `spark.ml` for now and let us know your feedback. Thanks! -Xiangrui
On Mon, Feb 23, 2015 at 8:10 AM, Jaonary Rabarisoa <jaon...@gmail.com> wrote: > Hi Joseph, > > Thank you for you feedback. I've managed to define an image type by > following VectorUDT implementation. > > I have another question about the definition of a user defined transformer. > The unary tranfromer is private to spark ml. Do you plan > to give a developer api for transformers ? > > > > On Sun, Jan 25, 2015 at 2:26 AM, Joseph Bradley <jos...@databricks.com> > wrote: >> >> Hi Jao, >> >> You're right that defining serialize and deserialize is the main task in >> implementing a UDT. They are basically translating between your native >> representation (ByteImage) and SQL DataTypes. The sqlType you defined looks >> correct, and you're correct to use a row of length 4. Other than that, it >> should just require copying data to and from SQL Rows. There are quite a >> few examples of that in the codebase; I'd recommend searching based on the >> particular DataTypes you're using. >> >> Are there particular issues you're running into? >> >> Joseph >> >> On Mon, Jan 19, 2015 at 12:59 AM, Jaonary Rabarisoa <jaon...@gmail.com> >> wrote: >>> >>> Hi all, >>> >>> I'm trying to implement a pipeline for computer vision based on the >>> latest ML package in spark. The first step of my pipeline is to decode image >>> (jpeg for instance) stored in a parquet file. >>> For this, I begin to create a UserDefinedType that represents a decoded >>> image stored in a array of byte. Here is my first attempt : >>> >>> >>> @SQLUserDefinedType(udt = classOf[ByteImageUDT]) >>> class ByteImage(channels: Int, width: Int, height: Int, data: >>> Array[Byte]) >>> >>> >>> private[spark] class ByteImageUDT extends UserDefinedType[ByteImage] { >>> >>> override def sqlType: StructType = { >>> // type: 0 = sparse, 1 = dense >>> // We only use "values" for dense vectors, and "size", "indices", and >>> "values" for sparse >>> // vectors. The "values" field is nullable because we might want to >>> add binary vectors later, >>> // which uses "size" and "indices", but not "values". >>> StructType(Seq( >>> StructField("channels", IntegerType, nullable = false), >>> StructField("width", IntegerType, nullable = false), >>> StructField("height", IntegerType, nullable = false), >>> StructField("data", BinaryType, nullable = false) >>> } >>> >>> override def serialize(obj: Any): Row = { >>> >>> val row = new GenericMutableRow(4) >>> val img = obj.asInstanceOf[ByteImage] >>> >>> >>> ... >>> } >>> >>> override def deserialize(datum: Any): Vector = { >>> >>> >>> .... >>> >>> >>> } >>> } >>> >>> override def pyUDT: String = "pyspark.mllib.linalg.VectorUDT" >>> >>> override def userClass: Class[Vector] = classOf[Vector] >>> } >>> >>> >>> I take the VectorUDT as a starting point but there's a lot of thing that >>> I don't really understand. So any help on defining serialize and deserialize >>> methods will be appreciated. >>> >>> Best Regards, >>> >>> Jao >> >> > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org