Re: Need some help to create user defined type for ML pipeline

Xiangrui Meng Mon, 23 Feb 2015 12:37:13 -0800

Yes, we are going to expose the developer API. There was a long
discussion in the PR: https://github.com/apache/spark/pull/3637. So we
marked them package private and look for feedback on how to improve
it. Please implement your classes under `spark.ml` for now and let us
know your feedback. Thanks! -Xiangrui


On Mon, Feb 23, 2015 at 8:10 AM, Jaonary Rabarisoa <jaon...@gmail.com> wrote:
> Hi Joseph,
>
> Thank you for you feedback. I've managed to define an image type by
> following VectorUDT implementation.
>
> I have another question about the definition of a user defined transformer.
> The unary tranfromer is private to spark ml. Do you plan
> to give a developer api for transformers ?
>
>
>
> On Sun, Jan 25, 2015 at 2:26 AM, Joseph Bradley <jos...@databricks.com>
> wrote:
>>
>> Hi Jao,
>>
>> You're right that defining serialize and deserialize is the main task in
>> implementing a UDT.  They are basically translating between your native
>> representation (ByteImage) and SQL DataTypes.  The sqlType you defined looks
>> correct, and you're correct to use a row of length 4.  Other than that, it
>> should just require copying data to and from SQL Rows.  There are quite a
>> few examples of that in the codebase; I'd recommend searching based on the
>> particular DataTypes you're using.
>>
>> Are there particular issues you're running into?
>>
>> Joseph
>>
>> On Mon, Jan 19, 2015 at 12:59 AM, Jaonary Rabarisoa <jaon...@gmail.com>
>> wrote:
>>>
>>> Hi all,
>>>
>>> I'm trying to implement a pipeline for computer vision based on the
>>> latest ML package in spark. The first step of my pipeline is to decode image
>>> (jpeg for instance) stored in a parquet file.
>>> For this, I begin to create a UserDefinedType that represents a decoded
>>> image stored in a array of byte. Here is my first attempt :
>>>
>>>
>>> @SQLUserDefinedType(udt = classOf[ByteImageUDT])
>>> class ByteImage(channels: Int, width: Int, height: Int, data:
>>> Array[Byte])
>>>
>>>
>>> private[spark] class ByteImageUDT extends UserDefinedType[ByteImage] {
>>>
>>>   override def sqlType: StructType = {
>>>     // type: 0 = sparse, 1 = dense
>>>     // We only use "values" for dense vectors, and "size", "indices", and
>>> "values" for sparse
>>>     // vectors. The "values" field is nullable because we might want to
>>> add binary vectors later,
>>>     // which uses "size" and "indices", but not "values".
>>>     StructType(Seq(
>>>       StructField("channels", IntegerType, nullable = false),
>>>       StructField("width", IntegerType, nullable = false),
>>>       StructField("height", IntegerType, nullable = false),
>>>       StructField("data", BinaryType, nullable = false)
>>>   }
>>>
>>>   override def serialize(obj: Any): Row = {
>>>
>>>     val row = new GenericMutableRow(4)
>>>     val img = obj.asInstanceOf[ByteImage]
>>>
>>>
>>> ...
>>>   }
>>>
>>>   override def deserialize(datum: Any): Vector = {
>>>
>>>
>>> ....
>>>
>>>
>>>     }
>>>   }
>>>
>>>   override def pyUDT: String = "pyspark.mllib.linalg.VectorUDT"
>>>
>>>   override def userClass: Class[Vector] = classOf[Vector]
>>> }
>>>
>>>
>>> I take the VectorUDT as a starting point but there's a lot of thing that
>>> I don't really understand. So any help on defining serialize and deserialize
>>> methods will be appreciated.
>>>
>>> Best Regards,
>>>
>>> Jao
>>
>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Need some help to create user defined type for ML pipeline

Reply via email to