Re: udf StructField to JSON String

Tristan Nixon Fri, 11 Mar 2016 13:52:22 -0800

It’s pretty simple, really:

import com.fasterxml.jackson.databind.ObjectMapper
import org.apache.spark.ml.UnaryTransformer
import org.apache.spark.ml.util.Identifiable
import org.apache.spark.sql.types.{DataType, StringType}


/**
 * A SparkML Transformer that will transform an
 * entity of type T into a JSON-formatted string.
 * Created by Tristan Nixon <[email protected]> on 3/11/16.
 */
class JsonSerializationTransformer[T](override val uid: String)
  extends UnaryTransformer[T,String,JsonSerializationTransformer[T]]
{
 def this() = this(Identifiable.randomUID("JsonSerializationTransformer"))
 val mapper = new ObjectMapper
 // add additional mapper configuration code here, like this:
 // mapper.setAnnotationIntrospector(new JaxbAnnotationIntrospector)
 // or this:
  // mapper.getSerializationConfig.withFeatures( 
SerializationFeature.WRITE_DATES_AS_TIMESTAMPS )

 override protected def createTransformFunc: ( T ) => String =
  mapper.writeValueAsString

 override protected def outputDataType: DataType = new StringType
}
and you would use it like any other transformer:

val jsontrans = new 
JsonSerializationTransformer[Document].setInputCol("myEntityColumn")
                             .setOutputCol("myOutputColumn")

val dfWithJson = jsontrans.transform( entityDF )

Note that this implementation is for Jackson 2.x. If you want to use Jackson 
1.x, it’s a bit trickier because the ObjectMapper class is not Serializable, 
and so you need to initialize it per-partition rather than having it just be a 
standard property.

> On Mar 11, 2016, at 12:49 PM, Jacek Laskowski <[email protected]> wrote:
> 
> Hi Tristan,
> 
> Mind sharing the relevant code? I'd like to learn the way you use Transformer 
> to do so. Thanks!
> 
> Jacek
> 
> 11.03.2016 7:07 PM "Tristan Nixon" <[email protected] 
> <mailto:[email protected]>> napisał(a):
> I have a similar situation in an app of mine. I implemented a custom ML 
> Transformer that wraps the Jackson ObjectMapper - this gives you full control 
> over how your custom entities / structs are serialized.
> 
>> On Mar 11, 2016, at 11:53 AM, Caires Vinicius <[email protected] 
>> <mailto:[email protected]>> wrote:
>> 
>> Hmm. I think my problem is a little more complex. I'm using 
>> https://github.com/databricks/spark-redshift 
>> <https://github.com/databricks/spark-redshift> and when I read from JSON 
>> file I got this schema.
>> 
>> root
>> |-- app: string (nullable = true)
>> 
>>  |-- ct: long (nullable = true)
>> 
>>  |-- event: struct (nullable = true)
>> 
>> |    |-- attributes: struct (nullable = true)
>> 
>>  |    |    |-- account: string (nullable = true)
>> 
>>  |    |    |-- accountEmail: string (nullable = true)
>> 
>> 
>>  |    |    |-- accountId: string (nullable = true)
>> 
>> 
>> 
>> I want to transform the Column event into String (formatted as JSON). 
>> 
>> I was trying to use udf but without success.
>> 
>> 
>> On Fri, Mar 11, 2016 at 1:53 PM Tristan Nixon <[email protected] 
>> <mailto:[email protected]>> wrote:
>> Have you looked at DataFrame.write.json( path )?
>> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter
>>  
>> <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter>
>> 
>> > On Mar 11, 2016, at 7:15 AM, Caires Vinicius <[email protected] 
>> > <mailto:[email protected]>> wrote:
>> >
>> > I have one DataFrame with nested StructField and I want to convert to JSON 
>> > String. There is anyway to accomplish this?
>> 
>

Re: udf StructField to JSON String

Reply via email to