It’s pretty simple, really:
import com.fasterxml.jackson.databind.ObjectMapper
import org.apache.spark.ml.UnaryTransformer
import org.apache.spark.ml.util.Identifiable
import org.apache.spark.sql.types.{DataType, StringType}
/**
* A SparkML Transformer that will transform an
* entity of type T into a JSON-formatted string.
* Created by Tristan Nixon <[email protected]> on 3/11/16.
*/
class JsonSerializationTransformer[T](override val uid: String)
extends UnaryTransformer[T,String,JsonSerializationTransformer[T]]
{
def this() = this(Identifiable.randomUID("JsonSerializationTransformer"))
val mapper = new ObjectMapper
// add additional mapper configuration code here, like this:
// mapper.setAnnotationIntrospector(new JaxbAnnotationIntrospector)
// or this:
// mapper.getSerializationConfig.withFeatures(
SerializationFeature.WRITE_DATES_AS_TIMESTAMPS )
override protected def createTransformFunc: ( T ) => String =
mapper.writeValueAsString
override protected def outputDataType: DataType = new StringType
}
and you would use it like any other transformer:
val jsontrans = new
JsonSerializationTransformer[Document].setInputCol("myEntityColumn")
.setOutputCol("myOutputColumn")
val dfWithJson = jsontrans.transform( entityDF )
Note that this implementation is for Jackson 2.x. If you want to use Jackson
1.x, it’s a bit trickier because the ObjectMapper class is not Serializable,
and so you need to initialize it per-partition rather than having it just be a
standard property.
> On Mar 11, 2016, at 12:49 PM, Jacek Laskowski <[email protected]> wrote:
>
> Hi Tristan,
>
> Mind sharing the relevant code? I'd like to learn the way you use Transformer
> to do so. Thanks!
>
> Jacek
>
> 11.03.2016 7:07 PM "Tristan Nixon" <[email protected]
> <mailto:[email protected]>> napisał(a):
> I have a similar situation in an app of mine. I implemented a custom ML
> Transformer that wraps the Jackson ObjectMapper - this gives you full control
> over how your custom entities / structs are serialized.
>
>> On Mar 11, 2016, at 11:53 AM, Caires Vinicius <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>> Hmm. I think my problem is a little more complex. I'm using
>> https://github.com/databricks/spark-redshift
>> <https://github.com/databricks/spark-redshift> and when I read from JSON
>> file I got this schema.
>>
>> root
>> |-- app: string (nullable = true)
>>
>> |-- ct: long (nullable = true)
>>
>> |-- event: struct (nullable = true)
>>
>> | |-- attributes: struct (nullable = true)
>>
>> | | |-- account: string (nullable = true)
>>
>> | | |-- accountEmail: string (nullable = true)
>>
>>
>> | | |-- accountId: string (nullable = true)
>>
>>
>>
>> I want to transform the Column event into String (formatted as JSON).
>>
>> I was trying to use udf but without success.
>>
>>
>> On Fri, Mar 11, 2016 at 1:53 PM Tristan Nixon <[email protected]
>> <mailto:[email protected]>> wrote:
>> Have you looked at DataFrame.write.json( path )?
>> https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter
>>
>> <https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrameWriter>
>>
>> > On Mar 11, 2016, at 7:15 AM, Caires Vinicius <[email protected]
>> > <mailto:[email protected]>> wrote:
>> >
>> > I have one DataFrame with nested StructField and I want to convert to JSON
>> > String. There is anyway to accomplish this?
>>
>