A simple workaround that seems to work (at least in localhost mode) is to mark my top-level pipeline object (inside my simple interface) as transient and add an initialize method. In the method that calls the pipeline and returns the results, I simply call the initialize method if needed (i.e. if the pipeline object is null.) This seems reasonable to me. I will try it on an actual cluster next....

Thanks,
Philip

On 10/22/2013 11:50 AM, Philip Ogren wrote:

I have a text analytics pipeline that performs a sequence of steps (e.g. tokenization, part-of-speech tagging, etc.) on a line of text. I have wrapped the whole pipeline up into a simple interface that allows me to call it from Scala as a POJO - i.e. I instantiate the pipeline, I pass it a string, and get back some objects. Now, I would like to do the same thing for items in a Spark RDD via a map transformation. Unfortunately, my pipeline is not serializable and so I get a NotSerializableException when I try this. I played around with Kryo just now to see if that could help and I ended up with a "missing no-arg constructor" exception on a class I have no control over. It seems the Spark framework expects that I should be able to serialize my pipeline when I can't (or at least don't think I can at first glance.)

Is there a workaround for this scenario? I am imagining a few possible solutions that seem a bit dubious to me, so I thought I would ask for direction before wandering about. Perhaps a better understanding of serialization strategies might help me get the pipeline to serialize. Or perhaps there is a way to instantiate my pipeline on demand on the nodes through a factory call.

Any advice is appreciated.

Thanks,
Philip

Reply via email to