Re: unable to serialize analytics pipeline

Philip Ogren Tue, 22 Oct 2013 11:53:29 -0700

A simple workaround that seems to work (at least in localhost mode) isto mark my top-level pipeline object (inside my simple interface) astransient and add an initialize method. In the method that calls thepipeline and returns the results, I simply call the initialize method ifneeded (i.e. if the pipeline object is null.) This seems reasonable tome. I will try it on an actual cluster next....


Thanks,
Philip


On 10/22/2013 11:50 AM, Philip Ogren wrote:

I have a text analytics pipeline that performs a sequence of steps(e.g. tokenization, part-of-speech tagging, etc.) on a line of text.I have wrapped the whole pipeline up into a simple interface thatallows me to call it from Scala as a POJO - i.e. I instantiate thepipeline, I pass it a string, and get back some objects. Now, I wouldlike to do the same thing for items in a Spark RDD via a maptransformation. Unfortunately, my pipeline is not serializable and soI get a NotSerializableException when I try this. I played aroundwith Kryo just now to see if that could help and I ended up with a"missing no-arg constructor" exception on a class I have no controlover. It seems the Spark framework expects that I should be able toserialize my pipeline when I can't (or at least don't think I can atfirst glance.)
Is there a workaround for this scenario? I am imagining a fewpossible solutions that seem a bit dubious to me, so I thought I wouldask for direction before wandering about. Perhaps a betterunderstanding of serialization strategies might help me get thepipeline to serialize. Or perhaps there is a way to instantiate mypipeline on demand on the nodes through a factory call.
Any advice is appreciated.

Thanks,
Philip

Re: unable to serialize analytics pipeline

Reply via email to