A simple workaround that seems to work (at least in localhost mode) is
to mark my top-level pipeline object (inside my simple interface) as
transient and add an initialize method. In the method that calls the
pipeline and returns the results, I simply call the initialize method if
needed (i.e. if the pipeline object is null.) This seems reasonable to
me. I will try it on an actual cluster next....
Thanks,
Philip
On 10/22/2013 11:50 AM, Philip Ogren wrote:
I have a text analytics pipeline that performs a sequence of steps
(e.g. tokenization, part-of-speech tagging, etc.) on a line of text.
I have wrapped the whole pipeline up into a simple interface that
allows me to call it from Scala as a POJO - i.e. I instantiate the
pipeline, I pass it a string, and get back some objects. Now, I would
like to do the same thing for items in a Spark RDD via a map
transformation. Unfortunately, my pipeline is not serializable and so
I get a NotSerializableException when I try this. I played around
with Kryo just now to see if that could help and I ended up with a
"missing no-arg constructor" exception on a class I have no control
over. It seems the Spark framework expects that I should be able to
serialize my pipeline when I can't (or at least don't think I can at
first glance.)
Is there a workaround for this scenario? I am imagining a few
possible solutions that seem a bit dubious to me, so I thought I would
ask for direction before wandering about. Perhaps a better
understanding of serialization strategies might help me get the
pipeline to serialize. Or perhaps there is a way to instantiate my
pipeline on demand on the nodes through a factory call.
Any advice is appreciated.
Thanks,
Philip