Re: unable to serialize analytics pipeline

2013-10-22 Thread Mark Hamstra
If you distribute the needed jar(s) to your Workers, you may well be able
to instantiate what you need using mapPartitions, mapPartitionsWithIndex,
mapWith, flatMapWith, etc.  Be careful, though, about teardown of any
resource allocation that you may need to do within each partition.



On Tue, Oct 22, 2013 at 10:50 AM, Philip Ogren philip.og...@oracle.comwrote:


 I have a text analytics pipeline that performs a sequence of steps (e.g.
 tokenization, part-of-speech tagging, etc.) on a line of text.  I have
 wrapped the whole pipeline up into a simple interface that allows me to
 call it from Scala as a POJO - i.e. I instantiate the pipeline, I pass it a
 string, and get back some objects.  Now, I would like to do the same thing
 for items in a Spark RDD via a map transformation.  Unfortunately, my
 pipeline is not serializable and so I get a NotSerializableException when I
 try this.  I played around with Kryo just now to see if that could help and
 I ended up with a missing no-arg constructor exception on a class I have
 no control over.  It seems the Spark framework expects that I should be
 able to serialize my pipeline when I can't (or at least don't think I can
 at first glance.)

 Is there a workaround for this scenario?  I am imagining a few possible
 solutions that seem a bit dubious to me, so I thought I would ask for
 direction before wandering about.  Perhaps a better understanding of
 serialization strategies might help me get the pipeline to serialize.  Or
 perhaps there is a way to instantiate my pipeline on demand on the nodes
 through a factory call.

 Any advice is appreciated.

 Thanks,
 Philip



Re: unable to serialize analytics pipeline

2013-10-22 Thread Philip Ogren
A simple workaround that seems to work (at least in localhost mode) is 
to mark my top-level pipeline object (inside my simple interface) as 
transient and add an initialize method.  In the method that calls the 
pipeline and returns the results, I simply call the initialize method if 
needed (i.e. if the pipeline object is null.)  This seems reasonable to 
me.  I will try it on an actual cluster next


Thanks,
Philip

On 10/22/2013 11:50 AM, Philip Ogren wrote:


I have a text analytics pipeline that performs a sequence of steps 
(e.g. tokenization, part-of-speech tagging, etc.) on a line of text.  
I have wrapped the whole pipeline up into a simple interface that 
allows me to call it from Scala as a POJO - i.e. I instantiate the 
pipeline, I pass it a string, and get back some objects.  Now, I would 
like to do the same thing for items in a Spark RDD via a map 
transformation.  Unfortunately, my pipeline is not serializable and so 
I get a NotSerializableException when I try this.  I played around 
with Kryo just now to see if that could help and I ended up with a 
missing no-arg constructor exception on a class I have no control 
over.  It seems the Spark framework expects that I should be able to 
serialize my pipeline when I can't (or at least don't think I can at 
first glance.)


Is there a workaround for this scenario?  I am imagining a few 
possible solutions that seem a bit dubious to me, so I thought I would 
ask for direction before wandering about.  Perhaps a better 
understanding of serialization strategies might help me get the 
pipeline to serialize.  Or perhaps there is a way to instantiate my 
pipeline on demand on the nodes through a factory call.


Any advice is appreciated.

Thanks,
Philip