Re: unable to serialize analytics pipeline
If you distribute the needed jar(s) to your Workers, you may well be able to instantiate what you need using mapPartitions, mapPartitionsWithIndex, mapWith, flatMapWith, etc. Be careful, though, about teardown of any resource allocation that you may need to do within each partition. On Tue, Oct 22, 2013 at 10:50 AM, Philip Ogren philip.og...@oracle.comwrote: I have a text analytics pipeline that performs a sequence of steps (e.g. tokenization, part-of-speech tagging, etc.) on a line of text. I have wrapped the whole pipeline up into a simple interface that allows me to call it from Scala as a POJO - i.e. I instantiate the pipeline, I pass it a string, and get back some objects. Now, I would like to do the same thing for items in a Spark RDD via a map transformation. Unfortunately, my pipeline is not serializable and so I get a NotSerializableException when I try this. I played around with Kryo just now to see if that could help and I ended up with a missing no-arg constructor exception on a class I have no control over. It seems the Spark framework expects that I should be able to serialize my pipeline when I can't (or at least don't think I can at first glance.) Is there a workaround for this scenario? I am imagining a few possible solutions that seem a bit dubious to me, so I thought I would ask for direction before wandering about. Perhaps a better understanding of serialization strategies might help me get the pipeline to serialize. Or perhaps there is a way to instantiate my pipeline on demand on the nodes through a factory call. Any advice is appreciated. Thanks, Philip
Re: unable to serialize analytics pipeline
A simple workaround that seems to work (at least in localhost mode) is to mark my top-level pipeline object (inside my simple interface) as transient and add an initialize method. In the method that calls the pipeline and returns the results, I simply call the initialize method if needed (i.e. if the pipeline object is null.) This seems reasonable to me. I will try it on an actual cluster next Thanks, Philip On 10/22/2013 11:50 AM, Philip Ogren wrote: I have a text analytics pipeline that performs a sequence of steps (e.g. tokenization, part-of-speech tagging, etc.) on a line of text. I have wrapped the whole pipeline up into a simple interface that allows me to call it from Scala as a POJO - i.e. I instantiate the pipeline, I pass it a string, and get back some objects. Now, I would like to do the same thing for items in a Spark RDD via a map transformation. Unfortunately, my pipeline is not serializable and so I get a NotSerializableException when I try this. I played around with Kryo just now to see if that could help and I ended up with a missing no-arg constructor exception on a class I have no control over. It seems the Spark framework expects that I should be able to serialize my pipeline when I can't (or at least don't think I can at first glance.) Is there a workaround for this scenario? I am imagining a few possible solutions that seem a bit dubious to me, so I thought I would ask for direction before wandering about. Perhaps a better understanding of serialization strategies might help me get the pipeline to serialize. Or perhaps there is a way to instantiate my pipeline on demand on the nodes through a factory call. Any advice is appreciated. Thanks, Philip