I have a text analytics pipeline that performs a sequence of steps (e.g. tokenization, part-of-speech tagging, etc.) on a line of text. I have wrapped the whole pipeline up into a simple interface that allows me to call it from Scala as a POJO - i.e. I instantiate the pipeline, I pass it a string, and get back some objects. Now, I would like to do the same thing for items in a Spark RDD via a map transformation. Unfortunately, my pipeline is not serializable and so I get a NotSerializableException when I try this. I played around with Kryo just now to see if that could help and I ended up with a "missing no-arg constructor" exception on a class I have no control over. It seems the Spark framework expects that I should be able to serialize my pipeline when I can't (or at least don't think I can at first glance.)
Is there a workaround for this scenario? I am imagining a few possible solutions that seem a bit dubious to me, so I thought I would ask for direction before wandering about. Perhaps a better understanding of serialization strategies might help me get the pipeline to serialize. Or perhaps there is a way to instantiate my pipeline on demand on the nodes through a factory call.
Any advice is appreciated. Thanks, Philip
