Hi Steve! I have several pipelines that successfully use both numpy and scikit models without any problems. I don't think I use Pandas atm but I'm sure that is fine too.
However, you might have to do some special stuff if you encounter serializabillity problems. I also have tensorflow models in use, which were a bit trickier to get to work because of serialization problems as you mention. For that I needed to load one model instance per thread using thread.local as is done here: https://github.com/tensorflow/transform/blob/master/tensorflow_transform/beam/impl.py (I realize that this file has evolved a bit since i last looked at it. Might be worth looking at an older version of the file as its quite advanced now.) So, when serializability is not possible, you can still initialize objects locally in threads and let bundles that are executed in the same thread use the locally instantiated objects instead of sharing one intantiation across all bundles and threads. Br, Vilhelm On 29 Sep 2017 17:17, "Steven DeLaurentis" <[email protected]> wrote: Hi everyone, Came across this interesting project recently. Read through some of the docs and still had a question: is it possible to use NumPy/Pandas in the DoFn of a Beam? Or does the requirement of a serializable function preclude this possibility? Thanks, Steve
