Hi Steve!
I have several pipelines that successfully use both numpy and scikit models
without any problems. I don't think I use Pandas atm but I'm sure that is
fine too.

However, you might have to do some special stuff if you encounter
serializabillity problems. I also have tensorflow models in use, which were
a bit trickier to get to work because of serialization problems as you
mention. For that I needed to load one model instance per thread using
thread.local as is done here:

https://github.com/tensorflow/transform/blob/master/tensorflow_transform/beam/impl.py

(I realize that this file has evolved a bit since i last looked at it.
Might be worth looking at an older version of the file as its quite
advanced now.)

So, when serializability is not possible, you can still initialize objects
locally in threads and let bundles that are executed in the same thread use
the locally instantiated objects instead of sharing one intantiation across
all bundles and threads.

Br,
Vilhelm

On 29 Sep 2017 17:17, "Steven DeLaurentis" <[email protected]> wrote:

Hi everyone,

Came across this interesting project recently. Read through some of the
docs and still had a question: is it possible to use NumPy/Pandas in the
DoFn of a Beam? Or does the requirement of a serializable function preclude
this possibility?

Thanks,
Steve

Reply via email to