[Python] Stateful processing in Python SDK

Vilhelm von Ehrenheim Tue, 25 Jul 2017 03:50:18 -0700

Hi!
Is there any way to do stateful processing in Python Beam SDK?

I am trying to train a LSHForest for approximate nearest neighbor search.
Using the scikit-learn implementation it is possible to do partial fit's so
I can gather up mini batches and fit the model on those in sequence using
ParDo. However, to my understanding, there is no way for me to control on
how many bundles the ParDo will execute over and therefore the training
makes little sense and I will end up with a lot of different models, rather
than one.


Another approach would be to create a CombineFn that accumulates values by
training  the model on but There is no intuitive way to combine models in
`merge_accumulators` so I don't think that'll fit either.

Does it makes sense to pass the whole pcollection as a list in a side input
and train the model as so? In that case how should I chop the pcol into
batches that I can loop over in a nice way? If I read the whole set I'll
most likely run out of memory.

I've found that there exist stateful processing in the Java SDK but it
seems to be missing in python still.

Any help/ideas are greatly appreciated.

Thanks,
Vilhelm von Ehrenheim

[Python] Stateful processing in Python SDK

Reply via email to