Note that if you can access your side input as an Iterable or a Map, then the entire PCollection does not have to fit in memory in Runners that lazily load and cache, however performance could be bad depending on your access patterns.
-Vikas On 26 July 2017 at 21:55, Ahmet Altay <[email protected]> wrote: > HI Vilhelm, > > Python SDK currently does not support stateful processing. We should > update the capability matrix to show this. I filed https://issues.apache. > org/jira/browse/BEAM-2687 to track this feature. Feel free to follow it > there or better make it happen. As far as I know, nobody is actively > working on it and will unlikely to be supported in the short term. > > Thank you, > Ahmet > > On Tue, Jul 25, 2017 at 3:49 AM, Vilhelm von Ehrenheim < > [email protected]> wrote: > >> Hi! >> Is there any way to do stateful processing in Python Beam SDK? >> >> I am trying to train a LSHForest for approximate nearest neighbor search. >> Using the scikit-learn implementation it is possible to do partial fit's so >> I can gather up mini batches and fit the model on those in sequence using >> ParDo. However, to my understanding, there is no way for me to control on >> how many bundles the ParDo will execute over and therefore the training >> makes little sense and I will end up with a lot of different models, rather >> than one. >> >> Another approach would be to create a CombineFn that accumulates values >> by training the model on but There is no intuitive way to combine models >> in `merge_accumulators` so I don't think that'll fit either. >> >> Does it makes sense to pass the whole pcollection as a list in a side >> input and train the model as so? In that case how should I chop the pcol >> into batches that I can loop over in a nice way? If I read the whole set >> I'll most likely run out of memory. >> >> I've found that there exist stateful processing in the Java SDK but it >> seems to be missing in python still. >> >> Any help/ideas are greatly appreciated. >> >> Thanks, >> Vilhelm von Ehrenheim >> > >
