Note that if you can access your side input as an Iterable or a Map, then
the entire PCollection does not have to fit in memory in Runners that
lazily load and cache, however performance could be bad depending on your
access patterns.

-Vikas

On 26 July 2017 at 21:55, Ahmet Altay <[email protected]> wrote:

> HI Vilhelm,
>
> Python SDK currently does not support stateful processing. We should
> update the capability matrix to show this. I filed https://issues.apache.
> org/jira/browse/BEAM-2687 to track this feature. Feel free to follow it
> there or better make it happen. As far as I know, nobody is actively
> working on it and will unlikely to be supported in the short term.
>
> Thank you,
> Ahmet
>
> On Tue, Jul 25, 2017 at 3:49 AM, Vilhelm von Ehrenheim <
> [email protected]> wrote:
>
>> Hi!
>> Is there any way to do stateful processing in Python Beam SDK?
>>
>> I am trying to train a LSHForest for approximate nearest neighbor search.
>> Using the scikit-learn implementation it is possible to do partial fit's so
>> I can gather up mini batches and fit the model on those in sequence using
>> ParDo. However, to my understanding, there is no way for me to control on
>> how many bundles the ParDo will execute over and therefore the training
>> makes little sense and I will end up with a lot of different models, rather
>> than one.
>>
>> Another approach would be to create a CombineFn that accumulates values
>> by training  the model on but There is no intuitive way to combine models
>> in `merge_accumulators` so I don't think that'll fit either.
>>
>> Does it makes sense to pass the whole pcollection as a list in a side
>> input and train the model as so? In that case how should I chop the pcol
>> into batches that I can loop over in a nice way? If I read the whole set
>> I'll most likely run out of memory.
>>
>> I've found that there exist stateful processing in the Java SDK but it
>> seems to be missing in python still.
>>
>> Any help/ideas are greatly appreciated.
>>
>> Thanks,
>> Vilhelm von Ehrenheim
>>
>
>

Reply via email to