What about a StatefulDoFn where you append the value(s) in a bag state as
you see them?
If you need to seed the state information, you could do a one time lookup
in processElement for each key to HBase if the key hasn't yet been seen
(storing the fact that you loaded the data in a boolean) but afterwards you
would rely on reading the value(s) from the bag state.
processElement(...) {
Value newValue = ...
Iterable<Value> values;
if (!hasSeenKeyBooleanValueState.read()) {
values = ... load from HBase ...
valuesBagState.append(values);
hasSeenKeyBooleanValueState.set(true);
} else {
values = valuesBagState.read();
}
... perform processing using values ...
valuesBagState.append(newValue);
}
This blog post[1] has a good example.
1: https://beam.apache.org/blog/2017/02/13/stateful-processing.html
On Mon, Dec 3, 2018 at 12:48 PM Chandan Biswas <[email protected]>
wrote:
> Hello All,
> I have a use case where I have PCollection<KV<Key,Value>> data coming from
> Kafka source. When processing each record (KV<Key,Value>) I need all old
> values for that Key stored in a hbase table. The naive approach is to do
> HBase lookup in the DoFn.processElement. I considered sideinput but it' not
> going to work because of large dataset. Any suggestion?
>
> Thanks,
> Chandan
>