Re: Join a datastream with tables stored in Hive

Kurt Young Fri, 13 Dec 2019 00:38:22 -0800

Hi Krzysztof,

What you raised also interested us a lot to achieve in Flink.
Unfortunately, there
is no in place solution in Table/SQL API yet, but you have 2 options which
are both
close to this thus need some modifications.


1. The first one is use temporal table function [1]. It needs you to write
the logic of
reading hive tables and do the daily update inside the table function.
2. The second choice is to use temporal table join [2], which only works
with processing
time now (just like the simple solution you mentioned), and need the table
source has
look up capability (like hbase). Currently, hive connector doesn't support
look up, so to
make this work, you need to sync the content to other storages which
support look up,
like HBase.

Both solutions are not ideal now, and we also aims to improve this maybe in
the following
release.

Best,
Kurt


On Fri, Dec 13, 2019 at 1:44 AM Krzysztof Zarzycki <k.zarzy...@gmail.com>
wrote:

> Hello dear Flinkers,
> If this kind of question was asked on the groups, I'm sorry for a
> duplicate. Feel free to just point me to the thread.
> I have to solve a probably pretty common case of joining a datastream to a
> dataset.
> Let's say I have the following setup:
> * I have a high pace stream of events coming in Kafka.
> * I have some dimension tables stored in Hive. These tables are changed
> daily. I can keep a snapshot for each day.
>
> Now conceptually, I would like to join the stream of incoming events to
> the dimension tables (simple hash join). we can consider two cases:
> 1) simpler, where I join the stream with the most recent version of the
> dictionaries. (So the result is accepted to be nondeterministic if the job
> is retried).
> 2) more advanced, where I would like to do temporal join of the stream
> with dictionaries snapshots that were valid at the time of the event. (This
> result should be deterministic).
>
> The end goal is to do aggregation of that joined stream, store results in
> Hive or more real-time analytical store (Druid).
>
> Now, could you please help me understand is any of these cases
> implementable with declarative Table/SQL API? With use of temporal joins,
> catalogs, Hive integration, JDBC connectors, or whatever beta features
> there are now. (I've read quite a lot of Flink docs about each of those,
> but I have a problem to compile this information in the final design.)
> Could you please help me understand how these components should cooperate?
> If that is impossible with Table API, can we come up with the easiest
> implementation using Datastream API ?
>
> Thanks a lot for any help!
> Krzysztof
>

Re: Join a datastream with tables stored in Hive

Reply via email to