If you store the data that you're going to broadcast as a Delta table (see
delta.io) and perform a stream-batch (where your Delta table is the batch)
join, it will auto-update once the table receives any updates.

Best,
Burak

On Mon, Nov 18, 2019, 6:21 AM Bryan Jeffrey <[email protected]> wrote:

> Hello.
>
> We're running applications using Spark Streaming.  We're going to begin
> work to move to using Structured Streaming.  One of our key scenarios is to
> lookup values from an external data source for each record in an incoming
> stream.  In Spark Streaming we currently read the external data, broadcast
> it and then lookup the value from the broadcast.  The broadcast value is
> refreshed on a periodic basis - with the need to refresh evaluated on each
> batch (in a foreachRDD).  The broadcasts are somewhat large (~1M records).
> Each stream we're doing the lookup(s) for is ~6M records / second.
>
> While we could conceivably continue this pattern in Structured Streaming
> with Spark 2.4.x and the 'foreachBatch', based on my read of documentation
> this seems like a bit of an anti-pattern in Structured Streaming.
>
> So I am looking for advice: What mechanism would you suggest to on a
> periodic basis read an external data source and do a fast lookup for a
> streaming input.  One option appears to be to do a broadcast left outer
> join?  In the past this mechanism has been less easy to performance tune
> than doing an explicit broadcast and lookup.
>
> Regards,
>
> Bryan Jeffrey
>

Reply via email to