If you store the data that you're going to broadcast as a Delta table (see delta.io) and perform a stream-batch (where your Delta table is the batch) join, it will auto-update once the table receives any updates.
Best, Burak On Mon, Nov 18, 2019, 6:21 AM Bryan Jeffrey <[email protected]> wrote: > Hello. > > We're running applications using Spark Streaming. We're going to begin > work to move to using Structured Streaming. One of our key scenarios is to > lookup values from an external data source for each record in an incoming > stream. In Spark Streaming we currently read the external data, broadcast > it and then lookup the value from the broadcast. The broadcast value is > refreshed on a periodic basis - with the need to refresh evaluated on each > batch (in a foreachRDD). The broadcasts are somewhat large (~1M records). > Each stream we're doing the lookup(s) for is ~6M records / second. > > While we could conceivably continue this pattern in Structured Streaming > with Spark 2.4.x and the 'foreachBatch', based on my read of documentation > this seems like a bit of an anti-pattern in Structured Streaming. > > So I am looking for advice: What mechanism would you suggest to on a > periodic basis read an external data source and do a fast lookup for a > streaming input. One option appears to be to do a broadcast left outer > join? In the past this mechanism has been less easy to performance tune > than doing an explicit broadcast and lookup. > > Regards, > > Bryan Jeffrey >
