I have a scenario similar to yours, but we are using udf to do exactly that. But you need to get the value of a broadcast variable from the udf. But it's not clear how to achieve it, does anyone know?
Burak Yavuz <[email protected]> 于2019年11月19日周二 下午12:23写道: > If you store the data that you're going to broadcast as a Delta table (see > delta.io) and perform a stream-batch (where your Delta table is the > batch) join, it will auto-update once the table receives any updates. > > Best, > Burak > > On Mon, Nov 18, 2019, 6:21 AM Bryan Jeffrey <[email protected]> > wrote: > >> Hello. >> >> We're running applications using Spark Streaming. We're going to begin >> work to move to using Structured Streaming. One of our key scenarios is to >> lookup values from an external data source for each record in an incoming >> stream. In Spark Streaming we currently read the external data, broadcast >> it and then lookup the value from the broadcast. The broadcast value is >> refreshed on a periodic basis - with the need to refresh evaluated on each >> batch (in a foreachRDD). The broadcasts are somewhat large (~1M records). >> Each stream we're doing the lookup(s) for is ~6M records / second. >> >> While we could conceivably continue this pattern in Structured Streaming >> with Spark 2.4.x and the 'foreachBatch', based on my read of documentation >> this seems like a bit of an anti-pattern in Structured Streaming. >> >> So I am looking for advice: What mechanism would you suggest to on a >> periodic basis read an external data source and do a fast lookup for a >> streaming input. One option appears to be to do a broadcast left outer >> join? In the past this mechanism has been less easy to performance tune >> than doing an explicit broadcast and lookup. >> >> Regards, >> >> Bryan Jeffrey >> >
