Hi Experts, I have a structured streaming query running on spark 2.3 over yarn cluster, with below features:
- Reading JSON messages from Kafka topic with: - maxOffsetsPerTrigger as 5000 - trigger interval of my writeStream task is 500ms. - streaming dataset is defined as events with fields: id, name, refid, createdTime - A cached dataset of CSV file read from HDFS, such that, the CSV file contains a list of prohibited events refid - I have defined an intermediate dataset with the following query, which filters out prohibited events from the streaming data - select * from events where event.refid NOT IN (select refid from CSVData) The query progress from StreamingQuery object, it shows metrics as numInputRows, inputRowsPerSecond and processedRowsPerSecond as 0, although my query is executing with an execution time of ~400ms. And I can see that the query does take records from kafka and writes the processed data to the output database. If I remove the event filtering tasks, then all the metrics are displayed properly. Can anyone please point out why this behaviour is observed and how to gather metrics like numInputRows, etc while also filtering events fetched from CSV file? I am also open to suggestions if there is a better way of filtering out the prohibited events in structured streaming. Thanks in advance. Akshay Bhardwaj +91-97111-33849