Hi everyone, I have the following pipeline: Ingest 2 streams from Kafka -> parse JSON -> join both streams -> aggregate on a key over the last second -> output to Kafka with Join: inner join in interval of one second, with watermarking 50 ms Aggregation: tumbling window of one second, with watermarking 50ms
Data is published at a constant rate at 400 events/second and all joins and aggregations are done on a second-level. The median end-to-end latency of an event (<time output appeared on Kafka>-<time input appeared on Kafka>), is 5 seconds and the p99 latency is 7500. I was wondering why this was so high and investigated where it came from: - My output trigger is set to 0, so it should process as fast as possible, yet it still makes batches at an average interval of 1-2 seconds. - If processing a batch takes 1-2 seconds, does this mean that all events are published in one burst at the end of the two seconds? - Garbage collection is under control since I switched to G1GC and since I changed the parameter for spark.sql.streaming.minBatchesToRetain to 2. Practically all of the time goes to executor computing time. - The watermark: I put the watermark at 50ms for the join and the aggregation inputs. If I look at the microbatch execution progress (attachment progress.txt <http://apache-spark-user-list.1001560.n3.nabble.com/file/t8011/progress.txt> ), I see that the event time watermark is less that the minimum event time of the batch and 2 seconds less than the maximum event time of the batch. Therefore, I was wondering whether this means that events of batch t1 will only be send out after processing batch t2 so two seconds later? Or when would the watermark update? - If I look at the number of input records in the query progress, the number is exactly two times the expected amount. I read somewhere this could be the case if you use two sinks but this is not what I am doing. Are there other reasons this behavior might occur? Could it mean that the median 5 sec latency comes from the following factors: <avg 1 sec delay due to microbatching> + <2 sec processing time> + <2 sec delay before watermark advances far enough> = 5 seconds If this is a plausible cause, what can I do to increase performance? As a reference frame, the exact similar pipeline in Spark Streaming has a median latency of 760 ms and p99 of 2300. Thank you in advance! -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org