Hi everyone, 
I have the following pipeline:
Ingest 2 streams from Kafka -> parse JSON -> join both streams -> aggregate
on a key over the last second -> output to Kafka
with
Join: inner join in interval of one second, with watermarking 50 ms
Aggregation: tumbling window of one second, with watermarking 50ms

Data is published at a constant rate at 400 events/second and all joins and
aggregations are done on a second-level.

The median end-to-end latency of an event (<time output appeared on
Kafka>-<time input appeared on Kafka>), is 5 seconds and the p99 latency is
7500. I was wondering why this was so high and investigated where it came
from:

   - My output trigger is set to 0, so it should process as fast as
possible, yet it still makes batches at an average interval of 1-2 seconds.
   - If processing a batch takes 1-2 seconds, does this mean that all events
are published in one burst at the end of the two seconds?
    - Garbage collection is under control since I switched to G1GC and since
I changed the parameter for spark.sql.streaming.minBatchesToRetain to 2.
Practically all of the time goes to executor computing time.
    - The watermark: I put the watermark at 50ms for the join and the
aggregation inputs. If I look at the microbatch execution progress
(attachment  progress.txt
<http://apache-spark-user-list.1001560.n3.nabble.com/file/t8011/progress.txt> 
), I see that the event time watermark is less that the minimum event time
of the batch and 2 seconds less than the maximum event time of the batch.
Therefore, I was wondering whether this means that events of batch t1 will
only be send out after processing batch t2 so two seconds later? Or when
would the watermark update?
    - If I look at the number of input records in the query progress, the
number is exactly two times the expected amount. I read somewhere this could
be the case if you use two sinks but this is not what I am doing. Are there
other reasons this behavior might occur?

Could it mean that the median 5 sec latency comes from the following
factors:

<avg 1 sec delay due to microbatching> 
+ <2 sec processing time>
+ <2 sec delay before watermark advances far enough>
= 5 seconds

If this is a plausible cause, what can I do to increase performance?
As a reference frame, the exact similar pipeline in Spark Streaming has a
median latency of 760 ms and p99 of 2300.

Thank you in advance!






--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to