I've seen the feature work very well. For tuning, you've got:
spark.streaming.backpressure.pid.proportional (defaults to 1, non-negative) -
weight for response to "error" (change between last batch and this batch)
spark.streaming.backpressure.pid.integral (defaults to 0.2, non-negative) -
weight for the response to the accumulation of error. This has a dampening
effect.
spark.streaming.backpressure.pid.derived (defaults to zero, non-negative) -
weight for the response to the trend in error. This can cause
arbitrary/noise-induced fluctuations in batch size, but can also help react
quickly to increased/reduced capacity.
spark.streaming.backpressure.pid.minRate - the default value is 100 (must be
positive), batch size won't go below this.
spark.streaming.receiver.maxRate - batch size won't go above this.
Cheers,
Richard
https://richardstartin.com/
From: Liren Ding
Sent: 05 December 2016 22:18
To: d...@spark.apache.org; user@spark.apache.org
Subject: Back-pressure to Spark Kafka Streaming?
Hey all,
Does backressure actually work on spark kafka streaming? According to the
latest spark streaming document:
http://spark.apache.org/docs/latest/streaming-programming-guide.html
"In Spark 1.5, we have introduced a feature called backpressure that eliminate
the need to set this rate limit, as Spark Streaming automatically figures out
the rate limits and dynamically adjusts them if the processing conditions
change. This backpressure can be enabled by setting the configuration parameter
spark.streaming.backpressure.enabled to true."
But I also see a few open spark jira tickets on this option:
https://issues.apache.org/jira/browse/SPARK-7398
https://issues.apache.org/jira/browse/SPARK-18371
The case in the second ticket describes a similar issue as we have here. We use
Kafka to send large batches (10~100M) to spark streaming, and the spark
streaming interval is set to 1~4 minutes. With the backpressure set to true,
the queued active batches still pile up when average batch processing time
takes longer than default interval. After the spark driver is restarted, all
queued batches turn to a giant batch, which block subsequent batches and also
have a great chance to fail eventually. The only config we found that might
help is "spark.streaming.kafka.maxRatePerPartition". It does limit the incoming
batch size, but not a perfect solution since it depends on size of partition as
well as the length of batch interval. For our case, hundreds of partitions X
minutes of interval still produce a number that is too large for each batch. So
we still want to figure out how to make the backressure work in spark kafka
streaming, if it is supposed to work there. Thanks.
Liren