You can use
spark.streaming.receiver.maxRate not set Maximum rate (number of records per second) at which each receiver will receive data. Effectively, each stream will consume at most this number of records per second. Setting this configuration to 0 or a negative number will put no limit on the rate. See the deployment guide <https://spark.apache.org/docs/latest/streaming-programming-guide.html#deploying-applications> in the Spark Streaming programing guide for mode details. Another way is to implement a feedback loop in your receivers monitoring the performance metrics of your application/job and based on that adjusting automatically the receiving rate – BUT all these have nothing to do with “reducing the latency” – they simply prevent your application/job from clogging up – the nastier effect of which is when S[ark Streaming starts removing In Memory RDDs from RAM before they are processed by the job – that works fine in Spark Batch (ie removing RDDs from RAM based on LRU) but in Spark Streaming when done in this “unceremonious way” it simply Crashes the application From: Dmitry Goldenberg [mailto:dgoldenberg...@gmail.com] Sent: Monday, May 18, 2015 11:46 AM To: Akhil Das Cc: user@spark.apache.org Subject: Re: Spark Streaming and reducing latency Thanks, Akhil. So what do folks typically do to increase/contract the capacity? Do you plug in some cluster auto-scaling solution to make this elastic? Does Spark have any hooks for instrumenting auto-scaling? In other words, how do you avoid overwheling the receivers in a scenario when your system's input can be unpredictable, based on users' activity? On May 17, 2015, at 11:04 AM, Akhil Das <ak...@sigmoidanalytics.com> wrote: With receiver based streaming, you can actually specify spark.streaming.blockInterval which is the interval at which the receiver will fetch data from the source. Default value is 200ms and hence if your batch duration is 1 second, it will produce 5 blocks of data. And yes, with sparkstreaming when your processing time goes beyond your batch duration and you are having a higher data consumption then you will overwhelm the receiver's memory and hence will throw up block not found exceptions. Thanks Best Regards On Sun, May 17, 2015 at 7:21 PM, dgoldenberg <dgoldenberg...@gmail.com> wrote: I keep hearing the argument that the way Discretized Streams work with Spark Streaming is a lot more of a batch processing algorithm than true streaming. For streaming, one would expect a new item, e.g. in a Kafka topic, to be available to the streaming consumer immediately. With the discretized streams, streaming is done with batch intervals i.e. the consumer has to wait the interval to be able to get at the new items. If one wants to reduce latency it seems the only way to do this would be by reducing the batch interval window. However, that may lead to a great deal of churn, with many requests going into Kafka out of the consumers, potentially with no results whatsoever as there's nothing new in the topic at the moment. Is there a counter-argument to this reasoning? What are some of the general approaches to reduce latency folks might recommend? Or, perhaps there are ways of dealing with this at the streaming API level? If latency is of great concern, is it better to look into streaming from something like Flume where data is pushed to consumers rather than pulled by them? Are there techniques, in that case, to ensure the consumers don't get overwhelmed with new data? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-and-reducing-latency-tp22922.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org