There isn't a magic spark configuration setting that would account for multiple-second-long fixed overheads, you should be looking at maybe 200ms minimum for a streaming batch. 1024 kafka topicpartitions is not reasonable for the volume you're talking about. Unless you have really extreme workloads, 32 or 64 is a better starting guess.
Rather than jumping to conclusions about sql operations being the problem, start from the very beginning. Read a stream of messages from kafka and just do .foreach(println), at a reasonable batch size (say 500ms or a second), and see how that keeps up in your environment. Don't use take(), don't use count(), don't use print(), since they may have non-obvious performance implications. If that works, add on further operations one step at a time and see when issues arise. On Mon, May 30, 2016 at 8:45 PM, Malcolm Lockyer <malcolm.lock...@hapara.com> wrote: > Hopefully this is not off topic for this list, but I am hoping to > reach some people who have used Kafka + Spark before. > > We are new to Spark and are setting up our first production > environment and hitting a speed issue that maybe configuration related > - and we have little experience in configuring Spark environments. > > So we've got a Spark streaming job that seems to take an inordinate > amount of time to process. I realize that without specifics, it is > difficult to trace - however the most basic primitives in Spark are > performing horribly. The lazy nature of Spark is making it difficult > for me to understand what is happening - any suggestions are very much > appreciated. > > Environment is MBP 2.2 i7. Spark master is "local[*]". We are using > Kafka and PostgreSQL, both local. The job is designed to: > > a) grab some data from Kafka > b) correlate with existing data in PostgreSQL > c) output data to Kafka > > I am isolating timings by calling System.nanoTime() before and after > something that forces calculation, for example .count() on a > DataFrame. It seems like every operation has a MASSIVE fixed overhead > and that is stacking up making each iteration on the RDD extremely > slow. Slow operations include pulling a single item from the Kafka > queue, running a simple query against PostgresSQL, and running a Spark > aggregation on a RDD with a handful of rows. > > The machine is not maxing out on memory, disk or CPU. The machine > seems to be doing nothing for a high percentage of the execution time. > We have reproduced this behavior on two other machines. So we're > suspecting a configuration issue > > As a concrete example, we have a DataFrame produced by running a JDBC > query by mapping over an RDD from Kafka. Calling count() (I guess > forcing execution) on this DataFrame when there is *1* item/row (Note: > SQL database is EMPTY at this point so this is not a factor) takes 4.5 > seconds, calling count when there are 10,000 items takes 7 seconds. > > Can anybody offer experience of something like this happening for > them? Any suggestions on how to understand what is going wrong? > > I have tried tuning the number of Kafka partitions - increasing this > seems to increase the concurrency and ultimately number of things > processed per minute, but to get something half decent, I'm going to > need running with 1024 or more partitions. Is 1024 partitions a > reasonable number? What do you use in you environments? > > I've tried different options for batchDuration. The calculation seems > to be batchDuration * Kafka partitions for number of items per > iteration, but this is always still extremely slow (many per iteration > vs. very few doesn't seem to really improve things). Can you suggest a > list of the Spark configuration parameters related to speed that you > think are key - preferably with the values you use for those > parameters? > > I'd really really appreciate any help or suggestions as I've been > working on this speed issue for 3 days without success and my head is > starting to hurt. Thanks in advance. > > > > Thanks, > > -- > > Malcolm Lockyer > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org