Thanks Akhil. For doing reduceByKeyAndWindow, one has to have checkpointing enabled. So, yes we do have it enabled. But not Write Ahead Log because we don't have a need for recovery and we do not recover the process state on restart.
I don't know if IO Wait fully explains the increasing processing time. Below is a full minute of 'sar' output every 2 seconds. The iowait values don't seem too bad to me except for a brief small spike in the middle. Also, how does one explain the continued degradation of processing time even beyond the largest window interval? Thanks Nikunj $ sar 2 30 Linux 3.13.0-48-generic (ip-X-X-X-X) 07/16/2015 _x86_64_ (16 CPU) 01:11:14 AM CPU %user %nice %system %iowait %steal %idle 01:11:16 AM all 66.70 0.03 11.10 0.03 0.00 22.13 01:11:18 AM all 79.99 0.00 10.81 0.00 0.03 9.17 01:11:20 AM all 62.66 0.03 10.84 0.00 0.03 26.43 01:11:22 AM all 68.59 0.00 10.83 0.00 0.10 20.49 01:11:24 AM all 77.74 0.00 10.83 0.00 0.03 11.40 01:11:26 AM all 65.01 0.00 10.83 0.03 0.07 24.06 01:11:28 AM all 66.33 0.00 10.87 0.00 0.03 22.77 01:11:30 AM all 72.38 0.03 12.48 0.54 0.06 14.50 01:11:32 AM all 68.35 0.00 12.98 7.46 0.03 11.18 01:11:34 AM all 75.94 0.03 14.02 3.27 0.03 6.71 01:11:36 AM all 68.60 0.00 14.34 2.76 0.03 14.27 01:11:38 AM all 61.99 0.03 13.34 0.07 0.07 24.51 01:11:40 AM all 52.21 0.03 12.79 1.04 0.13 33.79 01:11:42 AM all 37.91 0.03 12.43 0.03 0.10 49.48 01:11:44 AM all 26.92 0.00 11.68 0.14 0.10 61.16 01:11:46 AM all 24.86 0.00 12.07 0.00 0.10 62.97 01:11:48 AM all 25.49 0.00 11.96 0.00 0.10 62.45 01:11:50 AM all 21.16 0.00 12.35 0.03 0.14 66.32 01:11:52 AM all 29.89 0.00 12.06 0.03 0.10 57.91 01:11:54 AM all 26.77 0.00 11.81 0.00 0.10 61.32 01:11:56 AM all 25.34 0.03 11.81 0.03 0.14 62.65 01:11:58 AM all 22.42 0.00 12.60 0.00 0.10 64.88 01:12:00 AM all 30.27 0.00 12.10 0.03 0.14 57.46 01:12:02 AM all 80.59 0.00 10.58 0.35 0.03 8.44 01:12:04 AM all 49.05 0.00 12.89 0.66 0.07 37.32 01:12:06 AM all 31.21 0.03 13.54 6.54 0.17 48.50 01:12:08 AM all 31.66 0.00 13.26 6.30 0.10 48.67 01:12:10 AM all 36.19 0.00 12.87 3.04 0.14 47.76 01:12:12 AM all 82.63 0.03 10.60 0.00 0.03 6.70 01:12:14 AM all 77.72 0.00 10.66 0.00 0.03 11.59 Average: all 52.22 0.01 12.04 1.08 0.08 34.58 On Thu, Jul 16, 2015 at 12:44 AM, Akhil Das <ak...@sigmoidanalytics.com> wrote: > What is your data volume? Are you having checkpointing/WAL enabled? In > that case make sure you are having SSD disks as this behavior is mainly due > to the IO wait. > > Thanks > Best Regards > > On Thu, Jul 16, 2015 at 8:43 AM, N B <nb.nos...@gmail.com> wrote: > >> Hello, >> >> We have a Spark streaming application and the problem that we are >> encountering is that the batch processing time keeps on increasing and >> eventually causes the application to start lagging. I am hoping that >> someone here can point me to any underlying cause of why this might happen. >> >> The batch interval is 1 minute as of now and the app does some maps, >> filters, joins and reduceByKeyAndWindow operations. All the reduces are >> invertible functions and so we do provide the inverse-reduce functions in >> all those. The largest window size we have is 1 hour right now. When the >> app is started, we see that the batch processing time is between 20 and 30 >> seconds. It keeps creeping up slowly and by the time it hits the 1 hour >> mark, it somewhere around 35-40 seconds. Somewhat expected and still not >> bad! >> >> I would expect that since the largest window we have is 1 hour long, the >> application should stabilize around the 1 hour mark and start processing >> subsequent batches within that 35-40 second zone. However, that is not what >> is happening. The processing time still keeps increasing and eventually in >> a few hours it exceeds 1 minute mark and then starts lagging. Eventually >> the lag builds up and becomes in minutes at which point we have to restart >> the system. >> >> Any pointers on why this could be happening and what we can do to >> troubleshoot further? >> >> Thanks >> Nikunj >> >> >