Thanks Akhil. For doing reduceByKeyAndWindow, one has to have checkpointing
enabled. So, yes we do have it enabled. But not Write Ahead Log because we
don't have a need for recovery and we do not recover the process state on
restart.

I don't know if IO Wait fully explains the increasing processing time.
Below is a full minute of 'sar' output every 2 seconds. The iowait values
don't seem too bad to me except for a brief small spike in the middle.
Also, how does one explain the continued degradation of processing time
even beyond the largest window interval?

Thanks
Nikunj


$ sar 2 30
Linux 3.13.0-48-generic (ip-X-X-X-X)        07/16/2015      _x86_64_
 (16 CPU)

01:11:14 AM     CPU     %user     %nice   %system   %iowait    %steal
%idle
01:11:16 AM     all     66.70      0.03     11.10      0.03      0.00
22.13
01:11:18 AM     all     79.99      0.00     10.81      0.00      0.03
 9.17
01:11:20 AM     all     62.66      0.03     10.84      0.00      0.03
26.43
01:11:22 AM     all     68.59      0.00     10.83      0.00      0.10
20.49
01:11:24 AM     all     77.74      0.00     10.83      0.00      0.03
11.40
01:11:26 AM     all     65.01      0.00     10.83      0.03      0.07
24.06
01:11:28 AM     all     66.33      0.00     10.87      0.00      0.03
22.77
01:11:30 AM     all     72.38      0.03     12.48      0.54      0.06
14.50
01:11:32 AM     all     68.35      0.00     12.98      7.46      0.03
11.18
01:11:34 AM     all     75.94      0.03     14.02      3.27      0.03
 6.71
01:11:36 AM     all     68.60      0.00     14.34      2.76      0.03
14.27
01:11:38 AM     all     61.99      0.03     13.34      0.07      0.07
24.51
01:11:40 AM     all     52.21      0.03     12.79      1.04      0.13
33.79
01:11:42 AM     all     37.91      0.03     12.43      0.03      0.10
49.48
01:11:44 AM     all     26.92      0.00     11.68      0.14      0.10
61.16
01:11:46 AM     all     24.86      0.00     12.07      0.00      0.10
62.97
01:11:48 AM     all     25.49      0.00     11.96      0.00      0.10
62.45
01:11:50 AM     all     21.16      0.00     12.35      0.03      0.14
66.32
01:11:52 AM     all     29.89      0.00     12.06      0.03      0.10
57.91
01:11:54 AM     all     26.77      0.00     11.81      0.00      0.10
61.32
01:11:56 AM     all     25.34      0.03     11.81      0.03      0.14
62.65
01:11:58 AM     all     22.42      0.00     12.60      0.00      0.10
64.88
01:12:00 AM     all     30.27      0.00     12.10      0.03      0.14
57.46
01:12:02 AM     all     80.59      0.00     10.58      0.35      0.03
 8.44
01:12:04 AM     all     49.05      0.00     12.89      0.66      0.07
37.32
01:12:06 AM     all     31.21      0.03     13.54      6.54      0.17
48.50
01:12:08 AM     all     31.66      0.00     13.26      6.30      0.10
48.67
01:12:10 AM     all     36.19      0.00     12.87      3.04      0.14
47.76
01:12:12 AM     all     82.63      0.03     10.60      0.00      0.03
 6.70
01:12:14 AM     all     77.72      0.00     10.66      0.00      0.03
11.59
Average:        all     52.22      0.01     12.04      1.08      0.08
34.58


On Thu, Jul 16, 2015 at 12:44 AM, Akhil Das <ak...@sigmoidanalytics.com>
wrote:

> What is your data volume? Are you having checkpointing/WAL enabled? In
> that case make sure you are having SSD disks as this behavior is mainly due
> to the IO wait.
>
> Thanks
> Best Regards
>
> On Thu, Jul 16, 2015 at 8:43 AM, N B <nb.nos...@gmail.com> wrote:
>
>> Hello,
>>
>> We have a Spark streaming application and the problem that we are
>> encountering is that the batch processing time keeps on increasing and
>> eventually causes the application to start lagging. I am hoping that
>> someone here can point me to any underlying cause of why this might happen.
>>
>> The batch interval is 1 minute as of now and the app does some maps,
>> filters, joins and reduceByKeyAndWindow operations. All the reduces are
>> invertible functions and so we do provide the inverse-reduce functions in
>> all those. The largest window size we have is 1 hour right now. When the
>> app is started, we see that the batch processing time is between 20 and 30
>> seconds. It keeps creeping up slowly and by the time it hits the 1 hour
>> mark, it somewhere around 35-40 seconds. Somewhat expected and still not
>> bad!
>>
>> I would expect that since the largest window we have is 1 hour long, the
>> application should stabilize around the 1 hour mark and start processing
>> subsequent batches within that 35-40 second zone. However, that is not what
>> is happening. The processing time still keeps increasing and eventually in
>> a few hours it exceeds 1 minute mark and then starts lagging. Eventually
>> the lag builds up and becomes in minutes at which point we have to restart
>> the system.
>>
>> Any pointers on why this could be happening and what we can do to
>> troubleshoot further?
>>
>> Thanks
>> Nikunj
>>
>>
>

Reply via email to