Hi Gordon, Thanks for the fast reply! You're right about the expired iterator exception occurring just before each spike. I can't see any signs of long GC on the task managers... CPU has been <15% the whole time when the spikes were taking place and I can't see anything unusual in the task manager logs.
But actually I just noticed that the Flink UI showed no successful checkpoints during the time of the problem even though my checkpoint interval is 15 minutes. So I guess this is probably some kind of Flink problem rather than a problem with the Kinesis consumer. Unfortunately I can't find anything useful in the logs so not sure what happened! Josh On Thu, Nov 3, 2016 at 12:44 PM, Tzu-Li (Gordon) Tai <tzuli...@apache.org> wrote: > Hi Josh, > > That warning message was added as part of FLINK-4514. It pops out whenever > a shard iterator was used after 5 minutes it was returned from Kinesis. > The only time spent between after a shard iterator was returned and before > it was used to fetch the next batch of records, is on deserializing and > emitting of the records of the last fetched batch. > So unless processing of the last fetched batch took over 5 minutes, this > normally shouldn’t happen. > > Have you noticed any sign of long, constant full GC for your Flink task > managers? From your description and check in code, the only possible guess > I can come up with now is that > the source tasks completely seized to be running for a period of time, and > when it came back, the shard iterator was unexpectedly found to be expired. > According to the graph you attached, > when the iterator was refreshed and tasks successfully fetched a few more > batches, the source tasks again halted, and so on. > So you should see that same warning message right before every small peak > within the graph. > > Best Regards, > Gordon > > > On November 3, 2016 at 7:46:42 PM, Josh (jof...@gmail.com) wrote: > > Hey Gordon, > > I've been using Flink 1.2-SNAPSHOT for the past week (with FLINK-4514) > with no problems, but yesterday the Kinesis consumer started behaving > strangely... My Kinesis data stream is fairly constant at around 1.5MB/sec, > however the Flink Kinesis consumer started to stop consuming for periods of > time (see the spikes in graph attached which shows data consumed by the > Flink Kinesis consumer) > > Looking in the task manager logs, there are no exceptions however there is > this log message which I believe is related to the problem: > > 2016-11-03 09:27:53,782 WARN org.apache.flink.streaming. > connectors.kinesis.internals.ShardConsumer - Encountered an unexpected > expired iterator AAAAAAAAAAF8OJyh+X3yBnbtzUgIfXv+phS7PKppd7q09/ > tduXG3lOhCmBGPUOlZul24tzSSM6KjHsQ+AbZY8MThKcSvGax/ > EoOIYoTELYbZmuwY4hgeqUsndxLIM0HL55iejroBV8YFmUmGwHsW8qkHsz// > Ci4cxcLrGArHex3n+4E+aoZ9AtgTPEZOBjXY49g+VGsDb0bQN5FJUoUVEfnbupk96ore for > shard KinesisStreamShard{streamName='stream001', shard='{ShardId: > shardId-000000000000,HashKeyRange: {StartingHashKey: 0,EndingHashKey: > 85070511730234615865841151857942042863},SequenceNumberRange: > {StartingSequenceNumber: 495665429169236488921642479266 > 79091159472198219567464450,}}'}; refreshing the iterator ... > > Having restarted the job from my last savepoint, it's consuming the stream > fine again with no problems. > > Do you have any idea what might be causing this, or anything I should do > to investigate further? > > Cheers, > > Josh > > On Wed, Oct 5, 2016 at 4:55 AM, Tzu-Li (Gordon) Tai <tzuli...@apache.org> > wrote: > >> Hi Steffen, >> >> Turns out that FLINK-4514 just missed Flink 1.1.2 and wasn’t included in >> the release (I’ll update the resolve version in JIRA to 1.1.3, thanks for >> noticing this!). >> The Flink community is going to release 1.1.3 asap, which will include >> the fix. >> If you don’t want to wait for the release and want to try the fix now, >> you can also build on the current “release-1.1” branch, which already has >> FLINK-4514 merged. >> Sorry for the inconvenience. Let me know if you bump into any other >> problems afterwards. >> >> Best Regards, >> Gordon >> >> >> On October 5, 2016 at 2:56:21 AM, Steffen Hausmann ( >> stef...@hausmann-family.de) wrote: >> >> Hi there, >> >> I'm running a Flink 1.1.2 job on EMR and Yarn that is reading events >> from a Kinesis stream. However, after a while (the exact duration varies >> and is in the order of minutes) the Kinesis source doesn't emit any >> further events and hence Flink doesn't produce any further output. >> Eventually, an ExpiredIteratorException occurs in one of the task, >> causing the entire job to fail: >> >> > com.amazonaws.services.kinesis.model.ExpiredIteratorException: >> Iterator expired. The iterator was created at time Mon Oct 03 18:40:30 UTC >> 2016 while right now it is Mon Oct 03 18:45:33 UTC 2016 which is further in >> the future than the tolerated delay of 300000 milliseconds. (Service: >> AmazonKinesis; Status Code: 400; Error Code: ExpiredIteratorException; >> Request ID: dace9532-9031-54bc-8aa2-3cbfb136d590) >> >> This seems to be related to FLINK-4514, which is marked as resovled for >> Flink 1.1.2. In contrast to what is describe in the ticket, the job I'm >> running isn't suspended but hangs just a few minutes after the job has >> been started. >> >> I've attached a log file showing the described behavior. >> >> Any idea what may be wrong? >> >> Thanks, >> Steffen >> >> >