I searched through the logs but did not find a BadCheckpointException. I extrapolated the 6G from the size of the log-* files being read. Below is a snippet of the logs. As you can see from the timestamp of the log lines, the channel ch3 was started at time 3:32 and channel startup activity completed only at 4:40
07 May 2013 03:32:40,598 INFO [lifecycleSupervisor-1-1] (org.apache.flume.channel.file.LogFileV3$MetaDataWriter.markCheckpoint:85) - Updating log-297.meta currentPosition = 1455187648, logWriteOrderID = 1385582495370 07 May 2013 03:32:40,599 INFO [lifecycleSupervisor-1-1] (org.apache.flume.channel.file.FileChannel.start:309) - Queue Size after replay: 58497343 [channel=ch3] 07 May 2013 03:32:40,628 INFO [lifecycleSupervisor-1-1] (org.apache.flume.instrumentation.MonitoredCounterGroup.register:89) - Monitoried counter group for type: CHANNEL, name: ch3, registered successfully. 07 May 2013 03:32:40,628 INFO [lifecycleSupervisor-1-1] (org.apache.flume.instrumentation.MonitoredCounterGroup.start:73) - Component type: CHANNEL, name: ch3 started 07 May 2013 03:32:40,887 INFO [lifecycleSupervisor-1-3] (org.apache.flume.channel.file.LogFile$SequentialReader.next:491) - Encountered EOF at 1623195640 in /flume2/data/log-861 07 May 2013 03:32:45,972 INFO [lifecycleSupervisor-1-3] (org.apache.flume.channel.file.LogFile$SequentialReader.next:491) - Encountered EOF at 1623195632 in /flume2/data/log-862 07 May 2013 03:32:47,777 INFO [lifecycleSupervisor-1-3] (org.apache.flume.channel.file.LogFile$SequentialReader.next:491) - Encountered EOF at 1623195646 in /flume2/data/log-863 07 May 2013 03:32:49,349 INFO [lifecycleSupervisor-1-3] (org.apache.flume.channel.file.LogFile$SequentialReader.next:491) - Encountered EOF at 1623195620 in /flume2/data/log-864 07 May 2013 04:17:37,858 INFO [lifecycleSupervisor-1-3] (org.apache.flume.channel.file.LogFile$SequentialReader.next:491) - Encountered EOF at 1442343132 in /flume2/data/log-865 07 May 2013 04:17:37,859 INFO [lifecycleSupervisor-1-3] (org.apache.flume.channel.file.ReplayHandler.replayLog:320) - read: 1829658, put: 0, take: 45223, rollback: 1, commit: 11, skip: 1784423, eventCount:44000 07 May 2013 04:17:37,860 ERROR [lifecycleSupervisor-1-3] (org.apache.flume.channel.file.ReplayHandler.replayLog:354) - Pending takes 1473 exist after the end of replay. Duplicate messages will exist in destination. 07 May 2013 04:17:37,860 INFO [lifecycleSupervisor-1-3] (org.apache.flume.channel.file.Log.replay:399) - Rolling /flume2/data 07 May 2013 04:17:37,860 INFO [lifecycleSupervisor-1-3] (org.apache.flume.channel.file.Log.roll:811) - Roll start /flume2/data 07 May 2013 04:17:37,861 INFO [lifecycleSupervisor-1-3] (org.apache.flume.channel.file.LogFile$Writer.<init>:171) - Opened /flume2/data/log-866 07 May 2013 04:17:37,862 INFO [lifecycleSupervisor-1-3] (org.apache.flume.channel.file.Log.roll:826) - Roll end 07 May 2013 04:17:37,863 INFO [lifecycleSupervisor-1-3] (org.apache.flume.channel.file.EventQueueBackingStoreFile.checkpoint:109) - Start checkpoint for /flume2/checkpoint/checkpoint, elements to sync = 42527 07 May 2013 04:17:37,865 INFO [lifecycleSupervisor-1-3] (org.apache.flume.channel.file.EventQueueBackingStoreFile.checkpoint:117) - Updating checkpoint metadata: logWriteOrderID: 1385582495371, queueSize: 46200121, queueHead: 65890565 07 May 2013 04:17:37,906 INFO [lifecycleSupervisor-1-3] (org.apache.flume.channel.file.LogFileV3$MetaDataWriter.markCheckpoint:85) - Updating log-866.meta currentPosition = 0, logWriteOrderID = 1385582495371 07 May 2013 04:17:37,907 INFO [lifecycleSupervisor-1-3] (org.apache.flume.channel.file.Log.writeCheckpoint:886) - Updated checkpoint for file: /flume2/data/log-866 position: 0 logWriteOrderID: 1385582495371 07 May 2013 04:17:37,908 INFO [lifecycleSupervisor-1-3] (org.apache.flume.channel.file.LogFile$RandomReader.close:356) - Closing RandomReader /flume2/data/log-855 07 May 2013 04:17:37,913 INFO [lifecycleSupervisor-1-3] (org.apache.flume.channel.file.LogFileV3$MetaDataWriter.markCheckpoint:85) - Updating log-855.meta currentPosition = 1592727635, logWriteOrderID = 1385582495371 07 May 2013 04:17:37,915 INFO [lifecycleSupervisor-1-3] (org.apache.flume.channel.file.LogFile$RandomReader.close:356) - Closing RandomReader /flume2/data/log-856 07 May 2013 04:17:37,921 INFO [lifecycleSupervisor-1-3] (org.apache.flume.channel.file.LogFileV3$MetaDataWriter.markCheckpoint:85) - Updating log-856.meta currentPosition = 1601406592, logWriteOrderID = 1385582495371 07 May 2013 04:17:37,922 INFO [lifecycleSupervisor-1-3] (org.apache.flume.channel.file.LogFile$RandomReader.close:356) - Closing RandomReader /flume2/data/log-857 07 May 2013 04:17:37,928 INFO [lifecycleSupervisor-1-3] (org.apache.flume.channel.file.LogFileV3$MetaDataWriter.markCheckpoint:85) - Updating log-857.meta currentPosition = 1608758564, logWriteOrderID = 1385582495371 07 May 2013 04:17:37,929 INFO [lifecycleSupervisor-1-3] (org.apache.flume.channel.file.LogFile$RandomReader.close:356) - Closing RandomReader /flume2/data/log-858 07 May 2013 04:17:37,935 INFO [lifecycleSupervisor-1-3] (org.apache.flume.channel.file.LogFileV3$MetaDataWriter.markCheckpoint:85) - Updating log-858.meta currentPosition = 1558785242, logWriteOrderID = 1385582495371 07 May 2013 04:17:37,936 INFO [lifecycleSupervisor-1-3] (org.apache.flume.channel.file.LogFile$RandomReader.close:356) - Closing RandomReader /flume2/data/log-859 07 May 2013 04:17:37,941 INFO [lifecycleSupervisor-1-3] (org.apache.flume.channel.file.LogFileV3$MetaDataWriter.markCheckpoint:85) - Updating log-859.meta currentPosition = 586542373, logWriteOrderID = 1385582495371 07 May 2013 04:17:37,943 INFO [lifecycleSupervisor-1-3] (org.apache.flume.channel.file.LogFile$RandomReader.close:356) - Closing RandomReader /flume2/data/log-861 07 May 2013 04:17:37,949 INFO [lifecycleSupervisor-1-3] (org.apache.flume.channel.file.LogFileV3$MetaDataWriter.markCheckpoint:85) - Updating log-861.meta currentPosition = 1556011472, logWriteOrderID = 1385582495371 07 May 2013 04:17:37,951 INFO [lifecycleSupervisor-1-3] (org.apache.flume.channel.file.LogFile$RandomReader.close:356) - Closing RandomReader /flume2/data/log-862 07 May 2013 04:17:37,956 INFO [lifecycleSupervisor-1-3] (org.apache.flume.channel.file.LogFileV3$MetaDataWriter.markCheckpoint:85) - Updating log-862.meta currentPosition = 1563024320, logWriteOrderID = 1385582495371 07 May 2013 04:17:37,958 INFO [lifecycleSupervisor-1-3] (org.apache.flume.channel.file.LogFile$RandomReader.close:356) - Closing RandomReader /flume2/data/log-863 07 May 2013 04:17:37,963 INFO [lifecycleSupervisor-1-3] (org.apache.flume.channel.file.LogFileV3$MetaDataWriter.markCheckpoint:85) - Updating log-863.meta currentPosition = 1612213007, logWriteOrderID = 1385582495371 07 May 2013 04:17:37,965 INFO [lifecycleSupervisor-1-3] (org.apache.flume.channel.file.LogFile$RandomReader.close:356) - Closing RandomReader /flume2/data/log-864 07 May 2013 04:17:37,971 INFO [lifecycleSupervisor-1-3] (org.apache.flume.channel.file.LogFileV3$MetaDataWriter.markCheckpoint:85) - Updating log-864.meta currentPosition = 1613618738, logWriteOrderID = 1385582495371 07 May 2013 04:17:37,972 INFO [lifecycleSupervisor-1-3] (org.apache.flume.channel.file.LogFile$RandomReader.close:356) - Closing RandomReader /flume2/data/log-865 07 May 2013 04:17:37,977 INFO [lifecycleSupervisor-1-3] (org.apache.flume.channel.file.LogFileV3$MetaDataWriter.markCheckpoint:85) - Updating log-865.meta currentPosition = 1440488562, logWriteOrderID = 1385582495371 07 May 2013 04:17:37,979 INFO [lifecycleSupervisor-1-3] (org.apache.flume.channel.file.FileChannel.start:309) - Queue Size after replay: 46200121 [channel=ch2] 07 May 2013 04:17:37,979 INFO [lifecycleSupervisor-1-3] (org.apache.flume.instrumentation.MonitoredCounterGroup.register:89) - Monitoried counter group for type: CHANNEL, name: ch2, registered successfully. 07 May 2013 04:17:37,980 INFO [lifecycleSupervisor-1-3] (org.apache.flume.instrumentation.MonitoredCounterGroup.start:73) - Component type: CHANNEL, name: ch2 started 07 May 2013 04:40:37,198 INFO [lifecycleSupervisor-1-2] (org.apache.flume.channel.file.LogFile$SequentialReader.next:491) - Encountered EOF at 1522974283 in /flume1/data/log-861 07 May 2013 04:40:37,199 INFO [lifecycleSupervisor-1-2] (org.apache.flume.channel.file.ReplayHandler.replayLog:320) - read: 1249010, put: 0, take: 121197, rollback: 0, commit: 31, skip: 1127782, eventCount:124000 07 May 2013 04:40:37,199 ERROR [lifecycleSupervisor-1-2] (org.apache.flume.channel.file.ReplayHandler.replayLog:354) - Pending takes 3504 exist after the end of replay. Duplicate messages will exist in destination. 07 May 2013 04:40:37,199 INFO [lifecycleSupervisor-1-2] (org.apache.flume.channel.file.Log.replay:399) - Rolling /flume1/data 07 May 2013 04:40:37,200 INFO [lifecycleSupervisor-1-2] (org.apache.flume.channel.file.Log.roll:811) - Roll start /flume1/data 07 May 2013 04:40:37,200 INFO [lifecycleSupervisor-1-2] (org.apache.flume.channel.file.LogFile$Writer.<init>:171) - Opened /flume1/data/log-862 07 May 2013 04:40:37,202 INFO [lifecycleSupervisor-1-2] (org.apache.flume.channel.file.Log.roll:826) - Roll end 07 May 2013 04:40:37,203 INFO [lifecycleSupervisor-1-2] (org.apache.flume.channel.file.EventQueueBackingStoreFile.checkpoint:109) - Start checkpoint for /flume1/checkpoint/checkpoint, elements to sync = 120496 07 May 2013 04:40:37,204 INFO [lifecycleSupervisor-1-2] (org.apache.flume.channel.file.EventQueueBackingStoreFile.checkpoint:117) - Updating checkpoint metadata: logWriteOrderID: 1385582495372, queueSize: 29456099, queueHead: 22675305 07 May 2013 04:40:37,238 INFO [lifecycleSupervisor-1-2] (org.apache.flume.channel.file.LogFileV3$MetaDataWriter.markCheckpoint:85) - Updating log-862.meta currentPosition = 0, logWriteOrderID = 1385582495372 07 May 2013 04:40:37,240 INFO [lifecycleSupervisor-1-2] (org.apache.flume.channel.file.Log.writeCheckpoint:886) - Updated checkpoint for file: /flume1/data/log-862 position: 0 logWriteOrderID: 1385582495372 07 May 2013 04:40:37,240 INFO [lifecycleSupervisor-1-2] (org.apache.flume.channel.file.LogFile$RandomReader.close:356) - Closing RandomReader /flume1/data/log-855 07 May 2013 04:40:37,246 INFO [lifecycleSupervisor-1-2] (org.apache.flume.channel.file.LogFileV3$MetaDataWriter.markCheckpoint:85) - Updating log-855.meta currentPosition = 1547225990, logWriteOrderID = 1385582495372 07 May 2013 04:40:37,247 INFO [lifecycleSupervisor-1-2] (org.apache.flume.channel.file.LogFile$RandomReader.close:356) - Closing RandomReader /flume1/data/log-856 07 May 2013 04:40:37,253 INFO [lifecycleSupervisor-1-2] (org.apache.flume.channel.file.LogFileV3$MetaDataWriter.markCheckpoint:85) - Updating log-856.meta currentPosition = 892596719, logWriteOrderID = 1385582495372 07 May 2013 04:40:37,255 INFO [lifecycleSupervisor-1-2] (org.apache.flume.channel.file.LogFile$RandomReader.close:356) - Closing RandomReader /flume1/data/log-857 07 May 2013 04:40:37,260 INFO [lifecycleSupervisor-1-2] (org.apache.flume.channel.file.LogFileV3$MetaDataWriter.markCheckpoint:85) - Updating log-857.meta currentPosition = 1559591451, logWriteOrderID = 1385582495372 07 May 2013 04:40:37,262 INFO [lifecycleSupervisor-1-2] (org.apache.flume.channel.file.LogFile$RandomReader.close:356) - Closing RandomReader /flume1/data/log-858 07 May 2013 04:40:37,267 INFO [lifecycleSupervisor-1-2] (org.apache.flume.channel.file.LogFileV3$MetaDataWriter.markCheckpoint:85) - Updating log-858.meta currentPosition = 1550429668, logWriteOrderID = 1385582495372 07 May 2013 04:40:37,269 INFO [lifecycleSupervisor-1-2] (org.apache.flume.channel.file.LogFile$RandomReader.close:356) - Closing RandomReader /flume1/data/log-859 07 May 2013 04:40:37,274 INFO [lifecycleSupervisor-1-2] (org.apache.flume.channel.file.LogFileV3$MetaDataWriter.markCheckpoint:85) - Updating log-859.meta currentPosition = 1616923189, logWriteOrderID = 1385582495372 ________________________________ From: Hari Shreedharan <[email protected]> To: "[email protected]" <[email protected]>; Rahul Ravindran <[email protected]> Sent: Monday, May 6, 2013 9:57 PM Subject: Re: Usage of use-fast-replay for FileChannel Did you have an issue with the checkpoint that the entire 6G of data was replayed (look for BadCheckpointException in the logs to figure out if the channel was stopped in middle of a checkpoint)? With the next version of Flume, you should be able to recover even if the channel stopped while the checkpoint was being written. Fast Replay will try to maintain order, but it will require a massive amount of memory to run if you have a large number of events. Also, fast replay will only run if the checkpoint is corrupt/does not exist. Hari On Mon, May 6, 2013 at 9:40 PM, Rahul Ravindran <[email protected]> wrote: Hi, > For FileChannel, how much of a performance improvement in replay times were >observed with use-fast-replay? We currently have use-fast-replay set to false >and were replaying about 6 G of data. We noticed replay times of about one >hour. I looked at the code and it appears that fast-replay does not guarantee >the same ordering of events during replay. Is this accurate? Are there any >other downsides of using fast-replay? Any stability concerns? >Thanks, >~Rahul.
