My team uses Flume 1.4.0 packaged with CDH5.0.2 via an embedded agent to
write to a file channel.  From a previous thread started by my colleague,
"FileChannel Replays consistently take a long time" and associated issue,
https://issues.apache.org/jira/browse/FLUME-2450, it was suggested to use a
backup checkpoint directory to avoid lengthy replays.  When I enabled the
backup checkpoint directory, I observed via iotop near 100% IO by my
application with the embedded agent.  This level of IO persists for about
30 seconds rendering the application unusable during this time period.

For comparison, I monitored via iotop when backup checkpoint is disabled.
 IO activity occurs for at most several seconds.  That is, there is a
qualitative difference when enabling the backup checkpoint directory.
 Additionally, I also tried deleting the existing checkpoints/data
directories to start with a clean slate.  Those experiment results are
in-line with my above observations.

Is this expected behavior when using a backup checkpoint directory?  Is
there anyway in which the amount of IO can be reduced?  I appreciate
feedback and insights because the current behavior is untenable for a
production environment.

Thank you,
Michael

Reply via email to