This use case sounds like a perfect use of the Spool DIrectory source which will be in the upcoming 1.3 release.
Brock On Tue, Nov 6, 2012 at 4:53 PM, Rahul Ravindran <[email protected]> wrote: > We will update the checkpoint each time (we may tune this to be periodic) > but the contents of the memory channel will be in the legacy logs which are > currently being generated. > > Additionally, the sink for the memory channel will be an Avro source in > another machine. > > Does that clear things up? > > ________________________________ > From: Brock Noland <[email protected]> > To: [email protected]; Rahul Ravindran <[email protected]> > Sent: Tuesday, November 6, 2012 1:44 PM > > Subject: Re: Guarantees of the memory channel for delivering to sink > > But in your architecture you are going to write the contents of the > memory channel out? Or did I miss something? > > "The checkpoint will be updated each time we perform a successive > insertion into the memory channel." > > On Tue, Nov 6, 2012 at 3:43 PM, Rahul Ravindran <[email protected]> wrote: >> We have a legacy system which writes events to a file (existing log file). >> This will continue. If I used a filechannel, I will be double the number >> of >> IO operations(writes to the legacy log file, and writes to WAL). >> >> ________________________________ >> From: Brock Noland <[email protected]> >> To: [email protected]; Rahul Ravindran <[email protected]> >> Sent: Tuesday, November 6, 2012 1:38 PM >> Subject: Re: Guarantees of the memory channel for delivering to sink >> >> Your still going to be writing out all events, no? So how would file >> channel do more IO than that? >> >> On Tue, Nov 6, 2012 at 3:32 PM, Rahul Ravindran <[email protected]> wrote: >>> Hi, >>> I am very new to Flume and we are hoping to use it for our log >>> aggregation into HDFS. I have a few questions below: >>> >>> FileChannel will double our disk IO, which will affect IO performance on >>> certain performance sensitive machines. Hence, I was hoping to write a >>> custom Flume source which will use a memory channel, and which will >>> perform >>> checkpointing. The checkpoint will be updated each time we perform a >>> successive insertion into the memory channel. (I realize that this >>> results >>> in a risk of data, the maximum size of which is the capacity of the >>> memory >>> channel). >>> >>> As long as there is capacity in the memory channel buffers, does the >>> memory channel guarantee delivery to a sink (does it wait for >>> acknowledgements, and retry failed packets)? This would mean that we need >>> to >>> ensure that we do not exceed the channel capacity. >>> >>> I am writing a custom source which will use the memory channel, and which >>> will catch a ChannelException to identify any channel capacity issues(so, >>> buffer used in the memory channel is full because of lagging >>> sinks/network >>> issues etc). Is that a reasonable assumption to make? >>> >>> Thanks, >>> ~Rahul. >> >> >> >> -- >> Apache MRUnit - Unit testing MapReduce - >> http://incubator.apache.org/mrunit/ >> >> > > > > -- > Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/ > > -- Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/
