Bhaskar, I have created the following jira for this: https://issues.apache.org/jira/browse/FLUME-1829
-Jeff On Fri, Jan 11, 2013 at 6:48 AM, Bhaskar V. Karambelkar <[email protected] > wrote: > Thanks Jeff, > Clear and detailed explanations. These deserve to be on the wiki, as these > parameters have direct implications on the performance of flume nodes. > > thanks > Bhaskar > > > On Tue, Jan 8, 2013 at 9:40 PM, Jeff Lord <[email protected]> wrote: > >> Hi Bashkar, >> >> 1) Batch Size >> 1.a) When configured by client code using the flume-core-sdk , to send >> events to flume avro source. >> The flume client sdk has an appendBatch method. This will take a list of >> events and send them to the source as a batch. This is the size of the >> number of events to be passed to the source at one time. >> >> 1.b) When set as a parameter on HDFS sink (or other sinks which support >> BatchSize parameter) >> This is the number of events written to file before it is flushed to HDFS >> >> 2) >> 2.a) Channel Capacity >> This is the maximum capacity number of events of the channel. >> >> 2.b) Channel Transaction Capacity. >> This is the max number of events stored in the channel per transaction. >> >> How will setting these parameters to different values, affect throughput, >> latency in event flow? >> >> In general you will see better throughput by using memory channel as >> opposed to using file channel at the loss of durability. >> >> The channel capacity is going to need to be sized such that it is large >> enough to hold as many events as will be added to it by upstream agents. >> Ideal flow would see the sink draining events from the channel faster than >> it is having events added by its source. >> >> The channel transaction capacity will need to be smaller than the channel >> capacity. >> e.g. If your Channel capacity is set to 10000 than Channel Transaction >> Capacity should be set to something like 100. >> >> Specifically if we have clients with varying frequency of event >> generation, i.e. some clients generating thousands of events/sec, while >> others at a much slower rate, what effect will different values of these >> params have on these clients ? >> >> Transaction Capacity is going to be what throttles or limits how many >> events the source can put into the channel. This going to vary depending on >> how many tiers of agents/collectors you have setup. >> In general though this should probably be equal to whatever you have the >> batch size set to in your client. >> >> With regards to the hdfs batch size, the larger your batch size the >> better performance will be. However, keep in mind that if a transaction >> fails the entire transaction will be replayed which could have the >> implication of duplicate events downstream. >> >> -Jeff >> >> >> >> >> On Tue, Jan 8, 2013 at 10:46 AM, Bhaskar V. Karambelkar < >> [email protected]> wrote: >> >>> Can some one explain the importance of the following >>> 1) Batch Size >>> 1.a) When configured by client code using the flume-core-sdk , to send >>> events to flume avro source. >>> 1.b) When set as a parameter on HDFS sink (or other sinks which >>> support BatchSize parameter) >>> 2) >>> 2.a) Channel Capacity >>> 2.b) Channel Transaction Capacity. >>> >>> >>> Under which conditions should these params be set to high values, and >>> under which conditions should they be set to low values. >>> >>> >>> How will setting these parameters to different values, affect >>> throughput, latency in event flow. >>> Specifically if we have clients with varying frequency of event >>> generation, i.e. some clients generating thousands of events/sec, while >>> others at a much slower rate, what effect will different values of these >>> params have on these clients ? >>> >>> thanks >>> Bhaskar >>> >> >> >
