This isn't exactly an answer to your question, but try increasing the batch sizes on both your avro sink and hdfs sinks. These will increase the throughput of your file channel significantly.

If your memory channel on agent 1 gets full up, it will be limited to whatever throughput agent2 has. It will indicate this in the logs. You may also want to try using a snapshot of 1.3 which allows ganglia integration and is very useful for watching throughput, channel capacities, and take/put attempt/success counts. If it gets full the only thing to do is to increase throughput at agent2 by increasing transaction sizes(or by using separate disks for checkpoint/data dirs).

On 08/01/2012 12:19 AM, Raymond Ng wrote:
good day all, sorry for the long email
I'd like to know how to gauge where the performance bottleneck is with different types of channels used
I have a demo environemnt which looks a bit like this
Setup 1
Agent 1 ( Exec Source, Memory Channel and Avro Sink with 1 GB JVM) streaming data to
Agent 2 ( Avro Source, Memory Channel and HDFS Sink with 1.5 GB JVM)
the memory channel both have 1,000,000 capacity and 10,000 transaction capacity and I managed to achieve ~8000 records/sec in the Exec Source of Agent 1, and I'm not too concerned with how long it takes for Agent 2 to insert into HDFS

and when I changed Agent 2 to use FileChannel
Setup 2
Agent 1 ( Exec Source, Memory Channel and Avro Sink with 2 GB JVM) streaming data to Agent 2 ( Avro Source, File Channel and HDFS Sink with 1.0 GB JVM), the File Channel has the same capacity and transaction capacity as the memory channel stated above I've doubled the JVM for Agent 1 knowing that it needs to have a bigger buffer to handle the same throughout from the Exec source, as Agent 2 will be slower buffering records to disk before writing to HDFS. now I achieved ~4000 records per second in Exce source of Agent 1, however I wasn't expecting the Exec source to slow down on the throughput as its getting the same input from tailing the same file Is the decrease in the source throughput in Agent 1 to do with Agent 2 taking much longer to commit the events into the file channel which causes a knock-on on Agent 1 to release the records from its memory channel? I thought the performance on the source is determined by how quickly it can commit the events to the channel, the fact that the sink can't consume the events as quick as they are put in by the source should not affect the speed the source is committing to the channel? I say this because I have come across ChannelException where it suggested the sinks are not keeping up with the sources, kind of suggests to me that the sink will not slow down the source in terms of channel commit
hope it makes sense
thanks for any advice
--
Rgds
Ray

Reply via email to