Hi, On Wed, Oct 10, 2012 at 11:22 AM, Jagadish Bihani <[email protected]> wrote: > Hi Brock > > I will surely look into 'fsync lies'. > > But as per my experiments I think "file channel" is causing the issue. > Because on those 2 machines (one with higher throughput and other with > lower) > I did following experiment: > > cat Source -memory channel - file sink. > > Now with this setup I got same throughput on both the machines. (around 3 > MB/sec) > Now as I have used "File sink" it should also do "fsync" at some point of > time. > 'File Sink' and 'File Channel' both do disk writes. > So if there is differences in disk behaviour then even in the 'File Sink' it > should be visible. > > Am I missing something here?
File sink does not call fsync. > > Regards, > Jagadish > > > > On 10/10/2012 09:35 PM, Brock Noland wrote: >> >> OK your disk that is giving you 40KB/second is telling you the truth >> and the faster disk is lying to you. Look up "fsync lies" to see what >> I am referring to. >> >> A spinning disk can do 100 fsync operations per second (this is done >> at the end of every batch). That is how I estimated your event size, >> 40KB/second is doing 40KB / 100 = 409 bytes. >> >> Once again, if you want increased performance, you should increase the >> batch size. >> >> Brock >> >> On Wed, Oct 10, 2012 at 11:00 AM, Jagadish Bihani >> <[email protected]> wrote: >>> >>> Hi >>> >>> Yes. It is around 480 - 500 bytes. >>> >>> >>> On 10/10/2012 09:24 PM, Brock Noland wrote: >>>> >>>> How big are your events? Average about 400 bytes? >>>> >>>> Brock >>>> >>>> On Wed, Oct 10, 2012 at 5:11 AM, Jagadish Bihani >>>> <[email protected]> wrote: >>>>> >>>>> Hi >>>>> >>>>> Thanks for the inputs Brock. After doing several experiments >>>>> eventually problem boiled down to disks. >>>>> >>>>> -- But I had used the same configuration (so all software components >>>>> are >>>>> same in all 3 machines) >>>>> on all 3 machines. >>>>> -- In User guide it is written that if multiple file channel instances >>>>> are >>>>> active on the same agent then >>>>> different disks are preferable. But in my case only one file channel is >>>>> active per agent. >>>>> -- Only one pattern I observed that on the machines where I got better >>>>> performance have multiple disks. >>>>> But I don't understand how that will help if I have only 1 active file >>>>> channel. >>>>> -- What is the impact of the type of disk/disk device driver on >>>>> performance? >>>>> I mean I don't understand >>>>> with 1 disk I am getting 40 KB/sec and with other 2 MB/sec. >>>>> >>>>> Could you please elaborate on File channel and disks correlation. >>>>> >>>>> Regards, >>>>> Jagadish >>>>> >>>>> >>>>> On 10/09/2012 08:01 PM, Brock Noland wrote: >>>>> >>>>> Hi, >>>>> >>>>> Using file channel, in terms of performance, the number and type of >>>>> disks is going to be much more predictive of performance than CPU or >>>>> RAM. Note that consumer level drives/controllers will give you much >>>>> "better" performance because they lie to you about when your data is >>>>> actually written to the drive. If you search for "fsync lies" you'll >>>>> find more information on this. >>>>> >>>>> You probably want to increase the batch size to get better performance. >>>>> >>>>> Brock >>>>> >>>>> On Tue, Oct 9, 2012 at 2:46 AM, Jagadish Bihani >>>>> <[email protected]> wrote: >>>>> >>>>> Hi >>>>> >>>>> My flume setup is: >>>>> >>>>> Source Agent : cat source - File Channel - Avro Sink >>>>> Dest Agent : avro source - File Channel - HDFS Sink. >>>>> >>>>> There is only 1 source agent and 1 destination agent. >>>>> >>>>> I measure throughput as amount of data written to HDFS per second. >>>>> ( I have rolling interval 30 sec; so If 60 MB file is generated in 30 >>>>> sec >>>>> the >>>>> throughput is : -- 2 MB/sec ). >>>>> >>>>> I have run source agent on various machines with different hardware >>>>> configurations : >>>>> (In all cases I run flume agent with JAVA OPTIONS as >>>>> "-DJAVA_OPTS="-Xms500m -Xmx1g -Dcom.sun.management.jmxremote >>>>> -XX:MaxDirectMemorySize=2g") >>>>> >>>>> JDK is 32 bit. >>>>> >>>>> Experiment 1: >>>>> ===== >>>>> RAM : 16 GB >>>>> Processor: Intel Xeon E5620 @ 2.40 GHz (16 cores). >>>>> 64 bit Processor with 64 bit Kernel. >>>>> Throughput: 2 MB/sec >>>>> >>>>> Experiment 2: >>>>> ====== >>>>> RAM : 4 GB >>>>> Processor: Intel Xeon E5504 @ 2.00GHz (4 cores). 32 bit Processor >>>>> 64 bit Processor with 32 bit Kernel. >>>>> Throughput : 30 KB/sec >>>>> >>>>> Experiment 3: >>>>> ====== >>>>> RAM : 8 GB >>>>> Processor:Intel Xeon E5520 @ 2.27 GHz (16 cores).32 bit Processor >>>>> 64 bit Processor with 32 bit Kernel. >>>>> Throughput : 80 KB/sec >>>>> >>>>> -- So as can be seen there is huge difference in the throughput with >>>>> same >>>>> configuration but >>>>> different hardware. >>>>> -- In the first case where throughput is more RES is around 160 MB in >>>>> other >>>>> cases it is in >>>>> the range of 40 MB - 50 MB. >>>>> >>>>> Can anybody please give insights that why there is this huge difference >>>>> in >>>>> the throughput? >>>>> What is the correlation between RAM and filechannel/HDFS sink >>>>> performance >>>>> and also >>>>> with 32-bit/64 bit kernel? >>>>> >>>>> Regards, >>>>> Jagadish >>>>> >>>>> >>>>> >>>> >> >> > -- Apache MRUnit - Unit testing MapReduce - http://incubator.apache.org/mrunit/
