Re: Flume perf measurements

Roshan Naik Mon, 13 Apr 2015 14:02:41 -0700

  1.  HDD was a Single 7200 rpm regular disk.  The most significant aspects of 
the config are noted in the measurements. Namely: Number of DataDirs, Batch 
size and Number of sinks or sources.
  2.  With Exec source it is a problem. Normally with transactional sources 
like avro or SpoolingDir, you won't have the data loss issue. N case of Avro, 
the upstream application (Flume) which is sending these will get an error and 
will have to resend. In case of SpoolingDir again the data is still on disk.

-roshan

From: 김동경 <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Friday, April 10, 2015 11:18 PM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: Flume perf measurements

Thank you for sharing Roshan.
I have a few questions.

1. What kind of, and how many hardware(HDD) did you use when you do file 
channel benchmark?
 Actually, I also did benchmark with file channel, I only get 2K~3K TPS.
Did you use HDDs for each data dirs?

Could you share the most influential parts in configuration for high 
performance?

2. Regarding 100K exec source batch size, if agent falls down before all the 
events are committed to channel, aren`t those messages lost?
Do you have any measures to handle that message loss?

Thanks in advance
Regards
Dongkyoung.

2015-04-11 3:44 GMT+09:00 Roshan Naik 
<[email protected]<mailto:[email protected]>>:
Will have this info on the wiki soon, but thought of sending it out right away 
to users list also since there seem to be some threads on performance in the 
users list.

Sample Flume v1.4 Measurements for reference:

Here are some sample measurements taken with a single agent and 500 byte events.

Cluster Config: 20-node Hadoop cluster (1 name node and 19 data nodes).

Machine Config: 24 cores - Xeon E5-2640 v2 @ 2.00GHz, 164 GB RAM.

1.     File channel with HDFS Sink (Sequence File):

Source: 4 x Exec Source, 100k batchSize

HDFS Sink Batch size: 500,000

Channel: File

Number of data dirs: 8

Events/Sec

Sink Count

1 data dirs

2 data dirs

4 data dirs

6 data dirs

8 data dirs

10 data dirs

1

14.3 k

2

21.9 k

4

35.8 k

8

24.8 k

43.8 k

72.5 k

77 k

78.6 k

76.6 k

10

58 k

12

49.3 k

49 k

Was looking for sweet spot in perf. So did not take measurements for all data  
points on grid. Only too for the ones that made sense. For example: when perf 
dropped by adding more sinks, did not take more measurements for those rows.

2.     HDFS Sink:

Channel: Memory

# of  HDFS

Sinks

Snappy

BatchSz:1.2mill

Snappy

BatchSz:1.4mill

Sequence File

BatchSz:1.2mill

1

34.3 k

33 k

33 k

2

71 k

75 k

69 k

4

141 k

145 k

141 k

8

271 k

273 k

251 k

12

382 k

380 k

370 k

16

478 k

538 k

486 k

Some simple observations :

  *   increasing number of dataDirs helps FC perf even on single disk systems
  *   Increasing  number of sinks helps
  *   Max throughput observed was about 538k events/sec for HDFS sink which is 
approx 240MB/s

Re: Flume perf measurements

Reply via email to