We co-locate our flume agents on our data nodes in order to have access to many 
'spindles' for the file channels. We have a small cluster (10 nodes) so these 
are also our task tracker nodes and we haven't seen any huge performance issues.

For reference, our typical event ingestion rate is between 2k and 5k events per 
second under 'normal' production load. I recently had to backfill a couple of 
weeks of web logs though, and took the opportunity to examine max throughput 
rates and how heavy MR load affected things. During that 'test' we stabilized 
at about 12K events per second written to HDFS, that was a single agent using 2 
HDFS sinks taking from one file channel. As far as I could tell my bottleneck 
was in the Avro hop between my collector and writer agents, not in the HDFS 
sinks. When we had all MR slots used by large batch jobs for extended amounts 
of time the event throughput degraded to about 3500 events/sec.

I know these are just anecdotal data points, but wanted to share my experience 
with flume agents located on the actual data/task nodes themselves. I have done 
very little optimization aside from separating the file channel data/log 
directories onto separate drives.

-Paul Chavez

From: Devin Suiter RDX [mailto:[email protected]]
Sent: Tuesday, December 17, 2013 8:30 AM
To: [email protected]
Subject: File Channel Best Practice

Hi,

There has been a lot of discussion about file channel speed today, and I have 
had a dilemma I was hoping for some feedback on, since the topic is hot.

Regarding this:
"Hi,

1) You are only using a single disk for file channel and it looks like a single 
disk for both checkpoint and data directories therefore throughput is going to 
be extremely slow."

How do you solve in a practical sense the requirement for file channel to have 
a range of disks for best R/W speed, yet still have network visibility to 
source data sources and the Hadoop cluster at the same time?

It seems like for production file channel implementation, the best solution is 
to give Flume a dedicated server somewhere near the edge with a JBOD pile 
properly mounted and partitioned. But that adds to implementation cost.

The alternative seems to be to run Flume on a  physical Cloudera Manager SCM 
server that has some extra disks, or run Flume agents concurrent with datanode 
processes on worker nodes, but those don't seem good to do, especially 
piggybacking on worker nodes, and file channel > HDFS will compound the issue...

I know the namenode should definitely not be involved.

I suppose you could virtualize a few servers on a properly networked host and a 
fast SANS/NAS connection and get by ok, but that will merge your 
parallelization at some point...

Any ideas on the subject?

Devin Suiter
Jr. Data Solutions Software Engineer
[cid:~WRD000.jpg]
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com<http://www.rdx.com/>

<<inline: ~WRD000.jpg>>

Reply via email to