Thanks Paul, that's good to know. My cluster is sort of a combination of test and production, so we don't tinker with the real cluster config often, and my dev pseudo-cluster on VM doesn't really cater well to file-channel testing, and my source is only making 1.1 events/minute anyway so this has been tough for me to really examine closely.
I appreciate the time you took to share. *Devin Suiter* Jr. Data Solutions Software Engineer 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212 Google Voice: 412-256-8556 | www.rdx.com On Tue, Dec 17, 2013 at 12:43 PM, Paul Chavez < [email protected]> wrote: > We co-locate our flume agents on our data nodes in order to have access to > many ‘spindles’ for the file channels. We have a small cluster (10 nodes) > so these are also our task tracker nodes and we haven’t seen any huge > performance issues. > > > > For reference, our typical event ingestion rate is between 2k and 5k > events per second under ‘normal’ production load. I recently had to > backfill a couple of weeks of web logs though, and took the opportunity to > examine max throughput rates and how heavy MR load affected things. During > that ‘test’ we stabilized at about 12K events per second written to HDFS, > that was a single agent using 2 HDFS sinks taking from one file channel. As > far as I could tell my bottleneck was in the Avro hop between my collector > and writer agents, not in the HDFS sinks. When we had all MR slots used by > large batch jobs for extended amounts of time the event throughput degraded > to about 3500 events/sec. > > > > I know these are just anecdotal data points, but wanted to share my > experience with flume agents located on the actual data/task nodes > themselves. I have done very little optimization aside from separating the > file channel data/log directories onto separate drives. > > > > -Paul Chavez > > > > *From:* Devin Suiter RDX [mailto:[email protected]] > *Sent:* Tuesday, December 17, 2013 8:30 AM > *To:* [email protected] > *Subject:* File Channel Best Practice > > > > Hi, > > > > There has been a lot of discussion about file channel speed today, and I > have had a dilemma I was hoping for some feedback on, since the topic is > hot. > > > > Regarding this: > > "Hi, > > > > 1) You are only using a single disk for file channel and it looks like a > single disk for both checkpoint and data directories therefore throughput > is going to be extremely slow." > > > > How do you solve in a practical sense the requirement for file channel to > have a range of disks for best R/W speed, yet still have network visibility > to source data sources and the Hadoop cluster at the same time? > > > > It seems like for production file channel implementation, the best > solution is to give Flume a dedicated server somewhere near the edge with a > JBOD pile properly mounted and partitioned. But that adds to implementation > cost. > > > > The alternative seems to be to run Flume on a physical Cloudera Manager > SCM server that has some extra disks, or run Flume agents concurrent with > datanode processes on worker nodes, but those don't seem good to do, > especially piggybacking on worker nodes, and file channel > HDFS will > compound the issue... > > > > I know the namenode should definitely not be involved. > > > > I suppose you could virtualize a few servers on a properly networked host > and a fast SANS/NAS connection and get by ok, but that will merge your > parallelization at some point... > > > > Any ideas on the subject? > > > *Devin Suiter* > > Jr. Data Solutions Software Engineer > > [image: Image removed by sender.] > > 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212 > Google Voice: 412-256-8556 | www.rdx.com >
<<~WRD000.jpg>>
