Re: Some test findings and insight appreciated

Juhani Connolly Wed, 05 Sep 2012 19:18:23 -0700

Hi Steve,

Thanks for the rundown, lots of detail is better than not enough!Response inline


On 09/06/2012 01:37 AM, Steve Johnson wrote:

Hi all, I wanted to share some of my test findings/concerns, etc..First off, I apologize for this being so verbose, but I feel I need togive a little bit of a background into our setup and needs to show thebig picture. Please ignore this if your not interested, you've beenwarned...
But if you are, great, cause I do have some valid questions to followand really looking forward to any constructive comments.
Prior to a few weeks back, I have zero experience with Flume. Havebeen familiar with it's existence for some time (about a year) butnothing more than that.
My company, generates about 8billion log records per day, spreadacross 5 dataceters, with about 200 servers in each location. Soabout 1.6 billion per day in each cage. We're growing and shotting toincrease that to about 30billion per day based on holiday trafficgrowth and our companies growth. These log records are currentlyhourly rotated logback(slf4j) generated logs from our javaapplications, containing tab delimited ascii data of various widths.There's probably 25 different log types we collect, but generally allthe same format, some average record lengths of 50-60 bytes, whilesome others average 1k in width.
Right now, we collect them using a custom built java schedulingapplication. We have a machine dedicated to this at each DC. Thisbox fires off some hourly jobs (within minutes after log rotations)that pulls all the logs from the 200+ servers (some servers generateup to 10 different log types per hour), uncompressed. We used to pulldirectly to our central location, and would initiate compression onthe servers themselves, but this generated CPU/IO spikes every hourthat were causing performance issues. So we put a remote machine ineach node to handle local collection. They pull all the logs fileslocally first, then compress, then move into a queue. This happensacross all 5 dc's in parallel. We have another set of schedulers inour central location that then each collect from those remote nodes.Pull them locally, then we do some ETL work and load the raw log datainto our Greenplum warehouse for nightly aggregations and analysis.
This is obviously becoming very cumbersome to maintain, as we haveright now, 10 different schedulers running over 6 locations. Also, toguarantee we've fetched every log file, and also to guarantee wehaven' double-loaded any raw data (this data has only a logrec that'smaintained globally to guarantee uniqueness, so removing dupes is anightmare, so we like to avoid that), we have to track every filepickup, for each hop (currently tracked in a postgresql db) and thenuse that for validation and to also make sure we don't pull a rotatedlog again (logs stay archived on their original servers for 7 days).
A couple years back when we had 1 or even 2 dc's with only about 30servers in each, this wasn't so bad. But as you can imaging, we'relooking at over 80k files generated per day to track and manage. Whenthings run smooth, it's great, when we have issues, it's a pain todig into.
So what are the requirements I'm looking at for a replacement of saidsystem?
1. Less, or no custom configuration, must be drop-in and goenvironment, right now, as we add/remove servers, I have to edit a lotof db records to tell the schedulers which servers have which types oflogs, I also need to replicate it out, reload the configs and makesure log sources have ssh keys installed, etc.2. Must be able to compress data going between flume agents in remoteDC's to the flume agents in our central location. (bandwith for thiskind of data is not cheap, right now by gzipping the hourly logslocally before we transfer, we get between about 6:1 to a 10:1compression ratio depending on the log type.
3. Must be able to handle the throughput.
4. Must be transcriptional and recoverable, many of these logscorrelate directly to revenue, we must not lose data.

To have zero data loss you must use a reliable ingest system and alossless channel. Netcat source can't guarrantee delivery(if a channelcan't fit the sent messages for example, they will just get dropped).Memory channel will lose data on a crash.

5. Scalable.

>From reading the docs I believe Flume is a possible solution.

Forward to today...

Flume Agent Config:
Version flume-ng 1.2.0:

JAVA_OPTS="-Xms1g -Xmx5g -Dcom.sun.management.jmxremote-XX:MaxDirectMemorySize=5g"This is running on a 16 core Intel Xeon 2.4ghz with 48Gb ram, andlocal drives running raid5/xfs(not sure of the rpm's, but they'repretty fast).



Testing/Flume Setup:
testagent.sources = netcatSource
testagent.channels = memChannel
testagent.sinks = fileSink

testagent.sources.netcatSource.type = netcat
testagent.sources.netcatSource.channels = memChannel
testagent.sinks.fileSink.type = FILE_ROLL
testagent.sinks.fileSink.channel = memChannel
testagent.channels.memChannel.type = memory

testagent.sources.netcatSource.bind = 0.0.0.0
testagent.sources.netcatSource.port = 6172
testagent.sources.netcatSource.max-line-length = 65536
testagent.channels.memChannel.capacity = 4294967296

This is huge. The memorychannel uses a blocking queue of events, and I'mpretty sure that it will misbehave beyond the limits of the integerrange. Seeing as it's signed, that would be around 2 billion(and with anaverage event length of say 50, that would consume at least 100gb ofram)? FileChannel may or may not deal with huge capacities better. Thecapacity designation is for event count, not bytes of data. Someone didhowever recently post an issue about making physical size a setting insome form, maybe you want to add your feedback to that (https://issues.apache.org/jira/browse/FLUME-1535 )

testagent.sinks.fileSink.sink.directory =/opt/dotomi/flume-data/sink/file_roll
testagent.sinks.fileSink.sink.rollInterval = 0
I took one of our production servers hourly logs, one of the largestwe produce, (this one has about 1.2million rows in it for that hour,average record length about 700 bytes, some creeping up to 4k. Keepin mind, this is one server in one cage out of 50 total).
I wrote a Perl script that opens a socket to the NetCat source port onthe agent, and buffers about 10 log recs and then sends them inbatches of 10. I originally tried line-by-line, thiswas obviously super inefficient. I also attempted more (more on thatbelow) to buffer but started dropping too many events, i think it wascausing buffering issues on the agent. 10 seemed to be the magicnumber for my setup. I also started with a FileChannel (recoverable),and a simple file_roll sink so i could verify the output files.

As mentioned above, NetcatSource is not a reliable ingest system as itdoesn't know about events that weren't committed. In the long run, forlossless, you will want to deliver data via either avro or thescribe(thrift) data format. However if you just want to test specs, tryusing ExecSource to tail the log files and fiddle about with thebatching settings.

I ended up having some troubles getting the FileChannel started. Iended up getting it to start with some pretty narrow parameters whichcaused my flow to be very slow. When I tried to set higher numbers incapacity, it would either not start up, or start but nothing wouldflow. I ended up moving to a memory channel just to get my proof ofconcept moving, and to get a test of the framework first. Also, sincewe're a java shop, we're not opposed to the idea of writing customsources/sinks/channels where need be, assuming the framework is sound.

In its current implementation FileChannel is lossless and thus causes adisk flush(which generally writes two separate files) for everycommit(one for every batch of events). This is going to mean very slowthroughput if you have small batches. You can however improve this a lotby having the channels data directories and checkpoint directories onseparate disks(not always feasible). Or you can just make sure you'rebatching more events at a time.

After some heavy tuning, I was able to get something that worked well,and performed very well. I was eventually able to get 200k per secondof these log events through.
To cut to the chase, Here's some issues I had;
1. Data loss (this was brought up in another thread). About everyother time (a little less, 40%) I would run the exact same test, itwould drop a very small number of events, 10 or less (out of about500k events). Other times it would pass every event through withoutissue.

I'm going to guess this is the (netcat)source not being able to sendmessages to the channel. Since it doesn't inform the ingest system aboutthe failed delivery, the ingest also can't resend. I assume you're notgetting any exceptions like the other person recently asking aboutnetcat source?

2. Looking at my tunings, I was able to get about 60k per second on asingle flume instance with the above mentioned tuning. I decided tocrank everything up (double it, even tried then doubling that oncemore). This machine has 48gb and is doing nothing but this. Sologically, I figured I could bump my OPTS to 10g instead of 5, and upmy channel capacity to 8g. Allowing me to buffer more and in theorydouble my throughput. This wasn't at all the case, by attempting tothrow more at it (either by lowering my sleep times between batches,or even using the same sleep times but double my batch size from10lines to 20, things started flaking out). Basically, after about200k lines went through, it just stopped processing, no warnings,errors nothing.Here's where it gets interesting though. I then setup four flumeagents on the same machine with all the same configurations andstartup params back at the 4g range, all listening on different ports.I started all 4, and then in parallel (on another machine), ran mytest script to hit all four agents. That's when I was able to get200k through. So by running four of them with lower tunables, I wasable to get the throughput I couldn't get running one with 4x thetunables and startup options.

The channel capacity really shouldn't matter so long as it is largeenough to hold stuff in the interim until the sink drains it.If you want to use one big agent, you may need to tune the sinkrunners(which are single threaded). E.g. if you have a lot of datacoming in and one avro sink that just can't keep up, you can set upmultiple avro sinks. The channel size setting should be made to be largeenough to hold whatever amount of data you expect to build up if adownstream server/write location becomes unavailable.

Number 2 is something i can easily live with but would like to hearsome insight on maybe what's causing it. Obviously the disks can keepup because all of the file_roll paths for all 4 agents are using thesame drive. And obviously I have the ram to buffer accordingly. Butfor some reason, one agent with 2x or even 4x the juice starts gettingflaky.
Number 1 is more concerning, this obviously will need to be solved.

The 2 rules I stick to for a lossless flow(your current system isunfortunately breaking both):1) ingest system using an rpc delivery that is aware of failed sends andresponds by resending data(we have a python program that tails filessending to ScribeSource, keeping a position pointer, and rewinds thatpointer when the thrift rpc responds with failure).2) Lossless channel(currently file or jdbc). This is generally only anissue for restarts/failures.

In summary, I'm willing and ready to spend more time on this. Butwanted to get some insight from the pros, developers here and alsomake sure I'm not crazy and maybe just trying to use this for morethan it was designed.
Many thanks for anyone that stuck around to read this! :)

Thanks for your feedback!

Best of luck,
 Juhani



 Cheers

--
Steve Johnson
[email protected] <mailto:[email protected]>

Re: Some test findings and insight appreciated

Reply via email to