Hi Asim,

I have a similar use case that has been in production for about a year. We have 
6 web servers sending about 15GB a day of web server logs to an 11 node Hadoop 
cluster. Additionally those same hosts send another few GB of data from the web 
applications themselves to HDFS using flume. I would say in total we send about 
120GB a day to HDFS from those 6 hosts. The web servers each run a local flume 
agent with an identical config.  I will just describe our configuration as I 
think it answers most of your questions.

The web logs are IIS log files that roll once a day and are written to 
continuously. We wanted a relatively low latency from log file to HDFS so we 
use a scheduled task to create an incremental 'diff' file every minute and 
place it in a spool directory. From there a spoolDir source on the local agent 
processes the file. The same task that creates the diff files also cleans up 
ones that have been processed by flume. This spoolDir source is just for 
weblogs and has no special routing, it uses a file channel and Avro sinks to 
connect to HDFS. There are two sinks in a failover configuration using a sink 
group that send to flume agents co-located on HDFS data nodes.

The application logs are JSON data that is sent directly from the application 
to flume via an httpSource. Again, the logs are sent to the same local agent. 
We have about a dozen separate log streams in total from our application but we 
send all these events to the one httpSource. Using headers we split into 'high' 
and 'low' priority streams using a multiplexing channel selector. We don't 
really do any special handling of the high priority events, we just watch one 
of them a LOT closer than the other. ;)  These channels are also drained by an 
Avro sink and are sent to a different pair of flume agents, again co-located 
with data nodes.

The flume agents that run on the data nodes just have avro sources, file 
channels and HDFS sinks. We have found two HDFS sinks (without any sink groups) 
can be necessary to keep the channels from backing up in some cases. The amount 
of file channels on these agents varies, some log streams are split into their 
own channels first and many of them use a 'catch all' channel that uses 
tokenized paths on the sink to write the data to different locations in HDFS 
according to header values.  The HDFS sinks all write DataStream files with 
format Text and bucket into Hive friendly partitioned directories by date and 
hour. The JSON events are one line each and we use a custom Hive SerDe for the 
JSON data. We use Hive external table definitions to read the data and use 
Oozie to process every log stream hourly into Snappy compressed internal Hive 
tables and then drop the raw data in 8 days. We don't use Impala all that much 
as most of our workflows just crunch the data and push it back into a SQL DW 
for the data people.

Here is a good start for capacity planning: 
https://cwiki.apache.org/confluence/display/FLUME/Flume%27s+Memory+Consumption.

We have gotten away with default channel sizes (1 million) so far without 
issue. We do try to separate the file channels to different physical disks as 
much as we can to optimize our hardware.

Hope that helps,
Paul Chavez


From: Asim Zafir [mailto:[email protected]]
Sent: Wednesday, February 05, 2014 1:23 PM
To: [email protected]
Subject: distributed weblogs ingestion on HDFS via flume

Flume Users,

Here is the problem statement, will be very much interested to have your 
valuable input and feedback on the following:

Assuming that fact that we generate  200GB of logs PER DAY from 50 webservers

Goal is to sync that to HDFS repository


1) do all the webserver in our case needs to run a flume agent?
2) do all the webserver will be acting as source in our setup ?
3) can we sync webservers logs directly to HDFS store by passing channels?
4) do we have a choice of directly synching the weblogs to HDFS store and not 
let the webserver right locally? what is the best practice?
5) what setup will that be where i would let the flume, sync a local datadire 
on weblogs, and sync it as soon as the date arrives to this directory?
6) do i need a dedicated flume server for this setup?
7) if i do use  memory based channel and then do HDFS sync do I need a 
dedicated server, or can run those agents on the webserver itself, provided 
there is enough memory OR would it be recommended to position my config to a 
centralize flume server and the establish the sync.
8) how should we do the capacity planning for a memory based channel?
9) how should we do the capacity planning for a file based channel ?

sincerely,
AZ

Reply via email to