Re: distributed weblogs ingestion on HDFS via flume

ed Wed, 05 Feb 2014 16:11:09 -0800

Hi Asim,

Here's some information that might be helpful based on my relatively new
experience with Flume:

*1) do all the webserver in our case needs to run a flume agent?*

They could but don't necessarily have to.  For example, if you don't want
to put a flume agent on all your web servers you could forward the logs
using syslog to another server running a flume agent listening for the logs
using the syslog source.  If you do want to put a flume agent on your web
servers then you could send the logs to a local syslog source which would
use the avro sink to pass the logs to the flume collection server which
would do the actually writing to HDFS, or you could use a file spooler
source to read the logs from disk and then forward them to the collector
(again using avro source and sink)

*Not Using Flume on the Webservers:*

[webserver1: apache -> syslogd] ==>

[webserver2: apache -> syslogd] ==> [flume collection server: flume syslog
source --> flume hdfs sink]

[webserver3: apache -> syslogd] ==>

*Using Flume on the Webservers Option1:*

[webserver1: apache -> syslogd -> flume syslog source -> flume avro sink]
==>

[webserver2: apache -> syslogd -> flume syslog source -> flume avro sink]
==>  [flume collection server: flume avro source --> flume hdfs sink]

[webserver3: apache -> syslogd -> flume syslog source -> flume avro sink]
==>

*Using Flume on Webservers Option2:*

[webserver1: apache -> filesystem -> flume file spooler source -> flume
avro sink] ==>

[webserver2: apache -> filesystem -> flume file spooler source -> flume
avro sink] ==> [flume collection server: flume avro source --> flume hdfs
sink]

[webserver3: apache -> filesystem -> flume file spooler source -> flume
avro sink] ==>

(by the way there are probably other ways to do this and you could even
split out the collection tier from the storage tier (currently done by the
same final agent)

*2) do all the webserver will be acting as source in our setup ?*

They will be acting as a source in the general sense that you want to
ingest their logs.  However, they don't necessarily have to run a flume
agent if you have some other way to ship the logs to a listening flume
agent somewhere (most likely using syslog but we've also had success with
receiving logs via the netcat source).

*3) can we sync webservers logs directly to HDFS store by passing channels?*

Not sure what you mean here but you will need a flume source and sink
running (in this case an HDFS sink).  You can't get the logs into HDFS
using only a channel.

*4) do we have a choice of directly synching the weblogs to HDFS store and
not let the webserver right locally? what is the best practice?*

If for example you're using Apache you could configure apache to send the
logs directly to syslog which would forward them to the listening Flume
syslog source on a remote server which would then write the logs to HDFS
using the HDFS sink over a memory channel.  In this case you could avoid
having the logs written to disk but if one part of the data flow goes down
(e.g., the flume agent crashes) you will lose log data.  You could switch
to a file channel which is durable and would help minimize the risk of data
loss.  If you don't care about potential data loss then memory channel is
much faster and a bit easier to setup.

*5) what setup will that be where i would let the flume, sync a local
datadire on weblogs, and sync it as soon as the date arrives to this
directory?*

You would want to use a file spooler source to read the log directory then
send to a collector using the avrosource/sink.

*6) do i need a dedicated flume server for this setup?*

It depends on what else the flume server is doing.  Personally I think it's
much easier if you dedicate a box to the task as you don't have to worry
about resource contention and monitoring becomes easier.  In addition, if
you use the file channel you will want dedicated disks for that purpose.
 Note that I'm referring to your collector/storage tier.  Obviously if you
use a flume agent on the webserver it will not be a dedicated box but this
shouldn't be an issue as that agent is only responsible for collecting logs
off a single machine and forwarding them on (this blog post has some good
info on tuning and topology design:
https://blogs.apache.org/flume/entry/flume_performance_tuning_part_1)

*7) if i do use  memory based channel and then do HDFS sync do I need a
dedicated server, or can run those agents on the webserver itself, provided
there is enough memory OR would it be recommended to position my config to
a centralize flume server and the establish the sync.*

I would not recommend running flume agents on all the webservers with HDFS
sink.  It seems much better to funnel the logs to 1 or more agents that
write to HDFS but not have all 50 webservers writing themselves.

*8) how should we do the capacity planning for a memory based channel?*

You have to decide how long you want to be able to hold data in the memory
channel in the event a downstream agent does down (or the HDFS sink gets
backed up).  Once you have that value you need to figure out what your
average event size is and the rate at which you are collecting events.
 This will give you a rough idea.  I'm sure there is some per event memory
overhead as well (but I don't know the exact value for that).  If you're
using Cloudera Manager you can monitor the memory channel usage directly
from the Cloudera Manager interface which is very useful.

*9) how should we do the capacity planning for a file based channel ?*

Assuming you're referring to heap memory, I think I saw in a different
thread that you need 32 bytes per event you want to store (the channel
capacity) + whatever Flume core will use. So if your channel capacity is 1
million events you will need ~32MB of heap space + 100-500MB for Flume
core.  You will of course need enough disk space to store the actual logs
themselves.

Best,

Ed

On Thu, Feb 6, 2014 at 6:22 AM, Asim Zafir <[email protected]> wrote:

> Flume Users,
>
>
> Here is the problem statement, will be very much interested to have your
> valuable input and feedback on the following:
>
>
> *Assuming that fact that we generate  200GB of logs PER DAY from 50
> webservers *
>
>
>
> Goal is to sync that to HDFS repository
>
>
>
>
>
> 1) do all the webserver in our case needs to run a flume agent?
>
> 2) do all the webserver will be acting as source in our setup ?
>
> 3) can we sync webservers logs directly to HDFS store by passing channels?
>
> 4) do we have a choice of directly synching the weblogs to HDFS store and
> not let the webserver right locally? what is the best practice?
>
> 5) what setup will that be where i would let the flume, sync a local
> datadire on weblogs, and sync it as soon as the date arrives to this
> directory?
>
> 6) do i need a dedicated flume server for this setup?
>
> 7) if i do use  memory based channel and then do HDFS sync do I need a
> dedicated server, or can run those agents on the webserver itself, provided
> there is enough memory OR would it be recommended to position my config to
> a centralize flume server and the establish the sync.
>
> 8) how should we do the capacity planning for a memory based channel?
>
> 9) how should we do the capacity planning for a file based channel ?
>
>
>
> sincerely,
>
> AZ
>

Re: distributed weblogs ingestion on HDFS via flume

Reply via email to