distributed weblogs ingestion on HDFS via flume

Asim Zafir Wed, 05 Feb 2014 13:23:23 -0800

Flume Users,


Here is the problem statement, will be very much interested to have your
valuable input and feedback on the following:


*Assuming that fact that we generate  200GB of logs PER DAY from 50
webservers *



Goal is to sync that to HDFS repository





1) do all the webserver in our case needs to run a flume agent?

2) do all the webserver will be acting as source in our setup ?

3) can we sync webservers logs directly to HDFS store by passing channels?

4) do we have a choice of directly synching the weblogs to HDFS store and
not let the webserver right locally? what is the best practice?

5) what setup will that be where i would let the flume, sync a local
datadire on weblogs, and sync it as soon as the date arrives to this
directory?

6) do i need a dedicated flume server for this setup?

7) if i do use  memory based channel and then do HDFS sync do I need a
dedicated server, or can run those agents on the webserver itself, provided
there is enough memory OR would it be recommended to position my config to
a centralize flume server and the establish the sync.

8) how should we do the capacity planning for a memory based channel?

9) how should we do the capacity planning for a file based channel ?



sincerely,

AZ

distributed weblogs ingestion on HDFS via flume

Reply via email to