Flume Users,
Here is the problem statement, will be very much interested to have your valuable input and feedback on the following: *Assuming that fact that we generate 200GB of logs PER DAY from 50 webservers * Goal is to sync that to HDFS repository 1) do all the webserver in our case needs to run a flume agent? 2) do all the webserver will be acting as source in our setup ? 3) can we sync webservers logs directly to HDFS store by passing channels? 4) do we have a choice of directly synching the weblogs to HDFS store and not let the webserver right locally? what is the best practice? 5) what setup will that be where i would let the flume, sync a local datadire on weblogs, and sync it as soon as the date arrives to this directory? 6) do i need a dedicated flume server for this setup? 7) if i do use memory based channel and then do HDFS sync do I need a dedicated server, or can run those agents on the webserver itself, provided there is enough memory OR would it be recommended to position my config to a centralize flume server and the establish the sync. 8) how should we do the capacity planning for a memory based channel? 9) how should we do the capacity planning for a file based channel ? sincerely, AZ
