I like the first option (Kafka + Flume cluster to HDFS cluster) Flume doesn't actually benefit much from being local to HDFS, and as you noticed - it may take resources from Spark and Impala.
Flume can live on same nodes as Kafka. Especially if you are using it with Kafka channel - Kafka can be a bit sensitive to serious memory or disk utilization. Hope this helps. Gwen On Tue, Feb 17, 2015 at 2:13 AM, Guillermo Ortiz <[email protected]> wrote: > Hi, > > I have some machines with Kafka and DataNotes in different machines. I > want to get with Flume the data from Kafka and store in HDFS. What's > the best architecture? I assume that all the machines have access to > the others. > > Cluster1 (Kafka + Flume) ---> Cluster2 (Hdfs) > There are a agent in each machine where Kafka is installed and the > sink writes in HDFS directly, it could be configured some compress > option in the sink, etc.. > > Cluster1 (Kafka + Flume + Avro) --> Cluster2(Flume + Avro + HDFS) > There are a agent in each machine where Kafka is installed. Flume > sends data to another flume through Avro and Flume which is installed > in the DataNode writes data in HDFS. > > Cluster1 (Kafka) --> Cluster2(Flume + HDFS) > Flume is just installed in the DataNodes > > I don't like to install Flume in the DataNodes because these machines > execute process as Spark, Hive, Impala, MapReduce and they spend so > many resources on theirs tasks. On other hand, it is where data have > to be sent. > I could be configure more than one source to get data from Kafka and > more than one Flume to have more htan one VM. > Could someone comment about advantages and disvantages that finds in > each scenario? >
