Not sure what you mean by the Kafka+flume to HDFS but in our experience we have seen significant data loss with flume being used as a transport mechanism to sync data to HDFS. something haven't worked for us :
1) flume appenders on the source - installing appenders and flume agent on the application server side and seriously cause performance issued. appenders appears to reach into dead lock state due to thread locking. 2) log4j v 1 and appenders are bad option with flume 3)log4jv2 + embedded agent solves the problem of thread locking relieves the stress on the application servers - since now you have 1 less jvm to manage, so no performance issues there. for any high traffic server generating data it really works 4)flume has issues with some meta character (some specific UTF code) and it will truncate to commit to the data pipeline if struck with those if the read on that character is outside the limit of that read buffer - since there is no loggin, its painful to even troubleshoot. thanks, Asim Zafir On Tue, Feb 17, 2015 at 12:29 PM, Gwen Shapira <[email protected]> wrote: > I like the first option (Kafka + Flume cluster to HDFS cluster) > > Flume doesn't actually benefit much from being local to HDFS, and as you > noticed - it may take resources from Spark and Impala. > > Flume can live on same nodes as Kafka. Especially if you are using it with > Kafka channel - Kafka can be a bit sensitive to serious memory or disk > utilization. > > Hope this helps. > > Gwen > > On Tue, Feb 17, 2015 at 2:13 AM, Guillermo Ortiz <[email protected]> > wrote: > >> Hi, >> >> I have some machines with Kafka and DataNotes in different machines. I >> want to get with Flume the data from Kafka and store in HDFS. What's >> the best architecture? I assume that all the machines have access to >> the others. >> >> Cluster1 (Kafka + Flume) ---> Cluster2 (Hdfs) >> There are a agent in each machine where Kafka is installed and the >> sink writes in HDFS directly, it could be configured some compress >> option in the sink, etc.. >> >> Cluster1 (Kafka + Flume + Avro) --> Cluster2(Flume + Avro + HDFS) >> There are a agent in each machine where Kafka is installed. Flume >> sends data to another flume through Avro and Flume which is installed >> in the DataNode writes data in HDFS. >> >> Cluster1 (Kafka) --> Cluster2(Flume + HDFS) >> Flume is just installed in the DataNodes >> >> I don't like to install Flume in the DataNodes because these machines >> execute process as Spark, Hive, Impala, MapReduce and they spend so >> many resources on theirs tasks. On other hand, it is where data have >> to be sent. >> I could be configure more than one source to get data from Kafka and >> more than one Flume to have more htan one VM. >> Could someone comment about advantages and disvantages that finds in >> each scenario? >> > >
