I mean Machine TypeA (Kafka + Flume Agent with Source Kafka and Sink HDFS) --> Machine TypeB (DataNode)
2015-02-17 22:40 GMT+01:00 Asim Zafir <[email protected]>: > Not sure what you mean by the Kafka+flume to HDFS but in our experience we > have seen significant data loss with flume being used as a transport > mechanism to sync data to HDFS. something haven't worked for us : > > 1) flume appenders on the source - installing appenders and flume agent on > the application server side and seriously cause performance issued. > appenders appears to reach into dead lock state due to thread locking. > 2) log4j v 1 and appenders are bad option with flume > 3)log4jv2 + embedded agent solves the problem of thread locking relieves the > stress on the application servers - since now you have 1 less jvm to manage, > so no performance issues there. for any high traffic server generating data > it really works > 4)flume has issues with some meta character (some specific UTF code) and it > will truncate to commit to the data pipeline if struck with those if the > read on that character is outside the limit of that read buffer - since > there is no loggin, its painful to even troubleshoot. > > thanks, > Asim Zafir > > On Tue, Feb 17, 2015 at 12:29 PM, Gwen Shapira <[email protected]> > wrote: >> >> I like the first option (Kafka + Flume cluster to HDFS cluster) >> >> Flume doesn't actually benefit much from being local to HDFS, and as you >> noticed - it may take resources from Spark and Impala. >> >> Flume can live on same nodes as Kafka. Especially if you are using it with >> Kafka channel - Kafka can be a bit sensitive to serious memory or disk >> utilization. >> >> Hope this helps. >> >> Gwen >> >> On Tue, Feb 17, 2015 at 2:13 AM, Guillermo Ortiz <[email protected]> >> wrote: >>> >>> Hi, >>> >>> I have some machines with Kafka and DataNotes in different machines. I >>> want to get with Flume the data from Kafka and store in HDFS. What's >>> the best architecture? I assume that all the machines have access to >>> the others. >>> >>> Cluster1 (Kafka + Flume) ---> Cluster2 (Hdfs) >>> There are a agent in each machine where Kafka is installed and the >>> sink writes in HDFS directly, it could be configured some compress >>> option in the sink, etc.. >>> >>> Cluster1 (Kafka + Flume + Avro) --> Cluster2(Flume + Avro + HDFS) >>> There are a agent in each machine where Kafka is installed. Flume >>> sends data to another flume through Avro and Flume which is installed >>> in the DataNode writes data in HDFS. >>> >>> Cluster1 (Kafka) --> Cluster2(Flume + HDFS) >>> Flume is just installed in the DataNodes >>> >>> I don't like to install Flume in the DataNodes because these machines >>> execute process as Spark, Hive, Impala, MapReduce and they spend so >>> many resources on theirs tasks. On other hand, it is where data have >>> to be sent. >>> I could be configure more than one source to get data from Kafka and >>> more than one Flume to have more htan one VM. >>> Could someone comment about advantages and disvantages that finds in >>> each scenario? >> >> >
