Re: flume tail source problem and performance

Jeong-shik Jang Mon, 28 Jan 2013 23:42:26 -0800

Hi Andy,

As you set startFromEnd option true, resend might be caused by DFOmechanism (agentDFOSink); when you restart flume node in DFO mode, allevents in different stages(logged, writing, sending and so on) rollsback to logged stage, which means resending and duplication.

And, for better performance, you may want to use agentBESink instead ofagentDFOSink.I recommend to use agentBEChain for failover in case of failure incollector tier if you have multiple collectors.


-JS

On 1/29/13 4:29 PM, Alexander Alten-Lorenz wrote:

Hi,

you could use tail -F, but this depends on the external source. Flume hasn't 
control about. You can write your own script and include this.

What's the content of:
/tmp/flume/agent/agent*.*/ - directories? Are sent and sending clean?

- Alex

On Jan 29, 2013, at 8:24 AM, 周梦想 <[email protected]> wrote:

hello,
1. I want to tail a log source and write it to hdfs. below is configure：
config [ag1, tail("/home/zhouhh/game.log",startFromEnd=true),
agentDFOSink("hadoop48",35853) ;]
config [ag2, tail("/home/zhouhh/game.log",startFromEnd=true),
agentDFOSink("hadoop48",35853) ;]
config [co1, collectorSource( 35853 ),  [collectorSink(
"hdfs://hadoop48:54310/user/flume/%y%m/%d","%{host}-",5000,raw),collectorSink(
"hdfs://hadoop48:54310/user/flume/%y%m","%{host}-",10000,raw)]]


I found if I restart the agent node, it will resend the content of game.log
to collector. There are some solutions to send logs from where I haven't
sent before? Or I have to make a mark myself or remove the logs manually
when restart the agent node?

2. I tested performance of flume, and found it's a bit slow.
if I using configure as above, there are only 50MB/minute.
I changed the configure to below:
ag1:tail("/home/zhouhh/game.log",startFromEnd=true)|batch(1000) gzip
agentDFOSink("hadoop48",35853);

config [co1, collectorSource( 35853 ), [collectorSink(
"hdfs://hadoop48:54310/user/flume/%y%m/%d","%{host}-",5000,raw),collectorSink(
"hdfs://hadoop48:54310/user/flume/%y%m","%{host}-",10000,raw)]]

I sent 300MB log, it will spent about 3 minutes, so it's about 100MB/minute.

while I send the log from ag1 to co1 via scp, It's about 30MB/second.

someone give me any ideas?

thanks!

Andy

--
Alexander Alten-Lorenz
http://mapredit.blogspot.com
German Hadoop LinkedIn Group: http://goo.gl/N8pCF



--
Jeong-shik Jang / [email protected]
Gruter, Inc., R&D Team Leader
www.gruter.com
Enjoy Connecting

Re: flume tail source problem and performance

Reply via email to