Hi JS, Thank you for your reply. So there is big shortness of collect log using flume. can I write my own agent to send logs via thrift protocol directly to collector server?
Best Regards, Andy Zhou 2013/2/4 Jeong-shik Jang <[email protected]> > Hi Andy, > > 1. "startFromEnd=true" in your source configuration means data missing can > happen at restart in tail side because flume will ignore any data event > generated during restart and start at the end all the time. > 2. With agentSink, data duplication can happen due to ack delay from > master or at agent restart. > > I think it is why Flume-NG doesn't support tail any more but does let user > handle using script or program; tailing is a tricky job. > > My suggestion is to use agentBEChain in agent tier, and DFO in collector > tier; you can still lose some data during failover at failure. > To minimize loss and duplication, implementing checkpoint function in tail > also can help. > > Having monitoring system to detecting failure is very important as well, > so that you can notice failure and do recovering reaction quickly. > > -JS > > > On 2/4/13 4:27 PM, 周梦想 wrote: > > Hi JS, > We can't accept agentBESink. Because this logs are important for data > analysis, we can't make any errors of the data. losing data, duplication > are all not acceptable. > one agent's configure is : tail("H:/game.log", startFromEnd=true) > agentSink("hadoop48", > 35853) > every time this windows agent restart, it will resend all the data to > collector server. > if some reason we restart the agent node, we can't get the mark of log > where the agent have sent. > > > 2013/1/29 Jeong-shik Jang <[email protected]> > >> Hi Andy, >> >> As you set startFromEnd option true, resend might be caused by DFO >> mechanism (agentDFOSink); when you restart flume node in DFO mode, all >> events in different stages(logged, writing, sending and so on) rolls back >> to logged stage, which means resending and duplication. >> >> And, for better performance, you may want to use agentBESink instead of >> agentDFOSink. >> I recommend to use agentBEChain for failover in case of failure in >> collector tier if you have multiple collectors. >> >> -JS >> >> >> On 1/29/13 4:29 PM, Alexander Alten-Lorenz wrote: >> >>> Hi, >>> >>> you could use tail -F, but this depends on the external source. Flume >>> hasn't control about. You can write your own script and include this. >>> >>> What's the content of: >>> /tmp/flume/agent/agent*.*/ - directories? Are sent and sending clean? >>> >>> - Alex >>> >>> On Jan 29, 2013, at 8:24 AM, 周梦想 <[email protected]> wrote: >>> >>> hello, >>>> 1. I want to tail a log source and write it to hdfs. below is configure: >>>> config [ag1, tail("/home/zhouhh/game.log",startFromEnd=true), >>>> agentDFOSink("hadoop48",35853) ;] >>>> config [ag2, tail("/home/zhouhh/game.log",startFromEnd=true), >>>> agentDFOSink("hadoop48",35853) ;] >>>> config [co1, collectorSource( 35853 ), [collectorSink( >>>> >>>> "hdfs://hadoop48:54310/user/flume/%y%m/%d","%{host}-",5000,raw),collectorSink( >>>> "hdfs://hadoop48:54310/user/flume/%y%m","%{host}-",10000,raw)]] >>>> >>>> >>>> I found if I restart the agent node, it will resend the content of >>>> game.log >>>> to collector. There are some solutions to send logs from where I haven't >>>> sent before? Or I have to make a mark myself or remove the logs manually >>>> when restart the agent node? >>>> >>>> 2. I tested performance of flume, and found it's a bit slow. >>>> if I using configure as above, there are only 50MB/minute. >>>> I changed the configure to below: >>>> ag1:tail("/home/zhouhh/game.log",startFromEnd=true)|batch(1000) gzip >>>> agentDFOSink("hadoop48",35853); >>>> >>>> config [co1, collectorSource( 35853 ), [collectorSink( >>>> >>>> "hdfs://hadoop48:54310/user/flume/%y%m/%d","%{host}-",5000,raw),collectorSink( >>>> "hdfs://hadoop48:54310/user/flume/%y%m","%{host}-",10000,raw)]] >>>> >>>> I sent 300MB log, it will spent about 3 minutes, so it's about >>>> 100MB/minute. >>>> >>>> while I send the log from ag1 to co1 via scp, It's about 30MB/second. >>>> >>>> someone give me any ideas? >>>> >>>> thanks! >>>> >>>> Andy >>>> >>> -- >>> Alexander Alten-Lorenz >>> http://mapredit.blogspot.com >>> German Hadoop LinkedIn Group: http://goo.gl/N8pCF >>> >>> >>> >>> >> >> -- >> Jeong-shik Jang / [email protected] >> Gruter, Inc., R&D Team Leader >> www.gruter.com >> Enjoy Connecting >> >> > > > -- > Jeong-shik Jang / [email protected] > Gruter, Inc., R&D Team Leaderwww.gruter.com > Enjoy Connecting > >
