Thanks to JS and GuoWei, I'll try this. Best Regards, Andy Zhou
2013/2/4 GuoWei <[email protected]> > > Exactly, It's very easy to write custom source and sink in flume. You can > just follow the default source or sink as example. And export as jar file > to the lib folder of flume. Then you can configure your own source or sink > in flume conf file. > > > Best Regards > Guo Wei > > > On 2013-2-4, at 下午4:13, Jeong-shik Jang <[email protected]> wrote: > > Yes, you can; Flume plugin framework provides easy way to implement and > apply your own source, deco and sink. > > -JS > > On 2/4/13 5:07 PM, 周梦想 wrote: > > Hi JS, > > Thank you for your reply. So there is big shortness of collect log using > flume. can I write my own agent to send logs via thrift protocol directly > to collector server? > > Best Regards, > Andy Zhou > > > 2013/2/4 Jeong-shik Jang <[email protected]> > >> Hi Andy, >> >> 1. "startFromEnd=true" in your source configuration means data missing >> can happen at restart in tail side because flume will ignore any data event >> generated during restart and start at the end all the time. >> 2. With agentSink, data duplication can happen due to ack delay from >> master or at agent restart. >> >> I think it is why Flume-NG doesn't support tail any more but does let >> user handle using script or program; tailing is a tricky job. >> >> My suggestion is to use agentBEChain in agent tier, and DFO in collector >> tier; you can still lose some data during failover at failure. >> To minimize loss and duplication, implementing checkpoint function in >> tail also can help. >> >> Having monitoring system to detecting failure is very important as well, >> so that you can notice failure and do recovering reaction quickly. >> >> -JS >> >> >> On 2/4/13 4:27 PM, 周梦想 wrote: >> >> Hi JS, >> We can't accept agentBESink. Because this logs are important for data >> analysis, we can't make any errors of the data. losing data, duplication >> are all not acceptable. >> one agent's configure is : tail("H:/game.log", startFromEnd=true) >> agentSink("hadoop48", >> 35853) >> every time this windows agent restart, it will resend all the data to >> collector server. >> if some reason we restart the agent node, we can't get the mark of log >> where the agent have sent. >> >> >> 2013/1/29 Jeong-shik Jang <[email protected]> >> >>> Hi Andy, >>> >>> As you set startFromEnd option true, resend might be caused by DFO >>> mechanism (agentDFOSink); when you restart flume node in DFO mode, all >>> events in different stages(logged, writing, sending and so on) rolls back >>> to logged stage, which means resending and duplication. >>> >>> And, for better performance, you may want to use agentBESink instead of >>> agentDFOSink. >>> I recommend to use agentBEChain for failover in case of failure in >>> collector tier if you have multiple collectors. >>> >>> -JS >>> >>> >>> On 1/29/13 4:29 PM, Alexander Alten-Lorenz wrote: >>> >>>> Hi, >>>> >>>> you could use tail -F, but this depends on the external source. Flume >>>> hasn't control about. You can write your own script and include this. >>>> >>>> What's the content of: >>>> /tmp/flume/agent/agent*.*/ - directories? Are sent and sending clean? >>>> >>>> - Alex >>>> >>>> On Jan 29, 2013, at 8:24 AM, 周梦想 <[email protected]> wrote: >>>> >>>> hello, >>>>> 1. I want to tail a log source and write it to hdfs. below is >>>>> configure: >>>>> config [ag1, tail("/home/zhouhh/game.log",startFromEnd=true), >>>>> agentDFOSink("hadoop48",35853) ;] >>>>> config [ag2, tail("/home/zhouhh/game.log",startFromEnd=true), >>>>> agentDFOSink("hadoop48",35853) ;] >>>>> config [co1, collectorSource( 35853 ), [collectorSink( >>>>> "hdfs://hadoop48:54310/user/flume/%y%m/%d >>>>> ","%{host}-",5000,raw),collectorSink( >>>>> "hdfs://hadoop48:54310/user/flume/%y%m","%{host}-",10000,raw)]] >>>>> >>>>> >>>>> I found if I restart the agent node, it will resend the content of >>>>> game.log >>>>> to collector. There are some solutions to send logs from where I >>>>> haven't >>>>> sent before? Or I have to make a mark myself or remove the logs >>>>> manually >>>>> when restart the agent node? >>>>> >>>>> 2. I tested performance of flume, and found it's a bit slow. >>>>> if I using configure as above, there are only 50MB/minute. >>>>> I changed the configure to below: >>>>> ag1:tail("/home/zhouhh/game.log",startFromEnd=true)|batch(1000) gzip >>>>> agentDFOSink("hadoop48",35853); >>>>> >>>>> config [co1, collectorSource( 35853 ), [collectorSink( >>>>> "hdfs://hadoop48:54310/user/flume/%y%m/%d >>>>> ","%{host}-",5000,raw),collectorSink( >>>>> "hdfs://hadoop48:54310/user/flume/%y%m","%{host}-",10000,raw)]] >>>>> >>>>> I sent 300MB log, it will spent about 3 minutes, so it's about >>>>> 100MB/minute. >>>>> >>>>> while I send the log from ag1 to co1 via scp, It's about 30MB/second. >>>>> >>>>> someone give me any ideas? >>>>> >>>>> thanks! >>>>> >>>>> Andy >>>>> >>>> -- >>>> Alexander Alten-Lorenz >>>> http://mapredit.blogspot.com >>>> German Hadoop LinkedIn Group: http://goo.gl/N8pCF >>>> >>>> >>>> >>>> >>> >>> -- >>> Jeong-shik Jang / [email protected] >>> Gruter, Inc., R&D Team Leader >>> www.gruter.com >>> Enjoy Connecting >>> >>> >> >> >> -- >> Jeong-shik Jang / [email protected] >> Gruter, Inc., R&D Team Leaderwww.gruter.com >> Enjoy Connecting >> >> > > > -- > Jeong-shik Jang / [email protected] > Gruter, Inc., R&D Team Leaderwww.gruter.com > Enjoy Connecting > > >
