You can set all of the time/size based rolling policies to zero and set an idle timeout on the sink. Below has a 15 minute timeout
agent.sinks.sink.hdfs.fileSuffix = FlumeData.%Y-%m-%d agent.sinks.sink.hdfs.fileType = DataStream agent.sinks.sink.hdfs.rollInterval = 0 agent.sinks.sink.hdfs.rollSize = 0 agent.sinks.sink.hdfs.batchSize = 0 agent.sinks.sink.hdfs.rollCount = 0 agent.sinks.sink.hdfs.idleTimeout = 900 On Tue, Oct 22, 2013 at 10:17 PM, Martinus m <[email protected]> wrote: > Hi David, > > The requirement is only roll per day actually. > > Hi Devin, > > Thanks for sharing your experienced. I also tried to set the config as > following : > > agent.sinks.sink.hdfs.fileSuffix = FlumeData.%Y-%m-%d > agent.sinks.sink.hdfs.fileType = DataStream > agent.sinks.sink.hdfs.rollInterval = 0 > agent.sinks.sink.hdfs.rollSize = 0 > agent.sinks.sink.hdfs.batchSize = 15000 > agent.sinks.sink.hdfs.rollCount = 0 > > But I didn't see anything on the s3 bucket. So I guess, I need to change > the rollInterval into 86400. In my understanding, rollInterval 86400 will > roll the file after 24 hours like you said, but it will not generate new > file if it's changed the day and haven't been 24 hours interval (unless we > put prefix to fileSuffix as above). > > Thanks to both of you. > > Best regards, > > Martinus > > > On Tue, Oct 22, 2013 at 11:16 PM, DSuiter RDX <[email protected]> wrote: > >> Martinus, you have to set all the other roll options to 0 explicitly in >> the configuration if you want them only to roll on one parameter, it will >> take the shortest working parameter it can meet for the roll. If you want >> it to roll once a day, you will have to specifically disable all the other >> options for roll triggers - they all take default settings unless told not >> to. When I was experimenting, for example, it kept rolling in 30 seconds >> even though I had the hdfs.rollSize set to 64MB (our test data is generated >> slowly). So I ended up with a pile of small (0.2KB - 19~KB) files in a >> bunch of directories sorted by timestamp in ten-minute intervals. >> >> So, maybe a conf like this: >> >> agent.sinks.sink.type = hdfs >> agent.sinks.sink.channel = channel >> agent.sinks.sink.hdfs.path = (desired path string, yours looks fine) >> agent.sinks.sink.hdfs.fileSuffix = .avro >> agent.sinks.sink.serializer = avro_event >> agent.sinks.sink.hdfs.fileType = DataStream >> agent.sinks.sink.hdfs.rollInterval = 86400 >> agent.sinks.sink.hdfs.rollSize = 134217728 >> agent.sinks.sink.hdfs.batchSize = 15000 >> agent.sinks.sink.hdfs.rollCount = 0 >> >> This one will roll in HDFS in 24-hour intervals, or at 128MB file size >> for the file, and will close the file if it has 15000 events in it, but if >> the hdfs.rollCount line was not set to "0" or some higher value (I probably >> could have set that at 15000 to match the hdfs.batchSize for same results) >> then the file would roll as soon as the default of only 10 events were >> written in to the file. >> >> Are you using a 1-tier or 2-tier design for this? For syslogTCP, we >> collect from syslogTCP which comes from remote host. It then goes to avro >> sink to aggregate the small event entries into larger avro files. Then, a >> second tier collects that with avro source, then hdfs sink. So, we get them >> all as individual events streamed into an avro container, then the avro >> container is put into HDFS every 24 hours or if it hits 128 MB. We were >> getting many small files because of the lower velocity of our sample set, >> and we did not want to clutter up FSImage. The avro serializer and >> DataStream type are necessary also, because the default behavior of HDFS >> sink is to put things in as SequenceFile format. >> >> Hope this helps you out. >> >> Sincerely, >> *Devin Suiter* >> Jr. Data Solutions Software Engineer >> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212 >> Google Voice: 412-256-8556 | www.rdx.com >> >> >> On Tue, Oct 22, 2013 at 10:07 AM, David Sinclair < >> [email protected]> wrote: >> >>> Do you need to roll based on size as well? Can you tell me the >>> requirements? >>> >>> >>> On Tue, Oct 22, 2013 at 2:15 AM, Martinus m <[email protected]>wrote: >>> >>>> Hi David, >>>> >>>> Thanks for your answer. I already did that, but using %Y-%m-%d. But, >>>> since there are still roll based on Size, so it will keep generating two or >>>> mores FlumeData.%Y-%m-%d with different postfix. >>>> >>>> Thanks. >>>> >>>> Martinus >>>> >>>> >>>> On Fri, Oct 18, 2013 at 10:35 PM, David Sinclair < >>>> [email protected]> wrote: >>>> >>>>> The SyslogTcpSource will put a header on the flume event named >>>>> 'timestamp'. This timestamp will be from the syslog entry. You could then >>>>> set the filePrefix in the sink to grab this out. >>>>> For example >>>>> >>>>> tier1.sinks.hdfsSink.hdfs.filePrefix = FlumeData.%{timestamp} >>>>> >>>>> dave >>>>> >>>>> >>>>> On Thu, Oct 17, 2013 at 10:23 PM, Martinus m <[email protected]>wrote: >>>>> >>>>>> Hi David, >>>>>> >>>>>> It's syslogtcp. >>>>>> >>>>>> Thanks. >>>>>> >>>>>> Martinus >>>>>> >>>>>> >>>>>> On Thu, Oct 17, 2013 at 9:17 PM, David Sinclair < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> What type of source are you using? >>>>>>> >>>>>>> >>>>>>> On Wed, Oct 16, 2013 at 9:56 PM, Martinus m >>>>>>> <[email protected]>wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> Is there any option in HDFS sink that I can start rolling a new >>>>>>>> file whenever the date in the log change? For example, I got below >>>>>>>> logs : >>>>>>>> >>>>>>>> Oct 16 23:58:56 test-host : just test >>>>>>>> Oct 16 23:59:51 test-host : test again >>>>>>>> Oct 17 00:00:56 test-host : just test >>>>>>>> Oct 17 00:00:56 test-host : test again >>>>>>>> >>>>>>>> Then I want it to make a file on S3 bucket with result like this : >>>>>>>> >>>>>>>> FlumeData.2013-10-16.1381916293017 <-- all the logs with Oct 16 >>>>>>>> from this year 2013 will goes to here and when it's reach Oct 17 year >>>>>>>> 2013, >>>>>>>> then it will start to sink into a new file below : >>>>>>>> >>>>>>>> FlumeData.2013-10-17.1381940047117 >>>>>>>> >>>>>>>> Thanks. >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
