Martinus, you have to set all the other roll options to 0 explicitly in the configuration if you want them only to roll on one parameter, it will take the shortest working parameter it can meet for the roll. If you want it to roll once a day, you will have to specifically disable all the other options for roll triggers - they all take default settings unless told not to. When I was experimenting, for example, it kept rolling in 30 seconds even though I had the hdfs.rollSize set to 64MB (our test data is generated slowly). So I ended up with a pile of small (0.2KB - 19~KB) files in a bunch of directories sorted by timestamp in ten-minute intervals.
So, maybe a conf like this: agent.sinks.sink.type = hdfs agent.sinks.sink.channel = channel agent.sinks.sink.hdfs.path = (desired path string, yours looks fine) agent.sinks.sink.hdfs.fileSuffix = .avro agent.sinks.sink.serializer = avro_event agent.sinks.sink.hdfs.fileType = DataStream agent.sinks.sink.hdfs.rollInterval = 86400 agent.sinks.sink.hdfs.rollSize = 134217728 agent.sinks.sink.hdfs.batchSize = 15000 agent.sinks.sink.hdfs.rollCount = 0 This one will roll in HDFS in 24-hour intervals, or at 128MB file size for the file, and will close the file if it has 15000 events in it, but if the hdfs.rollCount line was not set to "0" or some higher value (I probably could have set that at 15000 to match the hdfs.batchSize for same results) then the file would roll as soon as the default of only 10 events were written in to the file. Are you using a 1-tier or 2-tier design for this? For syslogTCP, we collect from syslogTCP which comes from remote host. It then goes to avro sink to aggregate the small event entries into larger avro files. Then, a second tier collects that with avro source, then hdfs sink. So, we get them all as individual events streamed into an avro container, then the avro container is put into HDFS every 24 hours or if it hits 128 MB. We were getting many small files because of the lower velocity of our sample set, and we did not want to clutter up FSImage. The avro serializer and DataStream type are necessary also, because the default behavior of HDFS sink is to put things in as SequenceFile format. Hope this helps you out. Sincerely, *Devin Suiter* Jr. Data Solutions Software Engineer 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212 Google Voice: 412-256-8556 | www.rdx.com On Tue, Oct 22, 2013 at 10:07 AM, David Sinclair < [email protected]> wrote: > Do you need to roll based on size as well? Can you tell me the > requirements? > > > On Tue, Oct 22, 2013 at 2:15 AM, Martinus m <[email protected]> wrote: > >> Hi David, >> >> Thanks for your answer. I already did that, but using %Y-%m-%d. But, >> since there are still roll based on Size, so it will keep generating two or >> mores FlumeData.%Y-%m-%d with different postfix. >> >> Thanks. >> >> Martinus >> >> >> On Fri, Oct 18, 2013 at 10:35 PM, David Sinclair < >> [email protected]> wrote: >> >>> The SyslogTcpSource will put a header on the flume event named >>> 'timestamp'. This timestamp will be from the syslog entry. You could then >>> set the filePrefix in the sink to grab this out. >>> For example >>> >>> tier1.sinks.hdfsSink.hdfs.filePrefix = FlumeData.%{timestamp} >>> >>> dave >>> >>> >>> On Thu, Oct 17, 2013 at 10:23 PM, Martinus m <[email protected]>wrote: >>> >>>> Hi David, >>>> >>>> It's syslogtcp. >>>> >>>> Thanks. >>>> >>>> Martinus >>>> >>>> >>>> On Thu, Oct 17, 2013 at 9:17 PM, David Sinclair < >>>> [email protected]> wrote: >>>> >>>>> What type of source are you using? >>>>> >>>>> >>>>> On Wed, Oct 16, 2013 at 9:56 PM, Martinus m <[email protected]>wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> Is there any option in HDFS sink that I can start rolling a new file >>>>>> whenever the date in the log change? For example, I got below logs : >>>>>> >>>>>> Oct 16 23:58:56 test-host : just test >>>>>> Oct 16 23:59:51 test-host : test again >>>>>> Oct 17 00:00:56 test-host : just test >>>>>> Oct 17 00:00:56 test-host : test again >>>>>> >>>>>> Then I want it to make a file on S3 bucket with result like this : >>>>>> >>>>>> FlumeData.2013-10-16.1381916293017 <-- all the logs with Oct 16 from >>>>>> this year 2013 will goes to here and when it's reach Oct 17 year 2013, >>>>>> then >>>>>> it will start to sink into a new file below : >>>>>> >>>>>> FlumeData.2013-10-17.1381940047117 >>>>>> >>>>>> Thanks. >>>>>> >>>>> >>>>> >>>> >>> >> >
