does the metrics endpoint show that events are still coming into this sink?
http://hostname of agent:41414/metrics <http://falcon:41414/metrics> Also, can you post the rest of the config? On Thu, Oct 24, 2013 at 10:09 PM, Martinus m <[email protected]> wrote: > Hi David, > > Almost every few seconds. > > Thanks. > > Martinus > > > On Thu, Oct 24, 2013 at 9:49 PM, David Sinclair < > [email protected]> wrote: > >> How often are your events coming in? >> >> >> On Thu, Oct 24, 2013 at 2:21 AM, Martinus m <[email protected]>wrote: >> >>> Hi David, >>> >>> Thanks for the example. I have set it just like above, but it only >>> generate for the first 15 minutes. After waiting for more than one hour, >>> there is no update at all in the s3 bucket. >>> >>> Thanks. >>> >>> Martinus >>> >>> >>> On Wed, Oct 23, 2013 at 8:48 PM, David Sinclair < >>> [email protected]> wrote: >>> >>>> You can set all of the time/size based rolling policies to zero and set >>>> an idle timeout on the sink. Below has a 15 minute timeout >>>> >>>> agent.sinks.sink.hdfs.fileSuffix = FlumeData.%Y-%m-%d >>>> agent.sinks.sink.hdfs.fileType = DataStream >>>> agent.sinks.sink.hdfs.rollInterval = 0 >>>> agent.sinks.sink.hdfs.rollSize = 0 >>>> agent.sinks.sink.hdfs.batchSize = 0 >>>> agent.sinks.sink.hdfs.rollCount = 0 >>>> agent.sinks.sink.hdfs.idleTimeout = 900 >>>> >>>> >>>> >>>> On Tue, Oct 22, 2013 at 10:17 PM, Martinus m <[email protected]>wrote: >>>> >>>>> Hi David, >>>>> >>>>> The requirement is only roll per day actually. >>>>> >>>>> Hi Devin, >>>>> >>>>> Thanks for sharing your experienced. I also tried to set the config as >>>>> following : >>>>> >>>>> agent.sinks.sink.hdfs.fileSuffix = FlumeData.%Y-%m-%d >>>>> agent.sinks.sink.hdfs.fileType = DataStream >>>>> agent.sinks.sink.hdfs.rollInterval = 0 >>>>> agent.sinks.sink.hdfs.rollSize = 0 >>>>> agent.sinks.sink.hdfs.batchSize = 15000 >>>>> agent.sinks.sink.hdfs.rollCount = 0 >>>>> >>>>> But I didn't see anything on the s3 bucket. So I guess, I need to >>>>> change the rollInterval into 86400. In my understanding, rollInterval >>>>> 86400 >>>>> will roll the file after 24 hours like you said, but it will not generate >>>>> new file if it's changed the day and haven't been 24 hours interval >>>>> (unless >>>>> we put prefix to fileSuffix as above). >>>>> >>>>> Thanks to both of you. >>>>> >>>>> Best regards, >>>>> >>>>> Martinus >>>>> >>>>> >>>>> On Tue, Oct 22, 2013 at 11:16 PM, DSuiter RDX <[email protected]> wrote: >>>>> >>>>>> Martinus, you have to set all the other roll options to 0 explicitly >>>>>> in the configuration if you want them only to roll on one parameter, it >>>>>> will take the shortest working parameter it can meet for the roll. If you >>>>>> want it to roll once a day, you will have to specifically disable all the >>>>>> other options for roll triggers - they all take default settings unless >>>>>> told not to. When I was experimenting, for example, it kept rolling in 30 >>>>>> seconds even though I had the hdfs.rollSize set to 64MB (our test data is >>>>>> generated slowly). So I ended up with a pile of small (0.2KB - 19~KB) >>>>>> files >>>>>> in a bunch of directories sorted by timestamp in ten-minute intervals. >>>>>> >>>>>> So, maybe a conf like this: >>>>>> >>>>>> agent.sinks.sink.type = hdfs >>>>>> agent.sinks.sink.channel = channel >>>>>> agent.sinks.sink.hdfs.path = (desired path string, yours looks fine) >>>>>> agent.sinks.sink.hdfs.fileSuffix = .avro >>>>>> agent.sinks.sink.serializer = avro_event >>>>>> agent.sinks.sink.hdfs.fileType = DataStream >>>>>> agent.sinks.sink.hdfs.rollInterval = 86400 >>>>>> agent.sinks.sink.hdfs.rollSize = 134217728 >>>>>> agent.sinks.sink.hdfs.batchSize = 15000 >>>>>> agent.sinks.sink.hdfs.rollCount = 0 >>>>>> >>>>>> This one will roll in HDFS in 24-hour intervals, or at 128MB file >>>>>> size for the file, and will close the file if it has 15000 events in it, >>>>>> but if the hdfs.rollCount line was not set to "0" or some higher value (I >>>>>> probably could have set that at 15000 to match the hdfs.batchSize for >>>>>> same >>>>>> results) then the file would roll as soon as the default of only 10 >>>>>> events >>>>>> were written in to the file. >>>>>> >>>>>> Are you using a 1-tier or 2-tier design for this? For syslogTCP, we >>>>>> collect from syslogTCP which comes from remote host. It then goes to avro >>>>>> sink to aggregate the small event entries into larger avro files. Then, a >>>>>> second tier collects that with avro source, then hdfs sink. So, we get >>>>>> them >>>>>> all as individual events streamed into an avro container, then the avro >>>>>> container is put into HDFS every 24 hours or if it hits 128 MB. We were >>>>>> getting many small files because of the lower velocity of our sample set, >>>>>> and we did not want to clutter up FSImage. The avro serializer and >>>>>> DataStream type are necessary also, because the default behavior of HDFS >>>>>> sink is to put things in as SequenceFile format. >>>>>> >>>>>> Hope this helps you out. >>>>>> >>>>>> Sincerely, >>>>>> *Devin Suiter* >>>>>> Jr. Data Solutions Software Engineer >>>>>> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212 >>>>>> Google Voice: 412-256-8556 | www.rdx.com >>>>>> >>>>>> >>>>>> On Tue, Oct 22, 2013 at 10:07 AM, David Sinclair < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Do you need to roll based on size as well? Can you tell me the >>>>>>> requirements? >>>>>>> >>>>>>> >>>>>>> On Tue, Oct 22, 2013 at 2:15 AM, Martinus m >>>>>>> <[email protected]>wrote: >>>>>>> >>>>>>>> Hi David, >>>>>>>> >>>>>>>> Thanks for your answer. I already did that, but using %Y-%m-%d. >>>>>>>> But, since there are still roll based on Size, so it will keep >>>>>>>> generating >>>>>>>> two or mores FlumeData.%Y-%m-%d with different postfix. >>>>>>>> >>>>>>>> Thanks. >>>>>>>> >>>>>>>> Martinus >>>>>>>> >>>>>>>> >>>>>>>> On Fri, Oct 18, 2013 at 10:35 PM, David Sinclair < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> The SyslogTcpSource will put a header on the flume event named >>>>>>>>> 'timestamp'. This timestamp will be from the syslog entry. You could >>>>>>>>> then >>>>>>>>> set the filePrefix in the sink to grab this out. >>>>>>>>> For example >>>>>>>>> >>>>>>>>> tier1.sinks.hdfsSink.hdfs.filePrefix = FlumeData.%{timestamp} >>>>>>>>> >>>>>>>>> dave >>>>>>>>> >>>>>>>>> >>>>>>>>> On Thu, Oct 17, 2013 at 10:23 PM, Martinus m < >>>>>>>>> [email protected]> wrote: >>>>>>>>> >>>>>>>>>> Hi David, >>>>>>>>>> >>>>>>>>>> It's syslogtcp. >>>>>>>>>> >>>>>>>>>> Thanks. >>>>>>>>>> >>>>>>>>>> Martinus >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Thu, Oct 17, 2013 at 9:17 PM, David Sinclair < >>>>>>>>>> [email protected]> wrote: >>>>>>>>>> >>>>>>>>>>> What type of source are you using? >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Wed, Oct 16, 2013 at 9:56 PM, Martinus m < >>>>>>>>>>> [email protected]> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi, >>>>>>>>>>>> >>>>>>>>>>>> Is there any option in HDFS sink that I can start rolling a new >>>>>>>>>>>> file whenever the date in the log change? For example, I got below >>>>>>>>>>>> logs : >>>>>>>>>>>> >>>>>>>>>>>> Oct 16 23:58:56 test-host : just test >>>>>>>>>>>> Oct 16 23:59:51 test-host : test again >>>>>>>>>>>> Oct 17 00:00:56 test-host : just test >>>>>>>>>>>> Oct 17 00:00:56 test-host : test again >>>>>>>>>>>> >>>>>>>>>>>> Then I want it to make a file on S3 bucket with result like >>>>>>>>>>>> this : >>>>>>>>>>>> >>>>>>>>>>>> FlumeData.2013-10-16.1381916293017 <-- all the logs with Oct 16 >>>>>>>>>>>> from this year 2013 will goes to here and when it's reach Oct 17 >>>>>>>>>>>> year 2013, >>>>>>>>>>>> then it will start to sink into a new file below : >>>>>>>>>>>> >>>>>>>>>>>> FlumeData.2013-10-17.1381940047117 >>>>>>>>>>>> >>>>>>>>>>>> Thanks. >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >
