Re: Roll based on date

Martinus m Mon, 28 Oct 2013 03:55:39 -0700

Hi David,

Following is my configuration file :


agent.sources = seqGenSrc
agent.channels = fileChannel
agent.sinks = s3Sink

# For each one of the sources, the type is defined
agent.sources.seqGenSrc.type = syslogtcp
agent.sources.seqGenSrc.port = 5140
agent.sources.seqGenSrc.host = localhost
agent.sources.seqGenSrc.keepFields = true

# The channel can be defined as follows.
agent.sources.seqGenSrc.channels = fileChannel

# Each sink's type must be defined
agent.sinks.s3Sink.type = hdfs

#Specify the channel the sink should use
agent.sinks.s3Sink.channel = fileChannel
agent.sinks.s3Sink.hdfs.path = s3n://awskeyid:awssecretkey@bucket_name
/%{host}
agent.sinks.s3Sink.hdfs.filePrefix = FlumeData.%Y-%m-%d
agent.sinks.s3Sink.hdfs.rollInterval = 0
agent.sinks.s3Sink.hdfs.rollSize = 0
agent.sinks.s3Sink.hdfs.rollCount = 0
agent.sinks.s3Sink.hdfs.batchSize = 0
agent.sinks.s3Sink.hdfs.idleTimeout = 600
agent.sinks.s3Sink.hdfs.fileType = DataStream

# Each channel's type is defined.
agent.channels.fileChannel.type = file

# Other config values specific to each type of channel(sink or source)
# can be defined as well
# In this case, it specifies the capacity of the memory channel
agent.channels.fileChannel.capacity = 1000000

Thanks.

Martinus


On Fri, Oct 25, 2013 at 10:20 PM, David Sinclair <
[email protected]> wrote:

> does the metrics endpoint show that events are still coming into this sink?
>
> http://hostname of agent:41414/metrics <http://falcon:41414/metrics>
>
> Also, can you post the rest of the config?
>
>
> On Thu, Oct 24, 2013 at 10:09 PM, Martinus m <[email protected]>wrote:
>
>> Hi David,
>>
>> Almost every few seconds.
>>
>> Thanks.
>>
>> Martinus
>>
>>
>> On Thu, Oct 24, 2013 at 9:49 PM, David Sinclair <
>> [email protected]> wrote:
>>
>>> How often are your events coming in?
>>>
>>>
>>> On Thu, Oct 24, 2013 at 2:21 AM, Martinus m <[email protected]>wrote:
>>>
>>>> Hi David,
>>>>
>>>> Thanks for the example. I have set it just like above, but it only
>>>> generate for the first 15 minutes. After waiting for more than one hour,
>>>> there is no update at all in the s3 bucket.
>>>>
>>>> Thanks.
>>>>
>>>> Martinus
>>>>
>>>>
>>>> On Wed, Oct 23, 2013 at 8:48 PM, David Sinclair <
>>>> [email protected]> wrote:
>>>>
>>>>> You can set all of the time/size based rolling policies to zero and
>>>>> set an idle timeout on the sink. Below has a 15 minute timeout
>>>>>
>>>>> agent.sinks.sink.hdfs.fileSuffix = FlumeData.%Y-%m-%d
>>>>> agent.sinks.sink.hdfs.fileType = DataStream
>>>>> agent.sinks.sink.hdfs.rollInterval = 0
>>>>> agent.sinks.sink.hdfs.rollSize = 0
>>>>> agent.sinks.sink.hdfs.batchSize = 0
>>>>> agent.sinks.sink.hdfs.rollCount = 0
>>>>> agent.sinks.sink.hdfs.idleTimeout = 900
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Oct 22, 2013 at 10:17 PM, Martinus m <[email protected]>wrote:
>>>>>
>>>>>> Hi David,
>>>>>>
>>>>>> The requirement is only roll per day actually.
>>>>>>
>>>>>> Hi Devin,
>>>>>>
>>>>>> Thanks for sharing your experienced. I also tried to set the config
>>>>>> as following :
>>>>>>
>>>>>> agent.sinks.sink.hdfs.fileSuffix = FlumeData.%Y-%m-%d
>>>>>> agent.sinks.sink.hdfs.fileType = DataStream
>>>>>> agent.sinks.sink.hdfs.rollInterval = 0
>>>>>> agent.sinks.sink.hdfs.rollSize = 0
>>>>>> agent.sinks.sink.hdfs.batchSize = 15000
>>>>>> agent.sinks.sink.hdfs.rollCount = 0
>>>>>>
>>>>>> But I didn't see anything on the s3 bucket. So I guess, I need to
>>>>>> change the rollInterval into 86400. In my understanding, rollInterval 
>>>>>> 86400
>>>>>> will roll the file after 24 hours like you said, but it will not generate
>>>>>> new file if it's changed the day and haven't been 24 hours interval 
>>>>>> (unless
>>>>>> we put prefix to fileSuffix as above).
>>>>>>
>>>>>> Thanks to both of you.
>>>>>>
>>>>>> Best regards,
>>>>>>
>>>>>> Martinus
>>>>>>
>>>>>>
>>>>>> On Tue, Oct 22, 2013 at 11:16 PM, DSuiter RDX <[email protected]>wrote:
>>>>>>
>>>>>>> Martinus, you have to set all the other roll options to 0 explicitly
>>>>>>> in the configuration if you want them only to roll on one parameter, it
>>>>>>> will take the shortest working parameter it can meet for the roll. If 
>>>>>>> you
>>>>>>> want it to roll once a day, you will have to specifically disable all 
>>>>>>> the
>>>>>>> other options for roll triggers - they all take default settings unless
>>>>>>> told not to. When I was experimenting, for example, it kept rolling in 
>>>>>>> 30
>>>>>>> seconds even though I had the hdfs.rollSize set to 64MB (our test data 
>>>>>>> is
>>>>>>> generated slowly). So I ended up with a pile of small (0.2KB - 19~KB) 
>>>>>>> files
>>>>>>> in a bunch of directories sorted by timestamp in ten-minute intervals.
>>>>>>>
>>>>>>> So, maybe a conf like this:
>>>>>>>
>>>>>>> agent.sinks.sink.type = hdfs
>>>>>>> agent.sinks.sink.channel = channel
>>>>>>> agent.sinks.sink.hdfs.path = (desired path string, yours looks fine)
>>>>>>> agent.sinks.sink.hdfs.fileSuffix = .avro
>>>>>>> agent.sinks.sink.serializer = avro_event
>>>>>>> agent.sinks.sink.hdfs.fileType = DataStream
>>>>>>> agent.sinks.sink.hdfs.rollInterval = 86400
>>>>>>> agent.sinks.sink.hdfs.rollSize = 134217728
>>>>>>> agent.sinks.sink.hdfs.batchSize = 15000
>>>>>>> agent.sinks.sink.hdfs.rollCount = 0
>>>>>>>
>>>>>>> This one will roll in HDFS in 24-hour intervals, or at 128MB file
>>>>>>> size for the file, and will close the file if it has 15000 events in it,
>>>>>>> but if the hdfs.rollCount line was not set to "0" or some higher value 
>>>>>>> (I
>>>>>>> probably could have set that at 15000 to match the hdfs.batchSize for 
>>>>>>> same
>>>>>>> results) then the file would roll as soon as the default of only 10 
>>>>>>> events
>>>>>>> were written in to the file.
>>>>>>>
>>>>>>> Are you using a 1-tier or 2-tier design for this? For syslogTCP, we
>>>>>>> collect from syslogTCP which comes from remote host. It then goes to 
>>>>>>> avro
>>>>>>> sink to aggregate the small event entries into larger avro files. Then, 
>>>>>>> a
>>>>>>> second tier collects that with avro source, then hdfs sink. So, we get 
>>>>>>> them
>>>>>>> all as individual events streamed into an avro container, then the avro
>>>>>>> container is put into HDFS every 24 hours or if it hits 128 MB. We were
>>>>>>> getting many small files because of the lower velocity of our sample 
>>>>>>> set,
>>>>>>> and we did not want to clutter up FSImage. The avro serializer and
>>>>>>> DataStream type are necessary also, because the default behavior of HDFS
>>>>>>> sink is to put things in as SequenceFile format.
>>>>>>>
>>>>>>> Hope this helps you out.
>>>>>>>
>>>>>>> Sincerely,
>>>>>>> *Devin Suiter*
>>>>>>> Jr. Data Solutions Software Engineer
>>>>>>> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
>>>>>>> Google Voice: 412-256-8556 | www.rdx.com
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Oct 22, 2013 at 10:07 AM, David Sinclair <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Do you need to roll based on size as well? Can you tell me the
>>>>>>>> requirements?
>>>>>>>>
>>>>>>>>
>>>>>>>> On Tue, Oct 22, 2013 at 2:15 AM, Martinus m 
>>>>>>>> <[email protected]>wrote:
>>>>>>>>
>>>>>>>>> Hi David,
>>>>>>>>>
>>>>>>>>> Thanks for your answer. I already did that, but using %Y-%m-%d.
>>>>>>>>> But, since there are still roll based on Size, so it will keep 
>>>>>>>>> generating
>>>>>>>>> two or mores FlumeData.%Y-%m-%d with different postfix.
>>>>>>>>>
>>>>>>>>> Thanks.
>>>>>>>>>
>>>>>>>>> Martinus
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, Oct 18, 2013 at 10:35 PM, David Sinclair <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> The SyslogTcpSource will put a header on the flume event named
>>>>>>>>>> 'timestamp'. This timestamp will be from the syslog entry. You could 
>>>>>>>>>> then
>>>>>>>>>> set the filePrefix in the sink to grab this out.
>>>>>>>>>> For example
>>>>>>>>>>
>>>>>>>>>> tier1.sinks.hdfsSink.hdfs.filePrefix = FlumeData.%{timestamp}
>>>>>>>>>>
>>>>>>>>>> dave
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Thu, Oct 17, 2013 at 10:23 PM, Martinus m <
>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi David,
>>>>>>>>>>>
>>>>>>>>>>> It's syslogtcp.
>>>>>>>>>>>
>>>>>>>>>>> Thanks.
>>>>>>>>>>>
>>>>>>>>>>> Martinus
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Thu, Oct 17, 2013 at 9:17 PM, David Sinclair <
>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> What type of source are you using?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Oct 16, 2013 at 9:56 PM, Martinus m <
>>>>>>>>>>>> [email protected]> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Is there any option in HDFS sink that I can start rolling a
>>>>>>>>>>>>> new file whenever the date in the log change? For example, I got 
>>>>>>>>>>>>> below logs
>>>>>>>>>>>>> :
>>>>>>>>>>>>>
>>>>>>>>>>>>> Oct 16 23:58:56 test-host : just test
>>>>>>>>>>>>> Oct 16 23:59:51 test-host : test again
>>>>>>>>>>>>> Oct 17 00:00:56 test-host : just test
>>>>>>>>>>>>> Oct 17 00:00:56 test-host : test again
>>>>>>>>>>>>>
>>>>>>>>>>>>> Then I want it to make a file on S3 bucket with result like
>>>>>>>>>>>>> this :
>>>>>>>>>>>>>
>>>>>>>>>>>>> FlumeData.2013-10-16.1381916293017 <-- all the logs with Oct
>>>>>>>>>>>>> 16 from this year 2013 will goes to here and when it's reach Oct 
>>>>>>>>>>>>> 17 year
>>>>>>>>>>>>> 2013, then it will start to sink into a new file below :
>>>>>>>>>>>>>
>>>>>>>>>>>>> FlumeData.2013-10-17.1381940047117
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks.
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Roll based on date

Reply via email to