Re: Problem setting the rollInterval for HDFS sink

DSuiter RDX Thu, 24 Oct 2013 10:52:56 -0700

I get that sometimes, but usually when the HDFS sink header interpreters
make a new directory or the file rolls. I have some .tmp files that are
from 2 weeks ago - the agent that wrote them is never going to point to
that filepath again, but they are still there. I usually don't sweat them
in my pseudo-cluster testbed, but in our development/quasi-production
cluster, I applied the hdfs.idleTimeout parameter - our test data grows
slowly, lots of small events with high frequency, but only about 20 MB of
data/day because they are log entries from a single server. I have it set
to make a new directory for the day based on the timestamp applied by the
syslogTCP source, so the first event to hit the source after midnight makes
a new directory and cause a new file, but does not cause the closing of the
previous file, not sure why, I think that is "just how it works" and so I
have a 30-minute idleTimeout in place. This morning, the roll created a new
FlumeData.$TIMESTAMP.avro.tmp, left the previous day's
FlumeData.$TIMESTAMP.avro.tmp open, and then the idleTimeout swooped in at
the 30-minute mark and closed the previous file for me. Setting the
idleTimeout too short will cause problems if it is shorter than the average
frequency of the events. It seems like the idleTimeout tells HDFS
BucketWriter to close the file, but does not tell AvroSink to write to a
new file, so the sink processor heap fills and crashes with OOME.


Hope that helps.

*Devin Suiter*
Jr. Data Solutions Software Engineer
100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
Google Voice: 412-256-8556 | www.rdx.com


On Thu, Oct 24, 2013 at 12:46 PM, Christopher Surage <[email protected]>wrote:

> David,
>
> Did you ever have a problem with the hdfs getting stuck on a write, I am
> noticing that it just stops writing files after a certain amount of time
> but it doesn't seem to be finished it just stops at a certain .tmp file.
>
> regards,
>
> Chris
>
>
> On Thu, Oct 24, 2013 at 11:09 AM, DSuiter RDX <[email protected]> wrote:
>
>> No problem! Glad I was able to help!
>>
>> *Devin Suiter*
>> Jr. Data Solutions Software Engineer
>> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
>> Google Voice: 412-256-8556 | www.rdx.com
>>
>>
>> On Thu, Oct 24, 2013 at 11:05 AM, Christopher Surage 
>> <[email protected]>wrote:
>>
>>> David,
>>>
>>> First of all thank you for your help, the typo was the problem. Second
>>> the reason I was using DataStream as my file type for my hdfs sink was
>>> because when I had it as a SequenceFile, the sink was adding a lot of
>>> garbage data to the file when it copied to the hdfs, which was causing
>>> undesired behavior with my created hive table. When I changed to
>>> DataStream, it just put the plain text in the file. With regard to the
>>> channels, that is something I will definitely look at in order to fine tune
>>> the performance, now that I have solved this problem I can look at that, I
>>> have fumbled around with the memory channel playing with the capacity and
>>> transitionCapacity attributes and I have run into choking of the channel,
>>> just have to read more about it. I don't know if you have seen these before
>>> but I've been looking at them
>>> https://blog.cloudera.com/blog/2013/01/how-to-do-apache-flume-performance-tuning-part-1/
>>> .
>>>
>>> Thanks for your help,
>>>
>>> Chris
>>>
>>>
>>> On Thu, Oct 24, 2013 at 10:17 AM, DSuiter RDX <[email protected]> wrote:
>>>
>>>> Christopher,
>>>>
>>>> I use a very similar setup. I had a similar problem for a while. The
>>>> HDFS sink defaults are the tricky part - they are all pretty small, since
>>>> they assume a high data velocity. The tricky part is that unless they are
>>>> all explicitly declared as OFF, then they are on.
>>>>
>>>> So, your HDFS batch size parameter might be the problem. Also, I notice
>>>> you need to capitalize the "S" in the hdfs.roll*S*ize parameter -
>>>> camelcase got me on transactionCapacity once :-) not sure if this is
>>>> copypasta from your config, but that will cause an issue with the param
>>>> being respected, so in your case it would roll it at 1024 bytes, or about
>>>> 10 lines of text probably.
>>>>
>>>> One question about your config, though - I notice you have the
>>>> hdfs.fileType as DataStream for Avro, but you do not have a serializer of
>>>> avro_event declared. In what format are your files being put into HDFS? As
>>>> Avro-contained streams, or as aggregated text bodies with newline
>>>> delimiters? I ask because this setup for us has led to us needing to unwrap
>>>> Avro event files in MapReduce, which is tricky - if you are getting
>>>> aggregate text, I have some reconfiguring to do.
>>>>
>>>> Other things to look out for are - make sure the HDFS file being
>>>> written to doesn't close mid-stream, I have not seen that recover
>>>> gracefully, I am getting OOME in my testbed right now due to something like
>>>> that; and make sure your transaction capacity in your channels is high
>>>> enough through the flow, my original one kept choking with a small
>>>> transaction capacity from the first channel to the Avro sink.
>>>>
>>>>
>>>> Good luck!
>>>>
>>>> *Devin Suiter*
>>>> Jr. Data Solutions Software Engineer
>>>> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
>>>> Google Voice: 412-256-8556 | www.rdx.com
>>>>
>>>>
>>>> On Thu, Oct 24, 2013 at 9:44 AM, Christopher Surage 
>>>> <[email protected]>wrote:
>>>>
>>>>> Hello I am having an issue increasing the size of the file which get
>>>>> written into my hdfs. I have tried playing with the rollCount attribute 
>>>>> for
>>>>> an hdfs sink but it seems to cap at 10 lines of text per file, with many
>>>>> files written to the hdfs directory. Now one may see why I need to change
>>>>> this.
>>>>>
>>>>> I have 2 boxes running
>>>>> 1) uses a spooldir source to check for new log files copied to a
>>>>> specific dir. It then sends the events to an avro sink through a mem
>>>>> channel to the other box with the hdfs on it.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> 2) uses an avro source and sends events to the hdfs sink.
>>>>>
>>>>>
>>>>> configurations:
>>>>>
>>>>> 1.
>>>>>  # Name the compnents of the agent
>>>>> a1.sources = r1
>>>>> a1.sinks = k1
>>>>> a1.channels = c1
>>>>>
>>>>>
>>>>> ###############Describe/configure the source#################
>>>>> a1.sources.r1.type = spooldir
>>>>> a1.sources.r1.spoolDir = /u1/csurage/flume_test
>>>>> a1.sources.r1.channels = c1
>>>>> #a1.sources.r1.fileHeader = true
>>>>>
>>>>>
>>>>> ##############describe the sink#######################
>>>>> # file roll sink
>>>>> #a1.sinks.k1.type = file_roll
>>>>> #a1.sinks.k1.sink.directory = /u1/csurage/target_flume
>>>>>
>>>>> # Avro sink
>>>>> a1.sinks.k1.type = avro
>>>>> a1.sinks.k1.hostname = 45.32.96.136
>>>>> a1.sinks.k1.port = 9311
>>>>>
>>>>>
>>>>> # Channel the sink connects to
>>>>> a1.sinks.k1.channel = c1
>>>>>
>>>>> ################describe the channel##################
>>>>> # use a channel which buffers events in memory
>>>>> a1.channels.c1.type = memory
>>>>> a1.channels.c1.byteCapacity = 0
>>>>>
>>>>>
>>>>>
>>>>> 2. note when I change any of the attributes in bold, the rollCount
>>>>> stays at 10 line
>>>>>     files written to the hdfs
>>>>>
>>>>> # Name the compnents of the agent
>>>>> a1.sources = r1
>>>>> a1.sinks = k1
>>>>> a1.channels = c1
>>>>>
>>>>>
>>>>> ###############Describe/configure the source#################
>>>>> a1.sources.r1.type = avro
>>>>> a1.sources.r1.bind = 45.32.96.136
>>>>> a1.sources.r1.port = 9311
>>>>> a1.sources.r1.channels = c1
>>>>> #a1.sources.r1.fileHeader = true
>>>>>
>>>>>
>>>>> ##############describe the sink#######################
>>>>> # HDFS sink
>>>>> a1.sinks.k1.type = hdfs
>>>>> a1.sinks.k1.hdfs.path = /user/csurage/hive
>>>>> a1.sinks.k1.hdfs.fileType = DataStream
>>>>> *a1.sinks.k1.hdfs.rollsize = 0*
>>>>> *a1.sinks.k1.hdfs.rollCount = 20   *
>>>>> *a1.sinks.k1.hdfs.rollInterval = 0*
>>>>>
>>>>>
>>>>> # Channel the sink connects to
>>>>> a1.sinks.k1.channel = c1
>>>>>
>>>>>
>>>>> ################describe the channel##################
>>>>> # use a channel which buffers events in memory
>>>>> a1.channels.c1.type = memory
>>>>> a1.channels.c1.byteCapacity = 0
>>>>>
>>>>>
>>>>> Please any help would be greatly appreciated, I have been stuck on
>>>>> this for 2 days.
>>>>>
>>>>> regards,
>>>>>
>>>>> Chris
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Problem setting the rollInterval for HDFS sink

Reply via email to