Re: Problem setting the rollInterval for HDFS sink

Christopher Surage Thu, 24 Oct 2013 09:58:20 -0700

David,

Did you ever have a problem with the hdfs getting stuck on a write, I am
noticing that it just stops writing files after a certain amount of time
but it doesn't seem to be finished it just stops at a certain .tmp file.


regards,

Chris


On Thu, Oct 24, 2013 at 11:09 AM, DSuiter RDX <[email protected]> wrote:

> No problem! Glad I was able to help!
>
> *Devin Suiter*
> Jr. Data Solutions Software Engineer
> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
> Google Voice: 412-256-8556 | www.rdx.com
>
>
> On Thu, Oct 24, 2013 at 11:05 AM, Christopher Surage <[email protected]>wrote:
>
>> David,
>>
>> First of all thank you for your help, the typo was the problem. Second
>> the reason I was using DataStream as my file type for my hdfs sink was
>> because when I had it as a SequenceFile, the sink was adding a lot of
>> garbage data to the file when it copied to the hdfs, which was causing
>> undesired behavior with my created hive table. When I changed to
>> DataStream, it just put the plain text in the file. With regard to the
>> channels, that is something I will definitely look at in order to fine tune
>> the performance, now that I have solved this problem I can look at that, I
>> have fumbled around with the memory channel playing with the capacity and
>> transitionCapacity attributes and I have run into choking of the channel,
>> just have to read more about it. I don't know if you have seen these before
>> but I've been looking at them
>> https://blog.cloudera.com/blog/2013/01/how-to-do-apache-flume-performance-tuning-part-1/
>> .
>>
>> Thanks for your help,
>>
>> Chris
>>
>>
>> On Thu, Oct 24, 2013 at 10:17 AM, DSuiter RDX <[email protected]> wrote:
>>
>>> Christopher,
>>>
>>> I use a very similar setup. I had a similar problem for a while. The
>>> HDFS sink defaults are the tricky part - they are all pretty small, since
>>> they assume a high data velocity. The tricky part is that unless they are
>>> all explicitly declared as OFF, then they are on.
>>>
>>> So, your HDFS batch size parameter might be the problem. Also, I notice
>>> you need to capitalize the "S" in the hdfs.roll*S*ize parameter -
>>> camelcase got me on transactionCapacity once :-) not sure if this is
>>> copypasta from your config, but that will cause an issue with the param
>>> being respected, so in your case it would roll it at 1024 bytes, or about
>>> 10 lines of text probably.
>>>
>>> One question about your config, though - I notice you have the
>>> hdfs.fileType as DataStream for Avro, but you do not have a serializer of
>>> avro_event declared. In what format are your files being put into HDFS? As
>>> Avro-contained streams, or as aggregated text bodies with newline
>>> delimiters? I ask because this setup for us has led to us needing to unwrap
>>> Avro event files in MapReduce, which is tricky - if you are getting
>>> aggregate text, I have some reconfiguring to do.
>>>
>>> Other things to look out for are - make sure the HDFS file being written
>>> to doesn't close mid-stream, I have not seen that recover gracefully, I am
>>> getting OOME in my testbed right now due to something like that; and make
>>> sure your transaction capacity in your channels is high enough through the
>>> flow, my original one kept choking with a small transaction capacity from
>>> the first channel to the Avro sink.
>>>
>>>
>>> Good luck!
>>>
>>> *Devin Suiter*
>>> Jr. Data Solutions Software Engineer
>>> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
>>> Google Voice: 412-256-8556 | www.rdx.com
>>>
>>>
>>> On Thu, Oct 24, 2013 at 9:44 AM, Christopher Surage 
>>> <[email protected]>wrote:
>>>
>>>> Hello I am having an issue increasing the size of the file which get
>>>> written into my hdfs. I have tried playing with the rollCount attribute for
>>>> an hdfs sink but it seems to cap at 10 lines of text per file, with many
>>>> files written to the hdfs directory. Now one may see why I need to change
>>>> this.
>>>>
>>>> I have 2 boxes running
>>>> 1) uses a spooldir source to check for new log files copied to a
>>>> specific dir. It then sends the events to an avro sink through a mem
>>>> channel to the other box with the hdfs on it.
>>>>
>>>>
>>>>
>>>>
>>>> 2) uses an avro source and sends events to the hdfs sink.
>>>>
>>>>
>>>> configurations:
>>>>
>>>> 1.
>>>>  # Name the compnents of the agent
>>>> a1.sources = r1
>>>> a1.sinks = k1
>>>> a1.channels = c1
>>>>
>>>>
>>>> ###############Describe/configure the source#################
>>>> a1.sources.r1.type = spooldir
>>>> a1.sources.r1.spoolDir = /u1/csurage/flume_test
>>>> a1.sources.r1.channels = c1
>>>> #a1.sources.r1.fileHeader = true
>>>>
>>>>
>>>> ##############describe the sink#######################
>>>> # file roll sink
>>>> #a1.sinks.k1.type = file_roll
>>>> #a1.sinks.k1.sink.directory = /u1/csurage/target_flume
>>>>
>>>> # Avro sink
>>>> a1.sinks.k1.type = avro
>>>> a1.sinks.k1.hostname = 45.32.96.136
>>>> a1.sinks.k1.port = 9311
>>>>
>>>>
>>>> # Channel the sink connects to
>>>> a1.sinks.k1.channel = c1
>>>>
>>>> ################describe the channel##################
>>>> # use a channel which buffers events in memory
>>>> a1.channels.c1.type = memory
>>>> a1.channels.c1.byteCapacity = 0
>>>>
>>>>
>>>>
>>>> 2. note when I change any of the attributes in bold, the rollCount
>>>> stays at 10 line
>>>>     files written to the hdfs
>>>>
>>>> # Name the compnents of the agent
>>>> a1.sources = r1
>>>> a1.sinks = k1
>>>> a1.channels = c1
>>>>
>>>>
>>>> ###############Describe/configure the source#################
>>>> a1.sources.r1.type = avro
>>>> a1.sources.r1.bind = 45.32.96.136
>>>> a1.sources.r1.port = 9311
>>>> a1.sources.r1.channels = c1
>>>> #a1.sources.r1.fileHeader = true
>>>>
>>>>
>>>> ##############describe the sink#######################
>>>> # HDFS sink
>>>> a1.sinks.k1.type = hdfs
>>>> a1.sinks.k1.hdfs.path = /user/csurage/hive
>>>> a1.sinks.k1.hdfs.fileType = DataStream
>>>> *a1.sinks.k1.hdfs.rollsize = 0*
>>>> *a1.sinks.k1.hdfs.rollCount = 20   *
>>>> *a1.sinks.k1.hdfs.rollInterval = 0*
>>>>
>>>>
>>>> # Channel the sink connects to
>>>> a1.sinks.k1.channel = c1
>>>>
>>>>
>>>> ################describe the channel##################
>>>> # use a channel which buffers events in memory
>>>> a1.channels.c1.type = memory
>>>> a1.channels.c1.byteCapacity = 0
>>>>
>>>>
>>>> Please any help would be greatly appreciated, I have been stuck on this
>>>> for 2 days.
>>>>
>>>> regards,
>>>>
>>>> Chris
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>
>

Re: Problem setting the rollInterval for HDFS sink

Reply via email to