No problem! Glad I was able to help! *Devin Suiter* Jr. Data Solutions Software Engineer 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212 Google Voice: 412-256-8556 | www.rdx.com
On Thu, Oct 24, 2013 at 11:05 AM, Christopher Surage <[email protected]>wrote: > David, > > First of all thank you for your help, the typo was the problem. Second the > reason I was using DataStream as my file type for my hdfs sink was because > when I had it as a SequenceFile, the sink was adding a lot of garbage data > to the file when it copied to the hdfs, which was causing undesired > behavior with my created hive table. When I changed to DataStream, it just > put the plain text in the file. With regard to the channels, that is > something I will definitely look at in order to fine tune the performance, > now that I have solved this problem I can look at that, I have fumbled > around with the memory channel playing with the capacity and > transitionCapacity attributes and I have run into choking of the channel, > just have to read more about it. I don't know if you have seen these before > but I've been looking at them > https://blog.cloudera.com/blog/2013/01/how-to-do-apache-flume-performance-tuning-part-1/ > . > > Thanks for your help, > > Chris > > > On Thu, Oct 24, 2013 at 10:17 AM, DSuiter RDX <[email protected]> wrote: > >> Christopher, >> >> I use a very similar setup. I had a similar problem for a while. The HDFS >> sink defaults are the tricky part - they are all pretty small, since they >> assume a high data velocity. The tricky part is that unless they are all >> explicitly declared as OFF, then they are on. >> >> So, your HDFS batch size parameter might be the problem. Also, I notice >> you need to capitalize the "S" in the hdfs.roll*S*ize parameter - >> camelcase got me on transactionCapacity once :-) not sure if this is >> copypasta from your config, but that will cause an issue with the param >> being respected, so in your case it would roll it at 1024 bytes, or about >> 10 lines of text probably. >> >> One question about your config, though - I notice you have the >> hdfs.fileType as DataStream for Avro, but you do not have a serializer of >> avro_event declared. In what format are your files being put into HDFS? As >> Avro-contained streams, or as aggregated text bodies with newline >> delimiters? I ask because this setup for us has led to us needing to unwrap >> Avro event files in MapReduce, which is tricky - if you are getting >> aggregate text, I have some reconfiguring to do. >> >> Other things to look out for are - make sure the HDFS file being written >> to doesn't close mid-stream, I have not seen that recover gracefully, I am >> getting OOME in my testbed right now due to something like that; and make >> sure your transaction capacity in your channels is high enough through the >> flow, my original one kept choking with a small transaction capacity from >> the first channel to the Avro sink. >> >> >> Good luck! >> >> *Devin Suiter* >> Jr. Data Solutions Software Engineer >> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212 >> Google Voice: 412-256-8556 | www.rdx.com >> >> >> On Thu, Oct 24, 2013 at 9:44 AM, Christopher Surage <[email protected]>wrote: >> >>> Hello I am having an issue increasing the size of the file which get >>> written into my hdfs. I have tried playing with the rollCount attribute for >>> an hdfs sink but it seems to cap at 10 lines of text per file, with many >>> files written to the hdfs directory. Now one may see why I need to change >>> this. >>> >>> I have 2 boxes running >>> 1) uses a spooldir source to check for new log files copied to a >>> specific dir. It then sends the events to an avro sink through a mem >>> channel to the other box with the hdfs on it. >>> >>> >>> >>> >>> 2) uses an avro source and sends events to the hdfs sink. >>> >>> >>> configurations: >>> >>> 1. >>> # Name the compnents of the agent >>> a1.sources = r1 >>> a1.sinks = k1 >>> a1.channels = c1 >>> >>> >>> ###############Describe/configure the source################# >>> a1.sources.r1.type = spooldir >>> a1.sources.r1.spoolDir = /u1/csurage/flume_test >>> a1.sources.r1.channels = c1 >>> #a1.sources.r1.fileHeader = true >>> >>> >>> ##############describe the sink####################### >>> # file roll sink >>> #a1.sinks.k1.type = file_roll >>> #a1.sinks.k1.sink.directory = /u1/csurage/target_flume >>> >>> # Avro sink >>> a1.sinks.k1.type = avro >>> a1.sinks.k1.hostname = 45.32.96.136 >>> a1.sinks.k1.port = 9311 >>> >>> >>> # Channel the sink connects to >>> a1.sinks.k1.channel = c1 >>> >>> ################describe the channel################## >>> # use a channel which buffers events in memory >>> a1.channels.c1.type = memory >>> a1.channels.c1.byteCapacity = 0 >>> >>> >>> >>> 2. note when I change any of the attributes in bold, the rollCount stays >>> at 10 line >>> files written to the hdfs >>> >>> # Name the compnents of the agent >>> a1.sources = r1 >>> a1.sinks = k1 >>> a1.channels = c1 >>> >>> >>> ###############Describe/configure the source################# >>> a1.sources.r1.type = avro >>> a1.sources.r1.bind = 45.32.96.136 >>> a1.sources.r1.port = 9311 >>> a1.sources.r1.channels = c1 >>> #a1.sources.r1.fileHeader = true >>> >>> >>> ##############describe the sink####################### >>> # HDFS sink >>> a1.sinks.k1.type = hdfs >>> a1.sinks.k1.hdfs.path = /user/csurage/hive >>> a1.sinks.k1.hdfs.fileType = DataStream >>> *a1.sinks.k1.hdfs.rollsize = 0* >>> *a1.sinks.k1.hdfs.rollCount = 20 * >>> *a1.sinks.k1.hdfs.rollInterval = 0* >>> >>> >>> # Channel the sink connects to >>> a1.sinks.k1.channel = c1 >>> >>> >>> ################describe the channel################## >>> # use a channel which buffers events in memory >>> a1.channels.c1.type = memory >>> a1.channels.c1.byteCapacity = 0 >>> >>> >>> Please any help would be greatly appreciated, I have been stuck on this >>> for 2 days. >>> >>> regards, >>> >>> Chris >>> >>> >>> >>> >>> >>> >> >
