David, Did you ever have a problem with the hdfs getting stuck on a write, I am noticing that it just stops writing files after a certain amount of time but it doesn't seem to be finished it just stops at a certain .tmp file.
regards, Chris On Thu, Oct 24, 2013 at 11:09 AM, DSuiter RDX <[email protected]> wrote: > No problem! Glad I was able to help! > > *Devin Suiter* > Jr. Data Solutions Software Engineer > 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212 > Google Voice: 412-256-8556 | www.rdx.com > > > On Thu, Oct 24, 2013 at 11:05 AM, Christopher Surage <[email protected]>wrote: > >> David, >> >> First of all thank you for your help, the typo was the problem. Second >> the reason I was using DataStream as my file type for my hdfs sink was >> because when I had it as a SequenceFile, the sink was adding a lot of >> garbage data to the file when it copied to the hdfs, which was causing >> undesired behavior with my created hive table. When I changed to >> DataStream, it just put the plain text in the file. With regard to the >> channels, that is something I will definitely look at in order to fine tune >> the performance, now that I have solved this problem I can look at that, I >> have fumbled around with the memory channel playing with the capacity and >> transitionCapacity attributes and I have run into choking of the channel, >> just have to read more about it. I don't know if you have seen these before >> but I've been looking at them >> https://blog.cloudera.com/blog/2013/01/how-to-do-apache-flume-performance-tuning-part-1/ >> . >> >> Thanks for your help, >> >> Chris >> >> >> On Thu, Oct 24, 2013 at 10:17 AM, DSuiter RDX <[email protected]> wrote: >> >>> Christopher, >>> >>> I use a very similar setup. I had a similar problem for a while. The >>> HDFS sink defaults are the tricky part - they are all pretty small, since >>> they assume a high data velocity. The tricky part is that unless they are >>> all explicitly declared as OFF, then they are on. >>> >>> So, your HDFS batch size parameter might be the problem. Also, I notice >>> you need to capitalize the "S" in the hdfs.roll*S*ize parameter - >>> camelcase got me on transactionCapacity once :-) not sure if this is >>> copypasta from your config, but that will cause an issue with the param >>> being respected, so in your case it would roll it at 1024 bytes, or about >>> 10 lines of text probably. >>> >>> One question about your config, though - I notice you have the >>> hdfs.fileType as DataStream for Avro, but you do not have a serializer of >>> avro_event declared. In what format are your files being put into HDFS? As >>> Avro-contained streams, or as aggregated text bodies with newline >>> delimiters? I ask because this setup for us has led to us needing to unwrap >>> Avro event files in MapReduce, which is tricky - if you are getting >>> aggregate text, I have some reconfiguring to do. >>> >>> Other things to look out for are - make sure the HDFS file being written >>> to doesn't close mid-stream, I have not seen that recover gracefully, I am >>> getting OOME in my testbed right now due to something like that; and make >>> sure your transaction capacity in your channels is high enough through the >>> flow, my original one kept choking with a small transaction capacity from >>> the first channel to the Avro sink. >>> >>> >>> Good luck! >>> >>> *Devin Suiter* >>> Jr. Data Solutions Software Engineer >>> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212 >>> Google Voice: 412-256-8556 | www.rdx.com >>> >>> >>> On Thu, Oct 24, 2013 at 9:44 AM, Christopher Surage >>> <[email protected]>wrote: >>> >>>> Hello I am having an issue increasing the size of the file which get >>>> written into my hdfs. I have tried playing with the rollCount attribute for >>>> an hdfs sink but it seems to cap at 10 lines of text per file, with many >>>> files written to the hdfs directory. Now one may see why I need to change >>>> this. >>>> >>>> I have 2 boxes running >>>> 1) uses a spooldir source to check for new log files copied to a >>>> specific dir. It then sends the events to an avro sink through a mem >>>> channel to the other box with the hdfs on it. >>>> >>>> >>>> >>>> >>>> 2) uses an avro source and sends events to the hdfs sink. >>>> >>>> >>>> configurations: >>>> >>>> 1. >>>> # Name the compnents of the agent >>>> a1.sources = r1 >>>> a1.sinks = k1 >>>> a1.channels = c1 >>>> >>>> >>>> ###############Describe/configure the source################# >>>> a1.sources.r1.type = spooldir >>>> a1.sources.r1.spoolDir = /u1/csurage/flume_test >>>> a1.sources.r1.channels = c1 >>>> #a1.sources.r1.fileHeader = true >>>> >>>> >>>> ##############describe the sink####################### >>>> # file roll sink >>>> #a1.sinks.k1.type = file_roll >>>> #a1.sinks.k1.sink.directory = /u1/csurage/target_flume >>>> >>>> # Avro sink >>>> a1.sinks.k1.type = avro >>>> a1.sinks.k1.hostname = 45.32.96.136 >>>> a1.sinks.k1.port = 9311 >>>> >>>> >>>> # Channel the sink connects to >>>> a1.sinks.k1.channel = c1 >>>> >>>> ################describe the channel################## >>>> # use a channel which buffers events in memory >>>> a1.channels.c1.type = memory >>>> a1.channels.c1.byteCapacity = 0 >>>> >>>> >>>> >>>> 2. note when I change any of the attributes in bold, the rollCount >>>> stays at 10 line >>>> files written to the hdfs >>>> >>>> # Name the compnents of the agent >>>> a1.sources = r1 >>>> a1.sinks = k1 >>>> a1.channels = c1 >>>> >>>> >>>> ###############Describe/configure the source################# >>>> a1.sources.r1.type = avro >>>> a1.sources.r1.bind = 45.32.96.136 >>>> a1.sources.r1.port = 9311 >>>> a1.sources.r1.channels = c1 >>>> #a1.sources.r1.fileHeader = true >>>> >>>> >>>> ##############describe the sink####################### >>>> # HDFS sink >>>> a1.sinks.k1.type = hdfs >>>> a1.sinks.k1.hdfs.path = /user/csurage/hive >>>> a1.sinks.k1.hdfs.fileType = DataStream >>>> *a1.sinks.k1.hdfs.rollsize = 0* >>>> *a1.sinks.k1.hdfs.rollCount = 20 * >>>> *a1.sinks.k1.hdfs.rollInterval = 0* >>>> >>>> >>>> # Channel the sink connects to >>>> a1.sinks.k1.channel = c1 >>>> >>>> >>>> ################describe the channel################## >>>> # use a channel which buffers events in memory >>>> a1.channels.c1.type = memory >>>> a1.channels.c1.byteCapacity = 0 >>>> >>>> >>>> Please any help would be greatly appreciated, I have been stuck on this >>>> for 2 days. >>>> >>>> regards, >>>> >>>> Chris >>>> >>>> >>>> >>>> >>>> >>>> >>> >> >
