Ok, I just realized that I am missing a 0 on the rollSize, and it is probably doing exactly what it is supposed to since I told it close the file at 3 KB not 3 MB...
Sorry everyone! Thanks! *Devin Suiter* Jr. Data Solutions Software Engineer 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212 Google Voice: 412-256-8556 | www.rdx.com On Mon, Oct 7, 2013 at 12:00 PM, DSuiter RDX <[email protected]> wrote: > Hi, this may be a problem with our understanding, or my configuration. > > I am trying to take data from rsyslog via remote forwarding over TCP into > a syslogTCP source, collect it as an avro sink, connect the avro sink to an > avro source, and then into an HDFS sink. > > Everything is connected and the data is flowing from the remote source > into HDFS in an avro container, so that is not the problem. > > The problem is that it is closing files when they are very small, only KBs > in size, even though I have the hdfs roll_Interval and rollCount properties > set to 0. I set the hdfs.rollSize property to 3072 for 3MB. I expected it > to aggregate the files into larger blocks before closing them. Is this > happening because of the HDFS directory-building escape sequences forcing > new directory writes and making new files prematurely? > > Here are my agent configs: > > syslogTCP Source > Avro Sink (first tier, pretty sure everything is ok > here but maybe not) > > ####RT Listener Agent#### > rtlv1.sources=srclv1 > rtlv1.sinks=snklv1 > rtlv1.channels=chnlv1 > > #sources > rtlv1.sources.srclv1.type=syslogtcp > rtlv1.sources.srclv1.host=192.168.1.2 > rtlv1.sources.srclv1.port=5140 > rtlv1.sources.srclv1.channels=chnlv1 > > #channels > rtlv1.channels.chnlv1.type=memory > rtlv1.channels.chnlv1.capacity=1500 > rtlv1.channels.chnlv1.transactionCapacity=1500 > > #sinks > rtlv1.sinks.snklv1.type=avro > rtlv1.sinks.snklv1.hostname=192.168.1.2 > rtlv1.sinks.snklv1.port=5141 > rtlv1.sinks.snklv1.batch-size=1500 > rtlv1.sinks.snklv1.channel=chnlv1 > > Avro Source > HDFS (second tier) > > ####RT Aggregate Writer Agent#### > rtlv2.sources=srclv2 > rtlv2.sinks=snklv2 > rtlv2.channels=chnlv2 > > #sources > rtlv2.sources.srclv2.type=avro > rtlv2.sources.srclv2.bind=192.168.1.2 > rtlv2.sources.srclv2.port=5141 > rtlv2.sources.srclv2.channels=chnlv2 > > #channels > rtlv2.channels.chnlv2.type=memory > rtlv2.channels.chnlv2.capacity=1500 > rtlv2.channels.chnlv2.transactioncapacity=1500 > > #sinks > rtlv2.sinks.snklv2.type=hdfs > rtlv2.sinks.snklv2.channel=chnlv2 > rtlv2.sinks.snklv2.hdfs.path=/user/flume/avro/%y-%m-%d/%H%M > rtlv2.sinks.snklv2.hdfs.fileSuffix=.avro > rtlv2.sinks.snklv2.serializer=avro_event > rtlv2.sinks.snklv2.hdfs.fileType=DataStream > rtlv2.sinks.snklv2.hdfs.rollInterval=0 > rtlv2.sinks.snklv2.hdfs.rollSize=3072 > rtlv2.sinks.snklv2.hdfs.batchSize=1500 > rtlv2.sinks.snklv2.hdfs.rollCount=0 > rtlv2.sinks.snklv2.hdfs.round=true > rtlv2.sinks.snklv2.hdfs.roundValue=10 > rtlv2.sinks.snklv2.hdfs.roundUnit=minute > > Thanks! > *Devin Suiter* > Jr. Data Solutions Software Engineer > 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212 > Google Voice: 412-256-8556 | www.rdx.com >
