Hi, this may be a problem with our understanding, or my configuration. I am trying to take data from rsyslog via remote forwarding over TCP into a syslogTCP source, collect it as an avro sink, connect the avro sink to an avro source, and then into an HDFS sink.
Everything is connected and the data is flowing from the remote source into HDFS in an avro container, so that is not the problem. The problem is that it is closing files when they are very small, only KBs in size, even though I have the hdfs roll_Interval and rollCount properties set to 0. I set the hdfs.rollSize property to 3072 for 3MB. I expected it to aggregate the files into larger blocks before closing them. Is this happening because of the HDFS directory-building escape sequences forcing new directory writes and making new files prematurely? Here are my agent configs: syslogTCP Source > Avro Sink (first tier, pretty sure everything is ok here but maybe not) ####RT Listener Agent#### rtlv1.sources=srclv1 rtlv1.sinks=snklv1 rtlv1.channels=chnlv1 #sources rtlv1.sources.srclv1.type=syslogtcp rtlv1.sources.srclv1.host=192.168.1.2 rtlv1.sources.srclv1.port=5140 rtlv1.sources.srclv1.channels=chnlv1 #channels rtlv1.channels.chnlv1.type=memory rtlv1.channels.chnlv1.capacity=1500 rtlv1.channels.chnlv1.transactionCapacity=1500 #sinks rtlv1.sinks.snklv1.type=avro rtlv1.sinks.snklv1.hostname=192.168.1.2 rtlv1.sinks.snklv1.port=5141 rtlv1.sinks.snklv1.batch-size=1500 rtlv1.sinks.snklv1.channel=chnlv1 Avro Source > HDFS (second tier) ####RT Aggregate Writer Agent#### rtlv2.sources=srclv2 rtlv2.sinks=snklv2 rtlv2.channels=chnlv2 #sources rtlv2.sources.srclv2.type=avro rtlv2.sources.srclv2.bind=192.168.1.2 rtlv2.sources.srclv2.port=5141 rtlv2.sources.srclv2.channels=chnlv2 #channels rtlv2.channels.chnlv2.type=memory rtlv2.channels.chnlv2.capacity=1500 rtlv2.channels.chnlv2.transactioncapacity=1500 #sinks rtlv2.sinks.snklv2.type=hdfs rtlv2.sinks.snklv2.channel=chnlv2 rtlv2.sinks.snklv2.hdfs.path=/user/flume/avro/%y-%m-%d/%H%M rtlv2.sinks.snklv2.hdfs.fileSuffix=.avro rtlv2.sinks.snklv2.serializer=avro_event rtlv2.sinks.snklv2.hdfs.fileType=DataStream rtlv2.sinks.snklv2.hdfs.rollInterval=0 rtlv2.sinks.snklv2.hdfs.rollSize=3072 rtlv2.sinks.snklv2.hdfs.batchSize=1500 rtlv2.sinks.snklv2.hdfs.rollCount=0 rtlv2.sinks.snklv2.hdfs.round=true rtlv2.sinks.snklv2.hdfs.roundValue=10 rtlv2.sinks.snklv2.hdfs.roundUnit=minute Thanks! *Devin Suiter* Jr. Data Solutions Software Engineer 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212 Google Voice: 412-256-8556 | www.rdx.com
