Check out the latest trunk code... We just committed FLUME-1666 courtesy of Jeff Lord this week.
Mike Sent from my iPhone > On Oct 10, 2013, at 11:56 AM, DSuiter RDX <[email protected]> wrote: > > Hi all, > > We set up a pipeline to get rsyslog input from a remote server via TCP using > rsyslog remote TCP forwarding functionality. The data gets sent from the > server to a syslogTCP source, delivered to an Avro sink via memory channel, > which then delivers it to an Avro source channeled to an HDFS sink. It is > moving from source to destination fine, but the output is messy in HDFS. I > realize some of it is Avro schema being defined, but there are Severity and > Facility markers, and extra timestamps that do not appear in > /var/log/messages in the original server. > > I am wondering if anyone can help us eliminate them? The extra information is > not useful, so if we could get the information down to what is showing up in > the /var/log/messages, that would simplify the next task of sorting the data > in MapReduce. > > Here is the agent recipe, and a scrubbed sample of the data we are getting. > > Recipe: > RT_syslog.sources = syslogTCP_RT_Tier1_Source avro_RT_Tier2_Source > RT_syslog.sinks = avro_RT_Tier1_Sink HDFS_RT_Tier2_Sink > RT_syslog.channels = memory_RT_Tier1_Channel memory_RT_Tier2_Channel > > # sources > RT_syslog.sources.syslogTCP_RT_Tier1_Source.type = syslogtcp > RT_syslog.sources.syslogTCP_RT_Tier1_Source.host = 12.34.56.78 > RT_syslog.sources.syslogTCP_RT_Tier1_Source.port = 5140 > RT_syslog.sources.syslogTCP_RT_Tier1_Source.channels = memory_RT_Tier1_Channel > > # channels > RT_syslog.channels.memory_RT_Tier1_Channel.type = memory > RT_syslog.channels.memory_RT_Tier1_Channel.capacity = 1500 > RT_syslog.channels.memory_RT_Tier1_Channel.transactionCapacity = 1500 > > # sinks > RT_syslog.sinks.avro_RT_Tier1_Sink.type = avro > RT_syslog.sinks.avro_RT_Tier1_Sink.hostname = 12.34.56.78 > RT_syslog.sinks.avro_RT_Tier1_Sink.port = 5141 > RT_syslog.sinks.avro_RT_Tier1_Sink.batch-size = 1500 > RT_syslog.sinks.avro_RT_Tier1_Sink.channel = memory_RT_Tier1_Channel > > # sources > RT_syslog.sources.avro_RT_Tier2_Source.type = avro > RT_syslog.sources.avro_RT_Tier2_Source.bind = 12.34.56.78 > RT_syslog.sources.avro_RT_Tier2_Source.port = 5141 > RT_syslog.sources.avro_RT_Tier2_Source.channels = memory_RT_Tier2_Channel > > # channels > RT_syslog.channels.memory_RT_Tier2_Channel.type = memory > RT_syslog.channels.memory_RT_Tier2_Channel.capacity = 15000 > RT_syslog.channels.memory_RT_Tier2_Channel.transactionCapacity = 15000 > > # sinks > RT_syslog.sinks.HDFS_RT_Tier2_Sink.type = hdfs > RT_syslog.sinks.HDFS_RT_Tier2_Sink.channel = memory_RT_Tier2_Channel > RT_syslog.sinks.HDFS_RT_Tier2_Sink.hdfs.path = /user/flume/RT_syslog > RT_syslog.sinks.HDFS_RT_Tier2_Sink.hdfs.fileSuffix = .avro > RT_syslog.sinks.HDFS_RT_Tier2_Sink.serializer = avro_event > RT_syslog.sinks.HDFS_RT_Tier2_Sink.hdfs.fileType = DataStream > RT_syslog.sinks.HDFS_RT_Tier2_Sink.hdfs.rollInterval = 86400 > RT_syslog.sinks.HDFS_RT_Tier2_Sink.hdfs.rollSize = 134217728 > RT_syslog.sinks.HDFS_RT_Tier2_Sink.hdfs.batchSize = 15000 > RT_syslog.sinks.HDFS_RT_Tier2_Sink.hdfs.rollCount = 0 > > Data we are getting in HDFS: > > u'headers': {u'timestamp': u'1381256530000', u'host': u'server001', > u'Severity': u'6', u'Facility': u'1'}} > {u'body': "RT: Ticket XXXXXX created in queue 'General' by info > (/opt/rt4/sbin/../lib/RT/Ticket.pm:694)", > What that looks like in original form: > > Oct 10 11:33:42 server001 RT: Ticket XXXXXX created in queue 'General' by > info (/opt/rt4/sbin/../lib/RT/Ticket.pm:694) > > Thanks! > Devin Suiter > Jr. Data Solutions Software Engineer > > 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212 > Google Voice: 412-256-8556 | www.rdx.com
