Check out the latest trunk code... We just committed FLUME-1666 courtesy of 
Jeff Lord this week.

Mike

Sent from my iPhone

> On Oct 10, 2013, at 11:56 AM, DSuiter RDX <[email protected]> wrote:
> 
> Hi all,
> 
> We set up a pipeline to get rsyslog input from a remote server via TCP using 
> rsyslog remote TCP forwarding functionality. The data gets sent from the 
> server to a syslogTCP source, delivered to an Avro sink via memory channel, 
> which then delivers it to an Avro source channeled to an HDFS sink. It is 
> moving from source to destination fine, but the output is messy in HDFS. I 
> realize some of it is Avro schema being defined, but there are Severity and 
> Facility markers, and extra timestamps that do not appear in 
> /var/log/messages in the original server.
> 
> I am wondering if anyone can help us eliminate them? The extra information is 
> not useful, so if we could get the information down to what is showing up in 
> the /var/log/messages, that would simplify the next task of sorting the data 
> in MapReduce.
> 
> Here is the agent recipe, and a scrubbed sample of the data we are getting.
> 
> Recipe:
> RT_syslog.sources = syslogTCP_RT_Tier1_Source avro_RT_Tier2_Source
> RT_syslog.sinks = avro_RT_Tier1_Sink HDFS_RT_Tier2_Sink
> RT_syslog.channels = memory_RT_Tier1_Channel memory_RT_Tier2_Channel
> 
> # sources
> RT_syslog.sources.syslogTCP_RT_Tier1_Source.type = syslogtcp
> RT_syslog.sources.syslogTCP_RT_Tier1_Source.host = 12.34.56.78
> RT_syslog.sources.syslogTCP_RT_Tier1_Source.port = 5140
> RT_syslog.sources.syslogTCP_RT_Tier1_Source.channels = memory_RT_Tier1_Channel
> 
> # channels
> RT_syslog.channels.memory_RT_Tier1_Channel.type = memory
> RT_syslog.channels.memory_RT_Tier1_Channel.capacity = 1500
> RT_syslog.channels.memory_RT_Tier1_Channel.transactionCapacity = 1500
> 
> # sinks
> RT_syslog.sinks.avro_RT_Tier1_Sink.type = avro
> RT_syslog.sinks.avro_RT_Tier1_Sink.hostname = 12.34.56.78
> RT_syslog.sinks.avro_RT_Tier1_Sink.port = 5141
> RT_syslog.sinks.avro_RT_Tier1_Sink.batch-size = 1500
> RT_syslog.sinks.avro_RT_Tier1_Sink.channel = memory_RT_Tier1_Channel
> 
> # sources
> RT_syslog.sources.avro_RT_Tier2_Source.type = avro
> RT_syslog.sources.avro_RT_Tier2_Source.bind = 12.34.56.78
> RT_syslog.sources.avro_RT_Tier2_Source.port = 5141
> RT_syslog.sources.avro_RT_Tier2_Source.channels = memory_RT_Tier2_Channel
> 
> # channels
> RT_syslog.channels.memory_RT_Tier2_Channel.type = memory
> RT_syslog.channels.memory_RT_Tier2_Channel.capacity = 15000
> RT_syslog.channels.memory_RT_Tier2_Channel.transactionCapacity = 15000
> 
> # sinks
> RT_syslog.sinks.HDFS_RT_Tier2_Sink.type = hdfs
> RT_syslog.sinks.HDFS_RT_Tier2_Sink.channel = memory_RT_Tier2_Channel
> RT_syslog.sinks.HDFS_RT_Tier2_Sink.hdfs.path = /user/flume/RT_syslog
> RT_syslog.sinks.HDFS_RT_Tier2_Sink.hdfs.fileSuffix = .avro
> RT_syslog.sinks.HDFS_RT_Tier2_Sink.serializer = avro_event
> RT_syslog.sinks.HDFS_RT_Tier2_Sink.hdfs.fileType = DataStream
> RT_syslog.sinks.HDFS_RT_Tier2_Sink.hdfs.rollInterval = 86400
> RT_syslog.sinks.HDFS_RT_Tier2_Sink.hdfs.rollSize = 134217728
> RT_syslog.sinks.HDFS_RT_Tier2_Sink.hdfs.batchSize = 15000
> RT_syslog.sinks.HDFS_RT_Tier2_Sink.hdfs.rollCount = 0
> 
> Data we are getting in HDFS:
> 
> u'headers': {u'timestamp': u'1381256530000', u'host': u'server001', 
> u'Severity': u'6', u'Facility': u'1'}}
> {u'body': "RT: Ticket XXXXXX created in queue 'General' by info 
> (/opt/rt4/sbin/../lib/RT/Ticket.pm:694)",
> What that looks like in original form:
> 
> Oct 10 11:33:42 server001 RT: Ticket XXXXXX created in queue 'General' by 
> info (/opt/rt4/sbin/../lib/RT/Ticket.pm:694)
> 
> Thanks!
> Devin Suiter
> Jr. Data Solutions Software Engineer
> 
> 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212
> Google Voice: 412-256-8556 | www.rdx.com

Reply via email to