I am exactly where you are with this, except for the problem of
my not having had time to write a serializer to address the
Hostname Timestamp issue.Questions about the use of Flume in this
manner seem to recur on a regular basis, so it seems a common use
case.
Sorry I cannot offer a solution since I am in your shoes at the
moment, unfortunately looking at storing logs twice.
Ron Thielen
<image001.jpg>
*From:*Josh West [mailto:[email protected] <mailto:[email protected]>]
*Sent:*Friday, October 26, 2012 9:05 AM
*To:*[email protected] <mailto:[email protected]>
*Subject:*Syslog Infrastructure with Flume
Hey folks,
I've been experimenting with Flume for a few weeks now, trying to
determine an approach to designing a reliable, highly available,
scalable system to store logs from various sources, including
syslog. Ideally, this system will meet the following requirements:
1. Logs from syslog across all servers make their way into HDFS.
2. Logs are stored in HDFS in a manner that is available for
post-processing:
* Example: HIVE partitions - with HDFS Flume Sink, can set
hdfs.path
tohdfs://namenode/flume/syslog/server=%{host}/facility=%{Facility}
* Example: Custom map reduce jobs...
3. Logs are stored in HDFS in a manner that is available for
"reading" by sysadmins:
* During troubleshooting/firefighting, it is quite helpful
to be able to login to a central logging system and tail
-f / grep logs.
* We need to be able to see the logs "live".
Some folks may be wondering why are we choosing Flume for syslog,
instead of something like Graylog2 or Logstash? The answer is we
will be using Flume + Hadoop for the transport and processing of
other types of data in addition to syslog. For example,
webserver access logs for post processing and statistical
analysis. So, we would like to make the most use of the Hadoop
cluster, keeping all logs of all types in one redundant/scalable
solution. Additionally, by keeping both syslog and webserver
access logs in Hadoop/HDFS, we can begin to correlate events.
I've run into some snags while attempting to implement Flume in a
manner that satisfies the requirements listed in the top of this
message:
1. Logs to HDFS:
* I can indeed use the Flume HDFS Sink to reliably write
logs into HDFS.
* Needed to write custom serializer to add Hostname and
Timestamp fields back to syslog messages.
* See: https://issues.apache.org/jira/browse/FLUME-1666
2. Logs to HDFS in manner available for
reading/firefighting/troubleshooting by sysadmins:
* Flume HDFS Sink uses the BucketWriter for recording flume
events to HDFS.
* Creates data files like:
/flume/syslog/server=%{host}/facility=%{Facility}/FlumeData.1350997160213
* Each file is format of FlumeData (or custom prefix)
followed by . followed by unix timestamp of when the file
was created.
o This is somewhat necessary... As you could have
multiple Flume writers, writing to the same HDFS, the
files cannot be opened by more than one writer. So
each writer should write to its own file.
* Latest file, currently being written to, is suffixed with
".tmp".
* This approach is not very sysadmin-friendly....
o You have to find the latest (ie. the .tmp files) and
hadoop fs -tail -f /path/to/file.tmp
o Hadoop's fs -tail -f command first prints the entire
file's contents, then begins tailing.
So the sum of it all is Flume is awesome for getting syslog (and
other) data into HDFS for post processing, but not the best at
getting it into HDFS in a sysadmin troubleshooting/firefighting
format. In an ideal world, I have syslog data coming into Flume
via one transport (i.e. SyslogTcp Source or SyslogUDP Source) and
being written into HDFS in a manner that is both post-processable
and sysadmin-friendly, but it looks like this isn't going to happen.
I've thus investigated some alternative approaches to meet the
requirements. One of these approaches is to have all of my
servers send their syslog messages to a central box running
rsyslog. Then, rsyslog would perform one of the following actions:
1. Write logs to HDFS directly using 'omhdfs' module, in a
format that is both post-processable and sysadmin-friendly :-)
2. Write logs to HDFS directly using 'hadoop-fuse-dfs' utility,
which has HDFS mounted as a filesystem.
3. Write logs to a local filesystem and also replicate logs into
a flume agent, configured with a SyslogSource and HDFS sink.
Option #1 sounds great. But unfortunately the 'omhdfs' module
for rsyslog isn't working very well. I've gotten it to login to
Hadoop/HDFS but it has issues creating/appending files.
Additionally, templating is somewhat suspect (ie. making
directories /syslog/someserver/somefacility dynamically).
Option #2 sounds reasonable, but either the HDFS FUSE module
doesn't support append mode (yet) or rsyslog is trying to
create/open the files in a manner not compliant with HDFS. No
surprise, as we all know HDFS can be somewhat "special" at times
;-) It's actually no matter anyways... Trying to "tail -f" a file
mounted via HDFS FUSE is rather useless. The data is only and
finally fed to the tail command once a full 64MB (or whatever you
use) block size of data has been written to the file. One would
only be able to use "hadoop fs -tail -f /path/to/log" which has
its own issues mentioned previously.
Option #3 would definitely work. However, now I'm storing my
logs twice. Once on some local filesystem and another time in
HDFS. It works but its not ideal as it's a waste of space. And
you've probably noticed from this email so far, I'd prefer
the*ideal*solution :-)
*Note*: Astute flumers would probably look at option #3 and
recommend making use of the RollingFileSink in addition to the
HDFSSink. Unfortunately, the RollingFileSink doesn't support
templated/dynamic directory creation like the HDFSSink with its
hdfs.path setting of
"hdfs://namenode/flume/syslog/server=%{host}/facility=%{Facility}".
So what exactly am I asking here? Well, I'd like to know first
how others are doing this. A hybrid of rsyslog and Flume? All
and only Flume? With custom serializers/interceptors/sinks? Or
perhaps... how would you recommend I handle this?
Thanks for any and all thoughts you can provide.
--
Josh West
Lead Systems Administrator
One.com <http://One.com>,[email protected] <mailto:[email protected]>
<Ronald J Thielen.vcf>