Very cool. Hcatalog seems like a nice idea, as otherwise lots of thought and planning must go into how one stores their data ... ensuring it can be read from all different Apache Hadoop related projects... E.G. if you store flume data in an HDFS path with /something=partition1/foo=bar, you'll need to use special Pig libraries to load the partitions.

Keep us updated please!

On 10/30/2012 12:37 AM, Roshan Naik wrote:
I am in the process of investigating the possibility of creating a HCatalog sink for Flume which should be able to handle such use cases. For your use case it could be thought of as a Hive sink. Goal is basically as follows... it would allow multiple flume agents to pump logs into a hive tables. That would make the data query-able without additional manual steps. Data will get added periodically in the form of new partitions to Hive. You would not have to deal with temporary files or manual addition of data into hive.

-roshan



On Sun, Oct 28, 2012 at 5:45 PM, Ralph Goers <[email protected] <mailto:[email protected]>> wrote:

    Since you ask...

    In our environment our primary concern is audit logs - have have
    to audit banking transactions as well as changes administrators
    make. We have a legacy system that needed to be integrated that
    had records in a form different than what we want stored. We also
    need to allow administrators to view events as close to real time
    as possible. Plus we have to aggregate data across 2 data centers.
    Although we are currently not including web server access logs we
    plan to integrate them in over time.  We also have requirements
    from our security team to pass events for their use to ArcSight.

    1. We have a "log extractor" that receives legacy events as they
    occur and converts them into our new format and passes them to
    Flume. All new applications use the Log4j 2 Flume Appender to get
    data to Flume.
    2. Flume passes the data to ArcSight for our security team's use.
    3. We wrote a Flume to Cassandra Sink.
    4. We wrote our own REST query services to retrieve the data from
    Cassandra.
    5. Since we are using DataStax Enterprise version of Cassandra we
    have also set up "Analytic" nodes that run Hadoop on top of
    Cassandra. This allows the data to be accessed via normal Hadoop
    tools for data analytics.
    6. We have written our own reporting UI component in our
    Administrative Platform to allow administrators to view activities
    in real time or to schedule background data collection so users
    can post process the data on their own.

    We do not have anything to allow an admin to "tail" the log but it
    wouldn't be hard at all to write an application to accept Flume
    events via Avro and display the last "n" events as they arrive.

    One thing I should point out. We format our events in accordance
    with RFC 5424 and store that in the Flume event body. We then
    store all our individual pieces of audit event data in Flume
    headers fields.  The RFC 5424 message is what we send to ArcSight.
    The event fields and the compressed body are all stored in
    individual columns in Cassandra.

    Ralph


    On Oct 26, 2012, at 2:06 PM, Ron Thielen wrote:

    I am exactly where you are with this, except for the problem of
    my not having had time to write a serializer to address the
    Hostname Timestamp issue.Questions about the use of Flume in this
    manner seem to recur on a regular basis, so it seems a common use
    case.
    Sorry I cannot offer a solution since I am in your shoes at the
    moment, unfortunately looking at storing logs twice.
    Ron Thielen
    <image001.jpg>
    *From:*Josh West [mailto:[email protected] <mailto:[email protected]>]
    *Sent:*Friday, October 26, 2012 9:05 AM
    *To:*[email protected] <mailto:[email protected]>
    *Subject:*Syslog Infrastructure with Flume

    Hey folks,

    I've been experimenting with Flume for a few weeks now, trying to
    determine an approach to designing a reliable, highly available,
    scalable system to store logs from various sources, including
    syslog.  Ideally, this system will meet the following requirements:

     1. Logs from syslog across all servers make their way into HDFS.
     2. Logs are stored in HDFS in a manner that is available for
        post-processing:
          * Example: HIVE partitions - with HDFS Flume Sink, can set
            hdfs.path
            tohdfs://namenode/flume/syslog/server=%{host}/facility=%{Facility}
          * Example: Custom map reduce jobs...
     3. Logs are stored in HDFS in a manner that is available for
        "reading" by sysadmins:
          * During troubleshooting/firefighting, it is quite helpful
            to be able to login to a central logging system and tail
            -f / grep logs.
          * We need to be able to see the logs "live".

    Some folks may be wondering why are we choosing Flume for syslog,
    instead of something like Graylog2 or Logstash?  The answer is we
    will be using Flume + Hadoop for the transport and processing of
    other types of data in addition to syslog.  For example,
    webserver access logs for post processing and statistical
    analysis.  So, we would like to make the most use of the Hadoop
    cluster, keeping all logs of all types in one redundant/scalable
    solution. Additionally, by keeping both syslog and webserver
    access logs in Hadoop/HDFS, we can begin to correlate events.

    I've run into some snags while attempting to implement Flume in a
    manner that satisfies the requirements listed in the top of this
    message:

     1. Logs to HDFS:
          * I can indeed use the Flume HDFS Sink to reliably write
            logs into HDFS.
          * Needed to write custom serializer to add Hostname and
            Timestamp fields back to syslog messages.
          * See: https://issues.apache.org/jira/browse/FLUME-1666
     2. Logs to HDFS in manner available for
        reading/firefighting/troubleshooting by sysadmins:
          * Flume HDFS Sink uses the BucketWriter for recording flume
            events to HDFS.
          * Creates data files like:
            
/flume/syslog/server=%{host}/facility=%{Facility}/FlumeData.1350997160213
          * Each file is format of FlumeData (or custom prefix)
            followed by . followed by unix timestamp of when the file
            was created.
              o This is somewhat necessary... As you could have
                multiple Flume writers, writing to the same HDFS, the
                files cannot be opened by more than one writer.  So
                each writer should write to its own file.
          * Latest file, currently being written to, is suffixed with
            ".tmp".
          * This approach is not very sysadmin-friendly....
              o You have to find the latest (ie. the .tmp files) and
                hadoop fs -tail -f /path/to/file.tmp
              o Hadoop's fs -tail -f command first prints the entire
                file's contents, then begins tailing.

    So the sum of it all is Flume is awesome for getting syslog (and
    other) data into HDFS for post processing, but not the best at
    getting it into HDFS in a sysadmin troubleshooting/firefighting
    format.  In an ideal world, I have syslog data coming into Flume
    via one transport (i.e. SyslogTcp Source or SyslogUDP Source) and
    being written into HDFS in a manner that is both post-processable
    and sysadmin-friendly, but it looks like this isn't going to happen.

    I've thus investigated some alternative approaches to meet the
    requirements.  One of these approaches is to have all of my
    servers send their syslog messages to a central box running
    rsyslog. Then, rsyslog would perform one of the following actions:

     1. Write logs to HDFS directly using 'omhdfs' module, in a
        format that is both post-processable and sysadmin-friendly :-)
     2. Write logs to HDFS directly using 'hadoop-fuse-dfs' utility,
        which has HDFS mounted as a filesystem.
     3. Write logs to a local filesystem and also replicate logs into
        a flume agent, configured with a SyslogSource and HDFS sink.

    Option #1 sounds great.  But unfortunately the 'omhdfs' module
    for rsyslog isn't working very well.  I've gotten it to login to
    Hadoop/HDFS but it has issues creating/appending files.
    Additionally, templating is somewhat suspect (ie. making
    directories /syslog/someserver/somefacility dynamically).

    Option #2 sounds reasonable, but either the HDFS FUSE module
    doesn't support append mode (yet) or rsyslog is trying to
    create/open the files in a manner not compliant with HDFS.  No
    surprise, as we all know HDFS can be somewhat "special" at times
    ;-) It's actually no matter anyways... Trying to "tail -f" a file
    mounted via HDFS FUSE is rather useless. The data is only and
    finally fed to the tail command once a full 64MB (or whatever you
    use) block size of data has been written to the file. One would
    only be able to use "hadoop fs -tail -f /path/to/log" which has
    its own issues mentioned previously.

    Option #3 would definitely work.  However, now I'm storing my
    logs twice.  Once on some local filesystem and another time in
    HDFS.  It works but its not ideal as it's a waste of space.  And
    you've probably noticed from this email so far, I'd prefer
    the*ideal*solution :-)

    *Note*: Astute flumers would probably look at option #3 and
    recommend making use of the RollingFileSink in addition to the
    HDFSSink. Unfortunately, the RollingFileSink doesn't support
    templated/dynamic directory creation like the HDFSSink with its
    hdfs.path setting of
    "hdfs://namenode/flume/syslog/server=%{host}/facility=%{Facility}".

    So what exactly am I asking here?  Well, I'd like to know first
    how others are doing this.  A hybrid of rsyslog and Flume?  All
    and only Flume?  With custom serializers/interceptors/sinks?  Or
    perhaps... how would you recommend I handle this?

    Thanks for any and all thoughts you can provide.

-- Josh West
    Lead Systems Administrator
    One.com  <http://One.com>,[email protected]  <mailto:[email protected]>
    <Ronald J  Thielen.vcf>



--
Josh West
Lead Systems Administrator
One.com, [email protected]

Reply via email to