Roshan, I am not a Hive/HCatalog pro, but I am just wondering why an HCatalog sink, rather than a Hive Sink? Hive is definitely very popular and would be well appreciated if we could get a Hive Sink written. Since HCatalog is supposed to be compatible with the Hive metastore (right?), why not just implement a Hive sink and make it available to a larger community? I'd definitely like to see a Hive Sink, and would definitely prioritize that and then if explicitly required, add HCatalog support to that - this way it is useful to people who use Hive as well. In fact, there already is a Hive Sink jira here:https://issues.apache.org/jira/browse/FLUME-1008.
I am a +1 for a Hive Sink, so please take a look. Thanks, Hari -- Hari Shreedharan On Monday, October 29, 2012 at 4:37 PM, Roshan Naik wrote: > I am in the process of investigating the possibility of creating a HCatalog > sink for Flume which should be able to handle such use cases. For your use > case it could be thought of as a Hive sink. Goal is basically as follows... > it would allow multiple flume agents to pump logs into a hive tables. That > would make the data query-able without additional manual steps. Data will get > added periodically in the form of new partitions to Hive. You would not have > to deal with temporary files or manual addition of data into hive. > > -roshan > > > > On Sun, Oct 28, 2012 at 5:45 PM, Ralph Goers <[email protected] > (mailto:[email protected])> wrote: > > Since you ask... > > > > In our environment our primary concern is audit logs - have have to audit > > banking transactions as well as changes administrators make. We have a > > legacy system that needed to be integrated that had records in a form > > different than what we want stored. We also need to allow administrators to > > view events as close to real time as possible. Plus we have to aggregate > > data across 2 data centers. Although we are currently not including web > > server access logs we plan to integrate them in over time. We also have > > requirements from our security team to pass events for their use to > > ArcSight. > > > > 1. We have a "log extractor" that receives legacy events as they occur and > > converts them into our new format and passes them to Flume. All new > > applications use the Log4j 2 Flume Appender to get data to Flume. > > 2. Flume passes the data to ArcSight for our security team's use. > > 3. We wrote a Flume to Cassandra Sink. > > 4. We wrote our own REST query services to retrieve the data from Cassandra. > > 5. Since we are using DataStax Enterprise version of Cassandra we have also > > set up "Analytic" nodes that run Hadoop on top of Cassandra. This allows > > the data to be accessed via normal Hadoop tools for data analytics. > > 6. We have written our own reporting UI component in our Administrative > > Platform to allow administrators to view activities in real time or to > > schedule background data collection so users can post process the data on > > their own. > > > > We do not have anything to allow an admin to "tail" the log but it wouldn't > > be hard at all to write an application to accept Flume events via Avro and > > display the last "n" events as they arrive. > > > > One thing I should point out. We format our events in accordance with RFC > > 5424 and store that in the Flume event body. We then store all our > > individual pieces of audit event data in Flume headers fields. The RFC > > 5424 message is what we send to ArcSight. The event fields and the > > compressed body are all stored in individual columns in Cassandra. > > > > Ralph > > > > > > On Oct 26, 2012, at 2:06 PM, Ron Thielen wrote: > > > I am exactly where you are with this, except for the problem of my not > > > having had time to write a serializer to address the Hostname Timestamp > > > issue. Questions about the use of Flume in this manner seem to recur on > > > a regular basis, so it seems a common use case. > > > > > > Sorry I cannot offer a solution since I am in your shoes at the moment, > > > unfortunately looking at storing logs twice. > > > > > > Ron Thielen > > > > > > > > > <image001.jpg> > > > > > > From: Josh West [mailto:[email protected]] > > > Sent: Friday, October 26, 2012 9:05 AM > > > To: [email protected] (mailto:[email protected]) > > > Subject: Syslog Infrastructure with Flume > > > > > > Hey folks, > > > > > > I've been experimenting with Flume for a few weeks now, trying to > > > determine an approach to designing a reliable, highly available, scalable > > > system to store logs from various sources, including syslog. Ideally, > > > this system will meet the following requirements: > > > Logs from syslog across all servers make their way into HDFS. > > > Logs are stored in HDFS in a manner that is available for post-processing: > > > Example: HIVE partitions - with HDFS Flume Sink, can set hdfs.path to > > > hdfs://namenode/flume/syslog/server=%{host}/facility=%{Facility} > > > Example: Custom map reduce jobs... > > > > > > > > > Logs are stored in HDFS in a manner that is available for "reading" by > > > sysadmins: > > > During troubleshooting/firefighting, it is quite helpful to be able to > > > login to a central logging system and tail -f / grep logs. > > > We need to be able to see the logs "live". > > > > > > > > > > > > > > > Some folks may be wondering why are we choosing Flume for syslog, instead > > > of something like Graylog2 or Logstash? The answer is we will be using > > > Flume + Hadoop for the transport and processing of other types of data in > > > addition to syslog. For example, webserver access logs for post > > > processing and statistical analysis. So, we would like to make the most > > > use of the Hadoop cluster, keeping all logs of all types in one > > > redundant/scalable solution. Additionally, by keeping both syslog and > > > webserver access logs in Hadoop/HDFS, we can begin to correlate events. > > > > > > > > > I've run into some snags while attempting to implement Flume in a manner > > > that satisfies the requirements listed in the top of this message: > > > > > > Logs to HDFS: > > > I can indeed use the Flume HDFS Sink to reliably write logs into HDFS. > > > Needed to write custom serializer to add Hostname and Timestamp fields > > > back to syslog messages. > > > See: https://issues.apache.org/jira/browse/FLUME-1666 > > > > > > > > > Logs to HDFS in manner available for reading/firefighting/troubleshooting > > > by sysadmins: > > > Flume HDFS Sink uses the BucketWriter for recording flume events to HDFS. > > > Creates data files like: > > > /flume/syslog/server=%{host}/facility=%{Facility}/FlumeData.1350997160213 > > > Each file is format of FlumeData (or custom prefix) followed by . > > > followed by unix timestamp of when the file was created. > > > This is somewhat necessary... As you could have multiple Flume writers, > > > writing to the same HDFS, the files cannot be opened by more than one > > > writer. So each writer should write to its own file. > > > > > > > > > Latest file, currently being written to, is suffixed with ".tmp". > > > This approach is not very sysadmin-friendly.... > > > You have to find the latest (ie. the .tmp files) and hadoop fs -tail -f > > > /path/to/file.tmp > > > Hadoop's fs -tail -f command first prints the entire file's contents, > > > then begins tailing. > > > > > > > > > > > > > > > > > > > > > So the sum of it all is Flume is awesome for getting syslog (and other) > > > data into HDFS for post processing, but not the best at getting it into > > > HDFS in a sysadmin troubleshooting/firefighting format. In an ideal > > > world, I have syslog data coming into Flume via one transport (i.e. > > > SyslogTcp Source or SyslogUDP Source) and being written into HDFS in a > > > manner that is both post-processable and sysadmin-friendly, but it looks > > > like this isn't going to happen. > > > > > > > > > I've thus investigated some alternative approaches to meet the > > > requirements. One of these approaches is to have all of my servers send > > > their syslog messages to a central box running rsyslog. Then, rsyslog > > > would perform one of the following actions: > > > > > > Write logs to HDFS directly using 'omhdfs' module, in a format that is > > > both post-processable and sysadmin-friendly :-) > > > Write logs to HDFS directly using 'hadoop-fuse-dfs' utility, which has > > > HDFS mounted as a filesystem. > > > Write logs to a local filesystem and also replicate logs into a flume > > > agent, configured with a SyslogSource and HDFS sink. > > > > > > > > > Option #1 sounds great. But unfortunately the 'omhdfs' module for > > > rsyslog isn't working very well. I've gotten it to login to Hadoop/HDFS > > > but it has issues creating/appending files. Additionally, templating is > > > somewhat suspect (ie. making directories /syslog/someserver/somefacility > > > dynamically). > > > > > > > > > Option #2 sounds reasonable, but either the HDFS FUSE module doesn't > > > support append mode (yet) or rsyslog is trying to create/open the files > > > in a manner not compliant with HDFS. No surprise, as we all know HDFS > > > can be somewhat "special" at times ;-) It's actually no matter > > > anyways... Trying to "tail -f" a file mounted via HDFS FUSE is rather > > > useless. The data is only and finally fed to the tail command once a > > > full 64MB (or whatever you use) block size of data has been written to > > > the file. One would only be able to use "hadoop fs -tail -f > > > /path/to/log" which has its own issues mentioned previously. > > > > > > > > > Option #3 would definitely work. However, now I'm storing my logs twice. > > > Once on some local filesystem and another time in HDFS. It works but > > > its not ideal as it's a waste of space. And you've probably noticed from > > > this email so far, I'd prefer the ideal solution :-) > > > > > > > > > Note: Astute flumers would probably look at option #3 and recommend > > > making use of the RollingFileSink in addition to the HDFSSink. > > > Unfortunately, the RollingFileSink doesn't support templated/dynamic > > > directory creation like the HDFSSink with its hdfs.path setting of > > > "hdfs://namenode/flume/syslog/server=%{host}/facility=%{Facility}". > > > > > > > > > So what exactly am I asking here? Well, I'd like to know first how > > > others are doing this. A hybrid of rsyslog and Flume? All and only > > > Flume? With custom serializers/interceptors/sinks? Or perhaps... how > > > would you recommend I handle this? > > > > > > > > > Thanks for any and all thoughts you can provide. > > > > > > > > > > > > > > > -- > > > Josh West > > > Lead Systems Administrator > > > One.com (http://One.com), [email protected] (mailto:[email protected]) > > > > > > > > > > > > <Ronald J Thielen.vcf> > > > > > > > > > >
