Roshan,

I am not a Hive/HCatalog pro, but I am just wondering why an HCatalog sink, 
rather than a Hive Sink? Hive is definitely very popular and would be well 
appreciated if we could get a Hive Sink written. Since HCatalog is supposed to 
be compatible with the Hive metastore (right?), why not just implement a Hive 
sink and make it available to a larger community? I'd definitely like to see a 
Hive Sink, and would definitely prioritize that and then if explicitly 
required, add HCatalog support to that - this way it is useful to people who 
use Hive as well. In fact, there already is a Hive Sink jira 
here:https://issues.apache.org/jira/browse/FLUME-1008. 

I am a +1 for a Hive Sink, so please take a look.


Thanks,
Hari


-- 
Hari Shreedharan


On Monday, October 29, 2012 at 4:37 PM, Roshan Naik wrote:

> I am in the process of investigating the possibility of creating  a HCatalog 
> sink for Flume which should be able to handle such use cases. For your use 
> case it could be thought of as a Hive sink. Goal is basically as follows... 
> it would allow multiple flume agents to pump logs into a hive tables. That 
> would make the data query-able without additional manual steps. Data will get 
> added periodically in the form of new partitions to Hive. You would not have 
> to deal with temporary files or manual addition of data into hive.  
> 
> -roshan
> 
> 
> 
> On Sun, Oct 28, 2012 at 5:45 PM, Ralph Goers <[email protected] 
> (mailto:[email protected])> wrote:
> > Since you ask...
> > 
> > In our environment our primary concern is audit logs - have have to audit 
> > banking transactions as well as changes administrators make. We have a 
> > legacy system that needed to be integrated that had records in a form 
> > different than what we want stored. We also need to allow administrators to 
> > view events as close to real time as possible. Plus we have to aggregate 
> > data across 2 data centers. Although we are currently not including web 
> > server access logs we plan to integrate them in over time.  We also have 
> > requirements from our security team to pass events for their use to 
> > ArcSight. 
> > 
> > 1. We have a "log extractor" that receives legacy events as they occur and 
> > converts them into our new format and passes them to Flume. All new 
> > applications use the Log4j 2 Flume Appender to get data to Flume. 
> > 2. Flume passes the data to ArcSight for our security team's use.
> > 3. We wrote a Flume to Cassandra Sink.
> > 4. We wrote our own REST query services to retrieve the data from Cassandra.
> > 5. Since we are using DataStax Enterprise version of Cassandra we have also 
> > set up "Analytic" nodes that run Hadoop on top of Cassandra. This allows 
> > the data to be accessed via normal Hadoop tools for data analytics.
> > 6. We have written our own reporting UI component in our Administrative 
> > Platform to allow administrators to view activities in real time or to 
> > schedule background data collection so users can post process the data on 
> > their own.
> > 
> > We do not have anything to allow an admin to "tail" the log but it wouldn't 
> > be hard at all to write an application to accept Flume events via Avro and 
> > display the last "n" events as they arrive. 
> > 
> > One thing I should point out. We format our events in accordance with RFC 
> > 5424 and store that in the Flume event body. We then store all our 
> > individual pieces of audit event data in Flume headers fields.  The RFC 
> > 5424 message is what we send to ArcSight. The event fields and the 
> > compressed body are all stored in individual columns in Cassandra. 
> > 
> > Ralph
> > 
> > 
> > On Oct 26, 2012, at 2:06 PM, Ron Thielen wrote:
> > > I am exactly where you are with this, except for the problem of my not 
> > > having had time to write a serializer to address the Hostname Timestamp 
> > > issue.  Questions about the use of Flume in this manner seem to recur on 
> > > a regular basis, so it seems a common use case. 
> > >  
> > > Sorry I cannot offer a solution since I am in your shoes at the moment, 
> > > unfortunately looking at storing logs twice.
> > >  
> > > Ron Thielen
> > >  
> > > 
> > > <image001.jpg>
> > >  
> > > From: Josh West [mailto:[email protected]] 
> > > Sent: Friday, October 26, 2012 9:05 AM
> > > To: [email protected] (mailto:[email protected])
> > > Subject: Syslog Infrastructure with Flume 
> > >  
> > > Hey folks,
> > > 
> > > I've been experimenting with Flume for a few weeks now, trying to 
> > > determine an approach to designing a reliable, highly available, scalable 
> > > system to store logs from various sources, including syslog.  Ideally, 
> > > this system will meet the following requirements: 
> > > Logs from syslog across all servers make their way into HDFS.
> > > Logs are stored in HDFS in a manner that is available for post-processing:
> > > Example:  HIVE partitions - with HDFS Flume Sink, can set hdfs.path to 
> > > hdfs://namenode/flume/syslog/server=%{host}/facility=%{Facility}
> > > Example:  Custom map reduce jobs...
> > > 
> > > 
> > > Logs are stored in HDFS in a manner that is available for "reading" by 
> > > sysadmins:
> > > During troubleshooting/firefighting, it is quite helpful to be able to 
> > > login to a central logging system and tail -f / grep logs.
> > > We need to be able to see the logs "live".
> > > 
> > > 
> > > 
> > > 
> > > Some folks may be wondering why are we choosing Flume for syslog, instead 
> > > of something like Graylog2 or Logstash?  The answer is we will be using 
> > > Flume + Hadoop for the transport and processing of other types of data in 
> > > addition to syslog.  For example, webserver access logs for post 
> > > processing and statistical analysis.  So, we would like to make the most 
> > > use of the Hadoop cluster, keeping all logs of all types in one 
> > > redundant/scalable solution.  Additionally, by keeping both syslog and 
> > > webserver access logs in Hadoop/HDFS, we can begin to correlate events.
> > > 
> > > 
> > > I've run into some snags while attempting to implement Flume in a manner 
> > > that satisfies the requirements listed in the top of this message:
> > > 
> > > Logs to HDFS:
> > > I can indeed use the Flume HDFS Sink to reliably write logs into HDFS.
> > > Needed to write custom serializer to add Hostname and Timestamp fields 
> > > back to syslog messages.
> > > See:  https://issues.apache.org/jira/browse/FLUME-1666
> > > 
> > > 
> > > Logs to HDFS in manner available for reading/firefighting/troubleshooting 
> > > by sysadmins:
> > > Flume HDFS Sink uses the BucketWriter for recording flume events to HDFS.
> > > Creates data files like:  
> > > /flume/syslog/server=%{host}/facility=%{Facility}/FlumeData.1350997160213
> > > Each file is format of FlumeData (or custom prefix) followed by . 
> > > followed by unix timestamp of when the file was created.
> > > This is somewhat necessary... As you could have multiple Flume writers, 
> > > writing to the same HDFS, the files cannot be opened by more than one 
> > > writer.  So each writer should write to its own file.
> > > 
> > > 
> > > Latest file, currently being written to, is suffixed with ".tmp".
> > > This approach is not very sysadmin-friendly....
> > > You have to find the latest (ie. the .tmp files) and hadoop fs -tail -f 
> > > /path/to/file.tmp
> > > Hadoop's fs -tail -f command first prints the entire file's contents, 
> > > then begins tailing.
> > > 
> > > 
> > > 
> > > 
> > > 
> > > 
> > > So the sum of it all is Flume is awesome for getting syslog (and other) 
> > > data into HDFS for post processing, but not the best at getting it into 
> > > HDFS in a sysadmin troubleshooting/firefighting format.  In an ideal 
> > > world, I have syslog data coming into Flume via one transport (i.e. 
> > > SyslogTcp Source or SyslogUDP Source) and being written into HDFS in a 
> > > manner that is both post-processable and sysadmin-friendly, but it looks 
> > > like this isn't going to happen.
> > > 
> > > 
> > > I've thus investigated some alternative approaches to meet the 
> > > requirements.  One of these approaches is to have all of my servers send 
> > > their syslog messages to a central box running rsyslog.  Then, rsyslog 
> > > would perform one of the following actions:
> > > 
> > > Write logs to HDFS directly using 'omhdfs' module, in a format that is 
> > > both post-processable and sysadmin-friendly :-)
> > > Write logs to HDFS directly using 'hadoop-fuse-dfs' utility, which has 
> > > HDFS mounted as a filesystem.
> > > Write logs to a local filesystem and also replicate logs into a flume 
> > > agent, configured with a SyslogSource and HDFS sink.
> > > 
> > > 
> > > Option #1 sounds great.  But unfortunately the 'omhdfs' module for 
> > > rsyslog isn't working very well.  I've gotten it to login to Hadoop/HDFS 
> > > but it has issues creating/appending files.  Additionally, templating is 
> > > somewhat suspect (ie. making directories /syslog/someserver/somefacility 
> > > dynamically).
> > > 
> > > 
> > > Option #2 sounds reasonable, but either the HDFS FUSE module doesn't 
> > > support append mode (yet) or rsyslog is trying to create/open the files 
> > > in a manner not compliant with HDFS.  No surprise, as we all know HDFS 
> > > can be somewhat "special" at times ;-)  It's actually no matter 
> > > anyways... Trying to "tail -f" a file mounted via HDFS FUSE is rather 
> > > useless.  The data is only and finally fed to the tail command once a 
> > > full 64MB (or whatever you use) block size of data has been written to 
> > > the file.  One would only be able to use "hadoop fs -tail -f 
> > > /path/to/log" which has its own issues mentioned previously.
> > > 
> > > 
> > > Option #3 would definitely work.  However, now I'm storing my logs twice. 
> > >  Once on some local filesystem and another time in HDFS.  It works but 
> > > its not ideal as it's a waste of space.  And you've probably noticed from 
> > > this email so far, I'd prefer the ideal solution :-)
> > > 
> > > 
> > > Note:  Astute flumers would probably look at option #3 and recommend 
> > > making use of the RollingFileSink in addition to the HDFSSink.  
> > > Unfortunately, the RollingFileSink doesn't support templated/dynamic 
> > > directory creation like the HDFSSink with its hdfs.path setting of 
> > > "hdfs://namenode/flume/syslog/server=%{host}/facility=%{Facility}".
> > > 
> > > 
> > > So what exactly am I asking here?  Well, I'd like to know first how 
> > > others are doing this.  A hybrid of rsyslog and Flume?  All and only 
> > > Flume?  With custom serializers/interceptors/sinks?  Or perhaps... how 
> > > would you recommend I handle this?
> > > 
> > > 
> > > Thanks for any and all thoughts you can provide.
> > > 
> > > 
> > >  
> > > 
> > > -- 
> > > Josh West
> > > Lead Systems Administrator
> > > One.com (http://One.com), [email protected] (mailto:[email protected])
> > > 
> > > 
> > > 
> > > <Ronald J  Thielen.vcf>
> > > 
> > 
> > 
> > 
> 

Reply via email to