Thanks Roshan. I understand that it makes it easier for us to use HCatalog - 
but I am not sure what percentage of Hive users actually use HCat. If we simply 
use Hive directly, we would be able to address a larger community - which I 
would definitely like (thought I don't know how feasible it is). I think it 
might be better to use Hive directly at least to make it more useful to a 
larger community.  


Hari 

-- 
Hari Shreedharan


On Wednesday, October 31, 2012 at 1:31 PM, Roshan Naik wrote:

> Hari,
>   Indeed from the end user point of view.. both would accomplish roughly the 
> same. From the implementation standpoint, however,  the Hive sink would have 
> to deal with HDFS (for data transfer) and Hive metastore separately to get 
> the job done. The HCat sink implementation, on the other hand, would 
> accomplish the same using just the HCat apis (for both aspects).  The HCat 
> sink implementation would be much simpler and cleaner as it wont have to 
> reinvent things (even if we reuse code form HDFS sink). The grunt work of of 
> moving data and "transactionally committing" them into partitions is handled 
> by HCat apis.
> -roshan
> 
> 
> 
> On Wed, Oct 31, 2012 at 12:22 PM, Hari Shreedharan <[email protected] 
> (mailto:[email protected])> wrote:
> > Roshan,
> > 
> > I am not a Hive/HCatalog pro, but I am just wondering why an HCatalog sink, 
> > rather than a Hive Sink? Hive is definitely very popular and would be well 
> > appreciated if we could get a Hive Sink written. Since HCatalog is supposed 
> > to be compatible with the Hive metastore (right?), why not just implement a 
> > Hive sink and make it available to a larger community? I'd definitely like 
> > to see a Hive Sink, and would definitely prioritize that and then if 
> > explicitly required, add HCatalog support to that - this way it is useful 
> > to people who use Hive as well. In fact, there already is a Hive Sink jira 
> > here:https://issues.apache.org/jira/browse/FLUME-1008.  
> > 
> > I am a +1 for a Hive Sink, so please take a look.
> > 
> > 
> > Thanks,
> > Hari
> > 
> > 
> > -- 
> > Hari Shreedharan
> > 
> > 
> > On Monday, October 29, 2012 at 4:37 PM, Roshan Naik wrote:
> > 
> > > I am in the process of investigating the possibility of creating  a 
> > > HCatalog sink for Flume which should be able to handle such use cases. 
> > > For your use case it could be thought of as a Hive sink. Goal is 
> > > basically as follows... it would allow multiple flume agents to pump logs 
> > > into a hive tables. That would make the data query-able without 
> > > additional manual steps. Data will get added periodically in the form of 
> > > new partitions to Hive. You would not have to deal with temporary files 
> > > or manual addition of data into hive.  
> > > 
> > > -roshan
> > > 
> > > 
> > > 
> > > On Sun, Oct 28, 2012 at 5:45 PM, Ralph Goers <[email protected] 
> > > (mailto:[email protected])> wrote:
> > > > Since you ask...
> > > > 
> > > > In our environment our primary concern is audit logs - have have to 
> > > > audit banking transactions as well as changes administrators make. We 
> > > > have a legacy system that needed to be integrated that had records in a 
> > > > form different than what we want stored. We also need to allow 
> > > > administrators to view events as close to real time as possible. Plus 
> > > > we have to aggregate data across 2 data centers. Although we are 
> > > > currently not including web server access logs we plan to integrate 
> > > > them in over time.  We also have requirements from our security team to 
> > > > pass events for their use to ArcSight. 
> > > > 
> > > > 1. We have a "log extractor" that receives legacy events as they occur 
> > > > and converts them into our new format and passes them to Flume. All new 
> > > > applications use the Log4j 2 Flume Appender to get data to Flume. 
> > > > 2. Flume passes the data to ArcSight for our security team's use.
> > > > 3. We wrote a Flume to Cassandra Sink.
> > > > 4. We wrote our own REST query services to retrieve the data from 
> > > > Cassandra.
> > > > 5. Since we are using DataStax Enterprise version of Cassandra we have 
> > > > also set up "Analytic" nodes that run Hadoop on top of Cassandra. This 
> > > > allows the data to be accessed via normal Hadoop tools for data 
> > > > analytics.
> > > > 6. We have written our own reporting UI component in our Administrative 
> > > > Platform to allow administrators to view activities in real time or to 
> > > > schedule background data collection so users can post process the data 
> > > > on their own.
> > > > 
> > > > We do not have anything to allow an admin to "tail" the log but it 
> > > > wouldn't be hard at all to write an application to accept Flume events 
> > > > via Avro and display the last "n" events as they arrive. 
> > > > 
> > > > One thing I should point out. We format our events in accordance with 
> > > > RFC 5424 and store that in the Flume event body. We then store all our 
> > > > individual pieces of audit event data in Flume headers fields.  The RFC 
> > > > 5424 message is what we send to ArcSight. The event fields and the 
> > > > compressed body are all stored in individual columns in Cassandra. 
> > > > 
> > > > Ralph
> > > > 
> > > > 
> > > > On Oct 26, 2012, at 2:06 PM, Ron Thielen wrote:
> > > > > I am exactly where you are with this, except for the problem of my 
> > > > > not having had time to write a serializer to address the Hostname 
> > > > > Timestamp issue.  Questions about the use of Flume in this manner 
> > > > > seem to recur on a regular basis, so it seems a common use case. 
> > > > >  
> > > > > Sorry I cannot offer a solution since I am in your shoes at the 
> > > > > moment, unfortunately looking at storing logs twice.
> > > > >  
> > > > > Ron Thielen
> > > > >  
> > > > > 
> > > > > <image001.jpg>
> > > > >  
> > > > > From: Josh West [mailto:[email protected]] 
> > > > > Sent: Friday, October 26, 2012 9:05 AM
> > > > > To: [email protected] (mailto:[email protected])
> > > > > Subject: Syslog Infrastructure with Flume 
> > > > >  
> > > > > 
> > > > > Hey folks,
> > > > > 
> > > > > I've been experimenting with Flume for a few weeks now, trying to 
> > > > > determine an approach to designing a reliable, highly available, 
> > > > > scalable system to store logs from various sources, including syslog. 
> > > > >  Ideally, this system will meet the following requirements:
> > > > > 
> > > > > Logs from syslog across all servers make their way into HDFS.
> > > > > Logs are stored in HDFS in a manner that is available for 
> > > > > post-processing:
> > > > > Example:  HIVE partitions - with HDFS Flume Sink, can set hdfs.path 
> > > > > to hdfs://namenode/flume/syslog/server=%{host}/facility=%{Facility}
> > > > > Example:  Custom map reduce jobs...
> > > > > 
> > > > > 
> > > > > Logs are stored in HDFS in a manner that is available for "reading" 
> > > > > by sysadmins:
> > > > > During troubleshooting/firefighting, it is quite helpful to be able 
> > > > > to login to a central logging system and tail -f / grep logs.
> > > > > We need to be able to see the logs "live".
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > Some folks may be wondering why are we choosing Flume for syslog, 
> > > > > instead of something like Graylog2 or Logstash?  The answer is we 
> > > > > will be using Flume + Hadoop for the transport and processing of 
> > > > > other types of data in addition to syslog.  For example, webserver 
> > > > > access logs for post processing and statistical analysis.  So, we 
> > > > > would like to make the most use of the Hadoop cluster, keeping all 
> > > > > logs of all types in one redundant/scalable solution.  Additionally, 
> > > > > by keeping both syslog and webserver access logs in Hadoop/HDFS, we 
> > > > > can begin to correlate events.
> > > > > 
> > > > > 
> > > > > I've run into some snags while attempting to implement Flume in a 
> > > > > manner that satisfies the requirements listed in the top of this 
> > > > > message:
> > > > > 
> > > > > Logs to HDFS:
> > > > > I can indeed use the Flume HDFS Sink to reliably write logs into HDFS.
> > > > > Needed to write custom serializer to add Hostname and Timestamp 
> > > > > fields back to syslog messages.
> > > > > See:  https://issues.apache.org/jira/browse/FLUME-1666
> > > > > 
> > > > > 
> > > > > Logs to HDFS in manner available for 
> > > > > reading/firefighting/troubleshooting by sysadmins:
> > > > > Flume HDFS Sink uses the BucketWriter for recording flume events to 
> > > > > HDFS.
> > > > > Creates data files like:  
> > > > > /flume/syslog/server=%{host}/facility=%{Facility}/FlumeData.1350997160213
> > > > > Each file is format of FlumeData (or custom prefix) followed by . 
> > > > > followed by unix timestamp of when the file was created.
> > > > > This is somewhat necessary... As you could have multiple Flume 
> > > > > writers, writing to the same HDFS, the files cannot be opened by more 
> > > > > than one writer.  So each writer should write to its own file.
> > > > > 
> > > > > 
> > > > > Latest file, currently being written to, is suffixed with ".tmp".
> > > > > This approach is not very sysadmin-friendly....
> > > > > You have to find the latest (ie. the .tmp files) and hadoop fs -tail 
> > > > > -f /path/to/file.tmp
> > > > > Hadoop's fs -tail -f command first prints the entire file's contents, 
> > > > > then begins tailing.
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > So the sum of it all is Flume is awesome for getting syslog (and 
> > > > > other) data into HDFS for post processing, but not the best at 
> > > > > getting it into HDFS in a sysadmin troubleshooting/firefighting 
> > > > > format.  In an ideal world, I have syslog data coming into Flume via 
> > > > > one transport (i.e. SyslogTcp Source or SyslogUDP Source) and being 
> > > > > written into HDFS in a manner that is both post-processable and 
> > > > > sysadmin-friendly, but it looks like this isn't going to happen.
> > > > > 
> > > > > 
> > > > > I've thus investigated some alternative approaches to meet the 
> > > > > requirements.  One of these approaches is to have all of my servers 
> > > > > send their syslog messages to a central box running rsyslog.  Then, 
> > > > > rsyslog would perform one of the following actions:
> > > > > 
> > > > > Write logs to HDFS directly using 'omhdfs' module, in a format that 
> > > > > is both post-processable and sysadmin-friendly :-)
> > > > > Write logs to HDFS directly using 'hadoop-fuse-dfs' utility, which 
> > > > > has HDFS mounted as a filesystem.
> > > > > Write logs to a local filesystem and also replicate logs into a flume 
> > > > > agent, configured with a SyslogSource and HDFS sink.
> > > > > 
> > > > > 
> > > > > Option #1 sounds great.  But unfortunately the 'omhdfs' module for 
> > > > > rsyslog isn't working very well.  I've gotten it to login to 
> > > > > Hadoop/HDFS but it has issues creating/appending files.  
> > > > > Additionally, templating is somewhat suspect (ie. making directories 
> > > > > /syslog/someserver/somefacility dynamically).
> > > > > 
> > > > > 
> > > > > Option #2 sounds reasonable, but either the HDFS FUSE module doesn't 
> > > > > support append mode (yet) or rsyslog is trying to create/open the 
> > > > > files in a manner not compliant with HDFS.  No surprise, as we all 
> > > > > know HDFS can be somewhat "special" at times ;-)  It's actually no 
> > > > > matter anyways... Trying to "tail -f" a file mounted via HDFS FUSE is 
> > > > > rather useless.  The data is only and finally fed to the tail command 
> > > > > once a full 64MB (or whatever you use) block size of data has been 
> > > > > written to the file.  One would only be able to use "hadoop fs -tail 
> > > > > -f /path/to/log" which has its own issues mentioned previously.
> > > > > 
> > > > > 
> > > > > Option #3 would definitely work.  However, now I'm storing my logs 
> > > > > twice.  Once on some local filesystem and another time in HDFS.  It 
> > > > > works but its not ideal as it's a waste of space.  And you've 
> > > > > probably noticed from this email so far, I'd prefer the ideal 
> > > > > solution :-)
> > > > > 
> > > > > 
> > > > > Note:  Astute flumers would probably look at option #3 and recommend 
> > > > > making use of the RollingFileSink in addition to the HDFSSink.  
> > > > > Unfortunately, the RollingFileSink doesn't support templated/dynamic 
> > > > > directory creation like the HDFSSink with its hdfs.path setting of 
> > > > > "hdfs://namenode/flume/syslog/server=%{host}/facility=%{Facility}".
> > > > > 
> > > > > 
> > > > > So what exactly am I asking here?  Well, I'd like to know first how 
> > > > > others are doing this.  A hybrid of rsyslog and Flume?  All and only 
> > > > > Flume?  With custom serializers/interceptors/sinks?  Or perhaps... 
> > > > > how would you recommend I handle this?
> > > > > 
> > > > > 
> > > > > Thanks for any and all thoughts you can provide.
> > > > > 
> > > > > 
> > > > >  
> > > > > 
> > > > > -- 
> > > > > Josh West
> > > > > Lead Systems Administrator
> > > > > One.com (http://One.com), [email protected] (mailto:[email protected])
> > > > > 
> > > > > 
> > > > > 
> > > > > <Ronald J  Thielen.vcf>
> > > > > 
> > > > 
> > > > 
> > > > 
> > > 
> > 
> 

Reply via email to