Thanks Roshan. I understand that it makes it easier for us to use HCatalog - but I am not sure what percentage of Hive users actually use HCat. If we simply use Hive directly, we would be able to address a larger community - which I would definitely like (thought I don't know how feasible it is). I think it might be better to use Hive directly at least to make it more useful to a larger community.
Hari -- Hari Shreedharan On Wednesday, October 31, 2012 at 1:31 PM, Roshan Naik wrote: > Hari, > Indeed from the end user point of view.. both would accomplish roughly the > same. From the implementation standpoint, however, the Hive sink would have > to deal with HDFS (for data transfer) and Hive metastore separately to get > the job done. The HCat sink implementation, on the other hand, would > accomplish the same using just the HCat apis (for both aspects). The HCat > sink implementation would be much simpler and cleaner as it wont have to > reinvent things (even if we reuse code form HDFS sink). The grunt work of of > moving data and "transactionally committing" them into partitions is handled > by HCat apis. > -roshan > > > > On Wed, Oct 31, 2012 at 12:22 PM, Hari Shreedharan <[email protected] > (mailto:[email protected])> wrote: > > Roshan, > > > > I am not a Hive/HCatalog pro, but I am just wondering why an HCatalog sink, > > rather than a Hive Sink? Hive is definitely very popular and would be well > > appreciated if we could get a Hive Sink written. Since HCatalog is supposed > > to be compatible with the Hive metastore (right?), why not just implement a > > Hive sink and make it available to a larger community? I'd definitely like > > to see a Hive Sink, and would definitely prioritize that and then if > > explicitly required, add HCatalog support to that - this way it is useful > > to people who use Hive as well. In fact, there already is a Hive Sink jira > > here:https://issues.apache.org/jira/browse/FLUME-1008. > > > > I am a +1 for a Hive Sink, so please take a look. > > > > > > Thanks, > > Hari > > > > > > -- > > Hari Shreedharan > > > > > > On Monday, October 29, 2012 at 4:37 PM, Roshan Naik wrote: > > > > > I am in the process of investigating the possibility of creating a > > > HCatalog sink for Flume which should be able to handle such use cases. > > > For your use case it could be thought of as a Hive sink. Goal is > > > basically as follows... it would allow multiple flume agents to pump logs > > > into a hive tables. That would make the data query-able without > > > additional manual steps. Data will get added periodically in the form of > > > new partitions to Hive. You would not have to deal with temporary files > > > or manual addition of data into hive. > > > > > > -roshan > > > > > > > > > > > > On Sun, Oct 28, 2012 at 5:45 PM, Ralph Goers <[email protected] > > > (mailto:[email protected])> wrote: > > > > Since you ask... > > > > > > > > In our environment our primary concern is audit logs - have have to > > > > audit banking transactions as well as changes administrators make. We > > > > have a legacy system that needed to be integrated that had records in a > > > > form different than what we want stored. We also need to allow > > > > administrators to view events as close to real time as possible. Plus > > > > we have to aggregate data across 2 data centers. Although we are > > > > currently not including web server access logs we plan to integrate > > > > them in over time. We also have requirements from our security team to > > > > pass events for their use to ArcSight. > > > > > > > > 1. We have a "log extractor" that receives legacy events as they occur > > > > and converts them into our new format and passes them to Flume. All new > > > > applications use the Log4j 2 Flume Appender to get data to Flume. > > > > 2. Flume passes the data to ArcSight for our security team's use. > > > > 3. We wrote a Flume to Cassandra Sink. > > > > 4. We wrote our own REST query services to retrieve the data from > > > > Cassandra. > > > > 5. Since we are using DataStax Enterprise version of Cassandra we have > > > > also set up "Analytic" nodes that run Hadoop on top of Cassandra. This > > > > allows the data to be accessed via normal Hadoop tools for data > > > > analytics. > > > > 6. We have written our own reporting UI component in our Administrative > > > > Platform to allow administrators to view activities in real time or to > > > > schedule background data collection so users can post process the data > > > > on their own. > > > > > > > > We do not have anything to allow an admin to "tail" the log but it > > > > wouldn't be hard at all to write an application to accept Flume events > > > > via Avro and display the last "n" events as they arrive. > > > > > > > > One thing I should point out. We format our events in accordance with > > > > RFC 5424 and store that in the Flume event body. We then store all our > > > > individual pieces of audit event data in Flume headers fields. The RFC > > > > 5424 message is what we send to ArcSight. The event fields and the > > > > compressed body are all stored in individual columns in Cassandra. > > > > > > > > Ralph > > > > > > > > > > > > On Oct 26, 2012, at 2:06 PM, Ron Thielen wrote: > > > > > I am exactly where you are with this, except for the problem of my > > > > > not having had time to write a serializer to address the Hostname > > > > > Timestamp issue. Questions about the use of Flume in this manner > > > > > seem to recur on a regular basis, so it seems a common use case. > > > > > > > > > > Sorry I cannot offer a solution since I am in your shoes at the > > > > > moment, unfortunately looking at storing logs twice. > > > > > > > > > > Ron Thielen > > > > > > > > > > > > > > > <image001.jpg> > > > > > > > > > > From: Josh West [mailto:[email protected]] > > > > > Sent: Friday, October 26, 2012 9:05 AM > > > > > To: [email protected] (mailto:[email protected]) > > > > > Subject: Syslog Infrastructure with Flume > > > > > > > > > > > > > > > Hey folks, > > > > > > > > > > I've been experimenting with Flume for a few weeks now, trying to > > > > > determine an approach to designing a reliable, highly available, > > > > > scalable system to store logs from various sources, including syslog. > > > > > Ideally, this system will meet the following requirements: > > > > > > > > > > Logs from syslog across all servers make their way into HDFS. > > > > > Logs are stored in HDFS in a manner that is available for > > > > > post-processing: > > > > > Example: HIVE partitions - with HDFS Flume Sink, can set hdfs.path > > > > > to hdfs://namenode/flume/syslog/server=%{host}/facility=%{Facility} > > > > > Example: Custom map reduce jobs... > > > > > > > > > > > > > > > Logs are stored in HDFS in a manner that is available for "reading" > > > > > by sysadmins: > > > > > During troubleshooting/firefighting, it is quite helpful to be able > > > > > to login to a central logging system and tail -f / grep logs. > > > > > We need to be able to see the logs "live". > > > > > > > > > > > > > > > > > > > > > > > > > Some folks may be wondering why are we choosing Flume for syslog, > > > > > instead of something like Graylog2 or Logstash? The answer is we > > > > > will be using Flume + Hadoop for the transport and processing of > > > > > other types of data in addition to syslog. For example, webserver > > > > > access logs for post processing and statistical analysis. So, we > > > > > would like to make the most use of the Hadoop cluster, keeping all > > > > > logs of all types in one redundant/scalable solution. Additionally, > > > > > by keeping both syslog and webserver access logs in Hadoop/HDFS, we > > > > > can begin to correlate events. > > > > > > > > > > > > > > > I've run into some snags while attempting to implement Flume in a > > > > > manner that satisfies the requirements listed in the top of this > > > > > message: > > > > > > > > > > Logs to HDFS: > > > > > I can indeed use the Flume HDFS Sink to reliably write logs into HDFS. > > > > > Needed to write custom serializer to add Hostname and Timestamp > > > > > fields back to syslog messages. > > > > > See: https://issues.apache.org/jira/browse/FLUME-1666 > > > > > > > > > > > > > > > Logs to HDFS in manner available for > > > > > reading/firefighting/troubleshooting by sysadmins: > > > > > Flume HDFS Sink uses the BucketWriter for recording flume events to > > > > > HDFS. > > > > > Creates data files like: > > > > > /flume/syslog/server=%{host}/facility=%{Facility}/FlumeData.1350997160213 > > > > > Each file is format of FlumeData (or custom prefix) followed by . > > > > > followed by unix timestamp of when the file was created. > > > > > This is somewhat necessary... As you could have multiple Flume > > > > > writers, writing to the same HDFS, the files cannot be opened by more > > > > > than one writer. So each writer should write to its own file. > > > > > > > > > > > > > > > Latest file, currently being written to, is suffixed with ".tmp". > > > > > This approach is not very sysadmin-friendly.... > > > > > You have to find the latest (ie. the .tmp files) and hadoop fs -tail > > > > > -f /path/to/file.tmp > > > > > Hadoop's fs -tail -f command first prints the entire file's contents, > > > > > then begins tailing. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > So the sum of it all is Flume is awesome for getting syslog (and > > > > > other) data into HDFS for post processing, but not the best at > > > > > getting it into HDFS in a sysadmin troubleshooting/firefighting > > > > > format. In an ideal world, I have syslog data coming into Flume via > > > > > one transport (i.e. SyslogTcp Source or SyslogUDP Source) and being > > > > > written into HDFS in a manner that is both post-processable and > > > > > sysadmin-friendly, but it looks like this isn't going to happen. > > > > > > > > > > > > > > > I've thus investigated some alternative approaches to meet the > > > > > requirements. One of these approaches is to have all of my servers > > > > > send their syslog messages to a central box running rsyslog. Then, > > > > > rsyslog would perform one of the following actions: > > > > > > > > > > Write logs to HDFS directly using 'omhdfs' module, in a format that > > > > > is both post-processable and sysadmin-friendly :-) > > > > > Write logs to HDFS directly using 'hadoop-fuse-dfs' utility, which > > > > > has HDFS mounted as a filesystem. > > > > > Write logs to a local filesystem and also replicate logs into a flume > > > > > agent, configured with a SyslogSource and HDFS sink. > > > > > > > > > > > > > > > Option #1 sounds great. But unfortunately the 'omhdfs' module for > > > > > rsyslog isn't working very well. I've gotten it to login to > > > > > Hadoop/HDFS but it has issues creating/appending files. > > > > > Additionally, templating is somewhat suspect (ie. making directories > > > > > /syslog/someserver/somefacility dynamically). > > > > > > > > > > > > > > > Option #2 sounds reasonable, but either the HDFS FUSE module doesn't > > > > > support append mode (yet) or rsyslog is trying to create/open the > > > > > files in a manner not compliant with HDFS. No surprise, as we all > > > > > know HDFS can be somewhat "special" at times ;-) It's actually no > > > > > matter anyways... Trying to "tail -f" a file mounted via HDFS FUSE is > > > > > rather useless. The data is only and finally fed to the tail command > > > > > once a full 64MB (or whatever you use) block size of data has been > > > > > written to the file. One would only be able to use "hadoop fs -tail > > > > > -f /path/to/log" which has its own issues mentioned previously. > > > > > > > > > > > > > > > Option #3 would definitely work. However, now I'm storing my logs > > > > > twice. Once on some local filesystem and another time in HDFS. It > > > > > works but its not ideal as it's a waste of space. And you've > > > > > probably noticed from this email so far, I'd prefer the ideal > > > > > solution :-) > > > > > > > > > > > > > > > Note: Astute flumers would probably look at option #3 and recommend > > > > > making use of the RollingFileSink in addition to the HDFSSink. > > > > > Unfortunately, the RollingFileSink doesn't support templated/dynamic > > > > > directory creation like the HDFSSink with its hdfs.path setting of > > > > > "hdfs://namenode/flume/syslog/server=%{host}/facility=%{Facility}". > > > > > > > > > > > > > > > So what exactly am I asking here? Well, I'd like to know first how > > > > > others are doing this. A hybrid of rsyslog and Flume? All and only > > > > > Flume? With custom serializers/interceptors/sinks? Or perhaps... > > > > > how would you recommend I handle this? > > > > > > > > > > > > > > > Thanks for any and all thoughts you can provide. > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > Josh West > > > > > Lead Systems Administrator > > > > > One.com (http://One.com), [email protected] (mailto:[email protected]) > > > > > > > > > > > > > > > > > > > > <Ronald J Thielen.vcf> > > > > > > > > > > > > > > > > > > > > > > >
