Bosco – Side topic: In the case of batches, what happens to the audits if Ranger Admin or HDFS are restarted before a batch is written. Is it lost? A customer raised this as a concern after we found some corrupt/incomplete JSON audits which corresponded with service restarts.
-- Sean Roberts @seano From: Don Bosco Durai <bo...@apache.org> Reply-To: "user@ranger.apache.org" <user@ranger.apache.org> Date: Wednesday, October 11, 2017 at 10:04 AM To: "user@ranger.apache.org" <user@ranger.apache.org> Subject: Re: Ranger - HDFS Audit - Compression Support? We could and I think that is what Kevin is also suggesting. If we write as ORC or other file format directly, then we have to see how to batch the audits. In the Audit V3 implementation, we did some optimization to avoid store (local write) and forward, instead build the batch in the memory itself and do bulk write (each Destination has different policies). But in the previous release, we did re-introduce an option to store and forward to HDFS due to HDFS file closure issue. I personally don’t know what would be a good batch size. But we can build on top that code to write in the format we want to. And make the output write configurable to support different types. Bosco From: Sean Roberts <srobe...@hortonworks.com> Reply-To: <user@ranger.apache.org> Date: Wednesday, October 11, 2017 at 1:50 AM To: "user@ranger.apache.org" <user@ranger.apache.org> Subject: Re: Ranger - HDFS Audit - Compression Support? I’ve been looking at the same. Even in small clusters the size of Ranger Audits is considerable. The files compress well. But compressed JSON will be difficult to query. Would engineering Ranger to write directly to ORC be reasonable? -- Sean Roberts @seano From: Don Bosco Durai <bo...@apache.org> Reply-To: "user@ranger.apache.org" <user@ranger.apache.org> Date: Wednesday, October 11, 2017 at 8:13 AM To: "user@ranger.apache.org" <user@ranger.apache.org> Subject: Re: Ranger - HDFS Audit - Compression Support? Kevin, thanks for your interest. You are right, currently one of the options is saving the audits in HDFS itself as JSON files in one folder per day. I have loaded these JSON files from the folder into Hive as compressed ORC format. The compressed files in ORC were less than 10% of the original size. So, it was significant decrease in size. Also, it is easier to run analytics on the Hive tables. So, there are couple of ways of doing it. 1. Write an Oozie job which runs every night and loads the previous day worth audit logs into ORC or other format 2. Write a AuditDestination which can write into the format you want to. Regardless which approach you take, this would be a good feature for Ranger. Thanks Bosco From: Kevin Risden <kris...@apache.org> Reply-To: <user@ranger.apache.org> Date: Tuesday, October 10, 2017 at 2:50 PM To: <user@ranger.apache.org> Subject: Ranger - HDFS Audit - Compression Support? We have one cluster that is on track to generate 10TB of HDFS audit data for one year. For compliance reasons, we are required to keep this audit information and not delete it. We use Ranger and Solr to search for the last N number of days for simple use cases (seeing why a user is denied access). I've done some research and found that Ranger HDFS audits are: * Stored as JSON objects (one per line) * Not compressed This is currently very verbose and would benefit from compression since this data is not frequently accessed. I didn't find any compression references on the mailing list archive or in JIRA. Is there a reason for this? If there isn't any objection, what would be a good way towards working towards adding compression support for Ranger HDFS audits? Some ideas include sequence files (with built in compression support) or even compressed Avro (which would make it easy to add a Hive table over the audit information. Kevin Risden