Please create the JIRA. I have added you as contributor. You can assign it to yourself if you are planning to work on it.
Thanks Bosco From: Kevin Risden <compuwizard...@gmail.com> Reply-To: <user@ranger.apache.org> Date: Wednesday, October 11, 2017 at 6:50 AM To: <user@ranger.apache.org> Subject: Re: Ranger - HDFS Audit - Compression Support? Any concern if I open a Ranger JIRA to track adding compression support for HDFS audits? Kevin Risden On Wed, Oct 11, 2017 at 4:35 AM, Don Bosco Durai <bo...@apache.org> wrote: The audit framework has local file spooling. If the destination is not reachable or the write is slow, then it will start spooling to a local file. And a different thread reads from the spool file and start writing to the destination. So, there is enough redundancy built in. The audit is written by the Ranger plugin from within the component itself. So, Ranger Admin is not involved. Ranger Audit framework has process exit hook and it tries to flush out the buffer before exiting. HDFS is tricky, particularly for HDFS Ranger plugin. The same goes for Ranger Solr Plugin writing to Solr as Audit Destination. The process itself is trying to shutdown, while the Ranger plugin is trying to flush to the same component. We also found another issue with HDFS stream writer, where if the file is not closed before HDFS shuts down, then the audits are lost ☹ Since there is no real solution, other than closing the HDFS files more regularly or write to local file and copy that to HDFS. So, this was the reason, the HDFS Audit Destination code was updated with another option to take care of this situation. Each Audit Destination has it’s own in memory buffer and also separate spool file. If we have to do one more refactoring (e.g. V4), then I would suggest that we always write the JSON to local files system and each Audit Destination instance will have a marker/pointer to the file and line they are currently processing. This will give us reliability and also simplify the implementation. And the writers can push audits to destination in real-time (e.g. to Kafka topic) or batch it and send it to the destination (e.g. Solr, HDFS, etc.). Bosco From: Sean Roberts <srobe...@hortonworks.com> Reply-To: <user@ranger.apache.org> Date: Wednesday, October 11, 2017 at 2:13 AM To: "user@ranger.apache.org" <user@ranger.apache.org> Subject: Re: Ranger - HDFS Audit - Compression Support? Bosco – Side topic: In the case of batches, what happens to the audits if Ranger Admin or HDFS are restarted before a batch is written. Is it lost? A customer raised this as a concern after we found some corrupt/incomplete JSON audits which corresponded with service restarts. -- Sean Roberts @seano From: Don Bosco Durai <bo...@apache.org> Reply-To: "user@ranger.apache.org" <user@ranger.apache.org> Date: Wednesday, October 11, 2017 at 10:04 AM To: "user@ranger.apache.org" <user@ranger.apache.org> Subject: Re: Ranger - HDFS Audit - Compression Support? We could and I think that is what Kevin is also suggesting. If we write as ORC or other file format directly, then we have to see how to batch the audits. In the Audit V3 implementation, we did some optimization to avoid store (local write) and forward, instead build the batch in the memory itself and do bulk write (each Destination has different policies). But in the previous release, we did re-introduce an option to store and forward to HDFS due to HDFS file closure issue. I personally don’t know what would be a good batch size. But we can build on top that code to write in the format we want to. And make the output write configurable to support different types. Bosco From: Sean Roberts <srobe...@hortonworks.com> Reply-To: <user@ranger.apache.org> Date: Wednesday, October 11, 2017 at 1:50 AM To: "user@ranger.apache.org" <user@ranger.apache.org> Subject: Re: Ranger - HDFS Audit - Compression Support? I’ve been looking at the same. Even in small clusters the size of Ranger Audits is considerable. The files compress well. But compressed JSON will be difficult to query. Would engineering Ranger to write directly to ORC be reasonable? -- Sean Roberts @seano From: Don Bosco Durai <bo...@apache.org> Reply-To: "user@ranger.apache.org" <user@ranger.apache.org> Date: Wednesday, October 11, 2017 at 8:13 AM To: "user@ranger.apache.org" <user@ranger.apache.org> Subject: Re: Ranger - HDFS Audit - Compression Support? Kevin, thanks for your interest. You are right, currently one of the options is saving the audits in HDFS itself as JSON files in one folder per day. I have loaded these JSON files from the folder into Hive as compressed ORC format. The compressed files in ORC were less than 10% of the original size. So, it was significant decrease in size. Also, it is easier to run analytics on the Hive tables. So, there are couple of ways of doing it. Write an Oozie job which runs every night and loads the previous day worth audit logs into ORC or other format Write a AuditDestination which can write into the format you want to. Regardless which approach you take, this would be a good feature for Ranger. Thanks Bosco From: Kevin Risden <kris...@apache.org> Reply-To: <user@ranger.apache.org> Date: Tuesday, October 10, 2017 at 2:50 PM To: <user@ranger.apache.org> Subject: Ranger - HDFS Audit - Compression Support? We have one cluster that is on track to generate 10TB of HDFS audit data for one year. For compliance reasons, we are required to keep this audit information and not delete it. We use Ranger and Solr to search for the last N number of days for simple use cases (seeing why a user is denied access). I've done some research and found that Ranger HDFS audits are: * Stored as JSON objects (one per line) * Not compressed This is currently very verbose and would benefit from compression since this data is not frequently accessed. I didn't find any compression references on the mailing list archive or in JIRA. Is there a reason for this? If there isn't any objection, what would be a good way towards working towards adding compression support for Ranger HDFS audits? Some ideas include sequence files (with built in compression support) or even compressed Avro (which would make it easy to add a Hive table over the audit information. Kevin Risden