Bosco – Side topic: In the case of batches, what happens to the audits if 
Ranger Admin or HDFS are restarted before a batch is written. Is it lost? A 
customer raised this as a concern after we found some corrupt/incomplete JSON 
audits which corresponded with service restarts.

--
Sean Roberts
@seano


From: Don Bosco Durai <bo...@apache.org>
Reply-To: "user@ranger.apache.org" <user@ranger.apache.org>
Date: Wednesday, October 11, 2017 at 10:04 AM
To: "user@ranger.apache.org" <user@ranger.apache.org>
Subject: Re: Ranger - HDFS Audit - Compression Support?

We could and I think that is what Kevin is also suggesting.

If we write as ORC or other file format directly, then we have to see how to 
batch the audits. In the Audit V3 implementation, we did some optimization to 
avoid store (local write) and forward, instead build the batch in the memory 
itself and do bulk write (each Destination has different policies). But in the 
previous release, we did re-introduce an option to store and forward to HDFS 
due to HDFS file closure issue.

I personally don’t know what would be a good batch size. But we can build on 
top that code to write in the format we want to. And make the output write 
configurable to support different types.

Bosco





From: Sean Roberts <srobe...@hortonworks.com>
Reply-To: <user@ranger.apache.org>
Date: Wednesday, October 11, 2017 at 1:50 AM
To: "user@ranger.apache.org" <user@ranger.apache.org>
Subject: Re: Ranger - HDFS Audit - Compression Support?

I’ve been looking at the same. Even in small clusters the size of Ranger Audits 
is considerable. The files compress well. But compressed JSON will be difficult 
to query.

Would engineering Ranger to write directly to ORC be reasonable?

--
Sean Roberts
@seano


From: Don Bosco Durai <bo...@apache.org>
Reply-To: "user@ranger.apache.org" <user@ranger.apache.org>
Date: Wednesday, October 11, 2017 at 8:13 AM
To: "user@ranger.apache.org" <user@ranger.apache.org>
Subject: Re: Ranger - HDFS Audit - Compression Support?

Kevin, thanks for your interest.

You are right, currently one of the options is saving the audits in HDFS itself 
as JSON files in one folder per day. I have loaded these JSON files from the 
folder into Hive as compressed ORC format. The compressed files in ORC were 
less than 10% of the original size. So, it was significant decrease in size. 
Also, it is easier to run analytics on the Hive tables.

So, there are couple of ways of doing it.


  1.  Write an Oozie job which runs every night and loads the previous day 
worth audit logs into ORC or other format
  2.  Write a AuditDestination which can write into the format you want to.

Regardless which approach you take, this would be a good feature for Ranger.

Thanks

Bosco


From: Kevin Risden <kris...@apache.org>
Reply-To: <user@ranger.apache.org>
Date: Tuesday, October 10, 2017 at 2:50 PM
To: <user@ranger.apache.org>
Subject: Ranger - HDFS Audit - Compression Support?

We have one cluster that is on track to generate 10TB of HDFS audit data for 
one year. For compliance reasons, we are required to keep this audit 
information and not delete it. We use Ranger and Solr to search for the last N 
number of days for simple use cases (seeing why a user is denied access).

I've done some research and found that Ranger HDFS audits are:
* Stored as JSON objects (one per line)
* Not compressed

This is currently very verbose and would benefit from compression since this 
data is not frequently accessed.

I didn't find any compression references on the mailing list archive or in 
JIRA. Is there a reason for this?

If there isn't any objection, what would be a good way towards working towards 
adding compression support for Ranger HDFS audits?

Some ideas include sequence files (with built in compression support) or even 
compressed Avro (which would make it easy to add a Hive table over the audit 
information.

Kevin Risden

Reply via email to