Re: Ranger - HDFS Audit - Compression Support?

Don Bosco Durai Wed, 11 Oct 2017 23:46:56 -0700

Please create the JIRA. I have added you as contributor. You can assign it to 
yourself if you are planning to work on it.


 

Thanks

 

Bosco

 

 

From: Kevin Risden <compuwizard...@gmail.com>
Reply-To: <user@ranger.apache.org>
Date: Wednesday, October 11, 2017 at 6:50 AM
To: <user@ranger.apache.org>
Subject: Re: Ranger - HDFS Audit - Compression Support?

 

Any concern if I open a Ranger JIRA to track adding compression support for 
HDFS audits?


Kevin Risden

 

On Wed, Oct 11, 2017 at 4:35 AM, Don Bosco Durai <bo...@apache.org> wrote:

The audit framework has local file spooling. If the destination is not 
reachable or the write is slow, then it will start spooling to a local file. 
And a different thread reads from the spool file and start writing to the 
destination. So, there is enough redundancy built in.

 

The audit is written by the Ranger plugin from within the component itself. So, 
Ranger Admin is not involved. Ranger Audit framework has process exit hook and 
it tries to flush out the buffer before exiting. 

 

HDFS is tricky, particularly for HDFS Ranger plugin. The same goes for Ranger 
Solr Plugin writing to Solr as Audit Destination. The process itself is trying 
to shutdown, while the Ranger plugin is trying to flush to the same component.

 

We also found another issue with HDFS stream writer, where if the file is not 
closed before HDFS shuts down, then the audits are lost ☹ Since there is no 
real solution, other than closing the HDFS files more regularly or write to 
local file and copy that to HDFS. So, this was the reason, the HDFS Audit 
Destination code was updated with another option to take care of this situation.

 

Each Audit Destination has it’s own in memory buffer and also separate spool 
file. If we have to do one more refactoring (e.g. V4), then I would suggest 
that we always write the JSON to local files system and each Audit Destination 
instance will have a marker/pointer to the file and line they are currently 
processing. This will give us reliability and also simplify the implementation. 
And the writers can push audits to destination in real-time (e.g. to Kafka 
topic) or batch it and send it to the destination (e.g. Solr, HDFS, etc.).

 

Bosco

 

 

 

From: Sean Roberts <srobe...@hortonworks.com>
Reply-To: <user@ranger.apache.org>
Date: Wednesday, October 11, 2017 at 2:13 AM


To: "user@ranger.apache.org" <user@ranger.apache.org>
Subject: Re: Ranger - HDFS Audit - Compression Support?

 

Bosco – Side topic: In the case of batches, what happens to the audits if 
Ranger Admin or HDFS are restarted before a batch is written. Is it lost? A 
customer raised this as a concern after we found some corrupt/incomplete JSON 
audits which corresponded with service restarts.

 

-- 

Sean Roberts

@seano

 

 

From: Don Bosco Durai <bo...@apache.org>
Reply-To: "user@ranger.apache.org" <user@ranger.apache.org>
Date: Wednesday, October 11, 2017 at 10:04 AM
To: "user@ranger.apache.org" <user@ranger.apache.org>
Subject: Re: Ranger - HDFS Audit - Compression Support?

 

We could and I think that is what Kevin is also suggesting.

 

If we write as ORC or other file format directly, then we have to see how to 
batch the audits. In the Audit V3 implementation, we did some optimization to 
avoid store (local write) and forward, instead build the batch in the memory 
itself and do bulk write (each Destination has different policies). But in the 
previous release, we did re-introduce an option to store and forward to HDFS 
due to HDFS file closure issue. 

 

I personally don’t know what would be a good batch size. But we can build on 
top that code to write in the format we want to. And make the output write 
configurable to support different types.

 

Bosco

 

 

 

 

 

From: Sean Roberts <srobe...@hortonworks.com>
Reply-To: <user@ranger.apache.org>
Date: Wednesday, October 11, 2017 at 1:50 AM
To: "user@ranger.apache.org" <user@ranger.apache.org>
Subject: Re: Ranger - HDFS Audit - Compression Support?

 

I’ve been looking at the same. Even in small clusters the size of Ranger Audits 
is considerable. The files compress well. But compressed JSON will be difficult 
to query.

 

Would engineering Ranger to write directly to ORC be reasonable?

 

-- 

Sean Roberts

@seano

 

 

From: Don Bosco Durai <bo...@apache.org>
Reply-To: "user@ranger.apache.org" <user@ranger.apache.org>
Date: Wednesday, October 11, 2017 at 8:13 AM
To: "user@ranger.apache.org" <user@ranger.apache.org>
Subject: Re: Ranger - HDFS Audit - Compression Support?

 

Kevin, thanks for your interest.

 

You are right, currently one of the options is saving the audits in HDFS itself 
as JSON files in one folder per day. I have loaded these JSON files from the 
folder into Hive as compressed ORC format. The compressed files in ORC were 
less than 10% of the original size. So, it was significant decrease in size. 
Also, it is easier to run analytics on the Hive tables.

 

So, there are couple of ways of doing it.

 
Write an Oozie job which runs every night and loads the previous day worth 
audit logs into ORC or other format
Write a AuditDestination which can write into the format you want to.
 

Regardless which approach you take, this would be a good feature for Ranger.

 

Thanks

 

Bosco

 

 

From: Kevin Risden <kris...@apache.org>
Reply-To: <user@ranger.apache.org>
Date: Tuesday, October 10, 2017 at 2:50 PM
To: <user@ranger.apache.org>
Subject: Ranger - HDFS Audit - Compression Support?

 

We have one cluster that is on track to generate 10TB of HDFS audit data for 
one year. For compliance reasons, we are required to keep this audit 
information and not delete it. We use Ranger and Solr to search for the last N 
number of days for simple use cases (seeing why a user is denied access).

 

I've done some research and found that Ranger HDFS audits are:

* Stored as JSON objects (one per line)

* Not compressed

 

This is currently very verbose and would benefit from compression since this 
data is not frequently accessed. 

 

I didn't find any compression references on the mailing list archive or in 
JIRA. Is there a reason for this?

 

If there isn't any objection, what would be a good way towards working towards 
adding compression support for Ranger HDFS audits?

 

Some ideas include sequence files (with built in compression support) or even 
compressed Avro (which would make it easy to add a Hive table over the audit 
information. 


Kevin Risden

Re: Ranger - HDFS Audit - Compression Support?

Reply via email to