[jira] [Commented] (RANGER-1429) Ranger Audit solrconfig.xml optimizations

2019-11-09 Thread Kevin Risden (Jira)


[ 
https://issues.apache.org/jira/browse/RANGER-1429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16970869#comment-16970869
 ] 

Kevin Risden commented on RANGER-1429:
--

It looks like this still patch is still applicable to current Solr configs for 
Ranger. Spellcheck in a thread used to be an issue (not sure if it was fixed in 
Solr itself). Either way, if Ranger doesn't need spellcheck seems like an easy 
win.

[~aarrieta] have you run into this as well? Specifically around infra-solr? 
This seems like a reasonable suggestion to prevent spell check for the 
collection. 

[~vel] not sure if spell check is used in Ranger anywhere. I don't have any 
examples off the top of my head. 

> Ranger Audit solrconfig.xml optimizations
> -
>
> Key: RANGER-1429
> URL: https://issues.apache.org/jira/browse/RANGER-1429
> Project: Ranger
>  Issue Type: Bug
>  Components: audit
>Affects Versions: 0.7.0
>Reporter: Greg Senia
>Priority: Major
> Attachments: RANGER-1429.patch
>
>
> When using SolrCloud and having a single shard single replica with a large 
> number of audit records 20GB for 2 weeks. It takes Solr a very long time to 
> open up the replica for work. After doing some research I determine via 
> thread dumps that on the first request to the Solr Collection it kicks off a 
> Suggestion thread and a SpellCheck thread. I propose removing these as it 
> does not seem to impact RangerAudit and solves the issue of Solr 
> ranger_audits collection being unavailable after solr startup.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (RANGER-2454) Remove the trailing slash in Ranger URL in RangerAdminJersey2RESTClient

2019-05-31 Thread Kevin Risden (JIRA)


[ 
https://issues.apache.org/jira/browse/RANGER-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852982#comment-16852982
 ] 

Kevin Risden commented on RANGER-2454:
--

RANGER-1415 fixed this for RangerAdminRESTClient but didn't change the Knox 
specific RangerAdminJersey2RESTClient

> Remove the trailing slash in Ranger URL in RangerAdminJersey2RESTClient
> ---
>
> Key: RANGER-2454
> URL: https://issues.apache.org/jira/browse/RANGER-2454
> Project: Ranger
>  Issue Type: Improvement
>  Components: Ranger
>Affects Versions: master
>Reporter: Nikhil Purbhe
>Assignee: Nikhil Purbhe
>Priority: Major
> Fix For: master
>
>
> Remove the trailing slash in Ranger URL in RangerAdminJersey2RESTClient.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (RANGER-1837) Enhance Ranger Audit to HDFS to support ORC file format

2018-11-13 Thread Kevin Risden (JIRA)


[ 
https://issues.apache.org/jira/browse/RANGER-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16685676#comment-16685676
 ] 

Kevin Risden commented on RANGER-1837:
--

I took a look but missed the submit review button. Changes look reasonable I 
don't have a cluster to test this on anymore. 

> Enhance Ranger Audit to HDFS to support ORC file format
> ---
>
> Key: RANGER-1837
> URL: https://issues.apache.org/jira/browse/RANGER-1837
> Project: Ranger
>  Issue Type: Improvement
>  Components: audit
>Reporter: Kevin Risden
>Assignee: Ramesh Mani
>Priority: Major
> Attachments: 
> 0001-RANGER-1837-Enhance-Ranger-Audit-to-HDFS-to-support-.patch, 
> 0001-RANGER-1837-Enhance-Ranger-Audit-to-HDFS-to-support-002.patch, 
> 0001-RANGER-1837-Enhance-Ranger-Audit-to-HDFS-to-support_001.patch, 
> AuditDataFlow.png
>
>
> My team has done some research and found that Ranger HDFS audits are:
> * Stored as JSON objects (one per line)
> * Not compressed
> This is currently very verbose and would benefit from compression since this 
> data is not frequently accessed. 
> From Bosco on the mailing list:
> {quote}You are right, currently one of the options is saving the audits in 
> HDFS itself as JSON files in one folder per day. I have loaded these JSON 
> files from the folder into Hive as compressed ORC format. The compressed 
> files in ORC were less than 10% of the original size. So, it was significant 
> decrease in size. Also, it is easier to run analytics on the Hive tables.
>  
> So, there are couple of ways of doing it.
>  
> Write an Oozie job which runs every night and loads the previous day worth 
> audit logs into ORC or other format
> Write a AuditDestination which can write into the format you want to.
>  
> Regardless which approach you take, this would be a good feature for 
> Ranger.{quote}
> http://mail-archives.apache.org/mod_mbox/ranger-user/201710.mbox/%3CCAJU9nmiYzzUUX1uDEysLAcMti4iLmX7RE%3DmN2%3DdoLaaQf87njQ%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (RANGER-1837) Enhance Ranger Audit to HDFS to support ORC file format

2018-10-31 Thread Kevin Risden (JIRA)


[ 
https://issues.apache.org/jira/browse/RANGER-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16670611#comment-16670611
 ] 

Kevin Risden commented on RANGER-1837:
--

Sorry changed jobs :) I can take a look in the next week or so.

Maybe [~quirogadf] or [~Khalid Diriye] or [~toftedahl] have some more 
thoughts...

> Enhance Ranger Audit to HDFS to support ORC file format
> ---
>
> Key: RANGER-1837
> URL: https://issues.apache.org/jira/browse/RANGER-1837
> Project: Ranger
>  Issue Type: Improvement
>  Components: audit
>Reporter: Kevin Risden
>Assignee: Ramesh Mani
>Priority: Major
> Attachments: 
> 0001-RANGER-1837-Enhance-Ranger-Audit-to-HDFS-to-support-.patch, 
> 0001-RANGER-1837-Enhance-Ranger-Audit-to-HDFS-to-support-002.patch, 
> 0001-RANGER-1837-Enhance-Ranger-Audit-to-HDFS-to-support_001.patch, 
> AuditDataFlow.png
>
>
> My team has done some research and found that Ranger HDFS audits are:
> * Stored as JSON objects (one per line)
> * Not compressed
> This is currently very verbose and would benefit from compression since this 
> data is not frequently accessed. 
> From Bosco on the mailing list:
> {quote}You are right, currently one of the options is saving the audits in 
> HDFS itself as JSON files in one folder per day. I have loaded these JSON 
> files from the folder into Hive as compressed ORC format. The compressed 
> files in ORC were less than 10% of the original size. So, it was significant 
> decrease in size. Also, it is easier to run analytics on the Hive tables.
>  
> So, there are couple of ways of doing it.
>  
> Write an Oozie job which runs every night and loads the previous day worth 
> audit logs into ORC or other format
> Write a AuditDestination which can write into the format you want to.
>  
> Regardless which approach you take, this would be a good feature for 
> Ranger.{quote}
> http://mail-archives.apache.org/mod_mbox/ranger-user/201710.mbox/%3CCAJU9nmiYzzUUX1uDEysLAcMti4iLmX7RE%3DmN2%3DdoLaaQf87njQ%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (RANGER-1938) Solr for Audit setup doesn't use DocValues effectively

2018-01-02 Thread Kevin Risden (JIRA)

[ 
https://issues.apache.org/jira/browse/RANGER-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16309021#comment-16309021
 ] 

Kevin Risden commented on RANGER-1938:
--

Awesome thanks [~vperiasamy]

> Solr for Audit setup doesn't use DocValues effectively
> --
>
> Key: RANGER-1938
> URL: https://issues.apache.org/jira/browse/RANGER-1938
> Project: Ranger
>  Issue Type: Improvement
>  Components: audit
>Affects Versions: 0.6.0, 0.7.0, 0.6.1, 0.6.2, 0.6.3, 0.7.1
>Reporter: Kevin Risden
>Assignee: Kevin Risden
>  Labels: performance
> Fix For: 1.0.0, 0.7.2
>
> Attachments: 
> 0001-RANGER-1938-Enable-DocValues-for-more-fields-in-Solr.patch
>
>
> Ranger uses Ambari Infra Solr (or another Apache Solr install) for storing 
> Ranger Audit events for displaying in Ranger Admin. In our case, we have 
> noticed quite a few Ambari Infra Solr OOM due to Ranger. I've talked with a 
> few other people who are having very similar problems with OOM errors.
> I've typed up some details about how the way Ranger is using Solr requires a 
> lot of heap. I've also outlined the fix for this which significantly reduced 
> the amount of heap memory required. I'm an Apache Lucene/Solr committer so 
> this optimization/usage might not be immediately obvious to those using Solr 
> especially version 5.x.
> https://risdenk.github.io/2017/12/18/ambari-infra-solr-ranger.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (RANGER-1942) Disable xmlparser and configEdit API in Solr for Audit setup

2017-12-21 Thread Kevin Risden (JIRA)

 [ 
https://issues.apache.org/jira/browse/RANGER-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Risden updated RANGER-1942:
-
Fix Version/s: 0.7.2

> Disable xmlparser and configEdit API in Solr for Audit setup
> 
>
> Key: RANGER-1942
> URL: https://issues.apache.org/jira/browse/RANGER-1942
> Project: Ranger
>  Issue Type: Bug
>  Components: audit
>Reporter: Kevin Risden
> Fix For: 0.7.2
>
>
> AMBARI-22273 addresses this for Ambari Infra Solr. Ranger should do its best 
> to protect users from using a config that could be an issue. Solr 5.5.5, 
> 6.6.2, and 7.1.0 all fix the below issues.
> A fix for Ranger would be to set the following in solrconfig.xml. Another 
> could be to make sure that the documentation for Ranger -> Solr ensures that 
> recommended versions are used.
> {code:xml}
> 
> {code}
> From https://lucene.apache.org/solr/news.html
> * Fix for a 0-day exploit (CVE-2017-12629), details: 
> https://s.apache.org/FJDl. RunExecutableListener has been disabled by default 
> (can be enabled by -Dsolr.enableRunExecutableListener=true) and resolving 
> external entities in the XML query parser (defType=xmlparser or {!xmlparser 
> ... }) is disabled by default.
> * Fix for CVE-2017-7660: Security Vulnerability in secure inter-node 
> communication in Apache Solr, details: https://s.apache.org/APTY



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (RANGER-1942) Disable xmlparser and configEdit API in Solr for Audit setup

2017-12-21 Thread Kevin Risden (JIRA)

 [ 
https://issues.apache.org/jira/browse/RANGER-1942?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Risden updated RANGER-1942:
-
Description: 
AMBARI-22273 addresses this for Ambari Infra Solr. Ranger should do its best to 
protect users from using a config that could be an issue. Solr 5.5.5, 6.6.2, 
and 7.1.0 all fix the below issues.

A fix for Ranger would be to set the following in solrconfig.xml. Another could 
be to make sure that the documentation for Ranger -> Solr ensures that 
recommended versions are used.

{code:xml}

{code}

>From https://lucene.apache.org/solr/news.html
* Fix for a 0-day exploit (CVE-2017-12629), details: https://s.apache.org/FJDl. 
RunExecutableListener has been disabled by default (can be enabled by 
-Dsolr.enableRunExecutableListener=true) and resolving external entities in the 
XML query parser (defType=xmlparser or {!xmlparser ... }) is disabled by 
default.
* Fix for CVE-2017-7660: Security Vulnerability in secure inter-node 
communication in Apache Solr, details: https://s.apache.org/APTY

  was:
AMBARI-22273 addresses this for Ambari Infra Solr. Ranger should do its best to 
protect users from using a config that could be an issue. Solr 5.5.5, 6.6.2, 
and 7.1.0 all fix the below issues. The fix for Ranger would be to set the 
following in solrconfig.xml.

{code:xml}

{code}

>From https://lucene.apache.org/solr/news.html
* Fix for a 0-day exploit (CVE-2017-12629), details: https://s.apache.org/FJDl. 
RunExecutableListener has been disabled by default (can be enabled by 
-Dsolr.enableRunExecutableListener=true) and resolving external entities in the 
XML query parser (defType=xmlparser or {!xmlparser ... }) is disabled by 
default.
* Fix for CVE-2017-7660: Security Vulnerability in secure inter-node 
communication in Apache Solr, details: https://s.apache.org/APTY


> Disable xmlparser and configEdit API in Solr for Audit setup
> 
>
> Key: RANGER-1942
> URL: https://issues.apache.org/jira/browse/RANGER-1942
> Project: Ranger
>  Issue Type: Bug
>  Components: audit
>Reporter: Kevin Risden
>
> AMBARI-22273 addresses this for Ambari Infra Solr. Ranger should do its best 
> to protect users from using a config that could be an issue. Solr 5.5.5, 
> 6.6.2, and 7.1.0 all fix the below issues.
> A fix for Ranger would be to set the following in solrconfig.xml. Another 
> could be to make sure that the documentation for Ranger -> Solr ensures that 
> recommended versions are used.
> {code:xml}
> 
> {code}
> From https://lucene.apache.org/solr/news.html
> * Fix for a 0-day exploit (CVE-2017-12629), details: 
> https://s.apache.org/FJDl. RunExecutableListener has been disabled by default 
> (can be enabled by -Dsolr.enableRunExecutableListener=true) and resolving 
> external entities in the XML query parser (defType=xmlparser or {!xmlparser 
> ... }) is disabled by default.
> * Fix for CVE-2017-7660: Security Vulnerability in secure inter-node 
> communication in Apache Solr, details: https://s.apache.org/APTY



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (RANGER-1942) Disable xmlparser and configEdit API in Solr for Audit setup

2017-12-21 Thread Kevin Risden (JIRA)
Kevin Risden created RANGER-1942:


 Summary: Disable xmlparser and configEdit API in Solr for Audit 
setup
 Key: RANGER-1942
 URL: https://issues.apache.org/jira/browse/RANGER-1942
 Project: Ranger
  Issue Type: Bug
  Components: audit
Reporter: Kevin Risden


AMBARI-22273 addresses this for Ambari Infra Solr. Ranger should do its best to 
protect users from using a config that could be an issue. Solr 5.5.5, 6.6.2, 
and 7.1.0 all fix the below issues. The fix for Ranger would be to set the 
following in solrconfig.xml.

{code:xml}

{code}

>From https://lucene.apache.org/solr/news.html
* Fix for a 0-day exploit (CVE-2017-12629), details: https://s.apache.org/FJDl. 
RunExecutableListener has been disabled by default (can be enabled by 
-Dsolr.enableRunExecutableListener=true) and resolving external entities in the 
XML query parser (defType=xmlparser or {!xmlparser ... }) is disabled by 
default.
* Fix for CVE-2017-7660: Security Vulnerability in secure inter-node 
communication in Apache Solr, details: https://s.apache.org/APTY



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (RANGER-1938) Solr for Audit setup doesn't use DocValues effectively

2017-12-21 Thread Kevin Risden (JIRA)

[ 
https://issues.apache.org/jira/browse/RANGER-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16300295#comment-16300295
 ] 

Kevin Risden commented on RANGER-1938:
--

Created AMBARI-22684 to address in Ambari.

> Solr for Audit setup doesn't use DocValues effectively
> --
>
> Key: RANGER-1938
> URL: https://issues.apache.org/jira/browse/RANGER-1938
> Project: Ranger
>  Issue Type: Improvement
>  Components: audit
>Affects Versions: 0.6.0, 0.7.0, 0.6.1, 0.6.2, 0.6.3, 0.7.1
>Reporter: Kevin Risden
>Assignee: Kevin Risden
>  Labels: performance
> Fix For: 1.0.0, 0.7.2
>
> Attachments: 
> 0001-RANGER-1938-Enable-DocValues-for-more-fields-in-Solr.patch
>
>
> Ranger uses Ambari Infra Solr (or another Apache Solr install) for storing 
> Ranger Audit events for displaying in Ranger Admin. In our case, we have 
> noticed quite a few Ambari Infra Solr OOM due to Ranger. I've talked with a 
> few other people who are having very similar problems with OOM errors.
> I've typed up some details about how the way Ranger is using Solr requires a 
> lot of heap. I've also outlined the fix for this which significantly reduced 
> the amount of heap memory required. I'm an Apache Lucene/Solr committer so 
> this optimization/usage might not be immediately obvious to those using Solr 
> especially version 5.x.
> https://risdenk.github.io/2017/12/18/ambari-infra-solr-ranger.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (RANGER-1938) Solr for Audit setup doesn't use DocValues effectively

2017-12-21 Thread Kevin Risden (JIRA)

[ 
https://issues.apache.org/jira/browse/RANGER-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16300289#comment-16300289
 ] 

Kevin Risden commented on RANGER-1938:
--

The reason I went with changing all to DocValues is because Solr did this in 
Solr 6.0 by default. See SOLR-8740 for details. Solr 5.5 has support for 
DocValues. Solr 6.0 was just the first major version where the default was 
changed. Moving to default DocValues=true matches closer to later versions of 
Solr.

Do you still want a comment in there?

> Solr for Audit setup doesn't use DocValues effectively
> --
>
> Key: RANGER-1938
> URL: https://issues.apache.org/jira/browse/RANGER-1938
> Project: Ranger
>  Issue Type: Improvement
>  Components: audit
>Affects Versions: 0.6.0, 0.7.0, 0.6.1, 0.6.2, 0.6.3, 0.7.1
>Reporter: Kevin Risden
>Assignee: Kevin Risden
>  Labels: performance
> Fix For: 1.0.0, 0.7.2
>
> Attachments: 
> 0001-RANGER-1938-Enable-DocValues-for-more-fields-in-Solr.patch
>
>
> Ranger uses Ambari Infra Solr (or another Apache Solr install) for storing 
> Ranger Audit events for displaying in Ranger Admin. In our case, we have 
> noticed quite a few Ambari Infra Solr OOM due to Ranger. I've talked with a 
> few other people who are having very similar problems with OOM errors.
> I've typed up some details about how the way Ranger is using Solr requires a 
> lot of heap. I've also outlined the fix for this which significantly reduced 
> the amount of heap memory required. I'm an Apache Lucene/Solr committer so 
> this optimization/usage might not be immediately obvious to those using Solr 
> especially version 5.x.
> https://risdenk.github.io/2017/12/18/ambari-infra-solr-ranger.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (RANGER-1938) Solr for Audit setup doesn't use DocValues effectively

2017-12-20 Thread Kevin Risden (JIRA)

[ 
https://issues.apache.org/jira/browse/RANGER-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16298628#comment-16298628
 ] 

Kevin Risden commented on RANGER-1938:
--

bq. If Apache Ambari is used to manage Ranger, then Ambari has its own template 
for solr-config

I think this only takes affect during initial install if using Solr cloud. The 
template isn't pushed back up to Zookeeper. I know my changes haven't been 
overwritten by Ambari (which I was a bit surprised by).

Does it make sense for me to open a JIRA against the Ambari project as well to 
fix this schema? I can link to this JIRA since they are related.

bq. If you are using Ambari-Infra Solr, then it is shared by Ambari LogSearch 
and Apache Atlas

It is true that Ambari Infra Solr is shared by Ranger, LogSearch, and Atlas. 
The modified Solr config is only for the single collection which is 
ranger_audit in this case. This could also be a problem with LogSearch and 
Atlas and would have to be fixed separately. We don't have LogSearch enabled 
right now and Atlas has too small of a dataset for us right now to notice a 
problem.

bq. We have to emphasize that changing the schema requires rebuilding of Solr 
Collection. Which means all existing audits from Solr will be deleted. There is 
an option to rebuild it from the audits stored in HDFS, but currently, there 
are no known documentation or scripts for that

Since the TTL is small, deleting is what I recommend since it avoids the 
complexity of reindexing from HDFS. If necessary definitely could reindex from 
HDFS but as you said this would require more effort.
 
bq. I have given review comments at review board. Essentially, I would prefer 
to set the docValues at individual field level, rather than at global/default 
fieldType level.

I don't think there is a great reason to not enable at the "global" fieldType 
level. We can disable at the individual field level if necessary. There are 
very few cases where turning off DocValues makes sense especially for Ranger. 
DocValues will take up more disk space during indexing but the tradeoff is that 
Solr is stable during any sorting or faceting operation. Most of the queries 
I've seen for Ranger use sorting in the Ranger Admin UI. The thing to remember 
here is that DocValues is at the collection level global and not for Solr 
global. Each collection can have a different configuration.

bq. I have a request. I know it is very difficult to recommend configuration 
settings for Solr. With your current setup, can you share the configuration you 
currently have?

1. Memory setting for Solr = *4GB (after change for this JIRA. Previously had 
up to 24GB and Solr wasn't stable over time.)*
2. Number of shards and replication = *5 shards - no replication right now 
(single Solr node)*
3. Number of days for TTL = *21 days*
4. Max number of documents (based on TTL) = *~1.33 billion live docs and ~110 
million deleted docs for 21 days. Split over 5 shards - each shard ~266 million 
live docs and ~22 million deleted docs right now.*
5. Are Solr instances running on dedicated servers or one of the master 
servers? = *1 Solr instance running on a data node with spinning disk (not 
using HDFS right now)*

Some other stats that could be helpful:
* very few queries against Ambari Infra Solr - Ranger Admin UI and Atlas UI 
(very rarely)
* default GC tuning from Solr scripts are used
* current uptime 2 weeks - no full GC pauses with 4GB heap
* average pause time is .00118s
* average pause interval is 15.41s
* average heap usage 2GB


> Solr for Audit setup doesn't use DocValues effectively
> --
>
> Key: RANGER-1938
> URL: https://issues.apache.org/jira/browse/RANGER-1938
> Project: Ranger
>  Issue Type: Bug
>  Components: audit
>Affects Versions: 0.6.0, 0.7.0, 0.6.1, 0.6.2, 0.6.3, 0.7.1
>Reporter: Kevin Risden
>Assignee: Kevin Risden
>  Labels: performance
> Fix For: 1.0.0, 0.7.2
>
> Attachments: 
> 0001-RANGER-1938-Enable-DocValues-for-more-fields-in-Solr.patch
>
>
> Ranger uses Ambari Infra Solr (or another Apache Solr install) for storing 
> Ranger Audit events for displaying in Ranger Admin. In our case, we have 
> noticed quite a few Ambari Infra Solr OOM due to Ranger. I've talked with a 
> few other people who are having very similar problems with OOM errors.
> I've typed up some details about how the way Ranger is using Solr requires a 
> lot of heap. I've also outlined the fix for this which significantly reduced 
> the amount of heap memory required. I'm an Apache Lucene/Solr committer so 
> this optimization/usage might not be immediately obvious to those using Solr 
> especially version 5.x.
> https://risdenk.github.io/2017/12/18/ambari-infra-solr-ranger.html



--
This message was sent by Atlassian JIRA

[jira] [Commented] (RANGER-1938) Solr for Audit setup doesn't use DocValues effectively

2017-12-20 Thread Kevin Risden (JIRA)

[ 
https://issues.apache.org/jira/browse/RANGER-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16298612#comment-16298612
 ] 

Kevin Risden commented on RANGER-1938:
--

[~bosco] - Thanks for the feedback. I don't see your comments on review board? 
I'll get the answers to your questions as well.

> Solr for Audit setup doesn't use DocValues effectively
> --
>
> Key: RANGER-1938
> URL: https://issues.apache.org/jira/browse/RANGER-1938
> Project: Ranger
>  Issue Type: Bug
>  Components: audit
>Affects Versions: 0.6.0, 0.7.0, 0.6.1, 0.6.2, 0.6.3, 0.7.1
>Reporter: Kevin Risden
>Assignee: Kevin Risden
>  Labels: performance
> Fix For: 1.0.0, 0.7.2
>
> Attachments: 
> 0001-RANGER-1938-Enable-DocValues-for-more-fields-in-Solr.patch
>
>
> Ranger uses Ambari Infra Solr (or another Apache Solr install) for storing 
> Ranger Audit events for displaying in Ranger Admin. In our case, we have 
> noticed quite a few Ambari Infra Solr OOM due to Ranger. I've talked with a 
> few other people who are having very similar problems with OOM errors.
> I've typed up some details about how the way Ranger is using Solr requires a 
> lot of heap. I've also outlined the fix for this which significantly reduced 
> the amount of heap memory required. I'm an Apache Lucene/Solr committer so 
> this optimization/usage might not be immediately obvious to those using Solr 
> especially version 5.x.
> https://risdenk.github.io/2017/12/18/ambari-infra-solr-ranger.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (RANGER-1938) Solr for Audit setup doesn't use DocValues effectively

2017-12-19 Thread Kevin Risden (JIRA)

 [ 
https://issues.apache.org/jira/browse/RANGER-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Risden updated RANGER-1938:
-
Labels: performance  (was: )

> Solr for Audit setup doesn't use DocValues effectively
> --
>
> Key: RANGER-1938
> URL: https://issues.apache.org/jira/browse/RANGER-1938
> Project: Ranger
>  Issue Type: Bug
>  Components: audit
>Affects Versions: 0.6.0, 0.7.0, 0.6.1, 0.6.2, 0.6.3, 0.7.1
>Reporter: Kevin Risden
>Assignee: Kevin Risden
>  Labels: performance
> Attachments: 
> 0001-RANGER-1938-Enable-DocValues-for-more-fields-in-Solr.patch
>
>
> Ranger uses Ambari Infra Solr (or another Apache Solr install) for storing 
> Ranger Audit events for displaying in Ranger Admin. In our case, we have 
> noticed quite a few Ambari Infra Solr OOM due to Ranger. I've talked with a 
> few other people who are having very similar problems with OOM errors.
> I've typed up some details about how the way Ranger is using Solr requires a 
> lot of heap. I've also outlined the fix for this which significantly reduced 
> the amount of heap memory required. I'm an Apache Lucene/Solr committer so 
> this optimization/usage might not be immediately obvious to those using Solr 
> especially version 5.x.
> https://risdenk.github.io/2017/12/18/ambari-infra-solr-ranger.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (RANGER-1938) Solr for Audit setup doesn't use DocValues effectively

2017-12-19 Thread Kevin Risden (JIRA)

[ 
https://issues.apache.org/jira/browse/RANGER-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297750#comment-16297750
 ] 

Kevin Risden commented on RANGER-1938:
--

This will require Solr 5.5 or later. Migrating from non DocValues to DocValues 
requires a reindex if planning to keep existing documents. I have deleted the 
ranger_audit collection and recreated it after making this schema change. For 
new installations this isn't a concern.

> Solr for Audit setup doesn't use DocValues effectively
> --
>
> Key: RANGER-1938
> URL: https://issues.apache.org/jira/browse/RANGER-1938
> Project: Ranger
>  Issue Type: Bug
>  Components: audit
>Affects Versions: 0.6.0, 0.7.0, 0.6.1, 0.6.2, 0.6.3, 0.7.1
>Reporter: Kevin Risden
>Assignee: Kevin Risden
> Attachments: 
> 0001-RANGER-1938-Enable-DocValues-for-more-fields-in-Solr.patch
>
>
> Ranger uses Ambari Infra Solr (or another Apache Solr install) for storing 
> Ranger Audit events for displaying in Ranger Admin. In our case, we have 
> noticed quite a few Ambari Infra Solr OOM due to Ranger. I've talked with a 
> few other people who are having very similar problems with OOM errors.
> I've typed up some details about how the way Ranger is using Solr requires a 
> lot of heap. I've also outlined the fix for this which significantly reduced 
> the amount of heap memory required. I'm an Apache Lucene/Solr committer so 
> this optimization/usage might not be immediately obvious to those using Solr 
> especially version 5.x.
> https://risdenk.github.io/2017/12/18/ambari-infra-solr-ranger.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (RANGER-1938) Solr for Audit setup doesn't use DocValues effectively

2017-12-19 Thread Kevin Risden (JIRA)

[ 
https://issues.apache.org/jira/browse/RANGER-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297749#comment-16297749
 ] 

Kevin Risden commented on RANGER-1938:
--

https://reviews.apache.org/r/64740/

> Solr for Audit setup doesn't use DocValues effectively
> --
>
> Key: RANGER-1938
> URL: https://issues.apache.org/jira/browse/RANGER-1938
> Project: Ranger
>  Issue Type: Bug
>  Components: audit
>Affects Versions: 0.6.0, 0.7.0, 0.6.1, 0.6.2, 0.6.3, 0.7.1
>Reporter: Kevin Risden
>Assignee: Kevin Risden
> Attachments: 
> 0001-RANGER-1938-Enable-DocValues-for-more-fields-in-Solr.patch
>
>
> Ranger uses Ambari Infra Solr (or another Apache Solr install) for storing 
> Ranger Audit events for displaying in Ranger Admin. In our case, we have 
> noticed quite a few Ambari Infra Solr OOM due to Ranger. I've talked with a 
> few other people who are having very similar problems with OOM errors.
> I've typed up some details about how the way Ranger is using Solr requires a 
> lot of heap. I've also outlined the fix for this which significantly reduced 
> the amount of heap memory required. I'm an Apache Lucene/Solr committer so 
> this optimization/usage might not be immediately obvious to those using Solr 
> especially version 5.x.
> https://risdenk.github.io/2017/12/18/ambari-infra-solr-ranger.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (RANGER-1938) Solr for Audit setup doesn't use DocValues effectively

2017-12-19 Thread Kevin Risden (JIRA)

 [ 
https://issues.apache.org/jira/browse/RANGER-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Risden updated RANGER-1938:
-
Attachment: 0001-RANGER-1938-Enable-DocValues-for-more-fields-in-Solr.patch

> Solr for Audit setup doesn't use DocValues effectively
> --
>
> Key: RANGER-1938
> URL: https://issues.apache.org/jira/browse/RANGER-1938
> Project: Ranger
>  Issue Type: Bug
>  Components: audit
>Affects Versions: 0.6.0, 0.7.0, 0.6.1, 0.6.2, 0.6.3, 0.7.1
>Reporter: Kevin Risden
>Assignee: Kevin Risden
> Attachments: 
> 0001-RANGER-1938-Enable-DocValues-for-more-fields-in-Solr.patch
>
>
> Ranger uses Ambari Infra Solr (or another Apache Solr install) for storing 
> Ranger Audit events for displaying in Ranger Admin. In our case, we have 
> noticed quite a few Ambari Infra Solr OOM due to Ranger. I've talked with a 
> few other people who are having very similar problems with OOM errors.
> I've typed up some details about how the way Ranger is using Solr requires a 
> lot of heap. I've also outlined the fix for this which significantly reduced 
> the amount of heap memory required. I'm an Apache Lucene/Solr committer so 
> this optimization/usage might not be immediately obvious to those using Solr 
> especially version 5.x.
> https://risdenk.github.io/2017/12/18/ambari-infra-solr-ranger.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (RANGER-1938) Solr for Audit setup doesn't use DocValues effectively

2017-12-19 Thread Kevin Risden (JIRA)

 [ 
https://issues.apache.org/jira/browse/RANGER-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Risden reassigned RANGER-1938:


Assignee: Kevin Risden

> Solr for Audit setup doesn't use DocValues effectively
> --
>
> Key: RANGER-1938
> URL: https://issues.apache.org/jira/browse/RANGER-1938
> Project: Ranger
>  Issue Type: Bug
>  Components: audit
>Affects Versions: 0.6.0, 0.7.0, 0.6.1, 0.6.2, 0.6.3, 0.7.1
>Reporter: Kevin Risden
>Assignee: Kevin Risden
>
> Ranger uses Ambari Infra Solr (or another Apache Solr install) for storing 
> Ranger Audit events for displaying in Ranger Admin. In our case, we have 
> noticed quite a few Ambari Infra Solr OOM due to Ranger. I've talked with a 
> few other people who are having very similar problems with OOM errors.
> I've typed up some details about how the way Ranger is using Solr requires a 
> lot of heap. I've also outlined the fix for this which significantly reduced 
> the amount of heap memory required. I'm an Apache Lucene/Solr committer so 
> this optimization/usage might not be immediately obvious to those using Solr 
> especially version 5.x.
> https://risdenk.github.io/2017/12/18/ambari-infra-solr-ranger.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (RANGER-1938) Solr for Audit setup doesn't use DocValues effectively

2017-12-19 Thread Kevin Risden (JIRA)

[ 
https://issues.apache.org/jira/browse/RANGER-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16297553#comment-16297553
 ] 

Kevin Risden commented on RANGER-1938:
--

>From this mailing list thread:

https://mail-archives.apache.org/mod_mbox/ranger-user/201712.mbox/%3CCAJU9nmjAZSuHdujNtOUbsAgtf4qG7YiJ46CnCceFbcUAyZmJWw%40mail.gmail.com%3E

> Solr for Audit setup doesn't use DocValues effectively
> --
>
> Key: RANGER-1938
> URL: https://issues.apache.org/jira/browse/RANGER-1938
> Project: Ranger
>  Issue Type: Bug
>  Components: audit
>Affects Versions: 0.6.0, 0.7.0, 0.6.1, 0.6.2, 0.6.3, 0.7.1
>Reporter: Kevin Risden
>
> Ranger uses Ambari Infra Solr (or another Apache Solr install) for storing 
> Ranger Audit events for displaying in Ranger Admin. In our case, we have 
> noticed quite a few Ambari Infra Solr OOM due to Ranger. I've talked with a 
> few other people who are having very similar problems with OOM errors.
> I've typed up some details about how the way Ranger is using Solr requires a 
> lot of heap. I've also outlined the fix for this which significantly reduced 
> the amount of heap memory required. I'm an Apache Lucene/Solr committer so 
> this optimization/usage might not be immediately obvious to those using Solr 
> especially version 5.x.
> https://risdenk.github.io/2017/12/18/ambari-infra-solr-ranger.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (RANGER-1938) Solr for Audit setup doesn't use DocValues effectively

2017-12-19 Thread Kevin Risden (JIRA)

 [ 
https://issues.apache.org/jira/browse/RANGER-1938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Risden updated RANGER-1938:
-
Summary: Solr for Audit setup doesn't use DocValues effectively  (was: Solr 
for Audit setup doesn't require DocValues)

> Solr for Audit setup doesn't use DocValues effectively
> --
>
> Key: RANGER-1938
> URL: https://issues.apache.org/jira/browse/RANGER-1938
> Project: Ranger
>  Issue Type: Bug
>  Components: audit
>Affects Versions: 0.6.0, 0.7.0, 0.6.1, 0.6.2, 0.6.3, 0.7.1
>Reporter: Kevin Risden
>
> Ranger uses Ambari Infra Solr (or another Apache Solr install) for storing 
> Ranger Audit events for displaying in Ranger Admin. In our case, we have 
> noticed quite a few Ambari Infra Solr OOM due to Ranger. I've talked with a 
> few other people who are having very similar problems with OOM errors.
> I've typed up some details about how the way Ranger is using Solr requires a 
> lot of heap. I've also outlined the fix for this which significantly reduced 
> the amount of heap memory required. I'm an Apache Lucene/Solr committer so 
> this optimization/usage might not be immediately obvious to those using Solr 
> especially version 5.x.
> https://risdenk.github.io/2017/12/18/ambari-infra-solr-ranger.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (RANGER-1938) Solr for Audit setup doesn't require DocValues

2017-12-19 Thread Kevin Risden (JIRA)
Kevin Risden created RANGER-1938:


 Summary: Solr for Audit setup doesn't require DocValues
 Key: RANGER-1938
 URL: https://issues.apache.org/jira/browse/RANGER-1938
 Project: Ranger
  Issue Type: Bug
  Components: audit
Affects Versions: 0.7.1, 0.6.3, 0.6.2, 0.6.1, 0.7.0, 0.6.0
Reporter: Kevin Risden


Ranger uses Ambari Infra Solr (or another Apache Solr install) for storing 
Ranger Audit events for displaying in Ranger Admin. In our case, we have 
noticed quite a few Ambari Infra Solr OOM due to Ranger. I've talked with a few 
other people who are having very similar problems with OOM errors.

I've typed up some details about how the way Ranger is using Solr requires a 
lot of heap. I've also outlined the fix for this which significantly reduced 
the amount of heap memory required. I'm an Apache Lucene/Solr committer so this 
optimization/usage might not be immediately obvious to those using Solr 
especially version 5.x.

https://risdenk.github.io/2017/12/18/ambari-infra-solr-ranger.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (RANGER-1837) Enhance Ranger Audit to HDFS to support ORC file format

2017-11-11 Thread Kevin Risden (JIRA)

[ 
https://issues.apache.org/jira/browse/RANGER-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16248614#comment-16248614
 ] 

Kevin Risden commented on RANGER-1837:
--

[~bosco] - Thanks I left some feedback on review board. Looks like great 
progress so far.

> Enhance Ranger Audit to HDFS to support ORC file format
> ---
>
> Key: RANGER-1837
> URL: https://issues.apache.org/jira/browse/RANGER-1837
> Project: Ranger
>  Issue Type: Improvement
>  Components: audit
>Reporter: Kevin Risden
> Attachments: 
> 0001-RANGER-1837-Enhance-Ranger-Audit-to-HDFS-to-support-.patch
>
>
> My team has done some research and found that Ranger HDFS audits are:
> * Stored as JSON objects (one per line)
> * Not compressed
> This is currently very verbose and would benefit from compression since this 
> data is not frequently accessed. 
> From Bosco on the mailing list:
> {quote}You are right, currently one of the options is saving the audits in 
> HDFS itself as JSON files in one folder per day. I have loaded these JSON 
> files from the folder into Hive as compressed ORC format. The compressed 
> files in ORC were less than 10% of the original size. So, it was significant 
> decrease in size. Also, it is easier to run analytics on the Hive tables.
>  
> So, there are couple of ways of doing it.
>  
> Write an Oozie job which runs every night and loads the previous day worth 
> audit logs into ORC or other format
> Write a AuditDestination which can write into the format you want to.
>  
> Regardless which approach you take, this would be a good feature for 
> Ranger.{quote}
> http://mail-archives.apache.org/mod_mbox/ranger-user/201710.mbox/%3CCAJU9nmiYzzUUX1uDEysLAcMti4iLmX7RE%3DmN2%3DdoLaaQf87njQ%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (RANGER-1837) HDFS Audit Compression

2017-10-13 Thread Kevin Risden (JIRA)

[ 
https://issues.apache.org/jira/browse/RANGER-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203539#comment-16203539
 ] 

Kevin Risden edited comment on RANGER-1837 at 10/13/17 1:16 PM:


>From Sean on the mailing list:
{quote}
I’ve been looking at the same. Even in small clusters the size of Ranger Audits 
is considerable. The files compress well. But compressed JSON will be difficult 
to query.
 
Would engineering Ranger to write directly to ORC be reasonable?
{quote}

Also from Bosco on the mailing list:
{quote}
If we write as ORC or other file format directly, then we have to see how to 
batch the audits. In the Audit V3 implementation, we did some optimization to 
avoid store (local write) and forward, instead build the batch in the memory 
itself and do bulk write (each Destination has different policies). But in the 
previous release, we did re-introduce an option to store and forward to HDFS 
due to HDFS file closure issue.
 
I personally don’t know what would be a good batch size. But we can build on 
top that code to write in the format we want to. And make the output write 
configurable to support different types.
{quote}

>From Ramesh on the mailing list:
{quote}
+1 for your suggestion on having a Audit FileFormat as a feature in the Ranger 
Audit Framework.  

In that case HDFSAuditDestination should have the provision to use a FileFormat 
before writing, where as SolrDestination might not require this.  

Each configured AuditDestination can have a Format conversion before writing, 
we don’t need to have this format all the way from Audit generation point.
{quote}


was (Author: risdenk):
Also from Bosco on the mailing list:
{quote}
If we write as ORC or other file format directly, then we have to see how to 
batch the audits. In the Audit V3 implementation, we did some optimization to 
avoid store (local write) and forward, instead build the batch in the memory 
itself and do bulk write (each Destination has different policies). But in the 
previous release, we did re-introduce an option to store and forward to HDFS 
due to HDFS file closure issue.
 
I personally don’t know what would be a good batch size. But we can build on 
top that code to write in the format we want to. And make the output write 
configurable to support different types.
{quote}

>From Ramesh on the mailing list:
{quote}
+1 for your suggestion on having a Audit FileFormat as a feature in the Ranger 
Audit Framework.  

In that case HDFSAuditDestination should have the provision to use a FileFormat 
before writing, where as SolrDestination might not require this.  

Each configured AuditDestination can have a Format conversion before writing, 
we don’t need to have this format all the way from Audit generation point.
{quote}

> HDFS Audit Compression
> --
>
> Key: RANGER-1837
> URL: https://issues.apache.org/jira/browse/RANGER-1837
> Project: Ranger
>  Issue Type: Improvement
>  Components: audit
>Reporter: Kevin Risden
>
> My team has done some research and found that Ranger HDFS audits are:
> * Stored as JSON objects (one per line)
> * Not compressed
> This is currently very verbose and would benefit from compression since this 
> data is not frequently accessed. 
> From Bosco on the mailing list:
> {quote}You are right, currently one of the options is saving the audits in 
> HDFS itself as JSON files in one folder per day. I have loaded these JSON 
> files from the folder into Hive as compressed ORC format. The compressed 
> files in ORC were less than 10% of the original size. So, it was significant 
> decrease in size. Also, it is easier to run analytics on the Hive tables.
>  
> So, there are couple of ways of doing it.
>  
> Write an Oozie job which runs every night and loads the previous day worth 
> audit logs into ORC or other format
> Write a AuditDestination which can write into the format you want to.
>  
> Regardless which approach you take, this would be a good feature for 
> Ranger.{quote}
> http://mail-archives.apache.org/mod_mbox/ranger-user/201710.mbox/%3CCAJU9nmiYzzUUX1uDEysLAcMti4iLmX7RE%3DmN2%3DdoLaaQf87njQ%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (RANGER-1837) HDFS Audit Compression

2017-10-13 Thread Kevin Risden (JIRA)

[ 
https://issues.apache.org/jira/browse/RANGER-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203539#comment-16203539
 ] 

Kevin Risden commented on RANGER-1837:
--

Also from Bosco on the mailing list:
{quote}
f we write as ORC or other file format directly, then we have to see how to 
batch the audits. In the Audit V3 implementation, we did some optimization to 
avoid store (local write) and forward, instead build the batch in the memory 
itself and do bulk write (each Destination has different policies). But in the 
previous release, we did re-introduce an option to store and forward to HDFS 
due to HDFS file closure issue.
 
I personally don’t know what would be a good batch size. But we can build on 
top that code to write in the format we want to. And make the output write 
configurable to support different types.
{quote}

>From Ramesh on the mailing list:
{quote}
+1 for your suggestion on having a Audit FileFormat as a feature in the Ranger 
Audit Framework.  

In that case HDFSAuditDestination should have the provision to use a FileFormat 
before writing, where as SolrDestination might not require this.  

Each configured AuditDestination can have a Format conversion before writing, 
we don’t need to have this format all the way from Audit generation point.
{quote}

> HDFS Audit Compression
> --
>
> Key: RANGER-1837
> URL: https://issues.apache.org/jira/browse/RANGER-1837
> Project: Ranger
>  Issue Type: Improvement
>  Components: audit
>Reporter: Kevin Risden
>
> My team has done some research and found that Ranger HDFS audits are:
> * Stored as JSON objects (one per line)
> * Not compressed
> This is currently very verbose and would benefit from compression since this 
> data is not frequently accessed. 
> From Bosco on the mailing list:
> {quote}You are right, currently one of the options is saving the audits in 
> HDFS itself as JSON files in one folder per day. I have loaded these JSON 
> files from the folder into Hive as compressed ORC format. The compressed 
> files in ORC were less than 10% of the original size. So, it was significant 
> decrease in size. Also, it is easier to run analytics on the Hive tables.
>  
> So, there are couple of ways of doing it.
>  
> Write an Oozie job which runs every night and loads the previous day worth 
> audit logs into ORC or other format
> Write a AuditDestination which can write into the format you want to.
>  
> Regardless which approach you take, this would be a good feature for 
> Ranger.{quote}
> http://mail-archives.apache.org/mod_mbox/ranger-user/201710.mbox/%3CCAJU9nmiYzzUUX1uDEysLAcMti4iLmX7RE%3DmN2%3DdoLaaQf87njQ%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Comment Edited] (RANGER-1837) HDFS Audit Compression

2017-10-13 Thread Kevin Risden (JIRA)

[ 
https://issues.apache.org/jira/browse/RANGER-1837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16203539#comment-16203539
 ] 

Kevin Risden edited comment on RANGER-1837 at 10/13/17 1:09 PM:


Also from Bosco on the mailing list:
{quote}
If we write as ORC or other file format directly, then we have to see how to 
batch the audits. In the Audit V3 implementation, we did some optimization to 
avoid store (local write) and forward, instead build the batch in the memory 
itself and do bulk write (each Destination has different policies). But in the 
previous release, we did re-introduce an option to store and forward to HDFS 
due to HDFS file closure issue.
 
I personally don’t know what would be a good batch size. But we can build on 
top that code to write in the format we want to. And make the output write 
configurable to support different types.
{quote}

>From Ramesh on the mailing list:
{quote}
+1 for your suggestion on having a Audit FileFormat as a feature in the Ranger 
Audit Framework.  

In that case HDFSAuditDestination should have the provision to use a FileFormat 
before writing, where as SolrDestination might not require this.  

Each configured AuditDestination can have a Format conversion before writing, 
we don’t need to have this format all the way from Audit generation point.
{quote}


was (Author: risdenk):
Also from Bosco on the mailing list:
{quote}
f we write as ORC or other file format directly, then we have to see how to 
batch the audits. In the Audit V3 implementation, we did some optimization to 
avoid store (local write) and forward, instead build the batch in the memory 
itself and do bulk write (each Destination has different policies). But in the 
previous release, we did re-introduce an option to store and forward to HDFS 
due to HDFS file closure issue.
 
I personally don’t know what would be a good batch size. But we can build on 
top that code to write in the format we want to. And make the output write 
configurable to support different types.
{quote}

>From Ramesh on the mailing list:
{quote}
+1 for your suggestion on having a Audit FileFormat as a feature in the Ranger 
Audit Framework.  

In that case HDFSAuditDestination should have the provision to use a FileFormat 
before writing, where as SolrDestination might not require this.  

Each configured AuditDestination can have a Format conversion before writing, 
we don’t need to have this format all the way from Audit generation point.
{quote}

> HDFS Audit Compression
> --
>
> Key: RANGER-1837
> URL: https://issues.apache.org/jira/browse/RANGER-1837
> Project: Ranger
>  Issue Type: Improvement
>  Components: audit
>Reporter: Kevin Risden
>
> My team has done some research and found that Ranger HDFS audits are:
> * Stored as JSON objects (one per line)
> * Not compressed
> This is currently very verbose and would benefit from compression since this 
> data is not frequently accessed. 
> From Bosco on the mailing list:
> {quote}You are right, currently one of the options is saving the audits in 
> HDFS itself as JSON files in one folder per day. I have loaded these JSON 
> files from the folder into Hive as compressed ORC format. The compressed 
> files in ORC were less than 10% of the original size. So, it was significant 
> decrease in size. Also, it is easier to run analytics on the Hive tables.
>  
> So, there are couple of ways of doing it.
>  
> Write an Oozie job which runs every night and loads the previous day worth 
> audit logs into ORC or other format
> Write a AuditDestination which can write into the format you want to.
>  
> Regardless which approach you take, this would be a good feature for 
> Ranger.{quote}
> http://mail-archives.apache.org/mod_mbox/ranger-user/201710.mbox/%3CCAJU9nmiYzzUUX1uDEysLAcMti4iLmX7RE%3DmN2%3DdoLaaQf87njQ%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (RANGER-1837) HDFS Audit Compression

2017-10-13 Thread Kevin Risden (JIRA)
Kevin Risden created RANGER-1837:


 Summary: HDFS Audit Compression
 Key: RANGER-1837
 URL: https://issues.apache.org/jira/browse/RANGER-1837
 Project: Ranger
  Issue Type: Improvement
  Components: audit
Reporter: Kevin Risden


My team has done some research and found that Ranger HDFS audits are:
* Stored as JSON objects (one per line)
* Not compressed

This is currently very verbose and would benefit from compression since this 
data is not frequently accessed. 

>From Bosco on the mailing list:
{quote}You are right, currently one of the options is saving the audits in HDFS 
itself as JSON files in one folder per day. I have loaded these JSON files from 
the folder into Hive as compressed ORC format. The compressed files in ORC were 
less than 10% of the original size. So, it was significant decrease in size. 
Also, it is easier to run analytics on the Hive tables.
 
So, there are couple of ways of doing it.
 
Write an Oozie job which runs every night and loads the previous day worth 
audit logs into ORC or other format
Write a AuditDestination which can write into the format you want to.
 
Regardless which approach you take, this would be a good feature for 
Ranger.{quote}

http://mail-archives.apache.org/mod_mbox/ranger-user/201710.mbox/%3CCAJU9nmiYzzUUX1uDEysLAcMti4iLmX7RE%3DmN2%3DdoLaaQf87njQ%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)