Julien, Interesting. Thanks for sharing. I was under the impression Spark would not be aware of hive.security. authorization.manager. Regards Dilli
On Tue, Jan 19, 2016 at 7:10 AM, Julien Carme <[email protected]> wrote: > Hello, > > I answer to myself as I am happy to say that I solved my problem and I > have been able to access Hive tables from SparkSQL with Ranger enabled. > Policies defined in Ranger are properly enforced in Spark. > > So here is how to do it (assuming you have been able to make it work > without Ranger): > - Check that you have set hive.security.authorization.manager= > org.apache.ranger.authorization.hive.authorizer.RangerHiveAuthorizerFactory > - Get ranger-hive-security.xml and ranger-hive-audit.xml from your ranger > hive plugin folder and copy them in you spark conf directory. > - Add these jars from your ranger distribution to your classpath (or use > the --driver-class-path argument for spark): ranger-hive-plugin, > ranger-plugins-common, ranger-plugins-audit, guava > > That's all. It should work. > > The only thing which bothers me a little bit now is that SparkSQL does not > handle 'doAs=F'. It is not surprising considering Spark is run by the user > and not by a server process own by a system user. So I am afraid it will be > an issue with Ranger, as all tables written with hive will be owned by hive > but all tables written with Spark will be owned by the user who wrote them. > We have to find a solution for that. > > Regards, > > Julien > > > 2016-01-19 13:38 GMT+01:00 Julien Carme <[email protected]>: > >> Hello, >> >> Thanks Madhan and Bosco for your answers. >> >> I am using HDP 2.3 and installed Ranger from Ambari. I suppose Ambari >> does run enable-hive-plugin, as Ranger does work correctly with Hive when I >> use Hive through the hiveserver2. It is only when I try to use it from >> Spark (using SparkSQL) that it does not work. >> >> SparkSQL does not use Hiveserver2, but it does not use HiveCLI either (at >> least not directly). Hive engine is not used at all. SparkSQL is a >> standalone SQL engine which is part of Spark, it gets Hive tables directly >> from where they are stored, using metadata it gets from HCAT. At least it >> is my understanding. >> >> Until recently, SparkSQL was ignoring Ranger, just like the Hive CLI, and >> it was working (I could access Hive data from Spark on a cluster with >> Ranger up, but of course Ranger rules were ignored). But since a recent >> update, SparkSQL now clearly does interact with Ranger, as I get Ranger >> exceptions when I use SparkSQL. I think that it gets the value of >> hive.security.authorization.manager (which in my system is a Ranger >> class) and instantiate this class in order to comply with security rules >> defined by this class. I am no expert in Spark internals or Ranger, this is >> just assumptions. >> >> I have solved multiple classpath (ranger jar not found) and configuration >> file (xa-secure.xml ?) issues in order to reach the point where I am now. >> Now I don't get missing class or missing file exceptions, but it still does >> not work, and I get the issue describe in my previous mail (see below). >> >> I will try to continue my investigations. If I make progress I will post >> it here. But any additional help would be appreciated. >> >> Best regards, >> >> Julien >> >> >> 2016-01-18 22:24 GMT+01:00 Don Bosco Durai <[email protected]>: >> >>> Ideally, Ranger shouldn’t be in play when HiveCLI is used. If I am not >>> wrong, Spark using HiveCLI API. >>> >>> To avoid this issue, I thought we only update hiveserver2.properties. >>> Julien, I assume you are using the standard enable plugin scripts. >>> >>> Thanks >>> >>> Bosco >>> >>> >>> From: Madhan Neethiraj <[email protected]> on behalf of Madhan >>> Neethiraj <[email protected]> >>> Reply-To: <[email protected]> >>> Date: Monday, January 18, 2016 at 9:54 AM >>> To: "[email protected]" <[email protected] >>> > >>> Subject: Re: Spark + Hive + Ranger >>> >>> Julien, >>> >>> Ranger Hive plugin requires additional configuration, like whereto >>> location of Ranger Admin, name of the service containing policies for Hive, >>> etc. Such configurations (in files named ranger-*.xml) are created when >>> enable-hive-plugin.sh script is run with appropriate values in >>> install.properties. This script also update hive-site.xml with necessary >>> changes – like registering Ranger as authorizer in >>> hive.security.authorization.manager. If you haven’t installed the plugin >>> using enable-hive-plugin.sh, please do so and let us know the result. >>> >>> Hope this helps. >>> >>> Madhan >>> >>> >>> From: Julien Carme <[email protected]> >>> Reply-To: "[email protected]" < >>> [email protected]> >>> Date: Monday, January 18, 2016 at 9:27 AM >>> To: "[email protected]" <[email protected] >>> > >>> Subject: Spark + Hive + Ranger >>> >>> Hello, >>> >>> I try to access Hive from Spark in an Hadoop cluster where I use Ranger >>> to control Hive access. >>> >>> As Ranger is installed, I have setup hive accordingly: >>> >>> hive.security.authorization.manager= >>> org.apache.ranger.authorization.hive.authorizer.RangerHiveAuthorizerFactory >>> >>> When I run Spark and I request it to access Hive table, it is using this >>> class to access it but I get several errors: >>> >>> 16/01/18 17:51:50 INFO provider.AuditProviderFactory: No v3 audit >>> configuration found. Trying v2 audit configurations >>> 16/01/18 17:51:50 ERROR util.PolicyRefresher: >>> PolicyRefresher(serviceName=null): failed to refresh policies. Will >>> continue to use last known version of policies (-1) >>> com.sun.jersey.api.client.ClientHandlerException: >>> java.lang.IllegalArgumentException: URI is not absolute >>> at >>> com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:149) >>> at com.sun.jersey.api.client.Client.handle(Client.java:648) >>> at >>> com.sun.jersey.api.client.WebResource.handle(WebResource.java:670) >>> at >>> com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74) >>> at >>> com.sun.jersey.api.client.WebResource$Builder.get(WebResource.java:503) >>> at >>> org.apache.ranger.admin.client.RangerAdminRESTClient.getServicePoliciesIfUpdated(RangerAdminRESTClient.java:71) >>> at >>> org.apache.ranger.plugin.util.PolicyRefresher.loadPolicyfromPolicyAdmin(PolicyRefresher.java:205) >>> >>> >>> >>> -- >>> >>> And then (but it is not clear at all the two errors are connected) : >>> >>> 16/01/18 17:51:50 INFO ql.Driver: Starting task [Stage-0:DDL] in serial >>> mode >>> 16/01/18 17:51:50 ERROR authorizer.RangerHiveAuthorizer: >>> filterListCmdObjects: Internal error: null RangerAccessResult object >>> received back from isAccessAllowed()! >>> 16/01/18 17:51:50 ERROR authorizer.RangerHiveAuthorizer: >>> filterListCmdObjects: Internal error: null RangerAccessResult object >>> received back from isAccessAllowed()! >>> 16/01/18 17:51:50 ERROR authorizer.RangerHiveAuthorizer: >>> filterListCmdObjects: Internal error: null RangerAccessResult object >>> received back from isAccessAllowed()! >>> 1 >>> -- >>> >>> And then the access to Hive tables fails. >>> >>> I am not sure where to go from there. Any help would be appreciated. >>> >>> Best Regards, >>> >>> Julien >>> >>> >>> >>> >> >
