Re: Spark + Hive + Ranger

Dilli Dorai Tue, 19 Jan 2016 07:31:53 -0800

Julien,
Interesting.
Thanks for sharing.
I was under the impression Spark would not be aware of hive.security.
authorization.manager.
Regards
Dilli



On Tue, Jan 19, 2016 at 7:10 AM, Julien Carme <[email protected]>
wrote:

> Hello,
>
> I answer to myself as I am happy to say that I solved my problem and I
> have been able to access Hive tables from SparkSQL with Ranger enabled.
> Policies defined in Ranger are properly enforced in Spark.
>
> So here is how to do it (assuming you have been able to make it work
> without Ranger):
> - Check that you have set hive.security.authorization.manager=
> org.apache.ranger.authorization.hive.authorizer.RangerHiveAuthorizerFactory
> - Get ranger-hive-security.xml and ranger-hive-audit.xml from your ranger
> hive plugin folder and copy them in you spark conf directory.
> - Add these jars from your ranger distribution to your classpath (or use
> the --driver-class-path argument for spark): ranger-hive-plugin,
> ranger-plugins-common, ranger-plugins-audit, guava
>
> That's all. It should work.
>
> The only thing which bothers me a little bit now is that SparkSQL does not
> handle 'doAs=F'. It is not surprising considering Spark is run by the user
> and not by a server process own by a system user. So I am afraid it will be
> an issue with Ranger, as all tables written with hive will be owned by hive
> but all tables written with Spark will be owned by the user who wrote them.
> We have to find a solution for that.
>
> Regards,
>
> Julien
>
>
> 2016-01-19 13:38 GMT+01:00 Julien Carme <[email protected]>:
>
>> Hello,
>>
>> Thanks Madhan and Bosco for your answers.
>>
>> I am using HDP 2.3 and installed Ranger from Ambari. I suppose Ambari
>> does run enable-hive-plugin, as Ranger does work correctly with Hive when I
>> use Hive through the hiveserver2. It is only when I try to use it from
>> Spark (using SparkSQL) that it does not work.
>>
>> SparkSQL does not use Hiveserver2, but it does not use HiveCLI either (at
>> least not directly). Hive engine is not used at all. SparkSQL is a
>> standalone SQL engine which is part of Spark, it gets Hive tables directly
>> from where they are stored, using metadata it gets from HCAT. At least it
>> is my understanding.
>>
>> Until recently, SparkSQL was ignoring Ranger, just like the Hive CLI, and
>> it was working (I could access Hive data from Spark on a cluster with
>> Ranger up, but of course Ranger rules were ignored). But since a recent
>> update, SparkSQL now clearly does interact with Ranger, as I get Ranger
>> exceptions when I use SparkSQL. I think that it gets the value of
>> hive.security.authorization.manager (which in my system is a Ranger
>> class) and instantiate this class in order to comply with security rules
>> defined by this class. I am no expert in Spark internals or Ranger, this is
>> just assumptions.
>>
>> I have solved multiple classpath (ranger jar not found) and configuration
>> file (xa-secure.xml ?) issues in order to reach the point where I am now.
>> Now I don't get missing class or missing file exceptions, but it still does
>> not work, and I get the issue describe in my previous mail (see below).
>>
>> I will try to continue my investigations. If I make progress I will post
>> it here. But any additional help would be appreciated.
>>
>> Best regards,
>>
>> Julien
>>
>>
>> 2016-01-18 22:24 GMT+01:00 Don Bosco Durai <[email protected]>:
>>
>>> Ideally, Ranger shouldn’t be in play when HiveCLI is used. If I am not
>>> wrong, Spark using HiveCLI API.
>>>
>>> To avoid this issue, I thought we only update hiveserver2.properties.
>>> Julien, I assume you are using the standard enable plugin scripts.
>>>
>>> Thanks
>>>
>>> Bosco
>>>
>>>
>>> From: Madhan Neethiraj <[email protected]> on behalf of Madhan
>>> Neethiraj <[email protected]>
>>> Reply-To: <[email protected]>
>>> Date: Monday, January 18, 2016 at 9:54 AM
>>> To: "[email protected]" <[email protected]
>>> >
>>> Subject: Re: Spark + Hive + Ranger
>>>
>>> Julien,
>>>
>>> Ranger Hive plugin requires additional configuration, like whereto
>>> location of Ranger Admin, name of the service containing policies for Hive,
>>> etc. Such configurations (in files named ranger-*.xml) are created when
>>> enable-hive-plugin.sh script is run with appropriate values in
>>> install.properties. This script also update hive-site.xml with necessary
>>> changes – like registering Ranger as authorizer in
>>> hive.security.authorization.manager. If you haven’t installed the plugin
>>> using enable-hive-plugin.sh, please do so and let us know the result.
>>>
>>> Hope this helps.
>>>
>>> Madhan
>>>
>>>
>>> From: Julien Carme <[email protected]>
>>> Reply-To: "[email protected]" <
>>> [email protected]>
>>> Date: Monday, January 18, 2016 at 9:27 AM
>>> To: "[email protected]" <[email protected]
>>> >
>>> Subject: Spark + Hive + Ranger
>>>
>>> Hello,
>>>
>>> I try to access Hive from Spark in an Hadoop cluster where I use Ranger
>>> to control Hive access.
>>>
>>> As Ranger is installed, I have setup hive accordingly:
>>>
>>> hive.security.authorization.manager=
>>> org.apache.ranger.authorization.hive.authorizer.RangerHiveAuthorizerFactory
>>>
>>> When I run Spark and I request it to access Hive table, it is using this
>>> class to access it but I get several errors:
>>>
>>> 16/01/18 17:51:50 INFO provider.AuditProviderFactory: No v3 audit
>>> configuration found. Trying v2 audit configurations
>>> 16/01/18 17:51:50 ERROR util.PolicyRefresher:
>>> PolicyRefresher(serviceName=null): failed to refresh policies. Will
>>> continue to use last known version of policies (-1)
>>> com.sun.jersey.api.client.ClientHandlerException:
>>> java.lang.IllegalArgumentException: URI is not absolute
>>>         at
>>> com.sun.jersey.client.urlconnection.URLConnectionClientHandler.handle(URLConnectionClientHandler.java:149)
>>>         at com.sun.jersey.api.client.Client.handle(Client.java:648)
>>>         at
>>> com.sun.jersey.api.client.WebResource.handle(WebResource.java:670)
>>>         at
>>> com.sun.jersey.api.client.WebResource.access$200(WebResource.java:74)
>>>         at
>>> com.sun.jersey.api.client.WebResource$Builder.get(WebResource.java:503)
>>>         at
>>> org.apache.ranger.admin.client.RangerAdminRESTClient.getServicePoliciesIfUpdated(RangerAdminRESTClient.java:71)
>>>         at
>>> org.apache.ranger.plugin.util.PolicyRefresher.loadPolicyfromPolicyAdmin(PolicyRefresher.java:205)
>>>
>>>
>>>
>>> --
>>>
>>> And then (but it is not clear at all the two errors are connected) :
>>>
>>> 16/01/18 17:51:50 INFO ql.Driver: Starting task [Stage-0:DDL] in serial
>>> mode
>>> 16/01/18 17:51:50 ERROR authorizer.RangerHiveAuthorizer:
>>> filterListCmdObjects: Internal error: null RangerAccessResult object
>>> received back from isAccessAllowed()!
>>> 16/01/18 17:51:50 ERROR authorizer.RangerHiveAuthorizer:
>>> filterListCmdObjects: Internal error: null RangerAccessResult object
>>> received back from isAccessAllowed()!
>>> 16/01/18 17:51:50 ERROR authorizer.RangerHiveAuthorizer:
>>> filterListCmdObjects: Internal error: null RangerAccessResult object
>>> received back from isAccessAllowed()!
>>> 1
>>> --
>>>
>>> And then the access to Hive tables fails.
>>>
>>> I am not sure where to go from there. Any help would be appreciated.
>>>
>>> Best Regards,
>>>
>>> Julien
>>>
>>>
>>>
>>>
>>
>

Re: Spark + Hive + Ranger

Reply via email to