+1 for Tim's opinion

We encountered a similar issue when we enabled storage based authorization
(without Sentry) for Hive, the catalogd failed to load file metadata from
HDFS (because user `impala` has not x permissions). We solved this by
adding `impala` into the `supergroup`.

Then we encountered another issue that impalad can read all files of Hive.
I created IMPALA-7052 at that time. This can't be solved since the C lib of
HDFS don't support impersonate unless HADOOP-12953 resolved.

As you mentioned, you've implemented your own Hadoop FileSystem. Maybe you
can build your own libhdfs with the patch in HADOOP-12953, then rebuild
impalad with this libhdfs to have a try.

On Thu, Jan 3, 2019 at 8:56 AM Tim Armstrong <tarmstr...@cloudera.com>
wrote:

> Stepping back for a second, doesn't what you're trying to do assume that
> each user will load metadata for each table separately? The whole point of
> the catalog server is that we load the metadata once and then share it
> between queries and users.
>
> I don't think we want to have the catalog server load different versions
> of a table depending on which user initially loaded the table? That would
> cause all sorts of issues.
>
> On Wed, Jan 2, 2019 at 12:36 PM mhd wrk <mhdwrkoff...@gmail.com> wrote:
>
>> I see. I was wondering how it works inside hive server. Basically this is
>> a HDFS C API issue. Thanks for the elaborate explanation.
>>
>> On Wed, Jan 2, 2019 at 12:27 PM Shant Hovsepian <sh...@arcadiadata.com>
>> wrote:
>>
>>> Problem is mostly with libhdfs as documented here HADOOP-12953
>>>
>>> On a kerberized setup the service principal gets picked up. There are
>>> work arounds in the Java HDFS API but the c based one in libhdfs has this
>>> issue. Of course caching HDFS will b trickier in impala as well but first
>>> his one API in libhdfs needs to be enhanced.
>>>
>>> Also in general having database authorization at the file level may not
>>> be a good idea or clean design and using sentry and extending it's
>>> authorization mecuanisms would be cleaner.
>>>
>>> -Shant
>>>
>>> On Wed, Jan 2, 2019, 12:21 PM mhd wrk <mhdwrkoff...@gmail.com> wrote:
>>>
>>>> Thanks for further info. Not sure if our Product Management is OK, at
>>>> this point, with us patching Impala server to get our solution working. Our
>>>> product is supposed to work with already installed servers.
>>>>
>>>> Any plans to address the gap (making requesting_user visible inside
>>>> catalog server) in future release?
>>>>
>>>>
>>>>
>>>> On Wed, Jan 2, 2019 at 11:50 AM Bharath Vissapragada <
>>>> bhara...@cloudera.com> wrote:
>>>>
>>>>> I was poking around in the code and it looks like we have most of the code
>>>>> in place
>>>>> <https://github.com/apache/impala/blob/27577dd652554dda5a03016e2d1e3ab66fe6b1f5/common/thrift/CatalogService.thrift#L47>
>>>>>
>>>>> // Common header included in all CatalogService requests.
>>>>> // TODO: The CatalogServiceVersion/protocol version should be part of
>>>>> the header.
>>>>> // This would require changes in BDR and break their compatibility
>>>>> story. We should
>>>>> // coordinate a joint change somewhere down the line.
>>>>> struct TCatalogServiceRequestHeader {
>>>>> // The effective user who submitted this request.
>>>>> 1: optional string requesting_user
>>>>> }
>>>>>
>>>>> That header is included in all the RPCs. However, that is an optional
>>>>> field and may not be in a few places (since we don't actually rely on that
>>>>> currently). So you could start with making it a "required" field and see
>>>>> what all breaks. HTH.
>>>>>
>>>>> On Wed, Jan 2, 2019 at 11:35 AM Bharath Vissapragada <
>>>>> bhara...@cloudera.com> wrote:
>>>>>
>>>>>> I think we expose it via UDF effective_user() (effective user could
>>>>>> be different from the connected if delegation/doas is enabled). You can 
>>>>>> run
>>>>>> a query like "select effective_user()" in a session.
>>>>>>
>>>>>> You can also look it up in the /sessions page on the coordinator web
>>>>>> UI (<coordinator>:25000/sessions?json) and you can get a json formatted
>>>>>> string containing the connected and delegate user for each session.
>>>>>>
>>>>>> If you want it on the Catalog side, you probably have to plumb it
>>>>>> through the RPC calls (change the thrift spec and pass it along from the
>>>>>> coordinator session handling code to the Catalog RPC code).
>>>>>>
>>>>>> On Wed, Jan 2, 2019 at 11:19 AM mhd wrk <mhdwrkoff...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Is there any Impala/Sentry specific API we can use inside our code
>>>>>>> to figure out who current user is?
>>>>>>>
>>>>>>> On Wed, Jan 2, 2019 at 11:12 AM Bharath Vissapragada <
>>>>>>> bhara...@cloudera.com> wrote:
>>>>>>>
>>>>>>>> Yes. I think Jeszy is right. Per my understanding too, we don't
>>>>>>>> impersonate the client user on the Catalog server. Instead, we enforce 
>>>>>>>> the
>>>>>>>> authorization via Sentry during query planning.
>>>>>>>>
>>>>>>>> On Wed, Jan 2, 2019 at 7:06 AM mhd wrk <mhdwrkoff...@gmail.com>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> IMPALA-2177 sounds like the correct issue.
>>>>>>>>> Here are log messages from authentication.cc for impalad and
>>>>>>>>> catalogd respectively:
>>>>>>>>>
>>>>>>>>> I0102 14:15:06.722666 28195 authentication.cc:478] Successfully
>>>>>>>>>> authenticated client user *"ad...@example.com
>>>>>>>>>> <ad...@example.com>"*
>>>>>>>>>> I0102 03:40:07.972348 27948 authentication.cc:445] Successfully
>>>>>>>>>> authenticated principal *"impala/cdh-...@example.com
>>>>>>>>>> <cdh-...@example.com>"* on an internal connection
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> As you can see from the messages above, impalad is able to
>>>>>>>>> identify the currently connected user correctly. However catalogd 
>>>>>>>>> always
>>>>>>>>> authenticates as impala which causes the problem.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Wed, Jan 2, 2019 at 4:19 AM Jeszy <jes...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hey,
>>>>>>>>>>
>>>>>>>>>> IIUC your question correctly, this is a limitation. IMPALA-2177
>>>>>>>>>> looks
>>>>>>>>>> to be the appropriate jira.
>>>>>>>>>> Most users use Impala together with Sentry, where the recommended
>>>>>>>>>> approach is to disable impersonation (even in services that allow
>>>>>>>>>> it,
>>>>>>>>>> like Hive).
>>>>>>>>>>
>>>>>>>>>> HTH
>>>>>>>>>>
>>>>>>>>>> On Wed, 2 Jan 2019 at 05:55, Bharath Vissapragada <
>>>>>>>>>> bhara...@cloudera.com> wrote:
>>>>>>>>>> >
>>>>>>>>>> > Hi,
>>>>>>>>>> >
>>>>>>>>>> > Can you add the stack trace here if possible? It is not super
>>>>>>>>>> clear where exactly the problem is.
>>>>>>>>>> >
>>>>>>>>>> > Thanks,
>>>>>>>>>> > Bharath
>>>>>>>>>> >
>>>>>>>>>> > On Tue, Jan 1, 2019 at 6:34 PM mhd wrk <mhdwrkoff...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>> >>
>>>>>>>>>> >> we have our own implementation of Hadoop FileSystem which
>>>>>>>>>> relies on current user in a kerberosied environment to locate user 
>>>>>>>>>> specific
>>>>>>>>>> files in HDFS.  This custom file system works fine inside hive to 
>>>>>>>>>> create
>>>>>>>>>> external tables and query them. However trying to access the same 
>>>>>>>>>> tables
>>>>>>>>>> via Impala (jdbc driver) fails. Watching the log messages seems that 
>>>>>>>>>> when
>>>>>>>>>> impalad sends requests to catalogd to get meta data of a given table 
>>>>>>>>>> the
>>>>>>>>>> current user returned by  UserGroupInformation is the service account
>>>>>>>>>> running the server (impala/hostn...@example.com) instead of the
>>>>>>>>>> currently connected user.
>>>>>>>>>> >>
>>>>>>>>>> >> Is this a known issue or limitation of Impala?
>>>>>>>>>>
>>>>>>>>>

Reply via email to