Re: catalogd and UserGroupInformation.getCurrentUser();

Bharath Vissapragada Thu, 03 Jan 2019 10:22:29 -0800

Agree with Tim's points. My opinion is also the same, given the current
Catalog architecture.


On Thu, Jan 3, 2019 at 10:17 AM Tim Armstrong <tarmstr...@cloudera.com>
wrote:

> Right, we could use requesting_user for logging, statistics, etc, but it
> would be problematic to impersonate that user when loading metadata.
>
> It's of course possible that I'm missing something here.
>
> On Thu, Jan 3, 2019 at 10:05 AM mhd wrk <mhdwrkoff...@gmail.com> wrote:
>
>> Thanks for the link. So the final answer is that even if the libhdfs bug
>> gets fixed there won't be any changes to Impala to expose requesting_user
>> in Catalog Service, right?
>>
>> On Thu, Jan 3, 2019 at 9:46 AM Tim Armstrong <tarmstr...@cloudera.com>
>> wrote:
>>
>>> >  catalog server ignores file system authorization model
>>> The catalog daemon does this by design - the idea is that the catalog
>>> server can load and cache metadata on behalf of multiple users. It requires
>>> that the catalogd user (usually "impala") has permissions to read
>>> filesystem metadata.
>>>
>>> The "user account requirements" section in our docs explains this:
>>> https://impala.apache.org/docs/build/html/topics/impala_prereqs.html#prereqs
>>> and
>>> https://impala.apache.org/docs/build/html/topics/impala_security_files.html
>>>
>>> On Wed, Jan 2, 2019 at 5:52 PM mhd wrk <mhdwrkoff...@gmail.com> wrote:
>>>
>>>> it's more about enforcing Hadoop file system authorisation. All we have
>>>> done is implementing a custom Hadoop File System (org.apache.hadoop.fs.
>>>> FileSystem) and now trying to use Impala to query files hosted on that
>>>> file system and it fails because catalog server ignores file system
>>>> authorization model. The same file system works nicely with HDFS commands
>>>> (e.g. hdfs dfs -ls ...) as well as HiveServer.
>>>>
>>>> Our clients expect us to enforce authorization at all levels (HDFS,
>>>> Accumulo, Hive, Impala and ....)
>>>>
>>>> On Wed, Jan 2, 2019 at 4:56 PM Tim Armstrong <tarmstr...@cloudera.com>
>>>> wrote:
>>>>
>>>>> Stepping back for a second, doesn't what you're trying to do assume
>>>>> that each user will load metadata for each table separately? The whole
>>>>> point of the catalog server is that we load the metadata once and then
>>>>> share it between queries and users.
>>>>>
>>>>> I don't think we want to have the catalog server load different
>>>>> versions of a table depending on which user initially loaded the table?
>>>>> That would cause all sorts of issues.
>>>>>
>>>>> On Wed, Jan 2, 2019 at 12:36 PM mhd wrk <mhdwrkoff...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> I see. I was wondering how it works inside hive server. Basically
>>>>>> this is a HDFS C API issue. Thanks for the elaborate explanation.
>>>>>>
>>>>>> On Wed, Jan 2, 2019 at 12:27 PM Shant Hovsepian <
>>>>>> sh...@arcadiadata.com> wrote:
>>>>>>
>>>>>>> Problem is mostly with libhdfs as documented here HADOOP-12953
>>>>>>>
>>>>>>> On a kerberized setup the service principal gets picked up. There
>>>>>>> are work arounds in the Java HDFS API but the c based one in libhdfs has
>>>>>>> this issue. Of course caching HDFS will b trickier in impala as well but
>>>>>>> first his one API in libhdfs needs to be enhanced.
>>>>>>>
>>>>>>> Also in general having database authorization at the file level may
>>>>>>> not be a good idea or clean design and using sentry and extending it's
>>>>>>> authorization mecuanisms would be cleaner.
>>>>>>>
>>>>>>> -Shant
>>>>>>>
>>>>>>> On Wed, Jan 2, 2019, 12:21 PM mhd wrk <mhdwrkoff...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks for further info. Not sure if our Product Management is OK,
>>>>>>>> at this point, with us patching Impala server to get our solution 
>>>>>>>> working.
>>>>>>>> Our product is supposed to work with already installed servers.
>>>>>>>>
>>>>>>>> Any plans to address the gap (making requesting_user visible inside
>>>>>>>> catalog server) in future release?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Wed, Jan 2, 2019 at 11:50 AM Bharath Vissapragada <
>>>>>>>> bhara...@cloudera.com> wrote:
>>>>>>>>
>>>>>>>>> I was poking around in the code and it looks like we have most of
>>>>>>>>> the code in place
>>>>>>>>> <https://github.com/apache/impala/blob/27577dd652554dda5a03016e2d1e3ab66fe6b1f5/common/thrift/CatalogService.thrift#L47>
>>>>>>>>>
>>>>>>>>> // Common header included in all CatalogService requests.
>>>>>>>>> // TODO: The CatalogServiceVersion/protocol version should be part
>>>>>>>>> of the header.
>>>>>>>>> // This would require changes in BDR and break their compatibility
>>>>>>>>> story. We should
>>>>>>>>> // coordinate a joint change somewhere down the line.
>>>>>>>>> struct TCatalogServiceRequestHeader {
>>>>>>>>> // The effective user who submitted this request.
>>>>>>>>> 1: optional string requesting_user
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> That header is included in all the RPCs. However, that is an
>>>>>>>>> optional field and may not be in a few places (since we don't 
>>>>>>>>> actually rely
>>>>>>>>> on that currently). So you could start with making it a "required" 
>>>>>>>>> field
>>>>>>>>> and see what all breaks. HTH.
>>>>>>>>>
>>>>>>>>> On Wed, Jan 2, 2019 at 11:35 AM Bharath Vissapragada <
>>>>>>>>> bhara...@cloudera.com> wrote:
>>>>>>>>>
>>>>>>>>>> I think we expose it via UDF effective_user() (effective user
>>>>>>>>>> could be different from the connected if delegation/doas is 
>>>>>>>>>> enabled). You
>>>>>>>>>> can run a query like "select effective_user()" in a session.
>>>>>>>>>>
>>>>>>>>>> You can also look it up in the /sessions page on the coordinator
>>>>>>>>>> web UI (<coordinator>:25000/sessions?json) and you can get a json 
>>>>>>>>>> formatted
>>>>>>>>>> string containing the connected and delegate user for each session.
>>>>>>>>>>
>>>>>>>>>> If you want it on the Catalog side, you probably have to plumb it
>>>>>>>>>> through the RPC calls (change the thrift spec and pass it along from 
>>>>>>>>>> the
>>>>>>>>>> coordinator session handling code to the Catalog RPC code).
>>>>>>>>>>
>>>>>>>>>> On Wed, Jan 2, 2019 at 11:19 AM mhd wrk <mhdwrkoff...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Is there any Impala/Sentry specific API we can use inside our
>>>>>>>>>>> code to figure out who current user is?
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Jan 2, 2019 at 11:12 AM Bharath Vissapragada <
>>>>>>>>>>> bhara...@cloudera.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Yes. I think Jeszy is right. Per my understanding too, we don't
>>>>>>>>>>>> impersonate the client user on the Catalog server. Instead, we 
>>>>>>>>>>>> enforce the
>>>>>>>>>>>> authorization via Sentry during query planning.
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Jan 2, 2019 at 7:06 AM mhd wrk <mhdwrkoff...@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> IMPALA-2177 sounds like the correct issue.
>>>>>>>>>>>>> Here are log messages from authentication.cc for impalad and
>>>>>>>>>>>>> catalogd respectively:
>>>>>>>>>>>>>
>>>>>>>>>>>>> I0102 14:15:06.722666 28195 authentication.cc:478]
>>>>>>>>>>>>>> Successfully authenticated client user *"ad...@example.com
>>>>>>>>>>>>>> <ad...@example.com>"*
>>>>>>>>>>>>>> I0102 03:40:07.972348 27948 authentication.cc:445]
>>>>>>>>>>>>>> Successfully authenticated principal *"impala/cdh-...@example.com
>>>>>>>>>>>>>> <cdh-...@example.com>"* on an internal connection
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> As you can see from the messages above, impalad is able to
>>>>>>>>>>>>> identify the currently connected user correctly. However catalogd 
>>>>>>>>>>>>> always
>>>>>>>>>>>>> authenticates as impala which causes the problem.
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Jan 2, 2019 at 4:19 AM Jeszy <jes...@gmail.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Hey,
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> IIUC your question correctly, this is a limitation.
>>>>>>>>>>>>>> IMPALA-2177 looks
>>>>>>>>>>>>>> to be the appropriate jira.
>>>>>>>>>>>>>> Most users use Impala together with Sentry, where the
>>>>>>>>>>>>>> recommended
>>>>>>>>>>>>>> approach is to disable impersonation (even in services that
>>>>>>>>>>>>>> allow it,
>>>>>>>>>>>>>> like Hive).
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> HTH
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, 2 Jan 2019 at 05:55, Bharath Vissapragada <
>>>>>>>>>>>>>> bhara...@cloudera.com> wrote:
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > Hi,
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > Can you add the stack trace here if possible? It is not
>>>>>>>>>>>>>> super clear where exactly the problem is.
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > Thanks,
>>>>>>>>>>>>>> > Bharath
>>>>>>>>>>>>>> >
>>>>>>>>>>>>>> > On Tue, Jan 1, 2019 at 6:34 PM mhd wrk <
>>>>>>>>>>>>>> mhdwrkoff...@gmail.com> wrote:
>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>> >> we have our own implementation of Hadoop FileSystem which
>>>>>>>>>>>>>> relies on current user in a kerberosied environment to locate 
>>>>>>>>>>>>>> user specific
>>>>>>>>>>>>>> files in HDFS.  This custom file system works fine inside hive 
>>>>>>>>>>>>>> to create
>>>>>>>>>>>>>> external tables and query them. However trying to access the 
>>>>>>>>>>>>>> same tables
>>>>>>>>>>>>>> via Impala (jdbc driver) fails. Watching the log messages seems 
>>>>>>>>>>>>>> that when
>>>>>>>>>>>>>> impalad sends requests to catalogd to get meta data of a given 
>>>>>>>>>>>>>> table the
>>>>>>>>>>>>>> current user returned by  UserGroupInformation is the service 
>>>>>>>>>>>>>> account
>>>>>>>>>>>>>> running the server (impala/hostn...@example.com) instead of
>>>>>>>>>>>>>> the currently connected user.
>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>> >> Is this a known issue or limitation of Impala?
>>>>>>>>>>>>>>
>>>>>>>>>>>>>

Re: catalogd and UserGroupInformation.getCurrentUser();

Reply via email to