Yeah, I can see the use case, the challenge is always finding people to do
the hard work of building and maintaining features.

On Thu, Jan 3, 2019 at 10:48 AM mhd wrk <mhdwrkoff...@gmail.com> wrote:

> I understand reasoning behind the design decision which requires making
> files available to a certain user. However there are clients in certain
> industries who are OK with an acceptable performance hit (might caused by
> loading/caching metadata per user) as long as they can have user specific
> permissions at all storage levels (HDFS, Accumulo and ....).
>
> IMO, Impala should make this possible as a configuration option.
>
>
>
> On Thu, Jan 3, 2019 at 10:22 AM Bharath Vissapragada <
> bhara...@cloudera.com> wrote:
>
>> Agree with Tim's points. My opinion is also the same, given the current
>> Catalog architecture.
>>
>> On Thu, Jan 3, 2019 at 10:17 AM Tim Armstrong <tarmstr...@cloudera.com>
>> wrote:
>>
>>> Right, we could use requesting_user for logging, statistics, etc, but it
>>> would be problematic to impersonate that user when loading metadata.
>>>
>>> It's of course possible that I'm missing something here.
>>>
>>> On Thu, Jan 3, 2019 at 10:05 AM mhd wrk <mhdwrkoff...@gmail.com> wrote:
>>>
>>>> Thanks for the link. So the final answer is that even if the libhdfs
>>>> bug  gets fixed there won't be any changes to Impala to expose
>>>> requesting_user in Catalog Service, right?
>>>>
>>>> On Thu, Jan 3, 2019 at 9:46 AM Tim Armstrong <tarmstr...@cloudera.com>
>>>> wrote:
>>>>
>>>>> >  catalog server ignores file system authorization model
>>>>> The catalog daemon does this by design - the idea is that the catalog
>>>>> server can load and cache metadata on behalf of multiple users. It 
>>>>> requires
>>>>> that the catalogd user (usually "impala") has permissions to read
>>>>> filesystem metadata.
>>>>>
>>>>> The "user account requirements" section in our docs explains this:
>>>>> https://impala.apache.org/docs/build/html/topics/impala_prereqs.html#prereqs
>>>>> and
>>>>> https://impala.apache.org/docs/build/html/topics/impala_security_files.html
>>>>>
>>>>> On Wed, Jan 2, 2019 at 5:52 PM mhd wrk <mhdwrkoff...@gmail.com> wrote:
>>>>>
>>>>>> it's more about enforcing Hadoop file system authorisation. All we
>>>>>> have done is implementing a custom Hadoop File System (
>>>>>> org.apache.hadoop.fs.FileSystem) and now trying to use Impala to
>>>>>> query files hosted on that file system and it fails because catalog 
>>>>>> server
>>>>>> ignores file system authorization model. The same file system works 
>>>>>> nicely
>>>>>> with HDFS commands (e.g. hdfs dfs -ls ...) as well as HiveServer.
>>>>>>
>>>>>> Our clients expect us to enforce authorization at all levels (HDFS,
>>>>>> Accumulo, Hive, Impala and ....)
>>>>>>
>>>>>> On Wed, Jan 2, 2019 at 4:56 PM Tim Armstrong <tarmstr...@cloudera.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Stepping back for a second, doesn't what you're trying to do assume
>>>>>>> that each user will load metadata for each table separately? The whole
>>>>>>> point of the catalog server is that we load the metadata once and then
>>>>>>> share it between queries and users.
>>>>>>>
>>>>>>> I don't think we want to have the catalog server load different
>>>>>>> versions of a table depending on which user initially loaded the table?
>>>>>>> That would cause all sorts of issues.
>>>>>>>
>>>>>>> On Wed, Jan 2, 2019 at 12:36 PM mhd wrk <mhdwrkoff...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I see. I was wondering how it works inside hive server. Basically
>>>>>>>> this is a HDFS C API issue. Thanks for the elaborate explanation.
>>>>>>>>
>>>>>>>> On Wed, Jan 2, 2019 at 12:27 PM Shant Hovsepian <
>>>>>>>> sh...@arcadiadata.com> wrote:
>>>>>>>>
>>>>>>>>> Problem is mostly with libhdfs as documented here HADOOP-12953
>>>>>>>>>
>>>>>>>>> On a kerberized setup the service principal gets picked up. There
>>>>>>>>> are work arounds in the Java HDFS API but the c based one in libhdfs 
>>>>>>>>> has
>>>>>>>>> this issue. Of course caching HDFS will b trickier in impala as well 
>>>>>>>>> but
>>>>>>>>> first his one API in libhdfs needs to be enhanced.
>>>>>>>>>
>>>>>>>>> Also in general having database authorization at the file level
>>>>>>>>> may not be a good idea or clean design and using sentry and extending 
>>>>>>>>> it's
>>>>>>>>> authorization mecuanisms would be cleaner.
>>>>>>>>>
>>>>>>>>> -Shant
>>>>>>>>>
>>>>>>>>> On Wed, Jan 2, 2019, 12:21 PM mhd wrk <mhdwrkoff...@gmail.com>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks for further info. Not sure if our Product Management is
>>>>>>>>>> OK, at this point, with us patching Impala server to get our solution
>>>>>>>>>> working. Our product is supposed to work with already installed 
>>>>>>>>>> servers.
>>>>>>>>>>
>>>>>>>>>> Any plans to address the gap (making requesting_user visible
>>>>>>>>>> inside catalog server) in future release?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> On Wed, Jan 2, 2019 at 11:50 AM Bharath Vissapragada <
>>>>>>>>>> bhara...@cloudera.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> I was poking around in the code and it looks like we have most
>>>>>>>>>>> of the code in place
>>>>>>>>>>> <https://github.com/apache/impala/blob/27577dd652554dda5a03016e2d1e3ab66fe6b1f5/common/thrift/CatalogService.thrift#L47>
>>>>>>>>>>>
>>>>>>>>>>> // Common header included in all CatalogService requests.
>>>>>>>>>>> // TODO: The CatalogServiceVersion/protocol version should be
>>>>>>>>>>> part of the header.
>>>>>>>>>>> // This would require changes in BDR and break their
>>>>>>>>>>> compatibility story. We should
>>>>>>>>>>> // coordinate a joint change somewhere down the line.
>>>>>>>>>>> struct TCatalogServiceRequestHeader {
>>>>>>>>>>> // The effective user who submitted this request.
>>>>>>>>>>> 1: optional string requesting_user
>>>>>>>>>>> }
>>>>>>>>>>>
>>>>>>>>>>> That header is included in all the RPCs. However, that is an
>>>>>>>>>>> optional field and may not be in a few places (since we don't 
>>>>>>>>>>> actually rely
>>>>>>>>>>> on that currently). So you could start with making it a "required" 
>>>>>>>>>>> field
>>>>>>>>>>> and see what all breaks. HTH.
>>>>>>>>>>>
>>>>>>>>>>> On Wed, Jan 2, 2019 at 11:35 AM Bharath Vissapragada <
>>>>>>>>>>> bhara...@cloudera.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> I think we expose it via UDF effective_user() (effective user
>>>>>>>>>>>> could be different from the connected if delegation/doas is 
>>>>>>>>>>>> enabled). You
>>>>>>>>>>>> can run a query like "select effective_user()" in a session.
>>>>>>>>>>>>
>>>>>>>>>>>> You can also look it up in the /sessions page on the
>>>>>>>>>>>> coordinator web UI (<coordinator>:25000/sessions?json) and you can 
>>>>>>>>>>>> get a
>>>>>>>>>>>> json formatted string containing the connected and delegate user 
>>>>>>>>>>>> for each
>>>>>>>>>>>> session.
>>>>>>>>>>>>
>>>>>>>>>>>> If you want it on the Catalog side, you probably have to plumb
>>>>>>>>>>>> it through the RPC calls (change the thrift spec and pass it along 
>>>>>>>>>>>> from the
>>>>>>>>>>>> coordinator session handling code to the Catalog RPC code).
>>>>>>>>>>>>
>>>>>>>>>>>> On Wed, Jan 2, 2019 at 11:19 AM mhd wrk <mhdwrkoff...@gmail.com>
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Is there any Impala/Sentry specific API we can use inside our
>>>>>>>>>>>>> code to figure out who current user is?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wed, Jan 2, 2019 at 11:12 AM Bharath Vissapragada <
>>>>>>>>>>>>> bhara...@cloudera.com> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>>> Yes. I think Jeszy is right. Per my understanding too, we
>>>>>>>>>>>>>> don't impersonate the client user on the Catalog server. 
>>>>>>>>>>>>>> Instead, we
>>>>>>>>>>>>>> enforce the authorization via Sentry during query planning.
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wed, Jan 2, 2019 at 7:06 AM mhd wrk <
>>>>>>>>>>>>>> mhdwrkoff...@gmail.com> wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> IMPALA-2177 sounds like the correct issue.
>>>>>>>>>>>>>>> Here are log messages from authentication.cc for impalad and
>>>>>>>>>>>>>>> catalogd respectively:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> I0102 14:15:06.722666 28195 authentication.cc:478]
>>>>>>>>>>>>>>>> Successfully authenticated client user *"ad...@example.com
>>>>>>>>>>>>>>>> <ad...@example.com>"*
>>>>>>>>>>>>>>>> I0102 03:40:07.972348 27948 authentication.cc:445]
>>>>>>>>>>>>>>>> Successfully authenticated principal 
>>>>>>>>>>>>>>>> *"impala/cdh-...@example.com
>>>>>>>>>>>>>>>> <cdh-...@example.com>"* on an internal connection
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> As you can see from the messages above, impalad is able to
>>>>>>>>>>>>>>> identify the currently connected user correctly. However 
>>>>>>>>>>>>>>> catalogd always
>>>>>>>>>>>>>>> authenticates as impala which causes the problem.
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Wed, Jan 2, 2019 at 4:19 AM Jeszy <jes...@gmail.com>
>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> Hey,
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> IIUC your question correctly, this is a limitation.
>>>>>>>>>>>>>>>> IMPALA-2177 looks
>>>>>>>>>>>>>>>> to be the appropriate jira.
>>>>>>>>>>>>>>>> Most users use Impala together with Sentry, where the
>>>>>>>>>>>>>>>> recommended
>>>>>>>>>>>>>>>> approach is to disable impersonation (even in services that
>>>>>>>>>>>>>>>> allow it,
>>>>>>>>>>>>>>>> like Hive).
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> HTH
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Wed, 2 Jan 2019 at 05:55, Bharath Vissapragada <
>>>>>>>>>>>>>>>> bhara...@cloudera.com> wrote:
>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>> > Hi,
>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>> > Can you add the stack trace here if possible? It is not
>>>>>>>>>>>>>>>> super clear where exactly the problem is.
>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>> > Thanks,
>>>>>>>>>>>>>>>> > Bharath
>>>>>>>>>>>>>>>> >
>>>>>>>>>>>>>>>> > On Tue, Jan 1, 2019 at 6:34 PM mhd wrk <
>>>>>>>>>>>>>>>> mhdwrkoff...@gmail.com> wrote:
>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>> >> we have our own implementation of Hadoop FileSystem
>>>>>>>>>>>>>>>> which relies on current user in a kerberosied environment to 
>>>>>>>>>>>>>>>> locate user
>>>>>>>>>>>>>>>> specific files in HDFS.  This custom file system works fine 
>>>>>>>>>>>>>>>> inside hive to
>>>>>>>>>>>>>>>> create external tables and query them. However trying to 
>>>>>>>>>>>>>>>> access the same
>>>>>>>>>>>>>>>> tables via Impala (jdbc driver) fails. Watching the log 
>>>>>>>>>>>>>>>> messages seems that
>>>>>>>>>>>>>>>> when impalad sends requests to catalogd to get meta data of a 
>>>>>>>>>>>>>>>> given table
>>>>>>>>>>>>>>>> the current user returned by  UserGroupInformation is the 
>>>>>>>>>>>>>>>> service account
>>>>>>>>>>>>>>>> running the server (impala/hostn...@example.com) instead
>>>>>>>>>>>>>>>> of the currently connected user.
>>>>>>>>>>>>>>>> >>
>>>>>>>>>>>>>>>> >> Is this a known issue or limitation of Impala?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>

Reply via email to