Right, we could use requesting_user for logging, statistics, etc, but it would be problematic to impersonate that user when loading metadata.
It's of course possible that I'm missing something here. On Thu, Jan 3, 2019 at 10:05 AM mhd wrk <mhdwrkoff...@gmail.com> wrote: > Thanks for the link. So the final answer is that even if the libhdfs bug > gets fixed there won't be any changes to Impala to expose requesting_user > in Catalog Service, right? > > On Thu, Jan 3, 2019 at 9:46 AM Tim Armstrong <tarmstr...@cloudera.com> > wrote: > >> > catalog server ignores file system authorization model >> The catalog daemon does this by design - the idea is that the catalog >> server can load and cache metadata on behalf of multiple users. It requires >> that the catalogd user (usually "impala") has permissions to read >> filesystem metadata. >> >> The "user account requirements" section in our docs explains this: >> https://impala.apache.org/docs/build/html/topics/impala_prereqs.html#prereqs >> and >> https://impala.apache.org/docs/build/html/topics/impala_security_files.html >> >> On Wed, Jan 2, 2019 at 5:52 PM mhd wrk <mhdwrkoff...@gmail.com> wrote: >> >>> it's more about enforcing Hadoop file system authorisation. All we have >>> done is implementing a custom Hadoop File System (org.apache.hadoop.fs. >>> FileSystem) and now trying to use Impala to query files hosted on that >>> file system and it fails because catalog server ignores file system >>> authorization model. The same file system works nicely with HDFS commands >>> (e.g. hdfs dfs -ls ...) as well as HiveServer. >>> >>> Our clients expect us to enforce authorization at all levels (HDFS, >>> Accumulo, Hive, Impala and ....) >>> >>> On Wed, Jan 2, 2019 at 4:56 PM Tim Armstrong <tarmstr...@cloudera.com> >>> wrote: >>> >>>> Stepping back for a second, doesn't what you're trying to do assume >>>> that each user will load metadata for each table separately? The whole >>>> point of the catalog server is that we load the metadata once and then >>>> share it between queries and users. >>>> >>>> I don't think we want to have the catalog server load different >>>> versions of a table depending on which user initially loaded the table? >>>> That would cause all sorts of issues. >>>> >>>> On Wed, Jan 2, 2019 at 12:36 PM mhd wrk <mhdwrkoff...@gmail.com> wrote: >>>> >>>>> I see. I was wondering how it works inside hive server. Basically this >>>>> is a HDFS C API issue. Thanks for the elaborate explanation. >>>>> >>>>> On Wed, Jan 2, 2019 at 12:27 PM Shant Hovsepian <sh...@arcadiadata.com> >>>>> wrote: >>>>> >>>>>> Problem is mostly with libhdfs as documented here HADOOP-12953 >>>>>> >>>>>> On a kerberized setup the service principal gets picked up. There are >>>>>> work arounds in the Java HDFS API but the c based one in libhdfs has this >>>>>> issue. Of course caching HDFS will b trickier in impala as well but first >>>>>> his one API in libhdfs needs to be enhanced. >>>>>> >>>>>> Also in general having database authorization at the file level may >>>>>> not be a good idea or clean design and using sentry and extending it's >>>>>> authorization mecuanisms would be cleaner. >>>>>> >>>>>> -Shant >>>>>> >>>>>> On Wed, Jan 2, 2019, 12:21 PM mhd wrk <mhdwrkoff...@gmail.com> wrote: >>>>>> >>>>>>> Thanks for further info. Not sure if our Product Management is OK, >>>>>>> at this point, with us patching Impala server to get our solution >>>>>>> working. >>>>>>> Our product is supposed to work with already installed servers. >>>>>>> >>>>>>> Any plans to address the gap (making requesting_user visible inside >>>>>>> catalog server) in future release? >>>>>>> >>>>>>> >>>>>>> >>>>>>> On Wed, Jan 2, 2019 at 11:50 AM Bharath Vissapragada < >>>>>>> bhara...@cloudera.com> wrote: >>>>>>> >>>>>>>> I was poking around in the code and it looks like we have most of >>>>>>>> the code in place >>>>>>>> <https://github.com/apache/impala/blob/27577dd652554dda5a03016e2d1e3ab66fe6b1f5/common/thrift/CatalogService.thrift#L47> >>>>>>>> >>>>>>>> // Common header included in all CatalogService requests. >>>>>>>> // TODO: The CatalogServiceVersion/protocol version should be part >>>>>>>> of the header. >>>>>>>> // This would require changes in BDR and break their compatibility >>>>>>>> story. We should >>>>>>>> // coordinate a joint change somewhere down the line. >>>>>>>> struct TCatalogServiceRequestHeader { >>>>>>>> // The effective user who submitted this request. >>>>>>>> 1: optional string requesting_user >>>>>>>> } >>>>>>>> >>>>>>>> That header is included in all the RPCs. However, that is an >>>>>>>> optional field and may not be in a few places (since we don't actually >>>>>>>> rely >>>>>>>> on that currently). So you could start with making it a "required" >>>>>>>> field >>>>>>>> and see what all breaks. HTH. >>>>>>>> >>>>>>>> On Wed, Jan 2, 2019 at 11:35 AM Bharath Vissapragada < >>>>>>>> bhara...@cloudera.com> wrote: >>>>>>>> >>>>>>>>> I think we expose it via UDF effective_user() (effective user >>>>>>>>> could be different from the connected if delegation/doas is enabled). >>>>>>>>> You >>>>>>>>> can run a query like "select effective_user()" in a session. >>>>>>>>> >>>>>>>>> You can also look it up in the /sessions page on the coordinator >>>>>>>>> web UI (<coordinator>:25000/sessions?json) and you can get a json >>>>>>>>> formatted >>>>>>>>> string containing the connected and delegate user for each session. >>>>>>>>> >>>>>>>>> If you want it on the Catalog side, you probably have to plumb it >>>>>>>>> through the RPC calls (change the thrift spec and pass it along from >>>>>>>>> the >>>>>>>>> coordinator session handling code to the Catalog RPC code). >>>>>>>>> >>>>>>>>> On Wed, Jan 2, 2019 at 11:19 AM mhd wrk <mhdwrkoff...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Is there any Impala/Sentry specific API we can use inside our >>>>>>>>>> code to figure out who current user is? >>>>>>>>>> >>>>>>>>>> On Wed, Jan 2, 2019 at 11:12 AM Bharath Vissapragada < >>>>>>>>>> bhara...@cloudera.com> wrote: >>>>>>>>>> >>>>>>>>>>> Yes. I think Jeszy is right. Per my understanding too, we don't >>>>>>>>>>> impersonate the client user on the Catalog server. Instead, we >>>>>>>>>>> enforce the >>>>>>>>>>> authorization via Sentry during query planning. >>>>>>>>>>> >>>>>>>>>>> On Wed, Jan 2, 2019 at 7:06 AM mhd wrk <mhdwrkoff...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> IMPALA-2177 sounds like the correct issue. >>>>>>>>>>>> Here are log messages from authentication.cc for impalad and >>>>>>>>>>>> catalogd respectively: >>>>>>>>>>>> >>>>>>>>>>>> I0102 14:15:06.722666 28195 authentication.cc:478] Successfully >>>>>>>>>>>>> authenticated client user *"ad...@example.com >>>>>>>>>>>>> <ad...@example.com>"* >>>>>>>>>>>>> I0102 03:40:07.972348 27948 authentication.cc:445] >>>>>>>>>>>>> Successfully authenticated principal *"impala/cdh-...@example.com >>>>>>>>>>>>> <cdh-...@example.com>"* on an internal connection >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> As you can see from the messages above, impalad is able to >>>>>>>>>>>> identify the currently connected user correctly. However catalogd >>>>>>>>>>>> always >>>>>>>>>>>> authenticates as impala which causes the problem. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Jan 2, 2019 at 4:19 AM Jeszy <jes...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hey, >>>>>>>>>>>>> >>>>>>>>>>>>> IIUC your question correctly, this is a limitation. >>>>>>>>>>>>> IMPALA-2177 looks >>>>>>>>>>>>> to be the appropriate jira. >>>>>>>>>>>>> Most users use Impala together with Sentry, where the >>>>>>>>>>>>> recommended >>>>>>>>>>>>> approach is to disable impersonation (even in services that >>>>>>>>>>>>> allow it, >>>>>>>>>>>>> like Hive). >>>>>>>>>>>>> >>>>>>>>>>>>> HTH >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, 2 Jan 2019 at 05:55, Bharath Vissapragada < >>>>>>>>>>>>> bhara...@cloudera.com> wrote: >>>>>>>>>>>>> > >>>>>>>>>>>>> > Hi, >>>>>>>>>>>>> > >>>>>>>>>>>>> > Can you add the stack trace here if possible? It is not >>>>>>>>>>>>> super clear where exactly the problem is. >>>>>>>>>>>>> > >>>>>>>>>>>>> > Thanks, >>>>>>>>>>>>> > Bharath >>>>>>>>>>>>> > >>>>>>>>>>>>> > On Tue, Jan 1, 2019 at 6:34 PM mhd wrk < >>>>>>>>>>>>> mhdwrkoff...@gmail.com> wrote: >>>>>>>>>>>>> >> >>>>>>>>>>>>> >> we have our own implementation of Hadoop FileSystem which >>>>>>>>>>>>> relies on current user in a kerberosied environment to locate >>>>>>>>>>>>> user specific >>>>>>>>>>>>> files in HDFS. This custom file system works fine inside hive to >>>>>>>>>>>>> create >>>>>>>>>>>>> external tables and query them. However trying to access the same >>>>>>>>>>>>> tables >>>>>>>>>>>>> via Impala (jdbc driver) fails. Watching the log messages seems >>>>>>>>>>>>> that when >>>>>>>>>>>>> impalad sends requests to catalogd to get meta data of a given >>>>>>>>>>>>> table the >>>>>>>>>>>>> current user returned by UserGroupInformation is the service >>>>>>>>>>>>> account >>>>>>>>>>>>> running the server (impala/hostn...@example.com) instead of >>>>>>>>>>>>> the currently connected user. >>>>>>>>>>>>> >> >>>>>>>>>>>>> >> Is this a known issue or limitation of Impala? >>>>>>>>>>>>> >>>>>>>>>>>>