Agree with Tim's points. My opinion is also the same, given the current Catalog architecture.
On Thu, Jan 3, 2019 at 10:17 AM Tim Armstrong <tarmstr...@cloudera.com> wrote: > Right, we could use requesting_user for logging, statistics, etc, but it > would be problematic to impersonate that user when loading metadata. > > It's of course possible that I'm missing something here. > > On Thu, Jan 3, 2019 at 10:05 AM mhd wrk <mhdwrkoff...@gmail.com> wrote: > >> Thanks for the link. So the final answer is that even if the libhdfs bug >> gets fixed there won't be any changes to Impala to expose requesting_user >> in Catalog Service, right? >> >> On Thu, Jan 3, 2019 at 9:46 AM Tim Armstrong <tarmstr...@cloudera.com> >> wrote: >> >>> > catalog server ignores file system authorization model >>> The catalog daemon does this by design - the idea is that the catalog >>> server can load and cache metadata on behalf of multiple users. It requires >>> that the catalogd user (usually "impala") has permissions to read >>> filesystem metadata. >>> >>> The "user account requirements" section in our docs explains this: >>> https://impala.apache.org/docs/build/html/topics/impala_prereqs.html#prereqs >>> and >>> https://impala.apache.org/docs/build/html/topics/impala_security_files.html >>> >>> On Wed, Jan 2, 2019 at 5:52 PM mhd wrk <mhdwrkoff...@gmail.com> wrote: >>> >>>> it's more about enforcing Hadoop file system authorisation. All we have >>>> done is implementing a custom Hadoop File System (org.apache.hadoop.fs. >>>> FileSystem) and now trying to use Impala to query files hosted on that >>>> file system and it fails because catalog server ignores file system >>>> authorization model. The same file system works nicely with HDFS commands >>>> (e.g. hdfs dfs -ls ...) as well as HiveServer. >>>> >>>> Our clients expect us to enforce authorization at all levels (HDFS, >>>> Accumulo, Hive, Impala and ....) >>>> >>>> On Wed, Jan 2, 2019 at 4:56 PM Tim Armstrong <tarmstr...@cloudera.com> >>>> wrote: >>>> >>>>> Stepping back for a second, doesn't what you're trying to do assume >>>>> that each user will load metadata for each table separately? The whole >>>>> point of the catalog server is that we load the metadata once and then >>>>> share it between queries and users. >>>>> >>>>> I don't think we want to have the catalog server load different >>>>> versions of a table depending on which user initially loaded the table? >>>>> That would cause all sorts of issues. >>>>> >>>>> On Wed, Jan 2, 2019 at 12:36 PM mhd wrk <mhdwrkoff...@gmail.com> >>>>> wrote: >>>>> >>>>>> I see. I was wondering how it works inside hive server. Basically >>>>>> this is a HDFS C API issue. Thanks for the elaborate explanation. >>>>>> >>>>>> On Wed, Jan 2, 2019 at 12:27 PM Shant Hovsepian < >>>>>> sh...@arcadiadata.com> wrote: >>>>>> >>>>>>> Problem is mostly with libhdfs as documented here HADOOP-12953 >>>>>>> >>>>>>> On a kerberized setup the service principal gets picked up. There >>>>>>> are work arounds in the Java HDFS API but the c based one in libhdfs has >>>>>>> this issue. Of course caching HDFS will b trickier in impala as well but >>>>>>> first his one API in libhdfs needs to be enhanced. >>>>>>> >>>>>>> Also in general having database authorization at the file level may >>>>>>> not be a good idea or clean design and using sentry and extending it's >>>>>>> authorization mecuanisms would be cleaner. >>>>>>> >>>>>>> -Shant >>>>>>> >>>>>>> On Wed, Jan 2, 2019, 12:21 PM mhd wrk <mhdwrkoff...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Thanks for further info. Not sure if our Product Management is OK, >>>>>>>> at this point, with us patching Impala server to get our solution >>>>>>>> working. >>>>>>>> Our product is supposed to work with already installed servers. >>>>>>>> >>>>>>>> Any plans to address the gap (making requesting_user visible inside >>>>>>>> catalog server) in future release? >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Wed, Jan 2, 2019 at 11:50 AM Bharath Vissapragada < >>>>>>>> bhara...@cloudera.com> wrote: >>>>>>>> >>>>>>>>> I was poking around in the code and it looks like we have most of >>>>>>>>> the code in place >>>>>>>>> <https://github.com/apache/impala/blob/27577dd652554dda5a03016e2d1e3ab66fe6b1f5/common/thrift/CatalogService.thrift#L47> >>>>>>>>> >>>>>>>>> // Common header included in all CatalogService requests. >>>>>>>>> // TODO: The CatalogServiceVersion/protocol version should be part >>>>>>>>> of the header. >>>>>>>>> // This would require changes in BDR and break their compatibility >>>>>>>>> story. We should >>>>>>>>> // coordinate a joint change somewhere down the line. >>>>>>>>> struct TCatalogServiceRequestHeader { >>>>>>>>> // The effective user who submitted this request. >>>>>>>>> 1: optional string requesting_user >>>>>>>>> } >>>>>>>>> >>>>>>>>> That header is included in all the RPCs. However, that is an >>>>>>>>> optional field and may not be in a few places (since we don't >>>>>>>>> actually rely >>>>>>>>> on that currently). So you could start with making it a "required" >>>>>>>>> field >>>>>>>>> and see what all breaks. HTH. >>>>>>>>> >>>>>>>>> On Wed, Jan 2, 2019 at 11:35 AM Bharath Vissapragada < >>>>>>>>> bhara...@cloudera.com> wrote: >>>>>>>>> >>>>>>>>>> I think we expose it via UDF effective_user() (effective user >>>>>>>>>> could be different from the connected if delegation/doas is >>>>>>>>>> enabled). You >>>>>>>>>> can run a query like "select effective_user()" in a session. >>>>>>>>>> >>>>>>>>>> You can also look it up in the /sessions page on the coordinator >>>>>>>>>> web UI (<coordinator>:25000/sessions?json) and you can get a json >>>>>>>>>> formatted >>>>>>>>>> string containing the connected and delegate user for each session. >>>>>>>>>> >>>>>>>>>> If you want it on the Catalog side, you probably have to plumb it >>>>>>>>>> through the RPC calls (change the thrift spec and pass it along from >>>>>>>>>> the >>>>>>>>>> coordinator session handling code to the Catalog RPC code). >>>>>>>>>> >>>>>>>>>> On Wed, Jan 2, 2019 at 11:19 AM mhd wrk <mhdwrkoff...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> Is there any Impala/Sentry specific API we can use inside our >>>>>>>>>>> code to figure out who current user is? >>>>>>>>>>> >>>>>>>>>>> On Wed, Jan 2, 2019 at 11:12 AM Bharath Vissapragada < >>>>>>>>>>> bhara...@cloudera.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Yes. I think Jeszy is right. Per my understanding too, we don't >>>>>>>>>>>> impersonate the client user on the Catalog server. Instead, we >>>>>>>>>>>> enforce the >>>>>>>>>>>> authorization via Sentry during query planning. >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Jan 2, 2019 at 7:06 AM mhd wrk <mhdwrkoff...@gmail.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> IMPALA-2177 sounds like the correct issue. >>>>>>>>>>>>> Here are log messages from authentication.cc for impalad and >>>>>>>>>>>>> catalogd respectively: >>>>>>>>>>>>> >>>>>>>>>>>>> I0102 14:15:06.722666 28195 authentication.cc:478] >>>>>>>>>>>>>> Successfully authenticated client user *"ad...@example.com >>>>>>>>>>>>>> <ad...@example.com>"* >>>>>>>>>>>>>> I0102 03:40:07.972348 27948 authentication.cc:445] >>>>>>>>>>>>>> Successfully authenticated principal *"impala/cdh-...@example.com >>>>>>>>>>>>>> <cdh-...@example.com>"* on an internal connection >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> As you can see from the messages above, impalad is able to >>>>>>>>>>>>> identify the currently connected user correctly. However catalogd >>>>>>>>>>>>> always >>>>>>>>>>>>> authenticates as impala which causes the problem. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Jan 2, 2019 at 4:19 AM Jeszy <jes...@gmail.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Hey, >>>>>>>>>>>>>> >>>>>>>>>>>>>> IIUC your question correctly, this is a limitation. >>>>>>>>>>>>>> IMPALA-2177 looks >>>>>>>>>>>>>> to be the appropriate jira. >>>>>>>>>>>>>> Most users use Impala together with Sentry, where the >>>>>>>>>>>>>> recommended >>>>>>>>>>>>>> approach is to disable impersonation (even in services that >>>>>>>>>>>>>> allow it, >>>>>>>>>>>>>> like Hive). >>>>>>>>>>>>>> >>>>>>>>>>>>>> HTH >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, 2 Jan 2019 at 05:55, Bharath Vissapragada < >>>>>>>>>>>>>> bhara...@cloudera.com> wrote: >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > Hi, >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > Can you add the stack trace here if possible? It is not >>>>>>>>>>>>>> super clear where exactly the problem is. >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > Thanks, >>>>>>>>>>>>>> > Bharath >>>>>>>>>>>>>> > >>>>>>>>>>>>>> > On Tue, Jan 1, 2019 at 6:34 PM mhd wrk < >>>>>>>>>>>>>> mhdwrkoff...@gmail.com> wrote: >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> we have our own implementation of Hadoop FileSystem which >>>>>>>>>>>>>> relies on current user in a kerberosied environment to locate >>>>>>>>>>>>>> user specific >>>>>>>>>>>>>> files in HDFS. This custom file system works fine inside hive >>>>>>>>>>>>>> to create >>>>>>>>>>>>>> external tables and query them. However trying to access the >>>>>>>>>>>>>> same tables >>>>>>>>>>>>>> via Impala (jdbc driver) fails. Watching the log messages seems >>>>>>>>>>>>>> that when >>>>>>>>>>>>>> impalad sends requests to catalogd to get meta data of a given >>>>>>>>>>>>>> table the >>>>>>>>>>>>>> current user returned by UserGroupInformation is the service >>>>>>>>>>>>>> account >>>>>>>>>>>>>> running the server (impala/hostn...@example.com) instead of >>>>>>>>>>>>>> the currently connected user. >>>>>>>>>>>>>> >> >>>>>>>>>>>>>> >> Is this a known issue or limitation of Impala? >>>>>>>>>>>>>> >>>>>>>>>>>>>