Thanks for the link. So the final answer is that even if the libhdfs bug gets fixed there won't be any changes to Impala to expose requesting_user in Catalog Service, right?
On Thu, Jan 3, 2019 at 9:46 AM Tim Armstrong <tarmstr...@cloudera.com> wrote: > > catalog server ignores file system authorization model > The catalog daemon does this by design - the idea is that the catalog > server can load and cache metadata on behalf of multiple users. It requires > that the catalogd user (usually "impala") has permissions to read > filesystem metadata. > > The "user account requirements" section in our docs explains this: > https://impala.apache.org/docs/build/html/topics/impala_prereqs.html#prereqs > and > https://impala.apache.org/docs/build/html/topics/impala_security_files.html > > On Wed, Jan 2, 2019 at 5:52 PM mhd wrk <mhdwrkoff...@gmail.com> wrote: > >> it's more about enforcing Hadoop file system authorisation. All we have >> done is implementing a custom Hadoop File System (org.apache.hadoop.fs. >> FileSystem) and now trying to use Impala to query files hosted on that >> file system and it fails because catalog server ignores file system >> authorization model. The same file system works nicely with HDFS commands >> (e.g. hdfs dfs -ls ...) as well as HiveServer. >> >> Our clients expect us to enforce authorization at all levels (HDFS, >> Accumulo, Hive, Impala and ....) >> >> On Wed, Jan 2, 2019 at 4:56 PM Tim Armstrong <tarmstr...@cloudera.com> >> wrote: >> >>> Stepping back for a second, doesn't what you're trying to do assume that >>> each user will load metadata for each table separately? The whole point of >>> the catalog server is that we load the metadata once and then share it >>> between queries and users. >>> >>> I don't think we want to have the catalog server load different versions >>> of a table depending on which user initially loaded the table? That would >>> cause all sorts of issues. >>> >>> On Wed, Jan 2, 2019 at 12:36 PM mhd wrk <mhdwrkoff...@gmail.com> wrote: >>> >>>> I see. I was wondering how it works inside hive server. Basically this >>>> is a HDFS C API issue. Thanks for the elaborate explanation. >>>> >>>> On Wed, Jan 2, 2019 at 12:27 PM Shant Hovsepian <sh...@arcadiadata.com> >>>> wrote: >>>> >>>>> Problem is mostly with libhdfs as documented here HADOOP-12953 >>>>> >>>>> On a kerberized setup the service principal gets picked up. There are >>>>> work arounds in the Java HDFS API but the c based one in libhdfs has this >>>>> issue. Of course caching HDFS will b trickier in impala as well but first >>>>> his one API in libhdfs needs to be enhanced. >>>>> >>>>> Also in general having database authorization at the file level may >>>>> not be a good idea or clean design and using sentry and extending it's >>>>> authorization mecuanisms would be cleaner. >>>>> >>>>> -Shant >>>>> >>>>> On Wed, Jan 2, 2019, 12:21 PM mhd wrk <mhdwrkoff...@gmail.com> wrote: >>>>> >>>>>> Thanks for further info. Not sure if our Product Management is OK, at >>>>>> this point, with us patching Impala server to get our solution working. >>>>>> Our >>>>>> product is supposed to work with already installed servers. >>>>>> >>>>>> Any plans to address the gap (making requesting_user visible inside >>>>>> catalog server) in future release? >>>>>> >>>>>> >>>>>> >>>>>> On Wed, Jan 2, 2019 at 11:50 AM Bharath Vissapragada < >>>>>> bhara...@cloudera.com> wrote: >>>>>> >>>>>>> I was poking around in the code and it looks like we have most of >>>>>>> the code in place >>>>>>> <https://github.com/apache/impala/blob/27577dd652554dda5a03016e2d1e3ab66fe6b1f5/common/thrift/CatalogService.thrift#L47> >>>>>>> >>>>>>> // Common header included in all CatalogService requests. >>>>>>> // TODO: The CatalogServiceVersion/protocol version should be part >>>>>>> of the header. >>>>>>> // This would require changes in BDR and break their compatibility >>>>>>> story. We should >>>>>>> // coordinate a joint change somewhere down the line. >>>>>>> struct TCatalogServiceRequestHeader { >>>>>>> // The effective user who submitted this request. >>>>>>> 1: optional string requesting_user >>>>>>> } >>>>>>> >>>>>>> That header is included in all the RPCs. However, that is an >>>>>>> optional field and may not be in a few places (since we don't actually >>>>>>> rely >>>>>>> on that currently). So you could start with making it a "required" field >>>>>>> and see what all breaks. HTH. >>>>>>> >>>>>>> On Wed, Jan 2, 2019 at 11:35 AM Bharath Vissapragada < >>>>>>> bhara...@cloudera.com> wrote: >>>>>>> >>>>>>>> I think we expose it via UDF effective_user() (effective user could >>>>>>>> be different from the connected if delegation/doas is enabled). You >>>>>>>> can run >>>>>>>> a query like "select effective_user()" in a session. >>>>>>>> >>>>>>>> You can also look it up in the /sessions page on the coordinator >>>>>>>> web UI (<coordinator>:25000/sessions?json) and you can get a json >>>>>>>> formatted >>>>>>>> string containing the connected and delegate user for each session. >>>>>>>> >>>>>>>> If you want it on the Catalog side, you probably have to plumb it >>>>>>>> through the RPC calls (change the thrift spec and pass it along from >>>>>>>> the >>>>>>>> coordinator session handling code to the Catalog RPC code). >>>>>>>> >>>>>>>> On Wed, Jan 2, 2019 at 11:19 AM mhd wrk <mhdwrkoff...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Is there any Impala/Sentry specific API we can use inside our code >>>>>>>>> to figure out who current user is? >>>>>>>>> >>>>>>>>> On Wed, Jan 2, 2019 at 11:12 AM Bharath Vissapragada < >>>>>>>>> bhara...@cloudera.com> wrote: >>>>>>>>> >>>>>>>>>> Yes. I think Jeszy is right. Per my understanding too, we don't >>>>>>>>>> impersonate the client user on the Catalog server. Instead, we >>>>>>>>>> enforce the >>>>>>>>>> authorization via Sentry during query planning. >>>>>>>>>> >>>>>>>>>> On Wed, Jan 2, 2019 at 7:06 AM mhd wrk <mhdwrkoff...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> IMPALA-2177 sounds like the correct issue. >>>>>>>>>>> Here are log messages from authentication.cc for impalad and >>>>>>>>>>> catalogd respectively: >>>>>>>>>>> >>>>>>>>>>> I0102 14:15:06.722666 28195 authentication.cc:478] Successfully >>>>>>>>>>>> authenticated client user *"ad...@example.com >>>>>>>>>>>> <ad...@example.com>"* >>>>>>>>>>>> I0102 03:40:07.972348 27948 authentication.cc:445] Successfully >>>>>>>>>>>> authenticated principal *"impala/cdh-...@example.com >>>>>>>>>>>> <cdh-...@example.com>"* on an internal connection >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> As you can see from the messages above, impalad is able to >>>>>>>>>>> identify the currently connected user correctly. However catalogd >>>>>>>>>>> always >>>>>>>>>>> authenticates as impala which causes the problem. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Wed, Jan 2, 2019 at 4:19 AM Jeszy <jes...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hey, >>>>>>>>>>>> >>>>>>>>>>>> IIUC your question correctly, this is a limitation. IMPALA-2177 >>>>>>>>>>>> looks >>>>>>>>>>>> to be the appropriate jira. >>>>>>>>>>>> Most users use Impala together with Sentry, where the >>>>>>>>>>>> recommended >>>>>>>>>>>> approach is to disable impersonation (even in services that >>>>>>>>>>>> allow it, >>>>>>>>>>>> like Hive). >>>>>>>>>>>> >>>>>>>>>>>> HTH >>>>>>>>>>>> >>>>>>>>>>>> On Wed, 2 Jan 2019 at 05:55, Bharath Vissapragada < >>>>>>>>>>>> bhara...@cloudera.com> wrote: >>>>>>>>>>>> > >>>>>>>>>>>> > Hi, >>>>>>>>>>>> > >>>>>>>>>>>> > Can you add the stack trace here if possible? It is not super >>>>>>>>>>>> clear where exactly the problem is. >>>>>>>>>>>> > >>>>>>>>>>>> > Thanks, >>>>>>>>>>>> > Bharath >>>>>>>>>>>> > >>>>>>>>>>>> > On Tue, Jan 1, 2019 at 6:34 PM mhd wrk < >>>>>>>>>>>> mhdwrkoff...@gmail.com> wrote: >>>>>>>>>>>> >> >>>>>>>>>>>> >> we have our own implementation of Hadoop FileSystem which >>>>>>>>>>>> relies on current user in a kerberosied environment to locate user >>>>>>>>>>>> specific >>>>>>>>>>>> files in HDFS. This custom file system works fine inside hive to >>>>>>>>>>>> create >>>>>>>>>>>> external tables and query them. However trying to access the same >>>>>>>>>>>> tables >>>>>>>>>>>> via Impala (jdbc driver) fails. Watching the log messages seems >>>>>>>>>>>> that when >>>>>>>>>>>> impalad sends requests to catalogd to get meta data of a given >>>>>>>>>>>> table the >>>>>>>>>>>> current user returned by UserGroupInformation is the service >>>>>>>>>>>> account >>>>>>>>>>>> running the server (impala/hostn...@example.com) instead of >>>>>>>>>>>> the currently connected user. >>>>>>>>>>>> >> >>>>>>>>>>>> >> Is this a known issue or limitation of Impala? >>>>>>>>>>>> >>>>>>>>>>>