+1 for Tim's opinion We encountered a similar issue when we enabled storage based authorization (without Sentry) for Hive, the catalogd failed to load file metadata from HDFS (because user `impala` has not x permissions). We solved this by adding `impala` into the `supergroup`.
Then we encountered another issue that impalad can read all files of Hive. I created IMPALA-7052 at that time. This can't be solved since the C lib of HDFS don't support impersonate unless HADOOP-12953 resolved. As you mentioned, you've implemented your own Hadoop FileSystem. Maybe you can build your own libhdfs with the patch in HADOOP-12953, then rebuild impalad with this libhdfs to have a try. On Thu, Jan 3, 2019 at 8:56 AM Tim Armstrong <tarmstr...@cloudera.com> wrote: > Stepping back for a second, doesn't what you're trying to do assume that > each user will load metadata for each table separately? The whole point of > the catalog server is that we load the metadata once and then share it > between queries and users. > > I don't think we want to have the catalog server load different versions > of a table depending on which user initially loaded the table? That would > cause all sorts of issues. > > On Wed, Jan 2, 2019 at 12:36 PM mhd wrk <mhdwrkoff...@gmail.com> wrote: > >> I see. I was wondering how it works inside hive server. Basically this is >> a HDFS C API issue. Thanks for the elaborate explanation. >> >> On Wed, Jan 2, 2019 at 12:27 PM Shant Hovsepian <sh...@arcadiadata.com> >> wrote: >> >>> Problem is mostly with libhdfs as documented here HADOOP-12953 >>> >>> On a kerberized setup the service principal gets picked up. There are >>> work arounds in the Java HDFS API but the c based one in libhdfs has this >>> issue. Of course caching HDFS will b trickier in impala as well but first >>> his one API in libhdfs needs to be enhanced. >>> >>> Also in general having database authorization at the file level may not >>> be a good idea or clean design and using sentry and extending it's >>> authorization mecuanisms would be cleaner. >>> >>> -Shant >>> >>> On Wed, Jan 2, 2019, 12:21 PM mhd wrk <mhdwrkoff...@gmail.com> wrote: >>> >>>> Thanks for further info. Not sure if our Product Management is OK, at >>>> this point, with us patching Impala server to get our solution working. Our >>>> product is supposed to work with already installed servers. >>>> >>>> Any plans to address the gap (making requesting_user visible inside >>>> catalog server) in future release? >>>> >>>> >>>> >>>> On Wed, Jan 2, 2019 at 11:50 AM Bharath Vissapragada < >>>> bhara...@cloudera.com> wrote: >>>> >>>>> I was poking around in the code and it looks like we have most of the code >>>>> in place >>>>> <https://github.com/apache/impala/blob/27577dd652554dda5a03016e2d1e3ab66fe6b1f5/common/thrift/CatalogService.thrift#L47> >>>>> >>>>> // Common header included in all CatalogService requests. >>>>> // TODO: The CatalogServiceVersion/protocol version should be part of >>>>> the header. >>>>> // This would require changes in BDR and break their compatibility >>>>> story. We should >>>>> // coordinate a joint change somewhere down the line. >>>>> struct TCatalogServiceRequestHeader { >>>>> // The effective user who submitted this request. >>>>> 1: optional string requesting_user >>>>> } >>>>> >>>>> That header is included in all the RPCs. However, that is an optional >>>>> field and may not be in a few places (since we don't actually rely on that >>>>> currently). So you could start with making it a "required" field and see >>>>> what all breaks. HTH. >>>>> >>>>> On Wed, Jan 2, 2019 at 11:35 AM Bharath Vissapragada < >>>>> bhara...@cloudera.com> wrote: >>>>> >>>>>> I think we expose it via UDF effective_user() (effective user could >>>>>> be different from the connected if delegation/doas is enabled). You can >>>>>> run >>>>>> a query like "select effective_user()" in a session. >>>>>> >>>>>> You can also look it up in the /sessions page on the coordinator web >>>>>> UI (<coordinator>:25000/sessions?json) and you can get a json formatted >>>>>> string containing the connected and delegate user for each session. >>>>>> >>>>>> If you want it on the Catalog side, you probably have to plumb it >>>>>> through the RPC calls (change the thrift spec and pass it along from the >>>>>> coordinator session handling code to the Catalog RPC code). >>>>>> >>>>>> On Wed, Jan 2, 2019 at 11:19 AM mhd wrk <mhdwrkoff...@gmail.com> >>>>>> wrote: >>>>>> >>>>>>> Is there any Impala/Sentry specific API we can use inside our code >>>>>>> to figure out who current user is? >>>>>>> >>>>>>> On Wed, Jan 2, 2019 at 11:12 AM Bharath Vissapragada < >>>>>>> bhara...@cloudera.com> wrote: >>>>>>> >>>>>>>> Yes. I think Jeszy is right. Per my understanding too, we don't >>>>>>>> impersonate the client user on the Catalog server. Instead, we enforce >>>>>>>> the >>>>>>>> authorization via Sentry during query planning. >>>>>>>> >>>>>>>> On Wed, Jan 2, 2019 at 7:06 AM mhd wrk <mhdwrkoff...@gmail.com> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> IMPALA-2177 sounds like the correct issue. >>>>>>>>> Here are log messages from authentication.cc for impalad and >>>>>>>>> catalogd respectively: >>>>>>>>> >>>>>>>>> I0102 14:15:06.722666 28195 authentication.cc:478] Successfully >>>>>>>>>> authenticated client user *"ad...@example.com >>>>>>>>>> <ad...@example.com>"* >>>>>>>>>> I0102 03:40:07.972348 27948 authentication.cc:445] Successfully >>>>>>>>>> authenticated principal *"impala/cdh-...@example.com >>>>>>>>>> <cdh-...@example.com>"* on an internal connection >>>>>>>>> >>>>>>>>> >>>>>>>>> As you can see from the messages above, impalad is able to >>>>>>>>> identify the currently connected user correctly. However catalogd >>>>>>>>> always >>>>>>>>> authenticates as impala which causes the problem. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Wed, Jan 2, 2019 at 4:19 AM Jeszy <jes...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Hey, >>>>>>>>>> >>>>>>>>>> IIUC your question correctly, this is a limitation. IMPALA-2177 >>>>>>>>>> looks >>>>>>>>>> to be the appropriate jira. >>>>>>>>>> Most users use Impala together with Sentry, where the recommended >>>>>>>>>> approach is to disable impersonation (even in services that allow >>>>>>>>>> it, >>>>>>>>>> like Hive). >>>>>>>>>> >>>>>>>>>> HTH >>>>>>>>>> >>>>>>>>>> On Wed, 2 Jan 2019 at 05:55, Bharath Vissapragada < >>>>>>>>>> bhara...@cloudera.com> wrote: >>>>>>>>>> > >>>>>>>>>> > Hi, >>>>>>>>>> > >>>>>>>>>> > Can you add the stack trace here if possible? It is not super >>>>>>>>>> clear where exactly the problem is. >>>>>>>>>> > >>>>>>>>>> > Thanks, >>>>>>>>>> > Bharath >>>>>>>>>> > >>>>>>>>>> > On Tue, Jan 1, 2019 at 6:34 PM mhd wrk <mhdwrkoff...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >> >>>>>>>>>> >> we have our own implementation of Hadoop FileSystem which >>>>>>>>>> relies on current user in a kerberosied environment to locate user >>>>>>>>>> specific >>>>>>>>>> files in HDFS. This custom file system works fine inside hive to >>>>>>>>>> create >>>>>>>>>> external tables and query them. However trying to access the same >>>>>>>>>> tables >>>>>>>>>> via Impala (jdbc driver) fails. Watching the log messages seems that >>>>>>>>>> when >>>>>>>>>> impalad sends requests to catalogd to get meta data of a given table >>>>>>>>>> the >>>>>>>>>> current user returned by UserGroupInformation is the service account >>>>>>>>>> running the server (impala/hostn...@example.com) instead of the >>>>>>>>>> currently connected user. >>>>>>>>>> >> >>>>>>>>>> >> Is this a known issue or limitation of Impala? >>>>>>>>>> >>>>>>>>>