Yeah, I can see the use case, the challenge is always finding people to do the hard work of building and maintaining features.
On Thu, Jan 3, 2019 at 10:48 AM mhd wrk <mhdwrkoff...@gmail.com> wrote: > I understand reasoning behind the design decision which requires making > files available to a certain user. However there are clients in certain > industries who are OK with an acceptable performance hit (might caused by > loading/caching metadata per user) as long as they can have user specific > permissions at all storage levels (HDFS, Accumulo and ....). > > IMO, Impala should make this possible as a configuration option. > > > > On Thu, Jan 3, 2019 at 10:22 AM Bharath Vissapragada < > bhara...@cloudera.com> wrote: > >> Agree with Tim's points. My opinion is also the same, given the current >> Catalog architecture. >> >> On Thu, Jan 3, 2019 at 10:17 AM Tim Armstrong <tarmstr...@cloudera.com> >> wrote: >> >>> Right, we could use requesting_user for logging, statistics, etc, but it >>> would be problematic to impersonate that user when loading metadata. >>> >>> It's of course possible that I'm missing something here. >>> >>> On Thu, Jan 3, 2019 at 10:05 AM mhd wrk <mhdwrkoff...@gmail.com> wrote: >>> >>>> Thanks for the link. So the final answer is that even if the libhdfs >>>> bug gets fixed there won't be any changes to Impala to expose >>>> requesting_user in Catalog Service, right? >>>> >>>> On Thu, Jan 3, 2019 at 9:46 AM Tim Armstrong <tarmstr...@cloudera.com> >>>> wrote: >>>> >>>>> > catalog server ignores file system authorization model >>>>> The catalog daemon does this by design - the idea is that the catalog >>>>> server can load and cache metadata on behalf of multiple users. It >>>>> requires >>>>> that the catalogd user (usually "impala") has permissions to read >>>>> filesystem metadata. >>>>> >>>>> The "user account requirements" section in our docs explains this: >>>>> https://impala.apache.org/docs/build/html/topics/impala_prereqs.html#prereqs >>>>> and >>>>> https://impala.apache.org/docs/build/html/topics/impala_security_files.html >>>>> >>>>> On Wed, Jan 2, 2019 at 5:52 PM mhd wrk <mhdwrkoff...@gmail.com> wrote: >>>>> >>>>>> it's more about enforcing Hadoop file system authorisation. All we >>>>>> have done is implementing a custom Hadoop File System ( >>>>>> org.apache.hadoop.fs.FileSystem) and now trying to use Impala to >>>>>> query files hosted on that file system and it fails because catalog >>>>>> server >>>>>> ignores file system authorization model. The same file system works >>>>>> nicely >>>>>> with HDFS commands (e.g. hdfs dfs -ls ...) as well as HiveServer. >>>>>> >>>>>> Our clients expect us to enforce authorization at all levels (HDFS, >>>>>> Accumulo, Hive, Impala and ....) >>>>>> >>>>>> On Wed, Jan 2, 2019 at 4:56 PM Tim Armstrong <tarmstr...@cloudera.com> >>>>>> wrote: >>>>>> >>>>>>> Stepping back for a second, doesn't what you're trying to do assume >>>>>>> that each user will load metadata for each table separately? The whole >>>>>>> point of the catalog server is that we load the metadata once and then >>>>>>> share it between queries and users. >>>>>>> >>>>>>> I don't think we want to have the catalog server load different >>>>>>> versions of a table depending on which user initially loaded the table? >>>>>>> That would cause all sorts of issues. >>>>>>> >>>>>>> On Wed, Jan 2, 2019 at 12:36 PM mhd wrk <mhdwrkoff...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> I see. I was wondering how it works inside hive server. Basically >>>>>>>> this is a HDFS C API issue. Thanks for the elaborate explanation. >>>>>>>> >>>>>>>> On Wed, Jan 2, 2019 at 12:27 PM Shant Hovsepian < >>>>>>>> sh...@arcadiadata.com> wrote: >>>>>>>> >>>>>>>>> Problem is mostly with libhdfs as documented here HADOOP-12953 >>>>>>>>> >>>>>>>>> On a kerberized setup the service principal gets picked up. There >>>>>>>>> are work arounds in the Java HDFS API but the c based one in libhdfs >>>>>>>>> has >>>>>>>>> this issue. Of course caching HDFS will b trickier in impala as well >>>>>>>>> but >>>>>>>>> first his one API in libhdfs needs to be enhanced. >>>>>>>>> >>>>>>>>> Also in general having database authorization at the file level >>>>>>>>> may not be a good idea or clean design and using sentry and extending >>>>>>>>> it's >>>>>>>>> authorization mecuanisms would be cleaner. >>>>>>>>> >>>>>>>>> -Shant >>>>>>>>> >>>>>>>>> On Wed, Jan 2, 2019, 12:21 PM mhd wrk <mhdwrkoff...@gmail.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Thanks for further info. Not sure if our Product Management is >>>>>>>>>> OK, at this point, with us patching Impala server to get our solution >>>>>>>>>> working. Our product is supposed to work with already installed >>>>>>>>>> servers. >>>>>>>>>> >>>>>>>>>> Any plans to address the gap (making requesting_user visible >>>>>>>>>> inside catalog server) in future release? >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> On Wed, Jan 2, 2019 at 11:50 AM Bharath Vissapragada < >>>>>>>>>> bhara...@cloudera.com> wrote: >>>>>>>>>> >>>>>>>>>>> I was poking around in the code and it looks like we have most >>>>>>>>>>> of the code in place >>>>>>>>>>> <https://github.com/apache/impala/blob/27577dd652554dda5a03016e2d1e3ab66fe6b1f5/common/thrift/CatalogService.thrift#L47> >>>>>>>>>>> >>>>>>>>>>> // Common header included in all CatalogService requests. >>>>>>>>>>> // TODO: The CatalogServiceVersion/protocol version should be >>>>>>>>>>> part of the header. >>>>>>>>>>> // This would require changes in BDR and break their >>>>>>>>>>> compatibility story. We should >>>>>>>>>>> // coordinate a joint change somewhere down the line. >>>>>>>>>>> struct TCatalogServiceRequestHeader { >>>>>>>>>>> // The effective user who submitted this request. >>>>>>>>>>> 1: optional string requesting_user >>>>>>>>>>> } >>>>>>>>>>> >>>>>>>>>>> That header is included in all the RPCs. However, that is an >>>>>>>>>>> optional field and may not be in a few places (since we don't >>>>>>>>>>> actually rely >>>>>>>>>>> on that currently). So you could start with making it a "required" >>>>>>>>>>> field >>>>>>>>>>> and see what all breaks. HTH. >>>>>>>>>>> >>>>>>>>>>> On Wed, Jan 2, 2019 at 11:35 AM Bharath Vissapragada < >>>>>>>>>>> bhara...@cloudera.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> I think we expose it via UDF effective_user() (effective user >>>>>>>>>>>> could be different from the connected if delegation/doas is >>>>>>>>>>>> enabled). You >>>>>>>>>>>> can run a query like "select effective_user()" in a session. >>>>>>>>>>>> >>>>>>>>>>>> You can also look it up in the /sessions page on the >>>>>>>>>>>> coordinator web UI (<coordinator>:25000/sessions?json) and you can >>>>>>>>>>>> get a >>>>>>>>>>>> json formatted string containing the connected and delegate user >>>>>>>>>>>> for each >>>>>>>>>>>> session. >>>>>>>>>>>> >>>>>>>>>>>> If you want it on the Catalog side, you probably have to plumb >>>>>>>>>>>> it through the RPC calls (change the thrift spec and pass it along >>>>>>>>>>>> from the >>>>>>>>>>>> coordinator session handling code to the Catalog RPC code). >>>>>>>>>>>> >>>>>>>>>>>> On Wed, Jan 2, 2019 at 11:19 AM mhd wrk <mhdwrkoff...@gmail.com> >>>>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Is there any Impala/Sentry specific API we can use inside our >>>>>>>>>>>>> code to figure out who current user is? >>>>>>>>>>>>> >>>>>>>>>>>>> On Wed, Jan 2, 2019 at 11:12 AM Bharath Vissapragada < >>>>>>>>>>>>> bhara...@cloudera.com> wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> Yes. I think Jeszy is right. Per my understanding too, we >>>>>>>>>>>>>> don't impersonate the client user on the Catalog server. >>>>>>>>>>>>>> Instead, we >>>>>>>>>>>>>> enforce the authorization via Sentry during query planning. >>>>>>>>>>>>>> >>>>>>>>>>>>>> On Wed, Jan 2, 2019 at 7:06 AM mhd wrk < >>>>>>>>>>>>>> mhdwrkoff...@gmail.com> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>>> IMPALA-2177 sounds like the correct issue. >>>>>>>>>>>>>>> Here are log messages from authentication.cc for impalad and >>>>>>>>>>>>>>> catalogd respectively: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> I0102 14:15:06.722666 28195 authentication.cc:478] >>>>>>>>>>>>>>>> Successfully authenticated client user *"ad...@example.com >>>>>>>>>>>>>>>> <ad...@example.com>"* >>>>>>>>>>>>>>>> I0102 03:40:07.972348 27948 authentication.cc:445] >>>>>>>>>>>>>>>> Successfully authenticated principal >>>>>>>>>>>>>>>> *"impala/cdh-...@example.com >>>>>>>>>>>>>>>> <cdh-...@example.com>"* on an internal connection >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> As you can see from the messages above, impalad is able to >>>>>>>>>>>>>>> identify the currently connected user correctly. However >>>>>>>>>>>>>>> catalogd always >>>>>>>>>>>>>>> authenticates as impala which causes the problem. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Wed, Jan 2, 2019 at 4:19 AM Jeszy <jes...@gmail.com> >>>>>>>>>>>>>>> wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> Hey, >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> IIUC your question correctly, this is a limitation. >>>>>>>>>>>>>>>> IMPALA-2177 looks >>>>>>>>>>>>>>>> to be the appropriate jira. >>>>>>>>>>>>>>>> Most users use Impala together with Sentry, where the >>>>>>>>>>>>>>>> recommended >>>>>>>>>>>>>>>> approach is to disable impersonation (even in services that >>>>>>>>>>>>>>>> allow it, >>>>>>>>>>>>>>>> like Hive). >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> HTH >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On Wed, 2 Jan 2019 at 05:55, Bharath Vissapragada < >>>>>>>>>>>>>>>> bhara...@cloudera.com> wrote: >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > Hi, >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > Can you add the stack trace here if possible? It is not >>>>>>>>>>>>>>>> super clear where exactly the problem is. >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > Thanks, >>>>>>>>>>>>>>>> > Bharath >>>>>>>>>>>>>>>> > >>>>>>>>>>>>>>>> > On Tue, Jan 1, 2019 at 6:34 PM mhd wrk < >>>>>>>>>>>>>>>> mhdwrkoff...@gmail.com> wrote: >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> >> we have our own implementation of Hadoop FileSystem >>>>>>>>>>>>>>>> which relies on current user in a kerberosied environment to >>>>>>>>>>>>>>>> locate user >>>>>>>>>>>>>>>> specific files in HDFS. This custom file system works fine >>>>>>>>>>>>>>>> inside hive to >>>>>>>>>>>>>>>> create external tables and query them. However trying to >>>>>>>>>>>>>>>> access the same >>>>>>>>>>>>>>>> tables via Impala (jdbc driver) fails. Watching the log >>>>>>>>>>>>>>>> messages seems that >>>>>>>>>>>>>>>> when impalad sends requests to catalogd to get meta data of a >>>>>>>>>>>>>>>> given table >>>>>>>>>>>>>>>> the current user returned by UserGroupInformation is the >>>>>>>>>>>>>>>> service account >>>>>>>>>>>>>>>> running the server (impala/hostn...@example.com) instead >>>>>>>>>>>>>>>> of the currently connected user. >>>>>>>>>>>>>>>> >> >>>>>>>>>>>>>>>> >> Is this a known issue or limitation of Impala? >>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>