Thanks for the pointers towards the work you are doing here. I'll put up a patch for the jars and such in the next few days. https://issues.apache.org/jira/browse/FLINK-4287
Niels Basjes On Mon, Aug 1, 2016 at 11:46 AM, Stephan Ewen <se...@apache.org> wrote: > Thank you for the breakdown of the problem. > > Option (1) or (2) would be the way to go, currently. > > The problem that (3) does not support HBase is simply solvable by adding > the HBase jars to the lib directory. In the future, this should be solved > by the YARN re-architecturing: > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=65147077 > > For the renewal of Kerberos tokens for streaming jobs: There is WIP and a > pull request to attach key tabs to a Flink job: > https://github.com/apache/flink/pull/2275 > > The problem that the YARN session is accessible by everyone is a bit more > tricky. In the future, this should be solved by these two parts: > - With the YARN re-achitecturing, sessions are bound to individual > users. It should be possible to launch the session out of a single > YarnExecutionEnvironment and then submit multiple jobs against it. > - The over-the-wire encryption and authentication should make sure that > no other user can send jobs to that session. > > Greetings, > Stephan > > > > > > > > > > On Mon, Aug 1, 2016 at 9:47 AM, Niels Basjes <ni...@basjes.nl> wrote: > >> Hi, >> >> I have the situation that I have a Kerberos secured Yarn/HBase >> installation and I want to export data from a lot (~200) HBase tables to >> files on HDFS. >> I wrote a flink job that does this exactly the way I want it for a single >> table. >> >> Now in general I have a few possible approaches to do this for the 200 >> tables I am facing: >> >> 1) Create a single job that reads the data from all of those tables and >> writes them to the correct files. >> I expect that to be a monster that will hog the entire cluster >> because of the large number of HBase regions. >> >> 2) Run a job that does this for a single table and simply run that in a >> loop. >> Essentially I would have a shellscript or 'main' that loops over all >> tablenames and run a flink job for each of those. >> The downside of this is that it will start a new flink topology on >> Yarn for each table. >> This has a startup overhead of something like 30 seconds for each >> table that I would like to avoid. >> >> 3) I start a single yarn-session and submit my job in there 200 >> times. >> That would solve most of the startup overhead yet this doesn't work. >> >> If I start yarn-session then I see these two relevant lines in the output. >> >> 2016-07-29 14:58:30,575 INFO org.apache.flink.yarn.Utils >> - Attempting to obtain Kerberos security token for HBase >> 2016-07-29 14:58:30,576 INFO org.apache.flink.yarn.Utils >> - HBase is not available (not packaged with this >> application): ClassNotFoundException : >> "org.apache.hadoop.hbase.HBaseConfiguration". >> >> As a consequence any flink job I submit cannot access HBase at all. >> >> As an experiment I changed my yarn-session.sh script to include HBase on >> the classpath. (If you want I can submit a Jira issue and a pull request) >> Now the yarn-session does have HBase available and the jobs runs as >> expected. >> >> There are how ever two problems that remain: >> 1) This yarnsession is accessible by everyone on the cluster and as a >> consequence they can run jobs in there that can access all data I have >> access to. >> 2) The kerberos token will expire after a while and (just like with all >> long running jobs) I would really like to have this to be a 'long lived' >> thing. >> >> As far as I know this is just the tip of the security ice berg and I >> would like to know what the correct approach is to solve this. >> >> Thanks. >> >> -- >> Best regards / Met vriendelijke groeten, >> >> Niels Basjes >> > > -- Best regards / Met vriendelijke groeten, Niels Basjes