Thank you for looking into the problem, Niels. Let us know if you need anything. We would be happy to merge a pull request once you have verified the fix.
On Thu, Nov 5, 2015 at 1:38 PM, Niels Basjes <ni...@basjes.nl> wrote: > I created https://issues.apache.org/jira/browse/FLINK-2977 > > On Thu, Nov 5, 2015 at 12:25 PM, Robert Metzger <rmetz...@apache.org> > wrote: > >> Hi Niels, >> thank you for analyzing the issue so properly. I agree with you. It seems >> that HDFS and HBase are using their own tokes which we need to transfer >> from the client to the YARN containers. We should be able to port the fix >> from Spark (which they got from Storm) into our YARN client. >> I think we would add this in org.apache.flink.yarn.Utils#setTokensFor(). >> >> Do you want to implement and verify the fix yourself? If you are to busy >> at the moment, we can also discuss how we share the work (I'm implementing >> it, you test the fix) >> >> >> Robert >> >> On Tue, Nov 3, 2015 at 5:26 PM, Niels Basjes <ni...@basjes.nl> wrote: >> >>> Update on the status so far.... I suspect I found a problem in a secure >>> setup. >>> >>> I have created a very simple Flink topology consisting of a streaming >>> Source (the outputs the timestamp a few times per second) and a Sink (that >>> puts that timestamp into a single record in HBase). >>> Running this on a non-secure Yarn cluster works fine. >>> >>> To run it on a secured Yarn cluster my main routine now looks like this: >>> >>> public static void main(String[] args) throws Exception { >>> System.setProperty("java.security.krb5.conf", "/etc/krb5.conf"); >>> UserGroupInformation.loginUserFromKeytab("nbas...@xxxxxx.net", >>> "/home/nbasjes/.krb/nbasjes.keytab"); >>> >>> final StreamExecutionEnvironment env = >>> StreamExecutionEnvironment.getExecutionEnvironment(); >>> env.setParallelism(1); >>> >>> DataStream<String> stream = env.addSource(new TimerTicksSource()); >>> stream.addSink(new SetHBaseRowSink()); >>> env.execute("Long running Flink application"); >>> } >>> >>> When I run this >>> flink run -m yarn-cluster -yn 1 -yjm 1024 -ytm 4096 >>> ./kerberos-1.0-SNAPSHOT.jar >>> >>> I see after the startup messages: >>> >>> 17:13:24,466 INFO org.apache.hadoop.security.UserGroupInformation >>> - Login successful for user nbas...@xxxxxx.net using keytab >>> file /home/nbasjes/.krb/nbasjes.keytab >>> 11/03/2015 17:13:25 Job execution switched to status RUNNING. >>> 11/03/2015 17:13:25 Custom Source -> Stream Sink(1/1) switched to >>> SCHEDULED >>> 11/03/2015 17:13:25 Custom Source -> Stream Sink(1/1) switched to >>> DEPLOYING >>> 11/03/2015 17:13:25 Custom Source -> Stream Sink(1/1) switched to >>> RUNNING >>> >>> Which looks good. >>> >>> However ... no data goes into HBase. >>> After some digging I found this error in the task managers log: >>> >>> 17:13:42,677 WARN org.apache.hadoop.hbase.ipc.RpcClient >>> - Exception encountered while connecting to the server : >>> javax.security.sasl.SaslException: GSS initiate failed [Caused by >>> GSSException: No valid credentials provided (Mechanism level: Failed to >>> find any Kerberos tgt)] >>> 17:13:42,677 FATAL org.apache.hadoop.hbase.ipc.RpcClient >>> - SASL authentication failed. The most likely cause is missing or >>> invalid credentials. Consider 'kinit'. >>> javax.security.sasl.SaslException: GSS initiate failed [Caused by >>> GSSException: No valid credentials provided (Mechanism level: Failed to >>> find any Kerberos tgt)] >>> at >>> com.sun.security.sasl.gsskerb.GssKrb5Client.evaluateChallenge(GssKrb5Client.java:212) >>> at >>> org.apache.hadoop.hbase.security.HBaseSaslRpcClient.saslConnect(HBaseSaslRpcClient.java:177) >>> at >>> org.apache.hadoop.hbase.ipc.RpcClient$Connection.setupSaslConnection(RpcClient.java:815) >>> at >>> org.apache.hadoop.hbase.ipc.RpcClient$Connection.access$800(RpcClient.java:349) >>> >>> >>> First starting a yarn-session and then loading my job gives the same >>> error. >>> >>> My best guess at this point is that Flink needs the same fix as >>> described here: >>> >>> https://issues.apache.org/jira/browse/SPARK-6918 ( >>> https://github.com/apache/spark/pull/5586 ) >>> >>> What do you guys think? >>> >>> Niels Basjes >>> >>> >>> >>> On Tue, Oct 27, 2015 at 6:12 PM, Maximilian Michels <m...@apache.org> >>> wrote: >>> >>>> Hi Niels, >>>> >>>> You're welcome. Some more information on how this would be configured: >>>> >>>> In the kdc.conf, there are two variables: >>>> >>>> max_life = 2h 0m 0s >>>> max_renewable_life = 7d 0h 0m 0s >>>> >>>> max_life is the maximum life of the current ticket. However, it may be >>>> renewed up to a time span of max_renewable_life from the first ticket issue >>>> on. This means that from the first ticket issue, new tickets may be >>>> requested for one week. Each renewed ticket has a life time of max_life (2 >>>> hours in this case). >>>> >>>> Please let us know about any difficulties with long-running streaming >>>> application and Kerberos. >>>> >>>> Best regards, >>>> Max >>>> >>>> On Tue, Oct 27, 2015 at 2:46 PM, Niels Basjes <ni...@basjes.nl> wrote: >>>> >>>>> Hi, >>>>> >>>>> Thanks for your feedback. >>>>> So I guess I'll have to talk to the security guys about having special >>>>> kerberos ticket expiry times for these types of jobs. >>>>> >>>>> Niels Basjes >>>>> >>>>> On Fri, Oct 23, 2015 at 11:45 AM, Maximilian Michels <m...@apache.org> >>>>> wrote: >>>>> >>>>>> Hi Niels, >>>>>> >>>>>> Thank you for your question. Flink relies entirely on the Kerberos >>>>>> support of Hadoop. So your question could also be rephrased to "Does >>>>>> Hadoop support long-term authentication using Kerberos?". And the >>>>>> answer is: Yes! >>>>>> >>>>>> While Hadoop uses Kerberos tickets to authenticate users with services >>>>>> initially, the authentication process continues differently >>>>>> afterwards. Instead of saving the ticket to authenticate on a later >>>>>> access, Hadoop creates its own security tockens (DelegationToken) that >>>>>> it passes around. These are authenticated to Kerberos periodically. To >>>>>> my knowledge, the tokens have a life span identical to the Kerberos >>>>>> ticket maximum life span. So be sure to set the maximum life span very >>>>>> high for long streaming jobs. The renewal time, on the other hand, is >>>>>> not important because Hadoop abstracts this away using its own >>>>>> security tockens. >>>>>> >>>>>> I'm afraid there is not Kerberos how-to yet. If you are on Yarn, then >>>>>> it is sufficient to authenticate the client with Kerberos. On a Flink >>>>>> standalone cluster you need to ensure that, initially, all nodes are >>>>>> authenticated with Kerberos using the kinit tool. >>>>>> >>>>>> Feel free to ask if you have more questions and let us know about any >>>>>> difficulties. >>>>>> >>>>>> Best regards, >>>>>> Max >>>>>> >>>>>> >>>>>> >>>>>> On Thu, Oct 22, 2015 at 2:06 PM, Niels Basjes <ni...@basjes.nl> >>>>>> wrote: >>>>>> > Hi, >>>>>> > >>>>>> > I want to write a long running (i.e. never stop it) streaming flink >>>>>> > application on a kerberos secured Hadoop/Yarn cluster. My >>>>>> application needs >>>>>> > to do things with files on HDFS and HBase tables on that cluster so >>>>>> having >>>>>> > the correct kerberos tickets is very important. The stream is to be >>>>>> ingested >>>>>> > from Kafka. >>>>>> > >>>>>> > One of the things with Kerberos is that the tickets expire after a >>>>>> > predetermined time. My knowledge about kerberos is very limited so >>>>>> I hope >>>>>> > you guys can help me. >>>>>> > >>>>>> > My question is actually quite simple: Is there an howto somewhere >>>>>> on how to >>>>>> > correctly run a long running flink application with kerberos that >>>>>> includes a >>>>>> > solution for the kerberos ticket timeout ? >>>>>> > >>>>>> > Thanks >>>>>> > >>>>>> > Niels Basjes >>>>>> >>>>> >>>>> >>>>> >>>>> -- >>>>> Best regards / Met vriendelijke groeten, >>>>> >>>>> Niels Basjes >>>>> >>>> >>>> >>> >>> >>> -- >>> Best regards / Met vriendelijke groeten, >>> >>> Niels Basjes >>> >> >> > > > -- > Best regards / Met vriendelijke groeten, > > Niels Basjes >