Re: Flink job on secure Yarn fails after many hours

2017-04-12 Thread Robert Metzger
karound. > >> Thanks > >> > >> Thomas > >> > >> > >> De : Maximilian Michels [m...@apache.org] > >> Envoyé : mardi 15 mars 2016 16:51 > >> À : user@flink.apache.org > >> Cc :

Re: Flink job on secure Yarn fails after many hours

2016-03-19 Thread Niels Basjes
; Envoyé : mardi 15 mars 2016 16:51 > À : user@flink.apache.org > Cc : Niels Basjes > Objet : Re: Flink job on secure Yarn fails after many hours > > Hi Thomas, > > Nils (CC) and I found out that you need at least Hadoop version 2.6.1 > to properly run Kerberos applications on Hadoop clust

Re: Flink job on secure Yarn fails after many hours

2016-03-19 Thread Maximilian Michels
0 AM, Thomas Lamirault > <thomas.lamira...@ericsson.com> wrote: >> >> Hi Max, >> >> I will try these workaround. >> Thanks >> >> Thomas >> >> >> De : Maximilian Michels [m...@apache.org] >

Re: Flink job on secure Yarn fails after many hours

2016-03-15 Thread Maximilian Michels
homas > > > > > > De : ni...@basj.es [ni...@basj.es] de la part de Niels Basjes > [ni...@basjes.nl] > Envoyé : vendredi 4 décembre 2015 10:40 > À : user@flink.apache.org > Objet : Re: Flink job on secure Yarn fails after many hours > > Hi

Re: Flink job on secure Yarn fails after many hours

2015-12-03 Thread Maximilian Michels
Hi Niels, Just got back from our CI. The build above would fail with a Checkstyle error. I corrected that. Also I have built the binaries for your Hadoop version 2.6.0. Binaries: https://drive.google.com/file/d/0BziY9U_qva1sZ1FVR3RWeVNrNzA/view?usp=sharing Source:

Re: Flink job on secure Yarn fails after many hours

2015-12-02 Thread Maximilian Michels
I mentioned that the exception gets thrown when requesting container status information. We need this to send a heartbeat to YARN but it is not very crucial if this fails once for the running job. Possibly, we could work around this problem by retrying N times in case of an exception. Would it be

Re: Flink job on secure Yarn fails after many hours

2015-12-02 Thread Maximilian Michels
Hi Niels, You mentioned you have the option to update Hadoop and redeploy the job. Would be great if you could do that and let us know how it turns out. Cheers, Max On Wed, Dec 2, 2015 at 3:45 PM, Niels Basjes wrote: > Hi, > > I posted the entire log from the first log line at

Re: Flink job on secure Yarn fails after many hours

2015-12-02 Thread Niels Basjes
Hi, I posted the entire log from the first log line at the moment of failure to the very end of the logfile. This is all I have. As far as I understand the Kerberos and Keytab mechanism in Hadoop Yarn is that it catches the "Invalid Token" and then (if keytab) gets a new Kerberos ticket (or

Re: Flink job on secure Yarn fails after many hours

2015-12-02 Thread Maximilian Michels
Great. Here is the commit to try out: https://github.com/mxm/flink/commit/f49b9635bec703541f19cb8c615f302a07ea88b3 If you already have the Flink repository, check it out using git fetch https://github.com/mxm/flink/ f49b9635bec703541f19cb8c615f302a07ea88b3 && git checkout FETCH_HEAD

Re: Flink job on secure Yarn fails after many hours

2015-12-02 Thread Niels Basjes
Sure, just give me the git repo url to build and I'll give it a try. Niels On Wed, Dec 2, 2015 at 4:28 PM, Maximilian Michels wrote: > I mentioned that the exception gets thrown when requesting container > status information. We need this to send a heartbeat to YARN but it is

Re: Flink job on secure Yarn fails after many hours

2015-12-02 Thread Maximilian Michels
I forgot you're using Flink 0.10.1. The above was for the master. So here's the commit for Flink 0.10.1: https://github.com/mxm/flink/commit/a41f3866f4097586a7b2262093088861b62930cd git fetch https://github.com/mxm/flink/ \ a41f3866f4097586a7b2262093088861b62930cd && git checkout FETCH_HEAD

Re: Flink job on secure Yarn fails after many hours

2015-12-02 Thread Maximilian Michels
Hi Niels, Sorry for hear you experienced this exception. From a first glance, it looks like a bug in Hadoop to me. > "Not retrying because the invoked method is not idempotent, and unable to > determine whether it was invoked" That is nothing to worry about. This is Hadoop's internal retry