Hi Vishal,
you should not need to configure anything else.
Cheers,
Till
On Sat, Jun 30, 2018 at 7:23 PM Vishal Santoshi
wrote:
> A clarification.. In 1.5 with custom heartbeats are there additional
> configurations we should be concerned about ?
>
> On Fri, May 25, 2018 at 10:17 AM, Steven Wu
A clarification.. In 1.5 with custom heartbeats are there additional
configurations we should be concerned about ?
On Fri, May 25, 2018 at 10:17 AM, Steven Wu wrote:
> Till, thanks for the follow-up. looking forward to 1.5 :)
>
> On Fri, May 25, 2018 at 2:11 AM, Till Rohrmann
> wrote:
>
>> Hi
Till, thanks for the follow-up. looking forward to 1.5 :)
On Fri, May 25, 2018 at 2:11 AM, Till Rohrmann wrote:
> Hi Steven,
>
> we don't have `jobmanager.exit-on-fatal-akka-error` because then the JM
> would also be killed if a single TM gets quarantined. This is also not
Hi Steven,
we don't have `jobmanager.exit-on-fatal-akka-error` because then the JM
would also be killed if a single TM gets quarantined. This is also not a
desired behaviour.
With Flink 1.5 the problem with quarantining should be gone since we don't
rely anymore on Akka's death watch and instead
Till,
thanks for the clarification. yes, that situation is undesirable either.
In our case, restarting jobmanager could also recover the job from akk
association lock-out. it was actually the issue (high GC pause) on
jobmanager side that caused the akka failure.
do we have sth like
Hi Steven,
the reason why we did not turn on this feature per default was that in case
of a true JM failure, all of the TMs will think that they got quarantined
which triggers their shut down. Depending on how many container restarts
you have left on Yarn, for example, this can lead to a
Till,
We ran into the same issue. It started with high GC pause that caused
jobmanager to lose zk conn and leadership and caused jobmanager to
quarantine taskmanager in akka. Once quarantined, akka association btw
jobmanager and taskmanager is locked forever.
Your suggestion of "
@Jelmer, this is Till's las response on the issue.
-- Ashish
On Mon, Feb 5, 2018 at 5:56 AM, Till Rohrmann wrote:
Hi,
this sounds like a serious regression wrt Flink 1.3.2 and we should definitely
find out what's causing this problem. Given from what I see in the
Hi Till,
Thanks for detailed response. I will try to gather some of this information
during the week and follow up.
— Ashish
> On Feb 5, 2018, at 5:55 AM, Till Rohrmann wrote:
>
> Hi,
>
> this sounds like a serious regression wrt Flink 1.3.2 and we should
> definitely
Hi,
this sounds like a serious regression wrt Flink 1.3.2 and we should
definitely find out what's causing this problem. Given from what I see in
the logs, the following happens:
For some time the JobManager seems to no longer receive heartbeats from the
TaskManager. This could be, for example,
I've seen a similar issue while running successive Flink SQL batches on
1.4. In my case, the Job Manager would fail with the log output about
unreachability (with an additional statement about something going
"horribly wrong"). Under workload pressure, I reverted to 1.3.2 where
everything works
I haven’t gotten much further with this. It doesn’t look like GC related - at
least GC counters were not that atrocious. However, my main concern was once
the load subsides why aren’t TM and JM connecting again? That doesn’t look
normal. I could definitely tell JM was listening on the port and
Hi.
Did you find a reason for the detaching ?
I sometimes see the same on our system running Flink 1.4 on dc/os. I have
enabled taskmanager.Debug.memory.startlogthread for debugging.
Med venlig hilsen / Best regards
Lasse Nedergaard
> Den 20. jan. 2018 kl. 12.57 skrev Kien Truong
Hi,
You should enable and check your garbage collection log.
We've encountered case where Task Manager disassociated due to long GC
pause.
Regards,
Kien
On 1/20/2018 1:27 AM, ashish pok wrote:
Hi All,
We have hit some load related issues and was wondering if any one has
some
14 matches
Mail list logo