Hi Timur,

I had a look at the plan you shared.
I could not find any flow that branches and merges again, a pattern which
is prone to cause a deadlocks.

However, I noticed that the plan performs a lot of partitioning steps.
You might want to have a look at forwarded field annotations which can help
to reduce the partitioning and sorting steps [1].
This might help with complex jobs such as yours.

Best, Fabian

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.0/apis/batch/index.html#semantic-annotations


2016-04-27 10:57 GMT+02:00 Vasiliki Kalavri <vasilikikala...@gmail.com>:

> Hi Timur,
>
> I've previously seen large batch jobs hang because of join deadlocks. We
> should have fixed those problems, but we might have missed some corner
> case. Did you check whether there was any cpu activity when the job hangs?
> Can you try running htop on the taskmanager machines and see if they're
> idle?
>
> Cheers,
> -Vasia.
>
> On 27 April 2016 at 02:48, Timur Fayruzov <timur.fairu...@gmail.com>
> wrote:
>
>> Robert, Ufuk, logs, execution plan and a screenshot of the console are in
>> the archive:
>> https://www.dropbox.com/s/68gyl6f3rdzn7o1/debug-stuck.tar.gz?dl=0
>>
>> Note that when I looked in the backpressure view I saw back pressure
>> 'high' on following paths:
>>
>> Input->code_line:123,124->map->join
>> Input->code_line:134,135->map->join
>> Input->code_line:121->map->join
>>
>> Unfortunately, I was not able to take thread dumps nor heap dumps
>> (neither kill -3, jstack nor jmap worked, some Amazon AMI problem I assume).
>>
>> Hope that helps.
>>
>> Please, let me know if I can assist you in any way. Otherwise, I probably
>> would not be actively looking at this problem.
>>
>> Thanks,
>> Timur
>>
>>
>> On Tue, Apr 26, 2016 at 8:11 AM, Ufuk Celebi <u...@apache.org> wrote:
>>
>>> Can you please further provide the execution plan via
>>>
>>> env.getExecutionPlan()
>>>
>>>
>>>
>>> On Tue, Apr 26, 2016 at 4:23 PM, Timur Fayruzov
>>> <timur.fairu...@gmail.com> wrote:
>>> > Hello Robert,
>>> >
>>> > I observed progress for 2 hours(meaning numbers change on dashboard),
>>> and
>>> > then I waited for 2 hours more. I'm sure it had to spill at some
>>> point, but
>>> > I figured 2h is enough time.
>>> >
>>> > Thanks,
>>> > Timur
>>> >
>>> > On Apr 26, 2016 1:35 AM, "Robert Metzger" <rmetz...@apache.org> wrote:
>>> >>
>>> >> Hi Timur,
>>> >>
>>> >> thank you for sharing the source code of your job. That is helpful!
>>> >> Its a large pipeline with 7 joins and 2 co-groups. Maybe your job is
>>> much
>>> >> more IO heavy with the larger input data because all the joins start
>>> >> spilling?
>>> >> Our monitoring, in particular for batch jobs is really not very
>>> advanced..
>>> >> If we had some monitoring showing the spill status, we would maybe
>>> see that
>>> >> the job is still running.
>>> >>
>>> >> How long did you wait until you declared the job hanging?
>>> >>
>>> >> Regards,
>>> >> Robert
>>> >>
>>> >>
>>> >> On Tue, Apr 26, 2016 at 10:11 AM, Ufuk Celebi <u...@apache.org> wrote:
>>> >>>
>>> >>> No.
>>> >>>
>>> >>> If you run on YARN, the YARN logs are the relevant ones for the
>>> >>> JobManager and TaskManager. The client log submitting the job should
>>> >>> be found in /log.
>>> >>>
>>> >>> – Ufuk
>>> >>>
>>> >>> On Tue, Apr 26, 2016 at 10:06 AM, Timur Fayruzov
>>> >>> <timur.fairu...@gmail.com> wrote:
>>> >>> > I will do it my tomorrow. Logs don't show anything unusual. Are
>>> there
>>> >>> > any
>>> >>> > logs besides what's in flink/log and yarn container logs?
>>> >>> >
>>> >>> > On Apr 26, 2016 1:03 AM, "Ufuk Celebi" <u...@apache.org> wrote:
>>> >>> >
>>> >>> > Hey Timur,
>>> >>> >
>>> >>> > is it possible to connect to the VMs and get stack traces of the
>>> Flink
>>> >>> > processes as well?
>>> >>> >
>>> >>> > We can first have a look at the logs, but the stack traces will be
>>> >>> > helpful if we can't figure out what the issue is.
>>> >>> >
>>> >>> > – Ufuk
>>> >>> >
>>> >>> > On Tue, Apr 26, 2016 at 9:42 AM, Till Rohrmann <
>>> trohrm...@apache.org>
>>> >>> > wrote:
>>> >>> >> Could you share the logs with us, Timur? That would be very
>>> helpful.
>>> >>> >>
>>> >>> >> Cheers,
>>> >>> >> Till
>>> >>> >>
>>> >>> >> On Apr 26, 2016 3:24 AM, "Timur Fayruzov" <
>>> timur.fairu...@gmail.com>
>>> >>> >> wrote:
>>> >>> >>>
>>> >>> >>> Hello,
>>> >>> >>>
>>> >>> >>> Now I'm at the stage where my job seem to completely hang. Source
>>> >>> >>> code is
>>> >>> >>> attached (it won't compile but I think gives a very good idea of
>>> what
>>> >>> >>> happens). Unfortunately I can't provide the datasets. Most of
>>> them
>>> >>> >>> are
>>> >>> >>> about
>>> >>> >>> 100-500MM records, I try to process on EMR cluster with 40 tasks
>>> 6GB
>>> >>> >>> memory
>>> >>> >>> for each.
>>> >>> >>>
>>> >>> >>> It was working for smaller input sizes. Any idea on what I can do
>>> >>> >>> differently is appreciated.
>>> >>> >>>
>>> >>> >>> Thans,
>>> >>> >>> Timur
>>> >>
>>> >>
>>> >
>>>
>>
>>
>

Reply via email to