Re: Flink 1.7.1 job is stuck in running state

2019-01-18 Thread Gary Yao
Hi Piotr,

Ideally on DEBUG level.

Best,
Gary

On Fri, Jan 18, 2019 at 3:41 PM Piotr Szczepanek 
wrote:

> Hey Gary,
> thanks for your reply.
> Before we have been using Flink version 1.5.2.
> With both version we're using Flink deployed on Yarn.
>
> Regarding log would you like to have log entries with DEBUG enabled or
> INFO would be enough?
>
> Thanks,
> Piotr
>
> pt., 18 sty 2019 o 15:14 Gary Yao  napisał(a):
>
>> Hi Piotr,
>>
>> What was the version you were using before 1.7.1?
>> How do you deploy your cluster, e.g., YARN, standalone?
>> Can you attach full TM and JM logs?
>>
>> Best,
>> Gary
>>
>> On Fri, Jan 18, 2019 at 3:05 PM Piotr Szczepanek <
>> piotr.szczepa...@gmail.com> wrote:
>>
>>> Hello,
>>> we have scenario with running Data Processing jobs that generates export
>>> files on demand. Our first approach was using ClusterClient, but recently
>>> we switched to REST API for job submittion. In the meantime we switched to
>>> flink 1.7.1 and that started to cause a problems.
>>> Some of our jobs are stuck, not processing any data. Task Managers have
>>> info that Chain is switching to RUNNING, and then nothing happenes.
>>> In TM's stdout logs we can see that for some reason log is cut, e.g.:
>>>
>>> Jan 10, 2019 4:28:33 PM INFO:
>>> org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader
>>> initialized will read a total of 615 records.
>>> Jan 10, 2019 4:28:33 PM INFO:
>>> org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading
>>> next block
>>> Jan 10, 2019 4:28:33 PM INFO:
>>> org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory
>>> in 63 ms. row count = 615
>>> Jan 10, 2019 4:28:33 PM WARNING:
>>> org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter
>>> due to context is not a instance of TaskInputOutputContext, but is
>>> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
>>> Jan 10, 2019 4:28:33 PM INFO:
>>> org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader
>>> initialized will read a total of 140 records.
>>> Jan 10, 2019 4:28:33 PM INFO:
>>> org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading
>>> next block
>>> Jan 10, 2019 4:28:33 PM INFO:
>>> org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory
>>> in 2 ms. row count = 140
>>> Jan 10, 2019 4:28:33 PM WARNING:
>>> org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter
>>> due to context is not a instance of TaskInputOutputContext, but is
>>> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
>>> Jan 10, 2019 4:28:33 PM INFO: or
>>>
>>> As you can see, last line is cut in the middle, and nothing happenes
>>> later on.
>>> None of counters ( records/bytes sent/read) are increased.
>>> We switched debug on on both TMs and JM but only thing they are showing
>>> up are sending heartbeats between each other.
>>> Do you have any idea what could be a problem? and how we could deal with
>>> them or at least try to investigate? Is there any timeout/config that we
>>> could try to enable?
>>>
>>


Re: Flink 1.7.1 job is stuck in running state

2019-01-18 Thread Piotr Szczepanek
Hey Gary,
thanks for your reply.
Before we have been using Flink version 1.5.2.
With both version we're using Flink deployed on Yarn.

Regarding log would you like to have log entries with DEBUG enabled or INFO
would be enough?

Thanks,
Piotr

pt., 18 sty 2019 o 15:14 Gary Yao  napisał(a):

> Hi Piotr,
>
> What was the version you were using before 1.7.1?
> How do you deploy your cluster, e.g., YARN, standalone?
> Can you attach full TM and JM logs?
>
> Best,
> Gary
>
> On Fri, Jan 18, 2019 at 3:05 PM Piotr Szczepanek <
> piotr.szczepa...@gmail.com> wrote:
>
>> Hello,
>> we have scenario with running Data Processing jobs that generates export
>> files on demand. Our first approach was using ClusterClient, but recently
>> we switched to REST API for job submittion. In the meantime we switched to
>> flink 1.7.1 and that started to cause a problems.
>> Some of our jobs are stuck, not processing any data. Task Managers have
>> info that Chain is switching to RUNNING, and then nothing happenes.
>> In TM's stdout logs we can see that for some reason log is cut, e.g.:
>>
>> Jan 10, 2019 4:28:33 PM INFO:
>> org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader
>> initialized will read a total of 615 records.
>> Jan 10, 2019 4:28:33 PM INFO:
>> org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading
>> next block
>> Jan 10, 2019 4:28:33 PM INFO:
>> org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory
>> in 63 ms. row count = 615
>> Jan 10, 2019 4:28:33 PM WARNING:
>> org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter
>> due to context is not a instance of TaskInputOutputContext, but is
>> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
>> Jan 10, 2019 4:28:33 PM INFO:
>> org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader
>> initialized will read a total of 140 records.
>> Jan 10, 2019 4:28:33 PM INFO:
>> org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading
>> next block
>> Jan 10, 2019 4:28:33 PM INFO:
>> org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory
>> in 2 ms. row count = 140
>> Jan 10, 2019 4:28:33 PM WARNING:
>> org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter
>> due to context is not a instance of TaskInputOutputContext, but is
>> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
>> Jan 10, 2019 4:28:33 PM INFO: or
>>
>> As you can see, last line is cut in the middle, and nothing happenes
>> later on.
>> None of counters ( records/bytes sent/read) are increased.
>> We switched debug on on both TMs and JM but only thing they are showing
>> up are sending heartbeats between each other.
>> Do you have any idea what could be a problem? and how we could deal with
>> them or at least try to investigate? Is there any timeout/config that we
>> could try to enable?
>>
>


Re: Flink 1.7.1 job is stuck in running state

2019-01-18 Thread Gary Yao
Hi Piotr,

What was the version you were using before 1.7.1?
How do you deploy your cluster, e.g., YARN, standalone?
Can you attach full TM and JM logs?

Best,
Gary

On Fri, Jan 18, 2019 at 3:05 PM Piotr Szczepanek 
wrote:

> Hello,
> we have scenario with running Data Processing jobs that generates export
> files on demand. Our first approach was using ClusterClient, but recently
> we switched to REST API for job submittion. In the meantime we switched to
> flink 1.7.1 and that started to cause a problems.
> Some of our jobs are stuck, not processing any data. Task Managers have
> info that Chain is switching to RUNNING, and then nothing happenes.
> In TM's stdout logs we can see that for some reason log is cut, e.g.:
>
> Jan 10, 2019 4:28:33 PM INFO:
> org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader
> initialized will read a total of 615 records.
> Jan 10, 2019 4:28:33 PM INFO:
> org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading
> next block
> Jan 10, 2019 4:28:33 PM INFO:
> org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory
> in 63 ms. row count = 615
> Jan 10, 2019 4:28:33 PM WARNING:
> org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter
> due to context is not a instance of TaskInputOutputContext, but is
> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
> Jan 10, 2019 4:28:33 PM INFO:
> org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader
> initialized will read a total of 140 records.
> Jan 10, 2019 4:28:33 PM INFO:
> org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading
> next block
> Jan 10, 2019 4:28:33 PM INFO:
> org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory
> in 2 ms. row count = 140
> Jan 10, 2019 4:28:33 PM WARNING:
> org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter
> due to context is not a instance of TaskInputOutputContext, but is
> org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
> Jan 10, 2019 4:28:33 PM INFO: or
>
> As you can see, last line is cut in the middle, and nothing happenes later
> on.
> None of counters ( records/bytes sent/read) are increased.
> We switched debug on on both TMs and JM but only thing they are showing up
> are sending heartbeats between each other.
> Do you have any idea what could be a problem? and how we could deal with
> them or at least try to investigate? Is there any timeout/config that we
> could try to enable?
>


Flink 1.7.1 job is stuck in running state

2019-01-18 Thread Piotr Szczepanek
Hello,
we have scenario with running Data Processing jobs that generates export
files on demand. Our first approach was using ClusterClient, but recently
we switched to REST API for job submittion. In the meantime we switched to
flink 1.7.1 and that started to cause a problems.
Some of our jobs are stuck, not processing any data. Task Managers have
info that Chain is switching to RUNNING, and then nothing happenes.
In TM's stdout logs we can see that for some reason log is cut, e.g.:

Jan 10, 2019 4:28:33 PM INFO:
org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader
initialized will read a total of 615 records.
Jan 10, 2019 4:28:33 PM INFO:
org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading
next block
Jan 10, 2019 4:28:33 PM INFO:
org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory
in 63 ms. row count = 615
Jan 10, 2019 4:28:33 PM WARNING:
org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter
due to context is not a instance of TaskInputOutputContext, but is
org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
Jan 10, 2019 4:28:33 PM INFO:
org.apache.parquet.hadoop.InternalParquetRecordReader: RecordReader
initialized will read a total of 140 records.
Jan 10, 2019 4:28:33 PM INFO:
org.apache.parquet.hadoop.InternalParquetRecordReader: at row 0. reading
next block
Jan 10, 2019 4:28:33 PM INFO:
org.apache.parquet.hadoop.InternalParquetRecordReader: block read in memory
in 2 ms. row count = 140
Jan 10, 2019 4:28:33 PM WARNING:
org.apache.parquet.hadoop.ParquetRecordReader: Can not initialize counter
due to context is not a instance of TaskInputOutputContext, but is
org.apache.hadoop.mapreduce.task.TaskAttemptContextImpl
Jan 10, 2019 4:28:33 PM INFO: or

As you can see, last line is cut in the middle, and nothing happenes later
on.
None of counters ( records/bytes sent/read) are increased.
We switched debug on on both TMs and JM but only thing they are showing up
are sending heartbeats between each other.
Do you have any idea what could be a problem? and how we could deal with
them or at least try to investigate? Is there any timeout/config that we
could try to enable?