Re: Diagnosing bottlenecks in Flink jobs

2021-06-17 Thread Dan Hill
Thanks Jing!

On Wed, Jun 16, 2021 at 11:30 PM JING ZHANG  wrote:

> Hi Dan,
> It's better to split the Kafka partition into multiple partitions.
> Here is a way to try without splitting the Kafka partition. Add a
> rebalance shuffle between source and the downstream operators, set multiple
> parallelism for the downstream operators. But this way would introduce
> extra cpu cost for serialize/deserialize and extra network cost for shuffle
> data. I'm not sure the benefits of this method can offset the additional
> costs.
>
> Best,
> JING ZHANG
>
> Dan Hill  于2021年6月17日周四 下午1:49写道:
>
>> Thanks, JING ZHANG!
>>
>> I have one subtask for one Kafka source that is getting backpressure.  Is
>> there an easy way to split a single Kafka partition into multiple
>> subtasks?  Or do I need to split the Kafka partition?
>>
>> On Wed, Jun 16, 2021 at 10:29 PM JING ZHANG  wrote:
>>
>>> Hi Dan,
>>> Would you please describe what's the problem about your job? High
>>> latency or low throughput?
>>> Please first check the job throughput and latency .
>>> If the job throughput matches the speed of sources producing data and
>>> the latency metric is good, maybe the job works well without bottlenecks.
>>> If you find unnormal throughput or latency, please try the following
>>> points:
>>> 1. check the back pressure
>>> 2. check whether checkpoint duration is long and whether the checkpoint
>>> size is expected
>>>
>>> Please share the details for deeper analysis in this email if you find
>>> something abnormal about  the job.
>>>
>>> Best,
>>> JING ZHANG
>>>
>>> Dan Hill  于2021年6月17日周四 下午12:44写道:
>>>
 We have a job that has been running but none of the AWS resource
 metrics for the EKS, EC2, MSK and EBS show any bottlenecks.  I have
 multiple 8 cores allocated but only ~2 cores are used.  Most of the memory
 is not consumed.  MSK does not show much use.  EBS metrics look mostly
 idle.  I assumed I'd be able to see whichever resources is a bottleneck.

 Is there a good way to diagnose where the bottleneck is for a Flink job?

>>>


Re: Diagnosing bottlenecks in Flink jobs

2021-06-17 Thread JING ZHANG
Hi Dan,
It's better to split the Kafka partition into multiple partitions.
Here is a way to try without splitting the Kafka partition. Add a rebalance
shuffle between source and the downstream operators, set multiple
parallelism for the downstream operators. But this way would introduce
extra cpu cost for serialize/deserialize and extra network cost for shuffle
data. I'm not sure the benefits of this method can offset the additional
costs.

Best,
JING ZHANG

Dan Hill  于2021年6月17日周四 下午1:49写道:

> Thanks, JING ZHANG!
>
> I have one subtask for one Kafka source that is getting backpressure.  Is
> there an easy way to split a single Kafka partition into multiple
> subtasks?  Or do I need to split the Kafka partition?
>
> On Wed, Jun 16, 2021 at 10:29 PM JING ZHANG  wrote:
>
>> Hi Dan,
>> Would you please describe what's the problem about your job? High latency
>> or low throughput?
>> Please first check the job throughput and latency .
>> If the job throughput matches the speed of sources producing data and the
>> latency metric is good, maybe the job works well without bottlenecks.
>> If you find unnormal throughput or latency, please try the following
>> points:
>> 1. check the back pressure
>> 2. check whether checkpoint duration is long and whether the checkpoint
>> size is expected
>>
>> Please share the details for deeper analysis in this email if you find
>> something abnormal about  the job.
>>
>> Best,
>> JING ZHANG
>>
>> Dan Hill  于2021年6月17日周四 下午12:44写道:
>>
>>> We have a job that has been running but none of the AWS resource metrics
>>> for the EKS, EC2, MSK and EBS show any bottlenecks.  I have multiple 8
>>> cores allocated but only ~2 cores are used.  Most of the memory is not
>>> consumed.  MSK does not show much use.  EBS metrics look mostly idle.  I
>>> assumed I'd be able to see whichever resources is a bottleneck.
>>>
>>> Is there a good way to diagnose where the bottleneck is for a Flink job?
>>>
>>


Re: Diagnosing bottlenecks in Flink jobs

2021-06-16 Thread Dan Hill
Thanks, JING ZHANG!

I have one subtask for one Kafka source that is getting backpressure.  Is
there an easy way to split a single Kafka partition into multiple
subtasks?  Or do I need to split the Kafka partition?

On Wed, Jun 16, 2021 at 10:29 PM JING ZHANG  wrote:

> Hi Dan,
> Would you please describe what's the problem about your job? High latency
> or low throughput?
> Please first check the job throughput and latency .
> If the job throughput matches the speed of sources producing data and the
> latency metric is good, maybe the job works well without bottlenecks.
> If you find unnormal throughput or latency, please try the following
> points:
> 1. check the back pressure
> 2. check whether checkpoint duration is long and whether the checkpoint
> size is expected
>
> Please share the details for deeper analysis in this email if you find
> something abnormal about  the job.
>
> Best,
> JING ZHANG
>
> Dan Hill  于2021年6月17日周四 下午12:44写道:
>
>> We have a job that has been running but none of the AWS resource metrics
>> for the EKS, EC2, MSK and EBS show any bottlenecks.  I have multiple 8
>> cores allocated but only ~2 cores are used.  Most of the memory is not
>> consumed.  MSK does not show much use.  EBS metrics look mostly idle.  I
>> assumed I'd be able to see whichever resources is a bottleneck.
>>
>> Is there a good way to diagnose where the bottleneck is for a Flink job?
>>
>


Re: Diagnosing bottlenecks in Flink jobs

2021-06-16 Thread JING ZHANG
Hi Dan,
Would you please describe what's the problem about your job? High latency
or low throughput?
Please first check the job throughput and latency .
If the job throughput matches the speed of sources producing data and the
latency metric is good, maybe the job works well without bottlenecks.
If you find unnormal throughput or latency, please try the following points:
1. check the back pressure
2. check whether checkpoint duration is long and whether the checkpoint
size is expected

Please share the details for deeper analysis in this email if you find
something abnormal about  the job.

Best,
JING ZHANG

Dan Hill  于2021年6月17日周四 下午12:44写道:

> We have a job that has been running but none of the AWS resource metrics
> for the EKS, EC2, MSK and EBS show any bottlenecks.  I have multiple 8
> cores allocated but only ~2 cores are used.  Most of the memory is not
> consumed.  MSK does not show much use.  EBS metrics look mostly idle.  I
> assumed I'd be able to see whichever resources is a bottleneck.
>
> Is there a good way to diagnose where the bottleneck is for a Flink job?
>


Diagnosing bottlenecks in Flink jobs

2021-06-16 Thread Dan Hill
We have a job that has been running but none of the AWS resource metrics
for the EKS, EC2, MSK and EBS show any bottlenecks.  I have multiple 8
cores allocated but only ~2 cores are used.  Most of the memory is not
consumed.  MSK does not show much use.  EBS metrics look mostly idle.  I
assumed I'd be able to see whichever resources is a bottleneck.

Is there a good way to diagnose where the bottleneck is for a Flink job?