Hi Navneeth,

if everything worked before and you just experience later issues, it would
be interesting to know if your state size grew over time. An application
over time usually needs gradually more resources. If the user base of your
company grows, so does the amount of messages (be it click stream, page
impressions, or any kind of transactions). Often time, also the operator
state grows. Sometimes, it's just that the events themselves become more
complex and thus you need more overall bandwidth. This means that from time
to time, you need to increase the memory of Flink (for state) or the number
of compute nodes (to handle more events). In the same way, you need to make
sure that your sink scales as well.

If you fail to keep up with the demand, the application gradually becomes
more unstable (for example by running out of memory repeatedly). I'm
suspecting that this may happen in your case.

First, it's important to understand what the bottleneck is. Web UI should
help to narrow it down quickly. You can also share your insights and we can
discuss further strategies.

If nothing works out, I also recommend an upgrade. Your best migration path
would be to use Flink 1.7, which should allow a smoother transition for
state [1]. I'd guess that afterwards, you should be able to migrate to 1.11
with almost no code changes.

[1]
https://ci.apache.org/projects/flink/flink-docs-release-1.11/dev/stream/state/custom_serialization.html#migrating-from-deprecated-serializer-snapshot-apis-before-flink-17

On Fri, Aug 28, 2020 at 1:43 PM Yun Tang <[email protected]> wrote:

> Hi Navneeth
>
> First of all, I suggest to upgrade Flink version to latest version.
> And you could refer here [1] for the savepoint compatibility when
> upgrading Flink.
>
> For the problem that cannot connect address, you could login your pod and
> run 'nslookup jobmanager' to see whether the host could be resolved.
> You can also check the service of 'jobmanager' whether work as expected
> via 'kubectl get svc' .
>
> [1]
> https://ci.apache.org/projects/flink/flink-docs-stable/ops/upgrading.html#compatibility-table
>
> Best
> Yun Tang
>
> ------------------------------
> *From:* Navneeth Krishnan <[email protected]>
> *Sent:* Friday, August 28, 2020 17:00
> *To:* user <[email protected]>
> *Subject:* Flink Migration
>
> Hi All,
>
> We are currently on a very old version of flink 1.4.0 and it has worked
> pretty well. But lately we have been facing checkpoint timeout issues. We
> would like to minimize any changes to the current pipelines and go ahead
> with the migration. With that said our first pick was to migrate to 1.5.6
> and later migrate to a newer version.
>
> Do you guys think a more recent version like 1.6 or 1.7 might work? We did
> try 1.8 but it requires some changes in the pipelines.
>
> When we tried 1.5.6 with docker compose we were unable to get the task
> manager attached to jobmanager. Are there some specific configurations
> required for newer versions?
>
> Logs:
>
> 8-28 07:36:30.834 [main] INFO
> org.apache.flink.runtime.util.LeaderRetrievalUtils  - TaskManager will
> try to connect for 10000 milliseconds before falling back to heuristics
>
> 2020-08-28 07:36:30.853 [main] INFO
> org.apache.flink.runtime.net.ConnectionUtils  - Retrieved new target
> address jobmanager/172.21.0.8:6123.
>
> 2020-08-28 07:36:31.279 [main] INFO
> org.apache.flink.runtime.net.ConnectionUtils  - Trying to connect to
> address jobmanager/172.21.0.8:6123
>
> 2020-08-28 07:36:31.280 [main] INFO
> org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from
> address 'e6f9104cdc61/172.21.0.9': Connection refused (Connection refused)
>
> 2020-08-28 07:36:31.281 [main] INFO
> org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from
> address '/172.21.0.9': Connection refused (Connection refused)
>
> 2020-08-28 07:36:31.281 [main] INFO
> org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from
> address '/172.21.0.9': Connection refused (Connection refused)
>
> 2020-08-28 07:36:31.282 [main] INFO
> org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from
> address '/127.0.0.1': Invalid argument (connect failed)
>
> 2020-08-28 07:36:31.283 [main] INFO
> org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from
> address '/172.21.0.9': Connection refused (Connection refused)
>
> 2020-08-28 07:36:31.284 [main] INFO
> org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from
> address '/127.0.0.1': Invalid argument (connect failed)
>
> 2020-08-28 07:36:31.684 [main] INFO
> org.apache.flink.runtime.net.ConnectionUtils  - Trying to connect to
> address jobmanager/172.21.0.8:6123
>
> 2020-08-28 07:36:31.686 [main] INFO
> org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from
> address 'e6f9104cdc61/172.21.0.9': Connection refused (Connection refused)
>
> 2020-08-28 07:36:31.687 [main] INFO
> org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from
> address '/172.21.0.9': Connection refused (Connection refused)
>
> 2020-08-28 07:36:31.688 [main] INFO
> org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from
> address '/172.21.0.9': Connection refused (Connection refused)
>
> 2020-08-28 07:36:31.688 [main] INFO
> org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from
> address '/127.0.0.1': Invalid argument (connect failed)
>
> 2020-08-28 07:36:31.689 [main] INFO
> org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from
> address '/172.21.0.9': Connection refused (Connection refused)
>
> 2020-08-28 07:36:31.690 [main] INFO
> org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from
> address '/127.0.0.1': Invalid argument (connect failed)
>
> 2020-08-28 07:36:32.490 [main] INFO
> org.apache.flink.runtime.net.ConnectionUtils  - Trying to connect to
> address jobmanager/172.21.0.8:6123
>
> 2020-08-28 07:36:32.491 [main] INFO
> org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from
> address 'e6f9104cdc61/172.21.0.9': Connection refused (Connection refused)
>
> 2020-08-28 07:36:32.493 [main] INFO
> org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from
> address '/172.21.0.9': Connection refused (Connection refused)
>
> 2020-08-28 07:36:32.494 [main] INFO
> org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from
> address '/172.21.0.9': Connection refused (Connection refused)
>
> 2020-08-28 07:36:32.495 [main] INFO
> org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from
> address '/127.0.0.1': Invalid argument (connect failed)
>
> 2020-08-28 07:36:32.496 [main] INFO
> org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from
> address '/172.21.0.9': Connection refused (Connection refused)
>
> 2020-08-28 07:36:32.497 [main] INFO
> org.apache.flink.runtime.net.ConnectionUtils  - Failed to connect from
> address '/127.0.0.1': Invalid argument (connect failed)
>
> 2020-08-28 07:36:34.099 [main] INFO
> org.apache.flink.runtime.net.ConnectionUtils  - Trying to connect to
> address jobmanager/172.21.0.8:6123
>
> 2020-08-28 07:36:34.100 [main] INFO
> org.apache.flink.runtime.taskexecutor.TaskManagerRunner  - TaskManager
> will use hostname/address 'e6f9104cdc61' (172.21.0.9) for communication.
>
>
> Flink Conf
>
> jobmanager.rpc.address: jobmanager
>
> rest.address: jobmanager
>
>
> Thanks
>


-- 

Arvid Heise | Senior Java Developer

<https://www.ververica.com/>

Follow us @VervericaData

--

Join Flink Forward <https://flink-forward.org/> - The Apache Flink
Conference

Stream Processing | Event Driven | Real Time

--

Ververica GmbH | Invalidenstrasse 115, 10115 Berlin, Germany

--
Ververica GmbH
Registered at Amtsgericht Charlottenburg: HRB 158244 B
Managing Directors: Timothy Alexander Steinert, Yip Park Tung Jason, Ji
(Toni) Cheng

Reply via email to