If the archiving fails there should be some log message, like "Failed to archive job" or "Could not archive completed job..." . If nothing of the sort is logged my first instinct would be that the operation is being slowed down, _a lot_.

Where are you archiving them to? Could it be the write operation is being throttled heavily?

On 21/11/2019 13:48, Pavel Potseluev wrote:
Hi Vino,
Usually Flink archives jobs correctly and the problem is rarely reproduced. So I think it isn't a problem with configuration.
Job Manager log when job 5ec264a20bb5005cdbd8e23a5e59f136 was canceled:

    771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:52:13.294
    [Checkpoint Timer] INFO
    org.apache.flink.runtime.checkpoint.CheckpointCoordinator  -
    Triggering checkpoint 1872 @ 1574092333218 for job
    5ec264a20bb5005cdbd8e23a5e59f136.

    771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:52:37.260
    [flink-akka.actor.default-dispatcher-30] INFO
    org.apache.flink.runtime.checkpoint.CheckpointCoordinator  -
    Completed checkpoint 1872 for job 5ec264a20bb5005cdbd8e23a5e59f136
    (568048140 bytes in 23541 ms).

    771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:13.314
    [Checkpoint Timer] INFO
    org.apache.flink.runtime.checkpoint.CheckpointCoordinator  -
    Triggering checkpoint 1873 @ 1574092393218 for job
    5ec264a20bb5005cdbd8e23a5e59f136.

    771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.279
    [flink-akka.actor.default-dispatcher-40] INFO
    org.apache.flink.runtime.executiongraph.ExecutionGraph  - Job
    bureau-user-offers-statistics-AUTORU-USERS_AUTORU
    (5ec264a20bb5005cdbd8e23a5e59f136) switched from state RUNNING to
    CANCELLING.

    771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.279
    [flink-akka.actor.default-dispatcher-40] INFO
    org.apache.flink.runtime.executiongraph.ExecutionGraph  - Source:
    Custom File Source (1/1) (934d89cf3d7999b40225dd8009b5493c)
    switched from RUNNING to CANCELING.

    771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.280
    [flink-akka.actor.default-dispatcher-40] INFO
    org.apache.flink.runtime.executiongraph.ExecutionGraph  - Source:
    kafka-source-moderation-update-journal-autoru -> Filter -> Flat
    Map (1/2) (47656a3c4fc70e19622acca31267e41f) switched from RUNNING
    to CANCELING.

    771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.280
    [flink-akka.actor.default-dispatcher-40] INFO
    org.apache.flink.runtime.executiongraph.ExecutionGraph  - Source:
    kafka-source-moderation-update-journal-autoru -> Filter -> Flat
    Map (2/2) (be3c4562e65d3d6bdfda4f1632017c6c) switched from RUNNING
    to CANCELING.

    771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.280
    [flink-akka.actor.default-dispatcher-40] INFO
    org.apache.flink.runtime.executiongraph.ExecutionGraph  -
    user-offers-statistics-init-from-file -> Map (1/2)
    (4a45ed43b05e4d444e190a44b33514ac) switched from RUNNING to CANCELING.

    771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.280
    [flink-akka.actor.default-dispatcher-40] INFO
    org.apache.flink.runtime.executiongraph.ExecutionGraph  -
    user-offers-statistics-init-from-file -> Map (2/2)
    (bb3be311c5e53abaedb06b4d0148c23f) switched from RUNNING to CANCELING.

    771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.280
    [flink-akka.actor.default-dispatcher-40] INFO
    org.apache.flink.runtime.executiongraph.ExecutionGraph  - Keyed
    Reduce -> Map -> Sink: user-offers-statistics-autoru (1/2)
    (cfb291033df3f19c9745a6f2fd25e037) switched from RUNNING to CANCELING.

    771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.280
    [flink-akka.actor.default-dispatcher-40] INFO
    org.apache.flink.runtime.executiongraph.ExecutionGraph  - Keyed
    Reduce -> Map -> Sink: user-offers-statistics-autoru (2/2)
    (9ce7cd66199513fa97ac44d7617f0c83) switched from RUNNING to CANCELING.

    771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.299
    [flink-akka.actor.default-dispatcher-2] INFO
    org.apache.flink.runtime.executiongraph.ExecutionGraph  - Source:
    Custom File Source (1/1) (934d89cf3d7999b40225dd8009b5493c)
    switched from CANCELING to CANCELED.

    771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.300
    [flink-akka.actor.default-dispatcher-2] INFO
    org.apache.flink.runtime.executiongraph.ExecutionGraph  - Source:
    kafka-source-moderation-update-journal-autoru -> Filter -> Flat
    Map (1/2) (47656a3c4fc70e19622acca31267e41f) switched from
    CANCELING to CANCELED.

    771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.300
    [flink-akka.actor.default-dispatcher-2] INFO
    org.apache.flink.runtime.executiongraph.ExecutionGraph  - Source:
    kafka-source-moderation-update-journal-autoru -> Filter -> Flat
    Map (2/2) (be3c4562e65d3d6bdfda4f1632017c6c) switched from
    CANCELING to CANCELED.

    771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.344
    [flink-akka.actor.default-dispatcher-2] INFO
    org.apache.flink.runtime.executiongraph.ExecutionGraph  -
    user-offers-statistics-init-from-file -> Map (2/2)
    (bb3be311c5e53abaedb06b4d0148c23f) switched from CANCELING to
    CANCELED.

    771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.345
    [flink-akka.actor.default-dispatcher-2] INFO
    org.apache.flink.runtime.executiongraph.ExecutionGraph  -
    user-offers-statistics-init-from-file -> Map (1/2)
    (4a45ed43b05e4d444e190a44b33514ac) switched from CANCELING to
    CANCELED.

    771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.706
    [flink-akka.actor.default-dispatcher-2] INFO
    org.apache.flink.runtime.executiongraph.ExecutionGraph  - Keyed
    Reduce -> Map -> Sink: user-offers-statistics-autoru (1/2)
    (cfb291033df3f19c9745a6f2fd25e037) switched from CANCELING to
    CANCELED.

    771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.714
    [flink-akka.actor.default-dispatcher-2] INFO
    org.apache.flink.runtime.executiongraph.ExecutionGraph  - Keyed
    Reduce -> Map -> Sink: user-offers-statistics-autoru (2/2)
    (9ce7cd66199513fa97ac44d7617f0c83) switched from CANCELING to
    CANCELED.

    771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.714
    [flink-akka.actor.default-dispatcher-2] INFO
    org.apache.flink.runtime.executiongraph.ExecutionGraph  - Job
    bureau-user-offers-statistics-AUTORU-USERS_AUTORU
    (5ec264a20bb5005cdbd8e23a5e59f136) switched from state CANCELLING
    to CANCELED.

    771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.714
    [flink-akka.actor.default-dispatcher-2] INFO
    org.apache.flink.runtime.checkpoint.CheckpointCoordinator  -
    Stopping checkpoint coordinator for job
    5ec264a20bb5005cdbd8e23a5e59f136.

    771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.714
    [flink-akka.actor.default-dispatcher-2] INFO
    o.a.f.runtime.checkpoint.ZooKeeperCompletedCheckpointStore -
    Shutting down

    771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.966
    [flink-akka.actor.default-dispatcher-2] INFO
    org.apache.flink.runtime.zookeeper.ZooKeeperStateHandleStore -
    Removing
    /moderation-flink/testing/checkpoints/5ec264a20bb5005cdbd8e23a5e59f136
    from ZooKeeper

    771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:19.966
    [cluster-io-thread-6] INFO
    org.apache.flink.runtime.checkpoint.CompletedCheckpoint  -
    Checkpoint with ID 1872 at
    
's3://misc/moderation-flink/flink-checkpoints/5ec264a20bb5005cdbd8e23a5e59f136/chk-1872'
    not discarded.

    771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:20.044
    [flink-akka.actor.default-dispatcher-2] INFO
    o.a.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
    Shutting down.

    771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:20.045
    [flink-akka.actor.default-dispatcher-2] INFO
    o.a.flink.runtime.checkpoint.ZooKeeperCheckpointIDCounter  -
    Removing /checkpoint-counter/5ec264a20bb5005cdbd8e23a5e59f136 from
    ZooKeeper

    771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:53:20.259
    [flink-akka.actor.default-dispatcher-2] INFO
    org.apache.flink.runtime.dispatcher.StandaloneDispatcher  - Job
    5ec264a20bb5005cdbd8e23a5e59f136 reached globally terminal state
    CANCELED.

    771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:54:24.085
    [flink-akka.actor.default-dispatcher-31] INFO
    org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl  -
    Releasing idle slot [0553df66161f5d78f4b41d8c8c32c21f].

    771a4992-d694-d2a4-b49a-d4eb382086e5 2019-11-18 18:54:24.085
    [flink-akka.actor.default-dispatcher-31] INFO
    org.apache.flink.runtime.jobmaster.slotpool.SlotPoolImpl  -
    Releasing idle slot [498b9bf0c0f2188ff739d72e6df288dc].

21.11.2019, 06:07, "vino yang" <yanghua1...@gmail.com>:

    If everything is OK(your config options about archive dir and
    history server is correct), Flink should archive the completed job.
    You said you did not find any exceptions in the log about failing
    to archive. But any other exceptions? Can you share the logs about
    your scene?
    Best,
    Vino
    Pavel Potseluev <potsel...@yandex-team.ru
    <mailto:potsel...@yandex-team.ru>> 于2019年11月21日周四 上午2:25写道:

        Hi all,
        We see occasionally that flink doesn't save information about
        canceled job to archive directory (configured by
        jobmanager.archive.fs.dir property). And there are no
        exceptions in the log about failing archiving. It's a problem
        in our use case because our script for deploying jobs relies
        on flink history server to find latest checkpoint for some
        job. Does flink guarantee saving data to archive? If so, any
        ideas why it doesn't work sometimes? Flink version is 1.8.0.
-- Best regards,
        Pavel Potseluev
        Software developer, Yandex.Classifieds LLC

--
Best regards,
Pavel Potseluev
Software developer, Yandex.Classifieds LLC


Reply via email to