[jira] [Commented] (FLINK-25581) Jobmanager does not archive completed jobs

2022-01-18 Thread Chesnay Schepler (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17477914#comment-17477914
 ] 

Chesnay Schepler commented on FLINK-25581:
--

??Are there any additional parameters I can set to increase the debug detail 
level in jobmanager???

Logging on debug level is the only way to get more information.

> Jobmanager does not archive completed jobs
> --
>
> Key: FLINK-25581
> URL: https://issues.apache.org/jira/browse/FLINK-25581
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.13.5
> Environment: RHEL 7
>Reporter: Leonid Ilyevsky
>Priority: Major
> Attachments: error.log
>
>
> Jobmanager does not archive completed jobs.
> I configured the upload directory like this:
> jobmanager.archive.fs.dir: file:///liquidnet/shared/flink/completed-jobs
>  
> After the job was completed, nothing appeared in that directory.
> The job info was visible in the jobmanager console for one hour, then it 
> disappeared, and still there was no files in the configured directory.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25581) Jobmanager does not archive completed jobs

2022-01-18 Thread Leonid Ilyevsky (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17477903#comment-17477903
 ] 

Leonid Ilyevsky commented on FLINK-25581:
-

[~chesnay] So maybe the jobs that were not archived at all is different issue.

In the beginning, when I just configured archiving, it did not archive 
anything. Now I see at least some jobs are archived.

For now, we have a workaround by keeping our own job start/stop history. So we 
can spend some time investigating this strange issue.

Maybe you can try to reproduce this in your environment?

Are there any additional parameters I can set to increase the debug detail 
level in jobmanager?

> Jobmanager does not archive completed jobs
> --
>
> Key: FLINK-25581
> URL: https://issues.apache.org/jira/browse/FLINK-25581
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.13.5
> Environment: RHEL 7
>Reporter: Leonid Ilyevsky
>Priority: Major
> Attachments: error.log
>
>
> Jobmanager does not archive completed jobs.
> I configured the upload directory like this:
> jobmanager.archive.fs.dir: file:///liquidnet/shared/flink/completed-jobs
>  
> After the job was completed, nothing appeared in that directory.
> The job info was visible in the jobmanager console for one hour, then it 
> disappeared, and still there was no files in the configured directory.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25581) Jobmanager does not archive completed jobs

2022-01-18 Thread Chesnay Schepler (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17477879#comment-17477879
 ] 

Chesnay Schepler commented on FLINK-25581:
--

The FileAlreadyExistsException might be caused by FLINK-24232; in short if a 
job was suspended (e.g., because Zookeeper went down) then it is also archived, 
and subsequent attempts will fail.

I can't think of a reason why this should affect other jobs though.

> Jobmanager does not archive completed jobs
> --
>
> Key: FLINK-25581
> URL: https://issues.apache.org/jira/browse/FLINK-25581
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.13.5
> Environment: RHEL 7
>Reporter: Leonid Ilyevsky
>Priority: Major
> Attachments: error.log
>
>
> Jobmanager does not archive completed jobs.
> I configured the upload directory like this:
> jobmanager.archive.fs.dir: file:///liquidnet/shared/flink/completed-jobs
>  
> After the job was completed, nothing appeared in that directory.
> The job info was visible in the jobmanager console for one hour, then it 
> disappeared, and still there was no files in the configured directory.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25581) Jobmanager does not archive completed jobs

2022-01-18 Thread Leonid Ilyevsky (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17477868#comment-17477868
 ] 

Leonid Ilyevsky commented on FLINK-25581:
-

[~chesnay] Yes, I found something interesting.

This morning I checked the storage directory again, and I saw files there, so 
it looks like it does archive jobs, but not all of them. I could tell because I 
did increase the jobstore timeout to 24 hours, and I saw that some completed 
jobs certainly not archived.

Then I checked the jobmanager logs (I have 5 instances). In one of them I found 
exceptions related to the archiving. See the attached fragment in error.log.

This is just one example, there are more errors like that in the log, 
complaining that the file already exists. The file indeed exists, I checked. 
But other jobs completed around that time were not archived.

> Jobmanager does not archive completed jobs
> --
>
> Key: FLINK-25581
> URL: https://issues.apache.org/jira/browse/FLINK-25581
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.13.5
> Environment: RHEL 7
>Reporter: Leonid Ilyevsky
>Priority: Major
> Attachments: error.log
>
>
> Jobmanager does not archive completed jobs.
> I configured the upload directory like this:
> jobmanager.archive.fs.dir: file:///liquidnet/shared/flink/completed-jobs
>  
> After the job was completed, nothing appeared in that directory.
> The job info was visible in the jobmanager console for one hour, then it 
> disappeared, and still there was no files in the configured directory.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25581) Jobmanager does not archive completed jobs

2022-01-17 Thread Chesnay Schepler (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17477107#comment-17477107
 ] 

Chesnay Schepler commented on FLINK-25581:
--

[~lilyevsky] any news?

> Jobmanager does not archive completed jobs
> --
>
> Key: FLINK-25581
> URL: https://issues.apache.org/jira/browse/FLINK-25581
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.13.5
> Environment: RHEL 7
>Reporter: Leonid Ilyevsky
>Priority: Major
>
> Jobmanager does not archive completed jobs.
> I configured the upload directory like this:
> jobmanager.archive.fs.dir: file:///liquidnet/shared/flink/completed-jobs
>  
> After the job was completed, nothing appeared in that directory.
> The job info was visible in the jobmanager console for one hour, then it 
> disappeared, and still there was no files in the configured directory.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25581) Jobmanager does not archive completed jobs

2022-01-11 Thread Chesnay Schepler (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17472834#comment-17472834
 ] 

Chesnay Schepler commented on FLINK-25581:
--

Archiving happens right after the job is finished.

The [jobstore 
options|https://nightlies.apache.org/flink/flink-docs-release-1.14/docs/deployment/config/#jobstore-expiration-time]
 control how long jobs remain accessible.

> Jobmanager does not archive completed jobs
> --
>
> Key: FLINK-25581
> URL: https://issues.apache.org/jira/browse/FLINK-25581
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.13.5
> Environment: RHEL 7
>Reporter: Leonid Ilyevsky
>Priority: Major
>
> Jobmanager does not archive completed jobs.
> I configured the upload directory like this:
> jobmanager.archive.fs.dir: file:///liquidnet/shared/flink/completed-jobs
>  
> After the job was completed, nothing appeared in that directory.
> The job info was visible in the jobmanager console for one hour, then it 
> disappeared, and still there was no files in the configured directory.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25581) Jobmanager does not archive completed jobs

2022-01-11 Thread Leonid Ilyevsky (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17472825#comment-17472825
 ] 

Leonid Ilyevsky commented on FLINK-25581:
-

[~chesnay] One more question: when the archiving is supposed to occur? Right 
after the job is finished, or one hour later when the job info is pushed out of 
the console page?
Also, where is it configure that it is exactly one hour? Can this be changed?

> Jobmanager does not archive completed jobs
> --
>
> Key: FLINK-25581
> URL: https://issues.apache.org/jira/browse/FLINK-25581
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.13.5
> Environment: RHEL 7
>Reporter: Leonid Ilyevsky
>Priority: Major
>
> Jobmanager does not archive completed jobs.
> I configured the upload directory like this:
> jobmanager.archive.fs.dir: file:///liquidnet/shared/flink/completed-jobs
>  
> After the job was completed, nothing appeared in that directory.
> The job info was visible in the jobmanager console for one hour, then it 
> disappeared, and still there was no files in the configured directory.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25581) Jobmanager does not archive completed jobs

2022-01-11 Thread Chesnay Schepler (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17472774#comment-17472774
 ] 

Chesnay Schepler commented on FLINK-25581:
--

??[...] the archiving happens in jobmanager, correct? It has nothing to do with 
the History Manager (which I did not bring up yet).??

That is correct.

??I assumed that when it comes to archiving, my jobmanagers decide between 
themselves which one is going to do it, and so there should be no conflict.??

The JM that actually ran the job will archive it, so there shouldn't be a 
potential for conflicts.

> Jobmanager does not archive completed jobs
> --
>
> Key: FLINK-25581
> URL: https://issues.apache.org/jira/browse/FLINK-25581
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.13.5
> Environment: RHEL 7
>Reporter: Leonid Ilyevsky
>Priority: Major
>
> Jobmanager does not archive completed jobs.
> I configured the upload directory like this:
> jobmanager.archive.fs.dir: file:///liquidnet/shared/flink/completed-jobs
>  
> After the job was completed, nothing appeared in that directory.
> The job info was visible in the jobmanager console for one hour, then it 
> disappeared, and still there was no files in the configured directory.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25581) Jobmanager does not archive completed jobs

2022-01-11 Thread Leonid Ilyevsky (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17472761#comment-17472761
 ] 

Leonid Ilyevsky commented on FLINK-25581:
-

I just would like to clarify. According to documentation, the archiving happens 
in jobmanager, correct? It has nothing to do with the History Manager (which I 
did not bring up yet).

Also, I am running a cluster, and the place to save the history is on shared 
drive. I assumed that when it comes to archiving, my jobmanagers decide between 
themselves which one is going to do it, and so there should be no conflict.
I am going to check the logs of all jobmanagers in the cluster, will let you 
know if I find anything.

> Jobmanager does not archive completed jobs
> --
>
> Key: FLINK-25581
> URL: https://issues.apache.org/jira/browse/FLINK-25581
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.13.5
> Environment: RHEL 7
>Reporter: Leonid Ilyevsky
>Priority: Major
>
> Jobmanager does not archive completed jobs.
> I configured the upload directory like this:
> jobmanager.archive.fs.dir: file:///liquidnet/shared/flink/completed-jobs
>  
> After the job was completed, nothing appeared in that directory.
> The job info was visible in the jobmanager console for one hour, then it 
> disappeared, and still there was no files in the configured directory.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25581) Jobmanager does not archive completed jobs

2022-01-11 Thread Zhilong Hong (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17472662#comment-17472662
 ] 

Zhilong Hong commented on FLINK-25581:
--

Thank you for pointing this out, [~chesnay]. You are right. I missed this 
detail. Since the job is still visible in the WebUI, it shouldn't be 
application mode or per-job mode.

> Jobmanager does not archive completed jobs
> --
>
> Key: FLINK-25581
> URL: https://issues.apache.org/jira/browse/FLINK-25581
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.13.5
> Environment: RHEL 7
>Reporter: Leonid Ilyevsky
>Priority: Major
>
> Jobmanager does not archive completed jobs.
> I configured the upload directory like this:
> jobmanager.archive.fs.dir: file:///liquidnet/shared/flink/completed-jobs
>  
> After the job was completed, nothing appeared in that directory.
> The job info was visible in the jobmanager console for one hour, then it 
> disappeared, and still there was no files in the configured directory.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25581) Jobmanager does not archive completed jobs

2022-01-11 Thread Chesnay Schepler (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17472629#comment-17472629
 ] 

Chesnay Schepler commented on FLINK-25581:
--

Can you check the logs for "Job XXX has been archived at YYY."? This should be 
logged on INFO by the {{FsJobArchivist}}.

> Jobmanager does not archive completed jobs
> --
>
> Key: FLINK-25581
> URL: https://issues.apache.org/jira/browse/FLINK-25581
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.13.5
> Environment: RHEL 7
>Reporter: Leonid Ilyevsky
>Priority: Major
>
> Jobmanager does not archive completed jobs.
> I configured the upload directory like this:
> jobmanager.archive.fs.dir: file:///liquidnet/shared/flink/completed-jobs
>  
> After the job was completed, nothing appeared in that directory.
> The job info was visible in the jobmanager console for one hour, then it 
> disappeared, and still there was no files in the configured directory.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25581) Jobmanager does not archive completed jobs

2022-01-11 Thread Chesnay Schepler (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17472618#comment-17472618
 ] 

Chesnay Schepler commented on FLINK-25581:
--

[~Thesharing] That seems unlikely given that the job was still visible in the 
UI (I assume this was being referred to as "jobmanager console") for an hour, 
while FLINK-24491 only applies to application mode where the JM shuts down 
immediately.

> Jobmanager does not archive completed jobs
> --
>
> Key: FLINK-25581
> URL: https://issues.apache.org/jira/browse/FLINK-25581
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.13.5
> Environment: RHEL 7
>Reporter: Leonid Ilyevsky
>Priority: Major
>
> Jobmanager does not archive completed jobs.
> I configured the upload directory like this:
> jobmanager.archive.fs.dir: file:///liquidnet/shared/flink/completed-jobs
>  
> After the job was completed, nothing appeared in that directory.
> The job info was visible in the jobmanager console for one hour, then it 
> disappeared, and still there was no files in the configured directory.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25581) Jobmanager does not archive completed jobs

2022-01-11 Thread Zhilong Hong (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17472568#comment-17472568
 ] 

Zhilong Hong commented on FLINK-25581:
--

Maybe it's related to FLINK-24491. If the job finishes before 
HistoryServerArchivist finishes archiving the ExecutionGraphInfo, the archiving 
will be aborted. But it seems weird because writing files to the local disk 
should be finished instantly. Is there anything like "Could not archive 
completed job ... to the history server." in your log?

> Jobmanager does not archive completed jobs
> --
>
> Key: FLINK-25581
> URL: https://issues.apache.org/jira/browse/FLINK-25581
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.13.5
> Environment: RHEL 7
>Reporter: Leonid Ilyevsky
>Priority: Major
>
> Jobmanager does not archive completed jobs.
> I configured the upload directory like this:
> jobmanager.archive.fs.dir: file:///liquidnet/shared/flink/completed-jobs
>  
> After the job was completed, nothing appeared in that directory.
> The job info was visible in the jobmanager console for one hour, then it 
> disappeared, and still there was no files in the configured directory.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25581) Jobmanager does not archive completed jobs

2022-01-10 Thread Leonid Ilyevsky (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17472187#comment-17472187
 ] 

Leonid Ilyevsky commented on FLINK-25581:
-

Yes, I did. There were no errors. Also I saw it properly read the 
configuration. I also see the proper configuration on the console.

> Jobmanager does not archive completed jobs
> --
>
> Key: FLINK-25581
> URL: https://issues.apache.org/jira/browse/FLINK-25581
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.13.5
> Environment: RHEL 7
>Reporter: Leonid Ilyevsky
>Priority: Major
>
> Jobmanager does not archive completed jobs.
> I configured the upload directory like this:
> jobmanager.archive.fs.dir: file:///liquidnet/shared/flink/completed-jobs
>  
> After the job was completed, nothing appeared in that directory.
> The job info was visible in the jobmanager console for one hour, then it 
> disappeared, and still there was no files in the configured directory.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)


[jira] [Commented] (FLINK-25581) Jobmanager does not archive completed jobs

2022-01-10 Thread Chesnay Schepler (Jira)


[ 
https://issues.apache.org/jira/browse/FLINK-25581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17472092#comment-17472092
 ] 

Chesnay Schepler commented on FLINK-25581:
--

Did you check the JobManager logs for any errors?

> Jobmanager does not archive completed jobs
> --
>
> Key: FLINK-25581
> URL: https://issues.apache.org/jira/browse/FLINK-25581
> Project: Flink
>  Issue Type: Bug
>  Components: Runtime / Coordination
>Affects Versions: 1.13.5
> Environment: RHEL 7
>Reporter: Leonid Ilyevsky
>Priority: Major
>
> Jobmanager does not archive completed jobs.
> I configured the upload directory like this:
> jobmanager.archive.fs.dir: file:///liquidnet/shared/flink/completed-jobs
>  
> After the job was completed, nothing appeared in that directory.
> The job info was visible in the jobmanager console for one hour, then it 
> disappeared, and still there was no files in the configured directory.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)