Re: Failed but metascheduler did not resubmit job

Eroma Abeysinghe Fri, 25 Oct 2024 08:46:07 -0700

Hi Emre, et al,

I looked at the failed experiment Aaron shared. It was executed in ls6.
I looked at ls6 experiments for the past year and 9 months ago the cluster
started asking for a project allocation. Before, Ultrascan had submitted
successfully without a project allocation.
So right now, to submit to ls6, you need to contact them, provide your
details and ask for the project allocation number.


The messages we see from ls6 [1]
Aaron, do you have issues submitting to other clusters, or the research to
be done in ls6?

Thanks,
Eroma

----
[1]
A1394806797 STDOUT

-----------------------------------------------------------------
           Welcome to the Lonestar6 Supercomputer
-----------------------------------------------------------------

No reservation for this job
--> Verifying valid submit host (login2)...OK
--> Verifying valid jobname...OK
--> Verifying valid ssh keys...OK
--> Verifying access to desired queue (normal)...OK
--> Checking available allocation FAILED

A1394806797 STDERR

ERROR: You have no project in the projectuser.map file (in
accounting_check_prod.pl).

Please report this problem:
U. of TX users contact (https://portal.tacc.utexas.edu/consulting)

------

On Fri, Oct 25, 2024 at 11:05 AM Lahiru Jayathilake <
[email protected]> wrote:

>
>
> ---------- Forwarded message ---------
> From: Emre Brookes <[email protected]>
> Date: Tue, Oct 22, 2024 at 4:10 PM
> Subject: Re: Failed but metascheduler did not resubmit job
> To: <[email protected]>, Lahiru Jayathilake <
> [email protected]>
> Cc: <[email protected]>
>
>
> Hi Lahiru,
>
> I really appreciate your work on this.
> It's important that we have a working system that we can include in our
> progress report to the NIH & I'd like to see Aaron able to finish his
> paper this year.
> If there's anything we can do from our end to expedite this, please let
> me know.
>
> Thanks,
> Emre
>
> Lahiru Jayathilake wrote:
> > *EXTERNAL EMAIL*
> >
> > Hi Aaron,
> >
> > Thank you for your patience, and I apologize for the delay in getting
> > back to you regarding the issue.
> >
> > After further investigation, I noticed that the current version of the
> > Airavata Metascheduler does not support automatically resubmitting
> > jobs to different clusters when a job fails after successful
> > submission (e.g., due to resource allocation issues). I've now created
> > a task [1] to add this feature, which will enable the expected
> > functionality. This enhancement will take some time to implement, but
> > we’ll keep you updated on the progress.
> >
> > Please feel free to reach out if you have any further questions or
> > need additional information.
> >
> > [1] - https://issues.apache.org/jira/browse/AIRAVATA-3893
> > <https://issues.apache.org/jira/browse/AIRAVATA-3893>
> >
> > Thanks,
> > Lahiru
> >
> > On Wed, Oct 16, 2024 at 7:36 PM Aaron Householder
> > <[email protected] <mailto:[email protected]>> wrote:
> >
> >     Hi,
> >
> >     Is there any update?
> >
> >     Regards,
> >
> >     Aaron
> >
> >     *From: *Aaron Householder <[email protected]
> >     <mailto:[email protected]>>
> >     *Date: *Saturday, October 12, 2024 at 8:12 PM
> >     *To: *[email protected] <mailto:[email protected]>
> >     <[email protected] <mailto:[email protected]>>
> >     *Subject: *Re: Failed but metascheduler did not resubmit job
> >
> >     Hi Lahiru,
> >
> >     Any update on this issue? This is an impediment to getting this
> >     rolled out to UltraScan users. My understanding is that if the job
> >     fails while verifying and making checks that the metascheduler
> >     should try another resource.
> >
> >     Is there anything I can do to help?
> >
> >     Regards,
> >
> >     Aaron
> >
> >     *From: *Lahiru Jayathilake <[email protected]
> >     <mailto:[email protected]>>
> >     *Date: *Thursday, September 12, 2024 at 1:42 PM
> >     *To: *[email protected] <mailto:[email protected]>
> >     <[email protected] <mailto:[email protected]>>
> >     *Subject: *Re: Failed but metascheduler did not resubmit job
> >
> >     Hi Aaron,
> >
> >     Thanks for contacting us. We will look into this issue and get
> >     back to you.
> >
> >     Best,
> >     Lahiru
> >
> >     On 2024/09/11 18:45:33 Aaron Householder wrote:
> >     > Hi Airavata,
> >     >
> >     > I’m working on connecting Ultrascan3 to Airavata. As the message
> >     below shows, if the job fails the metascheduler might not retry
> >     the job. Is there a resource available to take a look at this issue?
> >     >
> >     > Regards,
> >     > Aaron
> >     >
> >     > From: Aaron Householder <[email protected]
> >     <mailto:[email protected]>>
> >     > Date: Tuesday, September 3, 2024 at 4:42 PM
> >     > To: Airavata Users <[email protected]
> >     <mailto:[email protected]>>
> >     > Subject: Failed but metascheduler did not resubmit job
> >     > Hi Airavata Users,
> >     >
> >     > I had an UltraScan job that seemed to fail without the
> >     metascheduler resubmitting the job for completion by another
> >     cluster. I received the following in an email:
> >     >
> >     >    Your UltraScan job is complete:
> >     >
> >     >    Submission Time : 2024-08-26 00:50:05
> >     >    Job End Time    :
> >     >    Mail Time       : 2024-08-25 19:54:41
> >     >    LIMS Host       :
> >     >    Analysis ID     : US3-AIRA_ea2b4a32-27a8-4df4-827c-5fd9367c5e1c
> >     >    Request ID      : 182  ( uslims3_Demo )
> >     >    RunID           : demo1_veloc1
> >     >    EditID          : 21030600161
> >     >    Data Type       : RA
> >     >    Cell/Channel/Wl : 2 / A / 259
> >     >    Status          : failed
> >     >    Cluster         : metascheduler
> >     >    Job Type        : 2DSA-MC
> >     >    GFAC Status     : FAILED
> >     >    GFAC Message    :
> >     org.apache.airavata.helix.impl.task.TaskOnFailException: Error
> >     Code : 23857cb5-5431-43e7-a927-fedd7a929e34, Task
> >     TASK_5b1ea99b-750c-49f6-a05b-df7175f141ed failed due to Couldn't
> >     find job id in both submitted and verified steps.
> >     expId:US3-AIRA_ea2b4a32-27a8-4df4-827c-5fd9367c5e1c Couldn't find
> >     remote jobId for JobName:A1394806797, both submit and verify steps
> >     doesn't return a valid JobId. Hence changing experiment state to
> >     Failed
> >     >         at
> >
>  
> org.apache.airavata.helix.impl.task.AiravataTask.onFail(AiravataTask.java:146)
> >     >         at
> >
>  
> org.apache.airavata.helix.impl.task.submission.DefaultJobSubmissionTask.onRun(DefaultJobSubmissionTask.java:192)
> >     >         at
> >
>  org.apache.airavata.helix.impl.task.AiravataTask.onRun(AiravataTask.java:437)
> >     >         at
> >
>  org.apache.airavata.helix.core.AbstractTask.run(AbstractTask.java:102)
> >     >         at org.apache.helix.task.TaskRunner.run(TaskRunner.java:71)
> >     >         at
> >
>  
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> >     >         at
> >     java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> >     >         at
> >
>  
> java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
> >     >         at
> >
>  
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> >     >         at
> >
>  
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> >     >         at java.base/java.lang.Thread.run(Thread.java:829)
> >     >
> >     >
> >     > No reservation for this job
> >     > --> Verifying valid submit host (login2)...OK
> >     > --> Verifying valid jobname...OK
> >     > --> Verifying valid ssh keys...OK
> >     > --> Verifying access to desired queue (normal)...OK
> >     > --> Checking available allocation FAILED
> >     >    Airavata stderr : ERROR: You have no project in the
> >     projectuser.map file (in accounting_check_prod.pl
> >     <http://accounting_check_prod.pl/>).
> >     >
> >
> >
> > CAUTION: This message originated outside of UT Health San Antonio.
> > Please exercise caution when clicking on links or opening attachments.
>
>

-- 
Thank You,
Best Regards,
Eroma

Re: Failed but metascheduler did not resubmit job

Reply via email to