Hi Emre, et al,
I looked at the failed experiment Aaron shared. It was executed in ls6.
I looked at ls6 experiments for the past year and 9 months ago the cluster
started asking for a project allocation. Before, Ultrascan had submitted
successfully without a project allocation.
So right now, to submit to ls6, you need to contact them, provide your
details and ask for the project allocation number.
The messages we see from ls6 [1]
Aaron, do you have issues submitting to other clusters, or the research to
be done in ls6?
Thanks,
Eroma
----
[1]
A1394806797 STDOUT
-----------------------------------------------------------------
Welcome to the Lonestar6 Supercomputer
-----------------------------------------------------------------
No reservation for this job
--> Verifying valid submit host (login2)...OK
--> Verifying valid jobname...OK
--> Verifying valid ssh keys...OK
--> Verifying access to desired queue (normal)...OK
--> Checking available allocation FAILED
A1394806797 STDERR
ERROR: You have no project in the projectuser.map file (in
accounting_check_prod.pl).
Please report this problem:
U. of TX users contact (https://portal.tacc.utexas.edu/consulting)
------
On Fri, Oct 25, 2024 at 11:05 AM Lahiru Jayathilake <
[email protected]> wrote:
>
>
> ---------- Forwarded message ---------
> From: Emre Brookes <[email protected]>
> Date: Tue, Oct 22, 2024 at 4:10 PM
> Subject: Re: Failed but metascheduler did not resubmit job
> To: <[email protected]>, Lahiru Jayathilake <
> [email protected]>
> Cc: <[email protected]>
>
>
> Hi Lahiru,
>
> I really appreciate your work on this.
> It's important that we have a working system that we can include in our
> progress report to the NIH & I'd like to see Aaron able to finish his
> paper this year.
> If there's anything we can do from our end to expedite this, please let
> me know.
>
> Thanks,
> Emre
>
> Lahiru Jayathilake wrote:
> > *EXTERNAL EMAIL*
> >
> > Hi Aaron,
> >
> > Thank you for your patience, and I apologize for the delay in getting
> > back to you regarding the issue.
> >
> > After further investigation, I noticed that the current version of the
> > Airavata Metascheduler does not support automatically resubmitting
> > jobs to different clusters when a job fails after successful
> > submission (e.g., due to resource allocation issues). I've now created
> > a task [1] to add this feature, which will enable the expected
> > functionality. This enhancement will take some time to implement, but
> > we’ll keep you updated on the progress.
> >
> > Please feel free to reach out if you have any further questions or
> > need additional information.
> >
> > [1] - https://issues.apache.org/jira/browse/AIRAVATA-3893
> > <https://issues.apache.org/jira/browse/AIRAVATA-3893>
> >
> > Thanks,
> > Lahiru
> >
> > On Wed, Oct 16, 2024 at 7:36 PM Aaron Householder
> > <[email protected] <mailto:[email protected]>> wrote:
> >
> > Hi,
> >
> > Is there any update?
> >
> > Regards,
> >
> > Aaron
> >
> > *From: *Aaron Householder <[email protected]
> > <mailto:[email protected]>>
> > *Date: *Saturday, October 12, 2024 at 8:12 PM
> > *To: *[email protected] <mailto:[email protected]>
> > <[email protected] <mailto:[email protected]>>
> > *Subject: *Re: Failed but metascheduler did not resubmit job
> >
> > Hi Lahiru,
> >
> > Any update on this issue? This is an impediment to getting this
> > rolled out to UltraScan users. My understanding is that if the job
> > fails while verifying and making checks that the metascheduler
> > should try another resource.
> >
> > Is there anything I can do to help?
> >
> > Regards,
> >
> > Aaron
> >
> > *From: *Lahiru Jayathilake <[email protected]
> > <mailto:[email protected]>>
> > *Date: *Thursday, September 12, 2024 at 1:42 PM
> > *To: *[email protected] <mailto:[email protected]>
> > <[email protected] <mailto:[email protected]>>
> > *Subject: *Re: Failed but metascheduler did not resubmit job
> >
> > Hi Aaron,
> >
> > Thanks for contacting us. We will look into this issue and get
> > back to you.
> >
> > Best,
> > Lahiru
> >
> > On 2024/09/11 18:45:33 Aaron Householder wrote:
> > > Hi Airavata,
> > >
> > > I’m working on connecting Ultrascan3 to Airavata. As the message
> > below shows, if the job fails the metascheduler might not retry
> > the job. Is there a resource available to take a look at this issue?
> > >
> > > Regards,
> > > Aaron
> > >
> > > From: Aaron Householder <[email protected]
> > <mailto:[email protected]>>
> > > Date: Tuesday, September 3, 2024 at 4:42 PM
> > > To: Airavata Users <[email protected]
> > <mailto:[email protected]>>
> > > Subject: Failed but metascheduler did not resubmit job
> > > Hi Airavata Users,
> > >
> > > I had an UltraScan job that seemed to fail without the
> > metascheduler resubmitting the job for completion by another
> > cluster. I received the following in an email:
> > >
> > > Your UltraScan job is complete:
> > >
> > > Submission Time : 2024-08-26 00:50:05
> > > Job End Time :
> > > Mail Time : 2024-08-25 19:54:41
> > > LIMS Host :
> > > Analysis ID : US3-AIRA_ea2b4a32-27a8-4df4-827c-5fd9367c5e1c
> > > Request ID : 182 ( uslims3_Demo )
> > > RunID : demo1_veloc1
> > > EditID : 21030600161
> > > Data Type : RA
> > > Cell/Channel/Wl : 2 / A / 259
> > > Status : failed
> > > Cluster : metascheduler
> > > Job Type : 2DSA-MC
> > > GFAC Status : FAILED
> > > GFAC Message :
> > org.apache.airavata.helix.impl.task.TaskOnFailException: Error
> > Code : 23857cb5-5431-43e7-a927-fedd7a929e34, Task
> > TASK_5b1ea99b-750c-49f6-a05b-df7175f141ed failed due to Couldn't
> > find job id in both submitted and verified steps.
> > expId:US3-AIRA_ea2b4a32-27a8-4df4-827c-5fd9367c5e1c Couldn't find
> > remote jobId for JobName:A1394806797, both submit and verify steps
> > doesn't return a valid JobId. Hence changing experiment state to
> > Failed
> > > at
> >
>
> org.apache.airavata.helix.impl.task.AiravataTask.onFail(AiravataTask.java:146)
> > > at
> >
>
> org.apache.airavata.helix.impl.task.submission.DefaultJobSubmissionTask.onRun(DefaultJobSubmissionTask.java:192)
> > > at
> >
> org.apache.airavata.helix.impl.task.AiravataTask.onRun(AiravataTask.java:437)
> > > at
> >
> org.apache.airavata.helix.core.AbstractTask.run(AbstractTask.java:102)
> > > at org.apache.helix.task.TaskRunner.run(TaskRunner.java:71)
> > > at
> >
>
> java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> > > at
> > java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
> > > at
> >
>
> java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
> > > at
> >
>
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> > > at
> >
>
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> > > at java.base/java.lang.Thread.run(Thread.java:829)
> > >
> > >
> > > No reservation for this job
> > > --> Verifying valid submit host (login2)...OK
> > > --> Verifying valid jobname...OK
> > > --> Verifying valid ssh keys...OK
> > > --> Verifying access to desired queue (normal)...OK
> > > --> Checking available allocation FAILED
> > > Airavata stderr : ERROR: You have no project in the
> > projectuser.map file (in accounting_check_prod.pl
> > <http://accounting_check_prod.pl/>).
> > >
> >
> >
> > CAUTION: This message originated outside of UT Health San Antonio.
> > Please exercise caution when clicking on links or opening attachments.
>
>
--
Thank You,
Best Regards,
Eroma