Re: Failed but metascheduler did not resubmit job

Emre Brookes Fri, 25 Oct 2024 11:51:10 -0700

Hi Eroma,

OK. Is there any timeline to get these capabilities implemented?

Given the current state, how can we formally test the ability of themetascheduler to reschedule?My thought was to use LS6 since its allocation expired, but that doesn'twork.If we supply a fake queue name for that resource, will that cause it tofirst try and then reschedule?Any other ideas for a UltraScan LIMS usingSciGap/ultrascan-airavata-bridge initiated testing ?


Thanks,
Emre

Eroma Abeysinghe wrote:

Hi Emre,

Currently, what metascheduler does is find a resource from the poolconfirm its available, can ssh, verify the queue, check the queuelimit and submit the job (Lahiru also explained previously in thisthread). After this point, if the job fails due to allocation notbeing available etc... we don't send it back to a new resource. Thisis yet to be implemented.


Thanks,
Eroma

On Fri, Oct 25, 2024 at 12:32 PM Emre Brookes<[email protected] <mailto:[email protected]>> wrote:


    Hi Eroma,

    We understand we do not have a current allocation on LS6.
    We expect the metascheduler to reschedule to another cluster for this
    error, will this case not be supported?
    Our goal for the metascheduler was for any issues, regardless of
    cause,
    to be as transparent as possible for the user.
    Allocations could run out, expire or perhaps problems with the
    allocation mechanism on the target resource - all
    these (and others, we have seen communications issues particularly
    with
    expanse) - should be recoverable by
    the metascheduler. This is the point of it. The simple case of a job
    running and returning a failed status is likely
    the *least* recoverable, since if the job ran correctly, there is
    likely
    a data issue that will cause it to fail elsewhere
    (exceptions could be different or memory limits, or the rare case
    a node
    failed during the computation).

    Thanks,
    Emre


    Eroma Abeysinghe wrote:
    > Hi Emre, et al,
    >
    > I looked at the failed experiment Aaron shared. It was executed
    in ls6.
    > I looked at ls6 experiments for the past year and 9 months ago the
    > cluster started asking for a project allocation.
    Before, Ultrascan had
    > submitted successfully without a project allocation.
    > So right now, to submit to ls6, you need to contact them,
    provide your
    > details and ask for the project allocation number.
    >
    > The messages we see from ls6 [1]
    > Aaron, do you have issues submitting to other clusters, or the
    > research to be done in ls6?
    >
    > Thanks,
    > Eroma
    >
    > ----
    > [1]
    > A1394806797 STDOUT
    > -----------------------------------------------------------------
    >             Welcome to the Lonestar6 Supercomputer
    > -----------------------------------------------------------------
    >
    > No reservation for this job
    > --> Verifying valid submit host (login2)...OK
    > --> Verifying valid jobname...OK
    > --> Verifying valid ssh keys...OK
    > --> Verifying access to desired queue (normal)...OK
    > --> Checking available allocation FAILED
    > A1394806797 STDERR
    > ERROR: You have no project in the projectuser.map file
    (inaccounting_check_prod.pl
    <http://inaccounting_check_prod.pl/>
    <http://accounting_check_prod.pl/
    <http://accounting_check_prod.pl/>>).
    >
    > Please report this problem:
    > U. of TX users contact
    (https://portal.tacc.utexas.edu/consulting
    <https://portal.tacc.utexas.edu/consulting>
    <https://portal.tacc.utexas.edu/consulting
    <https://portal.tacc.utexas.edu/consulting>>)
    > ------
    >
    > On Fri, Oct 25, 2024 at 11:05 AM Lahiru Jayathilake
    > <[email protected]
    <mailto:[email protected]>
    <mailto:[email protected]
    <mailto:[email protected]>>> wrote:
    >
    >
    >
    >     ---------- Forwarded message ---------
    >     From: *Emre Brookes* <[email protected]
    <mailto:[email protected]>
    >     <mailto:[email protected]
    <mailto:[email protected]>>>
    >     Date: Tue, Oct 22, 2024 at 4:10 PM
    >     Subject: Re: Failed but metascheduler did not resubmit job
    >     To: <[email protected]
    <mailto:[email protected]>
    >     <mailto:[email protected]
    <mailto:[email protected]>>>, Lahiru Jayathilake
    >     <[email protected]
    <mailto:[email protected]>
    <mailto:[email protected]
    <mailto:[email protected]>>>
    >     Cc: <[email protected] <mailto:[email protected]>
    <mailto:[email protected] <mailto:[email protected]>>>
    >
    >
    >     Hi Lahiru,
    >
    >     I really appreciate your work on this.
    >     It's important that we have a working system that we can include
    >     in our
    >     progress report to the NIH & I'd like to see Aaron able to
    finish his
    >     paper this year.
    >     If there's anything we can do from our end to expedite this,
    >     please let
    >     me know.
    >
    >     Thanks,
    >     Emre
    >
    >     Lahiru Jayathilake wrote:
    >     > *EXTERNAL EMAIL*
    >     >
    >     > Hi Aaron,
    >     >
    >     > Thank you for your patience, and I apologize for the delay in
    >     getting
    >     > back to you regarding the issue.
    >     >
    >     > After further investigation, I noticed that the current
    version
    >     of the
    >     > Airavata Metascheduler does not support automatically
    resubmitting
    >     > jobs to different clusters when a job fails after successful
    >     > submission (e.g., due to resource allocation issues). I've now
    >     created
    >     > a task [1] to add this feature, which will enable the expected
    >     > functionality. This enhancement will take some time to
    >     implement, but
    >     > we’ll keep you updated on the progress.
    >     >
    >     > Please feel free to reach out if you have any further
    questions or
    >     > need additional information.
    >     >
    >     > [1] - https://issues.apache.org/jira/browse/AIRAVATA-3893
    <https://issues.apache.org/jira/browse/AIRAVATA-3893>
    >     <https://issues.apache.org/jira/browse/AIRAVATA-3893
    <https://issues.apache.org/jira/browse/AIRAVATA-3893>>
    >
    >     > <https://issues.apache.org/jira/browse/AIRAVATA-3893
    <https://issues.apache.org/jira/browse/AIRAVATA-3893>
    >     <https://issues.apache.org/jira/browse/AIRAVATA-3893
    <https://issues.apache.org/jira/browse/AIRAVATA-3893>>>
    >     >
    >     > Thanks,
    >     > Lahiru
    >     >
    >     > On Wed, Oct 16, 2024 at 7:36 PM Aaron Householder
    >     > <[email protected] <mailto:[email protected]>
    <mailto:[email protected] <mailto:[email protected]>>
    >     <mailto:[email protected]
    <mailto:[email protected]> <mailto:[email protected]
    <mailto:[email protected]>>>>
    >     wrote:
    >     >
    >     >     Hi,
    >     >
    >     >     Is there any update?
    >     >
    >     >     Regards,
    >     >
    >     >     Aaron
    >     >
    >     >     *From: *Aaron Householder <[email protected]
    <mailto:[email protected]>
    >     <mailto:[email protected] <mailto:[email protected]>>
    >     >     <mailto:[email protected]
    <mailto:[email protected]> <mailto:[email protected]
    <mailto:[email protected]>>>>
    >     >     *Date: *Saturday, October 12, 2024 at 8:12 PM
    >     >     *To: *[email protected]
    <mailto:[email protected]>
    >     <mailto:[email protected]
    <mailto:[email protected]>>
    >     <mailto:[email protected]
    <mailto:[email protected]>
    <mailto:[email protected] <mailto:[email protected]>>>
    >     >     <[email protected]
    <mailto:[email protected]>
    >     <mailto:[email protected]
    <mailto:[email protected]>>
    >     <mailto:[email protected]
    <mailto:[email protected]>
    <mailto:[email protected]
    <mailto:[email protected]>>>>
    >     >     *Subject: *Re: Failed but metascheduler did not
    resubmit job
    >     >
    >     >     Hi Lahiru,
    >     >
    >     >     Any update on this issue? This is an impediment to
    getting this
    >     >     rolled out to UltraScan users. My understanding is that if
    >     the job
    >     >     fails while verifying and making checks that the
    metascheduler
    >     >     should try another resource.
    >     >
    >     >     Is there anything I can do to help?
    >     >
    >     >     Regards,
    >     >
    >     >     Aaron
    >     >
    >     >     *From: *Lahiru Jayathilake
    <[email protected] <mailto:[email protected]>
    >     <mailto:[email protected]
    <mailto:[email protected]>>
    >     >     <mailto:[email protected]
    <mailto:[email protected]>
    >     <mailto:[email protected]
    <mailto:[email protected]>>>>
    >     >     *Date: *Thursday, September 12, 2024 at 1:42 PM
    >     >     *To: *[email protected]
    <mailto:[email protected]>
    >     <mailto:[email protected]
    <mailto:[email protected]>>
    >     <mailto:[email protected]
    <mailto:[email protected]>
    <mailto:[email protected] <mailto:[email protected]>>>
    >     >     <[email protected]
    <mailto:[email protected]>
    >     <mailto:[email protected]
    <mailto:[email protected]>>
    >     <mailto:[email protected]
    <mailto:[email protected]>
    <mailto:[email protected]
    <mailto:[email protected]>>>>
    >     >     *Subject: *Re: Failed but metascheduler did not
    resubmit job
    >     >
    >     >     Hi Aaron,
    >     >
    >     >     Thanks for contacting us. We will look into this issue
    and get
    >     >     back to you.
    >     >
    >     >     Best,
    >     >     Lahiru
    >     >
    >     >     On 2024/09/11 18:45:33 Aaron Householder wrote:
    >     >     > Hi Airavata,
    >     >     >
    >     >     > I’m working on connecting Ultrascan3 to Airavata. As the
    >     message
    >     >     below shows, if the job fails the metascheduler might
    not retry
    >     >     the job. Is there a resource available to take a look at
    >     this issue?
    >     >     >
    >     >     > Regards,
    >     >     > Aaron
    >     >     >
    >     >     > From: Aaron Householder <[email protected]
    <mailto:[email protected]>
    >     <mailto:[email protected] <mailto:[email protected]>>
    >     >     <mailto:[email protected]
    <mailto:[email protected]> <mailto:[email protected]
    <mailto:[email protected]>>>>
    >     >     > Date: Tuesday, September 3, 2024 at 4:42 PM
    >     >     > To: Airavata Users <[email protected]
    <mailto:[email protected]>
    >     <mailto:[email protected]
    <mailto:[email protected]>>
    >     >     <mailto:[email protected]
    <mailto:[email protected]>
    >     <mailto:[email protected]
    <mailto:[email protected]>>>>
    >     >     > Subject: Failed but metascheduler did not resubmit job
    >     >     > Hi Airavata Users,
    >     >     >
    >     >     > I had an UltraScan job that seemed to fail without the
    >     >     metascheduler resubmitting the job for completion by
    another
    >     >     cluster. I received the following in an email:
    >     >     >
    >     >     >    Your UltraScan job is complete:
    >     >     >
    >     >     >    Submission Time : 2024-08-26 00:50:05
    >     >     >    Job End Time    :
    >     >     >    Mail Time       : 2024-08-25 19:54:41
    >     >     >    LIMS Host       :
    >     >     >    Analysis ID     :
    >     US3-AIRA_ea2b4a32-27a8-4df4-827c-5fd9367c5e1c
    >     >     >    Request ID      : 182  ( uslims3_Demo )
    >     >     >    RunID           : demo1_veloc1
    >     >     >    EditID          : 21030600161
    >     >     >    Data Type       : RA
    >     >     >    Cell/Channel/Wl : 2 / A / 259
    >     >     >    Status          : failed
    >     >     >    Cluster         : metascheduler
    >     >     >    Job Type        : 2DSA-MC
    >     >     >    GFAC Status     : FAILED
    >     >     >    GFAC Message    :
    >     >  org.apache.airavata.helix.impl.task.TaskOnFailException:
    Error
    >     >     Code : 23857cb5-5431-43e7-a927-fedd7a929e34, Task
    >     >     TASK_5b1ea99b-750c-49f6-a05b-df7175f141ed failed due
    to Couldn't
    >     >     find job id in both submitted and verified steps.
    >     >  expId:US3-AIRA_ea2b4a32-27a8-4df4-827c-5fd9367c5e1c
    Couldn't find
    >     >     remote jobId for JobName:A1394806797, both submit and
    verify
    >     steps
    >     >     doesn't return a valid JobId. Hence changing
    experiment state to
    >     >     Failed
    >     >     >         at
    >     >
    >
      
org.apache.airavata.helix.impl.task.AiravataTask.onFail(AiravataTask.java:146)
    >     >     >         at
    >     >
    >
      
org.apache.airavata.helix.impl.task.submission.DefaultJobSubmissionTask.onRun(DefaultJobSubmissionTask.java:192)
    >     >     >         at
    >     >
    >
      
org.apache.airavata.helix.impl.task.AiravataTask.onRun(AiravataTask.java:437)
    >     >     >         at
    >     >
    >
      org.apache.airavata.helix.core.AbstractTask.run(AbstractTask.java:102)
    >     >     >         at
    >  org.apache.helix.task.TaskRunner.run(TaskRunner.java:71)
    >     >     >         at
    >     >
    >
      
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    >     >     >         at
    >     >
     java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
    >     >     >         at
    >     >
    >
      
java.base/java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:304)
    >     >     >         at
    >     >
    >
      
java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
    >     >     >         at
    >     >
    >
      
java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
    >     >     >         at
    java.base/java.lang.Thread.run(Thread.java:829)
    >     >     >
    >     >     >
    >     >     > No reservation for this job
    >     >     > --> Verifying valid submit host (login2)...OK
    >     >     > --> Verifying valid jobname...OK
    >     >     > --> Verifying valid ssh keys...OK
    >     >     > --> Verifying access to desired queue (normal)...OK
    >     >     > --> Checking available allocation FAILED
    >     >     >    Airavata stderr : ERROR: You have no project in the
    >     >     projectuser.map file (in accounting_check_prod.pl
    <http://accounting_check_prod.pl/>
    >     <http://accounting_check_prod.pl/
    <http://accounting_check_prod.pl/>>
    >     >     <http://accounting_check_prod.pl/
    <http://accounting_check_prod.pl/>
    >     <http://accounting_check_prod.pl/
    <http://accounting_check_prod.pl/>>>).
    >     >     >
    >     >
    >     >
    >     > CAUTION: This message originated outside of UT Health San
    Antonio.
    >     > Please exercise caution when clicking on links or opening
    >     attachments.
    >
    >
    >
    > --
    > Thank You,
    > Best Regards,
    > Eroma



--
Thank You,
Best Regards,
Eroma

Re: Failed but metascheduler did not resubmit job

Reply via email to