Reuti wrote:
> Am 28.02.2011 um 19:44 schrieb Prentice Bisbal:
> 
>> Reuti wrote:
>>> Am 28.02.2011 um 17:58 schrieb Prentice Bisbal:
>>>
>>>> Hello Everyone. I'm using SGE 6.2u3.One of my users suddenly
>>>> reported getting this error when running Mathematica jobs (which use 
>>>> DRMAA):
>>>>
>>>>
>>>> Java::excptn: A Java exception occurred:
>>>>   org.ggf.drmaa.DeniedByDrmException: error: no suitable queues  .
>>>>       at com.sun.grid.drmaa.SessionImpl.nativeRunJob(Native Method)
>>>>       at com.sun.grid.drmaa.SessionImpl.runJob(SessionImpl.java:349)
>>>>
>>>>
>>>> Any ideas what could be causing this? Not sure exactly how to debug
>>>> DRMAA problems, and I can't think if anything I've changed in the past
>>>> few months that would affect this at all. I have a lot of users who used
>>>> gridMathematica heavily in the past, so this definitely worked before...
>>> This doesn't look like a DRMAA error per se - the job requests something 
>>> which can't be satisfied. 
>>>
>> That's what I thought, too, but I can't find anything that it could be
>> requesting that can't be satisfied. It's only running on 4 processors,
>> and I can run other jobs, it's only this one that uses DRMAA that hangs.
>>
>> I should point out that the job starts to run, then fails when the DRMAA
>> is invoked. Here's the jobscript I'm submitting:
>>
>> #!/bin/bash
>> #$ -V
>> #$ -cwd
>> #$ -m abe
>> #$ -l h_rt=00:10:00
>> #$ -R y
>>
>> export
>> CLASSPATH=/usr/local/share/sge/lib/drmaa.jar:/usr/local/share/sge/lib/jgd
>> i.jar:/usr/local/share/sge/lib/juti.jar
>> export LD_LIBRARY_PATH=/usr/local/share/sge/lib/lx24-amd64
>>
>> /usr/local/share/bin/math < factor.m
>>
>> And here's the Mathematica file:
>>
>> Needs["ClusterIntegration`"]
>> kernels = LaunchKernels[SGE["aurora.sns.ias.edu"], 4]
>> ParallelMap[FactorInteger, (10^Range[20, 30] - 1)/9]
>> CloseKernels[]
>>
>> And here's more of the output:
>>
>> Mathematica 7.0 for Linux x86 (64-bit)
>> Copyright 1988-2009 Wolfram Research, Inc.
>>
>> In[1]:=
>> In[2]:=
>> Java::excptn: A Java exception occurred:
>>    org.ggf.drmaa.DeniedByDrmException: error: no suitable queues  .
>>        at com.sun.grid.drmaa.SessionImpl.nativeRunJob(Native Method)
> 
> What native command is given here?

I do not know this Mathematica, a commercial, closed-source application.
  It's quitting time for me today, but do you think a stack trace might
be helpful? Never tried debugging Java before...
> 
> 
>>        at com.sun.grid.drmaa.SessionImpl.runJob(SessionImpl.java:349)
>>
>> Out[2]= $Failed
>>
>> In[3]:=
>> ParallelCombine::nopar:
>>   No parallel kernels available; proceeding with sequential evaluation.
>>
>> Out[3]= {{{11, 1}, {41, 1}, {101, 1}, {271, 1}, {3541, 1}, {9091, 1},
>> ... the rest is just the answer...
>>
>> So the problem seems to lie somewhere in the DRMAA invocation. How can I
>> debug/troubleshoot this?  The job doesn't run long enough to do a qstat
>> -j <jobid>, and I wonder if that would give me anything useful in this
>> case.
> 
> When there are no suitable queues, it won't start as an SGE job at all - it's 
> even not entering SGE.

Actually it is entering SGE. The initial Mathematica job is, at least.
I added some pause statements in the Mathematica program, and you can
see the the job running in the qstat output. Even after it fails at
launching additional processes with DRMAA, the job continues to run, but
as a serial process. The STDOUT and STDERR is captured and written to
the normal output files.

If you're referring to the child processes that  Mathematica is trying
to start with DRMAA, then you are correct, they aren't making it into SGE.

> 
> 
>> Are there any special configuration you need to enable DRMAA in SGE? I
>> don't remember evrer setting anything, and this exact program has been
>> working without a single hiccup for over a year, but it could be that I
>> accidently changed a setting.
> 
> User is on a blocked access list - can he run other jobs with `qsub`?

No. This error happens for me, too. In addition to running this job
serially, I can run multi-slot MPI jobs with OpenMPI, and the standard
'echo $HOSTNAME' for debugging.

I can't imagine what could have possibly changed to cause this. I put
the sleep statements in hoping that the output of qstat -j <jobid> would
show something after the DRMAA calls, since those jobs never enter SGE
(as you pointed out), no errors are shown for them.

> 
> -- Reuti
> 
> 
> 
>> -- 
>> Prentice
>> _______________________________________________
>> users mailing list
>> [email protected]
>> https://gridengine.org/mailman/listinfo/users
> 
> 

-- 
Prentice
_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to