Re: Sun Grid Engine 6.2 on SL 6.1

2012-01-16 Thread Wil Irwin
Hi Steve-

I have been unable to find external or internal documentation about glibc
compatibility for lx24.

I have since upgraded to SGE 6.2u7 and the glibc installed (SL 6.1) is as
follows

Name: glibc
Arch: x86_64
Version : 2.12
Release : 1.47.el6
Size: 12 M
Repo: installed
From repo   : sl-securit

Name: glib
Arch: x86_64
Epoch   : 1
Version : 1.2.10
Release : 33.el6
Size: 354 k
Repo: installed
From repo   : base

Would you happen to have any suggestions on where I can look to rule out
glibc-based problems?

Thanks much,
Wil

On Wed, Jan 11, 2012 at 1:56 PM, Steven Timm t...@fnal.gov wrote:

 This smells like there could be problems with glibc version.. the
 lx24 is presuming either a kernel version or a glibc version or both.
 Do you have the appropriate compatibility glibc libraries installed?

 Steve Timm




 On Wed, 11 Jan 2012, Wil Irwin wrote:

  Hi-

 It is 64-bit on 64-bit. The exact version is from
 'ge-6.2-bin-lx24-amd64.tar.gz' and 'ge-6.2-common.tar.gz'. So I can rule
 out that issue.

 As for the problems, I can provide more detail, but in brief (sort of):

 1. The installation is w/o incident and I have used all the suggested
 defaults. Out of frustration, I've also installed in a couple of dozen
 time
 changing some of the more flexible defaults one at a time.

 2. The simple job runs as it should.

 3. There are 3 nodes (with the master also serving as an executor). All
 are
 talking to each other in term of the SGE ports and NFS.

 4. My inquire was intended to be general in terms of some possible
 incompatibility between SGE and SL 6.1, the comment which follow have,
 unfortunately, the factor of submitting jobs using an analysis
 application.
 The script which this application uses is a bit convoluted, but I studied
 pretty well and, if there is some problem, I don't see it. I have not
 received any negative feedback from other users of this application.
 Unfortunately, it really isn't possible to submit the job from this
 application w/o using the accompanying script. So, of course, there is a
 bit of black-box factor.

 5. One particular job is very large (~20K commands). After the commands
 are
 generated and submitted, SGE returns the rather confusing error message of
 Unable to run job: job rejected: You try to submit a job with more than
 75000 tasks. Exiting. 75000 is the configured limit, but I can readily
 see
 the command lines being generated and it is exactly 16900. I would say in
 general, this is the most perplexing problem.

 6. #5 is accompanied by failure email messages, but no 16900 messages (I
 would say many hundred). I can't explain this behavior either. It could
 actually be an email server issue and not related to SGE, per se.

 7. Another example is or will appear to be very specific to the analysis
 application I am using as opposed to a general SGE issue. For this
 application, there is an explicit user variable to set the queue, and I
 have set it to 'verylong.q'. When I submit a much smaller job (~200
 commands) to try to figure out what is going wrong, the 'verylong.q' is
 ignored and 'short.q' is selected. But more curious and more SGE-related
 is
 the job will run, but it runs the commands in series and only uses 1
 processor on the master node (each node has 6 x 2 cores).

 That's a flavor of what is causing my sanity to slowly drift away.

 Regards,
 Wil

 On Wed, Jan 11, 2012 at 1:00 PM, Keith Chadwick chadw...@fnal.gov
 wrote:

  Are you trying to run either:

   1. A 32 bit version of SGE 6.2 on a 64 bit SL 6.1 system?

 or

   2. A 64 bit version of SGE 6.2 on a 32 bit SL 6.1 system?

 In the case #1, you should be able to get SGE to run once you install
 the necessary 32 bit compatibility libraries, or (recommended) switch
 to a 64 bit version of SGE 6.2.

 In the case #2, you are going to be out of luck...

 -Keith.


 At 12:43 PM -0800 1/11/12, Wil Irwin wrote:

  Hello-

 I am having unparalleled (no pun intended) problems getting SGE 6.2 to
 run under SL 6.1. I have consulted with others who have quite a bit of
 experience using SGE on an earlier version of SL, and we cannot
 determine
 why it won't run.

 Before I list the nature of the problems, I though I would start by
 asking if anyone has had a successful experience with SGE 6.2 on SL 6.1.

 I'm running kernel:  2.6.32-220.2.1.el6.x86_64 #1 SMP Thu Dec 22
 11:15:52
 CST 2011 x86_64

 Thanks for any help.

 -Wil





 --
 --**--**--
 Steven C. Timm, Ph.D  (630) 840-8525
 t...@fnal.gov  http://home.fnal.gov/~timm/
 Fermilab Computing Division, Scientific Computing Facilities,
 Grid Facilities Department, FermiGrid Services Group, Group Leader.
 Lead of FermiCloud project.



Re: Sun Grid Engine 6.2 on SL 6.1

2012-01-11 Thread Joshua Baker-LePain

On Wed, 11 Jan 2012 at 12:43pm, Wil Irwin wrote


I am having unparalleled (no pun intended) problems getting SGE 6.2 to run
under SL 6.1. I have consulted with others who have quite a bit of
experience using SGE on an earlier version of SL, and we cannot determine
why it won't run.

Before I list the nature of the problems, I though I would start by asking
if anyone has had a successful experience with SGE 6.2 on SL 6.1.

I'm running kernel:  2.6.32-220.2.1.el6.x86_64 #1 SMP Thu Dec 22 11:15:52
CST 2011 x86_64


I currently have SGE 6.*1* running on SL6.1, and will be testing 6.2 soon. 
I'd be interesting in hearing what issues you're having.  Also, it's worth 
asking exactly what version of SGE you're using.


--
Joshua Baker-LePain
QB3 Shared Cluster Sysadmin
UCSF


Re: Sun Grid Engine 6.2 on SL 6.1

2012-01-11 Thread Keith Chadwick

Are you trying to run either:

1. A 32 bit version of SGE 6.2 on a 64 bit SL 6.1 system?

or

2. A 64 bit version of SGE 6.2 on a 32 bit SL 6.1 system?

In the case #1, you should be able to get SGE to run once you install
the necessary 32 bit compatibility libraries, or (recommended) switch
to a 64 bit version of SGE 6.2.

In the case #2, you are going to be out of luck...

-Keith.

At 12:43 PM -0800 1/11/12, Wil Irwin wrote:

Hello-

I am having unparalleled (no pun intended) problems getting SGE 6.2 
to run under SL 6.1. I have consulted with others who have quite a 
bit of experience using SGE on an earlier version of SL, and we 
cannot determine why it won't run.


Before I list the nature of the problems, I though I would start by 
asking if anyone has had a successful experience with SGE 6.2 on SL 
6.1.


I'm running kernel:  2.6.32-220.2.1.el6.x86_64 #1 SMP Thu Dec 22 
11:15:52 CST 2011 x86_64


Thanks for any help.

-Wil


Re: Sun Grid Engine 6.2 on SL 6.1

2012-01-11 Thread Wil Irwin
Hi-

It is 64-bit on 64-bit. The exact version is from
'ge-6.2-bin-lx24-amd64.tar.gz' and 'ge-6.2-common.tar.gz'. So I can rule
out that issue.

As for the problems, I can provide more detail, but in brief (sort of):

1. The installation is w/o incident and I have used all the suggested
defaults. Out of frustration, I've also installed in a couple of dozen time
changing some of the more flexible defaults one at a time.

2. The simple job runs as it should.

3. There are 3 nodes (with the master also serving as an executor). All are
talking to each other in term of the SGE ports and NFS.

4. My inquire was intended to be general in terms of some possible
incompatibility between SGE and SL 6.1, the comment which follow have,
unfortunately, the factor of submitting jobs using an analysis application.
The script which this application uses is a bit convoluted, but I studied
pretty well and, if there is some problem, I don't see it. I have not
received any negative feedback from other users of this application.
Unfortunately, it really isn't possible to submit the job from this
application w/o using the accompanying script. So, of course, there is a
bit of black-box factor.

5. One particular job is very large (~20K commands). After the commands are
generated and submitted, SGE returns the rather confusing error message of
Unable to run job: job rejected: You try to submit a job with more than
75000 tasks. Exiting. 75000 is the configured limit, but I can readily see
the command lines being generated and it is exactly 16900. I would say in
general, this is the most perplexing problem.

6. #5 is accompanied by failure email messages, but no 16900 messages (I
would say many hundred). I can't explain this behavior either. It could
actually be an email server issue and not related to SGE, per se.

7. Another example is or will appear to be very specific to the analysis
application I am using as opposed to a general SGE issue. For this
application, there is an explicit user variable to set the queue, and I
have set it to 'verylong.q'. When I submit a much smaller job (~200
commands) to try to figure out what is going wrong, the 'verylong.q' is
ignored and 'short.q' is selected. But more curious and more SGE-related is
the job will run, but it runs the commands in series and only uses 1
processor on the master node (each node has 6 x 2 cores).

That's a flavor of what is causing my sanity to slowly drift away.

Regards,
Wil

On Wed, Jan 11, 2012 at 1:00 PM, Keith Chadwick chadw...@fnal.gov wrote:

 Are you trying to run either:

1. A 32 bit version of SGE 6.2 on a 64 bit SL 6.1 system?

 or

2. A 64 bit version of SGE 6.2 on a 32 bit SL 6.1 system?

 In the case #1, you should be able to get SGE to run once you install
 the necessary 32 bit compatibility libraries, or (recommended) switch
 to a 64 bit version of SGE 6.2.

 In the case #2, you are going to be out of luck...

 -Keith.


 At 12:43 PM -0800 1/11/12, Wil Irwin wrote:

 Hello-

 I am having unparalleled (no pun intended) problems getting SGE 6.2 to
 run under SL 6.1. I have consulted with others who have quite a bit of
 experience using SGE on an earlier version of SL, and we cannot determine
 why it won't run.

 Before I list the nature of the problems, I though I would start by
 asking if anyone has had a successful experience with SGE 6.2 on SL 6.1.

 I'm running kernel:  2.6.32-220.2.1.el6.x86_64 #1 SMP Thu Dec 22 11:15:52
 CST 2011 x86_64

 Thanks for any help.

 -Wil





Re: Sun Grid Engine 6.2 on SL 6.1

2012-01-11 Thread Keith Chadwick

It appears that we can likely eliminate 32/64 bit issues, then.

Some more questions:

Is this 20K command job:
- a sequence of trivially parallel commands,
- an MPI job,
- a job array,
- or is it a complicated DAG?

Can you capture the qsub(s) commands associated with this job?

Are you sure that the number of systems and number of streams are 
correctly specified?


-Keith.

At 1:39 PM -0800 1/11/12, Wil Irwin wrote:

Hi-

It is 64-bit on 64-bit. The exact version is from 
'ge-6.2-bin-lx24-amd64.tar.gz' and 'ge-6.2-common.tar.gz'. So I can 
rule out that issue.


As for the problems, I can provide more detail, but in brief (sort of):

1. The installation is w/o incident and I have used all the 
suggested defaults. Out of frustration, I've also installed in a 
couple of dozen time changing some of the more flexible defaults one 
at a time.


2. The simple job runs as it should.

3. There are 3 nodes (with the master also serving as an executor). 
All are talking to each other in term of the SGE ports and NFS.


4. My inquire was intended to be general in terms of some possible 
incompatibility between SGE and SL 6.1, the comment which follow 
have, unfortunately, the factor of submitting jobs using an analysis 
application. The script which this application uses is a bit 
convoluted, but I studied pretty well and, if there is some problem, 
I don't see it. I have not received any negative feedback from other 
users of this application. Unfortunately, it really isn't possible 
to submit the job from this application w/o using the accompanying 
script. So, of course, there is a bit of black-box factor.


5. One particular job is very large (~20K commands). After the 
commands are generated and submitted, SGE returns the rather 
confusing error message of Unable to run job: job rejected: You try 
to submit a job with more than 75000 tasks. Exiting. 75000 is the 
configured limit, but I can readily see the command lines being 
generated and it is exactly 16900. I would say in general, this is 
the most perplexing problem.
6. #5 is accompanied by failure email messages, but no 16900 
messages (I would say many hundred). I can't explain this behavior 
either. It could actually be an email server issue and not related 
to SGE, per se.


7. Another example is or will appear to be very specific to the 
analysis application I am using as opposed to a general SGE issue. 
For this application, there is an explicit user variable to set the 
queue, and I have set it to 'verylong.q'. When I submit a much 
smaller job (~200 commands) to try to figure out what is going 
wrong, the 'verylong.q' is ignored and 'short.q' is selected. But 
more curious and more SGE-related is the job will run, but it runs 
the commands in series and only uses 1 processor on the master node 
(each node has 6 x 2 cores).


That's a flavor of what is causing my sanity to slowly drift away.

Regards,
Wil

On Wed, Jan 11, 2012 at 1:00 PM, Keith Chadwick 
mailto:chadw...@fnal.govchadw...@fnal.gov wrote:


Are you trying to run either:

   1. A 32 bit version of SGE 6.2 on a 64 bit SL 6.1 system?

or

   2. A 64 bit version of SGE 6.2 on a 32 bit SL 6.1 system?

In the case #1, you should be able to get SGE to run once you install
the necessary 32 bit compatibility libraries, or (recommended) switch
to a 64 bit version of SGE 6.2.

In the case #2, you are going to be out of luck...

-Keith.


At 12:43 PM -0800 1/11/12, Wil Irwin wrote:

Hello-

I am having unparalleled (no pun intended) problems getting SGE 6.2 
to run under SL 6.1. I have consulted with others who have quite a 
bit of experience using SGE on an earlier version of SL, and we 
cannot determine why it won't run.


Before I list the nature of the problems, I though I would start by 
asking if anyone has had a successful experience with SGE 6.2 on SL 
6.1.


I'm running kernel:  2.6.32-220.2.1.el6.x86_64 #1 SMP Thu Dec 22 
11:15:52 CST 2011 x86_64


Thanks for any help.

-Wil


Re: Sun Grid Engine 6.2 on SL 6.1

2012-01-11 Thread Steven Timm

This smells like there could be problems with glibc version.. the
lx24 is presuming either a kernel version or a glibc version or both.
Do you have the appropriate compatibility glibc libraries installed?

Steve Timm



On Wed, 11 Jan 2012, Wil Irwin wrote:


Hi-

It is 64-bit on 64-bit. The exact version is from
'ge-6.2-bin-lx24-amd64.tar.gz' and 'ge-6.2-common.tar.gz'. So I can rule
out that issue.

As for the problems, I can provide more detail, but in brief (sort of):

1. The installation is w/o incident and I have used all the suggested
defaults. Out of frustration, I've also installed in a couple of dozen time
changing some of the more flexible defaults one at a time.

2. The simple job runs as it should.

3. There are 3 nodes (with the master also serving as an executor). All are
talking to each other in term of the SGE ports and NFS.

4. My inquire was intended to be general in terms of some possible
incompatibility between SGE and SL 6.1, the comment which follow have,
unfortunately, the factor of submitting jobs using an analysis application.
The script which this application uses is a bit convoluted, but I studied
pretty well and, if there is some problem, I don't see it. I have not
received any negative feedback from other users of this application.
Unfortunately, it really isn't possible to submit the job from this
application w/o using the accompanying script. So, of course, there is a
bit of black-box factor.

5. One particular job is very large (~20K commands). After the commands are
generated and submitted, SGE returns the rather confusing error message of
Unable to run job: job rejected: You try to submit a job with more than
75000 tasks. Exiting. 75000 is the configured limit, but I can readily see
the command lines being generated and it is exactly 16900. I would say in
general, this is the most perplexing problem.

6. #5 is accompanied by failure email messages, but no 16900 messages (I
would say many hundred). I can't explain this behavior either. It could
actually be an email server issue and not related to SGE, per se.

7. Another example is or will appear to be very specific to the analysis
application I am using as opposed to a general SGE issue. For this
application, there is an explicit user variable to set the queue, and I
have set it to 'verylong.q'. When I submit a much smaller job (~200
commands) to try to figure out what is going wrong, the 'verylong.q' is
ignored and 'short.q' is selected. But more curious and more SGE-related is
the job will run, but it runs the commands in series and only uses 1
processor on the master node (each node has 6 x 2 cores).

That's a flavor of what is causing my sanity to slowly drift away.

Regards,
Wil

On Wed, Jan 11, 2012 at 1:00 PM, Keith Chadwick chadw...@fnal.gov wrote:


Are you trying to run either:

   1. A 32 bit version of SGE 6.2 on a 64 bit SL 6.1 system?

or

   2. A 64 bit version of SGE 6.2 on a 32 bit SL 6.1 system?

In the case #1, you should be able to get SGE to run once you install
the necessary 32 bit compatibility libraries, or (recommended) switch
to a 64 bit version of SGE 6.2.

In the case #2, you are going to be out of luck...

-Keith.


At 12:43 PM -0800 1/11/12, Wil Irwin wrote:


Hello-

I am having unparalleled (no pun intended) problems getting SGE 6.2 to
run under SL 6.1. I have consulted with others who have quite a bit of
experience using SGE on an earlier version of SL, and we cannot determine
why it won't run.

Before I list the nature of the problems, I though I would start by
asking if anyone has had a successful experience with SGE 6.2 on SL 6.1.

I'm running kernel:  2.6.32-220.2.1.el6.x86_64 #1 SMP Thu Dec 22 11:15:52
CST 2011 x86_64

Thanks for any help.

-Wil








--
--
Steven C. Timm, Ph.D  (630) 840-8525
t...@fnal.gov  http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Group Leader.
Lead of FermiCloud project.


Re: Sun Grid Engine 6.2 on SL 6.1

2012-01-11 Thread Wil Irwin
Hi Keith-

I would characterize the 20K commands as mostly trivial (~5 minutes of CPU
time per command).

Unfortunately, because of the nature of the way the application generates
and submits the commands, it's sort of a close loop. I have not ruled this
out, but, while the suggestion is greatly appreciated, it shouldn't be
necessary. And, trying to capture the command is almost certainly going to
produce a new set of issues.

I am effectively 100% certain the systems are correct. I'm not an expert
using SGE (independent of this particular analysis application), so I'm not
100% certain what settings should be checked to verify streaming is
correctly configured.

Thanks much for your comments.

-Wil

On Wed, Jan 11, 2012 at 1:52 PM, Keith Chadwick chadw...@fnal.gov wrote:

 It appears that we can likely eliminate 32/64 bit issues, then.

 Some more questions:

 Is this 20K command job:
 - a sequence of trivially parallel commands,
 - an MPI job,
 - a job array,
 - or is it a complicated DAG?

 Can you capture the qsub(s) commands associated with this job?

 Are you sure that the number of systems and number of streams are
 correctly specified?

 -Keith.


 At 1:39 PM -0800 1/11/12, Wil Irwin wrote:

 Hi-

 It is 64-bit on 64-bit. The exact version is from
 'ge-6.2-bin-lx24-amd64.tar.gz' and 'ge-6.2-common.tar.gz'. So I can rule
 out that issue.

 As for the problems, I can provide more detail, but in brief (sort of):

 1. The installation is w/o incident and I have used all the suggested
 defaults. Out of frustration, I've also installed in a couple of dozen time
 changing some of the more flexible defaults one at a time.

 2. The simple job runs as it should.

 3. There are 3 nodes (with the master also serving as an executor). All
 are talking to each other in term of the SGE ports and NFS.

 4. My inquire was intended to be general in terms of some possible
 incompatibility between SGE and SL 6.1, the comment which follow have,
 unfortunately, the factor of submitting jobs using an analysis application.
 The script which this application uses is a bit convoluted, but I studied
 pretty well and, if there is some problem, I don't see it. I have not
 received any negative feedback from other users of this application.
 Unfortunately, it really isn't possible to submit the job from this
 application w/o using the accompanying script. So, of course, there is a
 bit of black-box factor.

 5. One particular job is very large (~20K commands). After the commands
 are generated and submitted, SGE returns the rather confusing error message
 of Unable to run job: job rejected: You try to submit a job with more than
 75000 tasks. Exiting. 75000 is the configured limit, but I can readily see
 the command lines being generated and it is exactly 16900. I would say in
 general, this is the most perplexing problem.
 6. #5 is accompanied by failure email messages, but no 16900 messages
 (I would say many hundred). I can't explain this behavior either. It could
 actually be an email server issue and not related to SGE, per se.

 7. Another example is or will appear to be very specific to the analysis
 application I am using as opposed to a general SGE issue. For this
 application, there is an explicit user variable to set the queue, and I
 have set it to 'verylong.q'. When I submit a much smaller job (~200
 commands) to try to figure out what is going wrong, the 'verylong.q' is
 ignored and 'short.q' is selected. But more curious and more SGE-related is
 the job will run, but it runs the commands in series and only uses 1
 processor on the master node (each node has 6 x 2 cores).

 That's a flavor of what is causing my sanity to slowly drift away.

 Regards,
 Wil

 On Wed, Jan 11, 2012 at 1:00 PM, Keith Chadwick mailto:
 chadw...@fnal.govcha**dw...@fnal.gov chadw...@fnal.gov wrote:

 Are you trying to run either:

   1. A 32 bit version of SGE 6.2 on a 64 bit SL 6.1 system?

 or

   2. A 64 bit version of SGE 6.2 on a 32 bit SL 6.1 system?

 In the case #1, you should be able to get SGE to run once you install
 the necessary 32 bit compatibility libraries, or (recommended) switch
 to a 64 bit version of SGE 6.2.

 In the case #2, you are going to be out of luck...

 -Keith.


 At 12:43 PM -0800 1/11/12, Wil Irwin wrote:

 Hello-

 I am having unparalleled (no pun intended) problems getting SGE 6.2 to
 run under SL 6.1. I have consulted with others who have quite a bit of
 experience using SGE on an earlier version of SL, and we cannot determine
 why it won't run.

 Before I list the nature of the problems, I though I would start by
 asking if anyone has had a successful experience with SGE 6.2 on SL 6.1.

 I'm running kernel:  2.6.32-220.2.1.el6.x86_64 #1 SMP Thu Dec 22 11:15:52
 CST 2011 x86_64

 Thanks for any help.

 -Wil