Re: very imblanced host allocation when using sshloginfile

2018-07-12 Thread Ole Tange
On Wed, Jul 11, 2018 at 3:45 AM, Daniel LaFlamme
 wrote:
> Hit send prematurely.
>
> Later in the job run, I did start seeing messages like the following:
>
> "mux_client_request_session: session request failed: Session open refused by
> peer"
>
> All machines had the same low load during parallel's run. I have spot
> checked sshd_config on a couple of the servers and they are the same. They
> are also managed by puppet so they should all have the same configuration.
>
> My question is: what can I do to make what is going on under the hood more
> transparent so I can debug the problem further?

-t will show what GNU Parallel actually runs the moment it is being run.

So you will be getting lines like:

ssh -S /tmp/control_path_dir-dTad/ssh-%r@%h:%p server -- exec perl -e
@GNU_Parallel\\\=split/_/,\\\"use_IPC::Open3\\\;_use_MIME::Base64\\\"\\\;eval\\\"@GNU_Parallel\\\"\\\;\\\$chld\\\=\\\$SIG\\\{CHLD\\\}\\\;\\\$SIG\\\{CHLD\\\}\\\=\\\"IGNORE\\\"\\\;my\\\$zip\\\=\\\(grep\\\{-x\\\$_\\\}\\\"/usr/local/bin/bzip2\\\"\\\)\\\[0\\\]\\\|\\\|\\\"bzip2\\\"\\\;open3\\\(\\\$in,\\\$out,\\\"\\\>\\\&STDERR\\\",\\\$zip,\\\"-dc\\\"\\\)\\\;if\\\(my\\\$perlpid\\\=fork\\\)\\\{close\\\$in\\\;\\\$eval\\\=join\\\"\\\",\\\<\\\$out\\\>\\\;close\\\$out\\\;\\\}else\\\{close\\\$out\\\;print\\\$in\\\(decode_base64\\\(join\\\"\\\",@ARGV\\\)\\\)\\\;close\\\$in\\\;exit\\\;\\\}wait\\\;\\\$SIG\\\{CHLD\\\}\\\=\\\$chld\\\;eval\\\$eval\\\;
QlpoOTFBWSZTWVeMEdsAABWfgHVu538ev//f/jABa1BsRJonknlDIaaNqNNA9RoAABoDQANCKnkwmmmkaABo0ABoNAAADIAJRETyIZTEwJk0aGmhkDQGIyGhobREc4dRYSEyZfhTy06+rZhU0MqRGZKf8Ax8DK579pAO0n455KoEgINRBiGuhRpZlhO3wPCiBlr6zK6GXeeLInJeWvFYUY4ISvgw6W5WsaU2Wr1q6QFB4rRqOcWIJy707Ay5C7FiIrLuy/IkSVHTPaON8Rm8dc4hA4KINl8gOkchptIKBW6VBgMJRomCqAtu2esksMlwL6BeDvOXJlyqESHrA8VmTNG0e8PAMyDM4xDRKSGnQQvEbBzUmk+Q9kI72hZQbNMcWvPqJCb2z32LrFGMRU4XjkZfhHdHa9V2DoDnTDMUYBsy77issbk6YjbTiCxK8gCTG/H+2oSpqb5rSr1kWpSlGtHmiVIJrZCqSpbGJT200f8MKHBMpRjrrcaSVVnmx24XapGhXnjCIHmAekOqOTFwIJY7QOBdssV1wGMYm2NoNJUbuLCMzUBJ34szrbLEUC+yxP4u5IpwoSCvGCO2;

Given that the problem looks like being related to ssh, you can try
running some:

  ssh -S /tmp/control_path_dir-dTad/ssh-%r@%h:%p server sleep 10

by hand.

Since there _are_ jobs on all your hosts, then I reckon it is a
problem in your local setup. If find a solution, please post it. It
may be that others have the same issue.


/Ole



very imblanced host allocation when using sshloginfile

2018-07-11 Thread Daniel LaFlamme
I have an sshloginfile with 12 hostnames in it. All machines are identical. I 
used parallel like this:
$ parallel --controlmaster --sshdelay 0.3  --joblog joblog.txt --sshloginfile 
~/etc/parallel/hosts  'echo {} | /shared/nfs/process-file'  filelist.txt | 
tee -a result.txt
/shared/nfs/process-file is a shell script and is accessible on all hosts
filelist.txt is a file with 6247 paths to .gz files in it. These .gz files are 
processed by the process-file script.
While the parallel was running I noticed that all of the jobs were being 
allocated to one of the hosts in the hostfile, host05.  In fact, the first time 
a host other than host05 was allocated a job was on sequence number 622. 
The final distribution of how many jobs ran on each of the 12 hosts is:
$ cat joblog.txt  | awk '{ print $2 }' | sort | uniq -c         2 host01     
538 host02         1 host03     323 host04   4696 host05         1 host07       
  2 host08         2 host09     493 host10         3 host11         3 host12    
     1 Host$
As you can see, the distribution is very imbalanced. 
Later in the job run, I did start seeing messages like the following: