Hi Ido,

Ido Tamir wrote:
Hi,
we use qmake to parallelize the illumina/solexa pipeline. Its a make based 
system that
operates on many files to generate some output.

However, often under load we get errors like:

error: commlib error: got select error (Connection reset by peer)
error: executing task of job 7980306 failed: failed sending task to 
[email protected]: can't find connection

Then we have to restart the pipeline.

I tried the make options -k (keep going) and -i (ignore), and it keeps working, 
but the result is broken.
-r is not available for qmake.

Is there a possibility to retry for a certain amount of tries if this error 
comes up - and only this
error? Sometimes there are missing files etc... then it should fail.
But this is simply a node not answering in a specified amount of time.
Is there a possibility to extend the timeout?
Setting the gdi_timeout=<timeout> in the global configuration (qconf -mconf), attribute qmaster_params does increases the receive timeout for the requests done by qmake (via qrsh -inherit).
See also man page sge_conf.5, section about qmaster_params.

You can try if it helps, but I have doubts.
From the error message "Connection reset by peer" I would guess it really would require a retry.

You can configure gdi_retries=n, where n > 0, in the global configuration, attribute qmaster_params to configure a retry of client request. Unfortunately this has effect on all clients (qsub, qstat, ...) except for the qrsh -inherit used by qmake.
I'll file an issue to make sure this gets fixed.

Best regards,

    Joachim
Thank you very much for your answers,
ido







_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

_______________________________________________
users mailing list
[email protected]
https://gridengine.org/mailman/listinfo/users

Reply via email to