We have this problem continously. The solution we have created is to cache the data that is going to be used. Typically, there are X cores per host so the same data is read X times (more if the process is repeated). So each process looks if the file has been cached, if not, it tries to lock the file (so multiple copies are avoided). When the file is locked is copied to the host.
If you want to take a look to the script: http://bazaar.launchpad.net/~translectures/tlk/trunk/view/head:/scripts/tLtask-train/scripts/get_cache.sh The problem is that this script is quite system dependent and that the first processes are going to take more time, as they have to cache the data. Excerpts from Arnau Bria's message of 2014-03-26 09:30:25 +0100: > On Tue, 25 Mar 2014 16:53:43 +0100 > Reuti Reuti wrote: > > > Hi, > Hi Reuti, > > > Am 25.03.2014 um 15:37 schrieb Arnau Bria: > > > > > I've been looking for a parameter that limits the amount of jobs to > > > be started in each schedule interval but I did not find it (man > > > sge_sched_conf) > > > > > > Is there any way to limit that? > > > > IIRC there was a similar question on the list before. The solution > > was to put a random sleep in the queue prolog to avoid overloading of > > any NFS server from where the jobs will read data (in case that's > > your goal). > > Yes, that's my goal. > I'll have to study that solution. I don't know if we can add extra > walltime to users jobs as they have to pay per walltime used... > > Thanks for your answer, > > > -- Reuti > Cheers, > Arnau -- NiCo _______________________________________________ users mailing list [email protected] https://gridengine.org/mailman/listinfo/users
