Hi S(o)GE users,

I need some advice :-)

During my Ph.D. times, I discovered Sun Grid Engine and used it to run 
distributed machine learning jobs on a (then) medium sized cluster (96 CPUs). I 
liked it. Now, a couple of years later, I am again looking for a scheduling and 
resource allocation system like SGE for a similar purpose. Unfortunately, SGE 
seems to be pretty dead. In addition, I have similar but not identical needs 
stemming from continuous integration and from running (micro-)web services. 
Ideally, I would like a simple, integrated solution and not a complex monster 
built from many large parts.

Here's what I'm trying to accomplish:

- Run custom jobs for machine learning / data analysis. When I have an idea, I 
write a job and run it. Usually, the same job is only run a few times. Jobs 
will span multiple hosts and might require OpenMP + MPI. This is where SGE was 
really good in the past. The crowd seems to have shifted to run everything on 
Hadoop although this setup would be really ineffective for my purposes. I 
usually just need a couple of CPUs (< 100).

- Run frequent identical jobs for continous integration. We have a Jenkins 
running, but it is lacking in some regards. Resource allocation and scheduling 
is more or less non-existent. For example, I cannot define resources for things 
like attached mobile devices that can be used only by one job of a multi-core 
Mac at the same time. These are things already solved with SGE, but SGE itself 
does not cover the main aspects of CI, i.e. the collection and analysis of the 
build data.

- Run (micro-)services. We have a couple of services that need run 
continuously. Some need to be scaled up and down regarding the number of 
parallel instances. This is where people are now using Docker and (also quite 
complex) resource allocation and scheduling systems like kubernetes.

All three sorts of tasks compete for the same resources and suffer the same 
problem of provisioning/configuring the workers to fulfill a job's 
requirements. We're using Vagrant + ansible to provision VMs for our machine 
learning tasks and I would like to extend this to the other problems as well. 
The resource allocation is still somewhat manual in our case. I would really 
like to cut down the complexity of our setup.

It would be great if you can point to me any helpful information, ideas, 
projects that could help me solve this.

Best,
Mark

Attachment: smime.p7s
Description: S/MIME cryptographic signature

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to