Hi S(o)GE users, I need some advice :-)
During my Ph.D. times, I discovered Sun Grid Engine and used it to run distributed machine learning jobs on a (then) medium sized cluster (96 CPUs). I liked it. Now, a couple of years later, I am again looking for a scheduling and resource allocation system like SGE for a similar purpose. Unfortunately, SGE seems to be pretty dead. In addition, I have similar but not identical needs stemming from continuous integration and from running (micro-)web services. Ideally, I would like a simple, integrated solution and not a complex monster built from many large parts. Here's what I'm trying to accomplish: - Run custom jobs for machine learning / data analysis. When I have an idea, I write a job and run it. Usually, the same job is only run a few times. Jobs will span multiple hosts and might require OpenMP + MPI. This is where SGE was really good in the past. The crowd seems to have shifted to run everything on Hadoop although this setup would be really ineffective for my purposes. I usually just need a couple of CPUs (< 100). - Run frequent identical jobs for continous integration. We have a Jenkins running, but it is lacking in some regards. Resource allocation and scheduling is more or less non-existent. For example, I cannot define resources for things like attached mobile devices that can be used only by one job of a multi-core Mac at the same time. These are things already solved with SGE, but SGE itself does not cover the main aspects of CI, i.e. the collection and analysis of the build data. - Run (micro-)services. We have a couple of services that need run continuously. Some need to be scaled up and down regarding the number of parallel instances. This is where people are now using Docker and (also quite complex) resource allocation and scheduling systems like kubernetes. All three sorts of tasks compete for the same resources and suffer the same problem of provisioning/configuring the workers to fulfill a job's requirements. We're using Vagrant + ansible to provision VMs for our machine learning tasks and I would like to extend this to the other problems as well. The resource allocation is still somewhat manual in our case. I would really like to cut down the complexity of our setup. It would be great if you can point to me any helpful information, ideas, projects that could help me solve this. Best, Mark
smime.p7s
Description: S/MIME cryptographic signature
_______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users