I'd like to hear what direction folks want to go in, and from there look at the options. I think for some of these options (like running our own public registry), they may be able to and it's something we should look at, but I don't assume they have time to work on this type of issue.
S On Tue, Apr 4, 2017 at 10:00 AM Lukasz Cwik <lc...@google.com.invalid> wrote: > Is this something that Apache infra could help us with? > > On Mon, Apr 3, 2017 at 7:22 PM, Stephen Sisk <s...@google.com.invalid> > wrote: > > > Summary: > > > > For IO ITs that use data stores that need custom docker images in order > to > > run, we can't currently use them in a kubernetes cluster (which is where > we > > host our data stores.) I have a couple options for how to solve this and > am > > looking for feedback from folks involved in creating IO ITs/opinions on > > kubernetes. > > > > > > Details: > > > > We've discussed in the past that we'll want to allow developers to submit > > just a dockerfile, and then we'll use that when creating the data store > on > > kubernetes. This is the case for ElasticsearchIO and I assume more data > > stores in the future will want to do this. It's also looking like it'll > be > > necessary to use custom docker images for the HadoopInputFormatIO's > > cassandra ITs - to run a cassandra cluster, there doesn't seem to be a > good > > image you can use out of the box. > > > > In either case, in order to retrieve a docker image, kubernetes needs a > > container registry - it will read the docker images from there. A simple > > private container registry doesn't work because kubernetes config files > are > > static - this means that if local devs try to use the kubernetes files, > > they point at the private container registry and they wouldn't be able to > > retrieve the images since they don't have access. They'd have to manually > > edit the files, which in theory is an option, but I don't consider that > to > > be acceptable since it feels pretty unfriendly (it is simple, so if we > > really don't like the below options we can revisit it.) > > > > Quick summary of the options > > > > ======================= > > > > We can: > > > > * Start using something like k8 helm - this adds more dependencies, adds > a > > small amount of complexity (this is my recommendation, but only by a > > little) > > > > * Start pushing images to docker hub - this means they'll be publicly > > visible and raises the bar for maintenance of those images > > > > * Host our own public container registry - this means running our own > > public service with costs, etc.. > > > > Below are detailed discussions of these options. You can skip to the "My > > thoughts on this" section if you're not interested in the details. > > > > > > 1. Templated kubernetes images > > > > ========================= > > > > Kubernetes (k8) does not currently have built in support for > parameterizing > > scripts - there's an issues open for this[1], but it doesn't seem to be > > very active. > > > > There are tools like Kubernetes helm that allow users to specify > parameters > > when running their kubernetes scripts. They also enable a lot more > (they're > > probably closer to a package manager like apt-get) - see this > > description[3] for an overview. > > > > I'm open to other options besides helm, but it seems to be the officially > > supported one. > > > > How the world would look using helm: > > > > * When developing an IO IT, someone (either the developer or one of us), > > would need to create a chart (the name for the helm script) - it's > > basically another set of config files but in theory is as simple as a > > couple metadata files plus a templatized version of a regular k8 script. > > This should be trivial compared to the task of creating a k8 script. > > > > * When creating an instance of a data store, the developer (or the beam > CI > > server) would first build the docker image for the data store and push to > > their container registry, then run a command like `helm install -f > > mydb.yaml --set imageRepo=1.2.3.4` > > > > * when done running tests/developing/etc… the developer/beam CI server > > would run `helm delete -f mydb.yaml` > > > > Upsides: > > > > * Something like helm is pretty interesting - we talked about it as an > > upside and something we wanted to do when we talked about using > kubernetes > > > > * We pick up a set of working kubernetes scripts this way. The full list > is > > at [2], but some ones that stood out: mongodb, memcached, mysql, > postgres, > > redis, elasticsearch (incubating), kafka (incubating), zookeeper > > (incubating) - this could speed development > > > > Downsides: > > > > * Adds an additional dependency to run our ITs (helm or another k8 > > templating tool) > > > > * Requires people to build their own images run a container registry if > > they don't already have one (it will not surprise you that there's a > docker > > image for running the registry [0] - so it's not crazy. :) I *think* this > > will probably just be a simple one/two line command once we have it > > scripted. > > > > * Helm in particular is kind of heavyweight for what we really need - it > > requires running a service in the k8 cluster and adds additional > > complexity. > > > > * Adds to the complexity of creating a new kubernetes script. Until I've > > tried it, I can't really speak to the complexity, but taking a look at > the > > instructions [4], it doesn't seem too bad. > > > > > > > > > > 2. Push images to docker hub > > > > ======================= > > > > This requires that users push images that we want to use to docker hub, > and > > then our IO ITs will rely on that. I think the developer of the > dockerfile > > should be responsible for the image - having the beam project responsible > > for a publicly available artifact (like the docker images) outside of our > > core deliverables doesn't seem like the right move. > > > > We would still retain a copy of the source dockerfiles and could > regenerate > > the images at any time, so I'm not concerned about a scenario where > docker > > hub went away (it would be pretty simple to switch to another repo - just > > change some config files.) > > > > For someone running the k8 scripts (ie, running the IO ITs), this is > pretty > > easy - they just run the k8 script like they do today. > > > > For someone creating the k8 scripts (ie, creating the IO ITs), this is > more > > complex - either they or we have to push this to docker hub and make sure > > it's up to date, etc.. > > > > > > Upsides: > > > > * No additional complexity for IO IT runners. > > > > Downsides: > > > > * Higher bar for creating the image in the first place - someone has to > > maintain the publicly available docker hub image. > > > > * It seems weird to have a custom docker image up on docker hub - maybe > > that's common, but if we need specific changes to images for our needs, > I'd > > prefer it be private. > > > > > > 3. Run our own *public* container registry > > > > ============================================== > > > > We would run a beam-specific container registry service - it would be > used > > by the apache beam CI servers, but it would also be available for use by > > anyone running beam IO ITs on their local dev setup. > > > > From a IO IT creator's perspective, this would look pretty similar to how > > things are now - they just check in a dockerfile. For someone running the > > k8 scripts, they similarly don't need to think about it. > > > > Upsides: > > > > * we're not adding any additional complexity for end developer > > > > Downsides: > > > > * Have to keep docker registry software up to date > > > > * The service is a single of failure for any beam devs running IO ITs > > > > * It can incur costs, etc… As an open source project, it doesn't seem > great > > for us to be running a public service. > > > > > > > > My thoughts on this > > > > =============== > > > > In spite of the additional complexity, I think using k8 helm is probably > > the best option. The general goal behind the IO ITs has been to keep > > ourselves self-contained: avoid having centralized infrastructure for > those > > running the ITs. Helm is a good match for those criteria. I will admit > that > > I find the additional dependencies/complexity to be worrisome. However, I > > really like the idea of picking up additional data store configs for > free - > > if we were doing this in 5 years, we'd say "we should just use the > > ecosystem of helm charts" and go from there. > > > > I do think that pushing images to docker hub is a viable option, and if > the > > community is more excited to do that/wants to push the images there, I'd > > support it. I can see how folks would be hesitant. I would like for the > > developer of the docker file to do > > > > Of the 3 options, I would strongly push back against running a public > > container registry - I would not want to administer it, and I don't think > > we as a project want to be paying for the costs associated with it. > > > > Next steps > > > > ========= > > > > Let me know what you think! This is definitely a topic where > understanding > > what the community of IO devs wants is helpful. As we discuss, I'll > > probably spend a little time exploring helm since I want to play around > > with it and understand if there are other drawbacks. I ran into this > > question while working on getting the HIFIO cassandra cluster running, > so I > > might prototype with that. > > > > I'll create JIRA for this in the next day or so. > > > > Stephen > > > > > > > > [0] docker registry container - https://hub.docker.com/_/registry/ > > > > [1] kubernetes issue open for supporting templates - > > https://github.com/kubernetes/kubernetes/issues/23896 > > > > [2] set of available charts - https://github.com/kubernetes/charts > > > > [3] kubernetes helm introduction - > > https://deis.com/blog/2015/introducing-helm-for-kubernetes/ > > [4] kubernetes charts instructions - > > https://github.com/kubernetes/helm/blob/master/docs/charts.md > > >