Inside Infra: Greg Stein --Part III

Sally Khudairi Fri, 17 Jul 2020 07:46:25 -0700

[this interview is available online at https://s.apache.org/InsideInfra-Greg3 ]


The close of the "Inside Infra" interview with ASF Infrastructure Administrator 
Greg Stein, who shares his experience with Sally Khudairi, ASF VP Marketing & 
Publicity. 

["Apache is growing: we're just seeing the demand explode and it's a hard 
problem for us to solve."]

PART THREE.

- We were talking about ensuring that the team is up to speed with everything 
required of them...

So there certainly are skill gaps; this is one of the things I want to help 
motivate the team with, where if somebody says, "Hey, I want to go and 
investigate Ansible as a potential Puppet replacement," I say, "Go forward." 

This would be similar to Google having their 20% projects. I'm sure you've 
heard of that.

- Oh, yeah.

It's almost the same where it's not 20%, maybe 5%, but it's the same as Google, 
no matter what they want to tell you, because everybody's got their job and you 
have to be really rigorous to carve out 20% of your time. And strictly 
speaking, it does actually make your Google manager a little upset if you carve 
out the entire 20%. But anyways, the concept is similar.

So for us it's like, "Well, go in and investigate Ansible, see if it'll work 
for us and put your notes into the Wiki." That's how we make forward progress, 
up our game, and learn new skills. If someone says, "I want to go and figure 
this out," the response is almost always, "Okay. You go do it." There's 
certainly an allowance for people to learn new skills. But most of the time we 
simply rely on, say, Gavin (ASF Infrastructure team member Gavin McDonald), 
knowing more about JIRA configuration than the other guys.

- That added component of sharing what you know, and adding it to the JIRA or 
to the Wiki actually is great because then everyone's learning. This is like 
the rising tide: everybody's learning about this, whether they're doing it 
perfectly or not. I think this is a very interesting process.

Yes, and that's also where Andrew (technical writer Andrew Wetmore) is helping 
us out. He’s organizing that information that we have learned, that we have 
documented, that we memorialized into the Wiki.

- Because our (ASF's) legacy is quite Medusa-like over all these years, it's 
interesting to see how everyone can get caught up and also contribute...you 
have to go back and deal with the legacy, but you also have to be able to move 
forward. To be able to bring others with you is brilliant. That's really cool.

The infrastructure has grown organically over 25 years from when Brian 
Behlendorf first said, "Hey, I have this server called hyperreal.org: you can 
run a CVS repository on it for the Web server."

- That computer was under his desk at the Wired offices way back when, wasn’t 
it...

Yes it was. And it's just grown organically over those 25 years. Then we had 
Minotaur and it did six different things ... now it only does half of one and 
we've moved the stuff out onto newer machines and newer processes and this and 
that. But the organic growth means that we've got some really hairy stuff. Our 
move to Puppet --first Puppet 3, and now to Puppet 6-- at each step we're 
improving it and making it less hairy and more manageable and something that 
somebody can come along, look at, pick up and run with it from there. That 
makes it a lot easier, so that we don't have to spend 100% of our time cross 
training.

- What are your thoughts on products, the hype cycle, where everyone's 
demanding Kubernetes, to use that as an example. Do you decide which products 
to provide support for, or is that up to Apache projects in the communities? 
You mentioned Ansible, just not too long ago, that was your internal decision 
to move. But I remember not long ago, GitHub entered into the landscape. How 
did that happen? How did you decide to make a move like that? That's a 
significant thing. Can you tell me a little bit about that?

It's a lot based on community input. So if we see a lot of people asking for a 
particular tool, we'll like, "Oh, hey, David, can you go and take a look at 
that and see if that's something…” Not David (ASF VP Infrastructure David 
Nalley), but Chris (Infrastructure team member Chris Lambertus) or somebody 
else. "Can you go take a look. Is that something that we can support? Because 
we're getting some queries about it."

And there's a little chicken and egg problem there that if the communities 
don't know to ask for the egg, we don't know whether to prep the chicken. It's 
like, “okay, wait, they don't even know to ask for a tool because we haven't 
said we will make this tool available, because we're not going to make the tool 
available until somebody asks”. But sometimes people file tickets like, "Can I 
get this set up?" and we'll go, "No."

Then six months later, somebody else will file a ticket: "Can I get this set 
up?" and we'll go," No." But after enough of those, we're like, "Maybe that's 
something that we really want to do." For GitHub, specifically that’s what 
happened there. Well, even before that Git, where we ran our own Git server, 
that was a volunteer that made that happen. That was, six years ago or so.

Well...the volunteer came along and said, "Well, I'll do this. I'm not going to 
take any time from Infra." There's been a couple things for the past few years 
where I've told people, "No, Infra will not work on that. But if you want to 
volunteer or find a volunteer, then we'll stand it up for testing." You know 
what I mean? Why not? So there's a couple things where people have stood up for 
test examples and there hasn't really been a lot of usage.

So, we're not going to support that. But something like Ansible is our own 
internal workflow and the tool we’ll experiment with, then to see if it'll 
improve our stuff. But from the community, they pretty much have to ask and it 
has to be a sustained ask. That's how we ended up with Travis CI: we actually 
pay for capacity in Travis CI, and that's based on community input.

So many people wanted to do their continuous integration through Travis that 
eventually we decided to pay for it. But it's tricky because some of these 
systems like Travis CI and others require certain permissions that we don't 
want to provide to the community. So we will want to hold those only within 
Infra. And so it gets hard to integrate certain tools. We've had to say no, but 
then again, we've found other ways to improve that so that we can lock down the 
permissions or use a proxy or other ways that we can route around some of these 
issues and then integrate the requested tool.

- So further to that, have you been in a situation where a project or a 
community has made unreasonable demands of Infra or have expectations, where 
it's like, so over the top or so out of scope, it totally surprised you? Have 
you had something like this?

Nothing surprises me.

- Nothing surprises you? Okay. Have you been in this situation? Like “was never 
going to happen”...

Yes, yes. There's been several times where one of the guys on the team is like, 
"Oh man, I got this ticket. I don't know what we want to do with this. Greg, go 
take a look." And I go and look at it and that's where I make that call: "Okay, 
is the Infra team going to take this on, or do I just say ‘no’ right now?"

So, yeah, there's been a number of times where I've said no and probably two or 
three times where I've gotten a little bit of pushback on that no. I say, "My 
answer is no, but here's how you escalate." I've had escalation a few times and 
I'm actually, mid-process --I'm dealing with one right now. So, I've said, "no, 
if you don't like my no, you can go to VP Infra and VP Infra is, probably going 
to tell you the same thing. And then you can go to the President. Right now 
those are actually the same person."

- The same person is a double "no".

That really is the true escalation path. I have to describe that to people and 
say, "I don't think you're going to get what you want." If I'm the one that 
says, no, you probably are not going to get it because VP Infra and President, 
and after that is the Board. They're probably not going to say, "Greg is wrong. 
Yes, we'll give that to you." But it's there. There's been a couple of times 
where I said "No, you have to ask the Board for the budget for those additional 
virtual machines." They went to the board and said, "Can we have budget for 
three machines?" and the Board said, "Yes."

So Infra went ahead and gave them the three VMs that they had initially 
requested. Strictly speaking, we would track those machines against their 
budget, but that detail is more than what the actual budget was. So we don't 
spend that time doing that, but I have had to say, no. I have had to... There 
was Apache Maven: they were keeping a copy of Maven Central, and Maven Central 
is run by Sonatype...

- Which is a commercial product...

Yes. They're using the trademark “Maven”, essentially a licensing agreement 
from us, a MOU. So with Maven Central, you could imagine if someone decides to 
just turn it off one day ...we wanted a copy. Apache Maven was making a copy of 
it, and it just started consuming so much disk space. We were like, "We can't 
support that growth rate. We can't support that even for the next six months. 
If you want to keep doing it, go ask the Board for money to keep doing it." 
They never did. We turned it off.

I wouldn't call that a ridiculous request --it was something where we didn't 
have to just say, "No, not going to do it. Bye." A lot of the requests are 
mostly just, "We aren't going to run that extra software. If you want: ask for 
a VM and you can run it, but we're not going to take responsibility for it."

- Over the years, obviously ASF Infra has changed. Was this all reactive or was 
it also proactive? Do you plan for those changes as you go or has it all been 
in response to Project X or in response to X emergency?

The growth of Infrastructure and its movement from volunteer-only to paid staff 
was part of just the growth of Apache. The volunteers could no longer keep up 
and things, like account creation, used to take sometimes four weeks to get an 
account. You’d put in a request for an account, four weeks later, it would 
finally get created.

- My gosh, that queue was crazy, huh?

Well, it wasn't even a long queue, it was simply that we didn't have volunteers 
making sure the queue stayed empty. Today it's down to one, two, maybe three 
days, and the account is created, because every day a staff member goes and 
creates the accounts first thing in the morning.

It was how I said that my day starts with looking at messages on Slack and then 
reading emails to see if there's stuff to handle. Well, one of the guys on 
staff, first thing he does in the morning is go and look at account creation. 
So he's been off and on pondering on a tool to make that easier for himself; he 
hasn't finished the tool, so he still has to do it manually. That's his 
incentive.

- “Work quickly”...

This is Chris Thistlethwaite. I say, "Chris, we can do something about that." 
And he says, "No, no, this is still my project. And every day when I run the 
script, it just makes me remember, I need to finish this."

So when the volunteers could not keep up with the amount of work, that's when 
we hired Joe Schaefer, then we hired another person, and hired another person. 
And so it was just trying to keep up with the rate of requests. 

That's how we ended up with hiring six people. And then I'm half a person, like 
I said, I'm part-time. So, it's just the growth of Apache. I think we're in 
much better shape than when I started. We're ahead of the curve. We can stay 
ahead of the curve because one of the things that I can do because I don't 
fight the fires every day ... that's for all the guys who know their stuff. 
They fight the fires and I can look at if I need to go and ask for another head 
count. And that's how we ended up with Andrew (technical writer Andrew 
Wetmore): “Well, you know, what we really need is somebody to manage all this 
documentation.” This was part of Sam's (former ASF President Sam Ruby), “If you 
had some money, what would you do with it?” That's how the technical 
writer/editor came around, because we've got 20 years of organic growth. We 
had...let's just call it “organic documentation”. That revamping project is 
going really well, I think.

- So, in what areas are you guys experiencing your biggest growth? As I was 
asking Chris and Drew, is there like a geographic influence on the demand? 
We’ve had a huge influx of users in China. Does any of that change the way or 
what you guys are doing? Or is it just more of everything?

Our biggest pain point, I would say, is continuous integration/continuous 
development: CI/CD. Jenkins, Travis, CircleCI, and things like this, where 
people make a change and they want that change built and tested. The more 
projects we get and the larger the communities get, the more changes and the 
more testing and the more building and the more this, more, more, more. It's 
kind of one of those things where it's “expand-to-fit”. So if we gave people 
100 machines, they'd use 100 machines. If we doubled it to 200, they'd use all 
200. It's just this rapacious need for CI machines. It's very hard to figure 
out how to plan around that other than just telling the communities, “No: we 
just don't have that much capacity: if you want to build it, do it on your own 
machine. You just can't use Apache hardware to do it.”

That's an unsatisfactory answer. That's been one of our hard problems and it's 
also kind of a newer problem: the development workflow that uses CI probably is 
just maybe five years old. Before that, certainly, automated building and 
testing was a thing, but it's really kind of grown into community workflow 
much, much more over the past five years, and more and more people are wanting 
to do it. The communities are growing. Apache is growing: we're just seeing the 
demand explode and it's a hard problem for us to solve.

China is the one case where we see regional issues, and that's because of the 
great firewall of China. Not because we're getting more Chinese developers, but 
because they have problems accessing our servers because they're located 
outside of China, and so we're looking at CDNs, a content distribution network 
to essentially make our content available closer to China. We've found that 
even with one of those CDN drop points in Hong Kong, they still have problems 
just reaching it there in Hong Kong, and so ... and we don't want to buy or 
lease or rent a server in China because doing business in China is too high of 
a hurdle for the Foundation. 

- Oh? 

You know, Microsoft and Google have to do business in China and they've got a 
pack of lawyers and a giant vault of money to deal with all the barriers. The 
Foundation does not, so it's also a hard problem to solve. We think we might be 
able to do it through Microsoft Azure, that they have a CDN that resides in 
China that Microsoft has done all that paperwork, so we're looking at that, but 
as far as regional things, it's not so much that we run into issues. We see 
Open Source communities in Europe and Brazil and Australia and Sri Lanka: none 
of them really have any problems because they don't have that firewall. It's 
not really about the Chinese people, but about the China firewall. 

- That's bigger than us. And that’s not something we can fire hose.
 
We do see little engagement from Japan and Brazil, and that is partly for 
language reasons and partly because the Brazil community is more about Free 
Software than Open Source software. 

- Yeah. They're very pro-FOSS.

Not OSS. But pro-free. And so, they're going to deal with the Free Software 
Foundation rather than the Apache Software Foundation.

- I see. That’s an important distinction. 

And then you also have the Portuguese language barrier. People contributing 
from Europe and India, Sri Lanka, etc., they pretty much know English and 
that's fine. A lot of the Brazilian developers do not know English...this is 
the same with the Japanese Open Source developers. Japanese and Brazilian, they 
tend to not know English, and so that kind of isolates them from the larger 
Open Source world, or Free Software world, in the case of Brazil.

- Would we consider localizing anything that we do, or are we going to continue 
as-is, as the ASF is all English?

The Infrastructure team will not translate our documents to serve those other 
languages. That's just too high of a bar.

There are a couple groups that have user mailing lists that are not English and 
that's totally fine, and Infrastructure will... well, you don't have to file a 
ticket anymore. It's, again, back to selfserve.apache.org: “self-serve” on 
Apache will create a mailing list for users communicating in Brazilian 
Portuguese, for example, or communicating in Japanese. But Infra doesn't do 
anything about that, that's just the self-serve tools. We certainly can't 
support non-English, and I don't think that the Foundation itself is going to 
make any moves towards that.

- Fair enough. So a lot of companies are really struggling to accommodate their 
teams working from home in response to the Coronavirus and all that. These 
stay-at-home orders are kind of shaking companies, but from day one, the ASF 
has always been a virtual organization. Has anything changed with your 
operation on that front? Has anything impacted the ASF's day-to-day, from this 
pandemic?

(chuckling) Not at all. I shouldn't laugh, but no. It really hasn't changed. 
We've been on our team channel for all three years, three and a half years that 
I've been here, and the world is burning down around us, but we still sit on 
the team channel.

Now, that said, (Infra team member) Daniel Gruno got stranded in Canada.

- Right! He’s still there?

He's still doing work from Canada. This is why when he travels to Canada for 
two months at a time, I don't care, you know? Because if his butt is in a chair 
in Denmark or in a chair in Canada, it's the same butt, so, you know...

- As long as you have connectivity and a computer, you can do it. 

Right. But if he has to be offline for two months, I'd say no. Or if you want 
unpaid time off, well, I'm not going to pay you, of course. Certainly the 
discussions have changed, you know? I mean, going shopping. You know, some 
members are immuno-compromised and that had an effect on our team meeting that 
we were planning in Nashville: they were the first to say, "No way. I'm not 
going," so, there’s that, but our day to day hasn't changed.

- That's more of a social thing versus an operational thing. Safety first.

So the notion of, "Oh, I got to run out to the grocery store. I need to strap 
on a mask," changes, but not the operation.

- Right. Right. So...what do you think people would be surprised to know about 
ASF Infra?

I don't know if it'd be surprising, but we are global. We've got four people in 
the United States, one in Canada, one in Denmark, one used to be in Australia, 
but is now in the UK, which actually kind of hurt a little bit, because in 
Australia, that meant that we always had somebody in that time zone, but now we 
have kind of this gap of Australia/Asia time zones when...

- A “Gavin” gap.

Yeah, well, I might be awake at that time, but I can't go and fix a MySQL 
server, so it does mean that we don't have that straight-up 24-hour coverage.

The notion that we are worldwide is kind of a neat thing about our team, and is 
what makes us pretty unique relative to other IT departments. I don't like 
being called an IT department, but that is essentially what we are. 

- Surprise.

What's the name of that TV show? The one that's about IT...

- "The IT Crowd", is that what you're referring to? The British show?

Yeah. So, you know, that's a funny show, but mostly when you think “IT 
department”, you think of some corporate people with button-up shirts, but 
...most of us, we're in our pajamas.

- Good one. What's your favorite part of the job?

I definitely like the team and that's why, nominally I'm part-time, but I'm 
pretty much constantly on the team channel and interacting, and so I think I 
just put that down as volunteer hours, where before I might work on Apache 
Subversion, but now I hang out with the team or I write some little tool or 
something like that. That's definitely been one of the more rewarding changes. 
Up until I started with this, I'd been a director for 15-and-a-half years, and 
that was kind of how I contributed to Apache. Now my work for Infrastructure is 
a new way to contribute to the Foundation. I'm also part of a new community, 
where before I would hang out with the httpd community, APR community, the 
Subversion people ...now it's the Infra people and my hobby time is kind of 
blended in with my work time, and vice versa. I mean, when your work time can 
also be seen as a hobby time, that's pretty cool.

I do think it's the team that makes it interesting. That's what I like the 
most, and that I'm working with a new, interesting community to contribute to 
the Foundation. 

- Not only did you switch roles, you switched communities. What was your 
biggest challenge going into this new role?

 would say probably trying to delineate what I was going to handle for the guys 
and that I wasn't going to tell them what to do or how to do it. It's like, 
“OK, I'm here to assist, to unblock things, to enable you guys, rather than to 
block you or micromanage you.”

To earn that trust, that I wasn't going to be some pointy-haired boss telling 
them how to do their work. Now, I don't know if that was ever a problem for 
them, but that was certainly one of my initial concerns: how to properly create 
my role. This was the first time Apache's even had somebody fill in this role, 
so I also had to find the role, which is, again, why I came up with 
“Infrastructure Administrator”, is because I wanted to define it as an enabler 
role, as an administrator, so they could get their work done but I would not be 
their manager. I would not be their boss: I was simply there to enable them.

- So, what are you most proud of in your infra career to date?

Ooh. I don't know. I would say by being hands-on, being the “hands” of Infra, 
it means that VP Infra didn't run away screaming.

David said in January 2016, maybe earlier, he was like, “No way. I'm out.” And 
after I was on the job for about two months, he said, “Huh. All right.”

- “I'm in!”

And so I get that feedback from him, “You know, you make the VP Infra hat quite 
easy for me.” I think that's probably what I really like about taking on the 
role, is that one of our volunteers got to stay rather than drop it because it 
was just causing so much anxiety and pain and time and frustration. Otherwise, 
most of the stuff I do is really boring. Not to me, but I don't have 
“accomplishments”. I push paperwork, basically, so the other guys can do 
accomplishments.

- Speaking of the other guys, how would your co-workers describe you?

I have no idea. I don't know. I really don't know. (laughing)

Where I just got done talking about what I saw as an issue, trying to frame 
what my role would be, it might have been fine with them and I was overly 
worried about it, but it’s hard for me to know. We don't do 360 reviews in 
Infra, so I don't get any feedback, really, from the team on what they think 
about myself or how I'm doing my job, so you'd have to ask them. 

- I have. Just kidding. So...what are the biggest “threats” that infrastructure 
managers or infrastructure administrators need to watch out for? What do you 
think is a “big thing” that people should be aware of, or is ASF so unique that 
you don’t feel like anyone really experiences what you experience?

There's our capacity issue with things like Travis, but I think you're asking a 
different question.

- I am, but that's fine. What's your greatest piece of advice? What would you 
tell aspiring infra administrators?

Actually, one of my greatest fears is really, as a small charitable foundation, 
it's hard for us to compete with well-funded corporations and some well-funded 
start-ups.

Related to that, I touched on it earlier, is career development ...you go into 
Google or Microsoft and there's a career ladder; we simply don't have a career 
ladder. There's salary growth. There's bonuses. If you want to have a resume or 
a LinkedIn profile that shows changes in growth and titles and career ladder, 
we can't offer that, and that's going to cut out some people. It's a very hard 
problem for me to solve. You know, there's things I can maybe do, but I also 
want to keep the team egalitarian and sort of level, rather than, “Oh, well, 
this guy is now the team lead.”

Given what I talked about, our social aspects, because we are all equal peers, 
keeping everybody with the same title, same position on the ladder means that 
we are peers and it's a little easier to interact that way. It's a real, real 
difficult problem. You ask what's scary: that's scary.

- But there's a counterpoint to that. You may not have a traditional career 
ladder path, but to say that you've worked in Infra for Apache carries weight. 
That's significant. 

I believe it does, especially when you can demonstrate the hundred different 
types of tasks...

- Well, that's exactly it. The breadth of work and the scale of what you guys 
do and the skill sets that you have to have and the fact that you have to play 
nice in the sandbox, all of it. The demand is immense, so to be able to be 
there and thrive and develop something from yourself in terms of a career is 
tremendous. Our team is exceptional. I mean, they're not expecting a linear 
ladder or something that others have.

You know, in other jobs, somebody might say, “I was a MySQL administrator.” 
Here, you're a MySQL administrator, PostgreSQL administrator… They had one 
role; here you've got dozens. 

- If you had a magic wand, what would you see happen with ASF infra?

I'd like to solve that CI problem. The other magic wand would be upgrading our 
mail server from 10-year-old technology to modern technology.

- Is that happening or is that literally a wish list issue?

It's happening, but it's been happening for three years. The thing is that 
email is so central to the Foundation that we can't really experiment with 
that. There are certain things we can do, but most of it, not so much, and so 
it means that we're being super-careful. There's about 10-12 different moving 
parts to it, and we're upgrading each of those a little bit by a little bit, 
until we can finally pull that big, scary, Young Frankenstein lever to hit the 
lightning bolt, you know?

- Yeah: I see the visual of that.

The magic wand would be to just make that all happen and make it work. Without 
the wand, it's going to take another 6-12 months.

- Right. What else do we need to know that I haven't asked? What should I be 
aware of or what should I be sharing?

Oh, I don't know. This is where my creativity ends. Ask me a coding question.

- Oh no coding questions. All right. Our time has also ended. Before we go, who 
should I be interviewing next? 

I would say Daniel (Gruno), because his role ... he's 20-30% system 
administration. The rest is tool development, so that makes his role rather 
unique in the team.

Perfect. Thanks so much, Greg. I really appreciate it. 

= = =
Greg is based in Austin on UTC -5. His favorite thing to drink during the 
workday is a big 32oz cup of Diet Mountain Dew.

= = =

NOTE: you are receiving this message because you are subscribed to the 
announce@apache.org distribution list. To unsubscribe, send email from the 
recipient account to announce-unsubscr...@apache.org with the word 
"Unsubscribe" in the subject line.

Inside Infra: Greg Stein --Part III

Reply via email to