[Wikitech-l] TechOps update

Victoria Coleman Wed, 10 Jan 2018 20:55:31 -0800


Hi everyone.


We are making some exciting changes in TechOps!

The Technical Operations team in the Technology department is possibly the 
oldest team in the organization. Originating from a group of volunteers (Mark 
being one of them) that enjoyed building and maintaining this up-and-coming, 
soon to become global top-10 web site as a hobby, the team has always focused 
on the challenge of keeping Wikimedia’s sites, services, and infrastructure 
working as well as possible. They did this at first on what can only be 
described as a shoestring budget, and still with modest resources today (more 
on this later).

Over time the team has grown to a professional staff of currently 18, with a 
pretty flat structure. Besides the other two sub teams (Traffic and Data Center 
 Ops) that do have a clearly defined scope, most of the team’s members as well 
as the majority of TechOps’s responsibilities still reside in the “Core Ops” 
sub team.

To strengthen the team as it continues to grow in responsibilities and 
membership we’ve decided to make some changes to the team’s structure, its 
leadership and its public profile.

Starting with the latter, we've decided to rename the team from Technical 
Operations to Site Reliability Engineering (SRE). SRE is a relatively modern 
term that more accurately describes the type of work the Technical Operations 
team has been doing for the past few years to some extent, as well as the path 
where it needs to grow into. Coined by Ben Treynor of Google, it’s now widely 
used across the industry. SRE describes a discipline where the emphasis is on 
the software engineering aspects of the work, with a focus on tools development 
and automation rather than human labor. Our hope is that this name change will 
more accurately represent the work and will help with recruiting into the team.

Second, we will increase the team’s management capacity. As the 
responsibilities and management/coordination/planning needs of the team kept 
growing, Faidon has stepped up and increased his involvement significantly. For 
example he covered for Mark during his paternity leave, and he has played a key 
leadership role in our efforts in the lawsuit against the NSA. In my time at 
the Foundation, I have come to rely on Faidon’s judgement, his ability to 
execute, and most of all on his leadership. So in recognition of Faidon’s 
important leadership role and responsibilities in the team, he is promoted  to 
Director of Site Reliability Engineering. Well done Faidon!

Mark and Faidon both will now be “Director of Site Reliability Engineering”, 
reporting to me. They will share some of the responsibilities of the team, such 
as its roadmap, and CapEx and OpEx planning and execution, as they have been 
doing for some time now. Each will lead one of two new sub-groups, “Service 
Operations” led by Mark, and “Infrastructure Foundations”, led by Faidon. The 
team will continue to operate as a single group responsible for the 
organization’s broader Site Reliability Engineering function, with both Mark 
and Faidon as leaders of the respective groups.

I also want to offer a few words about Mark. Mark exemplifies our values and we 
wouldn’t be the same without him. pubFrom driving servers around in the trunk 
of his car at the earliest days of the projects to building and running an 
exemplary team that has consistently delivered 99.98% uptime for the world’s 
fifth-most popular website, his work has been nothing short of heroic. He has 
done this with a team of 18 people, which many in our industry find 
incomprehensible. 

Both Katherine and our Board have recognized that delivering this level of 
performance with our radically efficient team is not sustainable as we continue 
to grow and make steps towards our strategic directions of knowledge equity and 
knowledge as a service. Katherine has asked, and the Board has unanimously 
recommended, that we  step up our investment in the team. I am thrilled at 
their support which will enable our SRE team to have access to additional 
resources  within the current fiscal year. 

Last but not least, and in an effort to return to his earlier days in the 
projects (and, in his words, an attempt to gain back some respect from his 
technical colleagues :-), Mark will dedicate two days a week to individual 
technical contributions in addition to his managerial work. Mark, thank you for 
your remarkable contributions!

Finally, I wanted to share more detail on our new sub team structure and scope. 
Data Center Operations
The existing Data Center Operations sub team continues as-is but will now be 
managed by Faidon. The team, consisting of Rob, Chris, and Papaul, is 
responsible for all of Wikimedia’s data center deployments and logistics as 
well as maintaining our presence in 8 locations across the world. They perform 
on-site work and maintain the full 5-year life cycle (specs, purchasing, 
physical install, break/fix and decommissioning) for all hardware.
Infrastructure Foundations
This new sub team will focus on building and maintaining our base platform 
(“metal cloud”) that forms the foundations upon which nearly everything else in 
our infrastructure builds upon. On top of our bare metal deployments, their 
responsibilities include (but are not limited to) configuration management 
systems, infrastructure automation, orchestration tooling, logging, metrics and 
monitoring as well as infrastructure security. This team consists of Riccardo, 
Filippo, Keith and Moritz, who will report to Faidon.
Traffic
The current Traffic sub team remains unchanged in membership, scope, and 
management. They are responsible for the critical first layer of high-traffic 
infrastructure which now spans much of the globe, including our TLS termination 
and caching layers, load balancing, DNS and our own network. The members of 
this team are Brandon, Emanuele and Arzhel as well as Valentin Gutierrez, our 
newly hired Traffic Security Engineer who will be starting on February 12th. 
They report to the team’s technical lead and manager, Brandon, who in turn will 
continue to report to Mark.
Data Persistence
The new Data Persistence sub team will focus on Wikimedia’s persistent data 
storage and retrieval systems, including (No)SQL databases, (distributed) 
object storage, file storage and backup systems. Today, this team will start 
with just our two database administrators, Jaime and Manuel, but the 
expectation is that this team will be built out in the near future with 
additional hands and expertise. They will report to Mark.
Service Operations
Finally, the Service Operations sub team will take care of public and 
“user-visible” services alongside Technology and Audiences teams. This 
includes, for example, our big MediaWiki platform, but also the newer 
(micro)services that comprise our stack. It also includes miscellaneous 
services and components that we rely upon (think Phabricator, mail systems, 
OTRS, etc…). The team will continue building our new SOA service infrastructure 
based on Kubernetes. Its membership will consist of Alexandros, Giuseppe, Ariel 
and Daniel, reporting to Mark. 

Please welcome our new SRE team!



Victoria (with a lot of help from Mark, Faidon and the SRE team)

_______________________________________________
Wikitech-l mailing list
[email protected]
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] TechOps update

Reply via email to