Site reliability engineering

src: s2.dmcdn.net

Site reliability engineering (SRE) is a discipline that incorporates aspects of software engineering and applies that to IT operations problems. The main goals are to create ultra-scalable and highly reliable software systems. According to Ben Treynor, founder of Google's Site Reliability Team, SRE is "what happens when a software engineer is tasked with what used to be called operations."

Video Site reliability engineering

History

Site Reliability Engineering was created at Google around 2003 when Ben Treynor was hired to lead a team of seven software engineers to run a production environment. The team was tasked to make Google's sites run smoothly, efficiently, and more reliably. Early on, Google's large-scale systems required the company to come up with new paradigms on how to manage such large systems and at the same time introduce new features continuously but at a very high-quality end user experience. The SRE footprint at Google is now larger than 1500 engineers. Many products have small to medium sized SRE teams supporting them, though by far not all products have SREs. The SRE processes that have been honed over the years are being used by other, mainly large scale, companies that are also starting to implement this paradigm. Microsoft, Apple, Twitter, Facebook, Dropbox, Amazon, IBM, Xero, Oracle, Zalando and GitHub have all put together SRE teams.

Maps Site reliability engineering

Roles

A site reliability engineer (SRE) will spend up to 50% of their time doing "ops" related work such as issues, on-call, and manual intervention. Since the software system that an SRE oversees is expected to be highly automatic and self-healing, the SRE should spend the other 50% of their time on development tasks such as new features, scaling or automation. The ideal SRE candidate is a programmer who also has operational, systems or networking knowledge, and likes to whittle down complex tasks.

DevOps vs SRE

Coined around 2008, DevOps is a philosophy of cross team empathy and business alignment. It's also been associated with a practice that encompasses automation of manual tasks, continuous integration and continuous delivery. Site Reliability Engineering is a superset of processes that would, inherently, include DevOps as a subset. By developing tools to improve and automate operations, the natural outcome would be a more automated and self-service DevOps environment. SREs, being developers themselves, will naturally bring solutions that help remove the barriers between development teams and operations teams.

Site Reliability Engineering with Paul Newson: GCPPodcast 38 - YouTube

src: i.ytimg.com

References

General

Site Reliability Engineering, O'Reilly Media, April 2016, Betsy Beyer, Chris Jones, Jennifer Petoff, Niall Richard Murphy, ISBN 978-1-4919-2909-4
The Practice of Cloud System Administration: Designing and Operating Large Distributed Systems, Volume 2, Thomas Limoncelli, ISBN 032194318X

Intro to Site Reliability Engineering & Resilience ...

src: i.ytimg.com

External links

Google - Site Reliability Engineering interview with Ben Treynor

Source of the article : Wikipedia

Site reliability engineering

Selasa, 12 Desember 2017