GOTO - Today, Tomorrow and the Future

Building Green Software Part 4: Operational Efficiency • Anne Currie

January 12, 2024 Anne Currie & GOTO Season 4 Episode 2
GOTO - Today, Tomorrow and the Future
Building Green Software Part 4: Operational Efficiency • Anne Currie
Show Notes Transcript Chapter Markers

This interview was recorded for the GOTO Book Club.
gotopia.tech/bookclub

Read the full transcription of the interview here

Anne Currie - Co-Author of "Building Green Software", Leadership Team at Green Software Foundation & Veteran Software Engineer

RESOURCES
oreilly.com/library/view/building-green-software/9781098150617
greensoftware.foundation

Anne
annecurrie.com
twitter.com/anne_e_currie

DESCRIPTION
How will software development and operations have to change to meet the sustainability and green needs of the planet? And what does that imply for development organizations? In this eye-opening book, sustainable software advocates Anne Currie, Sarah Hsu, and Sara Bergman provide a unique overview of this topic—discussing everything from the likely evolution of national grids to the effect those changes will have on the day-to-day lives of developers.

Ideal for everyone from new developers to CTOs, Building Green Software tackles the challenges involved and shows you how to build, host, and operate code in a way that's not only better for the planet, but also cheaper and relatively low-risk for your business. Most hyperscale public cloud providers have already committed to net-zero IT operations by 2030. This book shows you how to get on board.

You'll explore:
• How the energy transition is likely to change hosting on prem and in the cloud—and how your company can prepare
• The fundamental architectural principles of sustainable software development and how to apply them
• How to determine which parts of your system need to change
• The concept of extending hardware longevity and the part that software plays

* Book description: © O'Reilly

RECOMMENDED BOOKS
Anne Currie, Sarah Hsu, & Sara Bergman • Building Green Software
Ioannis Kolaxis • 101 Green Software
Mehdi Khosrow-Pour • Green Computing Strategies for Competitive Advantage and Business Sustainability
Lässig, Kersting & Morik • Computational Sustainability
Zbigniew H. Gontar • Smart Grid Analytics for Sustainability and Urbanization
Katsoni & Segarra-Oña • Smart Tourism as a Driver for Culture and Sustainability

Twitter
Instagram
LinkedIn
Facebook

Looking for a unique learning experience?
Attend the next GOTO conference near you! Get your ticket: gotopia.tech

SUBSCRIBE TO OUR YOUTUBE CHANNEL - new videos posted daily!

Intro

Hi, my name is Anne Currie, and I'm one of the co-authors of the new O'Reilly book, "Building Green Software," which is all about the things that we need to do in the software industry to handle the energy transition and what we need to be up to. And today I'm gonna talk to you about one of the latest chapters to go live with the book, which is "Operational Efficiency." Now, I'm gonna have to force myself to not talk about this for hours, just 10 to 15 minutes, because it is the most important chapter in the book, and it's the longest chapter in the book, and it has the most in it of the book.

The Significance of Operational Efficiency

The reason why I say it's the most important chapter is...It's controversial when I say it's the most important chapter because most people think that the most important chapter will be "Code Efficiency. And in some ways it is, but in terms of urgency and importance and priority, it isn't.

Code efficiency, if you are good at making your code efficient...Most people have seen the charts somewhere that's ranked different languages in terms of how efficient they are. You have probably seen the one that says that C is 100 times more efficient than Python. Therefore you'll be thinking, "Well, hang on a minute. I'm never gonna get a payoff as big as that." And you'd be right, "I should be rewriting my stuff in C, not Python." But today I'm going to argue that operational efficiency is, at this stage, much, much more effective even though, in terms of operational efficiency, the best you're probably gonna get is a 10x improvement. So you'll cut 90% of your carbon emissions by using good operational techniques. So you'd be thinking, "Well, that's literally 10 times less good than code efficiency."

The reason why operational efficiency is so effective is because we are much closer to being able to do it. There's so much more about operational efficiency that has been commoditized already than is the case with code efficiency. Now, there's some really good stuff going on with the commoditization of code efficiency. If you start looking into things like Python being compiled down to C-level performance, new languages like Mojo, have similar effects. There is work to improve the efficiency of code whilst keeping developer productivity. But that's sort of still at a very early stage, operational efficiency is much further down the line. It is much closer to being commoditized. There's a lot more you can buy off the shelf. So that's kind of the main reason...well, is it the main reason? It is one of the reasons why operational efficiency is where we need to be looking next.

Operational Efficiency: A Fundamental Shift

But the other reason why we need to be looking at operational efficiency is it more fundamental than code efficiency is. For example, say I had a monolith that was written in Python and I went to all the effort, and it would be a lot of effort to rewrite that completely and re-architect it for good carbon awareness, maybe put in some microservices, all that kind of stuff, to make it a 100 times more efficient. Say I did that, and it would take me ages, and it would make this stuff very hard to maintain in the future, at the moment. Things will improve, but at the moment, it'd be extremely custom work that I was putting in there. If I went, "Hooray, I've done that," I've reduced the amounts of CPU and memory and all those kinds of resources and bandwidth it requires by 99%. If I run it on the same machine, then I don't get any of that benefit or very little of that benefit. If you reduce how much a machine is used by 99%, you don't save that much. You don't save any embodied carbon and you don't save that much of the electricity used to power it either because most of the effort is in keeping a machine turned on rather than doing anything.

So fundamentally, if you don't right-size your application, you don't move it to a VM or a machine that is the same size as you've just shrunk it to, then you don't get that much benefit from the enormous amount of work that you did to shrink the application. So what I'm saying there is that the operational move, that moving the application to a machine that's the right size for it or a VM that's the right size for it, is more fundamental than actually tuning it. You have to learn how to do that before you get any benefit from the enormous amount of work that you'll get from code efficiency. The co-authors, the three of us who are writing the book, and the Green Software Foundation as well, what we want is for everybody to start moving in the bit that's already commoditized and is more fundamental, which is the operational efficiency side, get shit hot at that while we are putting pressure on the code efficiency folk, those platforms to become green and make that easier for us.

So in terms of ordering, it's operational efficiency first because otherwise, you know, if you did it the other way around, as I said, you don't get that much benefit. And, that is the way...fortunately, that's also the way that commoditization works, we're more commoditized on operational efficiency than we are on code efficiency.

Operational Efficiency Breakdown

So, operational efficiency. Now, this is the longest chapter in the book. So I've got an awful lot of stuff, and I'm gonna have to whizz through it here. I would say falls into three areas.

You've got turning off machines that aren't used or are underused, and there's a new kind of ops that entirely revolves around that problem, which is not as easy to solve as you might think, is LightSwitchOps. Then after that, you've got to be looking at things like increasing your multi-tenancy, using auto-scaling, right-sizing, removing over-provisioning, and that more or less falls into the remit of DevOps. And then finally, when you get hot at this, really fantastically good at this, you're looking at SRE, the moment, the peak of ops performance, which is Site Reliability Engineering Image, initially developed by Google.

Because what I'm telling you here about green and efficient ops is just efficient ops. It's nothing new, there's nothing magic about it being green, it's just really good ops. So once you are top of your game at doing ops, you will naturally be green, which is another good reason why ops is a good place to start because you can sell it for other reasons than being green. You can sell it for cost-saving reasons, security reasons, and, oddly enough, productivity reasons because a lot of these techniques that allow you to deploy more quickly will improve the productivity of your developers. And you'll find that that is often the easiest sell to make. If you can go faster, deploy faster, deploy more securely, deploy more confidently, then that's the kind of story that your company will like to hear because it means you'll be able to get features out faster and try them out. So no more of that. It's like the reverse of the old waterfall days, which were my days, we don't wanna go back to them.

1. LightSwitchOps

So in terms of operations, what are we looking at? LightSwitchOps. So LightSwitchOps is an idea being championed by Holly Cummins, who is an IBM engineer. And the idea is that we often don't turn off machines that we should be turning off, either because they're not very well used, or they're just not used at all because we fear that if we turn them off, we won't be able to turn them back on again. There are lots of reasons why we keep machines around that aren't effectively used, and that's a waste of power, and it's a really waste of those servers that could be used for something more effective. But we don't necessarily always know which machines they are, so some work is required to find that out. But also we fear turning them off in case you might not be able to turn them back on again.

Resolving that is probably your 101. It's the first thing that we need to be working on, making sure that we can safely turn off any machine in our systems. Say an example of the kind of level of savings you get from this, it's a security hole if you've got machines that you don't understand what they're doing anymore, and you know, it's just a sign that you're losing control of what's going on in your data centers. But it's incredibly common. So the best example, I heard about this recently, was VMware moving a data center in Singapore. They were moving the data center, pretty standard stuff, and they wanted to not move more than they had to. So they decided to do a full audit of what was running on all of the machines that they were moving. And what they discovered was that two-thirds of the machines in that data center were not doing anything that mattered to them anymore. That they were old, you know, couple of users, that it was not worth their while moving that to the machine, two-thirds.

So, LightSwitchOps...But it can be very hard to work out which machines you can turn off. There are a couple of really good techniques for this. There's the scream test, just turn the machine off, and find out if anybody screams, that works quite well. But again, you might fear that if you worry you're not gonna be able to turn the machine back on again. Another thing is that all resources are provisioned for six months, and if no one says, "No, I want this again," in six months, then it's not popular enough to warrant being kept on. But again, you've gotta tie this in with the LightSwitchOps. So the idea of LightSwitchOps is that you go through and you take the pain of working out which machines you need to turn off, and you take the risk that you might turn some of them off and not be able to turn them back on again. But from then on, everything is automated so that you can automatically turn it off and on again. And then you test that, and then you use that to turn off machines that are no longer in use, but also machines that are on at the moment, but you don't need to be on. So the obvious example is test systems at the weekend or development environments at night and in the weekends.

LightsSwitchOps, is the idea that you can turn machines on...This all comes from the idea that at night you don't not turn your lights off because you are afraid that they won't turn back on again in the morning. If you were afraid that they wouldn't turn back on again in the morning, you wouldn't turn them off at night. But you always do, and that's...well, with LED light bulbs, it doesn't necessarily save you that much. But in the olden days, it used to save you a lot of money to turn your lights off when you weren't in the room or you were in bed. But it only works because you're quite confident you can turn them back on again. And the aim is that with your systems, you are as confident as turning on your lights that you will be able to turn them back on again, and that way you can turn them off without fear. So that's LightSwitchOps.

2. DevOps and Right-Sizing

The next thing is DevOps. It's also scaling...it's about...Once you've done your LightSwitchOps, you turned off the stuff that you don't need anymore, the next thing to look at is not over-provisioning. I can talk forever about why over-provisioning happens, and it's perfectly sensible. But in the long run, we've got to cut back on over-provisioning, and there are various ways you can do that. You could use also scaling, you can use orchestrators to make sure that things are moved around from place to place and scaled up and scaled down. But fundamentally, right-sizing is the next step in the process of operational efficiency. And it's the kind of thing that's covered by DevOps.

It's a very difficult thing to do. I mean, I'm saying this as if, you know, it's like obvious that you do DevOps and that you make all this stuff work, but it's really not. It's hard, it requires an investment, but it is part of...If you look at companies that are doing well in operations these days, a lot of them are doing it so that they can be releasing code on an hourly, or a minutely basis, on a 10-minutely basis, really, really fast. And this all tends to go hand-in-hand with that. Getting good at DevOps, getting good at how you control your systems, getting orchestrators in place, starting to wrap workloads in things like containers, which is part of being able to move workloads around and scale them up and move them from machine to machine depending on their current resource requirements, it's hand-in-glove with that whole CI/CD moving faster in production and making sure that applications go live faster.

So, it is aligned with the stuff you want, it's not only good for machine productivity. And this is all about machine utilization. Operational efficiency is all about machine utilization. What you're doing with machine utilization cuts your costs, proves your security, and massively reduces your carbon effects. So, and again, this is why operational efficiency is the most important thing because it's aligned with all the other stuff that we want to do. So we've got DevOps there, which is like auto-scaling, using systems that...If you're in the cloud, choosing the right instance types. This is a deceptively powerful concept. It doesn't apply if you're not in the cloud, but if you are in the cloud, make sure you choose flexible instance types that are the right instance type for your workload.

So, for example, one of the most interesting types of auto-scaling that's available out there, I would say, is the burstable instance type, which is available in all clouds. And what it is, is you pay for a low level of...or what you think will be your average level of resource requirements on a machine, it's not crazily expensive. But the key reason why we all over-provision is because you think, well, mostly I need that, but occasionally I'm gonna need enormous amounts of resources, and if I don't get that, then I'm gonna fall over and it's gonna be really bad. So to meet my SLAs, I'm gonna have to over-provision to the maximum rather than run on the kind of, like, average level of resources that I'll need. And then, you know, know that every couple of weeks I'm gonna fall over.

The idea of burstable instances is that you pay for the moderate level of a resource requirement, but for a limited amount of time, occasionally, when it's needed, your hosting provider will allow your machine to leap up to that large amount of resources that you need to get with the peak and then decline again. So burstable instances are a really interesting way of doing auto-scaling. I'm quite happy with burstable instances. So right-sizing is one way of doing it.

Another way of doing it is to try and steer your developers to use platforms that do an awful lot of this stuff for you. So serverless, it does also scaling for you. Then spot instances, perfect, I love spot Instances for many, many reasons. They're fantastically good for demand shifting and shaping, which we'll talk about and which you will hear about in another podcast. But spot instances, again, will jump in, do stuff for you, and you don't have to worry about 'em so much. I mean, you have to architect for spot instances. That is not easy because they've got no SLAs and you'll have to redo everything. But operationally, if you could do that, you've won, that's the perfect green operational approach. And where you can use it, use it.

3. Site Reliability Engineering (SRE)

And then finally, we'll talk a little bit about what is the perfect, the perfect solution here is really for you to fully take on SRE, Google's SRE principles of CI/CD, full automation, and massive monitoring and acting based on what you see. Because all of this stuff, is hard, it is not easy, it's not trivial to do all these things, and we would've done them. The best practice and there are commodity tools available to help you with them, but they are hard. It is well worth looking at Google's SRE principles for how they have moved in this direction because they moved in this direction literally 20 years ago, and they've started to write up and talk about what they did and what they learned.

So you're not having...Unlike with code efficiency, where we're still in in very early days of aligning code efficiency and developer productivity, which we need to do. With operational efficiency, some people beat this track. It's not new, you can find out what you need to do. Almost the best thing you can do about being green is to start improving your operational performance and operational skills, and the acme of operational skills at the moment is reading some of these Google SRE books. They're scary. You should be following people like Charity Majors on Twitter, seeing what they're doing because they are...they're not doing it for green reasons, but it's green. It is the foundation of how we will produce very, very efficient systems in the future. So it's absolutely worth you looking at that.

Takeaways

So actually today I had so much to talk about. I think I've been able to cover the very, very basic...I've not gone into depth on any of these things, but remember, the book is available, and there's more detail in the book. This chapter is available, and it is a long chapter, there's a lot in it, but it is well worth reading through. And even that chapter only gives you an introduction to what you need to do, it does give you a good introduction to what you need to do. So go and have a read of it. You don't need...O'Reilly keeps reminding me, you do not need an O'Reilly subscription to read these chapters because you can download...you can go and get a free trial and read everything very quickly then, just like, you know, Netflix...or I need to do the Disney low-price trial so I can watch all the stuff on there for a while. But fundamentally, we're all very used to using a free trial to blast our way through the content. If you have anything to do with ops, use your O'Reilly free trial to read this chapter because it's what we need to do next. It's utterly, utterly key. And it is the bits that are aligned with what the rest of your business needs to do. So it'll be the easiest sell of any part of the green story.

Outro

But anyway, I promised GOTO that I wouldn't overrun this one to the horrible extent that I did the last one on code efficiency. So, there we go, LightSwitchOps, DevOps, SRE, and Operational Efficiency is the most important thing. Get good at it and you will win. This is the next step for you to take in being green. So thank you very much for listening to me, and I look forward to speaking to you again, which will probably be about the "Networking" chapter.

Intro
The significance of operational efficiency
Operational efficiency: A fundamental shift
Operational efficiency breakdown
1. LightSwitchOps
2. DevOps & right-sizing
3. Site Reliability Engineering (SRE)
Takeaways
Outro