Guy Warren has sat in a range of hot seats at financial organisations, from running production at a global investment bank (SBC Warburg) through to heading up operations at the FTSE Group. Now CEO at ITRS Group, a company that specialises in estate monitoring, capacity planning, IT analytics and load testing, he joined The Stack to talk bank infrastructure resilience, systems monitoring, and more.
We’ve seen a lot of significant outages recently across exchanges and indeed banks. Just how hard is it building resilience into infrastructure and applications in this space?
I’ve run production for financial organizations three times in my career, and there are challenges to it; without a doubt. I was the COO of FTSE Index company. I was the UK CEO and Global COO for FNZ. And earlier in my career, I ran production for a global investment bank, SBC, UBS, as it became. If you’ve got a large, complex, messy estate, it is difficult. I think there are four things that you have to have in place.
The first is that you need a resilient architecture. You have to assume things are going to fail. For everything that can fail, there needs to be at least one standby and a clear plan for how you get from the production to the standby. There’s hot-hot, where both the live system and the standby are both working flat out all the time. The hot-warm, where one is waiting to pick up the workload. If something dies and there is a “cold” which you cut over to in the event that the “hot” dies, that’s going to take you longer.
So you need good architecture and to know what your target availability is. If you’re shooting for four nines (99.99% uptime), if you’re running 24/7, that’s about 52 minutes a year. If you’re running working hours, it’s a bit less than that – 25 minutes a year. So you really don’t have long from detecting a problem to cutting over to whatever your standby is, minutes literally; with this preferably automatically done by the software itself. You want no single points of failure.
The second thing is thorough testing. And what tends to happen is people do functional testing. They make sure that a new piece of functionality works and they don’t do non-functional testing, like loading and leaving it running for a period of time to see if it changes behaviour over multiple hours of running. But that’s a key factor. Have you done thorough testing of this particular architecture and is your testing really close to your prod? The third thing is change management; changes are often done at the weekend, but not exclusively. Do you know how long this change is going to take to make and do you know when you need to start rolling back and saying this change is not going to be happening on time?
The big TSB outage a while ago; that started with a migration. They decided they couldn’t go backwards, decided to roll forward and left themselves in a uncleaned state. So managing change is really important.
The fourth one is knowing what’s actually happening in production because not everything is related to change. Probably 70% of failures are related to change, but things do fail in production that have been working for months on end and therefore having really good monitoring… We [ITRS] typically play in that fourth area of monitoring and automatic monitoring of that with automatic failover.
What is your sense of the resilience of much architecture out there and outage causes?
Hardware failures are fairly rare. I saw the Tokyo Exchange did blame hardware, which tells you two things. One, the hardware was old enough that it had failed — and that’s normally multiple years. But it also told you that there was no automatic bypassing about hardware. If it’s a server, there should be a second server waiting to go; if it’s a router, there should be two way — at least two ways — to every network, etc. The other thing that is a problem is how convoluted processes can be.
We were working with a client recently: from their mobile banking app through to a payment being made, there are ten hops; 10 significant pieces of processing going on in. Each of those had their own technology underneath them, modern and old, etc. So if you’ve got that kind of a situation, you’ve got a relatively fragile service delivery model. Everything has to be working all the time for that not to be down.
It does depend on how recently all of your infrastructure has been upgraded and renovated and not many people can afford to keep on going back and renovating the original early stuff they put in. The very oldest technology mainframes are very robust and go multiple years between failures. But nevertheless, there will be tech in there that is a lot more fragile. The number of delivery channels has gone up. From mobile banking to interbank processes like open banking, there are so many ways in which we touch those backend systems that it’s getting more and more complex for banks to manage.
As a monitoring service provider yourself, that must also be a challenge, given the pace of change…
What clients tend to end up with is a lot of monitoring tools, clients saying ‘I have 30 or even 40 monitoring tools’. That’s like trying to drive your car when there’s individual bits telling you about one wheel; something to tell you about your petrol gauge, and no central view as to what the hell’s going on. How can you watch forty consoles on the screen and work out what’s going on across your bank account and things? It’s a very siloed view of what’s happening. We can monitor everything from a flat file through a batch job, through old messaging technology right the way up to the most modern containers, dynamic environments, cloud, whatever. . Those who don’t want to come down from 30 tools to 3, we also integrate well with the other tools you’ve got: we would say what are maybe 5 other tools that we need to integrate with so that we give you a single pane of glass.
Monitoring and resilience is inarguably pretty critical, but banks are also intent on driving down cost. What’s the investment climate look like for IT across financial services as you see it?
Obviously this has been an extraordinary year, both with Covid and also with Brexit. But there’s new regulation coming through called Operational Resilience. It’s a step up from the normal operational controls that the financial organisation would have and it focuses on availability, performance and security. What they also did, is they brought through individual liability, personal responsibility. So it’s called an SMF (senior management function) implementation, which means an individual in the bank has to sign a contract with the regulator and that they are personally liable if their organisation is not doing what it needs to do.
When the regulator makes it personal like that, it’s not just a fine for the organisation. It’s penalties for the individual. By making it that that onerous, the organisations will take it more seriously and cost issues have to come second to having a resilient operation. This legislation is in consultation stage and will probably come in sometime next year. It’s also gone into the Basel Banking Committee and multiple jurisdictions are bringing in similar regulations. That will strengthen a focus on resilience.
What kind of software issues are typical causes of outages?
I called out the Tokyo outage as being unusual. I don’t see hardware failing very often. If it does, it should fail silently and another piece of hardware pick up the workload. So it’s typically either an application they bought or have written; sometimes it’s incompatibility with versions of software which they haven’t tested properly. They promote something through to prod and it breaks something else…
Do you think that that’s happening more and more because there is a pressure to push things through to prod faster?
Yes. The concept of “fail fast”, cut your testing down and try and automate it if you possibly can – but “let’s get a new piece of functionality through every four weeks, two weeks, six weeks, then if something’s not right, let’s get it fixed up and push that through” works fine in Facebook or Netflix. It doesn’t work well in banking because any outage is a real problem. There is a real challenge between doing this — trying to bring new versions through very quickly, versus getting high stability and predictable performance in your resulting service.
Have you seen attitudes towards that change over the past year or two?
I still see them trying to introduce more DevOps and still wincing at the number of outages they have. Something called SRE (Site Reliability Engineering), which Google developed is a much more methodical approach to trying to do their jobs and not enough organizations are following it. Google follow it and they slow down all changes while they’ve got a period of instability or poor performance. It’s a good approach, which we would espouse.
Outside of your immediate products, anywhere in financial services technology investment that you are seeing a lot of money being spent?
Cloud is absolutely roaring. It’s about $220 billion total spend per year. It’s just nuts. So we’ve had to improve our tools, bring through new monitoring for the major cloud providers, new capability around cloud cost management – because cloud use is often very wasteful. People tend to spin it up without really knowing exactly what size instances they need, and waste a lot of money. We actually just launched a new product called Cloud Cost Optimization, which you point at software, your cloud estate and it’ll tell you where you’re wasting your money.
Are you seeing largely cloud native applications being spun up or is it migration processes?
Historically, it’s been new. But increasingly it’s migration. And [too often] people think that whatever was running on a physical machine in my data center, I’ve got to spin up the same size on a physical machine or virtual machine in the cloud and just move the workload over. That’s very wasteful because although it’s bad to have wasted your data center, it’s disastrous to have wasted OpEx from the cloud.