Modern digital and cloud technology underpins the shift that enables businesses to implement new processes, scale quickly and serve customers in a whole new way. Historically, organisations would invest in their own IT infrastructure to support their business objectives, and the IT department's role would be focused on keeping the "lights on." To minimize the chance of failure of the equipment, engineers traditionally introduced an element of redundancy in the architecture. That redundancy could manifest itself on many levels. For example, it could be a redundant data centre, which is kept as a ‘hot’ or ‘warm’ site with a complete set of hardware and software ready to take the workload in case of the failure of a primary data centre. Components of the data centre, like power and cooling, can also be redundant to increase the resiliency. On a lesser scale, within a single data centre, networking infrastructure elements can be redundant. It is not uncommon to procure two firewalls instead of just one to configure them to balance the load or just to have a second one as a backup. Power and utility companies still stock up on critical industrial control equipment to be able to quickly react to a failed component.
Traditional Data Protection
The majority of effort, however, went into protecting the data storage. Magnetic disks were assembled in RAIDs to reduce the chances of data loss in case of failure, and backups were relegated to magnetic tapes to preserve less time-sensitive data and stored in separate physical locations. Depending on specific business objectives or compliance requirements, organizations had to heavily invest in these architectures. One-off investments were, however, only one side of the story. On-going maintenance, regular tests and periodic upgrades were also required to keep these components operational. Labor, electricity, insurance and other costs were adding to the final bill. Moreover, if a company was operating in a regulated space – for example, if they processed payments and cardholder data – then external audits, certification and attestation were also required.
Ensuring Resilience with the Advent of Cloud Computing
With the advent of cloud computing, companies were able to abstract away a lot of this complexity and let someone else handle the building and operation of data centres, as well as the dealing with compliance issues relating to physical security. The need for business resilience, however, did not go away. Cloud providers can offer options that far exceed (at comparable costs) the traditional infrastructure but only if configured appropriately. One example of this is the use of 'zones' of availability where your resources can be deployed across physically separate data centres. In this scenario, your service can be balanced across these availability zones and can remain running even if one of the 'zones' goes down. Capital investment required to achieve such functionality is much greater if you want to build your own infrastructure for this. In essence, you would have to build two or more data centres. You better have a solid business case for this. Additional resiliency in the cloud, however, is only achieved if you architect your solutions well: running your service in a single zone or, worse still, on a single virtual server can prove less resilient than running it on a physical machine. It is important to keep this in mind when deciding to move to the cloud from the traditional infrastructure. Simply lifting and shifting your applications to the cloud may, in fact, reduce the resiliency. These applications are unlikely to have been developed to work in the cloud and take advantage of these additional resiliency options. Therefore, I advise against such migration in favour of re-architecting. Cloud Service Provider SLAs should also be considered. Compensation might be offered for failure to meet these, but it’s your job to check how this compares to the traditional “5 nines” of availability in a traditional datacentre alongside the financial differences between service credits as recompense and business losses from lack of availability.
Cloud Service Models
You should also be aware of the many differences between cloud service models. When procuring a SaaS, for example, your ability to manage resilience is significantly reduced. In this case, you are relying completely on your provider to keep the service up and running, potentially raising the provider outage concern. In this scenario, archiving and regular data extraction might be your only options apart from reviewing the SLAs and accepting the residual risk. Even with the data, however, your options are limited without a second application on-hand to process that data, which may also require data transformation. Study the historical performance and pick your SaaS provider carefully. IaaS gives you more options to design an architecture for your application, but with this great freedom comes great responsibility. The provider is responsible for fewer layers of the overall stack when it comes to IaaS, so you must design and maintain a lot of it yourself. When doing so, assume failure rather than thinking of it as a (remote) possibility. Availability zones are helpful but not always sufficient. What scenarios require consideration of the use of a separate geographical region? Do any scenarios or requirements justify a need for a second cloud services provider? The European Banking Authority recommendations on Exit and Continuity can be an interesting example to look at from a testing and deliverability perspective. Finally, PaaS, as always, is somewhere in-between SaaS and IaaS. I find that a lot of the times it depends on a particular platform; some of them will give you options you can play with when it comes to resiliency, and others will retain full control. Be mindful of characteristics of SaaS that also affect PaaS from a redundancy perspective. For example, if you’re using a proprietary PaaS, then you can’t just lift and shift your data and code.
Final Thoughts
Above all, when designing for resiliency, take a risk-based approach. Not all your assets have the same criticality. Understand the priorities, know your RPO and RTO. Remember that SaaS can be built on top of AWS or Azure, exposing you to supply chain risks. Even when assuming the worst, you may not have to keep every single service running should the worst actually happen. For one thing, it's too expensive - just ask your business stakeholders. The very worst time to be defining your approach to resilience is in the middle of an incident closely followed by shortly after an incident. As with other elements of security in the cloud, resilience should “shift left” and be addressed as early in the delivery cycle as possible. As the Scout movement is fond of saying, “be prepared.”
About the Author: Leron Zinatullin (@le_rond) is an experienced risk consultant, specialising in cybersecurity strategy, management and delivery. He has led large scale, global, high value security transformation projects with a view to improving cost performance and supporting business strategy. He has extensive knowledge and practical experience in solving information security, privacy and architectural issues across multiple industry sectors. Visit Leron’s blog here: https://zinatullin.com/. To find out more about the psychology behind information security, read Leron’s book, The Psychology of Information Security. Editor’s Note: The opinions expressed in this guest author article are solely those of the contributor, and do not necessarily reflect those of Tripwire, Inc.