IT Resilience: Challenges, Pitfalls and Tips [Podcast]

We put David Davies, Business Continuity and IT Resilience Consultant, under the spotlight to answer questions on IT resilience.

IT resilience is not just about achieving ‘always-on’ systems, it’s also about being able to recover quickly and effectively when things go wrong – and it’s a central element of organisational resilience. In this recorded interview, David shares his insight into achieving IT resilience – what challenges need to be overcome, what pitfalls need to be avoided and lots of useful tips to help you get it right.

Listen to the podcast here, or read on for the article

Can you give us a brief overview of IT resilience?

If you think of an organisation’s IT systems such as email, databases and website, IT resilience is all about keeping those IT systems up and running, ideally without failures or interruptions. The ideal state is to never fail, but you also need to have the tried and tested technology in place to recover from IT failure if it does happen.

As an example, let’s say you have a primary data centre which runs all of your email and other business systems. If that fails, and it’s paired with a mirror image at data centre B, this should carry on running your systems seamlessly, if your primary data centre failed.

If both your primary data centre and your mirror image data centre B fail, your IT systems can be recovered from backups at data centre C – a more traditional recovery service.

An IT system that never fails is the ideal scenario but would come with a significant price tag that makes such a solution prohibitive for the majority of organisations. This is why it is important to do a business impact analysis (BIA) to understand exactly what level of downtime your organisation can tolerate, and then look to invest in a solution that delivers that level of resilience.

And it’s not just about buying and installing new technology. It begins with a willingness to understand the organisation and invest in improvement, so this needs support at board level, as part of an overall organisational resilience strategy.

Key Resilience Challenges:

What are the key challenges in IT resilience and how can businesses address them?

The IT resilience capability of many organisations has vastly improved over the last 20 years due to many factors. Disk storage and networking is comparatively much cheaper, which enables movement and storage of large amounts of data, and makes it more affordable to design for duplication of components and networking. Virtualisation technology has made IT systems and data more fluid across the IT estate it is housed in, rather than being stuck on single servers, and therefore much more resilient to equipment failure. Replication and recovery software is much more sophisticated now.

This is all really good news but it presents some key challenges:

IT departments can trust the technology so much they stop planning for failure

This means they stop investing the time and effort into arrangements and knowledge for what to do if there’s a serious IT failure and it needs to be recovered from backups.

IT departments can get overly focused on the threat of physical failure

Cyberattack presents a different kind of threat. Going back to our earlier example, if a data centre has a second data centre with a mirror copy of the data, a virus or data corruption is mirrored as well, so the data in both data centres is compromised. The organisation needs to rely on backups stored at the third data centre, and crucially, these need to go back in time far enough, to before the virus or corruption occurred.

Does the IT department fully understand their IT environment?

They’ll need to during an IT failure, to know how to recover it.

Does the IT department fully understand the resilience and recovery of IT systems provided by suppliers?

Understanding what your suppliers are taking responsibility for and where the responsibility lies with you, for example cloud service providers.

What about resilience in the cloud?

Cloud’s a fantastic thing for performance, agility, and to improve the delivery of IT systems and reduce costs and so on, but ultimately it’s not a standard, or a rubber stamp – it’s a marketing term for remote data centres. I’ve witnessed a worrying complacency among organisations moving to the cloud, that, “it’s the cloud, it will work!” The reality is that you need to investigate what you are buying and know what’s in the contract with your cloud provider. What would they do in a recovery situation for example, what resilience do they have in place? How would they back up data and recover it – and have they tested it? It’s important to observe tests if you can; at least ask them for test results, policy information and to see their incident management plan. Cloud providers may focus entirely on day-to-day projects, technology uptime and incidents, and not think about “bigger picture” technology outages, such as a complete data centre or site failure – it’s important to identify this mind-set (if it exists at the provider).

Managing Change:

How can businesses better understand their current IT environment, considering constant changes in the sector?

Continual technology improvements mean that IT environments are in a constant state of change to try and keep up. For the IT leadership team, it can feel like they’re forever pushing a piano up the stairs while being expected to play a tune! Each time you make it to the next floor, you realise there are more steps to climb.

Imagine you plan to upgrade to a brand new IT environment and network, but by the time you implement it, it’s not brand new anymore, and there are better options out there. This is frustrating for the IT leadership team, but it also means there’s a whole world of work to be done by the IT department to keep pace with change. Hardware upgrades, software upgrades, security patches, new IT servers and services coming online, old ones being retired. While you have the strategic view of where IT needs to go to take the business forward, there’s also so much maintenance work to be done to keep it running.

It’s a bit like living in a house from a horror film where the rooms and hallways and doors keep rearranging themselves. You can draw a map, but you have to keep redrawing it over and over again. It’s really difficult for IT departments to keep a detailed view of the whole IT estate and how it integrates, but it’s also really important to understand this and keep this up to date.

If you’re responsible for IT in some way in your organisation, whether an analyst, manager or CIO, you should ask yourself, “If it failed now, do I know what I need to do to recover it all and restart it?” If you think you’ll need to start with a whiteboard and sticky notes trying to figure it out at the time, that’s bad news. Instead, be aware that a lot of preparation can be done in advance:

Be prepared for resilience:

What are the IT systems and the services they deliver?
What are the servers and hardware?
What are the recovery interdependencies?
What involvement is needed from various IT and end-user teams to recover and validate IT systems?

Answering these questions will help you see where investment is needed to improve resilience.

Resilience and IT project management

Any significant IT change in your organisation will most likely be done through an IT project, such as significant IT system upgrades or new IT services. But, there are key IT resilience pitfalls that can happen with IT project management and it’s important to look out for these, as once the project is completed, it’s unlikely that the operational budget will have the capacity to fix it.

Avoid these pitfalls:

Has IT resilience or ITDR testing been allocated in the budget?
If not, this needs to be escalated to C-level
Is ITDR testing limited to an isolated test of the IT service only, not an integrated test?
If yes, this needs to be escalated to C-level
Is the project team asking for your sign-off (i.e. it is not self-certifying)?
You should give the team a process to self-certify – your involvement is needed to make sure the proper process is being followed, but don’t let them sidestep responsibility
Are there promises to fix things in the “phase 2” that hasn’t been planned yet?
Phase 2 may not happen! – This needs to be escalated to C-level
Are business continuity and IT continuity staff involved in strategic decision making?
Don’t just involve them as an afterthought

Can you give us an overview of the shifting culture of IT usage and how it applies to a business’ expectations of IT resilience?

I’m old enough to remember that in the 1970s and 80s, when computers first made their way into our homes, there was still some sense of wonder and respect attached to them and what they could do.

However, it seems that over time, the better IT gets and the closer it is to our daily lives, the less impressed we are with it, and the more we expect it to do everything for us with minimal effort.

In our personal lives, we’re now all end users, whether it’s of smartphones, gaming consoles, or tablets. I think that as end-users we’ve become a bit spoilt, and expect IT to just work with little thought or effort on our part.

The problem comes when IT professionals take that mentality into work and apply it to the cloud computing IT services that they use, which may be absolutely core to the organisation.

It’s really important for organisations to not just expect cloud computing to work, and to keep questioning and keep challenging.

For example:

Read the contract to check exactly what the cloud provider is delivering
Make sure you understand the interconnectivity between cloud and all of your other IT systems
Make sure you know how your cloud provider manages backup and recovery of the IT systems
Find out if failover and recovery processes have been thoroughly tested

If no one in your organisation understands the detail and substance of the resilience of your cloud IT services, what’s going to happen if that goes wrong? Are you blindly trusting your cloud provider?

Remember that cloud is a marketing term, it isn’t itself a quality standard. A supplier might be doing an element of cloud badly, or not be doing enough for IT resilience in their cloud environment – so don’t take the cloud for granted!

Top takeaways

Take a step back and think about resilience, not just from a technology perspective but also a wider perspective as part of your organisational resilience
Involve continuity professionals in strategic decisions, for example when considering new platforms and technologies
Consider: what if something serious happened right now, how would the business recover from it?
Be open and transparent about resilience across IT environments, projects and the business
Don’t trust “reliable” technology to the extent you don’t plan for backup and recovery (including physical, virtual and cloud solutions)!
To achieve resilience you need to manage change effectively – keep your “map” updated

About David Davies

David Davies is an award-winning Business Resilience and IT Resilience Consultant at Daisy Corporate Services. He has worked in IT resilience and recovery for more than 20 years, starting in a technical role at IBM looking after data backups and testing disaster recovery on very large enterprise systems. David moved on to project management of disaster recovery testing, then left IBM to work in business continuity consultancy over the last 14 years. In that time, David has worked with more than 150 organisations as a resilience consultant, some medium-sized but the vast majority being enterprise-sized organisations.