Is the age of cloud computing the end of online backups, or is it just the beginning? (Part 3)

In this 3-part series, we discuss how cloud computing is transforming the IT landscape, as well as whether or not online backup services will become obsolete as computing moves to the cloud, and what this means to managed service providers.

In part 1, we discussed the benefits and drawbacks of cloud IaaS, and how in the cloud it is critical to expect and plan for downtime of instances, availability zones, and even multi-zone outages. In part 2, we reviewed the different types of cloud storage and risks involved in deploying persistent block storage. Although cloud providers offer volume snapshot services, relying on cloud snapshots alone carries some fairly large risks.

Here’s some of the key reasons why snapshots are not a suitable replacement for backups:

Human error: Human error is estimated to be the cause of data loss in 3 out of 4 cases. If an administrator accidentally deletes a cloud volume, how are you going to get the data back? What are the risks of keeping your data all in one ecosystem vs having secondary copies that are authenticated separately? What are the unknown risks of human error in the organization running the cloud? (read more on hidden risks here [PDF])
Snapshot automation, management, and monitoring: Often snapshot services are provided via low-level cloud APIs — what tools will be used to automate and monitor the snapshot process? How will you coordinate taking the snapshots with running applications to ensure application data is in a consistent state when the snapshot is taken? If you’ve striped multiple volumes together into one logical volume, is it possible to have the cloud provider take snapshots of all of the relevant volumes atomically (all at once)? If you have hundreds or thousands of clients, how will you be sure that snapshots are working (or not) for all of your customers? Is there a centralized management and monitoring interface?
Verification of the integrity of snapshots: Snapshots allow you to go backwards in time on the volume, but do not guarantee by themselves that they contain a good copy of your data. How do you know that the snapshots are indeed accessible and the filesystems contained within them are not damaged? How do you know that application data (e.g., SQL) within the filesystem is intact and ready for use?
Long-term data retention: Taking snapshots is only half the battle — old snapshots need to be pruned automatically according to business requirements. How will you automatically enforce data retention policies? Is a tiered retention policy supported? (e.g., retain hourly snapshots for X days, daily snapshots for Y days, weekly snapshots for Z days, etc.) What functions are provided to efficiently export one or more snapshots? Can retention policies easily be customized on a per-volume basis?
Frequency of snapshots: In order to meet your customer’s recovery point objectives (RPOs), how often will snapshots need to be taken? Can it take snapshots as frequently as every 5 minutes? If so, will it be able to efficiently implement your desired data retention policies?
Time to restore snapshots: In order to meet your customer’s recovery time objectives (RTOs), what guarantees (or even estimates) does your cloud provider make on the time that it takes to restore a snapshot into a new volume? (Note: I haven’t seen any cloud providers make guarantees here — if you have, please let me know!)
Replication of snapshots: Some cloud providers will automatically replicate volume snapshots across availability regions to provide additional geographical redundancy. However, what visibility do you have into this replication process and how it relates to your RPOs? If an availability zone goes down, and you have to restore from a replicated snapshot in another region, what guarantees do you have on how far back that replicated snapshot is? Perhaps you’ll get lucky and your last snapshot replicated before the failure occurred, or you might get unlucky and your cloud provider’s tech support will inform you that they discovered (after the fact) that replication was back-logged and your last replicated snapshot is over 1 week (or 1 month!) old… Will you leave it to luck? If not, how will you monitor replication for all of your customers to ensure that you are meeting your customer’s required RPOs?
Restoring individual files: Volume snapshots are effective for restoring entire volumes, but what tools are provided to mount and browse individual files in snapshots? If your customer says they want a file that got deleted 60 days ago, how much labor will it cost you to get the data back? Hopefully it does not involve using low-level cloud APIs to re-populate a new volume from a snapshot, attach it to a new temporary instance, login and mount the volume, find the desired file(s), and attempt to copy it back to the production system. This becomes even more complex when multiple volumes are being combined by an instance into a larger logical volume through software RAID.
Software bugs: Bugs have potential to cause data loss at many different layers in the storage stack (filesystem, device driver, firmware, etc.)–the cloud is no different and introduces yet another layer. Bugs in cloud provider’s infrastructure have already publicly caused data loss (e.g., 2011 incident). How will you mitigate the risk of volume or snapshot loss caused by buggy cloud code?

An old adage says that “RAID isn’t backup,” and snapshots aren’t either. Cloud snapshots may be suitable as the only backup solution in some special cases (especially for apps built from scratch for the cloud), but it’s not suitable for most IaaS customer scenarios. Make sure you have a good answer and a prepared plan when (not if) Murphy’s Law hits your customers in the cloud.

Don’t get us wrong — snapshots are very powerful for cloning data volumes and having another layer of protection on your data. We recommend (and so does Amazon) doing both volume snapshots and volume backups (using cloud-aware backup and replication technology), but if you have to choose, our assertion is that cloud-aware cross-cloud backups will provide much better protection against the real risks to your data, and will also drive down your overall operational costs.

When it comes to evaluating data loss risks, there is more to consider than just technology risks:

Vendor lockin: Once critical data gets stored with one cloud provider, how easily will you be able to switch cloud providers in the future if business requirements and the competitive landscape change?
Security incidents: If a large public cloud gets hacked, what is the risk of data loss if all of your data (including volume snapshots) lives under the same technical umbrella?
Billing disputes: This goes back to human error — what if someone in accounting makes a mistake or a check gets lost in the mail, and they think your account is delinquent, and subsequently delete all of your data stored on their cloud? Sadly, there are stories of this already happening (e.g., one unconfirmed story here [see end of thread]). It’s important to mitigate this risk by having your data live across multiple organizations.
Service shutdowns: It’s not unheard of for services once offered to suddenly be shut down without much notice (e.g., HP discontinued Upline after acquiring it). As unlikely as it seems now, what would you do if you were given 30 day notice to get all of your data out of a discontinued cloud (or even no notice at all)? If you’re relying only on snapshot-style backups, how would you efficiently export all of your snapshots out of the cloud?

At eFolder we’ve engineered solutions to protect cloud applications (as well as traditional IT) to effectively address all of the above challenges and risks — we make protecting and recovering apps in the cloud easier, safer, faster, more reliable, and more profitable for MSPs. I’ll write more on how we address the challenges I’ve raised above in a future post.

The cloud is a huge opportunity for MSPs in the IT channel — the cloud sounds easy and care free to use, but the devil is in the details: it solves some historically thorny challenges, is full of promise for allowing companies to be more agile and cash efficient, but with it comes a host of new risk factors and operational complexities. The opportunity for MSPs is to effectively educate their customers, provide tailored advice, and build managed services packages that leverage the cloud to solve customer pain points while lowering MSP cap-ex and labor costs. The underlying complexity of the cloud and the infinite possibilities of what you can do for it means that MSPs will be needed more than ever before.

Latest Images

Trending Articles

Latest Images