Budgeting for SQL server downtime risk

Inside Facebook data center
(Image credit: Facebook)

“We need seven nines of availability. By the way, our budget is $14.”

Customer with lofty High Availability (HA) goals and no real budget to support it, unfortunately, is a common scenario for IT consultants assisting customers with their business continuity and application protection. Unfortunately, high expectations without appropriate funding can prevent organizations from being prepared for a disaster when it occurs.

This article describes how your organization can mitigate interruptions that impact mission-critical SQL Server deployments, and provide the necessary resources (whether on-premises, cloud, or hybrid) within a realistic budget. We will describe the importance in assessing your business continuity needs and understanding what downtime could mean for your organization. We will also look at key capital and operating expenses affecting your IT budget and how you can manage them with available technology resources.

Assess your business continuity needs

If SQL Server is an important application for your company, you should try answering these questions:

  • How important is each instance of SQL Server to the business?
  • What is the impact to the business if a SQL Server instance becomes unavailable?
  • In the event of a complete sitewide or regional disaster, how much data loss is tolerable for each instance of SQL Server?

The right answers depend on your understanding of downtime and how it will impact your decisions about funding a realistic business continuity plan with appropriate technology resources.

Understand what downtime really means

Look at every SQL Server environment and application you have, as well as the associated dependencies and risks. Some SQL Server instances may require High Availability, whereas others might not. Determine the costs for every minute of downtime. If your SQL Server application is unavailable for an hour, a day, or even a week, what impact will that have on your customers and your company?

Have you experienced an outage? If not, you may not really understand how deep the potential risks can be, such as:

  • Lost sales revenue
  • Lost employee productivity
  • Corrupted mission-critical data
  • Damaged equipment
  • Damaged relationships with customers and stakeholders
  • Degraded employee morale
  • Regulatory compliance and legal penalties
  • Litigation fees
  • Lost insurance discounts
  • Contract penalties
  • A disrupted supply-chain

Separate your applications into different RTO/RPO groups

Your applications may have different availability requirements. Simply put them into different groups, corresponding to your required recovery time objective (RTO), and recovery point objective (RPO). The more stringent your RTO and RPO, the more costly the HA solution will be. For each SQL Server instance, determine what level of availability meets your needs: high, medium, or low – such as in minutes, hours, or days.

For SQL Server instances requiring a high level of availability, there are several options. The traditional choice in a Windows environment is to deploy a SQL Server failover cluster instance (FCI). In a SQL Server FCI the cluster data is stored on a SAN or other shared storage device. Another option is utilizing Always On Availability Groups, which eliminates the need for a SAN, allowing both HA and DR configurations that can span multiple data centers. A third option is using a replication product that allows you to build a SQL Server FCI but eliminates the need for a SAN by replicating local disks for use in the cluster.

Applications requiring a medium-level availability may be sufficiently supported by running in VMware using VMware HA, or other hypervisor-based solutions. This solution will protect them from hardware failures, but it will not guard against unplanned downtime for software level failures (e.g., VMs or system update failures) or planned downtime for SQL Server instance maintenance. If this does not meet your availability needs, or you are not using VMware HA, you will need to consider budgeting for the high-level availability solutions which are more expensive.

Your applications requiring low-level availability can be managed with a simple but effective backup solution. Ensure good backups are scheduled and the occasionally test the restore function to make sure it works. Also, store a copy of your backup offsite for Disaster Recovery. Use either SAN replication or a software-based replication to replicate backups offsite.

In each of these availability scenarios, your business continuity requirements should help map out the resources you will need to reduce the costs associated with the alternative risks.

Create your downtime budget

Base your budget planning on a good understanding of the costs of downtime. “Reliability Engineering”, a process coined by Google, approached this analysis with the philosophy that every application should have a downtime budget that also covers co-dependent applications. Your budget should be a balance between the required level of protection and costs. This includes understanding the limitations of cloud service SLAs and where you can reduce costs without sacrificing your availability requirements.

Cloud service SLAs

The best practice is to combine all the SLAs together for the services you are using in the cloud and do some math to get a better perspective of what your true service level agreement is for high availability.

For example, if you are using Active Directory managed services, there are SLAs associated with that. If you are using cloud storage, there may be an SLA that is different from the SLA the cloud compute instances have themselves. If you are connecting from on-premises to the cloud, there is an SLA associated with the network that connects you to the cloud. If any of these services are down, your supposed availability may not be available.

SQL server licenses

One of the biggest costs to your budget can be the SQL Server licenses you use. For example, the Enterprise Edition can cost almost $7000 a core. However, many of the features previously available only in the Enterprise Edition are now available in the lower-priced Standard Edition at cost that is as much as 80 percent less. Whereas there are situations still requiring use of the Enterprise version, many use cases can be satisfied by using the Standard Edition.

Also, for all supported versions of SQL Server, if you buy software insurance, you get a free HA copy and a free DR copy of the license. You can have three VMs running in two different data centers or cloud regions, and you need to license only one. Using the Standard Edition instead of the Enterprise edition significantly reduces license expenses and software assurance allows HA and DR configurations with no additional SQL Server licensing costs.

Also consider your secondary systems when running in a cloud environment. If you have a little bit higher RTO, perhaps you can downsize your HA and DR secondary systems. If you need to resize them later, it is easier to do in the cloud than on-premises.

Capital versus operating costs

Budgeting for downtime is different for on-premises versus cloud SQL Server instances. On-premises expenses are capital expenses (CAPEX) where you need to consider the lifecycle of the hardware. Cloud expenses are all operational expenses (OPEX) – you pay-as-you-go and have greater flexibility for resizing those resources and costs.

For example, you are purchasing a server, and plan for 30 percent growth year over year. You would typically size your server for three years. Whereas, if you are running in the cloud, you might size that server for the next six months. If growth goes up or down, you can resize any time, with the corresponding adjustment in expense.

This is also true for software. You may think about bringing your own license to the cloud but consider the benefit of simply paying-as-you-go.

Another cost to consider is refactoring your business processes to take advantage of cloud native features. There will be an up-front cost. However, the process of converting legacy business processes into smaller chunks supported by new-generation, cloud-native applications will greatly improve IT deployment of new functionality with potentially better price and performance.

Visibility into IT expenses

You gain a lot of flexibility and tools with the cloud to resize resources up or down to support changing business continuity requirements and controlling costs. You also have greater visibility into your IT spending. For example, your organization gets a single bill with the cost of each resource clearly identified. If tags are used appropriately, those costs can be charged back to the exact cost centers that use those resources. It can also be easier to hone-in on applications that are expensive and underused. 

This can be a challenge with on-premises systems because a lot of those costs are blended or sewn together and not easily identified. Why not optimize this picture? Pay for the availability resources that you absolutely need for your business continuity and save costs on unproductive resources.

Dave Bermingham, Senior Technical Evangelist, SIOS Technology

Dave Bermingham is the Senior Technical Evangelist at SIOS Technology. He holds numerous technical certifications and has been elected a Microsoft MVP for both Clusters and Cloud & Datacenter Management.