Business Time !!?

availability • Mar 31, 2023

Hot on the heels of talking about platform updates lets talk some business time!! Now business time can mean a whole lot of different things to different people in different situations, so lets get specific and talk context.......

As much as I love Flight of the Concords we're not talking their type of business time, we're talking about upgrades, patching, expansions, replacements, and all that good stuff that happens behind scenes!

Typically this type of work happens out of "business time" or business hours if you prefer. We do this because we want to minimise any risk of disruption to workloads running on the platform and that seems like a perfectly good approach......... or does it?🤔 Now don't get me wrong, we absolutely want to and actively DO, manage risk in our environment and that's not about to change. As a service provider we're absolutely risk averse when it comes to managing out customer workloads given they are our life blood! Currently we schedule maintenance windows (broadly speaking) into two specific times a week - Wednesday and Sunday evenings. From time to time we'll schedule emergency windows for more pressing issues as the need arises. Again, it all comes back to availability, performance expectations and managing risk for our customers. ⚖️

What I'm wondering here, is if this type of thinking is perhaps a little antiquated? Not the availability, performance and risk parts...... but the scheduling piece. "Back in the day" the working hours were typically a 8am to 5pm type thing where this all makes perfect sense. But businesses of today are a more complicated beast, with many demanding access to resources for considerably wider portions of the day and some even 24/7! If we consider these types of changes, when is it a good time to do the work and schedule that maintenance window?

So how about we run through a couple of scenarios and how they might be handled:

a failed memory module in host
a failed component in storage array
the upgrade of storage array (software)

A Failed memory module in a host

So occasionally (not that frequently fortunately!), hardware failures like this can occur. Depending on the severity of the failure it could result in an HA event or possibly just an alert. An HA event results in all the virtual machines on that host being restarted on another host in the cluster 😠 - not ideal but the expected behaviour (along with automated notifications to all impacted customers curtesy of MyCloudSpace). An alert on the other hand means we have the opportunity to evacuate the the vm's cleanly (with no disruption) and then remove the host from production. A support case will be automatically logged with the hardware vendor in question. Once removed from production failed parts can be replaced as and when needed without impact or risk. ✅

In this case, once the host has been removed from production there is no risk so it can be worked on. No need to schedule a maintenance window (out-of hours or even in-business hours!)

A Failed component in a storage array

Again, not something that happens all that often fortunately! In this case our storage arrays are highly redundant and the failure of a single component (like a DirectFlash module for example) won't impact the underlying workloads running on the platform. A critical fault would be logged automatically with the storage vendor and a case created to replace the failed part. Redundancy in the platform allows normal operation to continue during this process. However, until such time as the failed part is replaced we do have additional risk (however small it is) because the storage platform is still serving production workloads. Therefore, any failed parts are replaced ASAP and as a priority.

In this case, an emergency maintenance window would be scheduled for the work and it would be done as soon as the parts were available in order to minimise our exposure. This replacement takes place without workload disruption of course!

The Upgrade of a storage array

Now this third scenario is a slightly different one and probably the one I want to consider. This process is non disruptive, remotely performed by our vendor, and is something that they perform hundreds of times on a day. Now traditionally, and in actual fact still currently, this is something we schedule out of hours. In all the times we're been through this process we've not ever had an issue and there is no performance impact during the process. 🙏🏼

The key difference with this, and say upgrading a vSphere host is that we can't take the entire array out of production in order to perform the upgrade.

Could we perform this during business hours? Well, I think we probably could given our confidence in the process. The key here is whether we think performing this during business hours introduces more risk or not. If the answer is yes then thats a non starter obviously! Additionally, if we thought there was any potential impact to performance again that would be a no go!

So, are there any benefits then to doing it during business hours? Probably the biggest benefit I could see would be an increased ability to respond should an unexpected issue occur. Obviously during business hours we have a far larger team available to react should the need arise. Ultimately, we need to carefully evaluate any maintenance and its potential to impact the services we deliver before deciding on the best time to perform it.

As a service provider, the availability, reliability, and performance of our platform are of utmost importance and dictate how we operate. We recognize the constant need for patching, updating, and expanding our platform, but we understand that these actions must be balanced with maintaining the platform's reliability and dependability. Additionally, risk mitigation remains a top priority for us. Our customers have high expectations for a stable and robust platform, and we strive to deliver precisely that.