So, what does “Production Ready” mean here at vBridge? Well it means a whole lot of process, design/component testing, documentation, monitoring and peer-review. Obviously, we need to be completely confident that not only are we adding to our capacity/capability but equally that we’re also not going to disrupt existing services.
With platform growth alongside general equipment lifecycles we’re pretty much constantly adding to and removing equipment from the platform. Commissioning new equipment starts at the requirements/design phase….
Are we simply adding capacity (compute, storage, network, etc) or are we introducing new capability? Adding capacity is generally straight forward….as is removing equipment being retired due to reaching the end of its lifecycle. Introducing a new capability on the other hand means ensuring we can do this in a highly redundant/non disruptive fashion, without introducing unquantified risk. Hardware does fail unfortunately….so steps have to be taken both at the hardware and software layers to mitigate the impact of any such failure. These mitigations could be as simple as having redundant components (like dual power supplies in a hosts), or multiple paths to storage or networks, but all are important. At the hypervisor layer features like VMware’s vMotion, Distributed Resource Scheduler (DRS), and High Availability (HA) all play an important part and are table stakes for any IaaS provider. It is the combination of both these hardware and software protections that result in a rock-solid platform and a consistently great experience for our customers.
From a design perspective it’s important to step back and take a high-level view - not only do we need to ensure the functionality, performance and availability we want are being provided but also, we’re adhering to industry best practice. It’s worth noting that best practice is highly dependent on context. Ultimately there is no one size (or design) that fits all answer here, but they absolutely can be used as a framework on which to build.
This part of the process is actually almost fun! There is something quite satisfying about physically pulling components from hosts, storage arrays, switches and the like just to see if they continue to function as designed! When the proverbial hits the fan you really want to know your workload will continue working and will simply move to another host, user a different path, or “self-heal” in some similar way. That’s what we do with this testing phase, we do things like:
· pull I/O modules
· pull Management modules
· disconnect ethernet/fibre cables
· pull ethernet modules
· shutdown Fibre Channel switches
· pull Power Supplies
· power off entire hosts
A very important component of all this is Change Control. The change control process ensures that when we are introducing change to the platform, we are doing it in a very controlled manner. Not only does the change have to be well documented and thought through but we also require pre and post checks along with a rollback plan should things not follow the script. Peer review and approval of changes is of course part of the process.
For consistencies sake we are have an ever-growing library of Standard Operating Processes or SOP’s. Because many of the changes we make are often repeated this helps to maintain consistent and predictable results which is essential. Despite this, there are times when its difficult to completely eliminate all risk and so we address these changes in two ways. Typically, we’ll schedule these out of (standard) business hours and will notify our customers of the upcoming maintenance window. We use Statuspage for these notifications which allows our customers to subscribe to alerts using both SMS and/or email notifications. Statuspage is also integrated and accessible directly from our MyCloudSpace (MCS) portal.
We do have two windows for standard non-notified changes: Wednesday and Sunday evenings. While these are non-notified, they are still visible on Statuspage or via MCS.
Documentation plays a vital part in being able to support any platform. There really is no point having a world class platform if nobody can find out how it’s all put together! For this reason, we use digital systems for documenting our platforms. With them we can quickly search and find out details like how things are connected, where they’re located, how we access them etc.
Monitoring and Alerting
All these things mean very little without effecting monitoring and alerting. It’s all well and good for a host to lose a PSU or network/storage path and keep functioning but if we don’t know about it, we can’t resolve the underlying failure which is in itself is another risk. Devices generally are configured with both email and SNMP alerting to ensure we know about failures. This is great for maintaining platform stability but not so great for sleep if you happen to be the on-call engineer when something needs attention!
We use a number of different products for monitoring, including (but not limited to) the following:
· Pager duty
We are currently working on a SIEM type solution to integrate these all together as they can generate quite a bit of noise as I’m sure you can imagine!
So as you can see production ready for us involves a whole lot of work, but that’s probably why you choose to work with us, we take care of all that niggly stuff so you don’t have to.