Infrastructure Testing
Most critical infrastructure are built with some form of redundancy, that mostly takes the form of physical hardware whereby there are 2 or more devices that create a cluster/pool or geographic diversity with Datacenter failover. In a failure event, the design of the system should seamlessly failover to other devices in the cluster, this failover event should be periodically tested (ideally biannually or annually), without this periodical testing you should consider that your redundancy may not work as intended as systems change over time and failover mechanisms may behave differently.
When the Infrastructure is first installed and not in production there should be extensive failover testing, you may want to create a excel sheet and think about failure scenarios both physical and logical. You may also want to think about documenting impact/behavior, did the system failover as expected, was there packet loss, did it failover within the expected time, etc.
Some test examples are...
· Removing primary power feed and power supplies to ensure power feed and supply redundancy.
· Removing supervisor cards to ensure supervisor failover in a dual supervisor chassis-based systems.
· Removing line cards if you have a chassis-based switch with line cards.
· Removing network links in a LAG/Aggerated Ethernet
· Removing a hard drive to ensure that a storage array/RAID recovers and rebuilds.
· If you have redundant racks, complete shutdown of primary/active rack.
· Geographically site testing, failover to another Datacenter though some configuration is sometimes required to bring up the secondary site.
· Is there any packet loss or errors on any network interfaces when under network load, this may be due to dirty fibre ends at install.
· Did VMware evacuate guests when maintenance is triggered on a host.
Once a failover testing event as occurred, a failback would also need to be tested. Finally, it’s also worth testing that the removal of the passive/secondary has no impact to the primary, this may trivial but may show a bug or design caveat.
Failover timers can be adjusted to suit the environment but normally the lower/aggressive the timers the greater chance of flapping causing a fault, especially in the network space. There maybe trial and error and finding the right balance between a faster failover and not creating a fault.
Failover testing may be done via other external forces, Datacenter’s may periodically test or due to maintenance drop a power feed. Patching/Upgrading of Software/Firmware often requires that the system goes through a failover event to do the upgrade, whether you consider these events “Periodical Testing” is up to you.
Once in production, failover testing obviously becomes a lot harder (and stressful) as you are now testing a production environment so planning and communications are now required. With production testing, you now have production workloads active so a test failover will give you a true result of the testing and the impact to the end user.
Though periodical failover testing is “uncomfortable”, you need to ask the question, would you rather have an unplanned outage at Monday 9am due to a failover event being unsuccessful due to a hardware fault or a planned testing at 2am with limited to no users on the system ?