Mark Gaydos, Chief Marketing Officer, Nlyte Software
In 2017, data center failures around the world became big news. The British Airways outage in May, which caused the cancellation of over 400 flights and stranded 75,000 passengers, cost the company an estimated $112 million in refunds and compensation. This doesn’t take into account the cost of reputation damage, and the loss of productivity during the downtime.
It later came to light that this outage was caused by a simple mistake made by one person – an engineer working at Heathrow, who disconnected and reconnected a power supply. This restarting action caused a power surge which took down not only the primary data server site, but the backup site as well.
The British Airways incident is just one example of how fragile our IT and computing infrastructure can be. Depending on the statistics, human error is the culprit in 22%-38% of data center outages. Other top causes of downtime are circumstances such as UPS failure, heat or CRAC failure, weather issues and in some cases, generator failure.
The costs associated with data center downtime can rapidly accumulate to hundreds of thousands of dollars per incident, and more in the case of financial market outages. As data centers increase in complexity, and start to include more remote processing locations, the task of assuring uptime becomes more challenging with an increased degree of monitoring difficulty.
The good news is that most data center outages are preventable – especially if data center managers have better insight into operations which will improve reaction time.
A Data Center Infrastructure Management (DCIM) solution gives these managers the “better insight” by providing the visibility into all operations to significantly mitigate the risk of downtime.
Here are some examples of risks that can be easily reduced with a DCIM solution:
A DCIM solution provides real-time temperature monitoring throughout a facility. This makes spotting hot spots in the computing infrastructure as simple as looking at a dashboard showing a real-time heat map. With this knowledge, any data center manager can rearrange equipment or load or simply adjust the speed of a fan, to remediate hot spots. In addition, DCIM solutions can identify opportunities for safe ambient temperature adjustments so the facility’s temperature can be raised without causing damage to IT equipment.
The first step in protecting against power overload is not only knowing where power is being used, but how it might be used more safely and efficiently. DCIM’s real-time power monitoring and tracking can deter power overload. With alert features the right people are notified when a pre-set power limit is close to being reached, giving data center personnel ample time to react, make changes and shift the load before a major disaster strikes. And if, despite this foreknowledge, catastrophe does occur, a DCIM system can simplify disaster recovery.
Flawed redundancy relates to power failure. The ability to test the resiliency of the power chain is essential to good data center stewardship. A DCIM solution provides the ability to perform “what if” tests of the power chain, in a virtual environment, with no risk to the actual infrastructure. With this ability, a data center manager can test for situations and answer such questions as:
What if this piece of equipment were to suddenly fail?
Where would the load go?
What else might fail as a result?
Are my a and b sides safe?
The biggest problem with capacity planning in a data center is: not knowing how much of the capacity is actually being used, and how much is left. A DCIM solution supplies not just power capacity intelligence, but also the physical space information as well. Moreover, it can provide information about how the physical capacity is being used, and how it might be used more efficiently, enabling consolidation of resources. The risk of running out of space or power is no longer an issue if you have a DCIM solution deployed. In addition, DCIM users have consolidated IT equipment to actually postpone or eliminate the need for multi-million dollar expansion projects.
Another data center risk has to do with asset management. The challenge is the ability to know what equipment is where. A DCIM solution not only keeps track of equipment throughout its useful life – providing information on where the asset is, what it is connected to and when it is moved, but also, it alerts the user when an asset has reached the end of its life and should be retired and replaced. This type of monitoring keeps the data center from having to support older equipment which has a higher risk of failure and becomes difficult and expensive to maintain.
Here’s one data center risk that’s related to human error. A built-in workflow engine in a DCIM solution helps data center staff avoid errors by giving them a central repository of what work has been performed, by whom as well as what still needs to be accomplished.
If we agree that people aren’t perfect and that they make mistakes, then we can agree that people might be the weakest link in the data center chain. But, with a DCIM solution in place data center teams have access to valuable information to prevent errors. A DCIM solution is a data repository for all data center staff to utilize and make more intelligent, informed decisions.
These are just a few examples of how a DCIM solution can help reduce risks and cut costs in a data center environment.
To find out more about reducing data center risk and how a DCIM solution can help, access this pre-recorded webinar. Hear 451 Research’s Rhonda Ascierto and Nlyte Software’s Mark Gaydos provide valuable examples on how to lower data center risks, OPEX and CAPEX.
Bio: Mark Gaydos is Chief Marketing Officer for Nlyte Software, the leading data center infrastructure management (DCIM) solution provider for seamlessly automating data center operations and infrastructure into an enterprise’s IT ecosystem.