The dream is 100% availability; how to avoid a nightmarish shutdown on the System i

Written by Francois Desjardins_ | Sep 30, 2015 4:15:00 PM

Often business leaders perceive high availability solutions as nothing other than an insurance policy. In fact, it's much more than that. Yes, it’s comforting to know that you are ready in the event of a system failure, a disaster or something unexpected. But in addition, this environment can be used to enable business continuity 24/7, by doing backups on the second server. Ressource intensive queries can be sent to the second server without having an impact on production. You can perform tests prior to implementation that will have no impact on production. And updates or the replacement of a server can now be done during the day rather than the weekends where overtime is accumulated.

Below you will find excerpts from a blog article written by our partner Traders, that provides high availability and continuity solutions for the IBM System i. The content of this article focuses on failures, the most common to the least probable, and ways to remedy them without a nightmarish stoppage of production.

The IBM Power server, conceived and designed according to the RAS concept (Reliability, Availability, Serviceability), is a production machine which at its base features intelligent "equipment" (Hypervisor, PowerVM component) and "software" (IBM i firmware) allowing it, in real time, to adopt the course of action and take the most appropriate decisions to mitigate any problems. However, it is not immune to major hardware or software failures or those linked to its environment.

Hardware failures, the most common and easiest to control

Subject to wear and tear by their function, disk subsystems are the leading cause of failure, after the malfunction of a physical drive (HDD). Several security options allow, directly through the operating system, the prevention of data loss and to ensure functioning, even if deteriorated- Raid disk protection and "Hot Spare," called mirrored protection. These physical protection levels are required today during the initial configuration of an IBM i partition.

Failures of power supply modules can also occur. They are planned for at the design of the server via redundant power supplies, hot-swappable, therefore without stopping production. It's the same for the cooling units (or fans) where each has a potential failure sensor and temperature sensor in order to avoid overheating. Extremely rare, electronic failures on Power servers are quickly circumscribed, most circuits being duplicated, or at least capable of being isolated by the machine intelligence that self-monitors and triggers alerts to the support center at IBM. Most failures can be easily detected, in a preventative or proactive manner and solved by adequate maintenance!

Software failures, rare and synonymous with a slowdown and not a stoppage

No one can dispute that the IBM i operating system is celebrated for its reliability and robustness. Each new version or update is tested by the manufacturer... In case of problems with the IBM code, often manifested due to a non-standard use of the system (CPU, Memory), the system is designed to isolate the "failure" without causing a crash of the partition. Added to this is the fact that the administrator has all the integrated tools to react efficiently. These failures can result in a disruption of the system or partition but rarely a stoppage. Again, proactive maintenance of IBM software preserves the health of the server.

Failures related to the environment, often underestimated and yet most devastating

Failures that impact a server the most, until it stops, are very often linked to external causes such as an air conditioning failure, an inverter or a faulty circuit breaker. The most common: an air conditioning failure resulting in overheating of the equipment, jeopardizing the stability and reliability of internal and external drives. In this case, a sudden cooling by rapid lowering of the temperature is not the solution; the Power machine, like all computer equipment, does not like sudden temperature changes. The intelligence of the machine, in case of overheating or power supply failure, stops production in 100% of cases!

One should also not underestimate the risks associated with a failure in the broader context of the environment - weather, natural disasters, which unfortunately are increasing and does not spare any region. If regular monitoring of infrastructure equipment is necessary, only a comprehensive high availability solution, guaranteeing real-time replication of data, can avoid a shutdown, or at least limit the impact!

Conclusion

Hardware and software failures can be quickly resolved through maintenance contracts, designed and implemented according to specific needs of system availability. They impact the system but are rarely the cause of its shutdown. This is not the case for failures related to the environment that result in uncontrollable cascading problems in the short, medium and long term, all the way to the paralysis of the information systems and the enterprise. The only effective and proven solution for Power i is a high availability system located at a remote site combined with a real-time software replication product, such as Trader's QuickEDD. And at the same time, this solution assumes a legitimate role is rectifying hardware and software failures.

The proverb not to put all your eggs in one basket has never been more pertinent with what we know about high availability and business continuity and the critical dependence today's companies have on their servers.

View full post