Home Random Page


CATEGORIES:

BiologyChemistryConstructionCultureEcologyEconomyElectronicsFinanceGeographyHistoryInformaticsLawMathematicsMechanicsMedicineOtherPedagogyPhilosophyPhysicsPolicyPsychologySociologySportTourism






Operational Health

Many organizations find it helpful to compare the monitoring and control of Service Operation to health monitoring and control.

In this sense, the IT Infrastructure is like an organism that has vital life signs that can be monitored to check whether it is functioning normally. This means that it is not necessary to monitor continuously every component of every IT system to ensure that it is functioning.

Operational Health can be determined by isolating a few important ‘vital signs’ on devices or services that are defined as critical for the successful execution of a Vital Business Function. This could be the bandwidth utilization on a network segment, or memory utilization on a major server. If these signs are within normal ranges, the system is healthy and does not require additional attention. This reduction in the need for extensive monitoring will result in cost reduction and operational teams and departments that are focused on the appropriate areas for service success.

However, as with organisms, it is important to check systems more thoroughly from time to time, to check for problems that do not immediately affect vital signs. For example a disk may be functioning perfectly, but it could be nearing its Mean Time Between Failures (MTBF) threshold. In this case the system should be taken out of service and given a thorough examination or ‘health check’. At the same time, it should be stressed that the end result should be the healthy functioning of the service as a whole. This means that health checks on components should be balanced against checks of the ‘end-to-end’ service. The definition of what needs to be monitored and what is healthy versus unhealthy is defined during Service Design, especially Availability Management and SLM.

Operational Health is dependent on the ability to prevent incidents and problems by investing in reliable and maintainable infrastructure. This is achieved through good availability design and proactive Problem Management. At the same time, Operational Health is also dependent on the ability to identify faults and localize them effectively so that they have minimal impact on the service. This requires strong (preferably automated) Incident and Problem Management.

The idea of Operational Health has also led to a specialized area called ‘Self Healing Systems’. This is an application of Availability, Capacity, Knowledge, Incident and Problem Management and refers to a system that has been designed to withstand the most severe operating conditions and to detect, diagnose and recover from most incidents and Known Errors. Self Healing Systems are known by different names, for example Autonomic Systems, Adaptive Systems and Dynamic Systems. Characteristics of Self Healing Systems include:

  • Resilience is designed and built into the system, for example multiple redundant disks or multiple processors. This protects the system against hardware failure since it is able to continue operating using the duplicated hardware component.
  • Software, data and operating system resilience is also designed into the system, for example mirrored databases (where a database is duplicated on a backup device) and disk-striping technology (where individual bits of data are distributed across a disk array – so that a disk failure results in the loss of only a part of data, which can be easily recovered using algorithms).
  • The ability to shift processing from one physical device to another without any disruption to the service. This could be a response to a failure or because the device is reaching high utilization levels (some systems are designed to distribute processing workloads continuously, to make optimum use of available capacity, which is also known as virtualization).
  • Built-in monitoring utilities which enable the system to detect events and to determine whether these represent normal operations or not.
  • A correlation engine (see paragraph 4.1.5.6 on Event Management). This will enable the system to determine the significance of each event and also to determine whether there is any predefined response to that event.
  • A set of diagnostic tools, such as diagnostic scripts, fault trees and a database of Known Errors and common workarounds. These are used as soon as an error is detected, to determine the appropriate response.
  • The ability to generate a call for human intervention by raising an alert or generating an incident.

While the concept of Operational Health is not a core concept of Service Operation, it is often a helpful metaphor to assist in determining what needs to be monitored and how frequently to perform preventive maintenance.



What and when to monitor for operational health should be determined in Service Design, tested and refined during Service Transition and optimized in Continual Service Improvement, as necessary.



Date: 2014-12-29; view: 879


<== previous page | next page ==>
Operation staff involvement in Service Design and Service Transition | Communication
doclecture.net - lectures - 2014-2024 year. Copyright infringement or personal data (0.006 sec.)