Capacity and Performance Monitoring

All components of the IT Infrastructure should be continually monitored (in conjunction with Event Management) so that any potential problems or trends can be identified before failures or performance degradation occurs. Ideally, such monitoring should be automated and thresholds should be set so that exception alerts are raised in good time to allow appropriate avoiding or recovery action to be taken before adverse impact occurs.

The components and elements to be monitored will vary depending upon the infrastructure in use, but will typically include:

CPU utilization (overall and broken down by system/service usage)
Memory utilization
IO rates (physical and buffer) and device utilization
Queue length (maximum and average)
File store utilization (disks, partitions, segments)
Applications (throughput rates, failure rates)
Databases (utilization, record locks, indexing, contention)
Network transaction rates, error and retry rates
Transaction response time
Batch duration profiles
Internet/intranet site/page hit rates
Internet response times (external and internal to firewalls)
Number of system/application log-ons and concurrent users
Number of network nodes in use, and utilization levels.

There are different kinds of monitoring tools needed to collect and interpret data at each level. For example, some tools will allow performance of business transactions to be monitored, while others will monitor CI behaviour.

Capacity Management must set up and calibrate alarm thresholds (where necessary in conjunction with Event Management, as it is often Event Monitoring tools that may be used) so that the correct alert levels are set and that any filtering is established as necessary so that only meaningful events are raised. Without such filtering it is possible that ‘information only’ alerts can obscure more significant alerts that require immediate attention. In addition, it is possible for serious failures to cause ‘alert storms’ due to very high volumes of repeat alerts, which again must be filtered so that the most meaningful messages are not obscured.

It may be appropriate to use external, third-party, monitoring capabilities for some CIs or components of the IT Infrastructure (e.g. key internet sites/pages). Capacity Management should be involved in helping specify and select any such monitoring capabilities and in integrating the results or any alerts with other monitoring and handling systems.

Capacity Management must work with all appropriate support groups to make decisions on where alarms are routed and on escalation paths and timescales. Alerts should be logged to the Service Desk as well as to appropriate support staff, so that appropriate Incident Records can be raised so a permanent record of the event exists – and Service Desk staff have a view of how well the support group(s) are dealing with the fault and can intervene if necessary.

Manufacturers’ claimed performance capabilities and agreed service level targets, together with actual historical monitored performance and capacity data, should be used to set alert levels. This may need to be an iterative process initially, performing some trial-and-error adjustments until the correct levels are achieved.

Note: Capacity Management may have to become involved in the capacity requirements and capabilities of IT Service Management. Whether the organization has enough Service Desk staff to handle the rate of incidents; whether the CAB structure can handle the number of changes it is being asked to review and approve; whether support tools can handle the volume of data being gathered are Capacity Management issues, which the Capacity Management team may be asked to help investigate and answer.

Date: 2014-12-29; view: 1251

<== previous page	\|	next page ==>
Users, groups, roles and service groups	\|	Handling capacity- or performance-related incidents

doclecture.net - lectures - 2014-2025 year. Copyright infringement or personal data (0.16 sec.)