Additional Hardware Reordering on the Intel Itanium Architecture
In addition to the reordering scenario described in the preceding section, shared memory access is subject to the following rules on Itanium-based architecture:
· Multiple write operations can be combined so that they appear as a single operation, thus preventing a processor from reading an interim value.
· The order of reads and writes to different locations is not preserved when seen from the perspectives of different processors. In this case, Processor 1 might write a new value to location x and then read a new value from location y, but Processor 2 sees the result of the read operation before it sees the result of the write operation.
The following example shows a situation in which read and write operations to shared memory might be reordered. The example includes a driver-created lock and illustrates one of the problems such locks might encounter. The standard Windows locking mechanisms are not subject to this problem.
In the example, the AcquireLock routine acquires a lock on an object. The ReleaseLock routine, in turn, releases the lock.
static LONG Lock = 0;
static LONG Total = 0;
void AcquireLock (PLONG pLock)
{
while (1) {
if (InterlockedCompareExchange (pLock, 1, 0) == 0) {
break;
}
}
//
// Lock is acquired.
//
}
void ReleaseLock (PLONG pLock)
{
*pLock = 0;
}
Consider the following code sequence, which uses these locking routines:
AcquireLock (&Lock);
Total++;
ReleaseLock (&Lock);
AcquireLock correctly uses InterlockedCompareExchange to lock the object. However, ReleaseLock does not use an interlocked exchange or a memory barrier. Consequently, either the compiler or the hardware could reorder the instruction that increments Total so that it occurs outside the locked code region, thus causing errors on multiprocessor systems.
The following code corrects this problem:
void ReleaseLock (LONG VOLATILE *pLock)
{
KeMemoryBarrierWithoutFence ();
*pLock=0;
}
The corrected code declares pLock as a volatile parameter, which ensures that the compiler generates code for the assignment to *pLock. The memory barrier prevents the compiler from reordering the statement that increments Total to occur after the assignment to *pLock. Using a standard Windows locking mechanism, such as an InterlockedXxx or ExInterlockedXxx routine, would also prevent this problem.
Performance and Scalability
A driver’s performance and scalability on multiprocessor hardware depend to a great extent on its use of locks and cache. Addressing performance and scalability can be difficult, particularly if you are developing a single driver that must perform well on a wide variety of hardware configurations. In some cases, optimal tuning for single-processor or dual-processor machines conflicts with that for high-end hardware with many processors. You should consider your primary market and the life cycle of your device and driver in determining the best design.
Locking Issues
Some performance problems related to locks are more likely to appear on multiprocessor machines than on single-processor machines. The material here summarizes the major issues facing driver writers. For more detailed information, see “Locks, Deadlocks, and Synchronization,” listed in the Resources section at the end of this article.