Frequently Used Locks

The types of locks your driver uses are an important and easily controllable factor in ensuring good performance. The InterlockedXxx and ExInterlockedXxx routines are designed for speed and should be used whenever possible.

On a multiprocessor system, heavy use of system-wide locks, such as the kernel dispatcher lock or the cancel spin lock, can slow system performance. If many threads that use such locks are running simultaneously, performance slows while threads spin, waiting for the lock. A driver that creates a single lock and uses it often or for several purposes can have similar performance problems, especially if the device controlled by that driver is used heavily.

Waiting threads acquire in-stack queued spin locks in first-come, first-served order, making acquisition of these locks fairer than acquisition of traditional spin locks. In-stack queued spin locks are often faster, as well. However, they are held at a higher IRQL than traditional spin locks. For this reason, they can actually degrade performance if held for a long time. Queued spin locks are most appropriate for use in high-contention situations in which they are held briefly. Traditional spin locks are preferable in low-contention situations.

Follow these guidelines when choosing and using locks:

· Keep in mind that single references to variables of the native machine size are always atomic. That is, on 32-bit hardware, 32-bit variables are accessed in a single machine instruction, and similarly for 64-bit variables on 64-bit hardware. You don’t need to lock such references unless they are part of an operation that requires strong ordering or must be performed atomically.

· Use cancel-safe IRP queues to avoid use of the system-wide cancel spin lock.

· Use InterlockedXxx and ExInterlockedXxx functions to perform simple logical, arithmetical, and list operations atomically.

· Use spin locks only when required. Use in-stack queued spin locks when lock contention is high and the hold time is very brief. Use traditional spin locks when lock contention is low.

· Minimize lock hold times by eliminating all unnecessary code from locked regions.

Deadlocks

A deadlock occurs when code running in Thread A holds a lock that code running in Thread B is trying to acquire while the code in Thread B holds a lock that code in Thread A is trying to acquire. Neither thread can progress until the other releases its lock.

To prevent deadlocks in your driver, define a locking hierarchy that specifies the order in which locks will be acquired. Code that conforms to a locking hierarchy always acquires locks in hierarchical order. For example, a driver that requires two locks, A and B, would always acquire lock A before acquiring lock B. If your driver consistently follows these rules, deadlocks cannot occur.

In addition, drivers can cause system deadlocks—and eventual crashes—by calling system routines that use locks from too high an IRQL. For example, driver code that runs at DISPATCH_LEVEL or higher can cause a deadlock by calling a system routine that waits for a mutex. The mutex is a kernel-dispatcher object, and code that waits for such objects must run at PASSIVE_LEVEL or APC_LEVEL. (For details, see “Locks, Deadlocks, and Synchronization,” which is listed in the Resources section.) For similar reasons, a driver that tries to acquire a spin lock from its InterruptService or SynchCritSection routine can cause a deadlock, because these routines run at DIRQL, and spin locks operate at the lower DISPATCH_LEVEL. Before attempting to call a system routine from driver code that runs at IRQL>PASSIVE_LEVEL, check the Windows DDK to determine the IRQLs at which the system routine can be called.

Live Locks

Live locks are another problem that appears more often on multiprocessor systems than on single-processor systems. In a live lock situation, code running in two or more threads tries to acquire the same lock at the same time, but the threads keep blocking each other. This problem can occur when two driver routines try to acquire a lock in the same kind of loop. For example:

void AcquireLock (PLONG pLock)

{

while (1) {

InterlockedIncrement (pLock);

if (*pLock == 1) {

break;

}

InterlockedDecrement (pLock);

}

This example shows a lock acquisition routine. If this routine executes in two threads simultaneously, a live lock can occur. Each thread increments pLock, determines that pLock equals 2 instead of 1, then decrements the value and repeats. Although both threads are “live” (not blocked), neither can acquire the lock.

Caching Issues

Optimizing drivers for caching can be difficult and time-consuming. Consider such optimizations only after you have thoroughly debugged and tested your driver and after you have resolved any locking problems or other performance bottlenecks.

Drivers typically allocate nonpaged, cached memory to hold frequently accessed driver data, such as the device extension. When it updates the cache, the hardware always reads an entire cache line, rather than individual data items. If you think of the cache as an array, a cache line is simply a row in that array: a consecutive block of memory that is read and cached in a single operation. The size of a cache line is generally from 16 to 128 bytes, depending on the hardware; KeGetRecommendedSharedDataAlignment returns the size of the largest cache line in the system.

Each cache line has one of the following states:

· Exclusive, meaning that this data does not appear in any other processor’s cache. When a cache line enters the Exclusive state, the data is purged from any other processor’s cache.

· Shared, meaning that another cache line has requested the same data.

· Invalid, meaning that another processor has changed the data in the line.

· Modified, meaning that the current processor has changed the data in this line.

All architectures on which Windows runs guarantee that every processor in a multiprocessor configuration will return the same value for any given memory location. This guarantee, which is called cache coherency between processors, ensures that whenever data in one processor’s cache changes, all other caches that contain the same data will be updated. On a single-processor system, whenever the required memory location is not in the cache, the hardware must reload it from memory. On a multiprocessor system, if the data is not in the current processor’s cache, the hardware can read it from main memory or request it from other processors’ caches. If the processor then writes a new value to that location, all other processors must update their caches to get the latest data.

Some data structures have a high locality of reference. This means that the structure often appears in a sequence of instructions that reference adjacent fields. If a structure has a high locality of reference and is protected by a lock, it should typically be in its own cache line.

For example, consider a large data structure that is protected by a lock and that contains both a pointer to a data item and a flag indicating the status of that data item. If the structure is laid out so that both fields are in the same cache line, any time the driver updates one variable, the other variable is already present in the cache and can be updated immediately.

In contrast, consider another scenario. What happens if two data structures in the same cache line are protected by two different locks and are accessed simultaneously from two different processors? Processor 0 updates the first structure, causing the cache line in Processor 0 to be marked Exclusive and the data in that line to be purged from other processors’ caches. Processor 1 must request the data from Processor 0 and wait until its own cache is updated before it can update the second structure. If Processor 0 again tries to write the first structure, it must request the data from Processor 1, wait until the cache is updated, and so on. However, if the structures are not on the same cache line, neither processor must wait for these cache updates. Therefore, two data structures that can be accessed simultaneously on two different processors (because they are not protected by the same lock) should be on different cache lines.

To test for cache issues, you should use tools that return information about your specific processor. A logic analyzer can help you determine which cache lines are contending. Some processor vendors make available software packages that can read performance data from their processors. Check your vendor’s Web site to find out if such a package is available.

Testing

You should always test every driver on both multiprocessor and single-processor machines. Testing on both increases the chances that you will discover problems related to timing and synchronization. In particular, testing on multiprocessor systems often reveals latent driver bugs that would eventually appear on a single-processor system, but that might not become apparent until after the driver has shipped.

As the number of processors increases, you are likely to find more bugs and more types of bugs. Unfortunately, multiprocessor hardware—especially machines with four or more processors—can be expensive. A practical solution is to use a two-processor hyper-threaded machine in testing. Such a configuration is relatively cheap, but it presents four processors to the operating system.

The Windows DDK includes numerous tools that can help you find problems in your driver. The following are especially useful in analyzing locking and performance issues:

· Driver Verifier

· Call Usage Verifier

· Kernrate and KrView

· DevCon

Driver Verifier

You can find many common driver bugs by using the Driver Verifier. The Driver Verifier is available on Windows 2000 and later versions and works with drivers for these versions. All of the Driver Verifier’s features are available to drivers for most types of devices. However, some features are not supported for graphics drivers, such as display and kernel-mode printer drivers.

By default, Driver Verifier always performs certain checks related to the use of locks. It checks to ensure that drivers acquire and release spin locks at the correct IRQL and that the driver releases each spin lock exactly once per acquisition.

In Windows XP and later systems, the Driver Verifier includes the Deadlock Detection option. Used together with the !deadlock extension to the debugger, this option can help you find potential deadlocks in your code. (This option does not work for display drivers or kernel-mode printer drivers.)

When you enable the Deadlock Detection option, Driver Verifier looks for lock hierarchy violations involving spin locks, mutexes, and fast mutexes. Most of the time, these violations identify code paths that will eventually deadlock.

Even if you believe that the conflicting code paths can never run simultaneously, you should nevertheless rewrite them. Any violation of a lock hierarchy violation can eventually cause a deadlock, especially if the code is revised, even slightly, in the future.

In addition, Driver Verifier can monitor global counters related to spin locks. The counters tell you how many times all verified drivers on the system acquired spin locks. This statistic can be useful in fine-tuning a driver to improve performance.

Date: 2015-12-24; view: 823

<== previous page	\|	next page ==>
Additional Hardware Reordering on the Intel Itanium Architecture	\|	Best Practices for Drivers

doclecture.net - lectures - 2014-2025 year. Copyright infringement or personal data (0.313 sec.)