Hardware Reordering on x86, x64, and Itanium Architectures

The x86-based, x64-based (AMD64 and Intel Extended Memory 64 Technology), and Itanium-based architectures all reorder instructions in some situations. Reordering is a feature of the chip itself; therefore, NUMA architectures are subject to the same reordering scenarios as the underlying processor.

Important All driver code should be platform-independent. Instead of including platform-specific code paths to handle processor reordering, code your driver to operate on all possible platforms. By using the standard Windows synchronization mechanisms and if necessary, the kernel-mode memory barrier routines, you can avoid the need for any special-case code.

On x86-based, x64-based and Itanium-based hardware, reordering might take place when a write operation for one location precedes a read operation for a different location. Processor reordering might move the read operation ahead of the write operation on the same CPU, thus effectively reversing their order in code. These architectures do not reorder read operations followed by read operations or write operations followed by write operations.

The following code sequence shows a situation where this sort of processor reordering might cause problems in a driver. The code is oversimplified to demonstrate the issue and ignores the possibility of compiler rearrangement.

The source code declares a and b and initializes both to 0, as follows:

LONG a = 0;

LONG b = 0;

The following code sequence executes in Thread 1:

{

a = 1;

b = 2;

}

The following code sequence executes in Thread 2:

{

LONG c;

b = 0;

c = a;

}

If the instructions are executed in program order, you could expect the final results to be any of the following:

Results	How Obtained
a = 1, b = 0, c = 1	All code in Thread 1 executes before any code in Thread 2.
a = 1, b = 2, c = 0	All code in Thread 2 executes before any code in Thread 1.
a = 1, b = 2, c = 1	Code in Threads 1 and 2 execute in an interleaved order.

However, processor reordering could result in the following sequence of operations:

Thread	Operation	Resulting Value
	Read a	a = 0
	Write a	a = 1
	Write b	b = 2
	Write b	b = 0
	Write c	c = 0

If instructions are executed in this order, the final result in memory would be a=1, b=0, and c=0.

To prevent this problem, the code should either assign 0 to b in an interlocked sequence or call KeMemoryBarrier immediately before assigning the value of a to c. The following example uses an interlocked sequence:

{

LONG c;

InterlockedExchange (&b, 0);

c = a;

}

The call to InterlockedExchange is an implicit memory barrier. It ensures that the result of the assignment to b is visible before the processor reads the value of a.

The following example shows how to use KeMemoryBarrier to solve the same problem:

{

LONG c;

b = 0;

KeMemoryBarrier();

c = a;

}

KeMemoryBarrier inserts a memory barrier instruction in the generated code. The memory barrier ensures that the result of assigning 0 to b is visible before the processor reads a.

Date: 2015-12-24; view: 1285

<== previous page	\|	next page ==>
Memory Barrier Semantics	\|	Additional Hardware Reordering on the Intel Itanium Architecture

doclecture.net - lectures - 2014-2025 year. Copyright infringement or personal data (1.614 sec.)