SSE technology introduction

xiaoxiao2021-04-11  1.3K+

SSE Technologies Introduction Intel's single-instruction Multi-Data Flow Extension (SSE, Streaming SIMD Extensions) technology can effectively enhance the CPU floating point operation. Visual Studio .NET 2003 provides programming support for the SSE instruction set, allowing users to directly use the SSE instructions without writing assembly code in C code. MSDN's topic for SSE technology [1] may be confused by beginners who are not familiar with SSE assembly instructions, but while reading MSDN's documentation, refer to Intel Software Manuals [2] Make you more clearly understand the key points of using SSE instructions.

SIMD (SINGLE-INSTRUCTION, MULTIPLE-DATA) is a CPU execution mode that uses a single channel instruction to process multiple data streams, i.e., in a CPU instruction execution cycle to perform a plurality of data. Consider the following task: calculate the square root of each element in a long floating-point array. Algorithm for achieving this task can be written like this:

For each f in array // Each element in the array

f = SQRT (f) // Calculate its square root

In order to understand the details of the implementation, we write the above code:

FOR Each f in array {loaded Floaded from memory to floating point register to calculate the square root and then put the calculation result from the register into the memory}

Processors with Intel SSE instruction sets have 8 128-bit registers, each register can store 4 (32-bit) single-precision floating point numbers. SSE provides an instruction set, which allows the floating point number to be loaded into these 128-bit registers, and these numbers can be arithmetic logic operations in these registers, and then put the result back back memory. After using SSE technology, the algorithm can be written as the following:

FOR Each 4 MEMBERS IN Array // Each 4 elements in the array loads the four numbers in the array into a 128-bit SSE register to complete the calculation of these 4 square roots in a CPU instruction execution cycle. Operation to remove the resulting four results}

C programmers do not have to care for these 128-bit registers when using the SSE instruction function, you can use 128-bit data type "__m128" and a series of C functions to implement these arithmetic and logic operations, and determine which SSE register and Code optimization is the task of the C compiler. SSE technology is indeed a very efficient way when the elements in a long floating point count are needed. SSE programming details include header files: All SSE instruction functions and __m128 data types are defined in xmmintrin.h files:


Because the SSE processor instruction used in the program is determined by the compiler, it is not related to the related .lib library file. Data Alignment Each floating point number of SSE instructions must be divided into a group per 16 bytes of the number (128-bit binary) that need to be processed. Static Array can be declared by __declspec (align (16)) keyword:

__declspec (align (16)) float m_farray [array_size];

Dynamic Array can be allocated by the _aligned_malloc function:

m_farray = (float *) _aligned_malloc (array_size * sizeof (float), 16); Dynamic arrays from the _aligned_malloc function allocation space can be released by the _ALIGNED_FREE function:


__M128 Data Type This data type variable can be used as an operand of the SSE instruction, which cannot be directly accessed by the user instruction. The _m128 type variable is automatically assigned to 16 bytes of word long. The CPU supports the SSE instruction set If your CPU can have an SSE instruction set, you can use the C function library supported by the SSE instruction set to the SSE instruction set, you can view a Visual C CPUID in MSDN. Example [4], it can help you detect if your CPU supports SSE, MMX instruction sets, or other CPU functions. Programming Example The following explains the application instance of SSE technology under Visual Studio .NET 2003, you can download sample programs compressed packets at The compression package contains two projects, which are Visual C . The Visual C . Net project established based on the Microsoft Basic Class Library (MFC), you can also establish these two items as described below. The SSTEST Sample Project The Ssest project is a dialog-based application that uses three floating point array participation operations:

Fresult [i] = SQRT (fsource1 [i] * fsource1 [i] fsource2 [i] * fsource2 [i]) 0.5

Where I = 0, 1, 2 ... Array_Size-1 where array_size is defined as 30000. The data source array (Source array) is assigned by using the SIN and COS functions, we use the Waterfall Chart Control (Waterfall Chart Control) developed by Kris Jearakul to display the source array and result arrays of the participating calculations. Calculate the time required (in milliseconds MS) is displayed in the dialog box. We use three different ways to complete the calculation: pure C code; use the C code of the SSE instruction function; include the code of the SSE assembly instruction. Pure C code:

Void Cssetestdlg :: ComputeArrayCplusplus (FLOAT * PARRAY1, / / ​​[Input] Source Array 1 FLOAT * PARRAY2, / / ​​[Input] Source Array 2 FLOAT * PRESULT, / / ​​[Output] Array INT NSize for storage results) // [Input] Array size {INT i; float * psource1 = parray1; float * psource2 = parray2; float * pdest = preSult; for (i = 0; i

The implementable functionality corresponding to the SSE assembly instruction Visual C . The SSE function in NET puts 4 32-bit floating point numbers into a 128-bit storage unit. Movss and shufps_mm_set_ps1 will perform 4 pairs of 32-bit floating point numbers simultaneously. These 4 pairs of 32-bit floating point numbers come from two 128-bit storage units, and then assign the calculation result (product) to a 128-bit storage unit. MULPS_MM_MUL_PS performs 4 pairs of 32-bit floating point simultaneously. These 4 pairs of 32-bit floating point numbers are from two 128-bit memory cells, and then assign the calculation result (add sum) to a 128-bit storage unit. ADDPS_MM_ADD_PS performs square root operation simultaneously in a 128-bit memory cell. SQRTPS_MM_SQRT_PS

Using Visual C . Net's SSE instruction function code: Void cssetestdlg :: computeArrayCplusplusse (FLOAT * PARRAY1, / / ​​[Input] source number 1 FLOAT * PARRAY2, / / ​​[Input] Source number 2 FLOAT * PRESULT, / / ​​[output ] Array INT nsize) // [Input] array {INT NLOOP = nsize / 4; __m128 * psrc1 = (__m128 *) PARRAY1; __M128 * PSRC2 = ( __m128 *) PARRAY2; __M128 * PDEST = (__m128 *) presult; __m128 m0_5 = _MM_SET_PS1 (0.5F); // m0_5 [0, 1, 2, 3] = 0.5 for (INT i = 0; i

0.5 Shufps XMM2, XMM2, 0 // Xmm2 [1, 2, 3] = XMM2 [0] MOV ESI, PARRAY1 // Enter the address of the source array 1 to ESI MOV EDX, PARRAY2 // Enter the source number 2 Address to EDX MOV EDI, PRESULT / / Output Array Address Save in EDI MOV ECX, NLOOP // Time to ECXSTART_LOOP: MOVAPS XMM0, [ESI] // XMM0 = [ESI] Mulps XMM0, XMM0 // XMM0 = XMM0 * XMM0 MOVAPS XMM1, [EDX] // XMM1 = [EDX] MULPS XMM1, XMM1 // XMM1 = XMM1 * XMM1 ADDPS XMM0, XMM1 // XMM0 = XMM0 XMM1 SQRTPS XMM0, XMM0 // XMM0 = SQRT (XMM0 AddPS XMM0, XMM2 // XMM0 = XMM1 XMM2 MOVAPS [EDI], XMM0 // [EDI] = XMM0 Add ESI, 16 // ESI = 16 Add EDX, 16 // EDX = 16 Add EDI, 16 / / EDI = 16 DEC ECX // ECX - JNZ START_LOOP // If not 0 turned to start_loop}} Finally, running the calculation test on my computer: Pure C code calculation uses 26 milliseconds using SSE The time used by the C function calculation is 9 milliseconds containing the C code calculation of the SSE assembly instruction. The time for 9 milliseconds or more is obtained by executing the program after Release optimization. The SSESAMPLE Sample Project The SSESample project is a dialog-based application where it calculates with the following floating point count: FRESULT [i] = SQRT (fsource [i] * 2.8) where i = 0, 1, 2 ... Array_size-1 This program simultaneously calculates the maximum and minimum value in the array. Array_size is defined as 100,000, the calculation results in the array are displayed in the list box. The time required for calculating the following three methods in my machine is: pure C code calculation 6 milliseconds using SSE C function calculation 3 milliseconds Using SSE assembly instructions to calculate 2 milliseconds, you can see the results of the SSE assembly instruction It will be better because the efficiency enhanced SSX register group is used. However, in general, the C function calculation using SSE is higher than the efficiency of the assembly code, because the code after the C compiler has a high calculation efficiency, to make assembly code compared to optimized code operations Higher efficiency, which is usually very difficult.

Pure C code: // Input: m_fInitialArray // Output: m_fResultArray, m_fMin, m_fMax void CSSESampleDlg :: OnBnClickedButtonCplusplus () {m_fMin = FLT_MAX; m_fMax = FLT_MIN; int i; for (i = 0; i m_fmax) m_fmax = m_fresultaryray [i];}} using the Visual C NET command function code SSE: // input: m_fInitialArray // output:. m_fResultArray, m_fMin, m_fMax void CSSESampleDlg :: OnBnClickedButtonSseC () {__m128 coeff = _mm_set_ps1 (2.8f); // coeff [0, 1 , 2, 3] = 2.8 __m128 TMP; __m128 min128 = _mm_set_ps1 (flt_max); // min128 [0, 1, 2, 3] = fl_max __m128 max128 = _mm_set_ps1 (flt_min); // MAX128 [0, 1, 2, 3] = fl_min __m128 * psource = (__m128 *) m_finitialarray; __m128 * pdest = (__m128 *) m_fresultarray; f OR (int i = 0; I

Max (XF [0], Max (XF [1], Max (XF [2], XF [3]))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))))) m_fMin, m_fMax void CSSESampleDlg :: OnBnClickedButtonSseAssembly () {float * pIn = m_fInitialArray; float * pOut = m_fResultArray; float f = 2.8f; float flt_min = FLT_MIN; float flt_max = FLT_MAX; __m128 min128; __m128 max128; // use the following Additional registers: XMM2, XMM3, XMM4: // XMM2 - multiplied coherent // xmm3 - minimum value // xmm4 - maximum _ASM {Movss XMM2, F // XMM2 [0] = 2.8 shufps XMM2, XMM2, 0 // XMM2 [1, 2, 3] = XMM2 [0] MOVSS XMM3, FLT_MAX / / XMM3 = FLT_MAX SHUFPS XMM3, XMM3, 0 // XMM3 [1, 2, 3] = XMM3 [0] MovsS XMM4, FLT_MIN // XMM4 = FLT_MIN SHUFPS XMM4, XMM4, 0 // XMM3 [1, 2, 3] = XMM3 [0] MOV ESI, PIN // Input Array Address Send to ESI MOV EDI, POUT // Output Array Address to EDI MOV ECX, Array_Size / 4 // Cycle Counter Initialization START_LOOP: MOVAPS XMM1 , [ESI] // xmm1 = [ESI] MULPS XMM1, XMM2 // XMM1 = XMM1 * XMM2 SQRTPS XMM1, XMM1 // XMM1 = SQRT (XMM1) MOVAPS [EDI], XMM1 // [EDI] = XMM1 MINPS XMM3, XMM1 MAXPS XMM4, XMM1 Add ESI, 16 Add EDI, 16 DEC ES JNZ START_LOOP MOVAPS MIN128, XMM3 MOVAPS MAX128, XMM4} Union u {__M128 m; float f [4];} x; xm = min128; m_fmin = min (XF [0], Min (XF [1], Min (XF [2], XF [3])))))); xm = max128; m_fmax = max (xf [0], max (xf [1], max (XF [ 2], XF [3]))));} Reference document:


New Post(0)