MMX and SSE Optimization - SSE

xiaoxiao2021-04-11  361

Telling MMX optimization techniques for integer computing, however, the truly large fortune graphics and sound processing are mostly floating point operations, and now the requirements for floating point operations are getting higher and higher, in such a condition INTEL Finally, add SSE instructions for floating point operations to increase in Pentium III processing, so all programs for SSE instructions must be run on the Pentium III or Althon XP.

SSE newly defines eight new 128-bit registers XMM0-XMM7, which is 1 fold more than the 64 bits of MMX, each register can be loaded into 4 32-bit floating point numbers, because it is a new register, so less The MMX register and the switching work of the original float register, so there is a higher execution efficiency. It is worth noting that SSE can operate 16-bit and 8-bit integers, but this is not the mainstream of the SSE application.

Here casually mention an Intel Compiler 8.0, this compiler is indeed strong, personal feelings are about 10-20% faster than Visual C 6.0 SP6, it can be optimized for different CPUs, if you are a P4 series CPU In the compilation, add the parameter / fast / qxw / qip / qunroll40, will not think about the result, if you read a user manual, according to the method inside, change the program will have more improvements, to all Worship the ultimate optimized friend recommended this compiler. Less words, less, transfer to SSE topics again, and give a simple example:

Used in VC uses inline assembly float a [] = {1.0, 2.0, 3.0, 4.0}; float b [] = {5.0, 6.0, 7.0, 8.0}; _ asm {MOV ECX, A; MOV EDX, B; MOVAPS XMM0, [ECX]; MOVAPS XMM1, [EDX]; AddPS XMM0, XMM1; MOVAPS [ECX], XMM0;

Like MMX, you can use it without compilation. Use only one header file can be used directly in C. #include __m128 a = _mm_set_ps (1, 2, 3, 4) __m128 b = _mm_set_ps (5, 6, 7, 8) A = _mm_add_ps (a, b); this time I feel more convenient to use Intrinsics, because it has developed a lot of synthetic directives for a lot.

In the case of the following instructions of the SSE, the following, more fully, more fully, INTEL download instruction manual will be found. The following part is referenced to: http://dwbclz.myetang.com/articles/piii/sse-ins-ref.html

AddPS

Format: AddPS XMM1, XMM2 / M128

Function: Two sets of single-precision numbers plus

algorithm:

DEST [31-0] = DEST [31-0] SRC / M128 [31-0]; DEST [63-32] = DEST [63-32] SRC / M128 [63-32]; DEST [95- 64] = DEST [95-64] SRC / M128 [95-64]; DEST [127-96] = DEST [127-96] SRC / M128 [127-96];

AddSS

Format: AddSS XMM1, XMM2 / M32

Function: Low single precision number is added

algorithm:

DEST [31-0] = DEST [31-0] SRC / M32 [31-0]; DEST [63-32] = DEST [63-32]; DEST [95-64] = DEST [95-64] ; DEST [127-96] = DEST [127-96]; andNPS

Format: Andnps XMM1, XMM2 / M128

Function: XMM1 "Reverse" and XMM2 / M128 "and" operation

algorithm:

DEST [127-0] = NOT (DEST [127-0]) and SRC / M128 [127-0];

Andps

Format: Andps XMM1, XMM2 / M128

Function: Logic "and" operations for two registers

algorithm:

DEST [127-0] and = SRC / M128 [127-0];

CMPPS

Format: CMPPS XMM1, XMM2 / M128, IMM8

Function: Compare the value of two registers, using different comparative methods according to different values ​​of IMM8

IMM8 == 0, ==; IMM8 == 1, <; IMM8 == 2, <=; IMM8 == 3,; IMM8 == 4,! =; IMM8 == 5,! <; imm8 == 6, ! <=; IMM8 == 7,!;

algorithm:

IF (IMM8 = 0) THEN OP = "EQ"; Elseif (IMM8 = 1) THEN OP = "LT"; Elseif (IMM8 = 2) THEN OP = "Le"; elseif (IMM8 = 3) THEN OP = "Unord "; Elseif (IMM8 = 4) THEN OP =" ne "; Elseif (IMM8 = 5) THEN OP =" nlt "; Elseif (IMM8 = 6) THEN OP =" NLSEIF (IMM8 = 7) THEN OP = "ORD"; Fi

CMP0 = DEST [31-0] OP SRC / M128 [31-0]; CMP1 = DEST [63-32] OP SRC / M128 [63-32]; CMP2 = DEST [95-64] OP SRC / M128 [95 -64]; CMP3 = DEST [127-96] OP SRC / M128 [127-96];

IF (cmp0 = true) THEN DEST [31-0] = 0xfffffff; Else DEST [31-0] = 0x00000000; FIIF (CMP1 = true) THEN DEST [63-32] = 0xfffffff; Else Dest [63-32] = 0x00000000; FIIF (CMP2 = true) THEN DEST [95-64] = 0xfffffff; Else DEST [95-64] = 0x00000000; FIIF (CMP3 = true) Then Dest [127-96] = 0xfffffff; Else Dest [127-96 ] = 0x00000000; Fi

Others: You can use the following readability good instructions

Instructions implement CMPEQPS xmm1, xmm2; CMPPS xmm1, xmm2, 0CMPLTPS xmm1, xmm2; CMPPS xmm1, xmm2, 1CMPLEPS xmm1, xmm2; CMPPS xmm1, xmm2, 2CMPUNORDPS xmm1, xmm2; CMPPS xmm1, xmm2, 3CMPNEQPS xmm1, xmm2; CMPPS xmm1, XMM2, 4CMPNLTPS XMM1, XMM2; CMPPS XMM1, XMM2, 5cmpnleps XMM1, XMM2; CMPPS XMM1, XMM2, 6CMPORDPS XMM1, XMM2; CMPPS XMM1, XMM2, 7CMPSS

Format: cmpss XMM1, XMM2 / M32, IMM8

Function: Comparison of low single precision

Algorithm: The algorithm is similar to CMPPS, but it is only for DEST [31-0].

You can also use readability better instructions.

Instructions implement CMPEQSS xmm1, xmm2 CMPSS xmm1, xmm2, 0CMPLTSS xmm1, xmm2 CMPSS xmm1, xmm2, 1CMPLESS xmm1, xmm2 CMPSS xmm1, xmm2, 2CMPUNORDSS xmm1, xmm2 CMPSS xmm1, xmm2, 3CMPNEQSS xmm1, xmm2 CMPSS xmm1, xmm2, 4CMPNLTSS xmm1, XMM2 CMPSS XMM1, XMM2, 5cmpnless XMM1, XMM2 CMPSS XMM1, XMM2, 6CMPORDSS XMM1, XMM2 CMPSS XMM1, XMM2, 7

Comiss

Format: COMISS XMM1, XMM2 / M32

Function: Compare the low position and set the identification bit

algorithm:

Of = 0; sf = 0; AF = 0; IF (DEST [31-0] URD SRC / M32 [31-0]) = true) THEN ZF = 1; PF = 1; Cf = 1; Elseif DEST [31-0] GTRTHAN SRC / M32 [31-0]) = true) THEN ZF = 0; PF = 0; CF = 0; Elseif ((DEST [31-0] Lessthan SRC / M32 [31-0] ) = True kil = 0; pf = 0; cf = 1; ELSE ZF = 1; pf = 0; cf = 0; Fi

CVTPI2PS

Format: CVTPI2PS XMM, MM / M64

Function: 32-bit integer transition to floating point

algorithm:

DEST [31-0] = (FLOAT) (SRC / M64 [31-0]); DEST [63-32] = (float) (SRC / M64 [63-32]); DEST [95-64] = DEST [95-64]; DEST [127-96] = DEST [127-96]; CVTPS2PI

Format: CVTPS2PI MM, XMM / M64

Function: The low two floating point numbers are transformed into integers

algorithm:

DEST [31-0] = (int) (SRC / M64 [31-0]); DEST [63-32] = (int) (SRC / M64 [63-32]);

CVTSI2SS

Format: CVTSI2SS XMM, R / M32

Function: 32-bit integer transitions to floating point numbers, deposit low

algorithm:

DEST [31-0] = (float) (R / M32); DEST [63-32] = DEST [63-32]; DEST [95-64] = DEST [95-64]; DEST [127-96] = DEST [127-96]; CVTSS2Si

Format: CVTSS2SI R32, XMM / M32

Function: The low floating point number is transformed into 32-bit integers

algorithm:

R32 = (int) (SRC / M32 [31-0]);

CVTTPS2PI

Format: CVTTPS2PI mm, XMM / M64

Function: The two floating point numbers are transformed into integers, and

algorithm:

DEST [31-0] = (int) (SRC / M64 [31-0]); DEST [63-32] = (int) (SRC / M64 [63-32]);

Cvttss2si

Format: CVTTSS2SI R32, XMM / M32

Function: Convert the lowest bit floating point number to an integer and is scheduled.

algorithm:

R32 = (int) (SRC / M32 [31-0]);

Divps

Format: DIVPS XMM1, XMM2 / M128

Function: Single precision number division operation

algorithm:

DEST [31-0] / (SRC / M128 [31-0]); DEST [63-32] = DEST [63-32] / (SRC / M128 [63-32]); DEST [95-64] / (SRC / M128 [95-64]); DEST [127-96] = DEST [127-96] / (SRC / M128 [127-96]);

Divss

Format: Divss XMM1, XMM2 / M32

Function: Low single precision

algorithm:

DEST [31-0] / (SRC / M32 [31-0]); DEST [63-32] = DEST [63-32]; DEST [95-64] = DEST [95- 64]; DEST [127-96] = DEST [127-96];

EMMS

Format: EMMS

Function: Turn the floating point logo

algorithm:

Fputagword <- ffff

Fxrstor

FXRSTOR M512BYTE

Function: Fab, MMX, and SSE status from M512byte

algorithm:

FP AND mmx state and streaming simd extension state = m512byte;

Fxsave

Format: FXSave M512byte

Function: Saves FP, MMX, and SSE status to M512byte

algorithm:

M512byte = fp and mmx state and streaming simd extension state;

LDMXCSR

Format: LDMXCSR M32

Function: Status control word loaded with SSE

algorithm:

MXCSR = M32;

Maxps

Format: Maxps XMM1, XMM2 / M128

Function: Return maximum

algorithm:

IF (DEST [31-0] = SRC [31-0]; Elseif (src [31-0] = nan) Then Dest [31-0] = SRC [31-0 ]; Elseif (DEST [31-0]> SRC / M128 [31-0]) THEN DEST [31-0] = DEST [31-0]; Else DEST [31-0] = SRC / M128 [31-0 ]; FIIF (DEST [63-32] = nan) THEN DEST [63-32] = SRC [63-32]; Elseif (src [63-32] = nan) THEN DEST [63-32] = SRC [63 -32]; Elseif (DEST [63-32]> SRC / M128 [63-32]) THEN DEST [63-32]; Else DEST [63-32] = SRC / M128 [63 -32]; FIIF (DEST [95-64] = nan) THEN DEST [95-64] = SRC [95-64]; Elseif (src [95-64] = nan) THEN DEST [95-64] = SRC [95-64]; Elseif (Dest [95-64]) THEN DEST [95-64] = DEST [95-64]; Else Dest [95-64] = SRC / M128 [95-64]; FIIF (DEST [127-96] = nan) THEN DEST [127-96] = SRC [127-96]; Elseif (SRC [127-96] = nan) Then Dest [127-96] = SRC [127-96]; Elseif (DEST [127-96]> SRC / M128 [127-96]) THEN DEST [127-96] = DEST [127-96]; Else Dest [127-96] = SRC / M128 [127-96]; FIMAXSS

Format: maxss XMM1, XMM2 / M32

Function: Return the maximum low position

Algorithm: Similar to the above, the difference is to operate only for DEST [31-0]

Minps

Format: MINPS XMM1, XMM2 / M128

Function: Return minimum

Algorithm:

Minss

Format: Minss XMM1, XMM2 / M32

Function: Return the minimum low position

Algorithm:

Movaps

Format: MOVAPS XMM1, XMM2 / M128 or MOVAPS XMM2 / M128, XMM1

Function: Aligned Data Transmission Directive

algorithm:

IF (Destination = DEST) THEN (* LOAD INSTRUCTION *) DEST [127-0] = M128; ELSE (* Move Instruction *) DEST [127 = 0] = SRC [127-0]; Fi; Else IF (Destination = M128) THEN (* STORE INSTRUCTION *) M128 = src [127-0]; Else (* Move Instruction *) DEST [127-0] = SRC [127-0]; Fi; Fi; Movhlps

Format: MOVHLPS XMM1, XMM2

Function: two numbers of highlights pass to low

algorithm:

DEST [127-64]; DEST [63-0] = SRC [127-64]; MOVHPS

Format: MOVHPS XMM, M64 or MOVHPS M64, XMM

Function: High Data Transfer Directive

algorithm:

IF (Destination = DEST) DEST [127-64] = M64; DEST [31-0] = DEST [31-0]; DEST [63-32] = DEST [63-32]; Else (* store instruction *) M64 = src [127-64]; fi; MOVLPS

Format: MOVLPS XMM, M64 or MOVLPS M64, XMM

Function: Low data transfer instruction

algorithm:

IF (Destination = DEST) DEST [63-0] = M64; DEST [95-64] = DEST [95-64]; DEST [127-96] = DEST [127-96]; Else (* store instruction *) m64 = ​​DEST [63-0]; FI MOVLHPS

Format: MOVLHPS XMM1, XMM2

Function: Two numbers of low positions

algorithm:

DEST [127-64] = SRC [63-0]; DEST [63-0] = DEST [63-0];

MovMSKPS

Format: MOVMSKPS R32, XMM

Function: Mask moves into 32-bit registers

algorithm:

R32 [0] = SRC [31]; R32 [1] = SRC [63]; R32 [2] = SRC [95]; R32 [3] = SRC [127]; R32 [7-4] = 0x0; R32 [15-8] = 0x00; R32 [31-16] = 0x0000;

Movntps

Format: Movntps M128, XMM

Function: Put the data directly into memory, reduce the pressure on the cache

algorithm:

M128 = SRC;

Movss

Format: MovsS XMM1, XMM2 / M32 or MOVSS XMM2 / M32, XMM1

Function: Transmission instructions for lowest bit data

algorithm:

IF (Destination = DEST) THEN (* LOAD INSTRUCTION *) DEST [31-0] = M32; DEST [63-32] = 0x00000000; DEST [95-64] = 0x00000000; DEST [ 127-96] = 0x00000000; Else (* Move Instruction *) DEST [31-0] = SRC [31-0]; DEST [63-32] = DEST [63-32]; DEST [95-64] = DEST [95-64]; DEST [127-96]; Fielse IF (Destination = M32) THEN (* STORE INSTRUCTION *) M32 = SRC [31-0]; Else (* Move Instruction *) DEST [31-0] = SRC [31-0] DEST [63-32]; DEST [63-64] = DEST [95-64]; DEST [127-96] = DEST [ 127-96]; FIFIMOVUPS

Format: MOVUPS XMM1, XMM2 / M128 or MOVUPS XMM2 / M128, XMM1

Function: Transmission instructions for non-alignment data

algorithm:

IF (Destination = XMM) THEN IF (* LOAD INSTRUCTION *) DEST [127-0] = M128; ELSE (* Move Instruction *) DEST [127-0] = SRC [127-0]; Fielse if (destination = m128) THEN (* STORE INSTRUCTION *) M128 = src [127-0]; ELSE (* Move Instruction *) DEST [127-0] = SRC [127-0]; FIFI

Mulps

Format: Mulps XMM1, XMM2 / M128

Function: single-precision number

algorithm:

DEST [31-0] = DEST [31-0] * SRC / M128 [31-0]; DEST [63-32] = DEST [63-32] * SRC / M128 [63-32]; DEST [95- 64] = DEST [95-64] * SRC / M128 [95-64]; DEST [127-96] = DEST [127-96] * SRC / M128 [127-96];

Mulss

Format: Mulss XMM1, XMM2 / M32

Function: the lowest single single precision

algorithm:

DEST [31-0] = DEST [31-0] * SRC / M32 [31-0]; DEST [63-32] = DEST [63-32]; DEST [95-64] = DEST [95-64] ; DEST [127-96] = DEST [127-96];

ORPS

Format: ORPS XMM1, XMM2 / M128

Function: ask or calculate

algorithm:

DEST [127-0] | = SRC / M128 [127-0]; RCPPS

Format: RCPS XMM1, XMM2 / M128

Function: Improvement of the approximate value

algorithm:

DEST [31-0] = Approx (1.0 / (SRC / M128 [31-0])); DEST [63-32] = Approx (1.0 / (SRC / M128 [63-32])); DEST [95- 64] = approx (1.0 / (SRC / M128 [95-64)))); DEST [127-96] = Approx (1.0 / (SRC / M128 [127-96])); RCPSS

Format: RCPSS XMM1, XMM2 / M32

Function: Seeking the approximate value of the lowest position

algorithm:

DEST [31-0] = Approx (1.0 / (SRC / M32 [31-0]))))); DEST [63-32] = DEST [63-32]; DEST [95-64] = DEST [95-64] ; DEST [127-96] = DEST [127-96];

RSQRTPS

Format: RSQRTPS XMM1, XMM2 / M128

Function: Improves the approximation of square roots

algorithm:

DEST [31-0] = Approx (1.0 / SQRT (SRC / M128 [31-0])); DEST [63-32] = Approx (1.0 / SQRT (SRC / M128 [63-32])); DEST [ 95-64] = Approx (1.0 / SQRT (SRC / M128 [95-64)); DEST [127-96] = Approx (1.0 / SQRT (SRC / M128 [127-96]));

RSQRTSS

Format: RSQRTSS XMM1, XMM2 / M32

Function: Approximate the least value of the lowest countdown square root

algorithm:

DEST [31-0] = Approx (1.0 / SQRT (SRC / M32 [31-0])); DEST [63-32] = DEST [63-32]; DEST [95-64] = DEST [95-64 ]; DEST [127-96] = DEST [127-96];

Shufps

Format: Shufps XMM1, XMM2 / M128, IMM8

Function: chaos

algorithm:

FP_SELECT = (IMM8 >> 0) and 0x3; if (fp_select = 0) THEN DEST [31-0] = DEST [31-0]; Elseif (fp_select = 1) Then Dest [31-0] = DEST [63- 32]; elseif (fp_select = 2) Then Dest [31-0] = DEST [95-64]; Else Dest [31-0] = DEST [127-96]; FI

FP_SELECT = (IMM8 >> 2) and 0x3; if (fp_select = 0) Then DEST [63-32] = DEST [31-0]; Elseif (fp_select = 1) Then Dest [63-32] = DEST [63- 32]; Elseif (fp_select = 2) Then DEST [63-32] = DEST [95-64]; Else Dest [63-32] = DEST [127-96]; FI

FP_SELECT = (IMM8 >> 4) and 0x3; if (fp_select = 0) Then DEST [95-64] = SRC / M128 [31-0]; Elseif (fp_select = 1) Then Dest [95-64] = SRC / M128 [63-32]; elseif (fp_select = 2) Then Dest [95-64] = SRC / M128 [95-64]; Else Dest [95-64] = src / m128 [127-96]; FIFP_SELECT = ( IMM8 >> 6) and 0x3; if (fp_select = 0) Then DEST [127-96] = SRC / M128 [31-0]; Elseif (fp_select = 1) Then Dest [127-96] = src / m128 [63 -32]; Elseif (fp_select = 2) THEN DEST [127-96] = SRC / M128 [95-64]; Else Dest [127-96] = SRC / M128 [127-96]; FI

SQRTPS

Format: SQRTPS XMM1, XMM2 / M128

Function: Square root

algorithm:

DEST [31-0] = SQRT (SRC / M128 [31-0]); DEST [63-32] = SQRT (SRC / M128 [63-32]); DEST [95-64] = SQRT (SRC / M128 [95-64]); DEST [127-96] = SQRT (SRC / M128 [127-96]);

SQRTSS

Format: SQRTSS XMM1, XMM2 / M32

Function: Minimum digits are square root

algorithm:

DEST [31-0] = SQRT (SRC / M32 [31-0]); DEST [63-32] = DEST [63-32]; DEST [95-64] = DEST [95-64]; DEST [127 -96] = DEST [127-96];

Stmxcsr

Format: STMXCSR M32

Function: Store SSE Control Words

algorithm:

M32 = mxcsr;

Subps

Format: SUBPS XMM1, XMM2 / M128

Function: Single Jandage Solutions

algorithm:

DEST [31-0] - SRC / M128 [31-0]; DEST [63-32] = DEST [63-32] - SRC / M128 [63-32]; DEST [95- 64] = DEST [95-64] - SRC / M128 [95-64]; DEST [127-96] = DEST [127-96] - SRC / M128 [127-96];

Subs

Format: SUBSS XMM1, XMM2 / M32

Function: minimum number of digits

algorithm:

DEST [31-0] - SRC / M32 [31-0]; DEST [63-32] = DEST [63-32]; DEST [95-64] = DEST [95-64] ; DEST [127-96] = DEST [127-96];

Ucomiss

Format: Ucomiss XMM1, XMM2 / M32

Function: Compare the low position and set the flag

algorithm:

Of = 0; sf = 0; AF = 0; IF (DEST [31-0] URD SRC / M32 [31-0]) = true) THEN ZF = 1; PF = 1; Cf = 1; Elseif DEST [31-0] GTRTHAN SRC / M32 [31-0]) = true) THEN ZF = 0; PF = 0; CF = 0; Elseif ((DEST [31-0] Lessthan SRC / M32 [31-0] ) = True kil = 0; pf = 0; cf = 1; else zf = 1; pf = 0; cf = 0; FIUNPCKHPS

Format: UnpckHPS XMM1, XMM2 / M128

Function: high two number alternate transmission

algorithm:

DEST [31-0] = DEST [95-64]; DEST [63-32] = SRC / M128 [95-64]; DEST [95-64] = DEST [127-96]; DEST [127-96] = SRC / M128 [127-96];

Unpcklps

Format: Unpcklps XMM1, XMM2 / M128

Function: Low two number alternate transmission

algorithm:

DEST [31-0]; DEST [31-0]; DEST [63-32] = SRC / M128 [31-0]; DEST [95-64] = DEST [63-32]; DEST [127-96] = SRC / M128 [63-32];

Xorps

Format: xorps XMM1, XMM2 / M128

Function: different or calculation

algorithm:

DEST [127-0] = DEST / M128 [127-0] XOR SRC / M128 [127-0]

转载请注明原文地址:https://www.9cbs.com/read-133512.html

New Post(0)