Talk about MMX optimization

xiaoxiao2021-04-11  256

These days, helped others MMX optimization, including YUV420 <-> rgb, yuv420 <-> yuv422, usvy-> yuv420, feeling more. MMX instructions are indeed strong, using the inline MMX instruction to convert The speed is 3-4 times before the optimization. The original MMX optimized entry will explain the usage of MMX instructions.

First, use the MMX instruction in the inline assembly of the VC to be quite convenient, and the efficiency is also very high. Recommended.

There are generally two ways: arrays and pointers. It is more convenient to use the array, but the usage pointer is more common. Array transmission mode: int A [6]; ..._ASM {MOVQ MM0, [a]; // MOVQ transfer 64 Bit, MM0 content is: a [3] a [2] a [1] a [0] MOVD MM1, [A 8]; // MOVD transfer low 32 bits, pay attention to this place plus 8 instead of 4} pointer transmission Way: int * p; ..._ asm {mov Eax, [P]; MOVQ MM0, [EAX]; MOVQ MM1, [EAX 8]}

The addition and subtraction in the MMX is relatively simple, and the only thing that is worth noting is those who are symbolic, those with saturated. Pmaddwd is two pairs of 16-digit multiplication and then add 32-bit results. Operation.

The MMX multiplication is divided into high low, because the results of the multiplication result and the number of multipliers are the same, and the results of the operations are high as 0, so it is enough to use the low multiplication. But if you want to get a complete The result should be counted twice. In the mobile instruction, it should be noted that PSLLW, PSRLD, etc. are logical displacements. It is necessary to move symbols. Only PSRAW, PSRAD is the arithmetic displacement does not move symbolic, and there is no division operation in MMX, so To see a way to convert the divided to 2 N times, then use the PSRAW MM0, N to achieve division operation. Note that the empty space is filled with 0.

The most complicated in MMX is a tightening instruction. They are responsible for rearrangement of the MMX register among the output data. Packsswb MM0, MM1 // mm0 stores 4 16 digits: ABCD MM1: EFGH Packsswb After MM0: The only difference between AbcdefghPackuswb with Packsswb is that unsigned PUNPCKHBW Expand high-order tightening data PUNPCKLBW Expand low-order tightening data, for example: MM0 and MM1 store 8 8 digits. Mm0: A7, A6, A5, A4, A3, A2, A1, A0 mm1: B7, B6, B5, B4, B3, B2, B1, B0PUNPCKHBW MM0, MM1; / / Results MM0: B7, A7, B6, A6, B5, A5, B4, A4PUNPCKLBW MM0, MM1; / / Results MM0: B3, A3, B2, A2, B1, A1, B0, A0


New Post(0)