Optimization of memory operations in VC ++ multi-threaded

zhaozj2021-02-08 385

Author / Li Hongya

Many programmers have found that programs written in VC can run on a multiprocessor's computer, which is much caused by multiple threads to compete with the same resource. For programs written in VC , the problem is on the specific implementation of the memory management of VC . By the following explanations of this problem, it provides a simple solution that allows this program to avoid running bottlenecks under multiprocessors. This method can be used when there is no source code for the VC program.

problem

The C and C runtime provides functions for the heap memory: c provides malloc () and free (), C provides new and delete. These functions are a unused block in the stack memory, whether by malloc () or the new, and the number of blocks is larger than the size of the application. If there is no large unused memory block, the running time library will request a new page to the operating system. The page is the unit of virtual memory manager to operate, under the NT platform based on Intel-based processor, generally 4,096 bytes. When you call free () or delete release memory, these memory blocks return to the heap for use in memory.

These operations look less eye-catching, but the key to the problem. The problem occurs when multiple threads apply for memory, which usually occurs on the system of multiprocessors. But even in a single processor's system, this problem may happen if the thread is scheduled.

Considering the two threads in the same process, thread 1 applies 256 bytes of memory at another processor while applying for 1,024 bytes of memory. The Memory Manager found an unused memory block for thread 1, and the same function found the same memory for thread 2. If the two threads update the internal data structure, record the memory and their size, the inside of the stack generates conflicts. Even if the file is applied, the two threads are confident that they have that memory, this program will also generate errors, this is just a time problem.

This is called contention, which is the biggest problem to write multithreaded programs. The key to solving this problem is to use a lock mechanism to protect these functions of the memory manager. The lock mechanism guarantees multiple threads mutually exclusive to run the same code. If a thread is running protected code, then other threads are This solution must be waited, also known as serialization.

NT provides some implementation methods for locked mechanisms. CreateMutex () Creates a system-wide lock object, but this method is the lowest efficiency; the Critical Section created by InitializeCriticalSection () is much higher; to get better performance, you can use NT 4 with Service Pack 3. Spin Lock, more detailed information can refer to the instructions for the initializecriticalsectionandspincount () function in the VC Help. Interestingly, although the help file says Spin Lock is used for NT's heap manager (the "Function of the HeapAlloc () series), the Pile management function of the VC runtime is not accessed by spin lock. If you look at the source program of the Pile of VC Runtime, it will be found to use a critical section for all memory operations. If you can use Heapalloc () in the VC running library, it will be obtained by using Spin Lock instead of critical section because it is useful.

The VC runtime can securely make multiple threads and release memory by using Critical Section. However, this method will cause a decline in performance due to the contention of memory. If a thread is accessed another thread when you are using the stack being used, the previous thread needs to wait and lose your own time slice, switch to other threads. The switching of threads is quite time consuming in NT because it takes a small percentage of the time slice of the thread. If there are multiple threads to access the same heap, it will cause more thread switching, sufficient to cause great performance loss. phenomenon

How do I find this performance loss in multiprocessor systems? There is a simple way to open the "Performance" monitor in Administrative Tools, add a context switch / second count in the system group, then run the multi-thread program you want to test, and add this process in the process group. The processor time count, so that the processor can get how many contexts to occur under high loads.

It is normal to switch under high load, but when the count exceeds 80,000 or 100,000, it means that too much time is wasted to switch on the thread. You can know if you calculate it. If you have 100,000 times per second. Switch, then each thread is only 10 microseconds for operation, and the normal time slice length on NT is about 12 milliseconds, which is thousands of times the former.

The performance diagram of Figure 1 shows over-thread switching, and Figure 2 shows the same process in the same environment, after using the solution provided below. In the case of Fig. 1, the system is required to perform 120,000 threads per second, and the number of times the thread switching is reduced below 1,000 times per second. Both pictures are intercepted when running the same test program, and there are three threads in the program simultaneously perform the number of piles of 2,048 bytes, and the hardware platform is a dual Pentium II 450 machine with 256MB of memory.

Solution

This method requires multi-threaded programs to be written with VC and is dynamic links to C runners. The version number of the VC run library file MSVCRT.DLL installed in the NT system is 6. The version installed is 5 or more. If the program is compiled with VC V6.0, even if multi-threaded programs and libcmt.lib are static links, this method can also be used.

When a VC program is running, the C run library is initialized, one of which is to determine the heap manager to use, the VC V6.0 runtime can use its own internal heap management functions, or directly call the operating system The heap management function ("HeapAlloc () series function), performs the following three steps in the __heap_select () function:

1. Check the version of the operating system, if you run in NT, and the main version is 5 or higher (Window 2000 and later), you will use HeapAlloc ().

2. Find environment variables __msvcrt_heap_select, if available, which heap function will be used. If its value is __global_heap_selected, all programs will be changed. If it is a full path to an executable file, you have to call getModuleFileName () check if the program exists, as for which heap function is to check the value behind the comma, 1 means using HeapAlloc (), 2 means using VC V5 pile Function, 3 means using the heap function of VC V6.

3. Detect the link program flag in the executable file, if you are created by VC V6 or higher, you will use the heap function of version 6, otherwise you will use the heap function of version 5. So how do we improve the performance of the program? If it is a dynamic link of MSVCRT.dll, this DLL is 5 or higher after February 1999, and installed service pack is 5 or higher. If it is a static link, the version number guaranteed by the linker is 6 or higher, you can check this version number with the QuickView.exe program. To change the selection of the heap function of the program you want to run, type the following command under the command line:

SET __MSVCRT_HEAP_SELECT = __ Global_Heap_Selected, 1

In the future, all programs run from this command line will inherit this environment variable settings. This way, HeapAlloc () is used in the heap operation. If all programs use these speed faster stack operation functions, run the System program of the Control Panel, select Environment, Tap "System Variable", enter the variable name and value, then press the "Apply" button to close the conversation Box, restart the machine.

According to Microsoft, there may be some issues that use VC V6 previously versions to use VC V6. If such a problem is encountered after the above settings, you can use a batch file to remove this setting, for example:

SET __MSVCRT_HEAP_SELECT = C: / Program files / myapp / myApp.exe, 1 c: /bin/buggyapp.exe, 2

test

In order to verify the effect under multiprocessor, a test program HeapTest.c. The program receives three parameters, the first parameter represents the number of threads, the second parameter is the maximum value of the memory applied, and the third parameter has a number of times each thread applies for memory.

#define Win32_Lean_and_mean

#include

/ * Compile with cl / mt heaptest.c * /

/ * to switch to the system heap issu the Following Command

Before Starting Heaptest from The Same Command Line

SET __MSVCRT_HEAP_SELECT = __ global_heap_selected, 1 * /

/ * Structure Transfers Variables to the worrse threads * /

Typedef struct tdata

{

Int Maximumlength;

Int alloccount;

} threaddata;

Void Printusage (char ** argv)

{

FPRINTF (stderr, "wrong number of parameters./nusage:/n");

FPrintf (stderr, "% s threadcount maxalloclength alloccount / n / N",

Argv [0]);

Exit (1);

}

Unsigned __stdcall workerthread (void * mythreaddata)

{

INT country;

Threaddata * mydata;

Char * dummy;

a getcount () * getcurrentthreadid ());

MyData = (ThreadData *) MythreadData; / * Now let us do the real thing * /

For (count = 0; count alloccount; count )

{

Dummy = (char *) Malloc ((RAND ()% mydata-> maximumledth) 1);

Free (Dummy);

}

_endthreadex (0);

/ * to satisfy compiler * /

Return 0;

}

INT main (int Argc, char ** argv)

{

Int threadcount;

INT country;

Threaddata actdata;

Handle * threadhandles;

DWORD STARTTIME;

DWORD StopTime;

DWORD RETVALUE;

UNSIGNED DUMMY;

/ * check parameters * /

IF (Argc <4 || Argc> 4)

Printusage (argv);

/ * Get Parameters for this run * /

ThreadCount = ATOI (Argv [1]);

IF (ThreadCount> 64)

Threadcount = 64;

ACTDATA.MAXIMUMUMLENGTH = ATOI (Argv [2]) - 1;

ActData.alloccount = ATOI (Argv [3]);

ThreadHandles = (Handle *) Malloc (ThreadCount * Sizeof (Handle);

Printf ("Test Run with% D Simultaneous Threads: / N", ThreadCount);

STARTTIME = GetTickCount ();

For (count = 0; count

{

ThreadHandles [count] = (Handle) _Beginthreadex (0, 0,

& workerthread, (void *) & actdata, 0, & dummy

IF (ThreadHandles [count] == (Handle) -1)

{

FPRINTF (stderr, "error starting worker threads./n");

EXIT (2);

}

/ * WAIT Until All Threads Are Done * /

RetValue = WaitFormultiPleObjects (ThreadCount, ThreadHandles

, 1, infinite;

Stoptime = gettickcount ();

Printf ("Total Time Elapsed Was:% D MilliseConds",

STOPTIME-STARTTIME);

Printf ("For% D Alloc Operations./N",

ActData.alloccount * threadcount);

/ * Cleanup * /

For (count = 0; count

CloseHandle (ThreadHandles [count]);

Free (ThreadHandles);

Return 0;

}

After the test program is processed, create a quantity of threads in the parameter 1, the ThreadData structure is used to pass the count variable. Memory operation in Workthread, first initialize the random number generator, and then perform the specified number of malloc () and Free () operations. The main thread call WaitFormultiPleObject () Wait for the worker thread to end, and then output the time of thread run. Timing is not very accurate, but the impact is not affected. In order to compile this program, you need to install the VC V6.0 program, open a command line window, type the following command:

CL / MT Heaptest.c

/ MT indicates the multi-threaded version of the multi-threaded version of the C run field. If you want to dynamically link, use / md. If VC is V5.0 and has a high version of MSVCRT.DLL, you should use dynamic links. Run this program now, use performance monitor to switch the number of thread switches, then set the environment parameters above, re-run this program, and then check the number of threads.

When the two diagrams are intercepted, the test program uses 60,953ms to apply for 3,000,000 memory applications, using the pile function of VC V6. After the conversion uses HeapAlloc (), the same operation is only 5,291 ms. In this particular case, use HeapAlloc () makes performance by more than 10 times! This performance can also be seen in the actual program.

in conclusion

Multi-processor systems can naturally improve the performance of the program, but if multiple processors are stronger, the performance of possible systems that may be multiprocessor is not as good as a single processor system. For C / C programs, the problem typically occurs when multiple threads are frequently in-memory operation activity. As mentioned above, as long as some settings are performed, it may greatly improve the performance of the multi-threader program under multiprocessors. This method is no need for source programs, nor does it need to recompile executables, and the biggest benefit is that the performance of this method is not required to pay any fees.

转载请注明原文地址:https://www.9cbs.com/read-442.html

9cbs

New Post(0)