Optimization of memory allocations-程序员宅基地

技术标签: serialization  allocation  performance  library  optimization  containers  Programing  


STL introduced mechanism called "allocator" used to create additional level of abstraction for STL containers which encapsulates information about memory model. This abstraction is infrequently used by developers, but can provide a lot of opportunities for performance optimization. There are several memory related headaches developers struggling for ages. One of them is memory fragmentation; another is bottleneck in multithreaded environment with one shared heap. To address this issues BitMagic library makes use of allocators.

Heap Fragmentation

When program starts running the runtime library forms a region of memory called "heap". In the simplest case heap looks like a solid unused block of memory.

When program works it allocates blocks of memory.


Fig. 1

What we see here is allocated and free memory all placed together and new portions are allocated from one big pool of free memory.

Over the time our program will free some memory and the overall heap picture changes.


Fig. 2

As we see our heap quickly becomes fragmented: free and allocated regions are becoming mixed. Technically runtime heap manager is designed to work with situations like this, but such a universalism is coming with a price and in many cases it is low performance. Heap manager needs to maintain lists of free blocks and every time we try to allocate it needs to scan available blocks looking for a suitable candidate. Over the time for long lasting server processes it can result in performance degradation.


Fig. 3

Now we cannot even place the candidate block even if we have sufficient free space. Our heap is too fragmented. In this case heap manager can either reject the allocation request by returning NULL or extend the heap by allocating more memory pages from the operating system. Allocation failure means serious consequences for the application because we not always have an escape root and a lot of programs in this situation will crash or gracefully terminate which for any practical purpose can be seeing as a special form of crash. Heap increase means that overall application image grows eating up resources from over running programs.

In a typical C++ program where objects are created and destroyed all the time the problems looks unsolvable but careful resource planning can be an improvement. If we get back to Fig.1 we can notice that our exemplary program often allocates blocks of a certain uniform size (in our example blocks A, C and E). This regularity can be used to reduce the runtime heap manager overhead.


Fig. 4

If application knows that during it's lifetime it is supposed to allocate number of equally sized blocks it can allocate in bulk by one big block and create simple custom heap, which will manage allocation more efficiently. In this case our private heap will also eventually be fragmented, but the algorithm of searching for a free candidate is much simpler. In this case the address of the free-list header becomes a clearly recognizable constant.

The drawback of the method is that we need to predict the peak maximum number of the block we need and allocate it all at once. It is not always possible or affordable so in real life the private heap can take form of several big blocks allocated and deallocated on demand. Another alternative ad-hoc design is that program allocates one private heap block serving the average consumption and when it is full redirects allocations to the main heap. In this case we optimize for the most frequent allocation scenario.

Many C++ programs also solve this problem by maintaining per-class free lists. This is a step in right direction, but this solution is incomplete and does not really fight the fragmentation problem, since every instance maintains its own free list and inter-object optimization becomes problematic. If free lists are shared between instances it makes much more difficult to create thread-safe containers and classes are becoming dependent on synchronization primitives.

The best design was proposed by STL where memory allocation logic was separated into a special interface named "allocator". STL containers can be externally parameterized with allocators without any change in the source code. BM library shares the same approach not trying to optimize memory management by itself. Just like STL it delegates all memory related needs to an external interface.

Take a look at Boost Pool library addressing the same problem of memory allocations.

Memory allocation in multithreaded environments

Another performance pile of performance related questions is coming from the multithreaded (MT), concurrent nature of the modern software. To achieve good scalability different threads of execution should not wait while some shared resource becomes available for the thread. Heap management infrastructure is one of the most essential shared resources.


Fig. 5

Many multithreaded applications that use the standard memory allocation routines pay a significant performance penalty when running on a multiprocessor machine. This is due to the serialization used by the default heap. On a multiprocessor machine, more than one thread may try to allocate memory simultaneously. One thread will block on the critical section guarding the heap. The other thread must then signal the critical section when it is finished to release the waiting thread. Moreover single threaded allocator design can exacerbate false sharing, when different processors can share cache lines.

On two and more processors machine heap serialization procedure can become a serious problem. In this article you can find test results showing dual processor machine working slower than a single processor contender because of the heap serialization burden.

All this makes one concurrent heap (e.g. B-tree) too expensive.


Fig. 6

Pure private heaps design suggests using one heap per thread and avoiding heap contention. Both Unix and Windows provide API to create and manipulate thread local storage.

Interesting, that MT model with separate heaps very well corresponds to multitasking, where programs are completely separated by OS and heap contention is impossible by definition. It means that multitasking architectures are not obsolete and should not be automatically abandoned in favor of more fashionable multi-threading. Multitasking applications can be easily converted to work in cluster environments, workstation networks or any other environment where it is profitable to employ what is called 'embarrassingly parallel computation': the program consists of fragments which are too 'embarrassed' to talk to each other and hence can be run independently of others.

Allocators for BM library

Standard STL allocator is interface that manages allocation and freeing for arrays of objects of a certain type. It goes along with the concept of a typical STL container specialized to manipulate only one type. BM library internally employs several different data structures to represent bitsets, which makes one allocator insufficient.

BM defines new template class mem_alloc working as an adapter between STL style allocators and all other BM types. Currently BM uses two allocators: one for pointers and another one for arrays of 32-bit unsigned integers used as bit blocks. (Actually BM uses both 32-bit integers and 16-bit short integers, but shorts can be allocated as 32-bit integers and then cast without a risk of having alignment problems.)

It was decided to use minimalist version of STL allocator requiring only two static functions allocate and deallocate. BM defines the default implementation that does not use STL but rather pedestrian malloc/free functions.

class block_allocator
    static bm::word_t* allocate(size_t n, const void *)
        return (bm::word_t*) ::malloc(n * sizeof(bm::word_t));

    static void deallocate(bm::word_t* p, size_t)

class ptr_allocator
    static void* allocate(size_t n, const void *)
        return ::malloc(n * sizeof(void*));

    static void deallocate(void* p, size_t)

By default BM is independent from STL and even does not use new/delete operators. So it can be easily integrated in any project using or not using STL, or redefining new/delete operators (Microsoft MFC for example).

If you choose to employ any custom memory allocation technique you need to:

  • implement bm::word_t allocator and void* allocator.
  • define new BM compatible allocator adapter:
            typedef mem_alloc myBM_alloc;
  • use the allocator adapter as the template argument for bvector:
            typedef bm::bvector bvect;

    BM library vision of bitsets is very different from how STL implements it. BM does not allocate one big block of memory for all bitset at once. It uses postponed allocation and frees bitset blocks when it is possible. BM also uses alternative GAP based bitset representation to reduce memory footprint. Expected result is that it allocates and deallocates memory blocks quite often. To alleviate the consequences like heap fragmentation BM library uses fixed-size blocks of memory so you can easily use any of the described heap management techniques. For the block allocator sizes can be 2048 (size of a bit block (SET_BLOCK_SIZE)) and as defined in gap_len_table (divided by 2) (64, 128, 256, 512). Pointer allocator uses the only size of 256 (SET_ARRAY_SIZE). All the sizes are not in bytes but in units of allocation, as it is defined by the STL allocator requirements.

    Current version of BM libarary provides only default allocator redirecting all calls to the default malloc/free. A very simple example of how to create allocators for BM.

版权声明:本文为博主原创文章,遵循 CC 4.0 BY-SA 版权协议,转载请附上原文出处链接和本声明。






【华为机试真题 Python实现】字符串筛选排序_python 实现 ascii 码排序-程序员宅基地

输入一个由 n 个大小写字母组成的字符串,按照 Ascii 码值从小到大的排序规则,查找字符串中第 k 个最小 ascii 码值的字母(k>=1),输出该字母所在字符串的位置索引(字符串的第一个字符位置索引为 0)。k 如果大于字符串长度,则输出最大 ascii 值的字母所在字符串的位置索引,如果有重复的字母,则输出字母的最小位置索引。..._python 实现 ascii 码排序

java list对元素进行指定多个字段属性按多种排序方式进行排序_java中对list的字段进行多权重排序-程序员宅基地

import java.lang.reflect.Field;import java.text.NumberFormat;import java.util.Collections;import java.util.Comparator;import java.util.Date;import java.util.List;/** * 功能说明 * * 在数据库中查出来的_java中对list的字段进行多权重排序


同一个网页在不同浏览器中显示的效果不一样,这是因为不同浏览器对网页代码的解释不一样,特别是css样式方面,这就是浏览器兼容性问题。为了解决流浪器兼容问题,程序员需要针对各种浏览器存在的“bug”编写相应的css代码,这种代码也称作CSS hack。首先介绍一下主流浏览器的css hack方法:浏览器特殊符号符号位置Fi...


http://www.cnblogs.com/wanda1416/p/3702976.htmlWSAEventSelect 是 WinSock 提供的一种异步事件通知I/O模型,与 WSAAsyncSelect模型有些类似。 该模型同样是接收 FD_XXX 之类的网络事件,但是是通过事件对象句柄通知,而非像 WSAAsyncSelect一样依靠Windows的消息驱


iOS Auto Layout-程序员宅基地


csuoj 1968 递推_递推 oj-程序员宅基地

DescriptionGiven a positive integer, N, a permutation of order N is a one-to-one (and thus onto) function from the set of integers from 1 to N to itself. If p is such a function, we represent the fu_递推 oj

利用docker搭建php7和nginx运行环境_docker nginx php-程序员宅基地

转自:http://www.jb51.net/article/113296.htm本文分享的是利用docker搭建php7和nginx运行环境的全过程,分享出来供大家参考学习,下面来看看详细的介绍:环境介绍根目录: /docker网站根目录:/docker/wwwnginx相关目录:/docker/nginx/conf.d准备工作1、使用docker加速器?12curl -sSL https://..._docker nginx php

【技巧】安装Python lightfm时出现error: Microsoft Visual C++ 14.0 is required_"lxml时提示 error: microsoft visual c++ 14.0 is requi-程序员宅基地

安装lightfm时出现问题:error: Microsoft Visual C++ 14.0 is required. Get it with "Microsoft Visual C++ Build Tools": https://visualstudio.microsoft.com/downloads/发现lightfm 没有whl文件可以下载,只能看这个Microsoft Visual..._"lxml时提示 error: microsoft visual c++ 14.0 is required. get it with \"micros"

Python 2.7.x 和 3.x 版本的重要区别小结-程序员宅基地

这篇文章主要介绍了Python 2.7.x 和 3.x 版本的重要区别小结,需要的朋友可以参考下许多Python初学者都会问:我应该学习哪个版本的Python。对于这个问题,我的回答通常是“先选择一个最适合你的Python教程,教程中使用哪个版本的Python,你就用那个版本。等学得差不多了,再来研究不同版本之间的差别”。但如果想要用Python开发一个新项目,那么


搜索仍是重点,不过没上一届那么多了。基础的模运算和细节处理标题: 购物单 小明刚刚找到工作,老板人很好,只是老板夫人很爱购物。老板忙的时候经常让小明帮忙到商场代为购物。小明很厌烦,但又不好推辞。 这不,XX大促销又来了!老板夫人开出了长长的购物单,都是有打折优惠的。 小明也有个怪癖,不到万不得已,从不刷卡,直接现金搞定。 现在小明很心烦,请你帮他计算一下,..._蓝桥杯c++算法训练矩阵运算给定一个n*n的矩阵a,求a+at的值。其中at表示a的转置。