Date of Original: March 21, 2004
I did some research into memory allocation methods for a particle system. A lot of people believe that a list of pointers is better than a list of objects, and some believe the opposite. Others use a static array. I decided to research into the different methods, using a particle system as the situtation.
Updated March 29 with a lot more results, refactored code, and two new algorithms, VectorAlgor and FlatArray.
I decided to research several algorithms for managing memory in particle systems. There is some debate as to which method is best, and I wanted to discover how the algorithms actually compared by benchmarking them. There seem to be three major schools of thought:
When I discovered that my compiler's implementation of std::allocator was nothing more than a wrapper around new/delete, I began to question my rationale for storing objects directly, and I decided to run these benchmarks.
In addition to the three algorithms above, I decided to try out two new algorithms:
I decided to research particle systems rather than general allocation because usually in game programming particles are very often created and destroyed and are a case that requires special consideration and many are created and destroyed every frame. Normal game objects usually are not created or destroyed anywhere near the rate of objects like particles.
After my first submission of this article, I got a lot of feedback. Two implementations of a packet, flat array algorithm. One is dynamically-sized and is based off of std::vector. The other is statically sized and uses an array.
The compiler environment is MSVC.NET 2003 (MSVC 7.1). I am using boost::timer and boost::format from the Boost library, version 1.31.0.
My system statistics:
Results from other systems are presented at the end, so those specifications will be given at that time. Those results have been contributed by other readers.
The particle class itself I decided would not offer any rendering capability, but would provide a reasonable update function that a particle class might have. I provided two update functions, one that checks if the particle is dead (for the static array), and one that does not (for the lists and packed arrays).
The particle object itself is 36 bytes and contains the following variables:
The particle as given is a typical particle definition for a modern 3D game engine. The update method provides Euler integration given the vectors but it does not do anything complicated like collection detection or drawing. This is relevant because it is important to know how much time you can save by using a custom memory system -- if only 1% of the time is spent managing the particle list, it is more worth our time to optimize the collision detection (if any) or the particle rendering code.
The ParticleManager is a class coded specifically for the Particle class, and is a fixed size. The algorithm is as follows:
The ParticleManager enforces a maximum particle limit. Sometimes this is desireable. Allocation is decently quick when the array is not full, but an update iteration will have to check all of the dead particles.
Some helpful person suggested that I could construct a packed array so that on object death I would copy the last live particle to the dead particle's position, and therefore never iterate over dead elements. Allocation would be constant time, but when particles die we would have to perform a copy.
I'd like to try to implement this algorithm in the future, but not for this initial test.
This is the standard std::list<Particle*> approach. Particles are allocated using new manually and the pointers are stored in the list. When objects die, the particles are deleted manually.
This algorithm has the advantage of not requiring any copying, but it is harder to implement due to explicit memory management. Also, two allocation steps are needed: one to allocate the list node and another to allocate the particle itself.
Like all the list algorithms, this algorithm benefits by never iterating over any dead particles.
I wanted to examine if eliminating the particle allocation step (avoid calling new/delete) was worthwhile, so I tried an algorithm similar to list of pointers except that I maintain two lists, a live list and a dead list.
Unlike the standard method, the dual list method will never shrink in memory size. It will use the amount of memory required to store the maximum number of particles that ever existed.
This is the standard std::list<Particle> approach. All memory management is handled the STL. Chances for memory leaks is greatly reduced due to no explicit memory management, and is easy to implement. It requires an extra copy though, through the copy constructor. It may be possible for some compilers to optimize out the copy, I've heard, but I did not examine if this is the case with MSVC.NET 2003 for this article.
I decided to try to write my own replacement to the std::allocator. This algorithm works similarly to the ParticleManager, except that when the array is filled, a new array is made. The BinAllocator works with constant-sized baskets, and maintains a list. Each basket has a specific size of bins. When all baskets are filled, a new basket is added to the the basket list.
The last element allocated is stored, so that iterations begin at that location. When the iteration reaches the end of a basket, the iteration moves to the next basket. If it returns back to the start, a new basket is made. Else it returns the first empty bin it finds.
For this article I attempted two basket sizes: 10 and 500. The 500 size performed much better, but used more memory. Like the dual list approach, the BinAllocator can only grow, never shrink. When a particle dies, that bin is marked as empty.
I added a small optimization afterwards, to cache the last element deleted. When the allocation occurs, if the last operation was a delete, it reuses that node. This skips the iteration process to find a new node, and almost guarantees that the memory is in the cache, while iterating to the next bin is likely to cause a cache miss as the next bin is likely to have been deallocated longest ago.
I still believe this algorithm can be substantially improved, perhaps by expanding the cache idea, but this this method starts to look like the dual list method, which I discovered to be very slow.
I did only the work I needed for BinAllocator to get it to run with Particles (that can't throw exceptions) in a std::list. Namely I did not examine exception-safety, nor did I allow allocations of multiple elements as would be required with std::vector.
These work similarly to the ParticleManager, except they fill in the holes when particles die. Both of these algorithms were contributed from the readers at Allegro.cc. VectorAlgor was contributed by Julien Cugniere. FlatArray was contributed by an Allegro.cc member who goes by the name "Orz."
The theory behind these is to use copy-on-death semantics to keep from having to maintain a linked list (as with BinAllocator), and to never iterate over dead particles. The only state needed for the algorithm is the end of objects in the array, and the array itself. If there are 4 particles, taking up four slots A B C D, if particle B dies, the particle in slot D is copied to overwrite B, then the size is decremented by one.
The assumption is that the objects are small and copy quickly because they have a trivial copy operation. The disadvantage to this algorithm is that the addresses of the particle particles will change because of copy-on-death semantics, and due to resizing of the vector in VectorAlgor.
The only significant difference between VectorAlgor and FlatArray is that VectorAlgor uses std::vector, and is dynamically sized. FlatArray uses a straight array, and refuses new particles if it is filled (rather than searching for the particle that is closest to dying). For these performance, the FlatArray size was set to 50,000 particles, so no particles were dropped (our maximum average is 2500).
I wanted to try out an algorithm someone suggested to me, and that I've seen in my quick research of other algorithms. It is similar to the BinAllocator in that it gives out single storage units (although there are variations that can allocate arrays). It is dissimilar to the BinAllocator in that it maintains a linked list as part of the nodes. Initially all of the "bins" are in the same linked list -- a free node linked list. Allocation simply takes the top node off the free list, and deallocation simply adds the node to the front of the free list. This has the advantage of not having to copy objects or move them in memory, meaning it should work with larger objects of if you need to keep a pointer to an object while it is alive.
I did not have enough time to implement this algorithm, but I would like to in the future. I believe that it would improve on the performance of the BinAllocator while not having the performance penalty I saw in the Dual List algorithm by having each node contain the actual pointers rather than having a node as a separate object (the Dual List still needed to call new/delete when you moved a pointer between lists, because it had to allocate the actual node of the std::list).
It may also be possible to implement this algorithm as a singly linked list rather than a doubly linked as is std::list. All of the implementations I saw of this algorithm used a doubly linked list, and I'm not sure why, so further study is needed. I've heard this algorithm called "simple segregated storage" for those interested in following up on this.
I ran each algorithm in a mock particle engine update loop. I ran the loop a certain number of times, using a certain dt value. The dt value is in seconds and is the frame length. One particle is created during each frame, and the particles are created to have a maximum life between 0 and 5 seconds.
This is by no means an exhausive test. Firstly, there are no memory allocations but particles. The other game code might call new with widly varying size requests that may reduce the efficency of new moreso than seen here. Secondly, the peformance may change significantly if the particle creation was not as constant. Thirdly, changes in particle life span might affect our choice in algorithm.
Here is the source code for this test. The source code itself is in the public domain. I would appreciate if the source code is used (probably only the BinAllocator is useful) that I would be mentioned in the program's credits, but it is not required.
I've refactored the code quite a bit from the original article's release when others started contributing code, so I wanted to clean it up. I rewrote the test framework to output results in a form that did not require hand post-processing, as well as split up all of the algorithms into different headers.
The single CPP file approach I took was for compiling simplicity (as I do not supply a makefile), and to make sure that the proper inlining is done equally in all compilers (MSVC.NET can inline across object file boundaries, GCC cannot still as far as I know).
The source code also contains the spreadsheet in which I placed the results from the original article. The file's format is Excel 2000. The newer results are only on this page.
The download file includes the source code and the executable compiled by MSVC.NET 7.1. Default options for a console application, except that whole program optimizations are turned on. MSVC.NET workspace files are provided.
Download source code: 78k zip, MS-DOS format
It appears that the VectorAlgor is the best solution for our specific particle system and is memory efficent. Of the algorithms that do not copy or move objects, the best is the BinAllocator. The best solution using a standard STL algorithm is the list of objects. The list of pointers and dual list approach are not worth considering because of their slow performance. The ParticleManager and FlatArray algorithms are statically sized, and dynamic solutions that are equally fast or faster exist for these. A translation of FlatArray should be considered for implementation in other languages, such as C.
My guess is that VectorAlgor's performance will get worse and eventually get worse than the BinAllocator as object size increases, but more research is needed. I think that the BinAllocator can still be greatly improved through the algorithm discussed earlier. So for other memory management tasks that use large objects or need pointers that will not be invalidated, the BinAllocator is likely the best solution, but more research is needed to see if it can be improved by having a growable linked list.
The BinAllocator complies with the std::allocator interface, so it can be used directly with std::list, and with some work it might be appropriate for the other containers such as std::map. It is likely never appropriate (or even usable) for std::vector or std::string as they are probably implemented to allocate arrays of objects.
The theoritical average number of particles for each column is 25, 250, 500, and 2500.
The columns are dt values (frame length). One particle is created each frame, and particle life is 0 to 5 seconds (random).
The format of the results is a CSV format in aligned columns (thanks to boost::format for making this easy!)
Here are the results from my machine.
All times are in seconds. 10000 Loops , 0.1, 0.01, 0.005, 0.001 ParticleManager 500, 0.04, 0.08, 0.11, 0.14 ParticleManager 2000, 0.15, 0.191, 0.23, 0.39 List of Pointers, 0.01, 0.04, 0.061, 0.64 List of Objects, 0.011, 0.03, 0.07, 0.47 Dual List, 0.01, 0.05, 0.111, 1.221 Bin List w/cached, 0, 0.04, 0.06, 0.441 VectorAlgor, 0.01, 0.03, 0.05, 0.331 FlatArray 50000, 0.01, 0.03, 0.06, 0.34 Objects At End, 27, 257, 505, 2467 20000 Loops , 0.1, 0.01, 0.005, 0.001 ParticleManager 500, 0.08, 0.151, 0.16, 0.2 ParticleManager 2000, 0.311, 0.37, 0.481, 0.861 List of Pointers, 0.02, 0.09, 0.231, 4.005 List of Objects, 0.02, 0.05, 0.111, 1.081 Dual List, 0.02, 0.1, 0.231, 3.635 Bin List w/cached, 0.01, 0.06, 0.1, 1.052 VectorAlgor, 0.01, 0.05, 0.1, 0.721 FlatArray 50000, 0.01, 0.06, 0.11, 0.811 Objects At End, 21, 262, 501, 2475 50000 Loops , 0.1, 0.01, 0.005, 0.001 ParticleManager 500, 0.211, 0.37, 0.401, 0.521 ParticleManager 2000, 0.781, 0.951, 1.192, 2.173 List of Pointers, 0.06, 0.23, 0.551, 9.654 List of Objects, 0.04, 0.15, 0.271, 2.964 Dual List, 0.06, 0.23, 0.581, 8.202 Bin List w/cached, 0.03, 0.13, 0.261, 2.834 VectorAlgor, 0.03, 0.14, 0.25, 1.963 FlatArray 50000, 0.03, 0.15, 0.271, 2.143 Objects At End, 25, 248, 500, 2510 100000 Loops , 0.1, 0.01, 0.005, 0.001 ParticleManager 500, 0.41, 0.771, 0.822, 1.041 ParticleManager 2000, 1.532, 1.943, 2.323, 4.377 List of Pointers, 0.11, 0.471, 1.121, 22.503 List of Objects, 0.07, 0.3, 0.551, 5.979 Dual List, 0.11, 0.5, 1.182, 20.68 Bin List w/cached, 0.06, 0.27, 0.521, 5.728 VectorAlgor, 0.05, 0.271, 0.51, 4.026 FlatArray 50000, 0.05, 0.281, 0.541, 4.366 Objects At End, 20, 251, 485, 2481
Mika Halttunen from Allegro.cc contributed a partial result set for his machine.
All times are in seconds. 10000 Loops , 0.1, 0.01, 0.005, 0.001 ParticleManager 500, 0.49, 0.681, 1.242, 1.412 ParticleManager 2000, 3.515, 3.705, 3.956, 7.811 List of Pointers, 0.08, 0.441, 2.814, 14.962 List of Objects, 0.06, 0.41, 1.102, 11.486 Dual List, 0.091, 1.011, 3.265, 16.874 Bin List w/cached, 0.05, 0.381, 1.091, 9.934 VectorAlgor, 0.041, 0.34, 0.661, 5.448 FlatArray 50000, 0.04, 0.35, 0.681, 5.858 Objects At End, 22, 255, 500, 2453 20000 Loops , 0.1, 0.01, 0.005, 0.001 ParticleManager 500, 0.921, 1.372, 2.264, 2.713 ParticleManager 2000, 5.588, 6.109, 7.05, 14.471 List of Pointers, 0.15, 1.973, 5.148, 34.149 List of Objects, 0.13, 0.781, 2.013, 24.806 Dual List, 0.16, 1.692, 7.291, 36.632 Bin List w/cached, 0.09, 0.781, 2.113, 19.057 VectorAlgor, 0.09, 0.701, 1.422, 13.229 FlatArray 50000, 0.091, 0.781, 1.502, 10.725 Objects At End, 25, 253, 520, 2485 50000 Loops , 0.1, 0.01, 0.005, 0.001 ParticleManager 500, 2.334, 3.585, 5.828, 7.171 ParticleManager 2000, 13.639, 15.402, 17.636, 38.836 List of Pointers, 0.47, 6.159, 18.166, 95.788 List of Objects, 0.29, 1.923, 4.957, 62.27 Dual List, 0.38, 5.298, 17.235, 109.537 Bin List w/cached, 0.471, 2.534, 7.671, 87.986 VectorAlgor, 0.291, 3.084, 5.969, 50.182 FlatArray 50000, 0.23, 2.113, 4.096, 41.77 Objects At End, 24, 255, 495, 2479 100000 Loops , 0.1, 0.01, 0.005, 0.001 ParticleManager 500, 5.909, 9.153, 16.213, 17.686 ParticleManager 2000, 55.51, 49.681, 64.253, 125.841 List of Pointers, 1.091, 10.275, 51.714, 253.345 List of Objects, 1.041, 6.63, 25.096, 230.741
Julien Cugniere was kind to provide a comparison between MSVC and GCC. First the GCC results:
All times are in seconds. 10000 Loops , 0.1, 0.01, 0.005, 0.001 ParticleManager 500, 0.08, 0.1, 0.13, 0.14 ParticleManager 2000, 0.281, 0.32, 0.351, 0.55 List of Pointers, 0.02, 0.04, 0.081, 0.651 List of Objects, 0.01, 0.03, 0.07, 0.56 Dual List, 0.01, 0.051, 0.09, 0.691 Bin List w/cached, 0.01, 0.03, 0.07, 0.591 VectorAlgor, 0, 0.03, 0.04, 0.28 FlatArray 50000, 0, 0.02, 0.05, 0.281 Objects At End, 24, 251, 510, 2473 20000 Loops , 0.1, 0.01, 0.005, 0.001 ParticleManager 500, 0.15, 0.22, 0.231, 0.3 ParticleManager 2000, 0.561, 0.631, 0.721, 1.171 List of Pointers, 0.02, 0.111, 0.21, 1.542 List of Objects, 0.02, 0.09, 0.211, 1.342 Dual List, 0.02, 0.1, 0.2, 1.552 Bin List w/cached, 0.02, 0.07, 0.151, 1.311 VectorAlgor, 0.011, 0.05, 0.09, 0.641 FlatArray 50000, 0.01, 0.05, 0.09, 0.641 Objects At End, 32, 253, 471, 2453 50000 Loops , 0.1, 0.01, 0.005, 0.001 ParticleManager 500, 0.37, 0.541, 0.611, 0.741 ParticleManager 2000, 1.402, 1.592, 1.793, 3.024 List of Pointers, 0.06, 0.291, 0.561, 4.105 List of Objects, 0.041, 0.24, 0.561, 3.555 Dual List, 0.05, 0.27, 0.501, 4.166 Bin List w/cached, 0.04, 0.19, 0.361, 3.555 VectorAlgor, 0.03, 0.13, 0.22, 1.733 FlatArray 50000, 0.03, 0.13, 0.231, 1.732 Objects At End, 26, 252, 497, 2487 100000 Loops , 0.1, 0.01, 0.005, 0.001 ParticleManager 500, 0.741, 1.092, 1.252, 1.492 ParticleManager 2000, 2.804, 3.194, 3.585, 6.129 List of Pointers, 0.12, 0.541, 1.102, 8.402 List of Objects, 0.08, 0.471, 1.101, 7.241 Dual List, 0.1, 0.501, 1.001, 8.532 Bin List w/cached, 0.09, 0.381, 0.721, 7.31 VectorAlgor, 0.061, 0.25, 0.451, 3.515 FlatArray 50000, 0.06, 0.25, 0.471, 3.515 Objects At End, 24, 264, 483, 2493
He then compared for the 100,000 loops case, GCC to MSVC. We looked a bit into GCC's allocation for STL, and it appears to have optimizations for objects below 128 bytes. We suspect MSVC's algorithm is more generic. This causes GCC to compare in very interesting ways.
MSVC vs. GCC Smaller number means GCC faster. 100000 Loops ParticleManager 500, 205,26%, 151,46%, 154,38%, 156,89% ParticleManager 2000, 193,11%, 175,21%, 164,22%, 147,47% List of Pointers, 120,00%, 114,86%, 103,77%, 50,30% List of Objects, 114,29%, 142,73%, 189,50%, 127,08% Dual List, 100,00%, 94,35%, 89,22%, 43,23% Bin List w/cached, 180,00%, 146,54%, 138,39%, 131,52% VectorAlgor, 122,00%, 95,79%, 92,04%, 92,60% FlatArray 50000, 120,00%, 92,59%, 90,40%, 84,78% Objects At End, 80,00%, 104,35%, 95,27%, 98,73%
GCC really optimizes the pointer versions of the lists! It is slower than MSVC for the list of objects. The end result is that the difference between a list of pointers and objects is no where near as large as when in MSVC. GCC does speed up the VectorAlgor and FlatArray, but the GCC results show that as in MSVC, they are both the same speed, so the VectorAlgor is better as it is dynamically sized.
I saved the Excel spreadsheet as an HTML file, and since it generates lots of nasty code I can't paste the table here or clean it up. I apologize for this. If anyone has problems viewing this, let me know. It worked for me with Mozilla 1.6 and IE 6, and viewed properly in my HTML editor, so it should be fine.
The release with optimized checks should be what you look at. I split the update method into update1, which does an isDead check, and update2, which does not.
ParticleManager 500 and 2000 refer to the static array size.
Explaination: The debug was when I ran the system in debug mode. It's more or less useless so I stopped taking the measurements. All of them except the Bin Test was with the old-unified update method.
Bin Test 10 and 500 refer to the basket size. "w/cached" is the version when I added the cached variable.