The library is intentionally pretty minimalist to focus on SIMD wrappers and instructions. So things like logging are just done with the standard library, and aligned memory functions are achieved with existing implementations (e.g. _aligned_malloc). On consoles, home rolled implementations would almost certainly be used to avoid the context switch.
A few things to take from this experiment:
- SSE2 instructions are considerably faster on large sets of data than comparable FPU instructions.
- Alignment can be a pain, and has considerable consequences to the design of code elsewhere. If you want to develop for consoles, its best to get used to this sooner rather than later.
- The same applies for an interface which supports SSE2, FPU, Altivec, etc. They are a pain to implement.
- Reordering instructions to prevent stalls can make a huge difference
- Restricted Pointers and an eye for Load Hit Stores can make a huge difference
- Removing Branches Can Make a huge difference
- A struct of arrays approach can be more performant than an array of structs for large structs that exceed the size of a cache line
- More experience with SFMT, static assertions, and Intel's Vtune
Preventing Stalls
Preventing stalls, in terms of performance, is very important in a speed critical library such as this. One thing I found myself doing was taking advantage of the SimdFloat wrapper to minimize them as much as possible. Since you have to use them to prevent pipeline flushes anyways, you might as well reorder them to minimize stalls as well. This is one of the easiest optimizations to make, since it only involves moving the same instructions around and spreading dependent data out so the processor stays busy.
bool32 testRayTriangle(const jlVector4& origin, const jlVector4& dir, const jlVector4& a, const jlVector4& b, const jlVector4& c) { jlVector4 pn = jlVector4::Cross(b - a, c - a); pn.normalize3(); jlSimdFloat pd = jlVector4::Dot3(pn, a); // stalls waiting on the normalize to finish jlSimdFloat t = (pd - jlVector4::Dot3(origin, pn)) / jlVector4::Dot3(dir, pn); jlSimdFloat zero = jlSimdFloat(0.0f); if (t <= zero) return 0; jlVector4 pt = origin + dir * t; jlSimdFloat u, v, w; barycentric(a, b, c, pt, u, v, w); jlSimdFloat one = jlSimdFloat(1.0f); return (v >= zero && w >= zero && (v + w) <= one); }
Since the normalize3() function takes the longest, we can construct the SIMD float types after as an easy way to minimize stalls.
bool32 testRayTriangle(const jlVector4& origin, const jlVector4& dir, const jlVector4& a, const jlVector4& b, const jlVector4& c) { jlVector4 pn = jlVector4::Cross(b - a, c - a); pn.normalize3(); jlSimdFloat zero = jlSimdFloat(0.0f); // these simple constructions keep the processor busy jlSimdFloat one = jlSimdFloat(1.0f); // while waiting on the normalization operation to finish jlSimdFloat pd = jlVector4::Dot3(pn, a); jlSimdFloat t = (pd - jlVector4::Dot3(origin, pn)) / jlVector4::Dot3(dir, pn); if (t <= zero) return 0; jlVector4 pt = origin + dir * t; jlSimdFloat u, v, w; barycentric(a, b, c, pt, u, v, w); return (v >= zero && w >= zero && (v + w) <= one); }
Removing Branches
Here's a typical AABB intersection test:
bool32 testAABBWithBranches(const jlAABB& boxA, const jlAABB& boxB) { if (boxA.max(0) < boxB.min(0) || boxA.min(0) > boxB.max(0)) return 0; if (boxA.max(1) < boxB.min(1) || boxA.min(1) > boxB.max(1)) return 0; if (boxA.max(2) < boxB.min(2) || boxA.min(2) > boxB.max(2)) return 0; return 1; }
On a modern CPU, branch misprediction and short circuiting is generally a bad thing. By removing the branches, instructions can be queued and executed in a more predictable manner. These two algorithms are logically the same, where we still check for a seperation on all three cartesian axis. However, the version without branches does all of these tests in parallel and uses bitwise operations to merge the comparisons into one boolean result.
bool32 testAABBWithoutBranches(const jlAABB& boxA, const jlAABB& boxB) { jlComp comp; jlComp aMinCompBMax = boxA.min.compLess(b.max); jlComp bMinCompAMax = boxB.min.compLess(a.max); comp.setAnd(aMinCompBMax, bMinCompAMax); return comp.allAreSet(jlComp::MASK_XYZ); }
SFMT
Random Number Generators are interesting to me, but can be notoriously hard to understand due to excessive use of magic constants and seemingly arbitrary bit shifts (frequently seen in hash functions too). In my old GXT project, I used an existing implementation of XORShift I found here. This experimental project seemed like a good excuse to explore Mersenne Twister and its SIMD optimized variant: SFMT. I adapted the code I found to work with my existing classes and support SSE and FPU seamlessly. The result is a considerable speed boost, but at the cost of extra memory, as it relies on a previously allocated buffer.
The restrict keyword
I'm still a bit new to the concept of the __restrict keyword. I did, however, notice a lot of engine devs talking about its importance on consoles and elsewhere in addressing the pointer aliasing problem (which also extends to references in C++).
restrict is a qualifier designed to address the pointer aliasing problem. When used as qualifier on a member function itself, it will apply to the "this" pointer. A number of articles have been written about restrict and the way it addresses the Load-Hit-Store problem. The way I like to think about restrict is that it is a promise the programmer makes to the compiler, that the pointer will not be referenced by any other pointers in that scope.
Three things to note with the restrict keyword absent of the way it works:
1) restrict is nonstandard and in vc++ it is __restrict and in g++ it is __restrict__. So this is an appropriate time to use the preprocessor.
2) Restrict in Visual C++ only applies to pointers, not references. Thankfully g++ supports restricted references, but this disparity is a pain for those seeking portable code. In Visual C++ restricted references are simply ignored, but this will accumulate numerous compiler warnings in a project of any decent size. This limits portable restricted addresses to restricted pointers.
3) Restrict matters to the calling code. Seriously, put restrict qualifiers in the header/prototype, even if the compiler will let you get away without doing it. Let's take an example where restricted pointers make sense as a design/code clarity hint as much as it does a performance centered measure.
Take barycentric coordinates, where we want to calculate a weighted average and put them into u, v, and w:
void barycentric(const jlVector4& a, const jlVector4& b, const jlVector4& c, const jlVector4& pt, jlSimdFloat* u, jlSimdFloat* v, jlSimdFloat* w);
It effectively defeats the purpose if u, v, and w reference the same variable. What we know about the weighted average is effectively lost and all variables will hold the value of the last stored value if they all reference the same variable. As such, it is entirely appropriate in such instances to change the function signature to:
void barycentric(const jlVector4& a, const jlVector4& b, const jlVector4& c, const jlVector4& pt, jlSimdFloat* JL_RESTRICT u, jlSimdFloat* JL_RESTRICT v, jlSimdFloat* JL_RESTRICT w);
Test | SSE | SSE With Restrict |
Barycentric Test | 0.171 | 0.167 |
For this project, I made much more extensive use of Intel's VTune than I have in the past. Here are some of the results I found after running numerous benchmarks on my machine. In each test, I would generally produce upwards of a million randomly generated matrices, vectors, rays, etc. and see how the two compared.
Struct of Arrays vs. Array of Structs
I also tried a few new things this time around, like the difference between Struct of Arrays and (the more frequently used) Array of Structs. This is admittedly an element that truly excels in multithreaded environments. An effective comparison is made in Game Engine Architecture:
// Array of Structs, easier to understand, // especially in the OOP world. Logically // related data is placed contiguously in memory. struct GameObject { uint32 id; Vector3 pos; Quaternion rot; float health; };
// In the Struct of Arrays approach, typed data is placed contiguously struct GameObjects { uint32 ids[MAX_GAME_OBJECTS]; Vector3 positions[MAX_GAME_OBJECTS]; Quaternion rotations[MAX_GAME_OBJECTS]; float healthPoints[MAX_GAME_OBJECTS]; };
A 64-byte line size means most loops will be unrolled 4 Vector4's at a time |
You can download and use CPU-Z to find properties of your CPU's cache and/or use a function like these in code to get the size of a d-cache line.
AoS Particles | SoA Particles |
5.264 | 5.063 |
More Tests
For good measure, a few more tests profiled using VTune. Feel free to run some yourself with hot spot analysis to try things out yourself.
VTune's "Compare" functionality between SSE2 and FPU |
Test | SSE | FPU |
Vector/Matrix Multiply Test | 0.5 | 2.0 |
Normalization Test | 0.138 | 0.4 |
Ray Triangle Test | 0.017 | 0.039 |