A Floating Point: The Challenges In Building A SIMD Optimized Math Library

What is SIMD?
SIMD stands for Single Instruction Multiple Data, and does exactly what it says it does. By packing a couple of integral/floating point types together, a considerable amount of arithmetic and logical computations can be done at once.

Why Should You Care?
These days, everything is designed for parallelism. The beauty of intrinsic functions is that they exploit such parallelism at the instruction level. With entirely separate registers and instructions, CPUs can additionally compute the same operation 3 more times at little additional cost. Game Engines have been using intrinsics for a while now. I noticed while browsing the Doom 3 source code, which was released in 2004, that they had full support for SIMD intrinsics. Even if you don't care to write high performance code you're only kidding yourself if you think this technology is going away. In all likelihood, in the industry you're going to have to deal with intrinsics, even if they've been wrapped in classes that are provided for you. This isn't just an exercise in reinventing the wheel for the sake of doing it - its a reality of modern day game architecture we all have (or want) to use.

But I Don't Know Assembly
You don't have too. Although a background in MIPS/x86 ASM definitely helps with understanding how SIMD math really ticks, functions are provided that do most of it for you. This is a very convenient and more maintainable way to do things as opposed to resorting to assembly instructions you have to roll yourself.

Specific Uses In Games
SIMD types can be used for a number of things, but since the __m128 data type holds 4 floating point values (which must be aligned on a 16 byte boundary), they are particularly good for data sets which can be treated the same. Things like vectors (x, y, z, w), matrices (4 vectors), colors (r, g, b, a) and more perform very well and are easily adapted to SIMD types. You can use intrinsics to represent 4 float values which may not all be treated homogeneously as well. For example, a bounding sphere may use the last value to represent the radius of the sphere, but then you have to be careful not to treat that scalar the way you treat the position vector. The same holds true for quaternions. And with crazy functionality, such as a random number generator that internally uses intrinsics, the downside is that code becomes much more complex and difficult to understand.

Challenges/Advice
The Alignment Problem: SIMD types have to be aligned on a 16 byte boundary. A number of game developers know that the default allocators in C++ make no guarantee, and as such, are used to overriding the new/delete operators for certain things that need them. Console developers, which may very well use custom memory allocators for everything already, should have no problem making the adjustment (if one is needed at all). Students and hobbyists on the PC might find their program crashing due to unaligned access and be left scratching their head. To start, I recommend just using _align_malloc or posix_memalign to get things going. Consider making a macro like DECLARE_ALIGNED_ALLOCATED_OBJ(ALIGNMENT) and using that everywhere you need a static overload, so it can eventually be changed later from one place.

Pipeline Flushes: This one is tricky, because your code will run fine, but might actually be slower than if you were still using standard floating point types! You have to try very hard to avoid mixing floating point math and SIMD math or the pipeline will experience significant stalls. If this sounds hard, it's because it is. A lot of game developers are used to using floats everywhere for everything. As a result, a large codebase is a huge, huge pain to refactor to accommodate for SIMD math in as many places as possible.

I recommend wrapping SIMD types into a class called something like SimdFloat, which, for all intents and purposes, acts just like a float. However, internally it actually holds a four float value to avoid those costly pipeline flushes. The implications of this are significant: now things like dot products, length squared functions, and others are actually returning quad data types. This will take some getting used to. You can help alleviate it by writing a conversion operator that converts to a regular float and back, but overuse carries the potential for abuse. If the additional memory space is significant, consider creating a SIMD float on the stack as soon as possible and using it for the rest of the function.

Code Clarity: This can be alleviated, but is ultimately going to take a hit somewhere. SIMD math typically involves a lot more temporary values on the stack. This will seem to avoid going directly "at the problem" at times. For example, consider a typical, trivial computation of the dot product:

inline float32 dot3(const Vector4& v) const {
    return x*v.x + y*v.y + z*v.z; 
}

Now becomes something like:

inline SimdFloat dot3(const Vector4& v) const {
   SimdFloat sf;
   quad128 p = _mm_mul_ps(quad, v.quad); // p[0] = (x * v.x), p[1] = (y * v.y), p[2] = (z * v.z) 
   quad128 sumxy = _mm_add_ss(_mm_shuffle_ps(p, p, _MM_SHUFFLE(1,1,1,1)), p); // check reference on shuffling
    quad128 sumxyz = _mm_add_ss(_mm_shuffle_ps(sumxy, sumxy, _MM_SHUFFLE(2, 2, 2, 2)), sumxy);
   sf.quad = sumxyz;
   return sf;
}

Don't say I didn't warn you :P. Since writing this by hand every time can be quite error prone, most implementations use the vector4 SIMD type, or its equivalent as frequently as possible. If the functions are inline anyways you will usually be just fine.

On the topic of code clarity, some implementations actually forbid operator overloads. Before doing this, I recommend making sure you really need to do it. It's true, the operators need to return a reference, and this has a cost. Successful SIMD libraries, like the ones used in Havok, do not allow this. It is justified as being speed critical. Default construction doesn't set a zero vector, and the assignment operator, while provided, doesn't return a reference either. This does indeed avoid computational costs, but it comes at the cost of clarity. I highly recommend providing both operators and the equivalent functions, that way things that really need it can avoid the operators and those that don't can stay as clear as possible. Consider for example, a calculation of a LERP value:

SimdFloat t; // assume its value is between 0 and 1 so we don't need to clamp it
Vector4 lerp = a + ((b - a) * t);

And again without operator overloading:

SimdFloat t;
Vector4 lerp;
lerp.setSub(b, a);
lerp.mul(t);
lerp.add(a);

Again, it's all about finding the right balance with your team. Expect resistance if you want to forbid operator overloading. And remember, you can always write a comment specifying the equivalent if operator overloading was available right above the computations themselves.

Portability: Math libraries have the wonderful combination of being incredibly speed critical, pervasive, and in need of portability. Some platforms don't support these SIMD types at all, others differ in their implementation (e.g. AltiVec and SSE2). This makes the development of a common interface considerably challenging. Read up on certain articles and consider referring to the one used in Havok by downloading their free trial (even if you have no interest in using their Physics/Animation) libraries.

A Floating Point

Pages

Friday, March 2, 2012

The Challenges In Building A SIMD Optimized Math Library

No comments:

Post a Comment

Blog Archive