A Floating Point: SIMD

Showing posts with label SIMD. Show all posts

Monday, March 26, 2012

Currently Under Construction

Things have been a bit quiet here but rest assured I've been occupied with a number of things, some of which will have source code up here in less than a week.

Card Kingdom
The Card Kingdom project is a lot of fun and has a lot of potential to get better. I've been working on gameplay code, camera systems, and helping with frustum culling. I may very well be working on A.I. systems as well (which I look forward to, especially for the final boss).

JL Math
I wrote a post on SIMD processing but I want to prove I've truly dived into it. Making an interface that works well with SSE2 and FPU is taking a bit longer than I originally expected, but should see something of an alpha release soon. The math library, JL Math, is named after my initials because it's an individual project and I didn't have a witty name for it. Ideally the interface could be expanded to work with Altivec sometime down the road too. Right now it supports Vector2, Vector4, Matrix4, and Quaternions.

FMOD and Component Entity Models
I've been experimenting with the excellent FMOD library and trying to find ways to cleanly integrate the system into a component based engine. One of the excellent things about FMOD is that you can often get most of the functionality you want with just a few function calls. Under the hood this industry standard library does an awful lot of things for you (which are of course configurable).

Concurrency

In my ongoing efforts to increase my experience with multithreaded architectures I'm hoping to fit in one more non-school side project before I graduate. So I bought Anthony Williams' excellent book, C++ Concurrency in Action and am auditing a Operating Systems course to get the concepts. It'll take some careful planning to decide how I want to implement this knowledge, but I'm currently hoping to do something with particles that builds upon the SIMD math library in an effort to do everything in parallel.

Expect to hear more on all of these developments in the near future.

Friday, March 2, 2012

The Challenges In Building A SIMD Optimized Math Library

What is SIMD?
SIMD stands for Single Instruction Multiple Data, and does exactly what it says it does. By packing a couple of integral/floating point types together, a considerable amount of arithmetic and logical computations can be done at once.

Why Should You Care?
These days, everything is designed for parallelism. The beauty of intrinsic functions is that they exploit such parallelism at the instruction level. With entirely separate registers and instructions, CPUs can additionally compute the same operation 3 more times at little additional cost. Game Engines have been using intrinsics for a while now. I noticed while browsing the Doom 3 source code, which was released in 2004, that they had full support for SIMD intrinsics. Even if you don't care to write high performance code you're only kidding yourself if you think this technology is going away. In all likelihood, in the industry you're going to have to deal with intrinsics, even if they've been wrapped in classes that are provided for you. This isn't just an exercise in reinventing the wheel for the sake of doing it - its a reality of modern day game architecture we all have (or want) to use.

But I Don't Know Assembly
You don't have too. Although a background in MIPS/x86 ASM definitely helps with understanding how SIMD math really ticks, functions are provided that do most of it for you. This is a very convenient and more maintainable way to do things as opposed to resorting to assembly instructions you have to roll yourself.

Specific Uses In Games
SIMD types can be used for a number of things, but since the __m128 data type holds 4 floating point values (which must be aligned on a 16 byte boundary), they are particularly good for data sets which can be treated the same. Things like vectors (x, y, z, w), matrices (4 vectors), colors (r, g, b, a) and more perform very well and are easily adapted to SIMD types. You can use intrinsics to represent 4 float values which may not all be treated homogeneously as well. For example, a bounding sphere may use the last value to represent the radius of the sphere, but then you have to be careful not to treat that scalar the way you treat the position vector. The same holds true for quaternions. And with crazy functionality, such as a random number generator that internally uses intrinsics, the downside is that code becomes much more complex and difficult to understand.

Challenges/Advice
The Alignment Problem: SIMD types have to be aligned on a 16 byte boundary. A number of game developers know that the default allocators in C++ make no guarantee, and as such, are used to overriding the new/delete operators for certain things that need them. Console developers, which may very well use custom memory allocators for everything already, should have no problem making the adjustment (if one is needed at all). Students and hobbyists on the PC might find their program crashing due to unaligned access and be left scratching their head. To start, I recommend just using _align_malloc or posix_memalign to get things going. Consider making a macro like DECLARE_ALIGNED_ALLOCATED_OBJ(ALIGNMENT) and using that everywhere you need a static overload, so it can eventually be changed later from one place.

Pipeline Flushes: This one is tricky, because your code will run fine, but might actually be slower than if you were still using standard floating point types! You have to try very hard to avoid mixing floating point math and SIMD math or the pipeline will experience significant stalls. If this sounds hard, it's because it is. A lot of game developers are used to using floats everywhere for everything. As a result, a large codebase is a huge, huge pain to refactor to accommodate for SIMD math in as many places as possible.

I recommend wrapping SIMD types into a class called something like SimdFloat, which, for all intents and purposes, acts just like a float. However, internally it actually holds a four float value to avoid those costly pipeline flushes. The implications of this are significant: now things like dot products, length squared functions, and others are actually returning quad data types. This will take some getting used to. You can help alleviate it by writing a conversion operator that converts to a regular float and back, but overuse carries the potential for abuse. If the additional memory space is significant, consider creating a SIMD float on the stack as soon as possible and using it for the rest of the function.

Code Clarity: This can be alleviated, but is ultimately going to take a hit somewhere. SIMD math typically involves a lot more temporary values on the stack. This will seem to avoid going directly "at the problem" at times. For example, consider a typical, trivial computation of the dot product:

inline float32 dot3(const Vector4& v) const {
    return x*v.x + y*v.y + z*v.z; 
}

Now becomes something like:

inline SimdFloat dot3(const Vector4& v) const {
   SimdFloat sf;
   quad128 p = _mm_mul_ps(quad, v.quad); // p[0] = (x * v.x), p[1] = (y * v.y), p[2] = (z * v.z) 
   quad128 sumxy = _mm_add_ss(_mm_shuffle_ps(p, p, _MM_SHUFFLE(1,1,1,1)), p); // check reference on shuffling
    quad128 sumxyz = _mm_add_ss(_mm_shuffle_ps(sumxy, sumxy, _MM_SHUFFLE(2, 2, 2, 2)), sumxy);
   sf.quad = sumxyz;
   return sf;
}

Don't say I didn't warn you :P. Since writing this by hand every time can be quite error prone, most implementations use the vector4 SIMD type, or its equivalent as frequently as possible. If the functions are inline anyways you will usually be just fine.

On the topic of code clarity, some implementations actually forbid operator overloads. Before doing this, I recommend making sure you really need to do it. It's true, the operators need to return a reference, and this has a cost. Successful SIMD libraries, like the ones used in Havok, do not allow this. It is justified as being speed critical. Default construction doesn't set a zero vector, and the assignment operator, while provided, doesn't return a reference either. This does indeed avoid computational costs, but it comes at the cost of clarity. I highly recommend providing both operators and the equivalent functions, that way things that really need it can avoid the operators and those that don't can stay as clear as possible. Consider for example, a calculation of a LERP value:

SimdFloat t; // assume its value is between 0 and 1 so we don't need to clamp it
Vector4 lerp = a + ((b - a) * t);

And again without operator overloading:

SimdFloat t;
Vector4 lerp;
lerp.setSub(b, a);
lerp.mul(t);
lerp.add(a);

Again, it's all about finding the right balance with your team. Expect resistance if you want to forbid operator overloading. And remember, you can always write a comment specifying the equivalent if operator overloading was available right above the computations themselves.

Portability: Math libraries have the wonderful combination of being incredibly speed critical, pervasive, and in need of portability. Some platforms don't support these SIMD types at all, others differ in their implementation (e.g. AltiVec and SSE2). This makes the development of a common interface considerably challenging. Read up on certain articles and consider referring to the one used in Havok by downloading their free trial (even if you have no interest in using their Physics/Animation) libraries.

Friday, January 13, 2012

Class Is Always In Session

I recently released a minor update to the CDS Library. I fixed some bugs and provided more explicit casting. The major incentive for these changes was compatibility with C++. CDS is still very much grounded in ANSI C, which is deliberate for compatibility reasons, so it doesn't take advantage of C++ specific features (such as templates). However, in terms of compatibility, I'm proud to have made a data structures library that almost any C/C++ compiler can build.

Unfortunately I'm spread very thin so my time for side projects such as these is extremely limited. I was hoping to get more experience writing cache conscious code, and perhaps experiment with some cache oblivious algorithms too.

Consistently exploiting cache optimized code is something I'm more mindful of, but admittedly need to improve on. It would be nice to explore this more some other time, perhaps as an extension to this project.

I'm also working on a game project for school where I'm deliberately trying to tackle Component - Entity Models and get more experience with SIMD math and intrinsics. I'm admittedly sticking relatively close to the framework used in Unity for now. In another project I might look into more data driven representations that use component aggregation a bit more purely. I started a thread on GameDev which discusses implementing lighter RTTI and ended up exploring ways to avoid the "everything is a game object" system we see so frequently in these types of frameworks. While this model admittedly feels excessive for some simple things it does have its benefits. When certain operations are best applied to logical game objects, iterating over every component or deleting them is much easier if you have one central game object. Communication between connected components also becomes trivial.

Currently the big question at hand is message passing. A hash table of function names and their addresses sounds reasonable. But is passing a void pointer array the best way to pass arguments? Do "returns" from these functions need to be put back into arguments of this array? What about like method names in different components? Whatever is chosen, it will have significant pros and cons that weren't very clear when we decided to go with components in the first place. In practice, we're finding the component model actually hides a significant amount of abstraction from the user. I'm not saying inheritance trees are better, merely that Component-Entity models, like all game object models, is going to feel a bit quirky from time to time.

Speaking of quirky, intrinsics will definitely feel that way the first time you implement them. Forced alignment, unclear macros, pipeline flushes, and bit masks? Your standard vector/matrix types are no longer so routine. And don't even get me started on how much of a pain it is to write this in a cross platform manner. Altivec's SIMD functions are just different enough to warrant typedefs everywhere. However, as an engine developer, intrinsics are definetly pretty cool. They are blazing fast (if used appropriately) and make old problems new again. That quaternion, bounding sphere, AABB, and float packed color structure are all "new" again. Things like cosine functions and sorting algorithms also present exciting new ways to conquer some old problems. It takes a specific kind of developer to get truly excited about rewriting these sorts of facilities. However, for those that do, they should definitely look into intrinsics if they haven't already.

My third recent side project involved diving into Lua. I work with PHP in my job so interpreted languages are not completely foreign to me. While Lua is considered easy to learn its syntax does not resemble C and certain ways in which Lua is used (table lookup, metatables, etc) can actually get rather involved. Nonetheless, it's a fun language to use, it was made by some very smart people, and it never feels like it's trying to be something its not. Lua knows it's a scripting language, and as such, is more frequently adopted as one compared to languages that see this as some sort of insult.

I'd like to build an event system or somehow provide script level access to my component entity model. However, it is still in development and not exactly a rudimentary task. So I kept the scope small and just made a simple math library. If nothing else, it provided a nice introduction to the language.

Other current events:
I learn a lot from reading other people's code. So while a lot of people in the games industry despise Boost and Template Meta-programming in general, it doesn't hurt to familiarize myself with the subject matter. I did the same with design patterns a couple years ago. The possible misuse of a technique is not a good excuse to evade learning it altogether. Is a singleton a monstrosity? Does overuse of templates bloat and complicate code? Yes and yes, but every implementation has its share of pros and cons.

I got Jesse Schell's excellent book from the library over the break. I may not see myself as a designer, but it never hurts to have some sensible design ideas floating around in your head. And even in the absence of great ideas, it improves communication skills with talented game designers.

I also got Volume 1 of Knuth's book for Christmas. I simply don't have the time to muscle through something like this right now, but it's nice knowing that if I want to challenge myself, I merely need to reach over to my bookshelf. It's gonna take a while to get through this one.

A Floating Point

Pages