A Floating Point: 2012

Sunday, September 2, 2012

Gameplay Architecture Part 2: Message Passing

Rationale
Games are very event driven applications which will sometimes (inevitably) end up resembling a web of complex states. In a project of any considerable size it simply isn't feasible to understand every component of the codebase. This is especially true of at higher levels of gameplay code built on top of numerous parallel subsystems written by many different people over an extended period of time.

Messages which either inform pieces of code what has happened (e.g. DOOR_OPENED) or what to do (e.g. DISABLE_RENDERING) are part of the solution towards an architecturally sound and flexible codebase. I had a few goals when implementing this system:

1) Integrates into the component/entity model system I wrote previously
2) Keeps class hierarchies flat, or avoids them altogether
3) Can queue messages efficiently and dispatch them later (e.g. send this message 10 seconds from now)
4) Components/Entities can listen in on these queues by their message type

Based on previous experience and these goals, this is what I came up with.

Simple Example
Message passing might be relatively simple if we could do everything on the stack. As we will explore later, it is highly desirable to register components to listen in on messages by type, store them in a priority queue and dispatch them later. This adds considerable amounts of difficulty to the problem, which will be addressed here in time. First, lets explore the simplest case of a message sent on the stack to a known component. Consistent with the component-entity model explained previously, here are the main interfaces for components/entities to send/receive messages.

const U32 DID_NOT_RECEIVE_MESSAGE = 0;
const U32 RECEIVED_MESSAGE = 1;

class jlMessage {
public:
      jlMessageType m_type;  // hashed message type
      void * m_data;         // untyped message data pointer
      U32 m_dataAndEntitySendMode;  // sizeof(*m_data) and flags
      F32 m_timeUntilDispatch; // when <= zero, dispatched from the queue
};

class jlComponent {
public:
    // Defers call to virtual receiveMessage function, non virtual base behavior for its benefits 
    // (e.g. tracing with certain debug builds that prints every message that is being sent)
    U32 sendMessage(jlMessage& msg);
protected:
     // Returns 1 if the message is received, 0 if not
     virtual U32 receiveMessage(jlMessage& msg);
}; 

class jlEntity {
public:
       // Iterates over attached components, calls sendMessage on them
       // Uses message entity send mode to determine which is called below
    U32 sendMessage(jlMessage& message, U32 minReceivers = DO_NOT_REQUIRE_RECEIVER);    
        // attached components  
 U32 sendMessageToComponents(jlMessage& message, U32 minReceivers = DO_NOT_REQUIRE_RECEIVER); 
        // attached components + all children
 U32 sendMessageToChildren(jlMessage& message, U32 minReceivers = DO_NOT_REQUIRE_RECEIVER);  
        // attached components + all ancestors 
 U32 sendMessageToAncestors(jlMessage& message, U32 minReceivers = DO_NOT_REQUIRE_RECEIVER);  
};

The non-virtual base behavior has numerous benefits for debugging

Lets take a look at a practical example, inside some script that detected a collision, we might wish to deal damage to an entity.

// inside some script which deals damage to an entity
void jlPhysicalCharacter::processCollisionWithFirePhantom() {
    jlDamageData dmgData;
    dmgData.m_amount = 5.0f;
    dmgData.m_damageType = FIRE;
    jlMessage damageMessage(JL_DAMAGE_MSG);
    damageMessage.setData(&dmgData, sizeof(dmgData));
    getHealthComponent()->sendMessage(damageMessage);
}

// inside the health component receiveMessage() function
U32 jlHealthComponent::receiveMessage(jlMessage& msg) {
    if (msg.getType() == JL_DAMAGE_MSG) {
         jlDamageData *data = static_cast<jlDamageData *>(msg.getData());
         JL_ASSERT(data);
         m_health -= data->m_damage;
         return RECEIVED_MESSAGE;
    } else {
        return DID_NOT_RECEIVE_MESSAGE;
    }
}

Message Data
The message is structured like so for a few reasons. A unique type id makes message handling convenient in a switch statement, lets us know what to cast to and how to interpret the data. In my implementation I am using an untyped void pointer so this is not something we can really ignore. One alternative is to use a giant union of message data for all of your needs like so:

struct AlternativeMessageData {
 union {
  struct {
   Matrix4 m_matrix;
  };
  struct {
   char m_string[64];
  };
  struct {
   Point3 m_point;
   Vector3 m_vector;
   Quaternion4 m_quat;
   void* m_ptr;
  };
  // etc etc...
 };
};

The advantage of this approach is that you never have a dangling reference and queuing messages becomes as simple as queuing the message itself. They have a consistent size and built in copy semantics. Defining new types will add to compile times while the union works in all cases.

However, the unioned data approach has a number of downsides. A giant union adds considerably to the size of the message struct. This makes them less efficient to update and move around in the queue itself. In certain instances, the data itself is not needed. If a component receives a DISABLE_RENDERING message they likely don't need any more data to know what to do. In these instances the unioned data can be excessive. Such a system is also less readable in my opinion. I found myself constantly going back and forth between code trying to find out what was really encoded where in the union.

With a pointer and knowledge of its size, you can define your own types and never have to wonder if the damage value was really encoded in m_floats[2] or m_floats[3] ever again. Admittedly this requires jumping through a number of additional hoops, but it is a better solution in my opinion.

Entity Processing

Entities send the same messages, just with different kinds of iteration

The way the entity processes messages is pretty predictable given the signature. We iterate through all components attached and call sendMessage to each one. We accumulate the amount received and compare it to the minimum expected receivers.

In our messages we specify an EntitySendMode. This is useful, since it allows us to queue messages without forcing the user to conform to one type of call. If they want to send the message up or down the tree they merely need to specify the desired mode before queuing the message.

Queuing Messages
Don't Overcomplicate Things, It Is Just A Heap
So far things have been kept simple. All we've really done is take a simple struct with a void pointer to some data, passed it to a consistent interface, casted it out and processed it appropriately. All of this is still done when messages are queued, but we need to do a considerable amount of more behind the scenes to provide a safe and efficient priority queue of messages.

The easy part is providing the priority queue. You can simply use an array heap and sort the messages by their remaining time. You can keep popping messages off the heap until you find one that hasn't run out of time. At that point, the following messages have not exceeded their delay yet and should not be dispatched just yet. Any simple resource on heaps, where the data is sorted by the time until dispatch should be sufficient.

Except It Is Pool Allocated

In addition to your standard heap, we need to ensure our message data is safe. To do this, when pushing our message we need to make a deep copy of our message data. My implementation achieves this with a pool allocator. The pool allocates space, the message data on the stack is memcopied to the newly allocated space, and the message struct which is copied into the queue has its data pointer readjusted to point to the data in the pool. Popping the message does the usual heap manipulation and deallocates the message data from the pool.

Writing A Pool Allocator

Pool allocation might scare some people but it is one of the easiest custom memory allocators to roll yourself. The idea is simple. We have blocks of the same size kept in a buffer. On init, every block goes into a "free list". This is just a linked list of places in the pool that are currently unused. When making an allocation we take the address (or offset) stored at the head of the "free list" and remove it from the free list. When making a deallocation, we take the memory address/offset of the freed memory and add it to the free list. This makes allocation and deallocation as fast as a linked list add/remove. Of course, if we malloc these free list nodes it defeats the whole point of a custom memory allocator. Therefore, we need to keep these nodes in the buffer itself. This is not dangerous, since the free blocks are unused anyways.

Listening For Queued Messages By Type

With the message queuing solved, we can somewhat trivially provide an interface that lets us register components/entities to listen for messages of a given type. In addition to our existing queue, we can store a map that lets us efficiently look up all the listeners for a given message type. Then when popping a message off the queue, we simply look up the linked list of listeners by its type, walk said list, and send the message to every item in it.

Since I wanted to avoid virtual functions, I stored the type I needed to cast to in the nodes. And yes, the nodes in the map should be pool allocated too.

Other Implementations/Alternatives
- If you want to make things easier on yourself, you can use the unioned messaged data struct. This will allow you to write an implementation that doesn't have to worry about any custom memory allocation.
-If you are less concerned about performance and more concerned about flexibility, you can have both components and entities derive from a "message receiver" class that has a virtual sendMessage function. That way you don't need to store the receiver type yourself and worry about your type casts.
-You can forego the usage of an interface entirely and just use delegates/function pointers.
-You don't really need the message data size and time delay outside of the message queue itself. This could be passed directly when queuing is needed and stored separately. SoA would greatly improve data cache utilization when updating the remaining time on the message.

Saturday, August 18, 2012

Gameplay Architecture Part 1: Component Entity Models in C++

Why Component Entity Models?
Rationale
In the interest of re-usability, separation of concerns, and data driven needs, developers are starting to move away from maintenance prone class hierarchies and towards Component Entity Models. Conceptually, such models are simple to understand. Functionality is split into components and game entities are defined by a composition of these components. So at its core, entities themselves are just a container of their components or an identifier that allows its components to relate and communicate with eachother.

There are a number of different ways to go about implementing such a model, and a number of things about C++ can make certain aspects of the implementation rather tricky. Be forewarned, that the implementation I'm presenting does use templates and does stick rather closely to the model used in Unity. While a bit tricky, the implementation here is not overly complicated, and has considerable benefits once it is set up. Users familiar with the script interface in Unity can speak of how intuitive it is to use.

Establishing Relationships Between Components
In more radical models, the idea of an entity is removed completely in favor of pure aggregation. Although not without its benefits, in my opinion this makes things unnecessarily complicated. In addition, forcing a component to belong to one and only Entity solves a lot of problems. So in the interest of keeping things simple that is what I will be presenting here. A one to many relationship between entities exist, and each component keeps a pointer to the entity. This makes sending messages and modifying components significantly easier for a new component. For example, a camera tracking behavior will know to search for and operate on camera components attached to the same entity. Likewise, a health component will know which components to send a message on something like a TAKE_DAMAGE event.

Accessing Components
At its core an Entity is a container of components. We don't want any game logic to seep into the entities themselves, but instead want funcionality to stay decoupled and seperated into individual components. Therefore, we need some way to access components. The implementation I am showing here presents a few including:

- By Type (using templates for type safety)
- By A Unique Instance Id (given by a static instance counter)
- By Name (using a hashed string id for faster comparisons, but still provides a system where designers can name parts of their level)

Why Names/Name Id's?
It is very handy for designers to identify things in the world by name, save such things to a file, and have them work the same in each build. Hashed string id's minimize the cost of comparisons, and can still be accessed in debug builds that make use of a giant string table. This is a godsend for debugging compared to some random enum type id that could easily be reordered when merged!

Why Instance Id's?
A GUID can be quite handy if you need a unique handle and your scriping language doesn't have pointers. In certain cases, name ids may not cut it since designers will keep giving objects the same name.

Providing Your Own RTTI
Identifying objects by their runtime type is a bit trickier. We could punt and just use dynamic_cast while iterating and rely on the non NULL result to indicate a valid cast. This will work and is more reliable than a typeid that won't respect class hierarchies. However, enabling RTTI is generally considered undesirable because it must be implemented everywhere in your program.

Observe, given a type id we have rolled ourselves and the use of a virtual function like the one below we can determine if an object is derived from a given type:

bool virtual isDerivedFromType(u32 typeId) {
    return (TYPE_ID == typeId) ? true : PARENT::isDerivedFromType(typeId); 
}

This will continue until it finds a matching type id or reaches the root class in the hierarchy. In the root of the hierarchy our logic would have to be simplified to compare its object id and not go to any parent class (since it doesn't exist).

bool virtual isDerivedFromType(u32 typeId) {
    return (TYPE_ID == typId); // no parent  
}

Hashed String Id's > Error Prone Enums

But how do we provide a unique type id? We could roll an enum of all possible types, but enums do not scale well at all. A better implementation, that would put less of a burden on its users, would use the hashed string id of the class name. Since we need a unique isDerivedFrom() function anyways, this is a perfectly justifiable use of the preprocessor. A good hash function like MD5 or CRC32 should let you use this simple pattern without any conflicts.

// goes in the public declaration of every component


#define DECLARE_COMPONENT(COMPONENT_CLASS, PARENT_COMPONENT_CLASS) \
static const U32 COMPONENT_TYPE_ID; \
static const char * COMPONENT_TYPE_NAME; \
virtual bool isDerivedFromComponent(U32 componentTypeId) const { \
    return (COMPONENT_TYPE_ID == componentTypeId) ? true : PARENT_COMPONENT_CLASS::isDerivedFromComponent(componentTypeId); \
}


// Goes in the cpp of every component
#define DEFINE_COMPONENT(COMPONENT_CLASS) \
const U32 COMPONENT_CLASS::COMPONENT_TYPE_ID = crc32(#COMPONENT_CLASS); \
const char * COMPONENT_CLASS::COMPONENT_TYPE_NAME = #COMPONENT_CLASS

So at this point, a minimalist implementation could have its logic work like so when searching for the first component of a given type:

template <typename COMPONENT_TYPE> COMPONENT_TYPE * getComponent() {
    foreach (component in componentList) {
        if (component->isDerivedFrom(COMPONENT_TYPE::GetComponentTypeId())) {
            return static_cast<COMPONENT_TYPE *>(component);
        }
    }
}

Storing Components
Data Structures

Part of the entity interface

There are a number of ways we could potentially store components. The first solution you might go to would be a dynamic array (e.g. std::vector) of component pointers. After some thought, I have created a very specialized structure that uses handles and takes advantage of template specialization. To keep things simple I will instead present a system that just uses an intrusive linked list. Since components should only belong to one list anyways this isn't a terrible choice, but in general if you want something that is concurrent and cache friendly you will probably want to use something else.

class Component {
    // the RTTI type id and name are stored statically
    Entity *m_entity;         // component container entity
    Component *m_next;        // intrusive next pointer
    Component *m_prev;        // intrusive prev pointer
    U32 m_nameId;             // hashed name id
    U32 m_instanceId;         // unique instance id
    ComponentEntityFlags m_componentEntityFlags;  // flags like enabled/disabled, etc etc
};

class Entity {
    Transform m_transform;     // head of the component linked list
    Component *m_tail;         // tail of intrusive component linked list
    U32 m_layers;              // bitwise layers
    U32 m_nameId;              // hashed name id
    U32 m_instanceId;          // unique object instance id
    ComponentEntityFlags m_componentEntityFlags; // entity flags
};

Therefore, we can get a component by type using the following code:

template <typename COMPONENT>
const COMPONENT * jlEntity::getComponent() const {
    const COMPONENT *ptr = NULL;
    for (const COMPONENT * cur = getComponents(); cur != NULL; cur = cur->getNext()) {
        if (cur->isDerivedFromComponent(COMPONENT::GetComponentTypeId())) {
            ptr = static_cast<const COMPONENT *>(cur);
            break;
        }
    }
    return ptr;
}

Getting multiple components by type also becomes trivial:

template <typename COMPONENT>
U32 jlEntity::getComponents(const COMPONENT *componentList[], U32 maxToAdd) const {
    U32 n = 0;
    for (const jlComponent *cur = getComponents(); cur && n < maxToAdd; cur = cur->getNext()) {
       if ( cur->isDerivedFromComponent(COMPONENT::GetComponentTypeId())) {
           componentList[n++] = static_cast<const COMPONENT *>(cur);
        }
    }    
    return n;
}

It could also be expanded using the bitwise filtering I posted about previously here. That way you can search for all enabled or disabled components of a given type.

Scenegraph Integration

Operations can be expanded to trees of entities/components

Parent-Child Relationships
One of the great things about Unity is that its parent child relationships carry over to the object model itself. This lets designers and programmers define objects in ways that are very beneficial and still have ways for high level code to intuitively communicate with related data. For example, if the wheel of a car needs to inherit rotation and translation, why not take this relationship a bit further? If we have links to the scene nodes, and they have links back to their entities, we can perform operations on components and entities while traversing down the tree. This not only means we can search for components in a subtree, but also send messages down them (or up them). Therefore, if the car gets an important event, it can quickly and easily send it down to the wheels

A Consistent Message Passing System
If functionality is split into components, than a virtual function for sending and receiving messages is easy to implement. If we have scene graph integration we can not only send messages to every component attached the given entity, but to every component in the entity subtree. This can be rather slow, but extremely useful when you need it. It is also very flexible for expansion (e.g. a component isn't doing anything with a message now, but could later without changing any other code),

As I will explore in another post, the nitty gritty details of implementing such a system is actually rather involved, especially if you want to queue messages with different data efficiently. Hopefully this will get posted within a few weeks.

Performance Considerations/Concerns
Speeding Up Component Retrieval
At first glance the above might look a bit scary. Sure it uses little memory, but a linear search that isn't cache coherent is scary. Even with a dynamic array, we are still ultimately dealing with pointers and indirection anyways. Also, since components do many different things, they will be allocated in many different ways. This is one of the downsides of component entity models in general, but all is not lost. Here are some of your options:

Exploiting Coupling
If every entity must have a transform this can serve as the head in your linked list. If the Transform is really just an id into a procedural interface, you may be able to get away with storing the component itself inside the entity.

Cache Frequently Used Components
In Unity, frequently used components are stored as members with separate references. When combined with template specialization this can make operations on a type extremely fast. For example, if we store a pointer to the transform our getComponent overload could resemble the following:

template <>
FORCE_INLINE const Transform * Entity::getComponent<Transform>() const {
    return m_transform;
}

However, this comes at the price of extra space in every entity. In addition, C++ templates do not respect class hierarchies, so if I make a subclass of Transform, I need to make specializations for every operation on that type (addComponent, removeComponent, etc etc). This is a considerable pain and maintenance overhead.

Minimize Components To A Light Interface
Try to avoid providing Update, Render, and a bunch of other virtual functions for everything unless you really hate I-CACHE efficiency. Keep things light, make components nothing more than a way to access existing information if that is all you need. Avoid storing the actual data for components inside the interface if it presents too much of an overhead. If you have an existing scene graph that uses some crazy struct of arrays scheme, make the Transform interface store nothing more than a handle into those arrays. Components do not have to present crippling changes to every piece of code. You can still keep your existing procedural implementations if you want. Components need only be an interface.

Allocate Entities Into A Pool
Since entities should not have subclasses and all share the same size, they are a perfect candidate for pool allocation. We cannot make the same assumptions with components, since they will obviously have subclasses of varying sizes. If the range isn't significant, you can allocate blocks capable of storing the largest component, or use separate pools for components of different sizes.

Questions?

Thursday, July 26, 2012

Simple Inclusive/Exclusive Bitwise Filtering

It's common in games to use bitfields to store numerous flags efficiently. Not only are booleans incapable of holding more than one flag, but they do so while taking up at least as much space as an unsigned char and sometimes as much space as a word (which has implications for space and alignment). It is much more efficient to store them into one variable and do simple bitwise operations to check if a flag is set. Just as a simple test case, lets say we have some flags for some generic entity in our game:

typedef unsigned int EntityFlags;
const EntityFlags EF_ENABLED = (1 << 0);
const EntityFlags EF_IS_DEAD = (1 << 1);
const EntityFlags EF_IS_PLAYER = (1 << 2);
const EntityFlags EF_IS_IN_VEHICLE = (1 << 3);
const EntityFlags EF_IS_STUNNED = (1 << 4);

So now we have a number of flags we can efficiently pack into one variable and pass to functions as one variable. This is how OpenGL handles almost all of its options, which is a good thing, particularly for compatibility reasons. For example, instead of some function with an inflexible signature:

int processState(bool isDead, bool isPlayer, bool isInVehicle);
// called later
int ret = processState(true, true, true);

We instead have something more flexible to change, and with clearer calling code (in my opinion):

int processState(EntityFlags entityFlags);
// called later
int ret = processState(EF_IS_DEAD | EF_IS_PLAYER | EF_IS_IN_VEHICLE);

Moving on from our simple review, lets take a look at a generic way to expand upon this knowledge. Lets say we have an input list and we need to filter it into another list given its flags. So our function looks something like this, with Entity being some type that at least has something resembling a getFlags() function that returns its EntityFlags from above.

int filterList(Entity * RESTRICT output[], const Entity * RESTRICT input[], int N, EntityFlags searchFlags) {
    int outIdx = 0;
    for (int inputIdx = 0; inputIdx < N; ++inputIdx) {
        if (filter(input[inputIdx]->getFlags(), searchFlags)) {
            output[outIdx++] = input[inputIdx];
        }
    }
    return outIdx;
}

So what goes into out filter function that differentiates filterList() from your standard list copy? If we are doing an inclusive search, whereby any flags shared in common we just need to AND the bits together and see if the result is non-zero:

int filterAnyBitsAreSet(EntityFlags flagsA, EntityFlags flagsB) {
    return (flagsA & flagsB); // search filters and object share at least one flag in common
}

The algorithm for checking if all the bits in flagsB are set in flagsA is also very simple, whereby we take an additional step and check for equality:

int filterAllBitsAreSet(EntityFlags flagsA, EntityFlags flagsB) {
    return ((flagsA & flagsB) == flagsB); // all flags in flagsB must be set in flagsA
}

All of this is pretty trivial. However, wouldn't it be nice if we could pass an additional parameter to filterList which let us perform inclusive and exclusive searches. E.G. at one point I want all entities in the list that are both a player and in a vehicle. And at another time I want all the entities that are either stunned and/or dead. Our filter function could take an additional flag which determines which processing to use:

int filter(EntityFlags flagsA, EntityFlags flagsB, bool searchInclusive) {
    if (searchInclusive) {
        return (flagsA & flagsB);
    } else { // exclusive all bits must be set
        return ((flagsA & flagsB) == flagsB);
    }
}

This will work. The searchInclusive flag could percolate up to our filterList() function and we could optionally search all flags in either manner. However, we are left with these annoying branches, which add up when processing a large list. How can we make this branchless?

Lets take a look at our previous conditions a little more closely. In both instances, some bits need to be shared in common. As such, (flagsA & flagsB) will have to be in our test in some way. Next, we are going to reverse the logic a bit of our all bits are set conditions. The XOR operation returns the difference in bits. If all bits in flagsB must be set in flagsA, the XOR difference should be 0000 for all of the relevant bits (since they should be the same). Therefore, if all bits must be set the following condition must be met ((flagsA ^ flagsB) & flagsB == 0). Use a pencil and paper if it helps you reason it out. So now we have two valid conditions, but they won't work together in one bitwise test. Notice the second half of our condition must equal zero to be valid. If we use another mask to zero out the expression in the case of an inclusive test we will be golden. Using one of the two constants below as our mask, we can perform the test without branches like so:

const int ANY_ARE_SET = 0;
const int ALL_ARE_SET  = 0xFFFFFFFF;

int filter(EntityFlags flagsA, EntityFlags flagsB, int mask) {
   return ((flagsA & flagsB) && (((flagsA ^ flagsB) & (flagsB & mask)) == 0));
}

// Or alternatively like...

int filter2(EntityFlags flagsA, EntityFlags flagsB, int mask) {
   return ((flagsA & flagsB) && (((flagsA & flagsB) & (flagsB & mask)) == (flagsB & mask)));
}

Now our list copy can filter results without forcing the calling code to perform a second pass and without excessive branches. And this simple concept is easily applicable to anything using bitfields. Now, as explained above I can filter lists of objects with bitfields quite easily and efficiently using the same function.

Entity *downedEntities[MAX_ENTITY_LIST_SIZE];
int sz = filterList(downedEntities, g_EntityList, g_numEntities, EF_IS_DEAD | EF_IS_STUNNED, ANY_IS_SET);
for (int i = 0; i < sz; ++i) {
    // processing for enemies that are stunned and/or dead
}

Entity *playersInVehicles[MAX_ENTITY_LIST_SIZE];
int sz = filterList(playersInVehicles, g_EntityList, g_numEntities, EF_IS_PLAYER | EF_IS_IN_VEHICLE, ALL_ARE_SET);
for (int i = 0; i < sz; ++i) {
    // processing for entities which are players and currently in vehicles
}

Simple but powerful.

Tuesday, May 29, 2012

JL Math

I've recently tuned and completed the SIMD math library I'm tentatively calling JL Math (named after my initials). It is still a bit rough around the edges (in terms of compatibility), but will work with SSE2 and FPU seamlessly after one change in the preprocessor (JL_SIMD_ENABLED=1 or 0) using Microsoft Visual Studio. True G++ support is still pending and untested, but most of the building blocks are there.

The library is intentionally pretty minimalist to focus on SIMD wrappers and instructions. So things like logging are just done with the standard library, and aligned memory functions are achieved with existing implementations (e.g. _aligned_malloc). On consoles, home rolled implementations would almost certainly be used to avoid the context switch.

A few things to take from this experiment:
- SSE2 instructions are considerably faster on large sets of data than comparable FPU instructions.
- Alignment can be a pain, and has considerable consequences to the design of code elsewhere. If you want to develop for consoles, its best to get used to this sooner rather than later.
- The same applies for an interface which supports SSE2, FPU, Altivec, etc. They are a pain to implement.
- Reordering instructions to prevent stalls can make a huge difference
- Restricted Pointers and an eye for Load Hit Stores can make a huge difference
- Removing Branches Can Make a huge difference
- A struct of arrays approach can be more performant than an array of structs for large structs that exceed the size of a cache line
- More experience with SFMT, static assertions, and Intel's Vtune

Preventing Stalls
Preventing stalls, in terms of performance, is very important in a speed critical library such as this. One thing I found myself doing was taking advantage of the SimdFloat wrapper to minimize them as much as possible. Since you have to use them to prevent pipeline flushes anyways, you might as well reorder them to minimize stalls as well. This is one of the easiest optimizations to make, since it only involves moving the same instructions around and spreading dependent data out so the processor stays busy.

bool32 testRayTriangle(const jlVector4& origin, const jlVector4& dir, const jlVector4& a, const jlVector4& b, const jlVector4& c) {
    jlVector4 pn = jlVector4::Cross(b - a, c - a);
    pn.normalize3();
    jlSimdFloat pd = jlVector4::Dot3(pn, a); // stalls waiting on the normalize to finish
    jlSimdFloat t = (pd - jlVector4::Dot3(origin, pn)) / jlVector4::Dot3(dir, pn);
    jlSimdFloat zero = jlSimdFloat(0.0f); 
    if (t <= zero) return 0;
    jlVector4 pt = origin + dir * t;
    jlSimdFloat u, v, w;
    barycentric(a, b, c, pt, u, v, w);
    jlSimdFloat one = jlSimdFloat(1.0f);
    return (v >= zero && w >= zero && (v + w) <= one);
}

Since the normalize3() function takes the longest, we can construct the SIMD float types after as an easy way to minimize stalls.

bool32 testRayTriangle(const jlVector4& origin, const jlVector4& dir, const jlVector4& a, const jlVector4& b, const jlVector4& c) {
    jlVector4 pn = jlVector4::Cross(b - a, c - a);
    pn.normalize3();
    jlSimdFloat zero = jlSimdFloat(0.0f); // these simple constructions keep the processor busy 
    jlSimdFloat one = jlSimdFloat(1.0f);  // while waiting on the normalization operation to finish
    jlSimdFloat pd = jlVector4::Dot3(pn, a);
    jlSimdFloat t = (pd - jlVector4::Dot3(origin, pn)) / jlVector4::Dot3(dir, pn);
    if (t <= zero) return 0;
    jlVector4 pt = origin + dir * t;
    jlSimdFloat u, v, w;
    barycentric(a, b, c, pt, u, v, w);
    return (v >= zero && w >= zero && (v + w) <= one);
}

Removing Branches
Here's a typical AABB intersection test:

bool32 testAABBWithBranches(const jlAABB& boxA, const jlAABB& boxB) {
    if (boxA.max(0) < boxB.min(0) || boxA.min(0) > boxB.max(0)) return 0;
    if (boxA.max(1) < boxB.min(1) || boxA.min(1) > boxB.max(1)) return 0;
    if (boxA.max(2) < boxB.min(2) || boxA.min(2) > boxB.max(2)) return 0;
    return 1;
}

On a modern CPU, branch misprediction and short circuiting is generally a bad thing. By removing the branches, instructions can be queued and executed in a more predictable manner. These two algorithms are logically the same, where we still check for a seperation on all three cartesian axis. However, the version without branches does all of these tests in parallel and uses bitwise operations to merge the comparisons into one boolean result.

bool32 testAABBWithoutBranches(const jlAABB& boxA, const jlAABB& boxB) {
    jlComp comp;
    jlComp aMinCompBMax = boxA.min.compLess(b.max);
    jlComp bMinCompAMax = boxB.min.compLess(a.max);
    comp.setAnd(aMinCompBMax, bMinCompAMax);
    return comp.allAreSet(jlComp::MASK_XYZ);
}

SFMT
Random Number Generators are interesting to me, but can be notoriously hard to understand due to excessive use of magic constants and seemingly arbitrary bit shifts (frequently seen in hash functions too). In my old GXT project, I used an existing implementation of XORShift I found here. This experimental project seemed like a good excuse to explore Mersenne Twister and its SIMD optimized variant: SFMT. I adapted the code I found to work with my existing classes and support SSE and FPU seamlessly. The result is a considerable speed boost, but at the cost of extra memory, as it relies on a previously allocated buffer.

The restrict keyword
I'm still a bit new to the concept of the __restrict keyword. I did, however, notice a lot of engine devs talking about its importance on consoles and elsewhere in addressing the pointer aliasing problem (which also extends to references in C++).

restrict is a qualifier designed to address the pointer aliasing problem. When used as qualifier on a member function itself, it will apply to the "this" pointer. A number of articles have been written about restrict and the way it addresses the Load-Hit-Store problem. The way I like to think about restrict is that it is a promise the programmer makes to the compiler, that the pointer will not be referenced by any other pointers in that scope.

Three things to note with the restrict keyword absent of the way it works:
1) restrict is nonstandard and in vc++ it is __restrict and in g++ it is __restrict__. So this is an appropriate time to use the preprocessor.
2) Restrict in Visual C++ only applies to pointers, not references. Thankfully g++ supports restricted references, but this disparity is a pain for those seeking portable code. In Visual C++ restricted references are simply ignored, but this will accumulate numerous compiler warnings in a project of any decent size. This limits portable restricted addresses to restricted pointers.
3) Restrict matters to the calling code. Seriously, put restrict qualifiers in the header/prototype, even if the compiler will let you get away without doing it. Let's take an example where restricted pointers make sense as a design/code clarity hint as much as it does a performance centered measure.
Take barycentric coordinates, where we want to calculate a weighted average and put them into u, v, and w:

void barycentric(const jlVector4& a, const jlVector4& b, const jlVector4& c, const jlVector4& pt, jlSimdFloat* u, jlSimdFloat* v, jlSimdFloat* w);

It effectively defeats the purpose if u, v, and w reference the same variable. What we know about the weighted average is effectively lost and all variables will hold the value of the last stored value if they all reference the same variable. As such, it is entirely appropriate in such instances to change the function signature to:

void barycentric(const jlVector4& a, const jlVector4& b, const jlVector4& c, const jlVector4& pt, jlSimdFloat* JL_RESTRICT u, jlSimdFloat* JL_RESTRICT v, jlSimdFloat* JL_RESTRICT w);

Test	SSE	SSE With Restrict
Barycentric Test	0.171	0.167

For this project, I made much more extensive use of Intel's VTune than I have in the past. Here are some of the results I found after running numerous benchmarks on my machine. In each test, I would generally produce upwards of a million randomly generated matrices, vectors, rays, etc. and see how the two compared.

Struct of Arrays vs. Array of Structs
I also tried a few new things this time around, like the difference between Struct of Arrays and (the more frequently used) Array of Structs. This is admittedly an element that truly excels in multithreaded environments. An effective comparison is made in Game Engine Architecture:

// Array of Structs, easier to understand,
// especially in the OOP world.  Logically
// related data is placed contiguously in memory.
struct GameObject {
    uint32 id;
    Vector3 pos;
    Quaternion rot;
    float health;
};

// In the Struct of Arrays approach, typed data is placed contiguously
struct GameObjects {
    uint32 ids[MAX_GAME_OBJECTS];
    Vector3 positions[MAX_GAME_OBJECTS];
    Quaternion rotations[MAX_GAME_OBJECTS];
    float healthPoints[MAX_GAME_OBJECTS];
};

A 64-byte line size means most loops will be unrolled
4 Vector4's at a time

The idea is to have homogenous data together in memory to maximize data cache hits and minimize misses. This is particularly concerning on consoles where the cost of these misses is thousands of instructions. A good article, which explores this (and more) as they apply to scene-graphs can be found here. Unrolling these loops that process these large arrays to the size of a data cache line can result in more benefits.

You can download and use CPU-Z to find properties of your CPU's cache and/or use a function like these in code to get the size of a d-cache line.

AoS Particles	SoA Particles
5.264	5.063

More Tests
For good measure, a few more tests profiled using VTune. Feel free to run some yourself with hot spot analysis to try things out yourself.


VTune's "Compare" functionality between SSE2 and FPU

Test	SSE	FPU
Vector/Matrix Multiply Test	0.5	2.0
Normalization Test	0.138	0.4
Ray Triangle Test	0.017	0.039

Browse JL Math on GitHub

Monday, March 26, 2012

Currently Under Construction

Things have been a bit quiet here but rest assured I've been occupied with a number of things, some of which will have source code up here in less than a week.

Card Kingdom
The Card Kingdom project is a lot of fun and has a lot of potential to get better. I've been working on gameplay code, camera systems, and helping with frustum culling. I may very well be working on A.I. systems as well (which I look forward to, especially for the final boss).

JL Math
I wrote a post on SIMD processing but I want to prove I've truly dived into it. Making an interface that works well with SSE2 and FPU is taking a bit longer than I originally expected, but should see something of an alpha release soon. The math library, JL Math, is named after my initials because it's an individual project and I didn't have a witty name for it. Ideally the interface could be expanded to work with Altivec sometime down the road too. Right now it supports Vector2, Vector4, Matrix4, and Quaternions.

FMOD and Component Entity Models
I've been experimenting with the excellent FMOD library and trying to find ways to cleanly integrate the system into a component based engine. One of the excellent things about FMOD is that you can often get most of the functionality you want with just a few function calls. Under the hood this industry standard library does an awful lot of things for you (which are of course configurable).

Concurrency

In my ongoing efforts to increase my experience with multithreaded architectures I'm hoping to fit in one more non-school side project before I graduate. So I bought Anthony Williams' excellent book, C++ Concurrency in Action and am auditing a Operating Systems course to get the concepts. It'll take some careful planning to decide how I want to implement this knowledge, but I'm currently hoping to do something with particles that builds upon the SIMD math library in an effort to do everything in parallel.

Expect to hear more on all of these developments in the near future.

Sunday, March 11, 2012

Some Assembly Required

I recently took a class in Computer Organization. A number of topics pertinent to game programmers were discussed in the course including:

- Assembly Instructions
- The Memory Hierarchy
- Pipelining/Parallelism

Like most classes which explore assembly code the instruction set was MIPS. MIPS is generally considered the best assembly language to learn first, at least before x86. As someone who enjoys tackling low level problems, getting a better understanding of how C/C++ code is actually translated into machine instructions was very interesting to me.

One of the projects for the class was writing Connect 4 entirely in assembly.

This was surprisingly informative and fun to write. Since I hope to explore asm { } inside C++ and x86 in the near future this was a good first step in that direction. You can download the source code here. If you have access to a RSIM on your machine, you will almost certainly need to adjust the paths in the makefile to the appropriate directories.

Friday, March 2, 2012

The Challenges In Building A SIMD Optimized Math Library

What is SIMD?
SIMD stands for Single Instruction Multiple Data, and does exactly what it says it does. By packing a couple of integral/floating point types together, a considerable amount of arithmetic and logical computations can be done at once.

Why Should You Care?
These days, everything is designed for parallelism. The beauty of intrinsic functions is that they exploit such parallelism at the instruction level. With entirely separate registers and instructions, CPUs can additionally compute the same operation 3 more times at little additional cost. Game Engines have been using intrinsics for a while now. I noticed while browsing the Doom 3 source code, which was released in 2004, that they had full support for SIMD intrinsics. Even if you don't care to write high performance code you're only kidding yourself if you think this technology is going away. In all likelihood, in the industry you're going to have to deal with intrinsics, even if they've been wrapped in classes that are provided for you. This isn't just an exercise in reinventing the wheel for the sake of doing it - its a reality of modern day game architecture we all have (or want) to use.

But I Don't Know Assembly
You don't have too. Although a background in MIPS/x86 ASM definitely helps with understanding how SIMD math really ticks, functions are provided that do most of it for you. This is a very convenient and more maintainable way to do things as opposed to resorting to assembly instructions you have to roll yourself.

Specific Uses In Games
SIMD types can be used for a number of things, but since the __m128 data type holds 4 floating point values (which must be aligned on a 16 byte boundary), they are particularly good for data sets which can be treated the same. Things like vectors (x, y, z, w), matrices (4 vectors), colors (r, g, b, a) and more perform very well and are easily adapted to SIMD types. You can use intrinsics to represent 4 float values which may not all be treated homogeneously as well. For example, a bounding sphere may use the last value to represent the radius of the sphere, but then you have to be careful not to treat that scalar the way you treat the position vector. The same holds true for quaternions. And with crazy functionality, such as a random number generator that internally uses intrinsics, the downside is that code becomes much more complex and difficult to understand.

Challenges/Advice
The Alignment Problem: SIMD types have to be aligned on a 16 byte boundary. A number of game developers know that the default allocators in C++ make no guarantee, and as such, are used to overriding the new/delete operators for certain things that need them. Console developers, which may very well use custom memory allocators for everything already, should have no problem making the adjustment (if one is needed at all). Students and hobbyists on the PC might find their program crashing due to unaligned access and be left scratching their head. To start, I recommend just using _align_malloc or posix_memalign to get things going. Consider making a macro like DECLARE_ALIGNED_ALLOCATED_OBJ(ALIGNMENT) and using that everywhere you need a static overload, so it can eventually be changed later from one place.

Pipeline Flushes: This one is tricky, because your code will run fine, but might actually be slower than if you were still using standard floating point types! You have to try very hard to avoid mixing floating point math and SIMD math or the pipeline will experience significant stalls. If this sounds hard, it's because it is. A lot of game developers are used to using floats everywhere for everything. As a result, a large codebase is a huge, huge pain to refactor to accommodate for SIMD math in as many places as possible.

I recommend wrapping SIMD types into a class called something like SimdFloat, which, for all intents and purposes, acts just like a float. However, internally it actually holds a four float value to avoid those costly pipeline flushes. The implications of this are significant: now things like dot products, length squared functions, and others are actually returning quad data types. This will take some getting used to. You can help alleviate it by writing a conversion operator that converts to a regular float and back, but overuse carries the potential for abuse. If the additional memory space is significant, consider creating a SIMD float on the stack as soon as possible and using it for the rest of the function.

Code Clarity: This can be alleviated, but is ultimately going to take a hit somewhere. SIMD math typically involves a lot more temporary values on the stack. This will seem to avoid going directly "at the problem" at times. For example, consider a typical, trivial computation of the dot product:

inline float32 dot3(const Vector4& v) const {
    return x*v.x + y*v.y + z*v.z; 
}

Now becomes something like:

inline SimdFloat dot3(const Vector4& v) const {
   SimdFloat sf;
   quad128 p = _mm_mul_ps(quad, v.quad); // p[0] = (x * v.x), p[1] = (y * v.y), p[2] = (z * v.z) 
   quad128 sumxy = _mm_add_ss(_mm_shuffle_ps(p, p, _MM_SHUFFLE(1,1,1,1)), p); // check reference on shuffling
    quad128 sumxyz = _mm_add_ss(_mm_shuffle_ps(sumxy, sumxy, _MM_SHUFFLE(2, 2, 2, 2)), sumxy);
   sf.quad = sumxyz;
   return sf;
}

Don't say I didn't warn you :P. Since writing this by hand every time can be quite error prone, most implementations use the vector4 SIMD type, or its equivalent as frequently as possible. If the functions are inline anyways you will usually be just fine.

On the topic of code clarity, some implementations actually forbid operator overloads. Before doing this, I recommend making sure you really need to do it. It's true, the operators need to return a reference, and this has a cost. Successful SIMD libraries, like the ones used in Havok, do not allow this. It is justified as being speed critical. Default construction doesn't set a zero vector, and the assignment operator, while provided, doesn't return a reference either. This does indeed avoid computational costs, but it comes at the cost of clarity. I highly recommend providing both operators and the equivalent functions, that way things that really need it can avoid the operators and those that don't can stay as clear as possible. Consider for example, a calculation of a LERP value:

SimdFloat t; // assume its value is between 0 and 1 so we don't need to clamp it
Vector4 lerp = a + ((b - a) * t);

And again without operator overloading:

SimdFloat t;
Vector4 lerp;
lerp.setSub(b, a);
lerp.mul(t);
lerp.add(a);

Again, it's all about finding the right balance with your team. Expect resistance if you want to forbid operator overloading. And remember, you can always write a comment specifying the equivalent if operator overloading was available right above the computations themselves.

Portability: Math libraries have the wonderful combination of being incredibly speed critical, pervasive, and in need of portability. Some platforms don't support these SIMD types at all, others differ in their implementation (e.g. AltiVec and SSE2). This makes the development of a common interface considerably challenging. Read up on certain articles and consider referring to the one used in Havok by downloading their free trial (even if you have no interest in using their Physics/Animation) libraries.

Wednesday, February 29, 2012

Joining Card Kingdom

I'm pleased to be joining the Card Kingdom team in a few short weeks. Card Kingdom is a graduate level capstone project which has been in development by some very talented designers, developers, and artists for nearly a year. I will be focusing on gameplay code, using an existing code base, and doing what I can to help get a polished game out the door. For financial reasons, I won't be attending GDC, but there should be a booth with the latest build on display. Here's a video of (an admittedly older) build that should give you a good feel for the game.

Expect updates with my contributions to be detailed here after they are completed. You can always follow major team updates on the project's main development blog. The work these guys have done is certainly deserving of your attention!

Friday, January 13, 2012

Class Is Always In Session

I recently released a minor update to the CDS Library. I fixed some bugs and provided more explicit casting. The major incentive for these changes was compatibility with C++. CDS is still very much grounded in ANSI C, which is deliberate for compatibility reasons, so it doesn't take advantage of C++ specific features (such as templates). However, in terms of compatibility, I'm proud to have made a data structures library that almost any C/C++ compiler can build.

Unfortunately I'm spread very thin so my time for side projects such as these is extremely limited. I was hoping to get more experience writing cache conscious code, and perhaps experiment with some cache oblivious algorithms too.

Consistently exploiting cache optimized code is something I'm more mindful of, but admittedly need to improve on. It would be nice to explore this more some other time, perhaps as an extension to this project.

I'm also working on a game project for school where I'm deliberately trying to tackle Component - Entity Models and get more experience with SIMD math and intrinsics. I'm admittedly sticking relatively close to the framework used in Unity for now. In another project I might look into more data driven representations that use component aggregation a bit more purely. I started a thread on GameDev which discusses implementing lighter RTTI and ended up exploring ways to avoid the "everything is a game object" system we see so frequently in these types of frameworks. While this model admittedly feels excessive for some simple things it does have its benefits. When certain operations are best applied to logical game objects, iterating over every component or deleting them is much easier if you have one central game object. Communication between connected components also becomes trivial.

Currently the big question at hand is message passing. A hash table of function names and their addresses sounds reasonable. But is passing a void pointer array the best way to pass arguments? Do "returns" from these functions need to be put back into arguments of this array? What about like method names in different components? Whatever is chosen, it will have significant pros and cons that weren't very clear when we decided to go with components in the first place. In practice, we're finding the component model actually hides a significant amount of abstraction from the user. I'm not saying inheritance trees are better, merely that Component-Entity models, like all game object models, is going to feel a bit quirky from time to time.

Speaking of quirky, intrinsics will definitely feel that way the first time you implement them. Forced alignment, unclear macros, pipeline flushes, and bit masks? Your standard vector/matrix types are no longer so routine. And don't even get me started on how much of a pain it is to write this in a cross platform manner. Altivec's SIMD functions are just different enough to warrant typedefs everywhere. However, as an engine developer, intrinsics are definetly pretty cool. They are blazing fast (if used appropriately) and make old problems new again. That quaternion, bounding sphere, AABB, and float packed color structure are all "new" again. Things like cosine functions and sorting algorithms also present exciting new ways to conquer some old problems. It takes a specific kind of developer to get truly excited about rewriting these sorts of facilities. However, for those that do, they should definitely look into intrinsics if they haven't already.

My third recent side project involved diving into Lua. I work with PHP in my job so interpreted languages are not completely foreign to me. While Lua is considered easy to learn its syntax does not resemble C and certain ways in which Lua is used (table lookup, metatables, etc) can actually get rather involved. Nonetheless, it's a fun language to use, it was made by some very smart people, and it never feels like it's trying to be something its not. Lua knows it's a scripting language, and as such, is more frequently adopted as one compared to languages that see this as some sort of insult.

I'd like to build an event system or somehow provide script level access to my component entity model. However, it is still in development and not exactly a rudimentary task. So I kept the scope small and just made a simple math library. If nothing else, it provided a nice introduction to the language.

Other current events:
I learn a lot from reading other people's code. So while a lot of people in the games industry despise Boost and Template Meta-programming in general, it doesn't hurt to familiarize myself with the subject matter. I did the same with design patterns a couple years ago. The possible misuse of a technique is not a good excuse to evade learning it altogether. Is a singleton a monstrosity? Does overuse of templates bloat and complicate code? Yes and yes, but every implementation has its share of pros and cons.

I got Jesse Schell's excellent book from the library over the break. I may not see myself as a designer, but it never hurts to have some sensible design ideas floating around in your head. And even in the absence of great ideas, it improves communication skills with talented game designers.

I also got Volume 1 of Knuth's book for Christmas. I simply don't have the time to muscle through something like this right now, but it's nice knowing that if I want to challenge myself, I merely need to reach over to my bookshelf. It's gonna take a while to get through this one.

A Floating Point

Pages