1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.

Pages

Saturday, August 18, 2012

Gameplay Architecture Part 1: Component Entity Models in C++


Why Component Entity Models?
Rationale
In the interest of re-usability, separation of concerns, and data driven needs, developers are starting to move away from maintenance prone class hierarchies and towards Component Entity Models.  Conceptually, such models are simple to understand.  Functionality is split into components and game entities are defined by a composition of these components.  So at its core, entities themselves are just a container of their components or an identifier that allows its components to relate and communicate with eachother.

There are a number of different ways to go about implementing such a model, and a number of things about C++ can make certain aspects of the implementation rather tricky.  Be forewarned, that the implementation I'm presenting does use templates and does stick rather closely to the model used in Unity.  While a bit tricky, the implementation here is not overly complicated, and has considerable benefits once it is set up.  Users familiar with the script interface in Unity can speak of how intuitive it is to use.

Establishing Relationships Between Components
In more radical models, the idea of an entity is removed completely in favor of pure aggregation.  Although not without its benefits, in my opinion this makes things unnecessarily complicated.  In addition, forcing a component to belong to one and only Entity solves a lot of problems.  So in the interest of keeping things simple that is what I will be presenting here.  A one to many relationship between entities exist, and each component keeps a pointer to the entity.  This makes sending messages and modifying components significantly easier for a new component.  For example, a camera tracking behavior will know to search for and operate on camera components attached to the same entity.  Likewise, a health component will know which components to send a message on something like a TAKE_DAMAGE event.

Accessing Components
At its core an Entity is a container of components.  We don't want any game logic to seep into the entities themselves, but instead want funcionality to stay decoupled and seperated into individual components.  Therefore, we need some way to access components.  The implementation I am showing here presents a few including:

- By Type (using templates for type safety)
- By A Unique Instance Id (given by a static instance counter)
- By Name (using a hashed string id for faster comparisons, but still provides a system where designers can name parts of their level)

Why Names/Name Id's?
It is very handy for designers to identify things in the world by name, save such things to a file, and have them work the same in each build.  Hashed string id's minimize the cost of comparisons, and can still be accessed in debug builds that make use of a giant string table.  This is a godsend for debugging compared to some random enum type id that could easily be reordered when merged!

Why Instance Id's?
A GUID can be quite handy if you need a unique handle and your scriping language doesn't have pointers.  In certain cases, name ids may not cut it since designers will keep giving objects the same name.

Providing Your Own RTTI
Identifying objects by their runtime type is a bit trickier.  We could punt and just use dynamic_cast while iterating and rely on the non NULL result to indicate a valid cast.  This will work and is more reliable than a typeid that won't respect class hierarchies.  However, enabling RTTI is generally considered undesirable because it must be implemented everywhere in your program.

Observe, given a type id we have rolled ourselves and the use of a virtual function like the one below we can determine if an object is derived from a given type:
bool virtual isDerivedFromType(u32 typeId) {
    return (TYPE_ID == typeId) ? true : PARENT::isDerivedFromType(typeId); 
}


This will continue until it finds a matching type id or reaches the root class in the hierarchy.  In the root of the hierarchy our logic would have to be simplified to compare its object id and not go to any parent class (since it doesn't exist).

bool virtual isDerivedFromType(u32 typeId) {
    return (TYPE_ID == typId); // no parent  
}  


Hashed String Id's > Error Prone Enums
But how do we provide a unique type id?  We could roll an enum of all possible types, but enums do not scale well at all.  A better implementation, that would put less of a burden on its users, would use the hashed string id of the class name.  Since we need a unique isDerivedFrom() function anyways, this is a perfectly justifiable use of the preprocessor.  A good hash function like MD5 or CRC32 should let you use this simple pattern without any conflicts.

// goes in the public declaration of every component


#define DECLARE_COMPONENT(COMPONENT_CLASS, PARENT_COMPONENT_CLASS) \
static const U32 COMPONENT_TYPE_ID; \
static const char * COMPONENT_TYPE_NAME; \
virtual bool isDerivedFromComponent(U32 componentTypeId) const { \
    return (COMPONENT_TYPE_ID == componentTypeId) ? true : PARENT_COMPONENT_CLASS::isDerivedFromComponent(componentTypeId); \
}


// Goes in the cpp of every component
#define DEFINE_COMPONENT(COMPONENT_CLASS) \
const U32 COMPONENT_CLASS::COMPONENT_TYPE_ID = crc32(#COMPONENT_CLASS); \
const char * COMPONENT_CLASS::COMPONENT_TYPE_NAME = #COMPONENT_CLASS
So at this point, a minimalist implementation could have its logic work like so when searching for the first component of a given type:
 
template <typename COMPONENT_TYPE> COMPONENT_TYPE * getComponent() {
    foreach (component in componentList) {
        if (component->isDerivedFrom(COMPONENT_TYPE::GetComponentTypeId())) {
            return static_cast<COMPONENT_TYPE *>(component);
        }
    }
}

Storing Components
Data Structures
Part of the entity interface
There are a number of ways we could potentially store components.  The first solution you might go to would be a dynamic array (e.g. std::vector) of component pointers.  After some thought, I have created a very specialized structure that uses handles and takes advantage of  template specialization.  To keep things simple I will instead present a system that just uses an intrusive linked list.  Since components should only belong to one list anyways this isn't a terrible choice, but in general if you want something that is concurrent and cache friendly you will probably want to use something else.

class Component {
    // the RTTI type id and name are stored statically
    Entity *m_entity;         // component container entity
    Component *m_next;        // intrusive next pointer
    Component *m_prev;        // intrusive prev pointer
    U32 m_nameId;             // hashed name id
    U32 m_instanceId;         // unique instance id
    ComponentEntityFlags m_componentEntityFlags;  // flags like enabled/disabled, etc etc
};

class Entity {
    Transform m_transform;     // head of the component linked list
    Component *m_tail;         // tail of intrusive component linked list
    U32 m_layers;              // bitwise layers
    U32 m_nameId;              // hashed name id
    U32 m_instanceId;          // unique object instance id
    ComponentEntityFlags m_componentEntityFlags; // entity flags
};
Therefore, we can get a component by type using the following code:
template <typename COMPONENT>
const COMPONENT * jlEntity::getComponent() const {
    const COMPONENT *ptr = NULL;
    for (const COMPONENT * cur = getComponents(); cur != NULL; cur = cur->getNext()) {
        if (cur->isDerivedFromComponent(COMPONENT::GetComponentTypeId())) {
            ptr = static_cast<const COMPONENT *>(cur);
            break;
        }
    }
    return ptr;
}




Getting multiple components by type also becomes trivial:

template <typename COMPONENT>
U32 jlEntity::getComponents(const COMPONENT *componentList[], U32 maxToAdd) const {
    U32 n = 0;
    for (const jlComponent *cur = getComponents(); cur && n < maxToAdd; cur = cur->getNext()) {
       if ( cur->isDerivedFromComponent(COMPONENT::GetComponentTypeId())) {
           componentList[n++] = static_cast<const COMPONENT *>(cur);
        }
    }    
    return n;
}

It could also be expanded using the bitwise filtering I posted about previously here.  That way you can search for all enabled or disabled components of a given type.

Scenegraph Integration
Operations can be expanded to trees of entities/components
Parent-Child Relationships
One of the great things about Unity is that its parent child relationships carry over to the object model itself.  This lets designers and programmers define objects in ways that are very beneficial and still have ways for high level code to intuitively communicate with related data.  For example, if the wheel of a car needs to inherit rotation and translation, why not take this relationship a bit further?  If we have links to the scene nodes, and they have links back to their entities, we can perform operations on components and entities while traversing down the tree.  This not only means we can search for components in a subtree, but also send messages down them (or up them).  Therefore, if the car gets an important event, it can quickly and easily send it down to the wheels

A Consistent Message Passing System
If functionality is split into components, than a virtual function for sending and receiving messages is easy to implement.  If we have scene graph integration we can not only send messages to every component attached the given entity, but to every component in the entity subtree.  This can be rather slow, but extremely useful when you need it.  It is also very flexible for expansion (e.g. a component isn't doing anything with a message now, but could later without changing any other code),

As I will explore in another post, the nitty gritty details of implementing such a system is actually rather involved, especially if you want to queue messages with different data efficiently.  Hopefully this will get posted within a few weeks.

Performance Considerations/Concerns
Speeding Up Component Retrieval
At first glance the above might look a bit scary.  Sure it uses little memory, but a linear search that isn't cache coherent is scary.  Even with a dynamic array, we are still ultimately dealing with pointers and indirection anyways.  Also, since components do many different things, they will be allocated in many different ways.  This is one of the downsides of component entity models in general, but all is not lost.  Here are some of your options:

Exploiting Coupling
If every entity must have a transform this can serve as the head in your linked list.  If the Transform is really just an id into a procedural interface, you may be able to get away with storing the component itself inside the entity.  

Cache Frequently Used Components
In Unity, frequently used components are stored as members with separate references.  When combined with template specialization this can make operations on a type extremely fast.  For example, if we store a pointer to the transform our getComponent overload could resemble the following:


template <>
FORCE_INLINE const Transform * Entity::getComponent<Transform>() const {
    return m_transform;
} 

However, this comes at the price of extra space in every entity.  In addition, C++ templates do not respect class hierarchies, so if I make a subclass of Transform, I need to make specializations for every operation on that type (addComponent, removeComponent, etc etc).  This is a considerable pain and maintenance overhead.

Minimize Components To A Light Interface
Try to avoid providing Update, Render, and a bunch of other virtual functions for everything unless you really hate I-CACHE efficiency.  Keep things light, make components nothing more than a way to access existing information if that is all you need.  Avoid storing the actual data for components inside the interface if it presents too much of an overhead.  If you have an existing scene graph that uses some crazy struct of arrays scheme, make the Transform interface store nothing more than a handle into those arrays.  Components do not have to present crippling changes to every piece of code.  You can still keep your existing procedural implementations if you want.  Components need only be an interface.

Allocate Entities Into A Pool
Since entities should not have subclasses and all share the same size, they are a perfect candidate for pool allocation.  We cannot make the same assumptions with components, since they will obviously have subclasses of varying sizes.  If the range isn't significant, you can allocate blocks capable of storing the largest component, or use separate pools for components of different sizes.

Questions?

Thursday, July 26, 2012

Simple Inclusive/Exclusive Bitwise Filtering


It's common in games to use bitfields to store numerous flags efficiently.  Not only are booleans incapable of holding more than one flag, but they do so while taking up at least as much space as an unsigned char and sometimes as much space as a word (which has implications for space and alignment).  It is much more efficient to store them into one variable and do simple bitwise operations to check if a flag is set.  Just as a simple test case, lets say we have some flags for some generic entity in our game:

typedef unsigned int EntityFlags;
const EntityFlags EF_ENABLED = (1 << 0);
const EntityFlags EF_IS_DEAD = (1 << 1);
const EntityFlags EF_IS_PLAYER = (1 << 2);
const EntityFlags EF_IS_IN_VEHICLE = (1 << 3);
const EntityFlags EF_IS_STUNNED = (1 << 4);


So now we have a number of flags we can efficiently pack into one variable and pass to functions as one variable.  This is how OpenGL handles almost all of its options, which is a good thing, particularly for compatibility reasons.  For example, instead of some function with an inflexible signature:

int processState(bool isDead, bool isPlayer, bool isInVehicle);
// called later
int ret = processState(true, true, true);
We instead have something more flexible to change, and with clearer calling code (in my opinion):

int processState(EntityFlags entityFlags);
// called later
int ret = processState(EF_IS_DEAD | EF_IS_PLAYER | EF_IS_IN_VEHICLE);

Moving on from our simple review, lets take a look at a generic way to expand upon this knowledge.  Lets say we have an input list and we need to filter it into another list given its flags.  So our function looks something like this, with Entity being some type that at least has something resembling a getFlags() function that returns its EntityFlags from above.

int filterList(Entity * RESTRICT output[], const Entity * RESTRICT input[], int N, EntityFlags searchFlags) {
    int outIdx = 0;
    for (int inputIdx = 0; inputIdx < N; ++inputIdx) {
        if (filter(input[inputIdx]->getFlags(), searchFlags)) {
            output[outIdx++] = input[inputIdx];
        }
    }
    return outIdx;
}


So what goes into out filter function that differentiates filterList() from your standard list copy?  If we are doing an inclusive search, whereby any flags shared in common we just need to AND the bits together and see if the result is non-zero:
int filterAnyBitsAreSet(EntityFlags flagsA, EntityFlags flagsB) {
    return (flagsA & flagsB); // search filters and object share at least one flag in common
}

The algorithm for checking if all the bits in flagsB are set in flagsA is also very simple, whereby we take an additional step and check for equality:
int filterAllBitsAreSet(EntityFlags flagsA, EntityFlags flagsB) {
    return ((flagsA & flagsB) == flagsB); // all flags in flagsB must be set in flagsA
}
All of this is pretty trivial.  However, wouldn't it be nice if we could pass an additional parameter to filterList which let us perform inclusive and exclusive searches.  E.G. at one point I want all entities in the list that are both a player and in a vehicle.  And at another time I want all the entities that are either stunned and/or dead.  Our filter function could take an additional flag which determines which processing to use:


int filter(EntityFlags flagsA, EntityFlags flagsB, bool searchInclusive) {
    if (searchInclusive) {
        return (flagsA & flagsB);
    } else { // exclusive all bits must be set
        return ((flagsA & flagsB) == flagsB);
    }
}

This will work.  The searchInclusive flag could percolate up to our filterList() function and we could optionally search all flags in either manner.  However, we are left with these annoying branches, which add up when processing a large list.  How can we make this branchless?

Lets take a look at our previous conditions a little more closely.  In both instances, some bits need to be shared in common.  As such, (flagsA & flagsB) will have to be in our test in some way.  Next, we are going to reverse the logic a bit of our all bits are set conditions.  The XOR operation returns the difference in bits.  If all bits in flagsB must be set in flagsA, the XOR difference should be 0000 for all of the relevant bits (since they should be the same).  Therefore, if all bits must be set the following condition must be met ((flagsA ^ flagsB) & flagsB == 0).  Use a pencil and paper if it helps you reason it out.  So now we have two valid conditions, but they won't work together in one bitwise test.  Notice the second half of our condition must equal zero to be valid.  If we use another mask to zero out the expression in the case of an inclusive test we will be golden.  Using one of the two constants below as our mask, we can perform the test without branches like so:

const int ANY_ARE_SET = 0;
const int ALL_ARE_SET  = 0xFFFFFFFF;

int filter(EntityFlags flagsA, EntityFlags flagsB, int mask) {
   return ((flagsA & flagsB) && (((flagsA ^ flagsB) & (flagsB & mask)) == 0));
}

// Or alternatively like...

int filter2(EntityFlags flagsA, EntityFlags flagsB, int mask) {
   return ((flagsA & flagsB) && (((flagsA & flagsB) & (flagsB & mask)) == (flagsB & mask)));
}


Now our list copy can filter results without forcing the calling code to perform a second pass and without excessive branches.  And this simple concept is easily applicable to anything using bitfields.  Now, as explained above I can filter lists of objects with bitfields quite easily and efficiently using the same function.

Entity *downedEntities[MAX_ENTITY_LIST_SIZE];
int sz = filterList(downedEntities, g_EntityList, g_numEntities, EF_IS_DEAD | EF_IS_STUNNED, ANY_IS_SET);
for (int i = 0; i < sz; ++i) {
    // processing for enemies that are stunned and/or dead
}

Entity *playersInVehicles[MAX_ENTITY_LIST_SIZE];
int sz = filterList(playersInVehicles, g_EntityList, g_numEntities, EF_IS_PLAYER | EF_IS_IN_VEHICLE, ALL_ARE_SET);
for (int i = 0; i < sz; ++i) {
    // processing for entities which are players and currently in vehicles
}


Simple but powerful.

Tuesday, May 29, 2012

JL Math

I've recently tuned and completed the SIMD math library I'm tentatively calling JL Math (named after my initials).  It is still a bit rough around the edges (in terms of compatibility), but will work with SSE2 and FPU seamlessly after one change in the preprocessor (JL_SIMD_ENABLED=1 or 0) using Microsoft Visual Studio.  True G++ support is still pending and untested, but most of the building blocks are there.

The library is intentionally pretty minimalist to focus on SIMD wrappers and instructions.  So things like logging are just done with the standard library, and aligned memory functions are achieved with existing implementations (e.g. _aligned_malloc).  On consoles, home rolled implementations would almost certainly be used to avoid the context switch. 

A few things to take from this experiment:
- SSE2 instructions are considerably faster on large sets of data than comparable FPU instructions.
- Alignment can be a pain, and has considerable consequences to the design of code elsewhere.  If you want to develop for consoles, its best to get used to this sooner rather than later.
- The same applies for an interface which supports SSE2, FPU, Altivec, etc.  They are a pain to implement.
- Reordering instructions to prevent stalls can make a huge difference
- Restricted Pointers and an eye for Load Hit Stores can make a huge difference
- Removing Branches Can Make a huge difference
- A struct of arrays approach can be more performant than an array of structs for large structs that exceed the size of a cache line
- More experience with SFMT, static assertions, and Intel's Vtune

Preventing Stalls
Preventing stalls, in terms of performance, is very important in a speed critical library such as this.  One thing I found myself doing was taking advantage of the SimdFloat wrapper to minimize them as much as possible.  Since you have to use them to prevent pipeline flushes anyways, you might as well reorder them to minimize stalls as well.  This is one of the easiest optimizations to make, since it only involves moving the same instructions around and spreading dependent data out so the processor stays busy.

bool32 testRayTriangle(const jlVector4& origin, const jlVector4& dir, const jlVector4& a, const jlVector4& b, const jlVector4& c) {
    jlVector4 pn = jlVector4::Cross(b - a, c - a);
    pn.normalize3();
    jlSimdFloat pd = jlVector4::Dot3(pn, a); // stalls waiting on the normalize to finish
    jlSimdFloat t = (pd - jlVector4::Dot3(origin, pn)) / jlVector4::Dot3(dir, pn);
    jlSimdFloat zero = jlSimdFloat(0.0f); 
    if (t <= zero) return 0;
    jlVector4 pt = origin + dir * t;
    jlSimdFloat u, v, w;
    barycentric(a, b, c, pt, u, v, w);
    jlSimdFloat one = jlSimdFloat(1.0f);
    return (v >= zero && w >= zero && (v + w) <= one);
}

Since the normalize3() function takes the longest, we can construct the SIMD float types after as an easy way to minimize stalls.
bool32 testRayTriangle(const jlVector4& origin, const jlVector4& dir, const jlVector4& a, const jlVector4& b, const jlVector4& c) {
    jlVector4 pn = jlVector4::Cross(b - a, c - a);
    pn.normalize3();
    jlSimdFloat zero = jlSimdFloat(0.0f); // these simple constructions keep the processor busy 
    jlSimdFloat one = jlSimdFloat(1.0f);  // while waiting on the normalization operation to finish
    jlSimdFloat pd = jlVector4::Dot3(pn, a);
    jlSimdFloat t = (pd - jlVector4::Dot3(origin, pn)) / jlVector4::Dot3(dir, pn);
    if (t <= zero) return 0;
    jlVector4 pt = origin + dir * t;
    jlSimdFloat u, v, w;
    barycentric(a, b, c, pt, u, v, w);
    return (v >= zero && w >= zero && (v + w) <= one);
}

Removing Branches
Here's a typical AABB intersection test:
bool32 testAABBWithBranches(const jlAABB& boxA, const jlAABB& boxB) {
    if (boxA.max(0) < boxB.min(0) || boxA.min(0) > boxB.max(0)) return 0;
    if (boxA.max(1) < boxB.min(1) || boxA.min(1) > boxB.max(1)) return 0;
    if (boxA.max(2) < boxB.min(2) || boxA.min(2) > boxB.max(2)) return 0;
    return 1;
} 

 
On a modern CPU, branch misprediction and short circuiting is generally a bad thing. By removing the branches, instructions can be queued and executed in a more predictable manner.  These two algorithms are logically the same, where we still check for a seperation on all three cartesian axis.  However, the version without branches does all of these tests in parallel and uses bitwise operations to merge the comparisons into one boolean result.

bool32 testAABBWithoutBranches(const jlAABB& boxA, const jlAABB& boxB) {
    jlComp comp;
    jlComp aMinCompBMax = boxA.min.compLess(b.max);
    jlComp bMinCompAMax = boxB.min.compLess(a.max);
    comp.setAnd(aMinCompBMax, bMinCompAMax);
    return comp.allAreSet(jlComp::MASK_XYZ);
}

SFMT
Random Number Generators are interesting to me, but can be notoriously hard to understand due to excessive use of magic constants and seemingly arbitrary bit shifts (frequently seen in hash functions too).  In my old GXT project, I used an existing implementation of XORShift I found here.  This experimental project seemed like a good excuse to explore Mersenne Twister and its SIMD optimized variant: SFMT.  I adapted the code I found to work with my existing classes and support SSE and FPU seamlessly.  The result is a considerable speed boost, but at the cost of extra memory, as it relies on a previously allocated buffer.

The restrict keyword
I'm still a bit new to the concept of the __restrict keyword.  I did, however, notice a lot of engine devs talking about its importance on consoles and elsewhere in addressing the pointer aliasing problem (which also extends to references in C++).

restrict is a qualifier designed to address the pointer aliasing problem.  When used as qualifier on a member function itself, it will apply to the "this" pointer.  A number of articles have been written about restrict and the way it addresses the Load-Hit-Store problem.  The way I like to think about restrict is that it is a promise the programmer makes to the compiler, that the pointer will not be referenced by any other pointers in that scope.

Three things to note with the restrict keyword absent of the way it works:
1) restrict is nonstandard and in vc++ it is __restrict and in g++ it is __restrict__.  So this is an appropriate time to use the preprocessor.
2) Restrict in Visual C++ only applies to pointers, not references.  Thankfully g++ supports restricted references, but this disparity is a pain for those seeking portable code.  In Visual C++ restricted references are simply ignored, but this will accumulate numerous compiler warnings in a project of any decent size.  This limits portable restricted addresses to restricted pointers.
3) Restrict matters to the calling code.  Seriously, put restrict qualifiers in the header/prototype, even if the compiler will let you get away without doing it.  Let's take an example where restricted pointers make sense as a design/code clarity hint as much as it does a performance centered measure.
Take barycentric coordinates, where we want to calculate a weighted average and put them into u, v, and w:

void barycentric(const jlVector4& a, const jlVector4& b, const jlVector4& c, const jlVector4& pt, jlSimdFloat* u, jlSimdFloat* v, jlSimdFloat* w);

It effectively defeats the purpose if u, v, and w reference the same variable.  What we know about the weighted average is effectively lost and all variables will hold the value of the last stored value if they all reference the same variable.  As such, it is entirely appropriate in such instances to change the function signature to:

void barycentric(const jlVector4& a, const jlVector4& b, const jlVector4& c, const jlVector4& pt, jlSimdFloat* JL_RESTRICT u, jlSimdFloat* JL_RESTRICT v, jlSimdFloat* JL_RESTRICT w);

TestSSESSE With Restrict
Barycentric Test 0.1710.167

For this project, I made much more extensive use of Intel's VTune than I have in the past.  Here are some of the results I found after running numerous benchmarks on my machine.  In each test, I would generally produce upwards of a million randomly generated matrices, vectors, rays, etc. and see how the two compared. 


Struct of Arrays vs. Array of Structs
I also tried a few new things this time around, like the difference between Struct of Arrays and (the more frequently used) Array of Structs.  This is admittedly an element that truly excels in multithreaded environments.  An effective comparison is made in Game Engine Architecture:

// Array of Structs, easier to understand,
// especially in the OOP world.  Logically
// related data is placed contiguously in memory.
struct GameObject {
    uint32 id;
    Vector3 pos;
    Quaternion rot;
    float health;
};


// In the Struct of Arrays approach, typed data is placed contiguously
struct GameObjects {
    uint32 ids[MAX_GAME_OBJECTS];
    Vector3 positions[MAX_GAME_OBJECTS];
    Quaternion rotations[MAX_GAME_OBJECTS];
    float healthPoints[MAX_GAME_OBJECTS];
};

A 64-byte line size means most loops will be unrolled
4 Vector4's at a time
The idea is to have homogenous data together in memory to maximize data cache hits and minimize misses.  This is particularly concerning on consoles where the cost of these misses is thousands of instructions.  A good article, which explores this (and more) as they apply to scene-graphs can be found here.  Unrolling these loops that process these large arrays to the size of a data cache line can result in more benefits.

You can download and use CPU-Z to find properties of your CPU's cache and/or use a function like these in code to get the size of a d-cache line.


AoS ParticlesSoA Particles
5.2645.063




More Tests
For good measure, a few more tests profiled using VTune.  Feel free to run some yourself with hot spot analysis to try things out yourself.

VTune's "Compare" functionality between SSE2 and FPU


TestSSEFPU
Vector/Matrix Multiply Test0.52.0
Normalization Test0.1380.4
Ray Triangle Test0.0170.039








Monday, March 26, 2012

Currently Under Construction

Things have been a bit quiet here but rest assured I've been occupied with a number of things, some of which will have source code up here in less than a week.

Card Kingdom
The Card Kingdom project is a lot of fun and has a lot of potential to get better.  I've been working on gameplay code, camera systems, and helping with frustum culling.  I may very well be working on A.I. systems as well (which I look forward to, especially for the final boss).

JL Math
I wrote a post on SIMD processing but I want to prove I've truly dived into it.  Making an interface that works well with SSE2 and FPU is taking a bit longer than I originally expected, but should see something of an alpha release soon.  The math library, JL Math, is named after my initials because it's an individual project and I didn't have a witty name for it.  Ideally the interface could be expanded to work with Altivec sometime down the road too.  Right now it supports Vector2, Vector4, Matrix4, and Quaternions.

FMOD and Component Entity Models
I've been experimenting with the excellent FMOD library and trying to find ways to cleanly integrate the system into a component based engine.  One of the excellent things about FMOD is that you can often get most of the functionality you want with just a few function calls.  Under the hood this industry standard library does an awful lot of things for you (which are of course configurable).

Concurrency
In my ongoing efforts to increase my experience with multithreaded architectures I'm hoping to fit in one more non-school side project before I graduate.  So I bought Anthony Williams' excellent book, C++ Concurrency in Action and am auditing a Operating Systems course to get the concepts.  It'll take some careful planning to decide how I want to implement this knowledge, but I'm currently hoping to do something with particles that builds upon the SIMD math library in an effort to do everything in parallel.

Expect to hear more on all of these developments in the near future.

Sunday, March 11, 2012

Some Assembly Required

I recently took a class in Computer Organization. A number of topics pertinent to game programmers were discussed in the course including:

- Assembly Instructions
- The Memory Hierarchy
- Pipelining/Parallelism

Like most classes which explore assembly code the instruction set was MIPS. MIPS is generally considered the best assembly language to learn first, at least before x86. As someone who enjoys tackling low level problems, getting a better understanding of how C/C++ code is actually translated into machine instructions was very interesting to me.

One of the projects for the class was writing Connect 4 entirely in assembly.

This was surprisingly informative and fun to write.  Since I hope to explore asm { } inside C++ and x86 in the near future this was a good first step in that direction.  You can download the source code here.  If you have access to a RSIM on your machine, you will almost certainly need to adjust the paths in the makefile to the appropriate directories.

Friday, March 2, 2012

The Challenges In Building A SIMD Optimized Math Library

What is SIMD?
SIMD stands for Single Instruction Multiple Data, and does exactly what it says it does.  By packing a couple of integral/floating point types together, a considerable amount of arithmetic and logical computations can be done at once.


Why Should You Care?
These days, everything is designed for parallelism.  The beauty of intrinsic functions is that they exploit such parallelism at the instruction level.  With entirely separate registers and instructions, CPUs can additionally compute the same operation 3 more times at little additional cost.  Game Engines have been using intrinsics for a while now.  I noticed while browsing the Doom 3 source code, which was released in 2004, that they had full support for SIMD intrinsics.  Even if you don't care to write high performance code you're only kidding yourself if you think this technology is going away.  In all likelihood, in the industry you're going to have to deal with intrinsics, even if they've been wrapped in classes that are provided for you.  This isn't just an exercise in reinventing the wheel for the sake of doing it - its a reality of modern day game architecture we all have (or want) to use.

But I Don't Know Assembly
You don't have too.  Although a background in MIPS/x86 ASM definitely helps with understanding how SIMD math really ticks, functions are provided that do most of it for you.  This is a very convenient and more maintainable way to do things as opposed to resorting to assembly instructions you have to roll yourself. 

Specific Uses In Games
SIMD types can be used for a number of things, but since the __m128 data type holds 4 floating point values (which must be aligned on a 16 byte boundary), they are particularly good for data sets which can be treated the same.  Things like vectors (x, y, z, w), matrices (4 vectors), colors (r, g, b, a) and more perform very well and are easily adapted to SIMD types.  You can use intrinsics to represent 4 float values which may not all be treated homogeneously as well.  For example, a bounding sphere may use the last value to represent the radius of the sphere, but then you have to be careful not to treat that scalar the way you treat the position vector.  The same holds true for quaternions.  And with crazy functionality, such as a random number generator that internally uses intrinsics, the downside is that code becomes much more complex and difficult to understand.

Challenges/Advice
The Alignment Problem: SIMD types have to be aligned on a 16 byte boundary.  A number of game developers know that the default allocators in C++ make no guarantee, and as such, are used to overriding the new/delete operators for certain things that need them.  Console developers, which may very well use custom memory allocators for everything already, should have no problem making the adjustment (if one is needed at all).  Students and hobbyists on the PC might find their program crashing due to unaligned access and be left scratching their head.  To start, I recommend just using _align_malloc or posix_memalign to get things going.  Consider making a macro like DECLARE_ALIGNED_ALLOCATED_OBJ(ALIGNMENT) and using that everywhere you need a static overload, so it can eventually be changed later from one place.

Pipeline Flushes: This one is tricky, because your code will run fine, but might actually be slower than if you were still using standard floating point types!  You have to try very hard to avoid mixing floating point math and SIMD math or the pipeline will experience significant stalls.  If this sounds hard, it's because it is.  A lot of game developers are used to using floats everywhere for everything.  As a result, a large codebase is a huge, huge pain to refactor to accommodate for SIMD math in as many places as possible.

I recommend wrapping SIMD types into a class called something like SimdFloat, which, for all intents and purposes, acts just like a float.  However, internally it actually holds a four float value to avoid those costly pipeline flushes.  The implications of this are significant: now things like dot products, length squared functions, and others are actually returning quad data types.  This will take some getting used to.  You can help alleviate it by writing a conversion operator that converts to a regular float and back, but overuse carries the potential for abuse.  If the additional memory space is significant, consider creating a SIMD float on the stack as soon as possible and using it for the rest of the function.  


Code Clarity: This can be alleviated, but is ultimately going to take a hit somewhere.  SIMD math typically involves a lot more temporary values on the stack.  This will seem to avoid going directly "at the problem" at times.  For example, consider a typical, trivial computation of the dot product:

inline float32 dot3(const Vector4& v) const {
    return x*v.x + y*v.y + z*v.z; 
}


Now becomes something like:

inline SimdFloat dot3(const Vector4& v) const {
   SimdFloat sf;
   quad128 p = _mm_mul_ps(quad, v.quad); // p[0] = (x * v.x), p[1] = (y * v.y), p[2] = (z * v.z) 
   quad128 sumxy = _mm_add_ss(_mm_shuffle_ps(p, p, _MM_SHUFFLE(1,1,1,1)), p); // check reference on shuffling
    quad128 sumxyz = _mm_add_ss(_mm_shuffle_ps(sumxy, sumxy, _MM_SHUFFLE(2, 2, 2, 2)), sumxy);
   sf.quad = sumxyz;
   return sf;
}


Don't say I didn't warn you :P.  Since writing this by hand every time can be quite error prone, most implementations use the vector4 SIMD type, or its equivalent as frequently as possible.  If the functions are inline anyways you will usually be just fine.

On the topic of code clarity, some implementations actually forbid operator overloads.  Before doing this, I recommend making sure you really need to do it.  It's true, the operators need to return a reference, and this has a cost.  Successful SIMD libraries, like the ones used in Havok, do not allow this.  It is justified as being speed critical.  Default construction doesn't set a zero vector, and the assignment operator, while provided, doesn't return a reference either.  This does indeed avoid computational costs, but it comes at the cost of clarity.  I highly recommend providing both operators and the equivalent functions, that way things that really need it can avoid the operators and those that don't can stay as clear as possible.  Consider for example, a calculation of a LERP value:

SimdFloat t; // assume its value is between 0 and 1 so we don't need to clamp it
Vector4 lerp = a + ((b - a) * t);


And again without operator overloading:

SimdFloat t;
Vector4 lerp;
lerp.setSub(b, a);
lerp.mul(t);
lerp.add(a);


Again, it's all about finding the right balance with your team.  Expect resistance if you want to forbid operator overloading.  And remember, you can always write a comment specifying the equivalent if operator overloading was available right above the computations themselves.


Portability: Math libraries have the wonderful combination of being incredibly speed critical, pervasive, and in need of portability.  Some platforms don't support these SIMD types at all, others differ in their implementation (e.g. AltiVec and SSE2).  This makes the development of a common interface considerably challenging.  Read up on certain articles and consider referring to the one used in Havok by downloading their free trial (even if you have no interest in using their Physics/Animation) libraries.

Wednesday, February 29, 2012

Joining Card Kingdom

I'm pleased to be joining the Card Kingdom team in a few short weeks.  Card Kingdom is a graduate level capstone project which has been in development by some very talented designers, developers, and artists for nearly a year.  I will be focusing on gameplay code, using an existing code base, and doing what I can to help get a polished game out the door.  For financial reasons, I won't be attending GDC, but there should be a booth with the latest build on display.  Here's a video of (an admittedly older) build that should give you a good feel for the game.

 

Expect updates with my contributions to be detailed here after they are completed.  You can always follow major team updates on the project's main development blog.  The work these guys have done is certainly deserving of your attention!

Friday, January 13, 2012

Class Is Always In Session

I recently released a minor update to the CDS Library.  I fixed some bugs and provided more explicit casting.  The major incentive for these changes was compatibility with C++.  CDS is still very much grounded in ANSI C, which is deliberate for compatibility reasons, so it doesn't take advantage of C++ specific features (such as templates).  However, in terms of compatibility, I'm proud to have made a data structures library that almost any C/C++ compiler can build. 

Unfortunately I'm spread very thin so my time for side projects such as these is extremely limited.  I was hoping to get more experience writing cache conscious code, and perhaps experiment with some cache oblivious algorithms too.  

Consistently exploiting cache optimized code is  something I'm more mindful of, but admittedly need to improve on.  It would be nice to explore this more some other time, perhaps as an extension to this project.

I'm also working on a game project for school where I'm deliberately trying to tackle Component - Entity Models and get more experience with SIMD math and intrinsics.  I'm admittedly sticking relatively close to the framework used in Unity for now.  In another project I might look into more data driven representations that use component aggregation a bit more purely.  I started a thread on GameDev which discusses implementing lighter RTTI and ended up exploring ways to avoid the "everything is a game object" system we see so frequently in these types of frameworks.  While this model admittedly feels excessive for some simple things it does have its benefits.   When certain operations are best applied to logical game objects, iterating over every component or deleting them is much easier if you have one central game object.  Communication between connected components also becomes trivial.
 
Currently the big question at hand is message passing.  A hash table of function names and their addresses sounds reasonable.  But is passing a void pointer array the best way to pass arguments?  Do "returns" from these functions need to be put back into arguments of this array?  What about like method names in different components?  Whatever is chosen, it will have significant pros and cons that weren't very clear when we decided to go with components in the first place.  In practice, we're finding the component model actually hides a significant amount of abstraction from the user.  I'm not saying inheritance trees are better, merely that Component-Entity models, like all game object models, is going to feel a bit quirky from time to time. 

Speaking of quirky, intrinsics will definitely feel that way the first time you implement them.  Forced alignment, unclear macros, pipeline flushes, and bit masks?  Your standard vector/matrix types are no longer so routine.  And don't even get me started on how much of a pain it is to write this in a cross platform manner.  Altivec's SIMD functions are just different enough to warrant typedefs everywhere.  However, as an engine developer, intrinsics are definetly pretty cool.  They are blazing fast (if used appropriately) and make old problems new again.  That quaternion, bounding sphere, AABB, and float packed color structure are all "new" again.  Things like cosine functions and sorting algorithms also present exciting new ways to conquer some old problems.  It takes a specific kind of developer to get truly excited about rewriting these sorts of facilities.  However, for those that do, they should definitely look into intrinsics if they haven't already.

My third recent side project involved diving into Lua.  I work with PHP in my job so interpreted languages are not completely foreign to me.  While Lua is considered easy to learn its syntax does not resemble C and certain ways in which Lua is used (table lookup, metatables, etc) can actually get rather involved.  Nonetheless, it's a fun language to use, it was made by some very smart people, and it never feels like it's trying to be something its not.  Lua knows it's a scripting language, and as such, is more frequently adopted as one compared to languages that see this as some sort of insult.


I'd like to build an event system or somehow provide script level access to my component entity model.  However, it is still in development and not exactly a rudimentary task.  So I kept the scope small and just made a simple math library.  If nothing else, it provided a nice introduction to the language.







Other current events:
I learn a lot from reading other people's code.  So while a lot of people in the games industry despise Boost and Template Meta-programming in general, it doesn't hurt to familiarize myself with the subject matter.  I did the same with design patterns a couple years ago.  The possible misuse of a technique is not a good excuse to evade learning it altogether.  Is a singleton a monstrosity?  Does overuse of templates bloat and complicate code?  Yes and yes, but every implementation has its share of pros and cons.

I got Jesse Schell's excellent book from the library over the break.  I may not see myself as a designer, but it never hurts to have some sensible design ideas floating around in your head.  And even in the absence of great ideas, it improves communication skills with talented game designers.

I also got Volume 1 of Knuth's book for Christmas.  I simply don't have the time to muscle through something like this right now, but it's nice knowing that if I want to challenge myself, I merely need to reach over to my bookshelf.  It's gonna take a while to get through this one.

Sunday, December 25, 2011

Prim's Maze Algorithm in Minecraft

Overview
I previously gave a sneak peak of this mini project here.  For an A.I. class we were given an assignment that challenged us to generate random, procedurally generated worlds in MineCraft.  There was more to later versions of the project, but I'm stripping it down to just the terrain generation in the given source code for now.  I believe this is what Minecraft modders will find the most useful (compared to the other, more application specific code I wrote in the project).  Hopefully the source code is understandable enough that other modders will be able to plug this functionality into their own projects.  The handful of java files are well commented and is complemented by a description in this post.  Here's a video demo showing off the algorithm, created via a chat command in the game:



Randomized Prim's Algorithm
A random variant of Prim's Algorithm, is best viewed in this instance not as a graph or tree, but a grid of walls.  We start with a full grid of walls, open one cell in the maze, and then proceed to carve our way out.  We do this by adding all of its "out" neighbors to a "wall list".  While there are still walls in this list, we pick a random one.  If the cell opposite of the wall is not in the maze we make a passage and add its "out" neighbors from the wall list.  Otherwise, we remove it from the list, and repeat this process until the wall list is empty.

Using the Plugin
I'm assuming the users already know how to compile Bukkit plugins themselves.  If not, I've provided a working jar file which can be directly placed into the plugins/ directory of an existing CraftBukkit Server.  The given archive also provides the plugin.yml file, which you need when building for CraftBukkit to load it properly.  The /gen command has the following usage:


/gen [name] [mazeSize] [mazeWallHeight] [mazeSpawnZoneSize] [mazeBorderHeight] [autoTeleport] [iterations]
Default Name: OptimusPrim
Default Maze Size: Random (32, 64, 96, 128, 256, 512).  If you input one yourself it must be a non zero multiple of 16.
Default Maze Wall Height: 3
Default Maze Spawn Zone Size: 16
Default Maze Border Height: 16
Default Auto Teleport: true (the way java parses this is "true"/"false" not 0/1)
Default Iterations: -1 (-1 iterates until done, custom positive values iterate until a specified cutoff.  Used for dev testing only to show "stepwise" increments of the algorithm.)


If you want to reload an existing world you can use the /visit command.

/visit worldname

This is very handy if you don't want to autoteleport to a new world.  Terrain generation can be cumbersome for very large mazes and there may be instances where you want to create a world with the intention of exploring it later (e.g. for comparative purposes).  AutoTeleport is on by default for the /gen command, but if you type the command to the length of that argument and set it to false, you'll need to use the visit command afterwards to spawn in your newly generated world.

The Plugin/Source
The plugin is compatible with the latest version of Minecraft at the time of writing (v1.0).  If you want to rebuild the source you'll need to reference the proper version of Bukkit and setup your own server.  Please refer to their wiki if you need help setting this up.




Monday, December 19, 2011

Learning A Functional Language

Overview
Paradigms Matter -- a lot.  If all you've used is an imperative language I recommend giving a functional language a spin.  You probably won't use it on your next major project and have no issues continuing to use the languages you already know and love (C/C++/C#).  I'm a member of this camp myself and am by no means an evangelist for any particular paradigm.  However, I sincerely believe you'll be a better programmer if you force yourself to think outside the box of the imperative/OOP paradigms once in a while.  In general, the more paradigms you know, the more ways you'll have at your disposal to solve various problems.  Since languages like C++ and Python are multiparadigm, in many languages there's nothing stopping you from adapting a more functional style than the one you already use.

If nothing else, spending a week or two using Haskell or Lisp will make you more mindful of the inherent problems with mutable states. I've seen some severe hyperbole thrown around that this predictability removes the need for debugging altogether.  This is completely false, but in many instances functional languages involve more predictable routines which can decrease the time spent scratching your head in the debugger.  Allow me to explain:

Referential Transparency
Referential Transparency is a property of functions/expressions that could be replaced by the known return value and have no change in the program's execution. In laymans terms, functions have no side effects, and there is no mutable state or variables. The benefit of this style is a significant increase in predictability. Given a function, if I pass the same arguments to that function I will always get the same result. It won't matter at which point the function is called and will never be dependent on potentially corrupted data.

How many times have you had to debug a function in C++ that caused a crash when the problem was really propagated elsewhere?  If the answer is: way too many, perhaps you should consider writing more self encapsulating and fault tolerant functions.

Easily Implemented Parallelism
When there are no global variables or states which could be simultaneously accessed by multiple threads, a lot of the pains inherent from multithreaded programming can be avoided from the get-go.  Since everything is headed in this direction, it doesn't hurt to brush up on it for a change.

Caveats
A lot of comparable programs in functional languages might feel handicapped by this purity. For example, certain applications which need to represent state (e.g. a atm machine which lets users withdraw/deposit funds) will likely be more  complex if processed as a concatenation of self encapsulating functions.  In addition, responsible imperative programming may have very good reasons to cache global variables, which, in many cases, greatly increase maintainability and speed.  The lazy evaluation frequently implemented in functional languages also makes predicting time and space costs very difficult.

Potential
What's interesting about functional languages is that they could make for a decent scripting language.  In my experience using Haskell, there are numerous cases where you merely specify what you want or are looking for, rather than how to get it.  Unfortunately, the extensive unfamiliarity with this style means most of the 'rapid prototyping' benefits sought by these languages will be lost learning a new one.  In addition, virtually every game script is heavily reliant on object state.  That's why it goes into script in the first place.

I'm a few days into my adventures with Haskell and hope to produce another post with code and examples explaining how curried functions, lambdas, and folds can be very useful for any kind of programming.

Additional Reading
Real World Haskell (Online Book)
Learn You a Haskell (Online Tutorial)
Functional Programming In Games (PDF Thesis Paper)
Why Functional Programming Matters (PDF Explanation)
Why learning Haskell/Python makes you a worse programmer (counter argument)