One very subtle way to get a rather large performance increase would be to improve cache locality. It's unclear from the post, but I would guess that you're using an array of particles. At its most basic level improving cache locality can be done by switching to a structure of arrays.
Given that each particle takes up 64 bytes, that is an entire cache line. I also find it unlikely that the particle will be aligned on the 64 byte boundary, so it's actually 2 cache lines. That's terrible for performance, especially since you're almost certainly manipulating them in a streamed fashion.
Along with reworking the structure of the data, you'll also want to rework how you interact with that data. For instance, if you have a single loop where you iterate through all of the particles and manipulate various parts of their data you're going to want to instead use multiple loops, working on only one piece of data at a time.
Example:
Code: Select all
struct Particle final
{
float x, y, vx, vy;
};
Particle* particles;
::std::size_t particleCount;
for(::std::size_t i = 0; i < particleCount; ++i)
{
Particle* p = particles[i];
p.x += p.vx;
p.y += p.vy;
}
Would be converted to:
Code: Select all
struct ParticleStreamed final
{
::std::size_t count;
align(64) float* x;
align(64) float* y;
align(64) float* vx;
align(64) float* vy;
};
ParticleStreamed particles;
for(::std::size_t i = 0; i < particles.count; ++i)
{
particles.x[i] += particles.vx[i];
}
for(::std::size_t i = 0; i < particles.count; ++i)
{
particles.y[i] += particles.vy[i];
}
This may seem odd, and at a casual glance may appear to be slower, but it actually has drastically better performance. Understanding why is easier with a graphic.
The first examples cache line looks a little bit like this (assuming good alignment, and 32 byte cache lines).
Code: Select all
---------------------------------------------------------------------------------
|x|x|x|x|y|y|y|y|vx|vx|vx|vx|vy|vy|vy|vy|x|x|x|x|y|y|y|y|vx|vx|vx|vx|vy|vy|vy|vy|
In this we can see that we can store 2 particles. Thus every 2 cycles memory would be read in. Remember that retrieving memory is a rather slow process.
The second examples cache lines look more like this:
Code: Select all
-----------------------------------------------------------------
|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|
-------------------------------------------------------------------------------------------------
|vx|vx|vx|vx|vx|vx|vx|vx|vx|vx|vx|vx|vx|vx|vx|vx|vx|vx|vx|vx|vx|vx|vx|vx|vx|vx|vx|vx|vx|vx|vx|vx|
-----------------------------------------------------------------
|y|y|y|y|y|y|y|y|y|y|y|y|y|y|y|y|y|y|y|y|y|y|y|y|y|y|y|y|y|y|y|y|
-------------------------------------------------------------------------------------------------
|vy|vy|vy|vy|vy|vy|vy|vy|vy|vy|vy|vy|vy|vy|vy|vy|vy|vy|vy|vy|vy|vy|vy|vy|vy|vy|vy|vy|vy|vy|vy|vy|
Here we can see that each cache line is storing 8 floats (each set of characters refers to a single byte). Thus every 8 cycles we have to load in 2 lines of memory (one for the position, one for the velocity).
Now in our example we are still only performing one addition each clock cycle. So we can actually improve performance even more with SSE.
Code: Select all
union ParticleStreamed final
{
struct
{
::std::size_t count;
align(64) float* x;
align(64) float* y;
align(64) float* vx;
align(64) float* vy;
};
struct
{
::std::size_t _;
align(64) __m128* mx;
align(64) __m128* my;
align(64) __m128* mvx;
align(64) __m128* mvy;
};
};
ParticleStreamed particles;
::std::size_t vectorCount = particles.count / 4;
for(::std::size_t i = 0; i < vectorCount; ++i)
{
particles.mx[i] = _mm_add_ps(particles.mx[i], particles.mvx[i]);
}
for(::std::size_t i = 0; i < vectorCount; ++i)
{
particles.my[i] = _mm_add_ps(particles.my[i], particles.mvy[i]);
}
In this example the memory is still the same, but we now perform 4 additions at a time instead of just one.
Now, I have seen a lot of people who discourage the use of SIMD intrinsics unless you're familiar with them. That is a fallacy that leads to one never learning how to use something for great performance gains. I recommend bookmarking the
Intel Intrinsics Guide.
Couple of notes:
- Don't use MMX, it's very outdated. This includes any intrinsics that operate on __m64.
- Avoid mixing AVX & SSE, they are different processors and switching between them creates a huge performance hit.
- According the Steam Hardware Survey, 100% of computers support up to SSE3 (not to be confused with SSSE3 which is still high at 98%), and 96% of computers support up to SSE4.2, so it's pretty safe to use any of the SSE intrinsics.
- Don't look at SVML, that is a library only found with the Intel compiler.
- SS intrinsics only operate on a single element in the vector, PS intrinsics operate on all 4 of the elements in the vector.
- Integer, Float, and Double operations technically operate on different types, but there is generally no difference between any of them, it is still 16 bytes of memory that is loaded into the XMM registers.
- Certain operations are missing, notably ones for EPI8 and EPU8, it sucks but it just is. There are emulations in SVML, but again you need the quite expensive Intel Compiler.
As some final notes, everything in this post is very abstracted, most of the code likely doesn't compile, alignment relies on compiler specific intrinsics, and all of it is based purely on speculation (well the philosophy of improving cache locality is a well understood concept). Compilers and CPU's (yes, the CPU performs optimizations at runtime) are quite good at optimizing things, so various steps may be possible to overlook, but I would still recommend checking the assembly output. There are also ways to
manually prefetch data with special instructions, I personally think this is overkill, but you do you. Side note on the last point: the Steam Hardware Survey says that only 0.05% of computers support PrefetchW, I don't think they're checking correctly given that
Windows 10 x64 actually requires that those instructions are supported, and 74% of computers are running Windows 10 x64