Factorio Forums

Posted: **Sun Nov 24, 2019 2:14 pm**

Ultros wrote: ↑
Sat Nov 23, 2019 11:33 pm
I guess it's up to the devs whether such an optimization is worth it. Considering the volumes of biters on multiplayer games, I think this would be an excellent improvement for multiplayer speed optimization.

Well, we just did this whole optimization, which should already majorly resolve the problem

We will see how community testing goes once 0.18 is out, there is always more room for improvement,
I think it is probably good enough now to last till 1.0 and beyond

Posted: **Mon Nov 25, 2019 9:01 am**

Very nice. Should help reduce the lag when nuking large numbers of biters.

Posted: **Mon Nov 25, 2019 8:42 pm**

Let's swim in Blood and Gore and let them stay ~~forever~~ for a day or two

Posted: **Tue Nov 26, 2019 4:31 pm**

In the last picture, why there is still a lot of things that look like particles, but are not handled by new system?

Posted: **Tue Nov 26, 2019 6:04 pm**

movax20h wrote: ↑
Tue Nov 26, 2019 4:31 pm
In the last picture, why there is still a lot of things that look like particles, but are not handled by new system?

There are things like explosion and particle-fountains, which are not performance concerns at the moment

Posted: **Wed Nov 27, 2019 9:19 pm**

IF there's a thing that impreses me, is that 20 members of the team later, and 5 years, and you gus still improve the performance by huge margins. Your dedication to this game is unparallel

Posted: **Thu Nov 28, 2019 1:44 pm**

SHiRKiT wrote: ↑
Wed Nov 27, 2019 9:19 pm
IF there's a thing that impreses me, is that 20 members of the team later, and 5 years, and you gus still improve the performance by huge margins. Your dedication to this game is unparallel

this one was quite low hanging fruit though

Posted: **Thu Nov 28, 2019 7:43 pm**

Tynach wrote: ↑
Sat Nov 23, 2019 12:27 am

vorax wrote: ↑
Sat Nov 23, 2019 12:05 am
Here are some thoughts for possible future improvements.

Take advantage of the commonality between particles produced by a common source.

Say you have a particle source object, which contains the majority of those 64bytes you mentioned. Then a particle would be a tiny structure which would be appended to the source object. Say you used three numbers for the position relative to the source of each particle (16 bits each) and three more for the velocity (8? bits each) a texture offset (8bit) and a rotation (8bit). That's 14 bytes per particle. I believe that would be sufficient to animate all sorts of particle effects. I'm assuming three coordinates because for example the "blood" streams out like a fountain, I think it would be most efficient to model particle behavior in 3D then translate the 3 coordinates to 2D.

Many of these tiny particles could be saved as a contiguous array in a random order. Particle deletion could then be performed in order. Particle update would usually require iterating the array, but this could depend on the specific particle animation.

This idea may not work so well if the source of the particles was itself in motion. I don't know how often that happens in factorio. in these cases it maybe would be necessary to create new sources periodically instead of allowing the sources to move.
Positions and rotations would be stored in floats or doubles, each being 32 or 64 bits each (respectively). I imagine the 64 bytes are:

1. Position (x and y): 8 bytes to 16 bytes.
2. Velocity (x and y): 8 bytes to 16 bytes.
3. Rotation: 4 bytes to 8 bytes.
4. Time (how long the particle has existed, for calculating where in its path it is and when it should fade out): 4 to 8 bytes.
5. Pointer to texture to draw: 8 bytes.
6. Pointer to particle behavior specification: 8 bytes.

Or something along those lines. If doubles are used instead of floats, that'd be all 64 bytes right there.

It’s things like this that make me wish factorio could be open source.

My assumptions would be 32 bit floats would be sufficient for most positioning and velocity. Anymore than a 16 bit float for rotation seems like over kill as we have pixel art. If sprites aren’t rotated but selected from sprite directions (ala the biters) I’d question if we need more than 4 bits (32 directions).

Time only needs to be measured per tick (so 60 frames per second). 8 bit gives us a max lifetime of just over 4 seconds which probably isn’t quite enough time for a lot of scenarios. At 16 bit we have a max lifetime of 18 minutes. That should be more than enough for almost every use case.

There is no need to use a proper pointer for each texture. Simply using a basic int id per texture could get us under 2 bytes, maybe even to one byte, especially if we are using texture sets. Ie textures of a particle will typically be confined to one set with the timing of said particle able to indicate the actual frame of the animation set to use. Likewise rotation could provide another simplification of the texture sets to use.

I again doubt we have more than 2 bytes worth of particle behavior specifications. Considering that particle specifications are probably dictated based on particle type and thus texture set, I’d doubt we need more than one byte to indicate the specification.

This reduces us down to approximately 23 bytes per particular best case scenario. Finally all of this could possibly compressed by storing an index of the spawning particle. This would lead to 2 arrays, root particle with extra information and child particular with a simplified set. If this was done child particular could probably get away with 16 bit floats for positioning and velocity. Textures and behavior would be determined by the parent and reduced lookups could be used reducing object size even more to maybe 12. And typically child particles are more common, meaning an even smaller size of the total array. And if the arrays are small enough not wreck cache hits (though it may still cause other issues).

With all that said I obviously don’t know what extra information we need to store or extra information that needs bundles.

Posted: **Thu Nov 28, 2019 8:44 pm**

cwalter5 wrote: ↑
Thu Nov 28, 2019 7:43 pm
It’s things like this that make me wish factorio could be open source.

My assumptions would be 32 bit floats would be sufficient for most positioning and velocity. Anymore than a 16 bit float for rotation seems like over kill as we have pixel art. If sprites aren’t rotated but selected from sprite directions (ala the biters) I’d question if we need more than 4 bits (32 directions).

Time only needs to be measured per tick (so 60 frames per second). 8 bit gives us a max lifetime of just over 4 seconds which probably isn’t quite enough time for a lot of scenarios. At 16 bit we have a max lifetime of 18 minutes. That should be more than enough for almost every use case.

There is no need to use a proper pointer for each texture. Simply using a basic int id per texture could get us under 2 bytes, maybe even to one byte, especially if we are using texture sets. Ie textures of a particle will typically be confined to one set with the timing of said particle able to indicate the actual frame of the animation set to use. Likewise rotation could provide another simplification of the texture sets to use.

I again doubt we have more than 2 bytes worth of particle behavior specifications. Considering that particle specifications are probably dictated based on particle type and thus texture set, I’d doubt we need more than one byte to indicate the specification.

This reduces us down to approximately 23 bytes per particular best case scenario. Finally all of this could possibly compressed by storing an index of the spawning particle. This would lead to 2 arrays, root particle with extra information and child particular with a simplified set. If this was done child particular could probably get away with 16 bit floats for positioning and velocity. Textures and behavior would be determined by the parent and reduced lookups could be used reducing object size even more to maybe 12. And typically child particles are more common, meaning an even smaller size of the total array. And if the arrays are small enough not wreck cache hits (though it may still cause other issues).

With all that said I obviously don’t know what extra information we need to store or extra information that needs bundles.

We use fixed point positions (see https://www.factorio.com/blog/post/fff-64) for map positions. Additionally, things like textures/sounds/non-mutable-properties end up getting stored on what we call a Prototype which gets a runtime ID assigned. Most prototypes use unsigned 16 bit IDs which means the game supports up to 2^16 - 1 prototypes of a given type. The ID can be used to get the prototype pointer which gets you access to all of the non-mutable data associated with that thing.

Posted: **Fri Nov 29, 2019 5:08 am**

Rseding91 wrote: ↑
Thu Nov 28, 2019 8:44 pm
We use fixed point positions (see https://www.factorio.com/blog/post/fff-64) for map positions. Additionally, things like textures/sounds/non-mutable-properties end up getting stored on what we call a Prototype which gets a runtime ID assigned. Most prototypes use unsigned 16 bit IDs which means the game supports up to 2^16 - 1 prototypes of a given type. The ID can be used to get the prototype pointer which gets you access to all of the non-mutable data associated with that thing.

For particles as other games use them, you would have a known emiter of wich the particles would "emerge" and therefore would only need to store the position offset from that emitter for each particle.
I doubt that there would be distances greater than 1k pixels away from the emitter. So 10 bit ought to be enough for everyone™ (20 bit for the whole 2D offset).
If you need to have moving emitters, group particles together so they can share one common emitter position and spawn them with their position offset initially set to the difference of the group position and the emitter position.
If you need to store a rotation, 10 bit should be enough for that too.
The animation step counter could probably get away with being 10 bits too. But if there are long-lived particles, a fixed animation step multiplier on the particle prototype might fix that.
The remaining 26 bits of a 64-bit register can be filled with other information (maybe make the animation step counter larger) or used to pack the particles even tighter...
As Factorio is memory-throughput-constrained, wasting some cycles on extra unpacking and multiplications/additions may be okay. For faster multiplications constrain the animation step multipliers to factors of two so you can use the ultra-fast bitshift instructions.

Probably not worth the dev time though.

Posted: **Fri Nov 29, 2019 1:21 pm**

The particles are already 64 bytes so there's no point in packing them into smaller structures unless they can be reduced to 32 bytes which isn't likely to happen without removing functionality that the game currently uses.

Posted: **Sat Nov 30, 2019 4:08 am**

Rseding91 wrote: ↑
Fri Nov 29, 2019 1:21 pm
The particles are already 64 bytes so there's no point in packing them into smaller structures unless they can be reduced to 32 bytes which isn't likely to happen without removing functionality that the game currently uses.

Well what oktokolo is suggesting would take you down to 64 bits so 8 bytes. Of course that’s only if our assumptions are good.

Posted: **Sat Nov 30, 2019 5:46 am**

So your assumptions lead you to believe that the factorio devs missed a factor-of-eight size optimization? And you have thought about this and determined that your assumptions are probably correct?

If I were trying to design a bit-packed particle system, I'd also include:
* Emitter ID (minimum 20 bits)
* Particle spriteset (various smoke, steam, fire, blood, etc. animations: 16 bits)
* Particle X/Y velocity (8 bits each)
* Particle Z position and velocity (10+8 bits: particles don't bounce, but they do fountain)
* Lifetime (12 bits: 68 seconds, tight but likely OK. I can use the animation counter to measure the age of the particle, but I also need to know how many ticks the particle should live.)

With Oktokolo's 40 bits, I'm already up to 16 bytes per particle. In addition, you now have to pull the X and Y coordinates of all emitters into memory before you know whether or not you need to draw that particle. Unless emitters are a lot smaller than particles, that's going to be a cache line per emitter. This probably saves bandwidth if I have a small number of emitters. But if I have a small number of emitters, I also have a small number of particles, so the memory throughput for that case isn't particularly interesting. On the other hand, if I'm out on a killing spree, and I just killed 100 biters, that's a fifth of my L1d, just to hold the emitter definitions. If I'm not very careful about how I allot slots in the particle array (Easy for blood. Hard for steam.) I'm going to discover that the emitter definition I need has been evicted in favor of particle definitions, and now I need to go fetch it from memory again.

Further, even with 1,048,576 emitters, there's going to have to be reuse of the emitter IDs, which adds a bunch of unpleasant code to assign and recycle emitter IDs. I'd also have to write code to sweep the particle array if all million emitters have live particles when a new emitter wants to join the party, which is going to be terrible for memory throughput.

Posted: **Sun Dec 01, 2019 9:02 pm**

DaleStan wrote: ↑
Sat Nov 30, 2019 5:46 am
So your assumptions lead you to believe that the factorio devs missed a factor-of-eight size optimization? And you have thought about this and determined that your assumptions are probably correct?

If I were trying to design a bit-packed particle system, I'd also include:
* Emitter ID (minimum 20 bits)
* Particle spriteset (various smoke, steam, fire, blood, etc. animations: 16 bits)
* Particle X/Y velocity (8 bits each)
* Particle Z position and velocity (10+8 bits: particles don't bounce, but they do fountain)
* Lifetime (12 bits: 68 seconds, tight but likely OK. I can use the animation counter to measure the age of the particle, but I also need to know how many ticks the particle should live.)

*snip*

I have a few problems with this:

Entity ID and sprite set can be cut from the particle's data structure. Whatever created the particles manages their lifetime, and whatever renders the particles gives them their appearance, and both access a huge array of them through a single indirection. The only thing that is needed to differentiate a particle in the same system is position (of which, X and Y are missing from this data structure), velocity, and life time.

I don't agree with packing things in a bitfield because then the shader that renders the particles has to do manual bit twiddling on the GPU. HLSL doesn't allow integer arithmetic, so it's not possible without using a floating point mod and a 23-bit mantissa in a float (which means about a third of all data sent to the GPU is padding). Unpacking the particles before sending them to the GPU is just not an option because it means we have to have an uncompressed buffer full of them anyway, and we save nothing on memory bandwidth between the CPU and GPU. There will just be mild improvements to memory usage on the CPU if each particle system is small and there are many particle systems.

What is most important when optimizing particles is keeping them in pools so the new frame's particles can be uploaded straight to the GPU without modifications and then rendered in as few draw calls as possible.

When the memory is contiguous, then the memory controller can go wild moving hundreds of megabytes of memory to the GPU instead of 64 bits at a time per particle.
If all particles in a pool share the same material but differ in something like position, then instancing can be used, which draws all particles using a single API call. Reducing API calls means the program and driver spend significantly less time processing each particle on the CPU.

Posted: **Mon Dec 02, 2019 12:19 am**

DaleStan wrote: ↑
Sat Nov 30, 2019 5:46 am
So your assumptions lead you to believe that the factorio devs missed a factor-of-eight size optimization? And you have thought about this and determined that your assumptions are probably correct?

Bit-level memory optimization seldom is the first, second or even third thing you think about when thinking about something that has to happen every frame. So yes, i assumed they either discarded it early on or completely "missed" that. It wastes CPU cycles as hell and doing it surely is a sign of insanity - except, maybe (i never intended my idea to sound like it would be guaranteed to not make the game perform worse), when being memory throughput limited.
Yes, it is totally common to "miss" a factor-of-eight (or even more) size optimization when storing data in memory. Using a 64-bit word for a boolean happens a lot. After all, what matters most of thetime is developer time - not code performance.
No, i did not concluded that my assumptions are probably correct - only, that there is enough probability for them being correct, that i could mention them without having to feel stupid for doing that. When doing micro-optimizations like that, it is the norm that there are enormous error margins - that is why the most important measure is thouroughly mesuring the bottlenecks before optimizing and the effects of tried optimizations. Hardware is complex and op code execution order is nondeterministic from the viewpoint of the coder since CPUs do all sorts of optimizations and predictions.
I still think, that the idea was non-just-the-same-they-heared-from-everyone-they-asked enough to be worth mentioning.

DaleStan wrote: ↑
Sat Nov 30, 2019 5:46 am
* Emitter ID (minimum 20 bits)

Redundant for particles stored in a circular buffer pointed at in the emitter's data structure.

DaleStan wrote: ↑
Sat Nov 30, 2019 5:46 am
* Particle spriteset (various smoke, steam, fire, blood, etc. animations: 16 bits)

Redundant for particles of emitters that have their particles share the same sprite set pointed at from inside the emitter's data structure. Want more sprite sets? Use more emitters.

DaleStan wrote: ↑
Sat Nov 30, 2019 5:46 am
* Particle X/Y velocity (8 bits each)
* Particle Z position and velocity (10+8 bits: particles don't bounce, but they do fountain)

Abstracted away by animation step. The animation step could be an index into a list of precalculated matrices that would be applied to the particle's position vector. That list could be pointed at from inside the emitter's data structure.

DaleStan wrote: ↑
Sat Nov 30, 2019 5:46 am
* Lifetime (12 bits: 68 seconds, tight but likely OK. I can use the animation counter to measure the age of the particle, but I also need to know how many ticks the particle should live.)

Particle lifetime is stored in the emitter's data structure. Particles age is the animation step.

I assumed a ratio of particles to emitters > 100. So it would make sense to have as much data stored in the emitter data structure as possible. Processing is done by looping over emitters and then loop over their ring buffer containing the particles. As particles die, they are deleted by advancing the ring buffer start index. Adding particles is done by storing them at start index + current particle count (wich then is increase by one). Obviously, the indices into the ring buffer are all modulo the ring buffer length.

Also keep in mind, that i had 26 unused bits left in the 64 bit particle structure. So if coordinate offsets, rotation, or animation step need to have a slightly higer resolution - there is some room for that. One possible extension would be four bits to be used to differentiate up to 16 different particle types per emitter wich might use completely different base offsets, animations, spriteset, whatever - all stored on the emitter instead on the hundreds of particles that one emitter might have visible each frame. But you could just have more emitters instead so code complexity can be kept (relatively) low.

P.S.: I am aware of this discussion having become purely academic because of the GPU needing to be fed with floating point numbers. I did not know that GPUs miss the commands to do the unpacking.

Posted: **Thu Dec 12, 2019 6:04 am**

One very subtle way to get a rather large performance increase would be to improve cache locality. It's unclear from the post, but I would guess that you're using an array of particles. At its most basic level improving cache locality can be done by switching to a structure of arrays.

Given that each particle takes up 64 bytes, that is an entire cache line. I also find it unlikely that the particle will be aligned on the 64 byte boundary, so it's actually 2 cache lines. That's terrible for performance, especially since you're almost certainly manipulating them in a streamed fashion.

Along with reworking the structure of the data, you'll also want to rework how you interact with that data. For instance, if you have a single loop where you iterate through all of the particles and manipulate various parts of their data you're going to want to instead use multiple loops, working on only one piece of data at a time.

Example:

Code: Select all

struct Particle final
{
    float x, y, vx, vy;
};

Particle* particles;
::std::size_t particleCount;

for(::std::size_t i = 0; i < particleCount; ++i)
{
    Particle* p = particles[i];
    p.x += p.vx;
    p.y += p.vy;
}

Would be converted to:

Code: Select all

struct ParticleStreamed final
{
    ::std::size_t count;
    align(64) float* x;
    align(64) float* y;
    align(64) float* vx;
    align(64) float* vy;
};

ParticleStreamed particles;

for(::std::size_t i = 0; i < particles.count; ++i)
{
    particles.x[i] += particles.vx[i];
}

for(::std::size_t i = 0; i < particles.count; ++i)
{
    particles.y[i] += particles.vy[i];
}

This may seem odd, and at a casual glance may appear to be slower, but it actually has drastically better performance. Understanding why is easier with a graphic.

The first examples cache line looks a little bit like this (assuming good alignment, and 32 byte cache lines).

Code: Select all

---------------------------------------------------------------------------------
|x|x|x|x|y|y|y|y|vx|vx|vx|vx|vy|vy|vy|vy|x|x|x|x|y|y|y|y|vx|vx|vx|vx|vy|vy|vy|vy|

In this we can see that we can store 2 particles. Thus every 2 cycles memory would be read in. Remember that retrieving memory is a rather slow process.

The second examples cache lines look more like this:

Code: Select all

-----------------------------------------------------------------
|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|x|
-------------------------------------------------------------------------------------------------
|vx|vx|vx|vx|vx|vx|vx|vx|vx|vx|vx|vx|vx|vx|vx|vx|vx|vx|vx|vx|vx|vx|vx|vx|vx|vx|vx|vx|vx|vx|vx|vx|
-----------------------------------------------------------------
|y|y|y|y|y|y|y|y|y|y|y|y|y|y|y|y|y|y|y|y|y|y|y|y|y|y|y|y|y|y|y|y|
-------------------------------------------------------------------------------------------------
|vy|vy|vy|vy|vy|vy|vy|vy|vy|vy|vy|vy|vy|vy|vy|vy|vy|vy|vy|vy|vy|vy|vy|vy|vy|vy|vy|vy|vy|vy|vy|vy|

Here we can see that each cache line is storing 8 floats (each set of characters refers to a single byte). Thus every 8 cycles we have to load in 2 lines of memory (one for the position, one for the velocity).

Now in our example we are still only performing one addition each clock cycle. So we can actually improve performance even more with SSE.

Code: Select all

union ParticleStreamed final
{
    struct
    {
        ::std::size_t count;
        align(64) float* x;
        align(64) float* y;
        align(64) float* vx;
        align(64) float* vy;
    };
    struct
    {
        ::std::size_t _;
        align(64) __m128* mx;
        align(64) __m128* my;
        align(64) __m128* mvx;
        align(64) __m128* mvy;
    };
};

ParticleStreamed particles;

::std::size_t vectorCount = particles.count / 4;

for(::std::size_t i = 0; i < vectorCount; ++i)
{
    particles.mx[i] = _mm_add_ps(particles.mx[i], particles.mvx[i]);
}

for(::std::size_t i = 0; i < vectorCount; ++i)
{
    particles.my[i] = _mm_add_ps(particles.my[i], particles.mvy[i]);
}

In this example the memory is still the same, but we now perform 4 additions at a time instead of just one.

Now, I have seen a lot of people who discourage the use of SIMD intrinsics unless you're familiar with them. That is a fallacy that leads to one never learning how to use something for great performance gains. I recommend bookmarking the Intel Intrinsics Guide.
Couple of notes:

Don't use MMX, it's very outdated. This includes any intrinsics that operate on __m64.
Avoid mixing AVX & SSE, they are different processors and switching between them creates a huge performance hit.
According the Steam Hardware Survey, 100% of computers support up to SSE3 (not to be confused with SSSE3 which is still high at 98%), and 96% of computers support up to SSE4.2, so it's pretty safe to use any of the SSE intrinsics.
Don't look at SVML, that is a library only found with the Intel compiler.
SS intrinsics only operate on a single element in the vector, PS intrinsics operate on all 4 of the elements in the vector.
Integer, Float, and Double operations technically operate on different types, but there is generally no difference between any of them, it is still 16 bytes of memory that is loaded into the XMM registers.
Certain operations are missing, notably ones for EPI8 and EPU8, it sucks but it just is. There are emulations in SVML, but again you need the quite expensive Intel Compiler.

As some final notes, everything in this post is very abstracted, most of the code likely doesn't compile, alignment relies on compiler specific intrinsics, and all of it is based purely on speculation (well the philosophy of improving cache locality is a well understood concept). Compilers and CPU's (yes, the CPU performs optimizations at runtime) are quite good at optimizing things, so various steps may be possible to overlook, but I would still recommend checking the assembly output. There are also ways to manually prefetch data with special instructions, I personally think this is overkill, but you do you. Side note on the last point: the Steam Hardware Survey says that only 0.05% of computers support PrefetchW, I don't think they're checking correctly given that Windows 10 x64 actually requires that those instructions are supported, and 74% of computers are running Windows 10 x64

Posted: **Wed Apr 15, 2020 5:54 am**

vandench wrote: ↑
Thu Dec 12, 2019 6:04 am
At its most basic level improving cache locality can be done by switching to a structure of arrays.

Data oriented design, right? (I hope responding to old posts isn't too discouraged here!)

Posted: **Wed Apr 15, 2020 6:52 am**

Tynach wrote: ↑
Wed Apr 15, 2020 5:54 am
I hope responding to old posts isn't too discouraged here!

It's lovingly referred to as "necroing a thread", and as just about everywhere, necromancers are regarded as the most charming people you'll ever come across. However, you're necromancing skills still need polishing! The posting you replied to is just about 4 months old, with the whole thread being just a month older. That's nothing, just about everybody can revive a thread that shows hardly a sign of decay yet! For some serious necromancer fame, you should dig up some really old graves -- the smellier the corpses, the better for you. This thread, for example, looks like it's a worthwhile accomplishment for an experienced necromancer. But perhaps you should start your necromancing career by frying some smaller fish first.

Posted: **Wed Apr 15, 2020 8:49 pm**

I like how kovarex' join time is less than two days before said thread (ID=11?).

Factorio Forums

Friday Facts #322 - New Particle system

Re: Friday Facts #322 - New Particle system

Re: Friday Facts #322 - New Particle system

Re: Friday Facts #322 - New Particle system

Re: Friday Facts #322 - New Particle system

Re: Friday Facts #322 - New Particle system

Re: Friday Facts #322 - New Particle system

Re: Friday Facts #322 - New Particle system

Re: Friday Facts #322 - New Particle system

Re: Friday Facts #322 - New Particle system

Re: Friday Facts #322 - New Particle system

Re: Friday Facts #322 - New Particle system

Re: Friday Facts #322 - New Particle system

Re: Friday Facts #322 - New Particle system

Re: Friday Facts #322 - New Particle system

Re: Friday Facts #322 - New Particle system

Re: Friday Facts #322 - New Particle system

Re: Friday Facts #322 - New Particle system

Re: Friday Facts #322 - New Particle system

Re: Friday Facts #322 - New Particle system