[TFS 1.2+] Performance tuning

Damc · Aug 16, 2015

Introduction.
A few questions asked on GitHub recently have convinced me that it might be necessary to write a tutorial about some quirks and new performance characteristics of the upcoming 1.2 release of TFS. I won't go very deep into the internals as I know that most people reading this probably aren't CS/Programming experts. Initially this tutorial will contain tips for tuning the networking stack, however I intend to keep this tutorial updated in the future, should there be other things that can affect performance of your server. It is assumed that the reader knows how to compile the server and has basic C++ knowledge.
WARNING: There is no guarantee of results. It might take a lot of trial and error and measurements to find the settings that best fit your server!
Useful terms.
Compile time constant - a number that is known (or can be calculated) at compile time. The advantage of using it over a dynamic config value (in other words, things that can be changed through config.lua) is that the compiler can use optimizations to increase the efficiency of code. Changing this constant requires recompilation of the translation unit it was defined in (this is C++-speak, if you don't understand that, don't worry, all you need to know is that recompilation is required, the tools will hopefully do the rest for you).

Latency - usually used in networking, means the length of time between a request/command and a response/execution of a command, people sometimes refer to high latency (and low responsiveness) as 'lag'.

Throughput - usually defined as amount of work that is done per unit time. In Computer Science there is usually a tradeoff between throughput and latency.

Object pooling - a technique used in programming high-performance systems that amortizes the overhead of memory management when additional information about object lifetime is available to the programmer that cannot be conveyed to the typical memory management tools.
Network stack tuning.
TFS 1.2 boasts a new networking stack (around half of the base networking code has been rewritten). Initially these changes were intended to provide thread safety guarantees for the casting system, however, some parts had to be radically redesigned. Even though in some areas of code performance has decreased (to provide correctness guarantees in all sane situations) the overall effect has been a performance increase, however there is a price. The old system was mostly fool-proof when it came to the server admin - he didn't have to worry about using the right setting for the network stack to work well. The new system has one big drawback in this area - if configured improperly it will either kill your throughput (and increase latency) or eat A LOT of your RAM.

(If you don't want to know how stuff works internally you can skip this paragraph)
I will now briefly try to describe how the system works. Almost all network packets related to the game protocol sent by the server to the client are so-called buffered messages. What that means is new information to be delivered to the game client is copied into a single buffer and sent once every "Auto-send cycle". This has the effect of greatly increasing throughput at the cost of latency. These buffers are internally called OutputMessages and, because they're non-local and large objects, they cannot be directly allocated and deallocated every time they're used because allocating large amounts of memory is slow (24 KiB might not seem like a large amount of memory, but for a C++ object it's fairly large), therefore a high performance pool is used (if you're interested, read about lock-free stacks). This pool has a fixed maximal capacity defined by a compile time constant, however, it is lazily populated, meaning that new OutputMessages will be created if the pool is empty and a OutputMessage is required. Another important feature of the new networking stack is auto-send scheduling. Before TFS 1.2 the auto-send queue would be checked every single dispatcher cycle (which was a waste of dispatcher time). Now, a more efficient approach is used - all protocols eligible for auto-send are checked every time a certain time quantum elapses. This time is configurable through a compile time constant.

Both mentioned constants are located in the src/outputmessage.cpp file:
Code:
```
const uint16_t OUTPUTMESSAGE_FREE_LIST_CAPACITY = 2048;
const std::chrono::milliseconds OUTPUTMESSAGE_AUTOSEND_DELAY {10};
```
The default capacity of the output message pool is, as you can see, 2048. This default should be enough for relatively small servers (up to perhaps 150 clients). If you regularly have more clients, you should consider increasing the pool capacity. What value should you choose? It can't be to low (because your server will have noticeable lag) and it can't be too high (no one likes to waste RAM). The best way to determine this value is to add code that prints a warning message every time the pool has been exhausted. You will need to modify the deallocate() member function in src/lockfree.h. Here's how this function has to look like:
Code:
```
    void deallocate(T* p, size_t) const {
       if (!getFreeList().bounded_push(p)) {
         std::cout << "Warning: OutputMessage pool capacity exhausted!" << std::endl;
         //Release memory without calling the destructor of T
         //(it has already been called at this point)
         operator delete(p);
       }
     }
```
(You will also need to add #include <iostream> in the include section)

You should run your server with this modification under normal load. If you see the warning printed regularly, that's a sign you need to increase the capacity of the pool. I suggest incrementing it by 50% every time you see these warnings printed regularly. Once you tune the pool size, I recommend removing these changes, because this piece of code is fairly performance critical - you don't want unused garbage there!

Note: If you're OK with potentially wasting at most 1.8 GiB of memory, go ahead and set the pool capacity to max (65534).

The second constant that will be discussed here is the autosend delay. The default should be fine in most cases, however if your players experience lag spikes, you should consider decreasing it (setting it below 1 ms is NOT recommended). In some cases increasing it might allow you to run more demanding scripts on your server, however, any value above 20 or 30 ms will probably be noticeable by players, so be careful!

Damc · Aug 16, 2015

[Reserved for the future]

Damc · Aug 16, 2015

[Reserved for the future]

Yamaken · Aug 16, 2015

Really nice that you did a refactor/redesign of such critical system, hope it turn stable in no time

Edit: i have a small question for you:

The ProtocolGame::releaseProtocol() function will clear/reset the player->client variable. Its called by dispatcher.

Now check this function:

void sendMagicEffect(const Position& pos, uint8_t type) const {
if (client) {
client->sendMagicEffect(pos, type);
}
}

It can be called by the scheduler thread, so if at the same time the dispatcher clear the variable and the scheduler call it, it will crash right?

There is some reason why this variable(client) is reset by dispatcher?

Damc · Aug 16, 2015

As far as I know TFS properly implements the scheduler-dispatcher design pattern, which means that the scheduler does not do work related to game logic. The only thing it does, is putting work items into the dispatcher's work queue at the right moment in time, so there is no data race here. There is however a data race when parsing packets, because the ASI/O thread(s) can dispatch parsing methods at the same time the dispatcher clears the player pointer (https://github.com/otland/forgottenserver/issues/1466), however that is rare. I haven't yet found a way to fix it in such a way that it doesn't impact dispatcher efficiency too much and I'll probably wait till the casting system is merged before I try to do that.

Yamaken · Aug 16, 2015

Damc said:
As far as I know TFS properly implements the scheduler-dispatcher design pattern, which means that the scheduler does not do work related to game logic. The only thing it does, is putting work items into the dispatcher's work queue at the right moment in time, so there is no data race here. There is however a data race when parsing packets, because the ASI/O thread(s) can dispatch parsing methods at the same time the dispatcher clears the player pointer (https://github.com/otland/forgottenserver/issues/1466), however that is rare. I haven't yet found a way to fix it in such a way that it doesn't impact dispatcher efficiency too much and I'll probably wait till the casting system is merged before I try to do that.

I don't believe it implements the scheduler-dispatcher design correctly. Look at the example i give. If i call Position:sendMagicEffect from a timer event(that i have added using addEvent lua function, which is using g_scheduler.addEvent), it will use the client(protocolgame) variable, it will use the player variable(inside ProtocolGame::canSee). Both client and player variables are erased by dispatcher, which may lead to a crash.

Edit - You are correct

Looking at the scheduler code i see now that its just puting the task in the dispatcher. Database tasks is doing that too, so it is thread safe.

Thanks for the answers, i feeling stupid right now lol

fironfox · Oct 27, 2015

Can i make this changes in tfs 1.0?

Damc · Oct 28, 2015

TFS 1.0 has the old networking system, so unless you back port it from TFS 1.2 to 1.0 the answer is no.

[TFS 1.2+] Performance tuning

Damc

Well-Known Member

Damc

Well-Known Member

Damc

Well-Known Member

Yamaken

Pro OpenTibia Developer

Damc

Well-Known Member

Yamaken

Pro OpenTibia Developer

fironfox

New Member

Damc

Well-Known Member

Similar threads