• There is NO official Otland's Discord server and NO official Otland's server list. The Otland's Staff does not manage any Discord server or server list. Moderators or administrator of any Discord server or server lists have NO connection to the Otland's Staff. Do not get scammed!

Optimizing TFS Pathfinding

if you really want to optimize it, i would recommend to start by learning some theory. The problem you are facing is the Shortest path problem (fastest path in this case but that's almost the same) - Shortest path problem - Wikipedia
Original solution is using something similar to Dijkstra's algorithm (complexity nlog(n)). With your modification you changes it to heuristic algorithm (also wiki page - A* search algorithm - Wikipedia ). It's gonna help, but in this case, only a bit.
The problem with tfs implementation is that it doesn't use processor cache correctly (it loads much data from memory in random order).
If you don't know what I am talking about, i prepared small example
Code:
#include <chrono>
#include <iostream>

std::chrono::high_resolution_clock::time_point time_point;
uint64_t micros;

int main() {
    int T[1000][1000] = {0};
    time_point = std::chrono::high_resolution_clock::now();
    for(int x = 0; x < 1000; ++x) {
        for(int y = 0; y < 1000; ++y) {
            T[x][y] = x + y;
        }
    }
    micros = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::high_resolution_clock::now() - time_point).count();
    if(T[11][33] == 0)  // without that gcc will remove for as part of optimization coz T won't be used
        return 0;
    std::cout << "Took " << micros << " microseconds" << std::endl;
    time_point = std::chrono::high_resolution_clock::now();
    for(int x = 0; x < 1000; ++x) {
        for(int y = 0; y < 1000; ++y) {
            T[y][x] = x + y;
        }
    }
    micros = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::high_resolution_clock::now() - time_point).count();
    if(T[11][33] == 0)
        return 0;
    std::cout << "Took " << micros << " microseconds" << std::endl;
    return 0;
}

In this code you have 2 loops, doing exactly the same, but execution time is much different.
Code:
Took 409 microseconds
Took 3677 microseconds
First loop is almost 9x faster than second! And as I said, both are doing exactly the same. It's good to know why if you want to start optimizing code in really efficient way.
If you want to make that function really fast, you must keep all processing data in memory block smaller than 4kb (it's possible to do it in smaller than 1kb block).
It's possible to speed up this function at least by 4 times, in normal case it's using ~10% of cpu time used by tfs (for eg. walking is using up to 30%).


This occurs because T is stored in rows and not columns, so it is layed out in memory like this:
T[0][0], T[0][1], T[0][2], ..., T[1000][999], T[1000]T[1000].

So when you are acessing its memory in the first loop, you have the region you are using already in cache because when the processor pulls T[0][0] to cache, T[0][1] T[0][2]... comes together.

Now on the second loop you change T[0][0], then T[1][0] and theese two memory points have a thousand ints in between them (the whole T[0] index), so the processor has to go into memory to get it which takes more time. (more cache misses)

I could be wrong

Edit--

I've recently changed a couple of std::lists to std::vector on tfs and that alone was like magic regarding cpu usage, but unfortunately when the vector gets reallocated we get some crashes becuase the pointers are invalidated.
 
This post is interesting, so in my Tibia 7.7 recreation project which has been slowly building up during these last weeks, I made an A* from zero, and I wanted to test the benchmark by spamming click on map and running the search loop 1000 times.

Screenshot_3.png


Now when adding code to send a teleport effect on every node that is checked to search for the best predecessor node within neighbors and points.

Screenshot_4.png

(CPU increased by a small margin due to sending multiple teleport effects, without it the results are as the first image, red area is start area, yellow area is the area clicked on the map).

The generated interval is made up with:
Code:
std::cout << "> Map loading time: " << (g_game.serverMilliseconds() - micros) / (1000.) << " seconds." << std::endl;

This post might be useless in here, but I found interesting the roughly over 0.010 s I'm seeing here from other tests.
The view range of the A* is 15 SQMs long in both X & Y range.

Are my results okay or what should I do to benchmark my own A* - or am I getting the good results due to what Kondrah said a few posts earlier?
 
Last edited:
This occurs because T is stored in rows and not columns, so it is layed out in memory like this:
T[0][0], T[0][1], T[0][2], ..., T[1000][999], T[1000]T[1000].

So when you are acessing its memory in the first loop, you have the region you are using already in cache because when the processor pulls T[0][0] to cache, T[0][1] T[0][2]... comes together.

Now on the second loop you change T[0][0], then T[1][0] and theese two memory points have a thousand ints in between them (the whole T[0] index), so the processor has to go into memory to get it which takes more time. (more cache misses)

I could be wrong

Edit--

I've recently changed a couple of std::lists to std::vector on tfs and that alone was like magic regarding cpu usage, but unfortunately when the vector gets reallocated we get some crashes becuase the pointers are invalidated.
Where you did chance std::lists to std::vector?
 
For anyone curious this an extract from the following link What is a "cache-friendly" code? provided by @kondra (thanks a lot yo).

Exploiting the ordering (e.g. changing column index first in c++):

M[0][0] (memory) + M[0][1] (cached) + M[1][0] (memory) + M[1][1] (cached)
= 1 + 2 + 3 + 4
--> 2 cache hits, 2 memory accesses
Not exploiting the ordering (e.g. changing row index first in c++):

M[0][0] (memory) + M[1][0] (memory) + M[0][1] (memory) + M[1][1] (memory)
= 1 + 3 + 2 + 4
--> 0 cache hits, 4 memory accesses
In this simple example, exploiting the ordering approximately doubles execution speed (since memory access requires much more cycles than computing the sums). In practice the performance difference can be much larger.
 
How I made my own pathfinder for an unity top down game with north, east, west and south move directions:

Learn the theory and make your own pathfinder function. Mine is called "GetShortestPath" and you can call it every time the target moves or everytime your character hits a moving obstacle like monster or player.

TFS 1.3 pathfinder system is bugged for fast monsters and people are offering 100$ on github for a revamp.
 
TFS 1.3 pathfinder system is bugged for fast monsters and people are offering 100$ on github for a revamp.

thats an easy fix, u change 2 lines of code and the monsters works as they should lol
and its also available for the public
 
@kondra Did you rewrite the whole map structure so that columns are accessed sequentially in memory? In other words: to cache columns instead of rows? (because when sending to client it expects one column at a time; not row). And because all servers have it the wrong way, when they send information to the client, the tile info is not accessed sequentially? Or did you find other more important caching problems?

Example:

Servers store their tiles in map like this in memory:
0x000x010x020x030x040x050x060x07
100,100,7101,100,7102,100,7103,100,7104,100,7105,100,7106,100,7107,1007

Client expects:
0x000x010x020x030x040x050x060x07
100,100,7100,101,7100,102,7100,103,7100,104,7100,105,7100,106,7100,107,7

so when sending you don't get sequential memory access. Instead you should store it the same way the client expects it?

Other:
I have only looked at how client 7.4 expects map
I have not looked at how RME stores map (you probably have to rewrite how to read map) and initial map loading is probably going to increase
 
@kondra Did you rewrite the whole map structure so that columns are accessed sequentially in memory? In other words: to cache columns instead of rows? (because when sending to client it expects one column at a time; not row). And because all servers have it the wrong way, when they send information to the client, the tile info is not accessed sequentially? Or did you find other more important caching problems?

Example:

Servers store their tiles in map like this in memory:
0x000x010x020x030x040x050x060x07
100,100,7101,100,7102,100,7103,100,7104,100,7105,100,7106,100,7107,1007

Client expects:
0x000x010x020x030x040x050x060x07
100,100,7100,101,7100,102,7100,103,7100,104,7100,105,7100,106,7100,107,7

so when sending you don't get sequential memory access. Instead you should store it the same way the client expects it?

Other:
I have only looked at how client 7.4 expects map
I have not looked at how RME stores map (you probably have to rewrite how to read map) and initial map loading is probably going to increase

Yes, I use custom structure for storing map instead O(log2) complexity to get tile it does it in O(1) (in 3 steps/ram queries exactly). So getTile is 6-8x faster in the end
 
If I might ask, I don't know much about programming but I would like to improve the creature behaviour, I see the pathfinding of this post is much more improved on far distances if the client has extended window. How this can exactly affects the perfomance of the server? Will raise the memory ussage a little, or a lot and players won't be able to play?

Also saw that the old code has some differences than nekiro's 8.6, example:

creature.cpp
Code:
std::forward_list<Direction>
and lastest sources have:
Code:
std::vector<Direction>

error.png

How do I properly merge? I have errors when merging because the syntaxis changed on newer TFS and don't know what I have to change to make it work. Thanks in advance!
 
Last edited:
Do you guys think it's worth rewriting it for smaller servers/maps? Because from what I understand, most of you host RL Map ones, right?
It will take some effort to rewrite the distro together with RME as mentioned by rwxsu. On the other hand, maybe it's worth including it in the official TFS core (both pathfinding and map changes)?
 
Back
Top