Optimizing TFS Pathfinding

gudan garam

Intermediate OT User
Joined
Feb 13, 2011
Messages
276
Reaction score
85
Location
Tibialand
if you really want to optimize it, i would recommend to start by learning some theory. The problem you are facing is the Shortest path problem (fastest path in this case but that's almost the same) - Shortest path problem - Wikipedia
Original solution is using something similar to Dijkstra's algorithm (complexity nlog(n)). With your modification you changes it to heuristic algorithm (also wiki page - A* search algorithm - Wikipedia ). It's gonna help, but in this case, only a bit.
The problem with tfs implementation is that it doesn't use processor cache correctly (it loads much data from memory in random order).
If you don't know what I am talking about, i prepared small example
Code:
#include <chrono>
#include <iostream>

std::chrono::high_resolution_clock::time_point time_point;
uint64_t micros;

int main() {
    int T[1000][1000] = {0};
    time_point = std::chrono::high_resolution_clock::now();
    for(int x = 0; x < 1000; ++x) {
        for(int y = 0; y < 1000; ++y) {
            T[x][y] = x + y;
        }
    }
    micros = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::high_resolution_clock::now() - time_point).count();
    if(T[11][33] == 0)  // without that gcc will remove for as part of optimization coz T won't be used
        return 0;
    std::cout << "Took " << micros << " microseconds" << std::endl;
    time_point = std::chrono::high_resolution_clock::now();
    for(int x = 0; x < 1000; ++x) {
        for(int y = 0; y < 1000; ++y) {
            T[y][x] = x + y;
        }
    }
    micros = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::high_resolution_clock::now() - time_point).count();
    if(T[11][33] == 0)
        return 0;
    std::cout << "Took " << micros << " microseconds" << std::endl;
    return 0;
}
In this code you have 2 loops, doing exactly the same, but execution time is much different.
Code:
Took 409 microseconds
Took 3677 microseconds
First loop is almost 9x faster than second! And as I said, both are doing exactly the same. It's good to know why if you want to start optimizing code in really efficient way.
If you want to make that function really fast, you must keep all processing data in memory block smaller than 4kb (it's possible to do it in smaller than 1kb block).
It's possible to speed up this function at least by 4 times, in normal case it's using ~10% of cpu time used by tfs (for eg. walking is using up to 30%).

This occurs because T is stored in rows and not columns, so it is layed out in memory like this:
T[0][0], T[0][1], T[0][2], ..., T[1000][999], T[1000]T[1000].

So when you are acessing its memory in the first loop, you have the region you are using already in cache because when the processor pulls T[0][0] to cache, T[0][1] T[0][2]... comes together.

Now on the second loop you change T[0][0], then T[1][0] and theese two memory points have a thousand ints in between them (the whole T[0] index), so the processor has to go into memory to get it which takes more time. (more cache misses)

I could be wrong

Edit--

I've recently changed a couple of std::lists to std::vector on tfs and that alone was like magic regarding cpu usage, but unfortunately when the vector gets reallocated we get some crashes becuase the pointers are invalidated.
 

Ezzz

Developer
Joined
Feb 26, 2010
Messages
1,629
Reaction score
353
Location
Quito, Ecuador
This post is interesting, so in my Tibia 7.7 recreation project which has been slowly building up during these last weeks, I made an A* from zero, and I wanted to test the benchmark by spamming click on map and running the search loop 1000 times.



Now when adding code to send a teleport effect on every node that is checked to search for the best predecessor node within neighbors and points.


(CPU increased by a small margin due to sending multiple teleport effects, without it the results are as the first image, red area is start area, yellow area is the area clicked on the map).

The generated interval is made up with:
Code:
std::cout << "> Map loading time: " << (g_game.serverMilliseconds() - micros) / (1000.) << " seconds." << std::endl;
This post might be useless in here, but I found interesting the roughly over 0.010 s I'm seeing here from other tests.
The view range of the A* is 15 SQMs long in both X & Y range.

Are my results okay or what should I do to benchmark my own A* - or am I getting the good results due to what Kondrah said a few posts earlier?
 
Last edited:

Yamaken

Pro OpenTibia Developer
Premium User
Joined
Jul 27, 2013
Messages
454
Reaction score
320
This occurs because T is stored in rows and not columns, so it is layed out in memory like this:
T[0][0], T[0][1], T[0][2], ..., T[1000][999], T[1000]T[1000].

So when you are acessing its memory in the first loop, you have the region you are using already in cache because when the processor pulls T[0][0] to cache, T[0][1] T[0][2]... comes together.

Now on the second loop you change T[0][0], then T[1][0] and theese two memory points have a thousand ints in between them (the whole T[0] index), so the processor has to go into memory to get it which takes more time. (more cache misses)

I could be wrong

Edit--

I've recently changed a couple of std::lists to std::vector on tfs and that alone was like magic regarding cpu usage, but unfortunately when the vector gets reallocated we get some crashes becuase the pointers are invalidated.
Where you did chance std::lists to std::vector?
 

DemShion

Member
Joined
Apr 8, 2013
Messages
82
Reaction score
1
Location
Mexico
For anyone curious this an extract from the following link What is a "cache-friendly" code? provided by @kondra (thanks a lot yo).

Exploiting the ordering (e.g. changing column index first in c++):

M[0][0] (memory) + M[0][1] (cached) + M[1][0] (memory) + M[1][1] (cached)
= 1 + 2 + 3 + 4
--> 2 cache hits, 2 memory accesses
Not exploiting the ordering (e.g. changing row index first in c++):

M[0][0] (memory) + M[1][0] (memory) + M[0][1] (memory) + M[1][1] (memory)
= 1 + 3 + 2 + 4
--> 0 cache hits, 4 memory accesses
In this simple example, exploiting the ordering approximately doubles execution speed (since memory access requires much more cycles than computing the sums). In practice the performance difference can be much larger.
 

Togu

Active Member
Joined
Jun 22, 2018
Messages
180
Reaction score
66
Location
Brazil
How I made my own pathfinder for an unity top down game with north, east, west and south move directions:

Learn the theory and make your own pathfinder function. Mine is called "GetShortestPath" and you can call it every time the target moves or everytime your character hits a moving obstacle like monster or player.

TFS 1.3 pathfinder system is bugged for fast monsters and people are offering 100$ on github for a revamp.
 

Ahilphino

Veteran OT User
Joined
Jun 5, 2013
Messages
1,291
Reaction score
359
TFS 1.3 pathfinder system is bugged for fast monsters and people are offering 100$ on github for a revamp.
thats an easy fix, u change 2 lines of code and the monsters works as they should lol
and its also available for the public
 
Top