gudan garam
Advanced OT User
if you really want to optimize it, i would recommend to start by learning some theory. The problem you are facing is the Shortest path problem (fastest path in this case but that's almost the same) - Shortest path problem - Wikipedia
Original solution is using something similar to Dijkstra's algorithm (complexity nlog(n)). With your modification you changes it to heuristic algorithm (also wiki page - A* search algorithm - Wikipedia ). It's gonna help, but in this case, only a bit.
The problem with tfs implementation is that it doesn't use processor cache correctly (it loads much data from memory in random order).
If you don't know what I am talking about, i prepared small example
Code:#include <chrono> #include <iostream> std::chrono::high_resolution_clock::time_point time_point; uint64_t micros; int main() { int T[1000][1000] = {0}; time_point = std::chrono::high_resolution_clock::now(); for(int x = 0; x < 1000; ++x) { for(int y = 0; y < 1000; ++y) { T[x][y] = x + y; } } micros = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::high_resolution_clock::now() - time_point).count(); if(T[11][33] == 0) // without that gcc will remove for as part of optimization coz T won't be used return 0; std::cout << "Took " << micros << " microseconds" << std::endl; time_point = std::chrono::high_resolution_clock::now(); for(int x = 0; x < 1000; ++x) { for(int y = 0; y < 1000; ++y) { T[y][x] = x + y; } } micros = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::high_resolution_clock::now() - time_point).count(); if(T[11][33] == 0) return 0; std::cout << "Took " << micros << " microseconds" << std::endl; return 0; }
In this code you have 2 loops, doing exactly the same, but execution time is much different.
First loop is almost 9x faster than second! And as I said, both are doing exactly the same. It's good to know why if you want to start optimizing code in really efficient way.Code:Took 409 microseconds Took 3677 microseconds
If you want to make that function really fast, you must keep all processing data in memory block smaller than 4kb (it's possible to do it in smaller than 1kb block).
It's possible to speed up this function at least by 4 times, in normal case it's using ~10% of cpu time used by tfs (for eg. walking is using up to 30%).
This occurs because T is stored in rows and not columns, so it is layed out in memory like this:
T[0][0], T[0][1], T[0][2], ..., T[1000][999], T[1000]T[1000].
So when you are acessing its memory in the first loop, you have the region you are using already in cache because when the processor pulls T[0][0] to cache, T[0][1] T[0][2]... comes together.
Now on the second loop you change T[0][0], then T[1][0] and theese two memory points have a thousand ints in between them (the whole T[0] index), so the processor has to go into memory to get it which takes more time. (more cache misses)
I could be wrong
Edit--
I've recently changed a couple of std::lists to std::vector on tfs and that alone was like magic regarding cpu usage, but unfortunately when the vector gets reallocated we get some crashes becuase the pointers are invalidated.