if you really want to optimize it, i would recommend to start by learning some theory. The problem you are facing is the Shortest path problem (fastest path in this case but that's almost the same) -
Shortest path problem - Wikipedia
Original solution is using something similar to Dijkstra's algorithm (complexity nlog(n)). With your modification you changes it to heuristic algorithm (also wiki page -
A* search algorithm - Wikipedia ). It's gonna help, but in this case, only a bit.
The problem with tfs implementation is that it doesn't use processor cache correctly (it loads much data from memory in random order).
If you don't know what I am talking about, i prepared small example
Code:
#include <chrono>
#include <iostream>
std::chrono::high_resolution_clock::time_point time_point;
uint64_t micros;
int main() {
int T[1000][1000] = {0};
time_point = std::chrono::high_resolution_clock::now();
for(int x = 0; x < 1000; ++x) {
for(int y = 0; y < 1000; ++y) {
T[x][y] = x + y;
}
}
micros = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::high_resolution_clock::now() - time_point).count();
if(T[11][33] == 0) // without that gcc will remove for as part of optimization coz T won't be used
return 0;
std::cout << "Took " << micros << " microseconds" << std::endl;
time_point = std::chrono::high_resolution_clock::now();
for(int x = 0; x < 1000; ++x) {
for(int y = 0; y < 1000; ++y) {
T[y][x] = x + y;
}
}
micros = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::high_resolution_clock::now() - time_point).count();
if(T[11][33] == 0)
return 0;
std::cout << "Took " << micros << " microseconds" << std::endl;
return 0;
}
In this code you have 2 loops, doing exactly the same, but execution time is much different.
Code:
Took 409 microseconds
Took 3677 microseconds
First loop is almost 9x faster than second! And as I said, both are doing exactly the same. It's good to know why if you want to start optimizing code in really efficient way.
If you want to make that function really fast, you must keep all processing data in memory block smaller than 4kb (it's possible to do it in smaller than 1kb block).
It's possible to speed up this function at least by
4 times, in normal case it's using ~10% of cpu time used by tfs (for eg. walking is using up to 30%).