Optimizing TFS Pathfinding

Discussion in 'Programming & Scripting' started by Flatlander, Mar 20, 2018.

  1. Flatlander

    Flatlander Species Developer

    Joined:
    Feb 17, 2009
    Messages:
    2,457
    Likes Received:
    1,296
    Best Answers:
    2
    I am making this tutorial to go along with my Monster View-Range Tutorial.

    When adding more view-range to a monster, you quickly find your server is constantly freezing due to path-finding. (Especially if you are like me and want wolves to stalk your poor players from 50 tiles away)

    To fix this, I added a bit more optimization to the Pathfinding in TFS.
    Below is how you can also make these changes.
    (I am self-taught, what I do does work, but it might not be the most efficient and best way to do it, if someone else knows how to make my code even more optimized please criticize and teach me how)

    First we need to be able to pull the Target Position while in BestNode to help calculate the true "best node".

    In Creature.cpp edit the following:
    Code (Text):
    1.  
    2. bool Creature::getPathTo(const Position& targetPos, std::forward_list<Direction>& dirList, const FindPathParams& fpp) const
    3. {
    4.    return g_game.map.getPathMatching(*this, dirList, FrozenPathingConditionCall(targetPos), fpp);
    5. }
    6.  
    Add targetPos into getPathMatching after *this,
    Code (Text):
    1.  
    2. bool Creature::getPathTo(const Position& targetPos, std::forward_list<Direction>& dirList, const FindPathParams& fpp) const
    3. {
    4.    return g_game.map.getPathMatching(*this, targetPos, dirList, FrozenPathingConditionCall(targetPos), fpp);
    5. }
    6.  
    In map.h edit the following:
    Code (Text):
    1.  
    2.        bool getPathMatching(const Creature& creature, std::forward_list<Direction>& dirList,
    3.                             const FrozenPathingConditionCall& pathCondition, const FindPathParams& fpp) const;
    4.  
    Add targetPos here too:
    Code (Text):
    1.  
    2.        bool getPathMatching(const Creature& creature, const Position& targetPos, std::forward_list<Direction>& dirList,
    3.                             const FrozenPathingConditionCall& pathCondition, const FindPathParams& fpp) const;
    4.  
    And also in map.h edit the following:
    Code (Text):
    1.  
    2. AStarNode* getBestNode();
    3.  
    Adding targetPos here too:
    Code (Text):
    1.  
    2. AStarNode* getBestNode(const Position& targetPos);
    3.  

    In map.cpp edit the following:
    Code (Text):
    1.  
    2. bool Map::getPathMatching(const Creature& creature, std::forward_list<Direction>& dirList, const FrozenPathingConditionCall& pathCondition, const FindPathParams& fpp) const
    3.  
    Add targetPos here too:
    Code (Text):
    1.  
    2. bool Map::getPathMatching(const Creature& creature, const Position& targetPos, std::forward_list<Direction>& dirList, const FrozenPathingConditionCall& pathCondition, const FindPathParams& fpp) const
    3.  
    Then edit this BestNode in map.cpp:
    Code (Text):
    1.  
    2.     AStarNode* found = nullptr;
    3.    while (fpp.maxSearchDist != 0 || nodes.getClosedNodes() < 100) {
    4.        AStarNode* n = nodes.getBestNode();
    5.  
    To include targetPos:
    Code (Text):
    1.  
    2.     AStarNode* found = nullptr;
    3.    while (fpp.maxSearchDist != 0 || nodes.getClosedNodes() < 100) {
    4.        AStarNode* n = nodes.getBestNode(targetPos);
    5.  


    And now for the big edit, the getBestNode() function itself:
    Code (Text):
    1.  
    2. AStarNode* AStarNodes::getBestNode()
    3. {
    4.    if (curNode == 0) {
    5.        return nullptr;
    6.    }
    7.  
    8.    int32_t best_node_f = std::numeric_limits<int32_t>::max();
    9.    int32_t best_node = -1;
    10.    for (size_t i = 0; i < curNode; i++) {
    11.        if (openNodes[i] && nodes[i].f < best_node_f) {
    12.            best_node_f = nodes[i].f;
    13.            best_node = i;
    14.        }
    15.    }
    16.  
    17.    if (best_node >= 0) {
    18.        return nodes + best_node;
    19.    }
    20.    return nullptr;
    21. }
    22.  
    Change it to the above (I have comments on why each line exists)
    Code (Text):
    1.  
    2. AStarNode* AStarNodes::getBestNode(const Position& targetPos)
    3. {
    4.    if (curNode == 0) {
    5.        return nullptr;
    6.    }
    7.  
    8.    int32_t diffNode = 0; //Difference between nodes
    9.  
    10.    int32_t best_node_f = std::numeric_limits<int32_t>::max();
    11.    int32_t best_node = -1;
    12.    for (size_t i = 0; i < curNode; i++) {
    13.        //I am not familiar with 1.X so there is probably a better way to get all these variables, also the below is the untested in TFS 1.X, hope it works
    14.        const int_fast32_t Sx = nodes[0].x; //Starting X Position
    15.        const int_fast32_t Sy = nodes[0].y; //Starting Y Position
    16.        const int_fast32_t Cx = nodes[i].x; //node[i] X Position
    17.        const int_fast32_t Cy = nodes[i].y; //node[i] Y Position
    18.        int32_t SdiffX = std::abs(targetPos.x - Sx); //X distance from the starting location
    19.        int32_t SdiffY = std::abs(targetPos.y - Sy); //Y distance from the starting location
    20.        int32_t NdiffX = std::abs(targetPos.x - Cx); //X distance from node[i] position
    21.        int32_t NdiffY = std::abs(targetPos.y - Cy); //X distance from node[i] position
    22.        diffNode = ((nodes[i].f+(((NdiffX-SdiffX)*5)+((NdiffY-SdiffY)*5)))+(std::max(NdiffX, NdiffY)*5)); //I messed around with this formula a lot to try to get the best performance, for me this works the best
    23.        if (openNodes[i] && diffNode < best_node_f) { //Compares the formula above with the current Best Node
    24.            best_node_f = diffNode; //Sets the Best Nodes new value to beat
    25.            best_node = i;
    26.        }
    27.    }
    28.  
    29.    if (best_node >= 0) {
    30.        return nodes + best_node;
    31.    }
    32.    return nullptr;
    33. }
    34.  
    The above SHOULD work.
    The only thing that may be done incorrectly is getting the pos.x and pos.y from the Nodes[] (I do this differently in TFS 0.X, but tried to use the same way TFS 1.X has it done)

    AFTER you do this, you should be able to go to my next tutorial adding monster ViewRange to the monsters.xml so you can have monsters that see farther than screen-size.
     
    Last edited: Apr 30, 2018
    Apollos, kingsley666, kito2 and 3 others like this.
  2. kito2

    kito2 https://mtibia.online

    Joined:
    Mar 9, 2009
    Messages:
    3,680
    Likes Received:
    197
    Best Answers:
    1
    Nice!
     
  3. Flatlander

    Flatlander Species Developer

    Joined:
    Feb 17, 2009
    Messages:
    2,457
    Likes Received:
    1,296
    Best Answers:
    2
    I fixed all the code issues on this thread to match the Pull Request I made to TFS.

    This is now tested (Thanks to Xagul) and it is 100% working.
    Mark has expressed some concerns about it may not choosing the "best" path, I asked him to give an example of what could go wrong.

    Shouldn't be too hard to fix any issue I don't think. I double-checked the code and can't find a way we would get a wrong path, so i'll have to wait for more information from him.
     
    Last edited: Apr 30, 2018
    kito2, BahamutxD and Nekiro like this.
  4. kito2

    kito2 https://mtibia.online

    Joined:
    Mar 9, 2009
    Messages:
    3,680
    Likes Received:
    197
    Best Answers:
    1
    Excellent news, I am going to integrate this optimization too.
     
  5. 4drik

    4drik Active Member

    Joined:
    Jun 30, 2014
    Messages:
    153
    Likes Received:
    72
    Best Answers:
    0
    I just tested it on YurOTS.
    First time I made a test for the same path:
    [​IMG]
    I saw that in the case of shorter distances, the old code is doing better.
    However, it was only one first test.
    So I decided to do the measurement a few more times - adding the result this time and comparing the final number.
    It looked like this:
    [​IMG]
    Final results:
    New: 0.212 s, 0.143 s, 0.157 s
    Old: 0.112 s, 0.12 s, 0.076 s

    Each time the old bestNode function won.

    Take a note that I just wanted to check how it works for me, for old server which is YurOTS.
     
    kito2 likes this.
  6. Flatlander

    Flatlander Species Developer

    Joined:
    Feb 17, 2009
    Messages:
    2,457
    Likes Received:
    1,296
    Best Answers:
    2
    Interesting, I guess you are going to make me do a testing video huh?

    Also can you let me know how you tested it?

    Also what path(s) did you run?
     
    Last edited: May 28, 2018
  7. Flatlander

    Flatlander Species Developer

    Joined:
    Feb 17, 2009
    Messages:
    2,457
    Likes Received:
    1,296
    Best Answers:
    2
    Below is testing on the Newest TFS Version.
    It is on a fast computer, therefor we needed to find the path a thousand times to get even milliseconds to show up.

    This is using the "Use" function to go to a tile, so this is running a lower number of max-nodes (I think 100 is the default for onUse movement)
    When you see "No Way" it means it could not find a path to that location by the time it reached Max Nodes.


    Maze:
    [​IMG]

    Wall:
    [​IMG]

    Open:
    [​IMG]

    From this, I have determined that it takes around 20-25% more time to run my version of "getBestNode".
    BUT, it more often reduces the amount of nodes required so greatly that it is still faster to use my new optimized system.
     
    kito2 and pbl like this.
  8. kito2

    kito2 https://mtibia.online

    Joined:
    Mar 9, 2009
    Messages:
    3,680
    Likes Received:
    197
    Best Answers:
    1
    Excellent!
     
  9. kondra

    kondra New Member

    Joined:
    Apr 9, 2018
    Messages:
    12
    Likes Received:
    23
    Best Answers:
    0
    if you really want to optimize it, i would recommend to start by learning some theory. The problem you are facing is the Shortest path problem (fastest path in this case but that's almost the same) - Shortest path problem - Wikipedia
    Original solution is using something similar to Dijkstra's algorithm (complexity nlog(n)). With your modification you changes it to heuristic algorithm (also wiki page - A* search algorithm - Wikipedia ). It's gonna help, but in this case, only a bit.
    The problem with tfs implementation is that it doesn't use processor cache correctly (it loads much data from memory in random order).
    If you don't know what I am talking about, i prepared small example
    Code (Text):
    1. #include <chrono>
    2. #include <iostream>
    3.  
    4. std::chrono::high_resolution_clock::time_point time_point;
    5. uint64_t micros;
    6.  
    7. int main() {
    8.     int T[1000][1000] = {0};
    9.     time_point = std::chrono::high_resolution_clock::now();
    10.     for(int x = 0; x < 1000; ++x) {
    11.         for(int y = 0; y < 1000; ++y) {
    12.             T[x][y] = x + y;
    13.         }
    14.     }
    15.     micros = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::high_resolution_clock::now() - time_point).count();
    16.     if(T[11][33] == 0)  // without that gcc will remove for as part of optimization coz T won't be used
    17.         return 0;
    18.     std::cout << "Took " << micros << " microseconds" << std::endl;
    19.     time_point = std::chrono::high_resolution_clock::now();
    20.     for(int x = 0; x < 1000; ++x) {
    21.         for(int y = 0; y < 1000; ++y) {
    22.             T[y][x] = x + y;
    23.         }
    24.     }
    25.     micros = std::chrono::duration_cast<std::chrono::microseconds>(std::chrono::high_resolution_clock::now() - time_point).count();
    26.     if(T[11][33] == 0)
    27.         return 0;
    28.     std::cout << "Took " << micros << " microseconds" << std::endl;
    29.     return 0;
    30. }
    In this code you have 2 loops, doing exactly the same, but execution time is much different.
    Code (Text):
    1. Took 409 microseconds
    2. Took 3677 microseconds
    First loop is almost 9x faster than second! And as I said, both are doing exactly the same. It's good to know why if you want to start optimizing code in really efficient way.
    If you want to make that function really fast, you must keep all processing data in memory block smaller than 4kb (it's possible to do it in smaller than 1kb block).
    It's possible to speed up this function at least by 4 times, in normal case it's using ~10% of cpu time used by tfs (for eg. walking is using up to 30%).
     
    Last edited: May 29, 2018
    pepsiman, StreamSide and kito2 like this.
  10. Flatlander

    Flatlander Species Developer

    Joined:
    Feb 17, 2009
    Messages:
    2,457
    Likes Received:
    1,296
    Best Answers:
    2
    First, I want to thank you for trying to help and teach me something I am completely ignorant on.
    I have no idea how to optimize code by using different methods, or how to use hardware (like processor cache) to my advantage.

    Things like this are basically impossible to "find" online.
    If you google "Best way to ~~ in C++" you get thousands of competing theories of how you should do things, so I just end up doing what looks easiest to understand or what I know.


    Now, onto your above post.
    I'm trying to "understand" why the first one would be faster, so I'm guessing:

    In my "Non-Programmer Mind" I visualize it like this.
    T[1000][1000] creates a table inside a table you can put information in.

    Below is why I think the first one runs faster. (Let me know if I am correct)

    Lets say it's just: T[2][2]
    Basically I visualize it as:
    Code (Text):
    1.  
    2. t = {
    3.   [1] = {
    4.     [1] = {},
    5.     [2] = {},
    6.   },
    7.   [2] = {
    8.     [1] = {},
    9.     [2] = {},
    10.   },
    11. }
    12.  
    Your first code basically grabs 1 sub-table, then loops through it's content.
    But your 2nd run alternates tables.

    I'm guessing what you mean by Processor Cache, is that when you run:
    T[1][1] = 1 + 1
    Which is: (T[x][y] = x + y)
    It pulls the sub-table into the Processor Cache:
    so It now has
    T[1] in cache so if you reference it again it can pull it super-fast.

    I think the problem is that the nodes function is basically this:
    Code (Text):
    1.  
    2. nodes = {
    3. [0] = {x = 0, y = 0, z = 0, f = 0, prev = 0}
    4. [1] = {x = 1, y = 1, z = 0, f = 2, prev = 0}
    5. [2] = {x = 2, y = 1, z = 0, f = 4, prev = 1}
    6. }
    7.  
    There isn't a sub-table to loop through, and I'm guessing you are saying to try to somehow change the table-structure and the way it searches through it to utilize sub-tables so it can cache it and be faster.

    I'm not sure how I would even achieve that at first thought, but I can brain-storm it I guess.

    Also, I may be COMPLETELY incorrect on what I think you are trying to teach me, which is why I explained it in as much detail as possible so you can correct me if i'm wrong.

    Let me know if I am right about how processor cache works using my above example.
     
  11. Evil Hero

    Evil Hero Legacy Member

    Joined:
    Dec 12, 2007
    Messages:
    1,099
    Likes Received:
    467
    Best Answers:
    2
    As far as I understand it's a different case.
    In the faster example you already create the multidimensional array which means that 1000 x 1000 indices equal to 0
    As you already have a pre created array and just change values it's a lot faster then create the index and re sign it's value, which is 2 steps instead of 1 (in c++ if I recall right)

    In lua you have the same problem
    table.insert(t, 1) is always slower then t[#t+1] = 1
     
  12. kondra

    kondra New Member

    Joined:
    Apr 9, 2018
    Messages:
    12
    Likes Received:
    23
    Best Answers:
    0
    i am bad at teaching and explaining, but google is better - What is a "cache-friendly" code?
    Evil hero, maybe try to change order first and test it before posting - spoiler - it's not connected. And I talk about c++, not lua.
     
    DemShion likes this.
  13. Evil Hero

    Evil Hero Legacy Member

    Joined:
    Dec 12, 2007
    Messages:
    1,099
    Likes Received:
    467
    Best Answers:
    2
    nvm been overflying your post and had a main focus on his, so I missed the main point :p
    just noticed that the indexing is swapped in both of the loops, makes sense now why it's tons slower.
    In that case jumping through arrays and altering them is ofc a lot slower then loading an array loop through it and go onto the next upper array.
    I should read more carefully next time, my fault :)
     
  14. Flatlander

    Flatlander Species Developer

    Joined:
    Feb 17, 2009
    Messages:
    2,457
    Likes Received:
    1,296
    Best Answers:
    2
    So it sounds like I was right.
    When you load an Array, and loop through it, it is faster than altering which array you are dealing with each call.

    But I still can't think of how I can apply this type of optimization to getBestNode or nodes.
     
  15. kingsley666

    kingsley666 Member

    Joined:
    Jul 24, 2011
    Messages:
    146
    Likes Received:
    23
    Best Answers:
    0
    Hello, i really don't know how TFS cores works, but i was thinking on what you say about optimizing the BestNode(), this is what i thought (i donĀ“t know the original code):

    - Depends on the range of view of the creature, you can draw a square arround the creature based on its range of view, for each sqm you analize if it's a walkable sqm to help the algoryth determine the fastest way on walkable sqm, all that information will go to a table, then you locate the creature and the target positions and then you aply the theory for shorter way on graphs.

    - In theory this should reduce de calculations about the best route to get the target.
     
  16. darkshin

    darkshin New old member

    Joined:
    Dec 14, 2010
    Messages:
    224
    Likes Received:
    19
    Best Answers:
    0
    Not sure if this will help, but Ive tested and its working.
    If someone wants to run a benchmark on it,
    somehow my tries only printed 0.0000 ms :(

    I've avoided to declare new variables each loop and copy values, keeping only pointers to the new values.
    Im not sure its relevance since I couldn't benchmark it
    Here it goes:
    Code (Text):
    1.     uint16_t* Sx = 0;
    2.     uint16_t* Sy = 0;
    3.     uint16_t* Cx = 0;
    4.     uint16_t* Cy = 0;
    5.  
    6.     for (size_t i = 0; i < curNode; i++) {
    7.         Sx = &nodes[0].x; //Starting X Position
    8.         Sy = &nodes[0].y; //Starting Y Position
    9.         Cx = &nodes[i].x; //node[i] X Position
    10.         Cy = &nodes[i].y; //node[i] Y Position
    11.  
    12.         diffNode = ((nodes[i].f +
    13.             (((std::abs(&targetPos.x - Cx) - std::abs(&targetPos.x - Sx)) * 5) + //X distance from node[i] position - X distance from the starting location
    14.             ((std::abs(&targetPos.y - Cy) - std::abs(&targetPos.y - Sy)) * 5))) + //Y distance from node[i] position - Y distance from the starting location
    15.                 (std::max<int32_t>(std::abs(&targetPos.x - Cx), std::abs(&targetPos.y - Cy)) * 5)); //max(X distance from node[i] position, Y distance from node[i] position,)
    16.    
    17.         if (openNodes[i] && diffNode < best_node_f) { //Compares the formula above with the current Best Node
    18.             best_node_f = diffNode; //Sets the Best Nodes new value to beat
    19.             best_node = i;
    20.         }
    21.     }
    PS: Creatures are running away from me lol, must be the position set at getPathMatching xD

    Edit: For some reason calls from Creature::goToFollowCreature() results in an inverted path.. Tested with the current thread code and the one I posted up. My TFS is edit so, if anyone can confirm it would be nice :)
     
    Last edited: May 30, 2018
    StreamSide likes this.
  17. Flatlander

    Flatlander Species Developer

    Joined:
    Feb 17, 2009
    Messages:
    2,457
    Likes Received:
    1,296
    Best Answers:
    2
    Mark made a good point on my pull request:
    If we calculate the "distance Cost" when we create the node, it would avoid calculating the distance cost for each node each time getBestNode() is run.

    I'll work on this when I get home from work today.
    After this I have no idea how I could optimize it further, any suggestions are welcome.

    I created this optimization because I personally have monsters that can see 50 sqm away and path up/down stairs. so the old TFS Pathing was obsolete for this task.
     
  18. darkshin

    darkshin New old member

    Joined:
    Dec 14, 2010
    Messages:
    224
    Likes Received:
    19
    Best Answers:
    0
    You sure have some gold hidden dont ya hahaha

    I don't know very well the getPathMatching function. So heres what I understood about the function where it comes to the getBestNode loop:

    The 'node list' is created at this code line

    Code (Text):
    1.  AStarNodes nodes(pos.x, pos.y);
    Then we go to the loop

    Code (Text):
    1.  
    2. AStarNode* found = nullptr;
    3. while (fpp.maxSearchDist != 0 || nodes.getClosedNodes() < 100) {
    4. AStarNode* n = nodes.getBestNode();
    5. if (!n) {
    6. if (found) {
    7. break;
    8. }
    9. return false;
    10. }
    11. const int_fast32_t x = n->x;
    12. const int_fast32_t y = n->y;
    13. pos.x = x;
    14. pos.y = y;
    15. if (pathCondition(startPos, pos, fpp, bestMatch)) {
    16. found = n;
    17. endPos = pos;
    18. if (bestMatch == 0) {
    19. break;
    20. }
    21. }
    22. uint_fast32_t dirCount;
    23. int_fast32_t* neighbors;
    24. if (n->parent) {
    25. const int_fast32_t offset_x = n->parent->x - x;
    26. const int_fast32_t offset_y = n->parent->y - y;
    27. if (offset_y == 0) {
    28. if (offset_x == -1) {
    29. neighbors = *dirNeighbors[DIRECTION_WEST];
    30. } else {
    31. neighbors = *dirNeighbors[DIRECTION_EAST];
    32. }
    33. } else if (!fpp.allowDiagonal || offset_x == 0) {
    34. if (offset_y == -1) {
    35. neighbors = *dirNeighbors[DIRECTION_NORTH];
    36. } else {
    37. neighbors = *dirNeighbors[DIRECTION_SOUTH];
    38. }
    39. } else if (offset_y == -1) {
    40. if (offset_x == -1) {
    41. neighbors = *dirNeighbors[DIRECTION_NORTHWEST];
    42. } else {
    43. neighbors = *dirNeighbors[DIRECTION_NORTHEAST];
    44. }
    45. } else if (offset_x == -1) {
    46. neighbors = *dirNeighbors[DIRECTION_SOUTHWEST];
    47. } else {
    48. neighbors = *dirNeighbors[DIRECTION_SOUTHEAST];
    49. }
    50. dirCount = fpp.allowDiagonal ? 5 : 3;
    51. } else {
    52. dirCount = 8;
    53. neighbors = *allNeighbors;
    54. }
    55. const int_fast32_t f = n->f;
    56. for (uint_fast32_t i = 0; i < dirCount; ++i) {
    57. pos.x = x + *neighbors++;
    58. pos.y = y + *neighbors++;
    59. if (fpp.maxSearchDist != 0 && (Position::getDistanceX(startPos, pos) > fpp.maxSearchDist || Position::getDistanceY(startPos, pos) > fpp.maxSearchDist)) {
    60. continue;
    61. }
    62. if (fpp.keepDistance && !pathCondition.isInRange(startPos, pos, fpp)) {
    63. continue;
    64. }
    65. const Tile* tile;
    66. AStarNode* neighborNode = nodes.getNodeByPosition(pos.x, pos.y);
    67. if (neighborNode) {
    68. tile = getTile(pos.x, pos.y, pos.z);
    69. } else {
    70. tile = canWalkTo(creature, pos);
    71. if (!tile) {
    72. continue;
    73. }
    74. }
    75. //The cost (g) for this neighbor
    76. const int_fast32_t cost = AStarNodes::getMapWalkCost(n, pos);
    77. const int_fast32_t extraCost = AStarNodes::getTileWalkCost(creature, tile);
    78. const int_fast32_t newf = f + cost + extraCost;
    79. if (neighborNode) {
    80. if (neighborNode->f <= newf) {
    81. //The node on the closed/open list is cheaper than this one
    82. continue;
    83. }
    84. neighborNode->f = newf;
    85. neighborNode->parent = n;
    86. nodes.openNode(neighborNode);
    87. } else {
    88. //Does not exist in the open/closed list, create a new node
    89. neighborNode = nodes.createOpenNode(n, pos.x, pos.y, newf);
    90. if (!neighborNode) {
    91. if (found) {
    92. break;
    93. }
    94. return false;
    95. }
    96. }
    97. }
    98. nodes.closeNode(n);
    99. }
    100.  
    Though the only node open is the start node, which owns the pos.xy coordenates.
    So the first getBestNode() will return the starting node, which is already open, and as it loops other nodes will be opened by the function createOpenNode(...)
    Basically, the calculation for the distance cost would be called by createOpenNode and getBestNode would just make a check between the already calculated values. Don't seem very hard to do.
    The trade off would be a uint16 variable for each new opened node vs less calculation processing.
    Which means... (Im not sure the impact)

    Found this about calculation processing vs memory lookup is at least interesting.
    Optimizing Code - CPU vs Memory
     
  19. Gesior.pl

    Gesior.pl Mega Noob&LOL 2012 Premium User

    Joined:
    Sep 18, 2007
    Messages:
    1,952
    Likes Received:
    855
    Best Answers:
    14
    Everything you should know about processor cache (all I know after 'system programming' at studies):
    PHP:
    1.  
    2. Latency Comparison Numbers (~2012)
    3. ----------------------------------
    4. Processor register - inside processor computing unit, ZERO ( https://en.wikipedia.org/wiki/Processor_register )
    5. L1 cache reference 0.5 ns
    6. Branch mispredict 5 ns
    7. L2 cache reference 7 ns 14x L1 cache
    8. Mutex lock/unlock 25 ns
    9. Main memory reference 100 ns 20x L2 cache, 200x L1 cache
    10.  
    Read/write to L1 cache is 200 times faster then to RAM.
    Read/write to L2 cache is 14 times faster then to RAM.
    You can't decide what is stored in L1/L2 - it's processor decision!
    Processor cache synchronize data with RAM in blocks. I heard that it uses 4KB blocks, because loading X KB block takes exactly same time as loading 1 byte from RAM.
    In case of image below, simple program:
    PHP:
    1. int x = sum + age;
    Processor must load values of 'sum' and 'age', sum them and set it's result to processor register temporary named 'x'.
    1. You tell processor to load variable 'sum' (RAM address 9000,0000-9000,0003) to use it's value. Processor loads it and 4KB around (adjusted to block) to cache and 'sum' value to processor register. It takes 100 ns.
    2. You tell processor to load variable 'age' (RAM address 9000,0004-9000,0005) to use it's value. This address is in cache! Processor loads it from L1 cache to processor register. It takes 0.5 ns!
    3. Sum of 2 variables from processor register and assigment to other processor register should take 1 processor cycle (0.3 ns on 3GHz cpu).

    By using variables that 'probably will be close to each other in memory' (table values next to each other, 2 variables initialized in same time) or keep on using few variables (which can be keeped in cache, there is a place for few 4KB blocks) your program can become super-fast!
    So doing few operations on one variable (0.3 ns per cpu cycle) can be faster then reading other variable in some loop (up to 100ns).
    In case of object programming: all object attributes are allocated at once, so they are next to each other in RAM!
    [​IMG]
    THAT'S ALL I KNOW ABOUT OPTIMIZING CACHE.
     
    Last edited: Jun 6, 2018
    Okke, rickgv, Erexo and 1 other person like this.
  20. Erexo

    Erexo Kage

    Joined:
    Mar 27, 2010
    Messages:
    667
    Likes Received:
    105
    Best Answers:
    5
    While at CPU caching topic, instruction cache can also be mentioned.
    With a little bit of engineering magic CPU can determine what will be the next set of instructions and where are they located, so whenever you are performing a jump, your CPU already knew it and loaded instructions from the destination before you even jump there! But when jumping destination is determined at runtime (ie. "read this memory and jump there" a.k.a. delegates) cpu will have hard time to correctly pre-load that part of memory where you want to go before you actually go there (and then it might stall due to tremendous 100ns for load the instructions from RAM). In "modern" and more "user-friendly" languages like C# it might be detected and correctly redesigned (in mild cases, L1/L2/L3 caches as well), but still I consider this as useful knowledge.
    And the caching thing that Gesior mentioned can be excellently used while using vector of structs instead of objects.
     
    Last edited: Jun 5, 2018

Share This Page

Loading...