This occurs because T is stored in rows and not columns, so it is layed out in memory like this:
T[0][0], T[0][1], T[0][2], ..., T[1000][999], T[1000]T[1000].
So when you are acessing its memory in the first loop, you have the region you are using already in cache because when the processor...