With
The best coding agent (https://cursor.com/) IDE you can use all popular AI models and compare results. You pay 20$ per month and get 20-40$ to spend on AI chats. There is also their own model 'composer-2.5-fast' on which you can spent much more (300-400kk tokens - ~150$) 'for free' with that 20$ plan.
There is also free access to Claude and Gemini sponsored by Google:
Google Antigravity (https://antigravity.google/) , but it often hits weekly limit of free account after 1-3 prompts to Claude (few times more with Gemini).
How bad AI can go on OTS? One popular OTS implemented 'prey system' on server using AI (it used existing OTC 'prey' module). Day after implementation CPU usage by Lua jumped from 23% to 50%! For next 2 days owner implemented optimizations with AI making it use 40% CPU next day and 30% CPU after 2 days. It's still +7% CPU usage for something so basic as 'prey system'.
Then I spent few hours and 9$ on Claude 4.6 to rewrite most of code to C++, but I had to prompt like 100 times and tell it how to implement every part of logic in optimized way. It wasn't "AI rewrite this Lua crap to make it work fast in C++"
The worst AI code I've seen was 'extra item attributes' code that AI implemented in
Container::getItems, which send SQL query for every iteration over item in backpack

, but it lagged OTS with 1 player online, so it was obvious that it must be fixed before going production server.
Sadly, most of AI bugs in Lua result in random server crashes (like you test script with 100 players online for 3 days and then you get random crash at 4 AM). Maybe you have to add to every prompt link to
[TFS 1.x+] How to NOT write Lua scripts or how to crash server by Lua script — Gesior's blog (https://skalski.pro/2020/06/05/tfs-1-x-how-to-not-write-lua-scripts-or-how-to-crash-server-by-lua-script)
(I replaced WordPress on my blog with my own blog app made with AI)
The best AI OTS code I've seen this year is OTCv8 updated by AI to make it compile into iPhone app. It took AI "Qwen 3.6" 3 days running on Apple Mac Studio with 256 GB RAM (shared with GPU on Apple). IDK how much you would spend on tokens, if you had to pay for GPT/Claude.
I just compared models to check how GPT 5.5 codes vs Claude 4.6/4.8.
TASK: I told AI to edit simple Lua script (reward box with random reward), so that same player won't get same item 2 times in a row.
Only GPT 5.5 wrote foolproof script. All other models analysed script config above, where I have a lot of items configured and each has unique ID.
GPT 5.5 expected that someone could write config with repeated item IDs, so checking if there is more than 1 item in table is not enough to guarantee that, if we remove items from table by ID and pick random element of table, there will be any.
In case of all other AIs codes, if I edit config and put there table with 2 elements, but both with same
id, server would go into infinite loop in Lua (freez whole OTS until manual restart by owner).
All other models - except GPT 5.5 - noticed that there are few items (different reward chests) player can use and implemented code that tracked last random item from each type of chest. I think it's better interpretation and real programmer would ask 'how do you want it to work', not implement code as he 'thinks' it should work.
Also, I ran same task 2 times with "Auto" and get 2 different results. That's common with AI. It's like slot machine, you run it X times and pick the best script generated.
Biggest problem of all AIs is generating too many lines of code. You have to describe every part of logic and how to implement it, to get short optimized code
My tests results:
Script before changes:
GPT-5.5 high:
Claude 4.6 high:
Claude 4.8 high:
Composer-2.5-fast (available only in Cursor):
"Auto" model by Cursor (it is Composer-2.0-fast or Composer-2.5-fast):
"Auto" model by Cursor (it is Composer-2.0-fast or Composer-2.5-fast):
Final "auto" code, when I wrote 4x longer prompt with description of all problems that it has to take care of. It generated safe code which uses less CPU than any other AI script above:
How much each test costed:
claude-4.6-opus-high-thinking - 82,5k tokens - 0.16$ by Cursor, official AI price 0.38$
claude-opus-4-8-thinking-high - 112,1k tokens - 0.14$ by Cursor, official AI price 0.31$
gpt-5.5-high - 151,6k tokens - 0.1$ by Cursor, official AI price 0.21$
composer-2.5-fast - 81,1k tokens - ~0.027% of monthly limit by Cursor, official AI price 0.13$
auto by composer- 103,2k tokens - ~0.01% of monthly limit by Cursor, official AI price 0.05$
auto by composer- 106k tokens - ~0.01% of monthly limit by Cursor, official AI price 0.05$