• There is NO official Otland's Discord server and NO official Otland's server list. The Otland's Staff does not manage any Discord server or server list. Moderators or administrator of any Discord server or server lists have NO connection to the Otland's Staff. Do not get scammed!

TFS 0.X Crashes I'm facing with

kor

PHP ziom
Premium User
Joined
Jul 12, 2008
Messages
252
Solutions
13
Reaction score
410
Location
Bialystok, Poland
GitHub
rookgaard
YouTube
Rookgaard
Hello.

On my 5 years old server first time I meet a crash (5x just today) which I'm not able to fix (and even find a reason) on my own. It's probably a deliberate action as it happened in quick succession without any connection with globalevent (such as server save or other quests scripts) or other logged in player's action (I have a system which records theirs incoming and outgoing packets) - looks like it's some already prepared data sent to gameserver port.

OS: 16.04.6 LTS (Xenial Xerus)
boost: 1.58.0.1ubuntu1
TFS: Fir3element/3777 (https://github.com/Fir3element/3777) with debugBuild=yes in Fir3element/3777 (https://github.com/Fir3element/3777/blob/master/src/configure.ac#L74)

Here's a list what console recorded and what gdb output was:
Crash no. 1 at 2021-04-18 16:24:13
console: just Segmentation fault (core dumped)
gdb: gdb tfs0 core-2021-04-18-16-24-13-356567362GNU gdb (Ubuntu 7.11.1-0ubuntu1~16. - Pastebin.com (https://pastebin.com/Jg4cJqTE)

Crash no. 2 at 2021-04-18 16:31:40
console: *** Error in `./tfs0': corrupted size vs. prev_size: 0x0000000001239c80 ***=== - Pastebin.com (https://pastebin.com/s7F2cN10)
gdb: gdb tfs0 core-2021-04-18-16-31-40-910861589GNU gdb (Ubuntu 7.11.1-0ubuntu1~16. - Pastebin.com (https://pastebin.com/HaLWMbjC)

Crash no. 3 at 2021-04-18 16:32:59
console: *** Error in `./tfs0': corrupted size vs. prev_size: 0x00000000027463a0 ***=== - Pastebin.com (https://pastebin.com/U3xHkG1K)
gdb: gdb tfs0 core-2021-04-18-16-32-59-695611728GNU gdb (Ubuntu 7.11.1-0ubuntu1~16. - Pastebin.com (https://pastebin.com/at2z8sSF)

Crash no. 4 at 2021-04-18 17:28:58
console: *** Error in `./tfs0': corrupted size vs. prev_size: 0x0000000001e7ff70 ***=== - Pastebin.com (https://pastebin.com/EZHGRjvE)
gdb: gdb tfs0 core-2021-04-18-17-28-58-369723171GNU gdb (Ubuntu 7.11.1-0ubuntu1~16. - Pastebin.com (https://pastebin.com/SmnSP8qd)

Crash no. 5 at 2021-04-18 17:45:56
console: *** Error in `./tfs0': corrupted size vs. prev_size: 0x0000000001742ad0 ***=== - Pastebin.com (https://pastebin.com/Hdqhmyaw)
gdb: gdb tfs0 core-2021-04-18-17-45-56-866334916GNU gdb (Ubuntu 7.11.1-0ubuntu1~16 - Pastebin.com (https://pastebin.com/pvXv7Ska)

Is there someone who can help me find what happened? Or just where to put some logging incoming IP or something else?

Thanks,
Michal "Gubihe"
 
For the most information, instead of just running bt on the currently selected thread, you could paste us the output of this command: thread apply all bt full
This should give you a backtrace of all the threads, and adding full after bt prints the values of the local variables also.
 
@kor
1. Are you sure you are not running out of RAM? Is it possible, that someone is attacking you, by making XXX.XXX connections, to make you run out of RAM?
Do you use some firewall to limit connections per minute per IP?

EDIT:
2. Did you compile that engine on machine you are running it or you copied binary file from other machine?
3. Did you switch machine, reinstall linux or update any linux packages?
 
Last edited:
Hello.

1. My VPS is 2 GB RAM, while TFS uses 524 MB, mysql 421 MB and PHP together with nginx and node scripts 202 MB. Rest (~850 MB) is free and some resides in buff/cache which I clear every server save at 06:00 with sync; echo 1 > /proc/sys/vm/drop_caches. But well, graphs shows something different at the time of crashes -
0xM4tfO.png

TCM8wFW.png


so it might be the case. About connection limit, I'm using only this iptables entry on that port iptables -A INPUT -p tcp --syn --dport XXX -m connlimit --connlimit-above 3 -j REJECT

2. Yes, engine was compiled on the same machine and configuration where it's running.
3. In 2018 I've moved to machine I'm currently on and according to apt logs, last time I updated packages was in Feb 2020
 
Running close to RAM limit is easiest way to get random crash.
For some reason 4 crashes were in 'ConnectionManager::createConnection' function and 5th in other function related to network. Easiest explanation would be attack with mass connections that make server go out of RAM in 1 second. Other possible reason is bug in some c++ library (boost?), but finding it would be super hard.

First step would be upgrade to 4 GB ram. You can also add logs of 'new connections' ( TCPDUMP capture new connections only (https://serverfault.com/questions/798745/tcpdump-capture-new-connections-only) ). After crash you can check, if there was spike in number of connections before crash.
 
Running close to RAM limit is easiest way to get random crash.
For some reason 4 crashes were in 'ConnectionManager::createConnection' function and 5th in other function related to network. Easiest explanation would be attack with mass connections that make server go out of RAM in 1 second. Other possible reason is bug in some c++ library (boost?), but finding it would be super hard.

First step would be upgrade to 4 GB ram. You can also add logs of 'new connections' ( TCPDUMP capture new connections only (https://serverfault.com/questions/798745/tcpdump-capture-new-connections-only) ). After crash you can check, if there was spike in number of connections before crash.

That's true I had to expand the RAM aswell.
 
@Gesior.pl someone attacked me again. He was prepared this time, because during investigation (I have system which record player's packets to and from server) I found someone cloned some items (but it's just side-effect I'm able to revert). It happened 3x times today:

As you asked, there was running tcpdump from command tcpdump -i ens3 "port port_here and tcp[tcpflags] & (tcp-syn) != 0" and here it's output: 18:29:54.414073 IP 185.107.80.219.38708 > server_ip.server_port: Flags [S], seq - Pastebin.com (https://pastebin.com/KCJ9kU2y)

I'm still on 2 GB RAM, but this time at the time of crash it was on good enough level (drops are caused by sync; echo 1 > /proc/sys/vm/drop_caches cron command) and there wasn't any spikes while monitoring live:
sMUxWtM.png


Because of cloning happened, I can exclude random crash. Also, I've extended tcpdump to tcpdump -i ens3 -X -vv -e "portrange 7000-8000 and tcp[tcpflags] & (tcp-syn) != 0" to be sure it's fired on some other ports.
 
@Gesior.pl someone attacked me again. He was prepared this time, because during investigation (I have system which record player's packets to and from server) I found someone cloned some items (but it's just side-effect I'm able to revert). It happened 3x times today:

As you asked, there was running tcpdump from command tcpdump -i ens3 "port port_here and tcp[tcpflags] & (tcp-syn) != 0" and here it's output: 18:29:54.414073 IP 185.107.80.219.38708 > server_ip.server_port: Flags [S], seq - Pastebin.com (https://pastebin.com/KCJ9kU2y)

I'm still on 2 GB RAM, but this time at the time of crash it was on good enough level (drops are caused by sync; echo 1 > /proc/sys/vm/drop_caches cron command) and there wasn't any spikes while monitoring live:


Because of cloning happened, I can exclude random crash. Also, I've extended tcpdump to tcpdump -i ens3 -X -vv -e "portrange 7000-8000 and tcp[tcpflags] & (tcp-syn) != 0" to be sure it's fired on some other ports.
Monitoring check RAM every X seconds. It cannot detect 1 second spike.
From that dump I can only read that AFTER crash people tried to connect again to OTS and last connection that came to server and - probably - crashed it was:
Code:
18:30:31.803718 IP 89-64-118-68.dynamic.chello.pl.6694 > server_ip.server_port: Flags [S], seq 3009279448, win 64240, options [mss 1420,nop,wscale 8,nop,nop,sackOK], length 0
but it looks absolutely normal.

How much RAM was allocated by OTS in moment of crash? Check size of 'core' file.

About "Fir3element/3777" engine. I updated it yesterday to make it compilable on Ubuntu 20.04. I updated many engines, but that was worst. It has more errors than 0.4 3777 and OTX2 together. If compilator report errors like 'return false from function that should return std::string', it means there must be 100 other logic errors that are not detectable by compiler.
 
@Gesior.pl someone attacked me again. He was prepared this time, because during investigation (I have system which record player's packets to and from server) I found someone cloned some items (but it's just side-effect I'm able to revert). It happened 3x times today:

As you asked, there was running tcpdump from command tcpdump -i ens3 "port port_here and tcp[tcpflags] & (tcp-syn) != 0" and here it's output: 18:29:54.414073 IP 185.107.80.219.38708 > server_ip.server_port: Flags [S], seq - Pastebin.com (https://pastebin.com/KCJ9kU2y)

I'm still on 2 GB RAM, but this time at the time of crash it was on good enough level (drops are caused by sync; echo 1 > /proc/sys/vm/drop_caches cron command) and there wasn't any spikes while monitoring live:
sMUxWtM.png


Because of cloning happened, I can exclude random crash. Also, I've extended tcpdump to tcpdump -i ens3 -X -vv -e "portrange 7000-8000 and tcp[tcpflags] & (tcp-syn) != 0" to be sure it's fired on some other ports.

How did u catch the cloners?


Monitoring check RAM every X seconds. It cannot detect 1 second spike.
From that dump I can only read that AFTER crash people tried to connect again to OTS and last connection that came to server and - probably - crashed it was:
Code:
18:30:31.803718 IP 89-64-118-68.dynamic.chello.pl.6694 > server_ip.server_port: Flags [S], seq 3009279448, win 64240, options [mss 1420,nop,wscale 8,nop,nop,sackOK], length 0
but it looks absolutely normal.

How much RAM was allocated by OTS in moment of crash? Check size of 'core' file.

About "Fir3element/3777" engine. I updated it yesterday to make it compilable on Ubuntu 20.04. I updated many engines, but that was worst. It has more errors than 0.4 3777 and OTX2 together. If compilator report errors like 'return false from function that should return std::string', it means there must be 100 other logic errors that are not detectable by compiler.

I also use this engine here...

"I updated it yesterday to make it compilable on Ubuntu 20.04"
what are u mean with this? u mean that now u can compile this on ubuntu 20.04?
because i'm using it here on debian 9
"It has more errors than 0.4 3777" what are u mean?

where is your changes?
 
How did u catch the cloners?




I also use this engine here...

"I updated it yesterday to make it compilable on Ubuntu 20.04"
what are u mean with this? u mean that now u can compile this on ubuntu 20.04?
because i'm using it here on debian 9
"It has more errors than 0.4 3777" what are u mean?

where is your changes?
I made changes on someones VPS. I don't have copy of these sources. If you need version compilable on Debian 10 / Ubuntu 20.04, you can create it on your own using tutorial (error no 15 occured first time in these sources): [C++/Linux] Compiling old engine (sources) on Debian 10 / Ubuntu 20.04 (https://otland.net/threads/c-linux-compiling-old-engine-sources-on-debian-10-ubuntu-20-04.274654/)
VPS owner already did all changes, just made mistake in fixing 'error no 1' and it did not compile.

"It has more errors than 0.4 3777" what are u mean?
Compilation fails reporting new errors, not met on 0.4 3777. So author of that engine added his own new bugs. Bugs that bad that it's not possible to compile code. If he added there bugs like that, he probably added 100 other bugs that are not auto-detected by compilator, but may crash server.
 
@Gesior.pl Indeed, every connection on that gameport looks normal as players was trying to rejoin during server summary between crashes and when it was opened. My core files very from 542 027 776 to 542 183 424 bytes. So when normally I have 400-500 MB free it really might be an issue? I thought Fir3element was actually 0.4 3777 version. Should I use "clean" version of 3777 from Backup of some old sources (https://otland.net/threads/backup-of-some-old-sources.199436/) or migrate as soon as possible to 1.2?

P.S. It's possible, that attack is done for other port and engine just got hit by ricochet?

@eyez As I mentioned, I've implemented "cam" system long time ago, so I can filter who was online just before crash and watch their record later like for example I'm doing sending proofs for AFK botters
 
P.S. It's possible, that attack is done for other port and engine just got hit by ricochet?
Possible? Yes. Again, I would use attack related to RAM limit of your server. Spam website/overload mysql by website spam to make it use more RAM. Next, some app (OTS) try to allocate RAM and crashes.
It's keep on crashing on 'new player connection'. There must be some problem with that code.

First I would move to some new operating system and 4 GB machine. If it's somehow system/library related, you will get new versions of libraries and maybe bug will auto-fix.

Pure 3777 (official otland repo):
Pure 3777 with changes that make it compilable on Debian 10 and Ubuntu 20.04:

Of course converting to 1.2 would be good, but in case you want to use 8.6 protocol, it will require a lot of work. There is no ready-to-run TFS 1.2 client 8.6 version. There are a lot of them, but everyone who use them report some problems with 8.6 features.
 
Pure 3777 (official otland repo):
Pure 3777 with changes that make it compilable on Debian 10 and Ubuntu 20.04:

Of course converting to 1.2 would be good, but in case you want to use 8.6 protocol, it will require a lot of work. There is no ready-to-run TFS 1.2 client 8.6 version. There are a lot of them, but everyone who use them report some problems with 8.6 features.

Nice, i was almost trying to move my 8.6 for 1.2
But looking on the forum i found a lot of people asking for help because of bugs (my 0.4 is totally stable, i got 400 hours of uptime)


@eyez As I mentioned, I've implemented "cam" system long time ago, so I can filter who was online just before crash and watch their record later like for example I'm doing sending proofs for AFK botters

Lol your server made this video?
 
@Gesior.pl Thank you so much much much. For now, I will try to run server on top of your changes with higher RAM. I asked about 1.2, because my own migration began 1,5 years ago, but never managed to finish and test it yet, looks like the time has come :D

@eyez every server is stable until someone will take it down ;) About video, my server recorded and "displayed" it for me in client, I still have to screen-record it and put on YT, but most things I already automated.
 
Back
Top