Sometime ago I was dealing with brute force attacks, and during that time, I thought it was fun to out-wit my attackers. Admittedly, for a while, it was. Down the road I went, with CloudFlare to handle most bots, as well as some general security measures of my own such as moving my login, disabling XML-RPC, installing WangGuard, and a few other scripts I wrote myself.
Lately, I’ve been getting more into actual system administration and learning the ins and outs of a Linux server environment. I started out with Apache (xampp) and evolved into a full-blown dedicated system in Canada. This server holds two Minecraft servers, my remote development environment, my personal website, and a few random databases I use for various side-projects.
On Wednesday March 30th, 2016, my MySQL database filled up, thanks to a sizable database file (1-2 GB) from one of our clients. Historically, during development, I try to mimic the live site of a client as closely as possible. This ensures there aren’t any data integrity issues and guarantees I’m not missing anything.
Well, I realize 1-2 GB isn’t that large when it comes to a database, but considering the fact that I had been working on multiple other projects at the time, as well as my personal data, and add that to the fact that a properly configured Minecraft server can create a significant amount of data in the database with the right logging software.
Well. Whoops. Server apocalypse.
How did this happen?
It’s a culmination of multiple things, the first of which was the MySQL database was mounted in the WRONG location. The second and less controllable external factors being the writing to the database from the Minecraft Server, as well as a few cron jobs I had going, and then the new client database that was imported.
On to my next mistake! After a lengthy call with Parbs (in my opinion, he is THE guy to go to for server problems), we came up with a solution to move the
/var folder over to
/home/var, which is where the bulk of my free space existed. After re-initializing mariaDB, I saw my main site come up. Whoa, it worked!
Of course, I checked a few other sites and still saw a db error, but I wrote if off as a whatever because it was end of the day and I wanted to get some R&R in.
Due to my never-ending quest for knowledge, later that night I returned to try to ‘figure’ out the reason the other sites were offline (database issues) while my main site was fine. It NEVER occurred to me the tables may have crashed.
My process went like so:
- Google the shiz outta my problem
- Proceed to run random commands from StackOverflow…that were from 2006
- Kill the server
So how did I kill the server?
Well, first off, if you ever want to break something, take the advice of the internet at face value and do no investigation on your own! That’s pretty much the easiest way to destroy something. I don’t do it when it comes to code, so why I did it for the server issue, I’ll never know.
I ended up mounting a folder ONTO itself (which, up until this point, I didn’t even know that was possible) with symlinks. In the end, I was like “Oh, I don’t need this symlink,” and simply did this:
rm -fR /home/var.
deleted my MAIN /var folder, which by default is where MySQL stores its data, which is
Well, you can imagine what happens when you delete 20GB worth of data with ZERO backups.
My main site to this day is still offline. I ended up having to flex some time so I could, at minimum, get my development environment online, and of course, my gaming servers (with 100+ players) were offline for two days. As I’m sure you could predict, I had some very unhappy people.
It was during this time I realized I knew nothing about Nginx vs Apache servers. I’m an Apache guy, but making the move to Nginx due to the server wipe was, at least I thought, the logical thing to do. I ended up staying up until 3:30am the day the apocalypse hit, and then spending a chunk of the following day to finish on-lining my development environment, which consisted of me trying and failing to set up Nginx server directives, restarting the server hundreds of times, and finally uploading a 20GB database, and about 130GB worth of files which ate my bandwidth all day.
What I Learned
Don’t touch anything without talking to Brad Parbs first!
In all seriousness: We say it again and again, and we hear it again and again, but even those of us who are professional devs can forget the importance of one very crucial thing: BACKUPS!!! Get a backup system that works. If you’re on a managed server more than likely you already have this. I, however was not, and felt the agony of a server-pocalypse.
One thing I’ve had trouble finding is a good system backup/restore system for remote servers. Anyone out there have a recommendation?
2 thoughts on “How to Avoid a Server Apocalypse”
Many people tend to say “BackUp/Restore” to mitigate risk and being prepared for “Disaster Recovery”. But this should be considered as being a strategy rather than a solution and therefore always is a set of actions to address those risks most relevant to them and/or to meet compliance, enhance governance,… (think of hardware failure, corrupted file system, “infected” applications, DBs or whatever comes in your mind).
You recently made your very own experience. 😉
Keep it simple, be smart.
Invest 3-4 days by learning the basics of IaC and start to utilize ansible, chef, puppet, salt stack, … (personally, I prefer ansible as it requires “push” instead “pull” = agentless = more control) and have a defined (and by that fully documented) environment allowing you to spin up disposable development or test environments (good for troubleshooting too) or to (re)create a reliable system that went down for whatever reason.
Sure, depending on the reason that led to the outage, you will want to change some config (ssh keys, password, banned IPs, opened ports,…) prior rebuilding the system but you will be up and running again in no time compared to any other solution while having all required changes already documented in your IaC code and pushing that out to a fresh droplet. Obviously backing up the DB(s) using Cron jobs (can also utilize WP-CLI for those who went to rely on another tool in the workflow…) should be one thing configured utilizing IaC. Another benefit is, that your customers will love you that too – not to say it also opens new ways to serve and support your clients by selling added value.
As a starting point on how a complete WP install could look like utilizing IaC, check https://github.com/roots/trellis. Should you want to go with puppet, check https://puphpet.com/ to start “playing” (I wouldn´t use the generated scripts in production).
If utilizing IaC seems to be too much overload, please trust my roughly 20 years experience as consultant in Enterprise IT that there is no single “backup/restore” solution covering each and every aspect without heavily investing in infrastructure (ntp server, fail-over cluster in order being able to take snapshots, massive space for storing snapshots, logs, system files, applications and related configuration – all separated as you will never know if they are “infected” in case you get hacked) and time (adding complexity requires a lot well defined processes).
By the way: Even those going that path will learn, that there is no single manufacturer not recommending performing a clean install – which brings us back to IaC to accomplish this task in a reproducible way.
Would you care to elaborate on the meaning of IaC? This scenario, which is more of an event due to the fact it did happen, was purely due to my ignorance and assuming the internet had all the answers. While I’m not saying it didn’t, I did read a post which ( at the time ) seemed pretty convincing, and therefore followed through.
I completely agree that a backup is not a solution, but a strategy. You could backup your system to hundreds of servers, all at the same time, but there’s still that ONE chance that something can go wrong. In my case, it would have been great to have a backup, even a slightly older one. However, the problem was none of my personal code was stored on a repo, the theme was lost, and the DB had zero backups, which lead to the obvious downfall.
For now, I’ve resorted to weekly backups of the databases and game servers. While this isn’t the optimal solution, as you outlined, it’s more than sufficient for my little ol’ setup.