Building a server maintenance plan
A good starting point is to classify your maintenance activities according to what you are trying to achieve with the activity and to move from there. In this article, we will split it into three areas.
First, we’ll look at the action you need to take to respond when there is an emergency, call it an emergency response plan. These include steps such as getting alerts when there is an emergency, and the ability to rapidly restore service when something does go wrong.
Next, we will consider steps you should take that can prevent emergencies from occurring in the first instance. For example, you can pro-actively do security checks, analyse performance numbers and check the usage of your server resources.
Finally we will look at some actions that act as a type of insurance in case you experience a server problem. These activities, including auditing your backups and doing fail-over checks will make sure you can rapidly restore your server if the need arises.
Responding to emerging problems: what you need to look out for
Different vehicles have different points of failure: a rocket has likely failure points that are very different from those on a racing bike. In the same way different servers have different root causes for failure: the reasons why a mail server could fail is very different from the reasons why a web server will fall over.
For this reason we can’t suggest a single plan that tells you exactly what you need to monitor to make sure you respond quickly in an emergency. Instead, we’ll guide you in the right direction by outlining what you should consider instead. We will use a web server as a typical example.
Problems with server capacity and user demand
Your server is not built to manage unlimited demand: it has a capacity limit. Sometimes demand can rise unexpectedly, perhaps someone sent out a wildly popular email to a million people or something on social media triggers demand. This can cause memory overload, disks that can’t respond and a server which does not serve pages.
Similarly, in environments where hosting is shared some users can run applications which draw an enormous amount of resources. In fact, some users can intentionally abuse server resources by not watching the amount of server load they generate.
Finally, sometimes server overload is caused by coding errors. Scripts that are not well written can cause memory to leak and other problems with resources. As part of your server maintenance plan you must watch out for both scripts and users who exploit more than their fair share of server resources, while simultaneously keeping an eye on over server utilisation.
Server attacks and malware
We live in an age where server attacks are incredibly common. These can come in several different shapes. For example, bots can try to brute force entry into your machine and the thousands of simultaneous queries this involves will cause capacity issues. A successful attack can lead to unauthorised access to your machine.
Malware is another big threat, software injections via undisclosed and unpatched vulnerabilities can allow hackers to gain entry to your machine, again giving unauthorised access and potentially leading to your server being used as a staging site for attacks on other machines.
Aside from the risks of unauthorised access including data loss and capacity issues, these attacks can lead to a loss of reputation: in other words, your server can be excluded from search engine results and you will find that your traffic drops precipitously. Watch out for attacks as part of your server maintenance plan.
Errors and failures
Servers are highly connected devices: both internally on a hardware and software basis and externally. Watch out for network problems, including broken connections to database backends or other apps that your server relies on.
Hardware is another point you need to watch, ensure that your RAID volume stays healthy for example and watch key indicators such as CPU and chassis temperature. Finally, if a redundant power supply fails – replace it immediately, and likewise with RAID volume issues.
In essence you need to monitor server statistics on all levels: network traffic, utilisation, loads and more so that you can notice when something is unusual. Only then can you investigate further. However it helps to have a plan that you can put into place when you notice an emergency situation developing.
Preventative maintenance: the key to avoiding problems
We’ve outlined what you need to be on the look for when it comes to monitoring emerging problems, but prevention is better than the cure. Again, it depends slightly on what server you are running, but let’s look at some of the preventative maintenance you can add to your server maintenance plan where the server in question is a database server.
Defragment and check indexes and integrity
Databases involve an enormous volume of read and write operations which need to be handled quickly, as a result a database can become fragmented. Delete queries in particular can lead to fragmentation which is why it is important to regularly optimize tables in your database to reduce the fragmentation that causes performance problems and which reduces free space.
Likewise, your preventative server maintenance plan should regularly do an index analysis, optimizing the indexes which MySQL is so reliant on. MySQL has an Analyze function which you should run on a monthly basis to ensure that MySQL can always find data fast. Analyze streamlines indexes and will make sure that queries are quickly executed.
Database integrity can be an issue, MySQL sometimes loses track of data sets as a result of database crashes and other app errors. Weekly checks of database integrity can prevent queries from failing as it provides MySQL with an opportunity to fix errors.
Check disk health and space
Just like database integrity, you can’t take disk health for granted. Always make sure you check your server logs because this is where you will find notices of HDD and RAID errors. These errors offer an indication of looming hard drive or RAID volume failure, giving you the opportunity to replace a drive before it brings down your server.
It’s not unknown for a server to fall over because it has run out of drive space. You must leave room for your database to increase in size, for backups to take place and for large database transactions to get processed. Free up space by removing temporary files, backups which are no longer relevant and other stale data.
Cluster efficiency is important, database clusters should sync efficiently if you want to prevent slow running queries and database errors. Again, early detection is key as it can prevent a costly database crash.
Scrutinise SQL logs
Your MySQL server will log errors when it finds table corruption or problems with indexes. Auditing your logs will ensure that you get an early warning of possible database failure: an error-filled log is a sure warning sign.
Slow queries are another point to watch out for. Aside from highlighting overall performance issues it also indicates which specific queries are causing performance problems, allowing you to tweak these to improve server performance.
Finally, a monthly health check on your server speeds will give you a record to go back on so that you can detect when your server is starting to experience bottlenecks. You can then fix these bottlenecks more easily before more serious issues emerge.
Overall you will need a degree of server management experience to really understand what it is about server performance that can throw up a red flag, indicating that a potential problem is approaching. Whether you run a web server, a DB server or something else, preventative maintenance is key.
Disaster recovery: building a plan to get up and running
Preventative plans are key to avoiding disaster, but even the best-run server environments occasionally face disasters. How do you respond? Clearly, the most important objective is getting things running again.
With a thoroughly thought out disaster recovery plan you can be up and running in a minute or less. Turnaround that is this quick is not necessary for every use-case, some websites owners will see no great harm if their site is down for an hour or two. For others, every minute of downtime is lost revenue.
There are a wide range of options that can minimize downtime. These include high availability clusters which are great at ensuring business continuity. Hardware with fault tolerance including redundant power supplies can work alongside fail-over mirrors to ensure that hardware failure never results in long downtime.
Crucial to disaster recovery: your backups
Some of the points we mentioned in the previous paragraph are expensive to implement, and outside the reach of many website operators. But one point is crucial to a sane server maintenance plan. It’s to do with your backups.
First, make sure your backups are in fact completing every day. Check for errors and ensure your backup tool reports the right status. Next, you need to check that your backups can be restored: can you retrieve the data, is there any corruption? Always monitor your available disk space as this is a prime reason for backups to fail. Finally, do a test run on the recovery process to verify how long it takes and whether it succeeds in the first instance. Watch out for unexpected glitches such as problems with connectivity that could make a recovery difficult.
Settling on your recovery plan
Finally, in deciding how you want to set up your recovery plan and on how much you invest you should carefully think through your application’s requirements. Start by thinking about how much downtime you can tolerate: how quickly do you need to restore services before the damage becomes intolerable?
Next figure out what plans, software and finally what hardware you need to get your disaster recovery plan in place. In doing so you can match the trade-offs you can accept, against those you cannot accept. But whatever you do always ensure your check and verify your backup strategy.