Eight Steps to Troubleshoot an IT Problem (Without Losing Your Mind)

March 1, 2023

When your monitoring systems start sending a deluge of alerts or your servers suddenly stop responding, it’s easy to go into crisis mode. That’s why Step One of this guide to troubleshooting is to remain calm. Let common sense prevail, be sure to maintain your documentation, and get down to the art of troubleshooting your IT systems. Just follow these eight general guidelines to pinpoint the issue and take steps towards remediation.

1) Calm down and start communicating

Don’t panic. Keep breathing. Grab some snacks and settle in. It’s vital to keep your head especially during a serious service outage. As long as you’re radiating calm vibes, now is a great time to start sending out communications regarding the issue — namely, that you are aware of it, and that you intend to follow up with stakeholders, team members, and users/customers at set intervals, like within 30 minutes, or every hour until resolution.

Keep your promises when it comes to follow up notifications. There’s no better way to alienate your users than keeping them in the dark during a service interruption. By offering regular updates, you can also control expectations and get out ahead of conversations like, “You said this would be fixed already,” as you alert customers of shifting timelines and ongoing efforts to remedy the problem.

2) Document and describe the problem in detail

If you don’t already have a troubleshooting documentation process in place, now is the time to start. Use a simple spreadsheet to describe the issue symptoms (what is happening), when it is happening, what components appear to be affected, which users are encountering the problem, and of course the date and time. If it is obvious, go ahead and not why the problem is occurring (that makes for pretty easy troubleshooting). Note any personnel involved in the troubleshooting process as well.

One vital thing to look for is whether any changes were made recently. A software update or component change often leads to problems, and is a relatively easy fix to roll back, assuming you’re using backups. You’re using backups or disaster recovery, aren’t you?

Double check that everything is plugged in and attempt the classic “turn it off and back on” by power cycling/restarting the system. Check your performance monitoring tools to be certain that it isn’t just a heavy system load causing problems. Be sure to be as specific as possible and to investigate the symptoms for yourself. Which bring us to…

3) Replicate the error conditions

While some errors may prove difficult to recreate, the majority are simple to discover (plug in a device and the system doesn’t recognize it, for instance). If you learned of the error from a third party, attempt to do the same steps they did to throw the error.

What devices are involved? Does the error persist with different end clients? Internal network or public internet? Did any significant events happen before/during the problem?

If you are troubleshooting remotely, try to get as specific a description as you can from the person reporting the problem. Walk them through the steps they took to reach the problem again, if you can, or use a remote desktop control app to gain access to their system.

4) Gather supporting information and logs

Get screenshots or copies of any error messages that appear or originally appeared, including any reference codes or links to knowledge base articles that they may include. Event logs from Windows event viewer, VMware logs, or other relevant sources can help you out by providing a time stamp and clues as to what process may have caused the breakdown.

Use any tools you have on hand, even those as simple as ping for network problems. This is the time to leverage performance monitors/system diagnostics, and network/system inventories.

Eight Steps to Troubleshoot an IT Problem (Without Losing Your Mind)

1) Calm down and start communicating

2) Document and describe the problem in detail

3) Replicate the error conditions

4) Gather supporting information and logs

Related Topics:

Embracing the Cloud | Part 1: Scalability

Join our newsletter