We're Hiring!
Take the next step in your career and work on diverse technology projects with cross-functional teams.
LEARN MORE
Mountain West Farm Bureau Insurance
office workers empowered by business technology solutions
BLOG
11
30
2017

Eight Steps to Troubleshoot an IT Problem (Without Losing Your Mind)

Last updated:
9.16.2020
No items found.

When your monitoring systems start sending a deluge of alerts or your servers suddenly stop responding, it’s easy to go into crisis mode. That’s why Step One of this guide to troubleshooting is to remain calm. Let common sense prevail, be sure to maintain your documentation, and get down to the art of troubleshooting your IT systems. Just follow these eight general guidelines to pinpoint the issue and take steps towards remediation.

 

1) Calm down and start communicating

Don’t panic. Keep breathing. Grab some snacks and settle in. It’s vital to keep your head especially during a serious service outage. As long as you’re radiating calm vibes, now is a great time to start sending out communications regarding the issue — namely, that you are aware of it, and that you intend to follow up with stakeholders, team members, and users/customers at set intervals, like within 30 minutes, or every hour until resolution.

Keep your promises when it comes to follow up notifications. There’s no better way to alienate your users than keeping them in the dark during a service interruption. By offering regular updates, you can also control expectations and get out ahead of conversations like, “You said this would be fixed already,” as you alert customers of shifting timelines and ongoing efforts to remedy the problem.

 

2) Document and describe the problem in detail

If you don’t already have a troubleshooting documentation process in place, now is the time to start. Use a simple spreadsheet to describe the issue symptoms (what is happening), when it is happening, what components appear to be affected, which users are encountering the problem, and of course the date and time. If it is obvious, go ahead and not why the problem is occurring (that makes for pretty easy troubleshooting). Note any personnel involved in the troubleshooting process as well.

One vital thing to look for is whether any changes were made recently. A software update or component change often leads to problems, and is a relatively easy fix to roll back, assuming you’re using backups. You’re using backups or disaster recovery, aren’t you?

Double check that everything is plugged in and attempt the classic “turn it off and back on” by power cycling/restarting the system. Check your performance monitoring tools to be certain that it isn’t just a heavy system load causing problems. Be sure to be as specific as possible and to investigate the symptoms for yourself. Which bring us to…

 

3) Replicate the error conditions

While some errors may prove difficult to recreate, the majority are simple to discover (plug in a device and the system doesn’t recognize it, for instance). If you learned of the error from a third party, attempt to do the same steps they did to throw the error.

What devices are involved? Does the error persist with different end clients? Internal network or public internet? Did any significant events happen before/during the problem?

If you are troubleshooting remotely, try to get as specific a description as you can from the person reporting the problem. Walk them through the steps they took to reach the problem again, if you can, or use a remote desktop control app to gain access to their system.

 

4) Gather supporting information and logs

Get screenshots or copies of any error messages that appear or originally appeared, including any reference codes or links to knowledge base articles that they may include. Event logs from Windows event viewer, VMware logs, or other relevant sources can help you out by providing a time stamp and clues as to what process may have caused the breakdown.

Use any tools you have on hand, even those as simple as ping for network problems. This is the time to leverage performance monitors/system diagnostics, and network/system inventories.

5) Lay out the entire system in a clear manner

Use a whiteboard, talk through the problem and associated components with a coworker, or just chicken scratch on a napkin: however you do it, you need a structured and organized method of looking at the big picture. Sometimes this step is all you need to have a Eureka moment.

 

6) Start digging

On to the research portion. If the problem hasn’t become apparent yet, you’ll need to turn to the web and potentially engage vendor support to help pinpoint it. Knowledge Bases, web forums, archived help desk tickets, search engine queries, and your fellow engineers can all be of great use in this step.

The supporting information you collected above will come in very useful. If you have specific error messages or system logs, you are likely to find someone else out there who has encountered the same issues.

 

7) Trial and error

After doing some research you should have a few clues as to where to begin to remedy the problem. Your first attempt at fixing it may or may not be successful, so be sure to create a system backup for any major changes, so you can be ready to roll back. If it is a minor fix, you should still be sure to document exactly what you’re changing.

This will help you systematically narrow down your troubleshooting, avoiding any combinations and previous fixes. Try changing settings, removing or rolling back new software, repairing corrupted system files, defragmenting hard drives, updating system software like drivers or operating systems, or replacing any faulty hardware. Check the DNS and DHCP settings and make sure the firewall or proxies are configured correctly.

 

8) Talk to a vendor or third party

Hopefully, by now you’ve pinpointed the issue and have figured out a way to resolve it. Still stumped? It’s time to call in the big guns. Open a ticket with the relevant software or hardware vendor, your systems integrator, or your data center provider and see if they can help get things back to normal.

 

Chances are you’ll eventually run into a unique issue, but the majority of IT problems are relatively common and easy to reproduce. Standard troubleshooting steps like a system reboot, reinstalling/updating drivers, or examining network settings are all great first stops. The most important thing is to keep your head and maintain clear communication with those who are affected by the problem. You’ll be back up and running in no time.

Recent Blog Posts

lunavi logo alternate white and yellow
11.14.2024
11
.
8
.
2024
Load & Performance Testing with Azure Load Testing Service

Learn about load and performance testing in Microsoft Azure.

Learn more
lunavi logo alternate white and yellow
10.8.2024
09
.
25
.
2024
Maximizing Business Efficiency with Azure DevOps

For enterprises looking to adopt or mature their DevOps practices, Azure DevOps offers unmatched flexibility, scalability, and depth.

Learn more
lunavi logo alternate white and yellow
10.8.2024
09
.
09
.
2024
Exploring Microsoft Fabric: A Comprehensive Overview

Discover how Microsoft Fabric transforms data management and analytics with powerful tools for real-time insights and seamless collaboration, driving smarter, faster business decisions.

Learn more