At some point in our lives, we have all experienced a time when we had something break, quit working, or at the very least not work the way we expected. After exhausting our personal skills in trying to figure out how to fix the problem, we end up calling the repairman – the expert that will help us get back on track. In the IT Ops and IT Security worlds, the repairman can be represented by our internal support teams, help desk, or manufacturer/vendor support representative. The typical first question: “What changed?” Our typical response: “Nothing" or "I don’t know” Either response usually results in a frustrated repairman, as they now have to investigate everything that may be contributing to the problem. Where to begin? This, my friends, can be a very daunting task in today’s complex IT infrastructures. For the repairman, or in this case the IT Ops and IT Security Teams, this can be a very unpleasant situation. Not only do they have to determine what the problem is and what actions to take to correct the problem. They also have to explain this to management or, on the extreme end, some outside third-party, while everyone expects answers. Not having them can be a very painful experience. Understanding and managing change is so critical, so challenging, and it is or should be a basic pillar of our IT & Security processes. Yet the process of validating change often lacks the same controls and emphasis we have in other aspects of availability requirements for mission critical and money-making platforms. All organizations, well most organizations, employ some version of Change Management. Whether that be a notepad on someone’s desk, a spreadsheet, or an ITSM platform where changes are documented and tracked to Change Management Boards (CABs), we all attempt to prevent "bad" or "unknown" changes. But unknown/bad changes still occur every day. The importance of change was impressed upon me very early in my career and firmly set my view on the importance of understanding and managing change. I was a junior systems programmer back in the 80s. My team was responsible for ALL platform and application support. The company I worked for at the time provided 24x7 telephone support service for a wide variety of customers. For example, one of our customers operated a fleet of over-the-road truckers. Our service provided the operators information for the nearest semi-truck repair shops. For this particular customer, our service was mission critical. A broken-down truck meant that they weren’t delivering goods to their customers. Those customers could not provide the goods to their customers, and those customers would take their business elsewhere. This had the potential to be a very impactful and costly situation. My experience: I was responsible for maintaining our VM environment. Now this is not the VM environment we all know and love today. This was a mainframe VM platform – still the same premise but a little more... shall we say, touchy? Part of my duties was to create VM’s for our programmers to test new applications, as well as our operating our production applications. I had a new application that I was tasked with evaluating and needed to spin up an environment for testing. Part of this process was to find available disk space where the environment would reside. Back in the day, this was a complex process. So, I identify some available disk space, format that space, and lay down the required OS and application. Lunch time rolls around, and I go out for lunch. Upon my return, I see the Red Light is on in our area, which meant we had a production application down. My manager had the programmer responsible for this application standing at his desk. His manager and the owner of the company were also standing at his desk, and none of them were happy. They typical questions were being asked: "Why is the system down?" "What are you doing to fix it?" "When will it be back up?" Typical Answers: "I don’t know." "I’m researching that now." "I don’t know." Everyone leaves with an emergency meeting scheduled to determine the problem. I go to my managers’ desk where he is frantically reading printouts, looking at systems trying to figure out the problem, and I ask, “What is going on and what does he need me to do?” He explains the situation and asks me if I had any idea of what may have changed to cause the outage... ”Well, the only thing I have worked on today is to spin up a new VM to test the new application we are considering.” That was the "aha" moment for my manager. He immediately ran a report detailing the disk properties, where I had just created to new VM. In the process of formatting the disk space for my VM, my sizing calculation was off by 1 track. This calculation error overlapped on and wiped out the VM where the production app resided. Yes, we were able to restore the application and get everything back up... 3 hours of a money-making mission critical application down. This was a very painful personal experience and a learning experience that has stayed with me to this day. Fortunately, we did not miss any customer calls, but put my reliability and knowledge into question and immediately changed the way in which any changes were made to our systems. In a more recent experience, an unhappy customer called me regarding an issue they were having in that my platform was reporting false positives in regard to changes made to their environment. Like the repairman, I asked for more specific information. The conversation went something like this:
Repairman: What specifically "Mr. Customer" is the issue?
Customer: We decommissioned 10 servers from our environment last night that were out of compliance with our standards and could not be patched. Your system is reporting that three of these servers are still active and this is skewing my compliance reporting and SLA.
Repairman: As our platform can only monitor and report on “things,” how exactly have you validated that the platform is providing inaccurate information?
Customer: I checked our Change Management Ticketing Platform, and the change is marked as completed by the server team.
Repairman: Can we talk with the server team and validate the change?
Customer: Yes, but we have a strict change policy and ticketing system from tracking change.
The customer then proceeded to conference in the Server Management Team, but the engineer that performed the change was not available at this time. We decided to reconvene tomorrow morning to review...
Customer/Server Management Team: We reviewed the change with the engineer last night:
- They did not complete the actual decommissioning process, as the change window expired.
- As the SLA was to decommission all servers by “date,” they closed the ticket as complete and planned to complete the actual decommissioning during the next change window.
As you can see, change Management has certainly evolved since I was an engineer in the 80s. Processes and policies have been put in place, tracking and documenting platforms have all come to fruition, Change Management Boards (CABs) have been formed… all to prevent an experience like mine. In today’s complex mission critical infrastructures, understanding and managing change is even more challenging and impactful than my personal experience. The human element is and will always remain, the weak link. Regardless of how strict your change process is, regardless of the ticketing systems employed to document and manage change, these processes and systems depend on the human element. The changes we make with all the best intentions do not always provide the desired outcome. As in my personal experience, we make errors. We are not infallible. In the case of my customer, changes are not always performed as approved, and changes are not always accurately documented. By simply employing a platform that is continuously monitoring, recording, assessing and alerting you to changes within your environment, we reduce/remove the guess work out of determining good or bad changes. It removes the human element. It can prevent or, at the very least, minimize costly outages by understanding what changed, when the change was made, and “who” made the change. Yet, this basic pillar of IT and Security Management is taken for granted or considered addressed. I mean we have CABs, we have ticketing systems, and we have strict policies in place, right? The most common position is that this is too expensive and complex. Compared to what? Doing nothing? Relying on the human element, your process, and your policies? Doing the same thing you are doing now? The definition of insanity: “Doing the same thing over and over and expecting a different outcome.” Consider the cost of an outage which is very difficult if not impossible to understand.
- Your money making mission critical systems are down.
- You’re not making money.
- You're out of compliance and missing personal MBOs, your organization is facing fines, and you are losing business opportunities.
- The human FTE effort required to investigate and determine root cause.
- Hours, Days, Weeks.
- The effort to accurately remediate the cause and prevent future outages.
- The reputation damage to your respective organization.
The benefit far outweighs the cost of understanding and managing “What Changed?”