In my previous post, I talked about what events should one manage and what should be discarded; in this post I want to touch a topic that has caused most pain to Netcool as a Event Management System. I am sure this has also been an issue with most other products in the SA, FMS space.
Very often I come across clients where ignorant system administrators use the MIB2Rules tool to create *Production ready* rules file which are included in the MTTRAPD rules file. Over a period of time, as MIB’s are added to the queue of requests for additional Fault/Event management requests – the rules garbage piles on. As a result monitoring gets out of hand and Network operations starts complaining about the quality of alarming. The complains increase and the quality of the Product is challenged.
To these organizations/system administrators; I want to assert – STOP THE ABUSE!! MIB2RULES is not intended to be used assuming that every rules file generated is the best fit solution for the facility/infrastructure for which the MIB is compiled. By NO MEANS MIB2RULES SHOULD BE USED TO BYPASS THE EVENT ENGINEERING PROCESS.
Event engineering process is the only way to improve and maintain the quality of alarming and meet expectations of the Operations users. So that leads us to the question – What is event engineering?
Event engineering is an engineering process of defining the events, the symptoms associated, probable organizational impacts and intended audience. Furthermore, event engineering process defines the visual association like severity, escalation procedures and any related enrichment information which would help user to respond to the event, if required.
So this leads us to what constitutes of an event; in my perspective an event should minimum constitute of the following fields from fault reporting standards: ManagedObject, ProbableCause, SpecificProblem, Severity, AlertGroup, AlertKey, Manager, Agent and Summary.
At the system level events can be depicted with various lifecycle or escalation points using powerful tools like Impact; this is something that I highly encourage but do not mandate. What I do strongly recommend is the process of engineering events better!!
Following the aforementioned within 10 weeks, I have succeeded in taking experience of the end users from 2/10 to 7/10 when it comes to Service Assurance solution in place. No, it’s not a secret, its just pure old Network management practices which were followed even in early 90′s but were somehow killed by Vendors/Sales folks to show the value of the box. Check out any new vendor site you will find “8 hours to Event Management”; setup in a day sort of slogans which have taken the quality of entire landscape to be below par.
Irrespective of the tool, the fundamentals of instrumentation have never changed. D L Parnas in his famous paper – “RDP – Fake it!!” has talked about following the procedure as close as possible and with this post I recommend the very same principle for managing infrastructure, networks, services and customers with the same rigour and formality.
Bottomline: Tools don’t/should’nt change Strategy, Organizations, Principles- They just align with the aforementioned to achieve the common purpose. Something to think about!
Robin,
Would it make sense if we divide the event processing process into two parts as expression of the business logic and low level technical handling?
It is true that it does not make sense to expect a tool to perform business functions, but it is understandable that people expect it to handle the low level technical functions.
Tools such as Mib2Rules can handle the low level activities and convert the machine language into human understandable standard. People need to do the rest to specify what to do with that information and express it as the business logic.
The real problem is (sadly) there is no open collaborative environment to express what events mean and what the relevant actions should be. Each organization is left to do this on its own (solve same problem repeatedly) and there is often not enough time/resources to do it, hence people/organizations looks for shortcuts, the magic wound.
There is great need/potential for collaboration in the event engineering process, unfortunately unlike developers, as IT management folks we suck at creating collaborative environments.
@Berkay
Thanks for the comment. You have hit the nail on the head on separating the concerns for Business layer, Service layer and Technical layer of event engineering process. I do agree that the community can do more to provide meaningful details for event description but I believe fault reporting standards X.733 are handy reference for that but might be a little outdated.
Poorly defined events are at times a bigger problem than ‘outage’ itself. Another important dimension here is that contextual and organizational value of events is not communicated to all stakeholders. I plan to cover that in my upcoming posts.