Jul 18

Managing vs Discarding events  has been a topic of debate for many years in the Network Management community. Both sides have merits and demerits to consider and while the reader may ask for a specific answer, the answer really is that it depends!! The real question rather is that what factors does this debate depend on?

For those of you who are not familiar with this topic, let me give a quick background. Most equipment vendors, provide MIBs/Off the shelf management modules to manage equipment from fault management/service assurance perspective based on standard TCP or UDP protocols like SNMP, TL1, Socket communication etc.  Various Telecom/Financial giants NMS teams debate the feasibility of managing huge number of events often millions in number in terms of volume per day; correlation/deduplication does reduce events to more actionable alarms but it does not solve reduce the actual root causes. So this leads us to a bigger question, what are root causes?

Does a NOC or Front office technician really care for Authentication failure alarms or those annoying informational and warning alarms provided off the shelf by vendor to “effectively manage the network”?

Following are the organizational factors to consider for effective event management:

1) The size and skills of the Layer 1 support NOC/Front Office: Ok, so if the Front Office is 4-5 guys, can they really handle 3000 critical alarms a day? Do they really need those trending alarms indicating that a T1 might be impacted in 4 hours or would they rather focus on the customer impacting outages? [I know that some would argue the very org. structure; but I will not try to influence business decisions which consider multiple dimensions of the picture, technology being one of them.]

The size of the team responsible for incident management is key for the fault management/service assurance team to ensure quality of alarming meets the expectations of the Organization.

2) The size & complexity of Application platform/Network: Size and complexity of the Application platform/Network plays an important role in defining alarms.

Example: For layer 1/core network – Technicians may want to know all trends to mitigate incidents from happening where as for layer2/layer3 network – Technicians may want only events indicating incidents impacting services.

Note: Understanding the network/applications from usage perspective helps immensely.

3) Customers & Services: Provisioned services and customer associations are important to the overall business objective. Understand them!

After understanding the aforementioned, you will know the organizational perspective and volume management perspective of events.

Now for the most important dimension of the debate on quality of alarming which constitutes of  accuracy, completeness and actionable alarms. Considering this factor, one might argue that only if we manage all identified alarms vs. whatever provided off the shelf – we can reach the goal of quality. Yes, i agree.

One the other hand, few might argue that by discarding unknown alarming we let some information which might impact services go unnoticed. Yes, i agree to this too. But the challenge is to balance these discards to the right level showing events which indicate right impact on the service.

That is why the challenge is not in getting the Right tool, its all using the tool Right!!

Tagged with:
Apr 29

Association of solution to how it impacts our bottom line [tactical and strategic (tangible) benefits]:

Measuring total number of customer calls which are not for an identified issue by SA solution. This is the highest priority and would serve as a report card for SA solution.

Inputs on the overall business continuity and growth/volume planning [strategic]:
Defining the trend analysis accountability team member

Encouragement/goveranance [strategic]:
Encouraging the team members for leveraging SA solution to ensure availability of services at all times. Association of individual performance to overall application availability and csi

Updating our customers on identified issues proactively[tactical]:
What happens sometimes is although teams are experiencing an issue, teams do not notify customer right away and start fixing the problem. Two options to fix this, documenting a problem management process with the first step to be a customer notification and communication, or auto email from the tool on the issue. Starting with the former would be a better approach.

preload preload preload