Jul 26

In my previous post, I talked about what events should one manage and what should be discarded; in this post I want to touch a topic that has caused most pain to Netcool as a Event Management System. I am sure this has also been an issue with most other products in the SA, FMS space.

Very often I come across clients where ignorant system administrators use the MIB2Rules tool to create *Production ready* rules file which are included in the MTTRAPD rules file. Over a period of time, as MIB’s are added to the queue of requests for additional Fault/Event management requests – the rules garbage piles on. As a result monitoring gets out of hand and Network operations starts complaining about the quality of alarming. The complains increase and the quality of the Product is challenged.

To these organizations/system administrators; I want to assert – STOP THE ABUSE!! MIB2RULES is not intended to be used assuming that every rules file generated is the best fit solution for the facility/infrastructure for which the MIB is compiled. By NO MEANS MIB2RULES SHOULD BE USED TO BYPASS THE EVENT ENGINEERING PROCESS.

Event engineering process is the only way to improve and maintain the quality of alarming and meet expectations of the Operations users. So that leads us to the question – What is event engineering?

Event engineering is an engineering process of defining the events, the symptoms associated, probable organizational impacts and intended audience. Furthermore, event engineering process defines the visual association like severity, escalation procedures and any related enrichment information which would help user to respond to the event, if required.

So this leads us to what constitutes of an event; in my perspective an event should minimum constitute of the following fields from fault reporting standards: ManagedObject, ProbableCause, SpecificProblem, Severity, AlertGroup, AlertKey, Manager, Agent and Summary.

At the system level events can be depicted with various lifecycle or escalation points using powerful tools like Impact; this is something that I highly encourage but do not mandate. What I do strongly recommend is the process of engineering events better!!

Following the aforementioned within 10 weeks, I have succeeded in taking experience of the end users from 2/10 to 7/10 when it comes to Service Assurance solution in place. No, it’s not a secret, its just pure old Network management practices which were followed even in early 90′s but were somehow killed by Vendors/Sales folks to show the value of the box. Check out any new vendor site you will find “8 hours to Event Management”; setup in a day sort of slogans which have taken the quality of entire landscape to be below par.

Irrespective of the tool, the fundamentals of instrumentation have never changed. D L Parnas in his famous paper – “RDP – Fake it!!” has talked about following the procedure as close as possible and with this post I recommend the very same principle for managing infrastructure, networks, services and customers with the same rigour and formality.

Bottomline: Tools don’t/should’nt change Strategy, Organizations, Principles- They just align with the aforementioned to achieve the common purpose. Something to think about!

Tagged with:
Jul 18

Managing vs Discarding events  has been a topic of debate for many years in the Network Management community. Both sides have merits and demerits to consider and while the reader may ask for a specific answer, the answer really is that it depends!! The real question rather is that what factors does this debate depend on?

For those of you who are not familiar with this topic, let me give a quick background. Most equipment vendors, provide MIBs/Off the shelf management modules to manage equipment from fault management/service assurance perspective based on standard TCP or UDP protocols like SNMP, TL1, Socket communication etc.  Various Telecom/Financial giants NMS teams debate the feasibility of managing huge number of events often millions in number in terms of volume per day; correlation/deduplication does reduce events to more actionable alarms but it does not solve reduce the actual root causes. So this leads us to a bigger question, what are root causes?

Does a NOC or Front office technician really care for Authentication failure alarms or those annoying informational and warning alarms provided off the shelf by vendor to “effectively manage the network”?

Following are the organizational factors to consider for effective event management:

1) The size and skills of the Layer 1 support NOC/Front Office: Ok, so if the Front Office is 4-5 guys, can they really handle 3000 critical alarms a day? Do they really need those trending alarms indicating that a T1 might be impacted in 4 hours or would they rather focus on the customer impacting outages? [I know that some would argue the very org. structure; but I will not try to influence business decisions which consider multiple dimensions of the picture, technology being one of them.]

The size of the team responsible for incident management is key for the fault management/service assurance team to ensure quality of alarming meets the expectations of the Organization.

2) The size & complexity of Application platform/Network: Size and complexity of the Application platform/Network plays an important role in defining alarms.

Example: For layer 1/core network – Technicians may want to know all trends to mitigate incidents from happening where as for layer2/layer3 network – Technicians may want only events indicating incidents impacting services.

Note: Understanding the network/applications from usage perspective helps immensely.

3) Customers & Services: Provisioned services and customer associations are important to the overall business objective. Understand them!

After understanding the aforementioned, you will know the organizational perspective and volume management perspective of events.

Now for the most important dimension of the debate on quality of alarming which constitutes of  accuracy, completeness and actionable alarms. Considering this factor, one might argue that only if we manage all identified alarms vs. whatever provided off the shelf – we can reach the goal of quality. Yes, i agree.

One the other hand, few might argue that by discarding unknown alarming we let some information which might impact services go unnoticed. Yes, i agree to this too. But the challenge is to balance these discards to the right level showing events which indicate right impact on the service.

That is why the challenge is not in getting the Right tool, its all using the tool Right!!

Tagged with:
Jul 13
My first observation on TOGAF 9 was that it is more business focused compared to TOGAF-8 with the addition of Business vision, drivers and capabilities (which is also inline with SCN). Before this version 9 TOGAF was definitely in the IT realm, and IT was essentially defined as hardware and software. By bringing business perspective with lifecycle into Business, Information, Data and Application architecture TOGAF has added a new set of offerings to its arsenal.

Note 1: The definition of IT in TOGAF 9 is the lifecycle management of information and related technology within an organization. In-line with what we have been discussing in class that we need to take a problem and build a business solution rather than build a solution and find Business problem.

For past few years, IT has been looked at a medium of simply automating tasks to reduce cost and getting tasks accomplished by throwing requirements over the wall. TOGAF provides an approach of integrating IT with Business using the ADM as a comprehensive guideline for establishing Enterprise architecture program by step by step procedures. This is especially useful for organizations considering starting an Enterprise architecture program.

Note 2: The single biggest news about TOGAF is a new, mature, well-defined architecture content framework of 170 pages, built on the experience of major systems integrators through hundreds of architecture projects. Emphasizes on Organizational learning (also discussed in class) via meta model is an important aspects.

Developing meta-model for enterprise content across the organization and utilization of the content for strategic and operational planning is the only way Business, Operations and IT can be aligned together. Most large organizations like the one that I work for; have multiple data repositories disparately fragmented across the organization. Due to lack of catalog management of this data, the organizational learning curve is low and leads to accessibility problem for right information at the right time and for the right objective. TOGAF proposes a detailed solution to this problem with the architecture content framework; by putting emphasis on the actual information, its access, presentation, and quality, so as to provide not only transaction processing support, but analytical processing support for critical business decisions provides great edge to TOGAF. This change gives new perspective value proposition for Enterprise architecture framework.

Note 3: TOGAF 9 puts emphasis on Service Oriented Architecture and Integrated system approach – Business Perspective of SOA
The biggest implementation aspect of TOGAF 9 is the emphasis on service oriented systems and integrated system approach. This is due to involvement of system integrators; who are one of the most important stakeholders of the Enterprise architecture program. To get the best return on investment for SOA, it is important to understand from a business perspective what capability your business has and why. One thing that’s really key about TOGAF 9 is that it takes a lot of ideas and practices that exist within individual organizations or proprietary frameworks, building a consensus around it, and releasing it into a public-domain context. Once that happens, the value you can get from that approach increases exponentially. Now, you’re not talking about going to one vendor and having to deal with one particular set of concepts, and then going to a different vendor and having to deal with another set of concepts, and dealing with the interoperability between those.
Note 4: TOGAF 9 allows enterprise architecture to be molded with operations management, system design, portfolio management, service management, business planning, and the Governance Institute’s COBIT guidelines. – Compatibility

One critical aspect of adopting a framework is evaluation of compatibility of the framework in the organizational ecosystem. TOGAF is very well equipped in this aspect due to compatibility guidelines for managing operations, systems design, portfolio management, service management, business and strategic planning. In addition to this, TOGAF provides governance guidelines to ensure alignment of Business, operations and IT and providing the equilibrium to provide optimal solutions for priority business problems.
To conclude, Fred Brooks has very truly pointed out that there is no silver bullet for building the best solution and “Doing the things right!!” but using TOGAF solves four of the most critical aspects for reducing the accidental complexity which are: Organizational learning via Content model frameworks, emphasis on Business perspective of SOA, Value propositions alignment to investments and finally the guidelines for effective governance by Enterprise architecture program.
References

Name
Reference
TOGAF 9 users see benefit for SOA
http://searchsoa.techtarget.com/news/article/0,289142,sid26_gci1347340,00.html
TOGAF 9 advances IT maturity while offering more paths to architecture-level IT improvement
http://briefingsdirectblog.blogspot.com/2009/02/togaf-9-advances-it-maturity-while.html
Never Mind the Architecture Frameworks: Here’s TOGAF 9
http://rasmussenreport.wordpress.com/2009/02/06/never-mind-the-architecture-frameworks-heres-togaf-9/
TOGAF 9 Architecture Documentation
http://www.opengroup.org/architecture/togaf9-doc/arch/

Tagged with:
Jul 01

  • http://publib.boulder.ibm.com/infocenter/tivihelp/v8r1/index.jsp?topic=/com.ibm.netcoolimpact.doc/dsa/imdsa_snmp_c.html
  • http://publib.boulder.ibm.com/infocenter/tivihelp/v8r1/index.jsp?topic=/com.ibm.netcool_OMNIbus.doc/gateways/snmp/snmp/wip/concept/snmp_intro.html
  • Apr 29

    Association of solution to how it impacts our bottom line [tactical and strategic (tangible) benefits]:

    Measuring total number of customer calls which are not for an identified issue by SA solution. This is the highest priority and would serve as a report card for SA solution.

    Inputs on the overall business continuity and growth/volume planning [strategic]:
    Defining the trend analysis accountability team member

    Encouragement/goveranance [strategic]:
    Encouraging the team members for leveraging SA solution to ensure availability of services at all times. Association of individual performance to overall application availability and csi

    Updating our customers on identified issues proactively[tactical]:
    What happens sometimes is although teams are experiencing an issue, teams do not notify customer right away and start fixing the problem. Two options to fix this, documenting a problem management process with the first step to be a customer notification and communication, or auto email from the tool on the issue. Starting with the former would be a better approach.

    Apr 29

    http://www.iec.org/online/tutorials/tmn/topic01.asp

    Logical model was most quoted, rest were terms which never took off…

    preload preload preload