Sep 07

For a BSM/SQM/Service Assurance solution, initial solution architecture document is one of the most crucial artifacts which not only details the strategic objectives of the solution but also provides a competitive analysis and an alignment to the existing capability of the organization. Furthermore, it provides an insight into the driving requirements, architectural background and key organizational context to ensure that the solution being built for the organization and is not something rammed down the throat off-the shelf.

Detailed below is the template:

1 Executive Summary

CONTENTS OF THIS SECTION: This section an overview of the content of the rest of the report, giving key facts that management would like to know about its contents.  The executive summary should give the most important aspects of the report while omitting details and some supporting information.  Generally speaking, the summary should be not longer than 1 page and preferably as short as possible while conveying the required information.

1B  [Optional, for mature organizations] Strategic Capability Network

Analysis of how the strategy aligns with the organizations capabilities and resources. You can safely skip this section if you already have a defined BSM strategy and a competitive analysis document detailing the value propositions that drive the business. For details refer the patent here and my analysis with an example here.

2 Introduction

CONTENTS OF THIS SECTION: This section gives the name of the system and describes its high-level functions.  This is expanded upon by the history and stakeholders sections.

2.1 History

CONTENTS OF THIS SECTION: This section provides the historical context for the system.  It answers how the system was developed and by whom.

2.2 Stakeholders

CONTENTS OF THIS SECTION: This section provides a list of the stakeholder roles important to the system.  For each, the section lists the concerns that the stakeholder has that can be addressed by the system.

3 Architecture & Problem Background

CONTENTS OF THIS SECTION: The sub-parts of Section 3.1 explain the constraints that provided the significant influence over the architecture.

3.1 System Overview

CONTENTS OF THIS SECTION: This section describes the general function and purpose for the system or subsystem whose architecture is described in this SAD.  Include a high-level context diagram of the system and summarize major inputs and outputs.

If you don’t know how to build an accurate context diagram, look here.

3.2 Goals and Context

CONTENTS OF THIS SECTION: This section gives the name of the system and describes its high-level functions that the BSM solution is offering and more importantly how the solution would fit into the current value chain of the organization.

3.3 Significant Driving Requirements

CONTENTS OF THIS SECTION: This section describes behavioral and quality attribute requirements (original or derived) that shaped the software architecture. Included are any scenarios that express driving behavioral and quality attribute goals.

This section should only list the key driving requirements and not detailed requirements for the solution.

4 Competative Landscape

CONTENTS OF THIS SECTION: This section lists and briefly describes the major competitors of the system.  Competitors are those systems that do the same thing as the system or those systems that could otherwise be used in place of the system.  It also gives a high level overview of the strengths, weaknesses, opportunities and threats of the system explained in more detail in the following sections.

4.1 Strengths

CONTENTS OF THIS SECTION: This section describes the functions that the system does well either in comparison with its competition or in absolute terms.

4.2 Weaknesses

CONTENTS OF THIS SECTION: This section describes the functions that the system does poorly in relation to its competitors or in absolute terms.  Also included could be features that competitors have but the system does not, or features that the system should have but does not given the stakeholders and high-level requirements described in the previous section.

4.3 Opportunities

CONTENTS OF THIS SECTION: This section describes what the opportunities are for the system.  Opportunities are factors external to the system (e.g., in the overall environment) such as general trends or actions of competitors that enable the system to increase its market share or usefulness to stakeholders.

4.4 Threats

CONTENTS OF THIS SECTION: This section describes the threats that the system is likely to experience.  Threats are factors external to the system such as general trends or actions of competitors that decrease the market share of the system or its usefulness to stakeholders; in the extreme case, threats might render the system obsolete.

5 Referenced Materials

CONTENTS OF THIS SECTION: This section provides citations for each reference document.  Provide enough information so that a reader of the SAD can be reasonably expected to locate the document.

6 Directory

6.1 Glossary

CONTENTS OF THIS SECTION: This section provides a list of definitions of special terms and acronyms used in the SAD . If terms are used in the SAD that are also used in a parent solution description document and the definition is different, this section explains why.

6.2 Acronym List

If you work in telecom or finance world, you would know as i do, the TLA’s [Three letter acronyms] are annoying from organization to organization. So, don’t assume – take 10 minutes and add value to your BSM document.

Acknowledgements:

SEI Architecture documentation

Professor Jeff Thompson

Professor J Vayghan

Tagged with:
Aug 22

Having been a firm believer of using formal techniques for building BSM solutions; I have studied SEI documentation guidelines over the years and implemented them time and again. This has been one of the biggest reasons for most success that I have attained in my career. For this reason, I have decided to share a step by step documentation guidelines series with my readers to influence the BSM community towards building solutions BETTER!!

I would like to acknowledge Professor Jamshid Vayghan and Professor Jeff Thompson who have been instrumental while teaching Enterprise Architecture and Software Solutions Architecture which have given me an insight into how delivering Architecture related artifacts better.

This series shall detail the documentation guidelines for BSM solution strategy, architecture planning, implementation and lifecycle management. This series shall not include troubleshooting, production support and maintainance related documentation as these aspects will depend on tool suite, hardware, software and organization enviornment.

Tagged with:
Jul 26

In my previous post, I talked about what events should one manage and what should be discarded; in this post I want to touch a topic that has caused most pain to Netcool as a Event Management System. I am sure this has also been an issue with most other products in the SA, FMS space.

Very often I come across clients where ignorant system administrators use the MIB2Rules tool to create *Production ready* rules file which are included in the MTTRAPD rules file. Over a period of time, as MIB’s are added to the queue of requests for additional Fault/Event management requests – the rules garbage piles on. As a result monitoring gets out of hand and Network operations starts complaining about the quality of alarming. The complains increase and the quality of the Product is challenged.

To these organizations/system administrators; I want to assert – STOP THE ABUSE!! MIB2RULES is not intended to be used assuming that every rules file generated is the best fit solution for the facility/infrastructure for which the MIB is compiled. By NO MEANS MIB2RULES SHOULD BE USED TO BYPASS THE EVENT ENGINEERING PROCESS.

Event engineering process is the only way to improve and maintain the quality of alarming and meet expectations of the Operations users. So that leads us to the question – What is event engineering?

Event engineering is an engineering process of defining the events, the symptoms associated, probable organizational impacts and intended audience. Furthermore, event engineering process defines the visual association like severity, escalation procedures and any related enrichment information which would help user to respond to the event, if required.

So this leads us to what constitutes of an event; in my perspective an event should minimum constitute of the following fields from fault reporting standards: ManagedObject, ProbableCause, SpecificProblem, Severity, AlertGroup, AlertKey, Manager, Agent and Summary.

At the system level events can be depicted with various lifecycle or escalation points using powerful tools like Impact; this is something that I highly encourage but do not mandate. What I do strongly recommend is the process of engineering events better!!

Following the aforementioned within 10 weeks, I have succeeded in taking experience of the end users from 2/10 to 7/10 when it comes to Service Assurance solution in place. No, it’s not a secret, its just pure old Network management practices which were followed even in early 90’s but were somehow killed by Vendors/Sales folks to show the value of the box. Check out any new vendor site you will find “8 hours to Event Management”; setup in a day sort of slogans which have taken the quality of entire landscape to be below par.

Irrespective of the tool, the fundamentals of instrumentation have never changed. D L Parnas in his famous paper – “RDP – Fake it!!” has talked about following the procedure as close as possible and with this post I recommend the very same principle for managing infrastructure, networks, services and customers with the same rigour and formality.

Bottomline: Tools don’t/should’nt change Strategy, Organizations, Principles- They just align with the aforementioned to achieve the common purpose. Something to think about!

Tagged with:
Jul 18

Managing vs Discarding events  has been a topic of debate for many years in the Network Management community. Both sides have merits and demerits to consider and while the reader may ask for a specific answer, the answer really is that it depends!! The real question rather is that what factors does this debate depend on?

For those of you who are not familiar with this topic, let me give a quick background. Most equipment vendors, provide MIBs/Off the shelf management modules to manage equipment from fault management/service assurance perspective based on standard TCP or UDP protocols like SNMP, TL1, Socket communication etc.  Various Telecom/Financial giants NMS teams debate the feasibility of managing huge number of events often millions in number in terms of volume per day; correlation/deduplication does reduce events to more actionable alarms but it does not solve reduce the actual root causes. So this leads us to a bigger question, what are root causes?

Does a NOC or Front office technician really care for Authentication failure alarms or those annoying informational and warning alarms provided off the shelf by vendor to “effectively manage the network”?

Following are the organizational factors to consider for effective event management:

1) The size and skills of the Layer 1 support NOC/Front Office: Ok, so if the Front Office is 4-5 guys, can they really handle 3000 critical alarms a day? Do they really need those trending alarms indicating that a T1 might be impacted in 4 hours or would they rather focus on the customer impacting outages? [I know that some would argue the very org. structure; but I will not try to influence business decisions which consider multiple dimensions of the picture, technology being one of them.]

The size of the team responsible for incident management is key for the fault management/service assurance team to ensure quality of alarming meets the expectations of the Organization.

2) The size & complexity of Application platform/Network: Size and complexity of the Application platform/Network plays an important role in defining alarms.

Example: For layer 1/core network – Technicians may want to know all trends to mitigate incidents from happening where as for layer2/layer3 network – Technicians may want only events indicating incidents impacting services.

Note: Understanding the network/applications from usage perspective helps immensely.

3) Customers & Services: Provisioned services and customer associations are important to the overall business objective. Understand them!

After understanding the aforementioned, you will know the organizational perspective and volume management perspective of events.

Now for the most important dimension of the debate on quality of alarming which constitutes of  accuracy, completeness and actionable alarms. Considering this factor, one might argue that only if we manage all identified alarms vs. whatever provided off the shelf – we can reach the goal of quality. Yes, i agree.

One the other hand, few might argue that by discarding unknown alarming we let some information which might impact services go unnoticed. Yes, i agree to this too. But the challenge is to balance these discards to the right level showing events which indicate right impact on the service.

That is why the challenge is not in getting the Right tool, its all using the tool Right!!

Tagged with:
preload preload preload