Find Articles in:
All
Business
Reference
Technology
News
Lifestyle

Identify's 'black box' gets straight to the heart of the software problem

Rethink IT, Nov, 2004

* Sometimes to a senior IT manager, it seems that blame avoidance is the only skill that technical teams have really mastered.

Whether it is the security group, applications developers or the technical support function, when an application goes wrong, it is rarely their fault, and nine times out of 10, it's not their problem.

One of the reasons for this, says application support company Identify, is that 80% of all problem resolution time is spent replicating reliably just what went wrong when the software system failed.

And in the interim, busy technical grades indulge in redundant information gathering, finger pointing and blame avoidance, rather than getting down to root cause determination. And that can be expensive.

Your better team players are desperate to get to the bottom of an end user problem, but it becomes like a scene from the latest Hercule Poirot novel, trying to detect just which part of your infrastructure or application suite is the problem. A lot of companies are going through this right now in their upgrades to Microsoft's XP Service Pack 2 and application complexity is going up in a web services world.

That's when Oren Modai, VP operations for Identify, says that everyone wishes they had his product, which he describes as a "black box", like a flight recorder for .Net and J2EE applications.

"At the moment finding out what caused a software problem is an unstructured process, which usually involves someone coming out to the end user department and saying, 'Can I have your server,' and taking the entire application, along with its hardware, offline."

"They just want to recreate the fault and sometimes the attempts to recreate the problem are more intrusive than the problem itself. It is communication intensive, involves multiple locations, it's error prone and it usually involves some formal trouble tickets process that goes through the motions."

"What people want is a factual basis for application problem resolution," says Modai.

The problem (and therefore Identify's opportunity) comes from the complexity of today's distributed applications. The more hardware and software that any given application travels across, the greater the number of things that needs to be checked when something goes wrong. This involves a large number of separate participants in the problem process. Each group has their own sets of priorities, time pressures and skill constraints, and the result can be a huge escalation in costs.

Modai keeps coming back to that one statistic. What if you could just go straight to the root cause of a problem and set to work solving it? That would be a saving of 80% of the time and cost of each problem, he insists.

The Identify AppSight Application Support System monitors and records applications and can be applied in development, in testing or when a program is already in production. It reports at three levels, at the user level, offering a replay of all the user screens; at the level where the software systems is interacting with the operating system and finally at the code level.

It can be switched on and off at a moment's notice and can run 24/7 or in reactive mode after an unplanned outage has happened, when the IT department knows that a problem is likely to recur.

Modai says, "AppSight does carry a CPU overhead, but it is about 2% to 3% CPU time. And you can set it so that the overhead is capped on mission critical systems. It can be issued from a single PC and multiple black boxes can report back to a web-based portal for constant monitoring.

He describes what he sees as the traditional problem solving mechanism whereby an end user gets an error page mid-transaction.

The call to the helpdesk results in a helpdesk technician visiting. They talk about the problem, but it won't recur. The support database has nothing in it similar, and the problem is allocated as likely to be a server side problem.

An operations engineer checks the server infrastructure and cannot recreate the problem and escalates it to the development team who wrote the module that was being used at the time of the error.

Development tries to recreate it and fails. There is more discussion between the user and the helpdesk and the development team to determine exact settings at the time the problem occurred. A day goes by.

Finally they recreate the problem, but it doesn't show why the problem occurred and they have to go in search of the root cause. They check the database, the application server and finally locate the problem in some of their own business logic code and take just an hour or so to come up with, and test, a patch.

When this kind of problem occurs, argues Modai, "It doesn't just slow down the end user and the support team, but it slows down the developer too, who should be working on future developments."

There are labor costs, downtime, and customer dissatisfaction. And in the case of the independent software vendor (ISV), the problems are further escalated by the fact that they are so removed from the user environment, and so when problems happen in their code, they have even less chance of recreating the circumstances under which it occurred.

 

BNET TalkbackShare your ideas and expertise on this topic

The following tags are supported in BNET comments:
<b></b> <i></i> <u></u> <pre></pre>

Leave a Reply

  1. You are currently a guest | Login?
advertisement
CIO SessionsVision Series on ZDNet

See and hear what CIOs the world over thinks about the business of technology and how it's changing the way we live and work.

Go
advertisement
  • Click Here
  • Click Here
advertisement

Content provided in partnership with Thompson Gale