Metrics to Reduce Risk in Product Ship Decisions

Traditionally, product shipment decisions were made based on how the software product “felt” to the tester or developer. After running the product for some period of time, the developer or tester would pronounce the product fit or unfit for shipment.

Many organizations now recognize that decisions based on “gut feel” are insufficient. These organizations have come to this realization from a number of different perspectives: they are starting to institute a quality management system, or have undergone a software process audit; they have had problems with their products in the field; or management recognizes ship decisions are made without sufficient data.

It is possible to define specific measurements for product shipment decisions. At the very least, these metrics can be defined for the system test phase. This paper will address:

  • Measurable system test entry criteria
  • Measurable data to assess product shipment risk

These metrics will be discussed in the context of one case study. We will define quality, develop possible metrics, discuss when to define the criteria, review general reliability and stability measures, and discuss how to assess the data.

Define quality

In order to define and assess shipment risks, product quality must be defined first. Grady [Grady92] defines quality as choosing one of these three priorities:

  1. Minimize time to market (decrease engineering cost and schedule)
  2. Maximize Customer satisfaction (determining features and working with your customers)
  3. Minimize Low defects

The key to defining quality is to choose only one of these priorities as your top priority. For any given product, one of these models is your top priority, and the other two are optimized within that context. Some people are not convinced that they only have one top priority. Here is a check to verify that choice: If it is 3 weeks before your scheduled ship date, the bug count is higher than you would like, and you didn't quite get the last feature in, what does management choose to do? If management's choice is to ship anyway, then time to market is your definition of quality. If management chooses to ship as soon as that feature is installed, then customer satisfaction is your top priority. If management chooses to wait to fix the bugs, then low defects is your definition of quality.

For too long, SQA groups have focused on low defects as a measure of quality. It is no longer adequate to define quality just as defect levels, if it ever was. In the the case below, we will see how the different measures of quality require different metrics.

Case Study: DataFinder

DataFinder is a new-paradigm product being produced at a small division of a large software company. Current sales for this division are at roughly $10 Million. The product is sold as a substantial performance enhancement for a RDBMS. We will review the actions and measurements taken during the company's evolution of their assessment of the readiness of the product ship.

The definition of system test used for the purpose of this discussion is that period of time when the product developers have completed all feature development and most of the code development. There was a “code freeze” milestone in the schedule. The developer activities were limited to bug fixes and reviews of bug fixes before the fix was introduced into the system. The SQA engineers were in high gear, running their tests, finding and reporting bugs, updating tests to account for new bugs or changed behavior.

At the time the case study started, DataFinder management believed they had met code freeze, were in the system test period, and were ready for beta test. In fact, they had already shipped a beta version of the product. However, the customers complained long and loud about the product defects and performance. The Product Development team worked very hard over the next four weeks, addressing the performance and defect issues. DataFinder shipped another beta release. The customers still complained about the product shortcomings.

DataFinder management decided they could not afford to go onto a never-ending four week release cycle, and not really know if the customers would be satisfied with the performance and defects in the software. They chose to reassess the current state of the product. They measured open and close bug rates, percentage of test passing, and test coverage.

Figure 1: Current bug rate

DataFinder engineering personnel were finding bugs faster than developers could fix bugs, a common problem early in the system test phase (See Figure 1).

Figure 2: Test pass percentage

As shown in Figure 2, they also found that more and more system tests were planned, because more features were being added to the software. The features were logged as bugs, but were being worked on as features.

Raw test coverage data suggested that the current suite of 1000 regression tests covered about 30% of the functions. Approximately half the functions may be extraneous. Using this assumption, management decided the function coverage figure was approximately 60%.

DataFinder management had a number of concerns, once they saw these metrics: when were they going to stop finding bugs faster than they could fix them, and when could they stabilize the system test baseline? Based on this data, management decided the feature freeze milestone had not yet been met.

In fact, this data gathering and analysis clarified their thinking about product quality. Initially, the SQA engineers thought low defects were most important. Software developers thought meeting the schedule was most important. After thinking about the data, management decided that the features and performance was most important to the product's success.

DataFinder management decided to treat this as an opportunity, not a disaster. They selected system test entry criteria, so that they could achieve defined-length system test and beta tests. They would not have to work on four-week chunks to determine if they had a shippable product.

There are critical times to define and review measurements for product shipment:

  1. During product definition (the requirements and design phase)
  2. At the beginning of final test (system test)
  3. At the entry to beta test
  4. At the end of final test, in preparation for shipment

DataFinder had missed the opportunity to define metrics for the product definition phase, but they were able to set system test entry and exit criteria, beta entry criteria, and shipment criteria.

Define System Test Entry Criteria

During the product definition phase, it is important to set the goals and requirements for a product. Goals are the things the product team wants to accomplish. Requirements are the things they must accomplish. In that same spirit, one can define entry criteria for the system test phase. Sample criteria may be:

System test entry criteria (requirements)

  1. Define white box tests for all functionality defined in <xyz> spec.
  2. Define black box tests for all other functionality.
  3. Automate all tests for one specific platform.
  4. No outstanding Critical bugs.
  5. The product must be able to accomplish the basic functionality as described in <some formal document such as program plan, requirements doc>.
  6. All modules must meet code freeze- all features are in and frozen, all code is developed, integrated, and debugged from the unit level.
  7. All code reviews complete.

System test entry goals

  1. Automate all testing into a suite which can be run in a 4 hour period across all platforms.
  2. No more than 20 major bugs and 100 minor bugs.
  3. code reviews complete
  4. Unit tests developed for all code in <specific> modules.

DataFinder chose these criteria as system test entry criteria:

  1. All bug fixes most be code reviewed (peer walkthrough) before checkin.
  2. Minimum 95% regression test pass rate.
  3. All regression test failures must be known, bugs defined, and plan for fix in place.
  4. All performance tests must pass.
  5. Performance must be at <rate comparable to competitors>.
  6. Reliability must be 100%: all successful commits must have committed, all rollbacks must rollback successfully, all data recovery mechanisms must be successful.

System test entry criteria must be developed by the SQA and developers, as a joint effort. In DataFinder's case, management suggested some of the criteria. The developers and SQA jointly agreed on all of the criteria and how to measure them.

It is easiest on everyone, and best for the product, to define the system test entry criteria as early in the project as possible. Criteria should be defined during the product definition phase, the functional spec phase, or the preliminary design phase. Initially, there were no system test entry criteria defined for DataFinder. A number of times, the development team claimed the product was ready for testing. However, the test effort was unable to effectively proceed due to numerous bugs. It became clearer during development that the product would never get to system test (successfully running the regression tests) unless these criteria existed.

Once entry criteria exist, it is very clear to the technical staff what the expectations of them are, and they can successfully work to those expectations. Schedule and development expectations can be met. A side effect of this type of entry criteria was to start discussing the complexity of the software in the framework of criteria. DataFinder product developers and SQA engineers were able to proactively think about where to put the testing effort. If entry and exit criteria had not been specified for system test, management would not have been able to understand which areas of the product required more resources (people, machines, testing, etc.).

Defined system test entry criteria also gives management another measurement point- how close are the technical staff to fulfilling the entry criteria? During the project there are times when technical contributors think “If I just had three more weeks…”. However, if at every time you examine the criteria, you are still three weeks away, there is a systemic problem. As an example, if four weeks from the start of system test period, you discover a product area in which there are insufficient system tests, unit test, or code reviews, you can take remedial action. When you have the metrics and track how close you are to meeting them, you can make smart choices. At one point in project, it became clear that part of the product was never going to pass the performance criterion. Development was able to start a redesign and reimplementation in time to make the system test start date.

Define Beta Test Entry Criteria

In addition to the system test entry criteria, the DataFinder product development team chose these additional criteria for beta test:

  1. The code must be branched.
  2. Minimum 98% regression test pass rate.
  3. No bugs that cause core dumps.
  4. Minimum 90% function coverage in regression tests. (Regression tests must enter at least 90% of the modules in the system.)

The product performance criteria were met by entry to system test, and the beta criteria ensured that the customers would not receive software with more defects than they could handle.

A note for the code coverage purists: DataFinder management and technical staff were new to the notion of code coverage. They assumed that their admittedly extensive regression tests would provide them minimum 90% function coverage, and that 90% was somehow “good enough”.

Beta test entry criteria should also be defined as early in the project as possible. Again, management, development, and quality assurance personnel need to agree on the criteria. In addition to deciding on the criteria, they should also estimate the time from the start of system test to the time beta entry criteria are met. That will give them an historical perspective of this part of the project for the next project.

Define Shipment Criteria

DataFinder developed these criteria for shipment:

  1. At least 6 of the beta sites should be referenceable. That is, the customers should be sufficiently happy with the product that they will provide references for it.
  2. The final code branch must be in place, containing the code, compilers, and tests.
  3. Minimum 98% regression test pass rate.
  4. No bugs that cause core dumps.
  5. Minimum 90% function coverage in regression tests.
  6. All regression test failures must be known, bugs defined, and plan for fix in place.
  7. All performance tests must pass.
  8. Performance must be at <rate comparable to competitors>.

Reliability must be 100%: all successful commits must have committed, all rollbacks must rollback successfully, all data recovery mechanisms must be successful.

Note that there are some business issues here- getting referenceable beta sites is not a traditional product development activity. However, one of the overall product goals was to become the marketplace definition product. That would be impossible without having references.

The criterion regarding the code branch, with everything required to recreate the product is not strictly a technical product development issue, but is a management issue for product development. By fulfilling this criterion, the development team knows they can recreate the entire product at any time.

Criteria summary

Even though all of the metrics dealt with defects, DataFinder product quality was not defined by defects. The referenceable site and performance criteria were the measurements that strictly dealt with customer satisfaction. Because this is a RDBMS type of product, the defect levels have to be low enough to ensure sufficient reliability. And, the reliability criterion was separate.

Table 1 lists some possible metrics based on time to market and low defects:

Table 1: Possible metrics

Time to Market Low Defects
Planned vs. actual dates Defects per unit time
Task productivity rates (how productive are the engineers in the aggregate for design tasks, implementation tasks, etc. Defects closed per unit time
Feature productivity rates (how productive the sub-project teams are for completing work on an entire feature. Defects found per activity (all reviews, unit tests, etc.)

Date variations give you two pieces of information: is the project on schedule, and how accurate were the original estimates. If the estimates were not accurate, you can use this data as a starting point to determine why:

  1. Was the schedule generated using technical contributor input, or was it a complete guess to begin with?
  2. Did people assume they had 40-hour work weeks available? Almost no one actually has 40 work-hours available for project work [Abdel-Hamid91].
  3. Were people working on other projects aside from this one- and reducing their available project-hours?

Task and Feature productivity rates give you data on how productive the engineering teams are by task and by feature. Note that this data should not be gathered or assessed by specific engineer, unless you really do have teams of only one engineer (no design reviews, no code reviews, no schedule reviews, just one developer who also writes all the documentation, plans and performs all the testing, and plans and monitors the project).

Defects found and closed per unit time can give you an indication when the testing is complete enough to stop. The bugs-found curve decays to close to 0.

Defects found per activity may give you some hints on what activities find the most issues, and which activities may need improvement. For example, if code reviews do not at least 5-10 times more defects than system tests, your code review process may not be adequate.

Assessment of the Metrics

Assessment of the entry criteria is then a matter of data gathering, and seeing if the data match the criteria. The data gathering is done by the responsible project or technical leader, and presented to the project leader in preparation for a system test entry readiness review. The project leader presents the entry criteria and data for each criterion. It is obvious which criteria are met and which ones are not yet met. The risks are illuminated to each person at the readiness review, and can be discussed. The project leader and project staff can then make the decision to enter or delay the system test period.

I have had experience where the entry criteria were used and where they were ignored. When the exit and entry criteria were used, we were able to predict exactly how long system test time would take, and when the product ship date would occur. When the entry criteria were used but the exit criteria ignored, the company shipped on time, but paid for that in support costs. When neither entry nor exit criteria were used, we were unable to predict ship date, and we were besieged by angry customers after shipment. The support costs were extremely high.

To return to the DataFinder case: Figure 3 shows the data for the total number of weeks from the time the company thought they were initially in system test to shipment:


Figure 3: Total DataFinder bug and checking metrics

DataFinder found that the system test and bug fix/code walkthrough activities increasingly found more bugs until approximately week 18. The number of checkins here tell you how many files are being perturbed by fixes- a possible indication of product stability. There is an interesting phenomenon at week 18- one or some of the bug fixes changed a large number of files, but the overall effect was to reduce the future number of bugs found. This data certainly does not bear out the persistent folklore that says “It's only a one line change, it can't be that bad”.

Figure 4: Total DataFinder system test plan, run, pass metrics

The fact that the number of system tests planned increased every week until week 22, see Figure 4, is an effect of the traditional delay in moving the feature knowledge into the SQA area.

No matter what the product quality definition, these basic metrics help an organization understand the state of the software. The software should show increasing stability over time: fewer checkins, fewer new bugs found, more bugs closed, and fewer bugs open. In addition, a running total of the number of tests passed per day of the system test period will indicate product stability in the code base.


It is necessary to decide at the beginning of the project what is important to the product- what product quality is. Especially if ship date is most important to management, it is crucial to decide how good is “good enough” for the product, and how you know when you have reached “good enough”. Measurements of specific activities required for product shipment help you decide if you have reached the proper curve shape or actual data. Recommended metrics are those that have the most value to the customers, whether the value is in time to market, features, or low defects.

Metric based decisions have a distinct advantage over gut-feel decisions- one can decide and negotiate measurements before the project starts and certainly before the project ends; it is possible to monitor progress and make ship decisions easily and quickly; and the company can then predict the effects of shipping the product.


Abdel-Hamid91: Abdel-Hamid, Tarek and Madnick, Stuart. Software Project Dynamics: An Integrated Approach. Prentice Hall, Englewood Cliffs, NJ. 1991.

Grady92: Grady, Robert. Practical Software Metrics for Project Management and Process Improvement. Prentice Hall, Englewood Cliffs, NJ. 1992.

© 1996 Johanna Rothman.

Like this article? See the other articles. Or, look at my workshops, so you can see how to use advice like this where you work.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.