Learning What You Don’t Know

©2005, Don Gray

“It ain’t what you don’t know that gets you into trouble. It’s what you know for sure that just ain’t so.” — Mark Twain

Working on hardware projects requires incredible attention to detail. Design time can take weeks, and actual fabrication time from order to delivery can be months. If something isn’t correct, you can’t just “edit and recompile”. You end up starting all over again, hopefully wiser, but just as far from delivery as you always were.

Recently I started working with a client bringing a hardened platform to market. The project involves in house designed custom electronics, tailored BIOS, an embedded OS, and their proprietary application. In addition to task monitoring and supply chain details, I also report project risks to the Engineering VP and CEO.

Evaluating Risk

After reviewing the project scope and delivery, meeting with the team and vendors, I started working on the risk assessment. My first pass was a spreadsheet that listed each risk I could think of, the probability it might happen (0 – 1), the impact if it did happen (again 0 – 1), and what we might be able to do if the risk occurred. A partial list looks like:

Risk Event Probability Impact Factor Recovery/Mitigation
Assembly plant burns down

0.0001

1

0.0001

Find another fabrication source
Problems with board delivery

0.1

0.7

0.07

Stay in touch with the vendor. Try to expedite delivery.
Units fail certification

.4

.8

.32

Try to modify circuits without board redesign

The risk factors are precise, but I couldn’t convince myself the numbers were accurate. The company has no quantitative data on vendor delivery, assembly times, or unit quality. I couldn’t justify the quantitative values beyond “Well, the numbers seem right”. Another problem with this approach is the side affect of multiplying two decimals together. The units failing certification presents more of a problem than the “.32″ implies. I decided to switch to a qualitative approach.

I now rank the risks using the following matrix1.

High

C

B

A

Potential Loss

Medium

D

C

B

Low

E

D

C

Low

Medium

High

Likelihood

A high potential loss coupled with a low likelihood results in a “C” rating. I spend more time working on the “A” risks than I do the “E” risks. My risk list now looks like:

Risk Event Rating Recovery/Mitigation
Assembly plant burns down

C

Find another fabrication source
Problems with board delivery

B

Stay in touch with the vendor. Try to expedite delivery.
Units fail certification

A

Try to modify circuits without board redesign

Monitoring Risk

Now I could create time line with deliverable items and their associated risk. Given the short time span (about 10 weeks) I used a spreadsheet with columns for deliverable, due date, risk, and last updated. For incremental tasks, I added a column for “to do next”. This gave me a document I could quickly review everyday that was easily updated with new information.

Component Deliverable Next Step Est. Sched Risk Need Don To
Mechanical
Heatsink Drawing Start after test jig delivery 5/10 D
Vibration Certification Need boards 5/27 A Review with Bill

Reviewing the deliverables, associated risks, and timetable left me feeling comfortable and confident we were home free. That’s when I started to worry. It’s not what I knew that was going to put the project in jeopardy; it’s what I didn’t know.

Exploring Risk

What I wanted to do was identify which risks were most likely to happen, and deal with them before the next set of units were built. What problems would we encounter? How could we prevent these problems from occurring? How would we deal with them if they did

I selected three activities to deal with these questions.

  1. Build early – I borrowed this one from my software experience.
  2. Learn from each build – Work on improving the process.
  3. Simulate – If you don’t have the real thing, substitute and act like it’s real.

Build Early

This product had gone through three revisions without seeing the front door. It turns out there were units which were functionally equivalent to the units we waited for. Management agreed to create ten “engineering prototypes”. These units were slightly off the final spec, but were close enough that resilient users could work with the units. This brought less loving and more critical views to the product. Differences between the product, the specification, and expectations were highlighted more quickly than standard reviews. Users were encouraged to try and find ways to “break” the system. We used OS specific testing software to verify the hardware properly functioned with the OS.

Practice Building

We chose to build one unit at a time. We’d note the problems with the build, correct the problems, and then build another. Occasionally it took a couple of days to correct the infrastructure issues, but we had 10 weeks. Why hurry? Get it right. Try again. After three build iterations, we had the kinks worked out of the process. This also had the side benefit of reducing the amount of rework when we discovered we needed to go back to the beginning and modify the units already built.

Building early and practicing building helped identify process risks, and allowed us to deal with them in advance. But the nagging question remained, what else don’t I know?

Simulation

Jerry Weinberg suggested we simulate the builds. Act like everything was ready to go, and start building the product. This exercise flushed out an almost invisible category. We had all the big parts, but didn’t have sufficient small parts for some assemblies. Little things like washers, screws and the like. The simulation could have been much more complex. We could have included interfaces, software components, and interactions between components. Since we had working functional prototypes, we chose to use actual equipment for exploring these risk areas.

In The End

Knowing what could go wrong is half the problem. Risk management isn’t over when the risks and how to deal with them are identified. Risk management requires daily attention; active anticipation of what can go wrong, how to hopefully prevent the problems, and reducing the impact of risks that actually occur.

I appreciate the AYE community for sharing ideas on this topic. You can read all the suggestions at

1http://www.comp.glam.ac.uk/Teaching/ismanagement/riskman1f.htm

This entry was posted in Articles and tagged , . Bookmark the permalink.

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>