Why Progressive Estimation Scale Is So Efficient For Teams

By Alex Yakyma.

In this article we will show that progressive estimation scale, like Fibonacci sequence often used by agile teams, is more efficient than linear scale and provides the team with more information about the size of backlog items. We will use pretty basic principles of Information Theory to arrive at our results. Also we will formulate a hypothesis about normalization of the estimation base. 

In Agile we quite often see the examples of “progressive” estimation scale when estimating the size of backlog items. Most often it would be Modified Fibonacci Sequence: 0, ½, 1, 2, 3, 5, 8, 13, 20, 40, 100. Less common is the scale used by XP proponents: 0, 1, 2, 4, split (we will refer to this estimation scale as XP Scale further in the text). We call them progressive to reflect the fact that the values grow much faster than linearly. Also we often call these scales (not 100% accurately though in general case) exponential. In fact, XP scale is itself exponential and Fibonacci scale can be roughly approximated for the range of the values defined above with f(x) = 1.6x-1 as shown in the figure below:


Figure 1. Modified Fibonacci sequence approximated by exponential function for its range of values.

The point we are making here is that both scales can be with some level of accuracy called “exponential”.

Having said that, let’s ask ourselves a question of why are we using an exponential scale? As we well know, thousands of agile teams are applying these estimation scales very successfully. What makes exponential scale so useful? We used to answer this question by referring to the fact that the bigger the size of a backlog item (say, N story points) the harder it is to say the difference between N and N – 1. It is true. However the statement itself is not a fundamental axiom but is just a corollary of more generic principles, which we are going to consider.

Information Theory Perspective
Let’s formulate a question at a little different angle. Let’s assume that we know (very-very roughly) that backlog item U is no bigger than L units of size (can be story points or ideal man-days, doesn’t really matter at this point), but can be any value from 0 to L with equal chances. Let’s also assume that we have an estimation technique that allows us to estimate the size of U with absolute precision of P. Let’s look at how much information do we obtain about the size of U by applying our estimation technique.


Figure 2. Illustration of symbols: U is an arbitrary backlog item; L – maximum possible size of any backlog item from the backlog; P – absolute precision of estimation; the point on the horizontal axis linked to item U shows its size.

As we well know from the basics of Information Theory, information (also called mutual information of two experiments, see http://en.wikipedia.org/wiki/Mutual_information for more detail) enclosed in experiment A regarding experiment B can be expressed as follows:

I(A,B) = H(B) - HA(B)

Here H(B) is entropy (see http://en.wikipedia.org/wiki/Entropy_(information_theory) for more detail), or in other words, the level of uncertainty of experiment B; HA(B) – is entropy of B after experiment A took place (also called conditional entropy). So now it is easy to interpret the amount of information enclosed in experiment A as the amount of uncertainty it reduces about experiment B.

Let’s now apply this formula to our case. We have our two “experiments”, one of them (B) consists in finding the absolutely accurate size of our backlog item U. The other one (A) consists in applying our estimation technique and thus reducing the uncertainty to certain extent (i.e. obtaining some information). Without going into mathematical depth we will just note that by applying Shannon’s formula (actually a definition of mutual information) we get:

I(A,B) = log(L/P)

where the logarithm base is not of extreme importance, can be any number (greater than 1) based on the simple feature of logarithms (see http://en.wikipedia.org/wiki/Logarithm#Change_of_base for instance), but typically they use 2 as the base for the logarithm which makes good sense from computational standpoint, as we eventually store all information in binary form.

This latter equation gives us very interesting result: the information that we are obtaining out of estimation grows much slower than the precision of estimation. More specifically, it grows as quickly as logarithm function. Now from the graph of logarithm function below we can see why “little estimation effort helps a lot and big estimation effort helps little”:


Figure 3. Logarithmic behavior of information about the size of an item as a result of estimation process. Horizontal axis represents relative precision (or more accurately, the value of L/P) and vertical axis represents information (in bits).

And finally using exponential (or close to exponential) estimation scale becomes totally logical – it adds valuable information faster. Indeed any function f(x) = alogbx grows faster than the logbx itself (here a and b are both greater than 1). In ideal case, which we certainly do not expect in practice, when a = b, this function simply becomes trivial linear function: f(x) = x. But since we are not trying to state that it is necessarily the case, we say “adds information faster” and avoid saying “adds information linearly”.

Normalization Hypothesis 
 Another interesting question is, given that most of the teams use Fibonacci sequence as their estimation scale, can they somehow ensure that their “corrected information curve” (i.e. the graph of a corresponding function:  f(x) = alogbx ) will grow fast enough to be more useful than the linear scale? Apparently it is safe to say about logarithmic behavior of the “information curve”, but practically finding its exact logarithmic base does not seem an easy task. We can hope though that teams are able to empirically find the right usage of the Fibonacci scale to be able to obtain information efficiently. So what is that “right usage”?..

For further simplicity (and as we shown before) we can use certain exponential function instead of Fibonacci sequence. Even if the base is fixed (say, f(x) = ax) our team can change the function itself by… yep, “re-scaling” the story points. Indeed, if at some point they go for a new “meaning” of a story point by saying that, for instance, now two old story points is one new story point, then our function f(x) turns into another function g(t) = f(2x), where t = 2x. Or conversely, x = 0.5t. If we substitute this expression for x in the formula above, we will get g(t) = bt, where b is square root of a. g(t) is again an exponential function but with different base. Figure below shows an example of how rescaling impacts the “estimation scale curve” if we take two new story points for every three old story points:


Figure 4. After re-scaling the estimation base by putting 2 “new” story points for 3 “old” story points the base of the exponential function (“the estimation function”) changes. Blue graph stands for old scale of estimation and red one stands for the new one.

This by the way means that:

Changing the estimation basis in Fibonacci sequence is quite a responsible thing, often underestimated. It may get the team closer or further from efficient estimation. There’s only one ideal base for estimation and it is important to get close to it. 

But now we can consider an interesting fact that many agile teams could confirm:

After some time agile teams often “normalize” (i.e. re-scale story points until they feel really comfortable with it) by arriving at approximately the same estimation base in Fibonacci scale – a team has typically 30 – 60 story points as their average velocity for a two-week sprint. 

This of course depends on the team size and would be smaller for the smaller teams and bigger for larger teams. But maybe this is just the way teams optimize their estimation basis over time to efficiently manage the potentially available information. Based on this we formulate our...

Normalization Hypothesis: Agile teams that normalize their estimation base over time, likely arrive at their optimal estimation capability (from the standpoint of information that they acquire about the backlog items they estimate). In other words, they empirically find such an exponential function that makes the “corrected information curve” close to linear. 

Although it looks like a bit of a journey to get to a better estimation base, we know we have simple stupid but reliable method to get started. By simply assigning 8 story points to each team member per sprint seams not a bad initial approximation. Indeed, considering the hypothesis above, it gives us N*8 varying from 32 to 64 story points when N (the size of the team) varies from 4 to 8. Learn more about this estimation base in (http://www.amazon.com/Agile-Software-Requirements-Enterprise-Development/dp/0321635841, Chapter 8, “Agile Estimating and Velocity”).

3 comments:

  1. The punchline here is that “[a] little estimation effort helps a lot and [a] big estimation effort helps little”. And that's very true, especially if we understand (as McConnell says) the point of estimation is not to predict the future but to understand if we are even in with a chance of managing our way to success.
    While I admire this semi-quantitative approach, there are some problems. This: “assume that we know […] that backlog item U is no bigger than L units of size […] but can be any value from 0 to L with equal chances” isa strange assumption. The classical PM literature observes that for a given task there is a time (or effort) less than which it cannot physically take, so the probability of an actual time (or effort) less than this minimum is zero. And, the most likely time (or effort) is to the right of that minimum, so the probability must increase to the right up to that modal value. And then the probability must decline to the right of that modal value up to some maximum amount (after which we would give up) and because there are more ways for a task to go wrong than to go right this maximum is far to the right. A suitably parameterised β distribution, or for quick and easy work an asymmetrical triangular distribution is used.
    Also, this: “assume that we have an estimation technique that allows us to estimate the size of U with absolute precision of P” seems highly optimistic. I'd expect that the precision (or is it really accuracy?) would vary directly with the absolute size of the thing being estimated, and also that the the precision would be asymmetrical, as we know that estimates tend to be optimistic, even when we take into account that they tend to be.

    ReplyDelete
  2. Thanks for sharing this post. I am sure you will also like this web-source.

    ReplyDelete