Software bugs

How can we predict how many bugs some software will have. Let's try this:

Terms

B == Number of bugs L == Amount of code (for now let's say "lines of code") D == Number of developers

Simple

We might say:

B = L / D

Which would mean as the number of lines of code increases, with a single developer the number of bugs increases. You can decrease the number of bugs by adding new developers (which is generally accepted to be false).

Lines of code is a bad metric

Lines of code is not a great metric (but it's also not so bad). We might instead count the number of conditionals (IF statements, etc) since that would better represent the actual logic. But then it's possible to mess that up and make unnecissary logical constructs that would increase the conditionals count. Ah, but that's ok because that maps to additional complexity, even if that complexity is unnecissary, so it's still going to be an upwards force on our bug count (because some poor soul still has to understand that shit).

Let's just sidestep the issue and call "L" the level of complexity of the code however we choose to define that.

Number of developers is a bad metric

Having "D" mean "number of developers" is a problem. We know (or believe) that adding more developers does not reduce the number of bugs. But there must be some constraint to this rule: a codebase of a million lines and one developer is going to have more bugs than the same codebase with two developers. Surely that is true.

Yet a codebase with a million lines of code and a million developers is also going to have more bugs than the same codebase with two developers, no? But why would this be? If each developer only maintained a single line of code why wouldn't there be less bugs? Perhaps because their one line of code would be completely meaningless without knowledge of the other X lines of code, meaning they would need to communicate with a X number of other developers in order to understand their one little line.

Let's redefine "D" to mean something else apart from "number of developers". It would be a function that has the number of developers as input, but it would become some form of productivity factor. Let's imagine "D" follows a pattern:

  ^
  |  *
  | * **
1 |*    ***
  |        ****
  |            *****
  |                 *********
  |                          *************...
0 +-------------------------------------------->
   1
                nr. developers

Meaning: At 0 developers the "D" factor is 0. This would make "B = L / 0" throw a divide by zero, because well, I guess the nr. of bugs would be infinite (or zero, or NaN, or whatever divide by zero means philosophically). At 1 developers we would have "B = L / 1" which means the number of bugs would increase lineraly with the increase of code complexity (debatable I guess).

As you increase the number of developers you see an increase in "D", up to a tipping point. After that you see a decrease on "D". At some point the "D" factor drops below 1, for example: "B = L / 0.8", meaning the "D" factor starts to become a positive multiplier and increases bug count.

Does "D" ever reach zero? I don't know. Let's just say that it approaches zero as a limit. If we have a million developers the D factor will be effectively zero. The statement is the same as: "when you have a million developers coordinating on a project, the chance of anything happening is effectively zero".

Premature conclusion

Basically, you identify a value of "D" for your environment that makes sense. Maybe that's 2, maybe that's 10. Whatever it is you are aiming for that peak in productivity. After you have "D" you then you can reduce "L" to get to your acceptable level of "B".

Not at all the end

If that were all true then "Information Technology" as we know it, as a whole, simply wouldn't exist. Every company or Open Source project writing code would only have "D" number of participants, and these companies or projects would never interact with each other.

The "shape" of the "D" function can be altered. What the "D" function represents is actually a really complex thing that includes a ton of factors. It's probable even that "B" and "L" influence "D" - your developer productivity is influenced by the number of bugs and the code complexity.

Let's take an example of the truth table for a "full-adder" binary circuit. We will ignore the carry-out flag and just look at the logic for generating the bit sum in the output:

A   B   C     F
---------------
0   0   0     0
0   0   1     1
0   1   0     1
0   1   1     0
1   0   0     1
1   0   1     0
1   1   0     0
1   1   1     1

Here, "A" is one input bit, "B" is the second input bit and "C" is the input carry. The "F" represents our output. We know that this truth table can be implemented using two XOR gates:

F = XOR(C, XOR(A, B))

However "XOR" itself is an abstract concept. If we write this bit-sum in "Canonical Sum of Product" notation without XOR we get:

F = (A'B'C) + (A'BC') + (AB'C') + (ABC)

So we could write a javascript function for this adder that looks like this:

function bitsum(a, b, c) {
    return (!a && !b && c) || (!a && b && !c) || (a && !b && !c) || (a && b && c);
}

But we know that an XOR looks like this:

XOR = (AB') + (A'B)

So we could write this:

function xor(a, b) {
    return (a && !b) || (!a && b);
}

function bitsum(a, b, c) {
    return xor(c, xor(a, b));
}

Clearly the second option allows us to have more developers on our project. We define an abstraction called "xor" and we use that abstraction to make our "bisum" function comprehendable.