hello

Resisting the urge to build system 2.0 (after designing system 2.0)

Working on legacy code can be quite hard. It's often messy, complicated and perhaps the most difficult thing: people are using it right now. It can be very tempting to want to sweep away all the cobwebs and start again from the beginning. Build system 2.0. Build system NG (Next Generation). There is plenty of advice out there to resist the urge to do a rewrite. It is highly likely that the existing code is complicated because it is trying to model the world and the world is complicated. In order to not loose features your brand new code will probably be just as complicated - tidier perhaps, but just as complicated.

I have worked on systems scattered with the bones of version 2.0's. The general pattern seems to be 1) get frustrated with legacy code, 2) redesign the system/sub-system, 3) write a clean implementation of the design and name it version 2.0. 4) integrate version 2.0 into the old system and 5) start migrating over to it.

Steps 2) and 3) tend to happen with a lot of enthusiasm and often are completed fairly quickly. Because people often resist the temptation to rewrite for quite some time, they have a fairly good idea about what is wrong with the old system and have great ideas about what system 2.0 should look like. System 2.0 is often on the surface clearly better than the original. This improvement is used to sell the new version to stakeholders. Promises start getting made to stakeholders about system 2.0, such as the reason development is slow is because the code is old and messy (valid) and by doing a rewrite we can pay some of our technical debt and be faster in the future (after we take this time to do the migration).

However things start to come unstuck after the implementation: integration and migration. This happens for a variety of reasons including:

Key features that people didn't know about were not implemented in version 2.0. Also sometimes features are deliberately left out of version 2.0 because stakeholders agree they are not critical, but then that decision gets reversed part-way through for whatever reasons.
Running two systems in parallel is not an easy task. Migration from one system to another often means running two systems at the same time, and if the design of system 2.0 is fundamentally better it probably means it is incompatible with the old.
Data collected during the reign of the old system should be available in the new system. Deep knowledge gained about the domain is encoded in that data and it can be unacceptable to throw it away. Therefore a data migration is often desired and this can lead to problems if the systems are incompatible at a data level.
Even if all of the above is successfully executed, deleting the old code or dropping old tables can still not be executed for a long time after the migration leaving confusing traces of the old system lurking in the new system. Sometimes for years.

Probably the most interesting aspect of integration is that a deep understanding of both the old and new systems is required for it to go well. Perhaps even a level of understanding that is not present at the time of the design of system 2.0. If integration and migration are so difficult and failure prone, what to do?

Stop at design

On the surface the choice presented is between keeping the status quo and muddling on with legacy code, or pushing ahead no matter how painful until the process is completely done. I have seen many very smart people completely underestimate the amount of effort required to see a project like this through to the end. This is not just technical effort either. While all this is happening the company or organization needs to continue to provide value to customers, and there is a substantial amount of stakeholder management involved in this process. Does that mean giving up and living with the pain? I propose a third option, which is to stop at design and turn the design into a collective technical vision.

The pain points associated with legacy code are real. The slow pace of development due to legacy code is also very real. Thinking about the problem from a fresh perspective and doing a redesign of the system is a totally worthwhile and productive exercise. Going through this process can even give fresh insights on exactly how the old system works. My proposition is to stop at that point and hold the new design up as a technical vision. A guiding principle for which all future system changes are held up against, as well as a concrete plan to actively pursue. When tough decisions need to be made about the direction the code should take, this vision can help break the deadlock. The course of action can be decided by asking if it takes us closer towards our collective technical vision.

And that is one of the key points in having this technical vision: If you have gathered enough consensus on what system 2.0 should look like, this vision is going to be a collective one. If you have done the design it is going to be documented somewhere and that means it can live longer than any single developer or single team. If you have done the design you have also been gathering information from others around you about how it should work, meaning those people also have the vision clear. What you end up with is a group of people that have a clear understanding of the direction they should be taking whenever they put fingers to the keyboard, and I believe this is exceptionally valuable for productivity and empowerment.

Set the course

Now the hard work begins. You can choose to be aggressive in how fast you steer towards the vision but you can also be more passive. Being aggressive will mean actively assigning chunks of development time to restructuring and refactoring old code. Being passive may mean leaving it to the discretion of individual developers to incrementally improve parts of the code as they work on it. You can distribute this load unevenly depending on capability, giving skilled developers larger chunks of time to rewrite and refactor while giving less skilled developers less time. What is critical though is to not segregate your developers into those that hold the vision and those that do not. When doing a rewrite of system 2.0 you necessarily segregate developers into two groups: those that get to implement the nice new system and those that don't. This can lead to resentment for talented people that are left working on the old code. By holding a technical collective vision you can empower all developers to act towards that goal.

Do not try to trick stakeholders

It may seem tempting to keep all this quiet when communicating to people outside the technical team. Don't. If development was slow before it will get even slower for a while. In addition to implementing new features or fixing old bugs you are now also actively working on improving the old code. The speed at which this happens is configurable depending on the needs of stakeholders. There is a danger that once this design and vision has been created it gathers dust while people get on with the business of keeping everything running. It is vital to communicate this technical vision clearly to stakeholders so they understand why things are moving in that direction. The amount of work that needs to be done is similar to the full process of a rewrite and migration to a system 2.0. The workload is similar, the time-span is also likely similar, however there is a greatly reduced risk of failure. Make sure that stakeholders understand that this is not a magic show where the refactoring work gets hidden in the margins.

Why is this better?

If the amount of work required is about the same, what is the advantage? Increased likely hood that the execution will not fail because:

The technical vision can be adjusted as more information comes to light about the old system. As long as those adjustments are done in agreement with the whole team and stakeholders the direction can be tweaked while moving.
Everyone knows what is being built and why it is being built in that way. Everyone can contribute to the vision and also contribute in a very real way to moving towards it. Contributions are not limited to a separate team building the replacement system 2.0.
There is a shared design and nomenclature that will result in more productive meetings and discussions.
There is no running two systems in parallel meaning there is only one system to reason about.
There is no need to back-populate the new system 2.0 with incompatible data from the old system because that work needs to happen along with incremental changes throughout the process.

How to start

Encourage everyone to contribute their thoughts on what is wrong with the old code. Encourage stakeholders to contribute their thoughts on what big things the system can't do now that they see future value in. Start to design the new system, gather information, have discussions, develop nomenclature, draw pictures, write things down, keep people involved. Resist the urge to start writing a new system. Encourage others to resist the urge to start writing a new system. Talk to people working on the old code and try to gauge how incompatible the emerging design is with the current reality. You might find people saying things like "this table in this database is sort of like that component in the new design" or "this chunk of the old code should really be a separate service" - those are valuable because they can help you validate the new design and also be good first targets for the initial work.

Design system 2.0 but resist the urge to build system 2.0.