Working on legacy code can be quite hard. It's often messy, complicated and perhaps the most difficult thing: people are using it right now. It can be very tempting to want to sweep away all the cobwebs and start again from the beginning. Build system 2.0. Build system NG (Next Generation). There is plenty of advice out there to resist the urge to do a rewrite. It is highly likely that the existing code is complicated because it is trying to model the world and the world is complicated. In order to not loose features your brand new code will probably be just as complicated - tidier perhaps, but just as complicated.
I have worked on systems scattered with the bones of version 2.0's. The general pattern seems to be 1) get frustrated with legacy code, 2) redesign the system/sub-system, 3) write a clean implementation of the design and name it version 2.0. 4) integrate version 2.0 into the old system and 5) start migrating over to it.
Steps 2) and 3) tend to happen with a lot of enthusiasm and often are completed fairly quickly. Because people often resist the temptation to rewrite for quite some time, they have a fairly good idea about what is wrong with the old system and have great ideas about what system 2.0 should look like. System 2.0 is often on the surface clearly better than the original. This improvement is used to sell the new version to stakeholders. Promises start getting made to stakeholders about system 2.0, such as the reason development is slow is because the code is old and messy (valid) and by doing a rewrite we can pay some of our technical debt and be faster in the future (after we take this time to do the migration).
However things start to come unstuck after the implementation: integration and migration. This happens for a variety of reasons including:
Probably the most interesting aspect of integration is that a deep understanding of both the old and new systems is required for it to go well. Perhaps even a level of understanding that is not present at the time of the design of system 2.0. If integration and migration are so difficult and failure prone, what to do?
On the surface the choice presented is between keeping the status quo and muddling on with legacy code, or pushing ahead no matter how painful until the process is completely done. I have seen many very smart people completely underestimate the amount of effort required to see a project like this through to the end. This is not just technical effort either. While all this is happening the company or organization needs to continue to provide value to customers, and there is a substantial amount of stakeholder management involved in this process. Does that mean giving up and living with the pain? I propose a third option, which is to stop at design and turn the design into a collective technical vision.
The pain points associated with legacy code are real. The slow pace of development due to legacy code is also very real. Thinking about the problem from a fresh perspective and doing a redesign of the system is a totally worthwhile and productive exercise. Going through this process can even give fresh insights on exactly how the old system works. My proposition is to stop at that point and hold the new design up as a technical vision. A guiding principle for which all future system changes are held up against, as well as a concrete plan to actively pursue. When tough decisions need to be made about the direction the code should take, this vision can help break the deadlock. The course of action can be decided by asking if it takes us closer towards our collective technical vision.
And that is one of the key points in having this technical vision: If you have gathered enough consensus on what system 2.0 should look like, this vision is going to be a collective one. If you have done the design it is going to be documented somewhere and that means it can live longer than any single developer or single team. If you have done the design you have also been gathering information from others around you about how it should work, meaning those people also have the vision clear. What you end up with is a group of people that have a clear understanding of the direction they should be taking whenever they put fingers to the keyboard, and I believe this is exceptionally valuable for productivity and empowerment.
Now the hard work begins. You can choose to be aggressive in how fast you steer towards the vision but you can also be more passive. Being aggressive will mean actively assigning chunks of development time to restructuring and refactoring old code. Being passive may mean leaving it to the discretion of individual developers to incrementally improve parts of the code as they work on it. You can distribute this load unevenly depending on capability, giving skilled developers larger chunks of time to rewrite and refactor while giving less skilled developers less time. What is critical though is to not segregate your developers into those that hold the vision and those that do not. When doing a rewrite of system 2.0 you necessarily segregate developers into two groups: those that get to implement the nice new system and those that don't. This can lead to resentment for talented people that are left working on the old code. By holding a technical collective vision you can empower all developers to act towards that goal.
It may seem tempting to keep all this quiet when communicating to people outside the technical team. Don't. If development was slow before it will get even slower for a while. In addition to implementing new features or fixing old bugs you are now also actively working on improving the old code. The speed at which this happens is configurable depending on the needs of stakeholders. There is a danger that once this design and vision has been created it gathers dust while people get on with the business of keeping everything running. It is vital to communicate this technical vision clearly to stakeholders so they understand why things are moving in that direction. The amount of work that needs to be done is similar to the full process of a rewrite and migration to a system 2.0. The workload is similar, the time-span is also likely similar, however there is a greatly reduced risk of failure. Make sure that stakeholders understand that this is not a magic show where the refactoring work gets hidden in the margins.
If the amount of work required is about the same, what is the advantage? Increased likely hood that the execution will not fail because:
Encourage everyone to contribute their thoughts on what is wrong with the old code. Encourage stakeholders to contribute their thoughts on what big things the system can't do now that they see future value in. Start to design the new system, gather information, have discussions, develop nomenclature, draw pictures, write things down, keep people involved. Resist the urge to start writing a new system. Encourage others to resist the urge to start writing a new system. Talk to people working on the old code and try to gauge how incompatible the emerging design is with the current reality. You might find people saying things like "this table in this database is sort of like that component in the new design" or "this chunk of the old code should really be a separate service" - those are valuable because they can help you validate the new design and also be good first targets for the initial work.
Design system 2.0 but resist the urge to build system 2.0.