A while ago, I’ve been working on a very large codebase that consisted of a few million lines of code. Large systems are usually a big mess and this one was no exception. Since this is a rather common problem in software engineering, I thought the internet would be littered with stories about this topic. There is a lot of talk about software carpentry, while software maintenance is rarely debated. Either large programs are being maintained by dark matter developers or nobody thinks that writing stories about large systems are interesting enough.
In the past I’ve encountered a few of those large monsters and they seem to have a lot in common. This article will try to present some of the problems and tricks that I am using when I have to deal with them. Hopefully this will inspire others to write similar posts and share tips from their own bag of tricks.
The main problem of any large codebase is the extreme complexity that stems from the fact that we live in a messy world of details that are very hard to describe and put into words. The programming languages that we are using nowadays are still too primitive for that task, and it takes a lot of lines and various layers of abstractions before we are able to convey the rules of our world to the all mighty computer .
The following sections will present some of the common problems which I’ve discovered during my big system adventures.
A common trait of a large codebases is that at some point they become so large and bloated that one person alone is no longer capable of understanding all its pieces. It seems to me that after 100’000 lines of code, the maintenance related problems start to appear as the complexity of the code simply dwarfs the capabilities of the human brain. Such large systems are commonly maintained by more than one person, but with a large group of people also come large organizational problems.
Within a large group of people the number of possible communication paths between them go bananas and so it often happens that the ass no longer knows what the head is doing. This misunderstanding in turn cause them to build the wrong thing that doesn’t fit into the rest of the system. You might also know this situation under the term of “those people had no idea what they were doing, and we will do it right this time” which is quite often floating around in the latest maintenance team.
That rarely happens though, because it’s likely the Towel of Babel situation all over again.
Large systems are usually maintained by the ones who did not build them. Initial developers often leave the company or move up in the pecking order to work on other projects and are therefore no longer familiar with the system. Sometimes the bright minds outsourced the initial development of the project in the name of lowering the costs, just to pay tenfold in the later stages once they realize the outsourcers developed the wrong thing. Even worse is the fact that the in house developers didn’t gain the internal domain knowledge that is necessary for further maintenance of the system.
This presents a big problem for the new maintainers, as they can’t just go around the company and ask the original developers about the initial design decisions. Learning this tribal knowledge usually takes a lot of time, because the code is harder to read and understand than it is to write. These days most developers seem to switch jobs every 2 to 3 years, therefore the learning process has to be constantly going on, otherwise you might end up with a large and expensive monster that nobody knows anything about . For most of the past large projects on which I’ve been working on, the team has usually changed by the end of the first version.
Rigorously documenting every step is not the cure for this problem, because at some point all that junk will become outdated and nobody will have the time to spend a year just reading the documentation and figuring out how the pieces fit together .
Large systems become large, because they are usually trying to solve every problem under the sun. Often the organization that is embarking on such journey does not have enough experienced employees on board to actually pull it off. Some like to say that pressure makes diamonds, but sometimes it also crushes the things that are under.
It’s fine to have less experienced people working on a large system as long as they have the elders overseeing their work. In the world where senior titles are handed left and right, that is often not the case and it’s how you end up with a very fragile system that is suitable for a replacement as soon as it was built. Most of the larger projects that I was working on and were considered successes, had the core parts of the system written by experienced developers. A significant chunks were also built by greenhorns, but they were usually guided and their blast radius was limited to the less complex parts of the system.
Big projects tend to attract the data modelers and other cultists who like to get in the way of getting shit done. These architecture astronauts will endlessly discuss the finer points of their UML data models and multithreaded layers of abstraction, that will one day allow them to be the heroes of their own story by writing some well encapsulated and “SOLID” code.
Why IBM sales reps don’t have children?
Because all they do is sit on the bed telling their spouses how great it’s going to be.
Meanwhile, the for loopers have to fight this creeping metadata bureaucracy madness on a daily basis. The tools handed down to them from the ivory tower usually don’t stand the heat of the battle, but that doesn’t bother the modelers who will try to fix the problems with more obfuscation patterns. It’s how you end with a homebrewed middleware monstrosity, because the 100 existing ones out there are obviously not up to the task of powering our little CRUD app.
I like to keep documentation separated from the code. Who am I?
A fool, with an out of sync document.
The documentation of any large system is almost always outdated. The code is usually changing faster due to the endless edge cases of the system that were not being thought of early on. The discovered edge case problems are usually fixed by bolting additional functionality right on the spot. The average code change of such patch is usually quite small, but a few tweaks here and there accumulate over time until the original design no longer matches with the reality.
Tweaking the code is usually simple as most people are familiar with the process. You pull the code from the version control, you make your tweaks and then you push it back. On the other hand updating the documentation is way more convoluted and usually involves the whole ceremony, because the term documentation is actually a spaghetti of Word documents, pdfs, spreadsheets, emails, wiki pages and some text files on some dude’s hard drive.
The corporate world still loves to use MS Word for writing technical documents, even though it’s entirely unusable for this use case. The Word doesn’t support syntax highlighting for code snippets and you get to play the game of “moving one image for 5 pixels to the left will mess with your headings and right align all text.” It also makes it very hard to have multiple people collaborating on the same document. The version control still treats Word documents in the same way as binary blobs, which makes merging changes and fixing merge conflicts far harder than it should be. I still remember how people collaborated by working each on their own copy of the document and having a documentation officer merging all the copies together manually to avoid any merge conflicts. Fun times.
If you are lucky, you might be writing documentation in plain text, but then you may have to get familiar with all kinds of weird Lovecraftian toolchains that are relying on all sorts of ancient operating system specifics in order to produce a nicer looking document.
After all these years of progress, writing documentation is still an unpleasant process due to all the pain surrounding the tools that we have to deal with on a daily basis. Large projects ensure that not only is the documentation hard to write, it’s also impossible to find and read due to the sheer number of documents .
In this section I will describe my ways of tackling the problems of an unknown large codebase that I often encounter in the wild. As mentioned before, the main problem of large systems is that nobody can understand them entirely and often you will be left wondering how the damn thing even works.
When you are trying to understand a specific part of a large system, it’s worth taking the time to talk to the current maintainers. They usually know it well enough to guide you through the jungle, so you can avoid the traps and get up to speed faster. Sometimes you will encounter a situation where you will just have to figure it out on your own, because nobody will have the answers to your questions.
Hopefully the following sections might give you some ideas on how to tackle such situations.
The easiest way to get familiar with a large system, is by going through its documentation and actually reading it. Large systems usually contain large swaths of outdated documentation, but even a slightly outdated document is often better than not having it at all. Ask the elders about the current state of documentation, so you don’t completely waste your time with deciphering the irrelevant documents.
Either way, the documentation will only give you an overview of the system. The details behind design decisions are almost never mentioned and you will have to find another way.
When I am trying to decipher how a specific part of the system is supposed to behave, I usually check for tests. If they exist, you might want to scroll through them and hopefully you will get another piece of the puzzle. Sometimes when I am trying to figure out how to use some obscure unknown library, I try to write some simple learning tests that are using some methods from the library.
If the tests are nowhere to be found, you can try to play with the debugger and step through the actual implementation code. Sometimes you will find “interesting” ideas floating around on how how you should write the missing tests before you start modifying the unknown code, but I don’t understand how are you supposed to do that if you don’t know what the software was supposed to be doing. It’s fine to test your understanding of the system that way, but writing random tests and then relying on them is ridiculous; you are writing useless tests at best or testing the wrong thing and then relying on your misunderstanding of the system at worst which will probably do more harm than good.
When you are trying to tweak the existing functionality of the system, you can probably track it down to just a few places in the code where that tweak is necessary. I usually study the code in those places until I figure out exactly which part should be modified and I ignore the rest of the system. Resist the temptation of fixing the parts that you find horrifying, because first you can’t fix it all and second you will get crushed by the complexity of the system. Mark those places down as a horrifying place to be and keep them in mind when it’s time to refactor.
If you don’t know the code well enough, you might also break an otherwise working system. Sometimes obvious bugs in the code become an expected behavior that should stay that way even if it’s wrong. At some point somebody might have started to rely on the broken behavior and if you decide to “fix” the broken part, you might actually break an otherwise working system.
Running the tests is a good way to ensure that your changes did not break anything, but make sure the tests are actually reliable. Far too often you will encounter unit tests with some shady mocks written by the unit test zealots who sleep well at night because they know their mocks are working.
All large systems will have parts where a certain design decisions will not be documented and nobody will now why they were necessary or done that way. Version control usually contains a history of commit messages which may give you some hints for understanding the reasoning behind those decisions. This is why you can find so many blog posts advertising the importance of writing good commit messages.
On smaller projects or when you are working alone, a good commit messages are not going make much difference. One person can only write so much code in one day of work and you can mostly figure out the intentions just by going through the source. If all else fails, you can still rewrite a small project in a reasonable time.
On the other hand, large projects are unwieldy and rewrite is normally not economically viable. Taking the time to immortalize the intents of your changes in the commit logs might save your own ass six months down the road when you won’t remember a thing about the code that you have written.
Sometimes, the reasons behind a certain design decisions are stored in the past bug reports. Large projects will probably have some kind of a bug tracker with various discussions surrounding the reported bug. These bug reports might be accompanied with the hash of the commit that fixes the bug so you can go deeper into the forest in search for the truth.
This is a bit more annoying process than going through the commit logs, as the bug trackers are normally not integrated with your editor of choice, but sometimes it’s the only way to obtain the missing piece of the puzzle.
When I am struggling to understand how the pieces of system fit together, it usually helps me to visualize things. You don’t have to create a detailed UML diagram; in fact I don’t think I have ever seen an UML diagram that wasn’t a glorious cryptographic mess. Simple boxes and arrows will do just fine in most cases. For navigating through the unfamiliar code you may also use the tools that visualize the structure of the code (like SourceTrail).
If necessary, you can write your own tools for drawing such visualizations. For example, if you are trying to visualize a mesh of microservices you can write a script that will automatically generate a graph of service connections by parsing the configuration files of those services. I personally find such connection diagrams much easier to follow and understand than figuring it out through the source code alone.
Commenting the code is one of the hot topics on which everybody will want to comment on. People will claim that a well written code doesn’t need comments, because its structure and naming conventions will tell you the whole story. Afterwards they will come up with a trivial hundred line example which will show you how much better the non commented code is in comparison to the nasty commented one.
It’s a baloney that is perpetuated by the book sellers and consultants that no longer work in the trenches. It’s easy to preach and stick to the principles when you don’t have to shovel the dirt on the large system for years. You can rewrite any trivial code into something that doesn’t need comments. After all, most of these silly examples easily fit into your brain just by reading the source code once.
The problems of non commented code only start to appear at scale, when you have a revolving door of variously skilled developers working on the same code for multiple years. In such case, no amount of cleaning your code and naming variables in this or that way will help you. A project of 10’000 lines behaves completely different in comparison to the project of 100’000 lines or the project of 1 million lines.
Since the internal domain knowledge and the design decisions are getting lost over time, I like to make my life easier by documenting my decisions and other “trivia” that are not obvious from the code alone. A well placed comment right where the action is will save you a lot of time, because you won’t have to search through the mess of design documents which usually won’t contain the detail that you are looking for. You won’t be able to document all your design decisions just by carefully naming variables and neither will your coworkers and other clean code enthusiasts.
When I am trying to add a functionality to the system and I realize that I am in an unfamiliar hard to understand territory, I like to put a trail of comments as I read through the code. I find such marked code much easier to understand and next time I have to go through that part, I can simply rely on the guiding comments as opposed to reading and understanding the entire source again.
I hear you saying: “But the comments might be outdated or misleading, how can you claim to rely on the comments when in my entire career I have never seen one helpful comment?” If that’s the case, you can use the same strategy that you use for dealing with documentation. Finders changers. Revise and update the parts that are wrong, but the real question is: “How did those comments go wrong? You do have code reviews, don’t you?”
I often want to know where a certain variable is used and how it is used. Sometimes the developers were to smart for their own good and will come up with an ingenious solutions that will trick your IDE in believing that the code is not used anywhere. This is particularly common in Java, where you will find ridiculous solutions glued together with a bunch of xml files that are spread throughout the entire project.
Finding such documents manually is pretty much a hopeless task, but with grep this is a trivial thing to do. It’s worth spending some time learning the grep or similar CLI tools that can quickly find the files containing the relevant keywords you are looking for.
Often you will want to look for a certain keyword across the entire documentation. If you are new to the project, you won’t really know which document is relevant for you. This is actually a much harder problem than you might think, as searching through non plain text files is a world of pain (see also The power of text files).
Don’t give up though. Word documents are just zip folders of xml files. If you extract them into plain xml files you can easily grep through that mess of content and layout. You might get fancy and use antiword tool instead. For searching through pdf documents you can use utilities like pdfgrep.
Sometimes you will encounter an old and undocumented codebase with nobody around to ask on how to approach your task. If you want a first hand experience you can try to write a Jenkins plugin. Jenkins is a really flexible continuous integration software that allows you to do everything, but at the same time it also fails to do anything and requires tons of plugins for even the most basic tasks. At some point I’ve realized that I need a specific dependency building plugin that doesn’t exist and I’ve decided to write it. After spending some time reading the provided documentation and poring over the code, I’ve realized it’s one big undocumented mess and the only way to figure out how it works is by trial and error and “reverse engineering” the actual functionality from other plugins.
In a situations like that, an IDE with a decent autocomplete might help you decipher an otherwise impenetrable codebase. Press a dot and let the editor suggest you the possible options. Far too often I see people noodling around with some half assed vim plugins, as if struggling to get the task done makes you a real developer with chest hair and everything.
There are people out there who are really productive with plain Emacs and nothing else (names like Jon Blow or Casey Muratori come to my mind), but unfortunately there are very few that are at that level of skill working in this industry. I’ve spent a lot of time maintaining my dotfiles until I’ve realized I was wasting so much time on the irrelevant nerd turf wars and that espoused productivity never really came around. A modern IDE with some custom key bindings will get you there way faster.
Regardless of how you tackle the problems of an unknown large system, keep in mind that large systems did not appear overnight. A lot of people spent a lot of time building them and there are hundreds of hidden edge cases bolted on top that are only there due to the problems that were discovered in production. If it takes a lot of time to build a large system, it also takes a lot of time to understand it.
 As you move through the ranks in the company, the higher you are, the more powerful language you are able to wield. In the beginning you are stuck working with primitive languages in which you have to specify every single detail. For example, if you are trying to read the contents of a file you have to specify exactly how you want that file to be read; either reading the entire file at once or iteratively line by line or character by character.
As you move up into the higher levels of the foodchain, at some point you gain the access to the power of spoken language. At this level you no longer have to worry about every little detail, as you can simply blurt ambiguous things (like read this file) and it’s up to the grunts below to figure out the necessary details.
 People usually don’t leave the company when everything is fine.
 This loss of knowledge situation is not tied only to the programming world, because it happens everywhere and we as the society haven’t really figured it out how to pass the knowledge through generations. For more on this discussion, you might be interested in Preventing the Collapse of Civilization (Jon Blow) talk.
 In 1980’s Tim Berners-Lee realized that the documents are hard to find at CERN, so he started imagining a system of interconnected documents that would supposedly solve this thorny problem for good. Nowadays we know this invention as the internet.
Despite 40 years of improvements and the internet becoming a part of our daily life, we still face the same problems. You can talk to another person half way across the world while watching a funny cat videos, but somehow we still struggle with finding the important project documents.
[Extra] You might be interested in reading the Out of the Tar Pit article (532kB pdf) that thoroughly tackles problems of large-scale software systems.