The power of text files

I didn’t really notice how much of my daily workflow is wrapped around text based files, until I had to work with some people who insisted on sending Word documents around. I couldn’t help but wonder what makes them so determined in their ways, and they couldn’t help but wonder why I am still bothering with the plain text files? Didn’t we like move on years ago, once the word processors took over?

In order to understand this problem, let’s start with a simple quiz. The question for our billion dollar quiz is: “How do you spot a Windows user in the wild?”

The battle of two philosophies

Say we have two different programmers each working with their favorite operating system. One is wielding the Windows machine, the other one swears by his Linux desktop. We give both of them the same task: “Write a report that will describe the work you did in the past month.”

A common approach of a Linux user would be:

  • Open a terminal, navigate to the development directory, open up a text editor and start typing.

  • Hold on, I have to check something, where are my notes located? I don’t know the exact name, but I know what the notes contained.

    $ grep -R "bugfixing" /home/sloth/dev/
  • I found two hits. Opening the first one didn’t bear much fruit, let’s try another one.

  • BINGO, found what I was looking for.

  • Time spent: 1 minute.

A typical way of how the Windows user would approach this task:

  • Opening a MS Word and start fiddling with the heading styles. Do I need table of contents? No, let’s leave it for now. Save an empty document into the dev directory, just in case the Word crashes.

  • USING A SPACE IN THE FILENAME, WHAT?

  • Hold on, I have to check something in my notes, where are they located? I don’t know the exact filename, but I know what the notes contained. Let me try to find it via a file manager - nope no document found.

  • Open another file manager and start clicking around. Where did I put that stuff?

  • Found a Readme file and double clicking on it.

  • A PLAIN NOTEPAD OPENS UP.

  • SCROLL, SCROLL, SCROLL, CTRL+F search, nope no searches found.

  • Oh it wasn’t this file I was looking for, let’s close this and try another one.

  • REACHING FOR THE MOUSE AGAIN.

  • Clicking on notes.docx file.

  • The lights starts dimming while another Word window is being opened.

  • ITS BEEN 15 MINUTES AND STILL NOTHING HAS BEEN DONE, AAAAARGH.

A long time Windows veteran that is reading this is probably shaking their head right now. I’m sorry to put you into the same basket of cluelessness, but the corporate circus is filled with clowns whose train of thoughts are driven by the GUI. If the GUI does not solve their problem they will get things in order manually, with some overly clever copy and paste mechanisms. This problem was caused partially due to lack of useful command line tools and partially due to adoption of non plain text file formats.

But why are these two worlds so different?

I have a wild theory about that. Different operating systems solved different problems and attracted different types of primates.

I believe that the main reason behind Microsoft Windows popularity was due to little Bill having the right connections at the right time and the Windows itself being relatively easy to use. Its easy to use part came from their rather intuitive GUI that the early Windows had. If you are using an intuitive GUI then you no longer have to memorize obscure terminal commands and you can muddle through your problem by pointing and clicking around.

Obviously when you have the luxury of using a GUI that does a bunch of stuff behind the scenes, you can afford to invent a new file formats that are not just pure text. So you add some metadata that defines the look and feel of the document. Over the years the new features and other cruft are piled on top of the previous ones, until nobody can read such file without a special viewer. Why wouldn’t you do that though? You don’t put any mental load on the end user as all the complexity is hidden behind the scenes, but your program is much more capable now.

On the other hand, Linux is a totally different beast. The development of the Linux was started by Linus Torvalds, who wanted to have a free Unix operating system that would work on his Intel 80386 PC and would take a full advantage of the 386 processor capabilities [1]. Bundling his home brewed kernel with various GNU development tools got him a free Unix like operating system.

The main idea behind the Unix philosophy is to write small programs that do one thing well. They should work together and handle text streams because text streams are an universal interface. Everything in a Linux is a text file (more specifically everything is a file descriptor) and you have a bunch of small command line utilities that can help you to operate and process those files.

Linux was still in early stages and it was far from being user friendly. In 1993 you had to be an expert to use it [2]. Power users did not really care much as they were used to their Unix machines, but for everyone else it was just too hard to use. Can you imagine your grandma compiling the latest kernel and memorizing hundreds of terminal commands just to use a computer? No? But, I am sure she can point and click around just fine which gave the Windows operating system a great advantage in this popularity contest.

Linux is getting easier to use with every passing year, but most of the users I know still prefer to work in a terminal and fiddle with plain text files. If you can handle text files as well as the GNU utilities can handle them, then you don’t have to bother with the GUIs. The consequence is that Linux developers are notorious for being a terrible GUI developers, while also having a knack for automating the boring processes. The environment simply pushes them in that direction.

Using an operating system is like writing in your favorite programming language. They all suck, but once you get used to one you keep using it. Learning new things that are fundamentally different from what you are used to is hard, so people are reluctant to switch sides and try something new. Just ask Apple fans (see also Stockholm syndrome [3]).

Why should we bother with text files?

Why would Joe the programmer, who is working at a huge and equally boring corporation, care about all this nonsense? Wasn’t his job to do something useful like writing code and drinking company provided coffee instead of fooling around with the file formats? Well yes, but…

During your average day as a developer you will encounter those little problems that will drive you nuts in the long run. Someone will pop up in your office and say: “Do you remember that time when we had to do this or that and I can’t find it anywhere. I don’t remember what’s the title of the document or where it is stored, but I am pretty sure it was about such and such things.”

Unless you have a great memory, chances are you won’t know what happened years ago. Trying to search for the specific keywords will not return anything useful, as most of the search tools were not made for searching through the non plain text documents.

However, if your documentation is written in a plain text, you can simply search throughout your entire hard drive for the specific keywords and in a few seconds you will get your results. Since plain texts are easy to parse and process, they are also easy to use as an input for your automation scripts - which saves you a lot of time in the long run.

There is another aspect that is rarely mentioned in the great online flame wars. How long do you think the plain text file format will last? Will you be able to open a text file 50 years from now? Most likely you will.

What about the .doc format? I have a feeling that we won’t be able to open them in the next decade and something else will come along [4]. When that moment arrives, you will have to transform all your documents into the new format or else they will be forever lost - assuming that you have backups and you don’t lose them even before that moment. Maybe that’s okay for things that have a limited life time, such as your high school essays or a documentation of your favorite software. You won’t care about those essays and that software will probably be long gone at that point, so the documentation itself is quite useless and probably not worth storing.

But there are things that are worth preserving even after hundreds of years. Certain books rarely go out of fashion and they are still being read centuries after they were written. Imagine Isaac Newton writing down his thoughts and storing them into one of the modern SASS storage companies (SASS - software as a service). That knowledge would be forever lost and some knowledge is just worth preserving for more than a few years. In the past books fit this purpose quite well as they are still readable after all these years. Nowadays, despite all the cloud computing advances, we still don’t know how to store knowledge for a longer term without relying on the old school paper.

Practice makes perfect

Now that the theory is over, let’s move over to the practical part of this fine piece of writing. How would you parse all links from 10 different pdf documents, fetch the data from the parsed links, extract all the phone numbers from the fetched data and store those numbers into a database - so some marketing bozo is able to call those poor people and nag them about the latest TV discounts at eight in the morning?

You start writing the… just kidding. You delegate the work to the people below your pay grade. Some entry level schmuck or unpaid intern will have to go through all the documents and do this mind numbing task manually, because they won’t know any better.

The work that could be done in a few hours that it takes to cobble together a script, will suddenly take days because of the poor choice of the data format. There may be libraries for parsing non text files, but in general they do not work that well [5]. For the programmer dealing with non text files is a world of pain. Why are so many people banging on the doors of that world is something to ponder about.

Notes

[1] Coming from the The Linux Programming Interface book. I would say that when something breaks, you still have to be an expert to use it.

[2] If you are interested in learning more about the history of Free Software movement, feel free to read: Free as in Freedom 2.0: Richard Stallman and the Free Software revolution (2Mb pdf).

[3] Stockholm syndrome - a condition in which hostages develop a psychological alliance with their captors during captivity.

[4] I know, I know it’s just a zip file full of xml files. I would rather not spend time parsing the content from the metadata, considering how awful the doc format is.

[5] But they do exist. Some time ago I had to search through hundreds of pdfs that had hundreds of pages - large project problems. I cobbled together a script that converted the pdfs into a plain text and then grep through that mess in order to locate the relevant files.