Code editors

Code editors are a fundamental tool of a programmer. Almost all of them are also text editors! This is because "source code" is almost always text. Compilers and interpreters take text files as input and convert that into some in-memory representation (a parse tree), and either emit some form of machine/byte code, or just execute/interpret the tree directly.

What separates a code editor from somethiing like Notepad are increasing levels of "understanding" of the code being written. The most basic level is syntax highlighting - coloring parts of the text according to some fairly loose parsing of the code in the language being written. This allows the programmer to more quickly identify syntax errors before compliation or execution is attempted. More advanced levels found in IDE's may include more rigourous parsing to understand symbol names, and peraps provide connections between the use of a symbol and its definition for easier navigation.

This setup poses some quite serious challenges for a programmer wishing to write a code editor, which I will describe below:

The text is canonical

One of the most important aspects of a code editor is that the text stored in the file is the canonical representation of the code. What the programmer sees in the editor should be what exists in the file. If the editor "lies" about the contents of the file, there is a risk that the programmer will believe that the code looks fine, but come time to hand the file to a compiler or interpreter the contents turn out to be different. It also means that if the programmer edits part of the file, the editor should not randomly change other parts of the file.

A simplistic design of a code editor is to parse the file into a syntax tree, and then allow the programmer to modify the syntax tree, re-serializing the tree back into text when saved. There are some fundamental problems with this approach:

Source code contains elements that are ignored by a language parser - comments and "structural" whitespace.
The programmer must be allowed to save an invalid syntax tree to disk.
If new syntactical constructs become available in the language, the editor can not use these until the editor parser is updated.

The editor parser can not be the same as the language parser

In order to be able to assist the programmer, the text must be parsed. However, this parser needs to be more flexible than the complier parser. If a language parser encounters invalid syntax it can simply throw a useful exception and abort the parsing process until the programmer fixes the problem. However an editor parser must display the text contents of the file accurately so the programmer can fix the problem. Therefore these parsers must be different! So for every language the editor supports, it must implement some kind of parser separate from the target language - that is a large amount of effort!

When the editor parser encounters a syntax error it must somehow visually flag this to the programmer. But it must also be able to "recover" from this and continue to correctly parse the remainder of the file or buffer. Otherwise the first syntax error encountered would flag the rest of the document as in an error state. How exactly does this recovery work? Imagine you had a mis-placed curley brace in your file - at what point does the editor parser decide that the code is correct again after this?

In reality, many syntax highlighting editor parsers don't bother with cases like these - they simply colorize everything as if the code was correct, and you find out later when you try to compile or run the program.

Contrast to the Scratch visual programming language: There is no syntax, because the editor is not a text editor - the programmer is modifying the parse tree of the language directly. There is no danger of loading an invalid file because users do not edit the serialized representation of the code outside the Scratch environment.

Merging of two trees

What we end up with is some form of data structure for editing the text, and a separate data structure for storing the code parse tree. These two structures must be delicately connected to each other. For example, it must be possible to have a "break" in the code parse tree around a syntactically invalid chunk of text - rather than the code parse tree simply stopping at the first sign of a problem. And at any moment the user may press a key that completely rearranges the code parse tree.

The user may also be editing a very large file where it is impractical to generate a code parse tree for the entire document (especially given a single key press may require a re-parse of the entire file). However, the viewing window of the text may break the code in a place that upsets the code parser. So you can't simply take all the visible lines on screen and feed them into the code parser - you have to do some tricks to read back or ahead until you have established yes or no this chunk is valid.

All this points to the code parser being very "loose". It may be a tree, but with nodes representing something quite different than what a compiler parser for the same language would produce. And it must be connected to the data structure representing the text in a clever way.

There is no way out in the near/medium future

Maybe we should give up on source code as text? Storing source code as a text file has so many advantages that it is difficult to see it ever changing:

Version control (git) thinks in text files, and is too damn useful to throw away in favor of some new thing.
You can switch editors anytime and not have to worry about it too much.
Regular text processing tools can be used to assist programming tasks (grep, sed, awk, etc).
Unix-like operating systems are built around streams of bytes (text), or lines of input, so the entire OS is "thinking" in the same terms as your source code.

Think of all the other things you can do because source code is text: throw your code at generic text indexing tools for instant search capabilities, ask ChatGPT to complete your code for you, send chunks of code via email, paste code snippits into Slack... the list goes on.

Source code as a document

Literate programming encourages you to reverse the code-comment balance, so the code actually becomes almost secondary to the description of the what/why/how. Some programmers even advocate visually emphasizing comments over code!

Practical ways

I think it could be possible to build a data structure for editing code that is somewhat unified for both cases. The "rope" data structure may be very efficient for editing text in the general case, however I think it's optimized in a way that makes it very challenging to make connections between the text representation and the code parse tree. For example, a rope structure may break a single word into two parts: how do you connect a parse tree to this without massive complexity when edits happen?

Programming languages can be remarkably similar in how they are represented as text. Blocks of code are usually bound by some specific characters that can be identified fairly easily, Indentation rules are somewhat unified across languages (eg: "inner" scopes are usually positioned to the right of their parent), expressions tend to be on a single line and the next expression tends to be on the line below, etc.

If one part of an expression is just outside the field of view, and the other part of the expression is within the field of view, chances are the part outside the field of view is not very "far away".

If we build a data structure for editing text that roughly follows these "rules" of source code, it may be easier to attach metadata to these nodes that assists the programmer.

Practical structure

So what might a practical data structure look like? I think the following elements might be useful:

Structural whitespace (outside strings)
Comment blocks
Code blocks (usually deliniated with curly braces)
Statements

The document-as-text would then be a data structure comprised of these elements. It should be relatively easy-ish to parse the text document into these components.