š
Yours truly is not an expert in compilers but does enjoy diving into this topic and learning something new from time to time.
Reporting errors encountered in the userās code is a crucial function of a compiler. Spending some time upfront to think though this aspects of compiler design is well worth it. As the decision has a significant impact on the compiler’s mechanics and code. So, it’s great to have a set of explicit, coherent assumptions in mind when working on the implementation.
Stop on the first error
The compiler can display an error message and stop the translation once it runs into the first error. This is arguably the simplest error-handling strategy to implement. It has an obvious downside. If a program contains multiple errors, this strategy makes the user go through an equal number of correct the error / restart compilation iterations to catch them all. The user experience might not necessarily be ideal in this case.
Try to discover the maximum number of errors
At the other end of the spectrum, a compiler can attempt to identify and report as many errors in the code as possible within a single run. When such a compiler detects an error it informs the user but, instead of stopping, it keeps going over the code. With the aim of keeping the number of times the user has to restart compilation to a minimum ā only one restart is required in the best-case scenario. While this strategy sounds a lot nicer it’s not without problems either.
Cascading errors
It leads to a notable increase in the compiler’s code complexity, as the compiler has to employ heuristics reducing cascading errors. Whatās a cascading error? In the Scala code sample below, there’s one error ā the closing parenthesis is missing in line 2.
|
|
But the Scala compiler diagnoses two errors. After seeing the first, real one it gets “confused” and reports the second error in line 3 where there is no actual problem with the code. This is an example of cascading error.
CascadingError.scala:2: error: identifier expected but ':' found.
def speak(: Unit = {
^
CascadingError.scala:3: error: ':' expected but ';' found.
};
^
In some cases, cascading errors make diagnostic messages so noisy, users start looking for a way to make the compiler stop after the first encountered error. The effort necessary to tell real errors from cascading errors can degrade the user experience to such a degree as to defy the entire point of reporting maximum number of error in one go.
A hybrid approach
Some compilers aim to integrate the strong sides of both approaches by following a middle-of-the-road error handling strategy. Translation of a program from the input language into the target one typically happens over a number of stages. For an educational compiler these stages can look like this1.
- Lexical analysis, or simply lexing;
- Syntactic analysis, or parsing;
- Semantic analysis;
- Emitting x86-64 assembly.
The idea is to diagnose as many errors in the input code as possible within one compilation stage. The key difference with the approach of detecting maximum errors per a compilation is that the next compilation stage will only start if no errors are found at the current one.
At the lexical analysis stage, if the lexer encounters a portion of the source that doesn’t correspond to any valid token, an error is reported to the users and lexing continues. But parsing isnāt going to start2, the compiler will stop once lexing is complete.
Similarly, if the parser encounters a syntactic error, it’s reported to the user and parsing continues. Once syntactic analysis is complete, the compiler will stop instead of moving on to the semantic analysis stage.
And so on.
C# compiler as an example
A great example of going this way is the C# compiler. Hereās a short description Eric Lippert gave on StackOverflow:
Briefly, the way the compiler works is it tries to get the program through a series of stages […]
The idea is that if an early stage gets an error, we might not be able to successfully complete a later stage without (1) going into an infinite loop, (2) crashing, or (3) reporting crazy “cascading” errors. So what happens is, you get one error, you fix it, and then suddenly the next stage of compilation can run, and it finds a bunch more errors.
[…]
The up side of this design is that (1) you get the errors that are the most “fundamental” first, without a lot of noisy, crazy cascading errors, and (2) the compiler is more robust because it doesn’t have to try to do analysis on programs where the basic invariants of the language are broken. The down side is of course your scenario: that you have fifty errors, you fix them all, and suddenly fifty more appear.
Finally, letās see how this idea is represented in the C# compilerās source code. The method CompileAndEmit
of the class Microsoft.CodeAnalysis.CommonCompiler
coordinates compilation of C# and VB.NET source code.
/// <summary>
/// Perform all the work associated with actual compilation
/// (parsing, binding, compile, emit), resulting in diagnostics
/// and analyzer output.
/// </summary>
private void CompileAndEmit(/*[...]*/)
{
analyzerCts = null;
reportAnalyzer = false;
analyzerDriver = null;
// Print the diagnostics produced during the parsing stage and exit if there were any errors.
compilation.GetDiagnostics(CompilationStage.Parse, includeEarlierStages: false, diagnostics, cancellationToken);
if (HasUnsuppressableErrors(diagnostics))
{
return;
}
// [...]
}
Right away, we notice the invocation of compilation.GetDiagnostics
with CompilationStage.Parse
. The invocation is followed by a check HasUnsuppressableErrors
to determine whether the compilation stage complete successfully and should the compilation move on to the next stage or stop. If you keep looking through the code of CompileAndEmit
you’ll find more spots that perform a call to HasUnsuppressableErrors(diagnostics)
and based on the result decide to stop or carry on with the compilation.
What about a production-grade compiler? C# compiler has more than 20 compilation stages. ↩︎
In contrast, the C# compiler performs lexing and parsing at the same time. If the lexer doesn’t recognize a token, the parser handles this scenario and keeps going anyway. But semantic analysis will not start if lexical or syntactic errors have been encountered in the input code. In such a case, translation terminates. ↩︎