Custom Visual Studio language services: ManagedMyC meets ANTLR
As some of you know (ok probably not many of you), I’m the author behind Pixel Mine nFringe, a custom language service framework that we used to provide UnrealScript editing & debugging features in Visual Studio 2005 and 2008. To date, I’ve written 2 full language services with it (UnrealScript and Antlr v3) and toyed with several others (INI files, C/C++, StringTemplate, and a scripting language used in another game). Several people have asked how to get started on a language service using ANTLR grammars for the backend features.
Just to get started, I’ve made a near direct port of the ManagedMyC sample from the Visual Studio SDK, which uses MPLEX/MPPG, to one that uses ANTLR. The most important thing to note at this point: many parts of this sample are inefficient, clumsy, and/or just done the wrong way. None of the features from earlier posts here are implemented in this sample [yet]. Over the next several weeks, I plan to make blog entries covering individual tasks required to make ManagedMyC a solid example of how someone could make a custom language service.
The source code for this post is linked at the end of this article. I’ve divided this post into the two major items involved in creating an ANTLR-based language service: setting up the grammar compiler with MSBuild, and creating a scanner-friendly lexer.
Setting up ANTLR grammars to compile with your language service
Antlr generates .cs files that must be included in the project build. Visual Studio’s build system has a problem identifying files generated in the middle of a build, so you normally have to build your project twice for a grammar change to take effect. To stop Visual Studio from caching the timestamps of the generated files so the build always works, add the following before the Import elements near the end of the project file:
1 2 3 | <PropertyGroup> <UseHostCompilerIfAvailable>False</UseHostCompilerIfAvailable> </PropertyGroup> |
You also need to add rules to have the grammars build with MSBuild. Add the following immediately after the Import elements:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 | <PropertyGroup> <Antlr3ToolPath>$(MSBuildProjectDirectory)\..\Antlr</Antlr3ToolPath> <CoreCompileDependsOn>$(CoreCompileDependsOn);GenerateAntlrCode</CoreCompileDependsOn> <CoreCleanDependsOn>$(CoreCleanDependsOn);CleanAntlrCode</CoreCleanDependsOn> </PropertyGroup> <Target Name="GenerateAntlrCode" Inputs="@(Antlr3)" Outputs="%(OutputFiles)"> <Message Importance="normal" Text="Antlr: Transforming '@(Antlr3)' to '%(Antlr3.OutputFiles)'" /> <Exec Command="java -cp %22$(Antlr3ToolPath)\antlr3.jar;$(Antlr3ToolPath)\antlr-2.7.7.jar;$(Antlr3ToolPath)\stringtemplate-3.1b1.jar%22 org.antlr.Tool -lib %22%(RootDir)%(Directory).%22 -message-format vs2005 @(Antlr3)" Outputs="%(OutputFiles)" /> </Target> <Target Name="CleanAntlrCode"> <ItemGroup> <_CleanAntlrFileWrites Include="@(Antlr3->'%(RelativeDir)%(Filename).tokens')" /> </ItemGroup> <Message Importance="normal" Text="Antlr: Deleting output files '@(_CleanAntlrFileWrites)'" /> <!-- Uncomment the following line if you want the "Rebuild Solution" command to rebuild your grammars --> <!--<Delete Files="@(_CleanAntlrFileWrites)" />--> </Target> |
IScanner-friendly lexers
The biggest difference between traditional parsing and the code highlighter in Visual Studio is in the amount of text they process at a time. For regular parsing, generally started by a call to LanguageService.ParseSource, you have access to and can parse the entire source file as a single block of text. The syntax highlighter, exposed by a class that implements the IScanner interface, only has access to one line at a time, so it must be resumable (better word?) from any point in the file. ANTLR generated lexers are not trivial to use one line at a time, but with a bit of thought up front, it’s not bad either.
The primary problem lies in single lexical tokens, such as block comments, that span more than one line. To work in a Visual Studio colorizer, your lexer needs to be able to start in the middle of any multi-line token. ManagedMyC has 2 tokens that can span multiple lines: white space and block comments.
White space, the not-so-multiline multiline token
First, handling white space. If you have a whitespace rule like the following, simply break it in two so it doesn’t span multiple lines.
Before:
WS : (' '|'\t'|'\r'|'\n')+ { $channel=HIDDEN; } ;
After:
NEWLINE : '\r'? '\n' { $channel = HIDDEN; } ; WS : (' '|'\t')+ { $channel=HIDDEN; } ;
Multi-line tokens: block comments
Edit 4/15/09: I have a new method for handling this type of token that is more reliable, faster, and cleaner than this method. I’ll be making a new blog post soon to revisit the issue.
Next is the big one: handling C-style block comments. The following is the simplest form of C-style comment represented in an ANTLR lexer rule:
COMMENT : '/*' .* '*/' { $channel=HIDDEN; } ;
First, you don’t want your lexer to throw exceptions when a block comment doesn’t end before the end of a line, so we modify the rule to continue to the end of the text or until the comment ends, whichever comes first:
COMMENT : '/*' ( ~('*') | ('*' ~'/') => '*' )* ('*/')? { $channel=HIDDEN; } ;
Making the lexer able to resume parsing in the middle of a comment is a bit more difficult. We do this by forcing everything to be a comment until we reach the end of the block comment, and then continue like normal. This takes several steps.
First, we need to be able to detect the end of a block comment. Since */ is only valid in C as the end of a block comment, we simply add it in the tokens{} section of the grammar:
tokens { // ... END_BLOCK_COMMENT = '*/'; // ... }
Next, we need to make sure no character bombs the lexer (NoViableAltException), so we add the following as the last rule in the file:
ANYCHAR : . ;
Next, we need to have a property in the lexer class to get/set the current state. In the MyCLexerHelper.cs file (the user-edited file for the partial class), we add the following:
1 2 3 4 5 6 7 8 9 10 11 12 13 | using Antlr.Runtime; namespace ManagedMyC { partial class MyCLexer { public bool InBlockComment { get; set; } } } |
which we manipulate inside the COMMENT rule:
COMMENT : '/*' { InBlockComment = true; $channel = HIDDEN; } ( ~('*') | ('*' ~'/') => '*' )* ('*/' {InBlockComment = false;})? ;
The final step is to override the NextToken() function to handle cases where the lexer started in a comment state:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 | using Antlr.Runtime; namespace ManagedMyC { partial class MyCLexer { public bool InBlockComment { get; set; } public override IToken NextToken() { IToken next = base.NextToken(); if ( next.Type != EOF && InBlockComment && next.Type != COMMENT ) { if ( next.Type == END_BLOCK_COMMENT ) InBlockComment = false; next.Type = COMMENT; next.Channel = HIDDEN; } return next; } } } |
Source code for this sample
You’ll have to generate your own LanguageService and Package Guids and set them in MyCConstants.cs before building this project.
The file is compressed with 7-zip because it’s awesome.<br/> ManagedMyC-1.7z
[…] Here’s the source code for the ManagedMyC sample at this point. Since I surely missed things, you can always diff this code versus the original source from my first post on this subject. […]
October 19th, 2008 at 5:34 pmHi Sam,
thanks for this post. I do have a few questions, though.
It seems you have introduced “multi-line tokens” and your way to approach them more or less in order to stick with the Babel frameworks approach to use only lexer tokens for colorization. But I believe this is not optimal:
(1) Lexer tokens like “Identifier” will appear in different scenarios (parser rules) where they are e.g. class names or object names, which are already colored differently in VisualStudio, so there they cannot result from the same token.
You therefore must solve this by adding a statemachine to the lexer, which essentially turns the regular lexer language into a context-free language. Since this is typically not supported by lexer generators you have to handcode the statemachine as done in your sample through the introduction of the state variable InBlockComment. For comments this is acceptable, but what about the class vs. object name example? You essentially would have to build parts of the AST to understand in the lexer (!) what kind of Identifier you are just scanning. That will be a lot of effort, won’t it?
(2) Then you add the switch/case- (“if-” in your sample) statement doing the evaulation of the current lexer state into the handwritten NextToken method. Now that is quite hard to maintain as the statemachine is now split into two parts: modifying actions in the lexer grammar and guards and transition detection in NextToken(). I am wondering whether ANTLR’s semantic predicates would do better here.
Is the approach shown in this post really scalable to support real languages of some size (whatever that means…)?
I’d be happy to hear about your thoughts. Thanks a lot, Mike
January 5th, 2009 at 3:17 pmHi Mike,
I’ve done 3 different things for 3 different languages. Each one was successful (good performance) for source files of 20000+ lines / 500+ kb.
UnrealScript:
I’m not compiling UnrealScript, so the grammar is solely used for IntelliSense purposes. The lexer rules in this grammar support the method described in this post, and the colorizer is implemented as a larger version of what’s in this post.
StringTemplate:
I updated the lexer rules in Group.g3 to meet the colorizer requirements described in this post. I don’t like this as much because the implementation of the StringTemplate library, which is completely independent of the language service, must now meet special requirements so the language service works. This type of dependency is unacceptable, so I’ll be changing over to the method I use for the ANTLR v3 Grammar language service.
ANTLR v3:
I reference C# port of the ANTLR tool to gather IntelliSense information / full source parsing. To implement the colorizer, I copied all of the lexer rules from ANTLR.g3 and made a new AntlrColorizerLexer.g3 inside the language service. I then updated this lexer to support the colorizer. If the ANTLR lexer spec changes in the future, I will have to update this lexer in the language service to reflect the changes, but I believe this is an acceptable situation.
Finally, regarding the use of manual coding instead of predicates: predicates of this form greatly impact the performance of the lexer. The method for implementing a colorizer as described here offers good performance and provides easy access to the original token information from the lexer at any point in the code via the TokenInfo. The StartIndex and EndIndex give the location, and the Token member (int) holds the lexer token type.
January 6th, 2009 at 2:21 pmHmm… I just made a project implementing a language service for a simple c-like language, using the ManagedMyC sample as a starting point. Could you elaborate on what this sample does incorrectly? (though I’ve found some bugs in it already).
About the manual coding vs. use of predicates issue with block comments: I use MPLex’s start conditions just like the ManagedMyC example does, e.g. you provide a different set of patterns that only apply when you are inside a comment, and prefix them with . When you encounter “/“, you put “BEING” in your rule to enter the COMMENT start condition; when you later encounter “/”, you use “BEING” to end it. I don’t know if ANTLR has anything like start conditions, but they were made to handle cases like this. I also use them to handle string literals and preprocessor statements, and they work like a charm.
January 28th, 2009 at 12:00 pmUrgh, that got mangled. Let me try again.
I use MPLex’s start conditions just like the ManagedMyC example does, e.g. you provide a different set of patterns that only apply when you are inside a comment, and prefix them with <COMMENT>. When you encounter /, you put “BEGIN<COMMENT>” in your rule to enter the COMMENT start condition; when you later encounter “/”, you use “BEGIN<INITIAL>” to end it.
January 28th, 2009 at 12:06 pmAlmost. The asterisk keeps getting interpreted as special character by wordpress. That was supposed to be “When you encounter /[asterisk]” and “when you later encounter “[asterisk]/”.
January 28th, 2009 at 12:11 pmHi Guys! The real problem for me is to get the language service running on my custom editor not on the core editor. I hope that somebody can help me. How can I set a specific language service to e.g. a textfield? Any links or snippets would help a lot! Thx!
May 14th, 2009 at 9:55 am[…] is updated every single time!Hope this can ease some frustration out there.Credit is due to Sam’s Blog for supplying the solution for this problem.*Actually it is two numbers behind because Visual […]
September 19th, 2012 at 8:44 am