Sam's Blog

01 Dec

Code Completion filtering, selection, and replacement algorithms

This post is derived from an email I sent to the NetBeans mailing last month. Some of the implementation details refer to specific items in the NetBeans codebase, but the general algorithms apply to any number of applications.

In the past I alluded to spending a great deal of time thinking about ways to improve the performance and usefulness of a code completion feature. This algorithm is complicated but ends up producing consistent, predictable behavior. It truly excels with COMPLETION_AUTO_POPUP_DELAY set to 0 and low-latency implementations of AsyncCompletionTask.query, but still performs better than alternatives when latencies are observable.

To start with, a couple definitions from the subject line:

Filtering algorithm: the algorithm used after semantic context evaluation to restrict the identifiers actually displayed in the completion dropdown.
Selection algorithm: the algorithm used to select an initial “best match” item within the completion dropdown.
Replacement algorithm: the algorithm used to determine what text gets replaced – currently chosen by the user pressing Enter or Ctrl+Enter.

I noticed that Ctrl+Enter deletes the current identifier before calling CompletionItem.defaultAction. I’ll refer to the behavior of Ctrl+Enter as Extend because it extends the completion to the end of the current identifier. I’ll refer to the behavior of Enter as No-Extend.

For this post, I’m examining the following questions:

Why is there a difference between Enter and Ctrl+Enter?
When would you want the behavior of Enter, and when would you want the behavior of Ctrl+Enter?
How should the completion dropdown get filtered?
How should the “best match” be determined?

Part 0: Progress

As part of my continued work on ANTLRWorks 2, I have modified the Editor Code Completion module to support everything described in this email without any breaking API changes relative to the current specification 1.28. In addition to preserving API compatibility, my current implementation exactly follows the existing code completion behavior if a developer does not explicitly override it. The changes are available with patches (unfortunately multiple patches as I tweaked a few things) as an RFE in Bug 204867 in the NetBeans Bugzilla.

Part 1: My answer to #1 (why is there a difference between Enter and Ctrl+Enter)

The difference is present because the current algorithm cannot reliably answer question #2.

Part 2: My answer to #2 (when do you want Extend vs. No-Extend?)

From what I can tell, the code completion algorithm uses this feature to compensate for not keeping track of information available when the completion was invoked. For example, suppose you are trying to complete the identifier getStuff, and the following currently present. For reference, assume this is columns 0 before the ‘g’ through 5 after the ‘t’.

getSt

When code completion (Ctrl+Space) is invoked at positions p=0 or p=5, the user expects the behavior of Enter. When the code completion is invoked at positions 0<p<5, the user expects the behavior of Ctrl+Enter. Also note that at position p=5, the algorithms of Enter and Ctrl+Enter are equivalent.

Part 3: Code completion selection in Extend mode

Using the example from Part 2, it’s clear that the current selection algorithm used in the completion dropdown does not properly handle the Extend behavior, because it only considers the text before the caret when selecting an item. When the completion algorithm in invoked in Extend mode, all of the text of the current identifier should be considered when choosing the default selection.

Part 4: Code completion filtering in both Extend and No-Extend modes

If the code selection algorithm of Part 3 is implemented, then the user will encounter major problems under the current filtering algorithm. In addition, it should be immediately apparent that the current filtering algorithm is crippled because it is fully incapable of handling even the simplest misspellings when completion is invoked at the end of an identifier (position p=5 in the previous example). While I do not believe the following rules are ideal for all situations, I designed them to be easily implemented and feel similar to the current rules while preserving the ability to support the selection mode of Part 3 as well as handling many misspelling cases.

For prefix filtering using the prefix text in span [0,x) with the caret at position c, you should always have x<=c.
If filtering on the prefix [0,x+1), where x>=0, produces an empty result, then filtering should be on [0,x) instead.

Part 4.1

In No-Extend mode with no misspellings before the caret, these rules produce exactly the same result as the current implementation. In No-Extend mode with misspellings present before the caret, these rules prevent having an empty (useless) dropdown appear. Unfortunately, if the user attempting to complete getShell types getSt and presses Ctrl+Enter, the filtering above would result in only showing getStuff. The solution is adding the following rule which has much larger ramifications.

The completion list should not be filtered before the user types a character after the list is initially shown. When the dropdown appears in response to an automatic trigger (see CompletionProvider.getAutoQueryTypes), the list should not be initially filtered even if the trigger is a word character. The latter case occurs when options like “Auto Popup on Typing Any Java Identifier Part” are enabled.

Part 4.2

To allow even more convenient typing, the filtering algorithm can be updated to also allow the following.

Substring match: when filtering on a prefix span [0,x), with x>1 (at least 2 characters), the filter allows items containing the substring [0,x). The match may or may not be case-sensitive.
Word boundary match: when filtering on a prefix span [0,x) which matches the regular expression ([A-Z]\w*){2,}, we call each group ([A-Z]\w*) a word prefix. The filter allows items containing words in the same relative order that match the word prefix. Note that this means the prefix UnsExc will allow UnsupportedOperationException even though the word Operation appears between the words matched by Uns and Exc. When evaluating the match, the “words” of a completion item start when [A-Z] follows ^ or [^A-Z], and when [A-Za-z] follows ^ or [^A-Za-z]. The latter covers multi_word_ids_with_underscores. This match may or may not be case-sensitive, but in case-sensitive mode it only works for UpperPascalCase identifiers.

Part 5: Instant substitution with Part 4.1 in place

Currently instant substitution only operates if the filtered list has a single item in it. It also only works if the caret is located at the end of an identifier, and when the prefix is a case-sensitive match. The current algorithm is in CompletionImpl.requestShowCompletionPane. This algorithm would need to be updated as follows.

The evaluated prefix for instant completion is the span [0,a), where Extend mode chooses the current identifier block and No-Extend mode chooses MIN(caretPosition, endOfIdentifier).
The evaluated prefix must match the prefix of exactly 1 item in the completion list (unique match constraint). The match may or may not be case-sensitive (this is an option in the Java editor, and some languages don’t consider case at all). Note that only prefix matching is used, even if the filtering algorithm implements the advanced filters in Part 4.2.

Part 6: Initial selection with new filtering rules in place

It should be clear that if the filtering rules are relaxed per the advanced rules in Part 4 (especially Part 4.1), the current selection algorithm of first prefix match will do a poor job of choosing items the user is trying to complete. The following selection rules are prioritized for ideal behavior, but an implementation may use variations for efficiency as long as the variations result in predictable behavior (typically restricted to performance related simplifications to rules 1 and 9).

Unless otherwise specified, character matches are case-insensitive. While the user has an option to explicitly disable case-insensitive matching, if all of the rules from this email are in place then that option will negatively impact code completion usefulness.

Evaluated span: Due to the relaxation of filtering rules in Part 4, the completion dropdown may include items even when the current identifier block does not match any item. The largest possible span is the same as the one defined in Part 5.1. A match which includes more characters from this span wins over one that includes fewer.
Exact match (case-sensitive): When possible, choose a completion item whose text exactly matches the evaluated span text.
Prefix match (case-sensitive): When possible, choose a completion item that starts with the evaluated span text.
Exact match (case-insensitive)
Prefix match (case-insensitive)
Substring match: When possible, choose a completion item that contains the evaluated span as a substring (filtering rule 4.4).
Word boundary match: When possible, choose a completion item that matches the evaluated span as described in filtering rule 4.5.
Validity match: When possible, choose a “smart match” completion item (sort priority < 0).
Recently used: When possible, choose a completion item which was successfully inserted by a more recent completion than one inserted by an earlier completion (or not previously inserted).
Case sensitivity: When possible, choose a completion item that matches the case of every matched character from the evaluated span. This decision is Boolean – the match is either all characters or does not match.
Alphabetical order: If evaluation of all of the above still results in multiple potential “best match” items, choose the first one in alphabetical order.

Part 6.1: Additional notes about language semantics in Validity and Recently Used selections

If semantics are tracked as well as the actual inserted text (e.g. for Java an MRU list of ElementHandle instead of String), then a weighting algorithm should be used to balance between text inserted more recently and having a full semantic match. One possible way to handle this is having one MRU that tracks semantic elements and one that tracks strings. In the semantic element list make sure there are never 2 items present which are valid within the same context, then always prefer a semantic match over a plain string match. The net impact of considering semantics is very cool, but hard to explain the nuances (the end user will just see it as the algorithm always knowing what to type).

In theory, a weighting algorithm could also be used to provide a hybrid of the Validity and Recently Used selection steps. At this point I have not considered the specifics of such a feature.

Part 6.2: Commentary about Visual Studio 2010

The C# language service Visual Studio 2010 implements several items described in this post. For users with access to Visual Studio 2010 with a C# project, the following is a list of some of the differences between it and the algorithms above.

Word boundary matching (Part 4.2) is limited to the first character of each word. For example, the user can type NIE to match NotImplementedException, but NImExc will not work. In addition, words start when [A-Z] follows [A-Z], which significantly reduces accuracy when UPPER_CASE_NAMES are used (every letter counts as a new word). This is especially problematic when working with COM interop, including much of the Visual Studio SDK.
When selecting a completion item, Visual Studio only considers Recently Used items when an Exact, Prefix, or Substring match occurs. For Word Boundary matches, it is ignored. When using System;, typing NSE will always match MulticastNotSupportedException no matter how many times you manually select NotSupportedException from the result.
Visual Studio does not implement the Validity match step. My early experience with this feature led me to a conclusion that it was “cool” but generally problematic. After much consideration, I finally realized that including it as step 8 in the selection algorithm allows the user to benefit from its advantages without having it override the various character matching steps.

Visual Studio 11 apparently includes an additional fuzzy logic selection feature, which I’m really hoping they properly inserted between the Word Boundary and Validity steps of the selection algorithm (fingers crossed).

This entry was posted on Thursday, December 1st, 2011 at 11:02 am and is filed under Programming, Tutorials. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

Search

Categories

Archives

Meta