SBCL as an Assembly Breadboard: Building a High-Performance Stack VM

Implementing a virtual machine (VM) often involves a trade-off between simplicity and performance. For stack-based VMs, the overhead of pushing and popping data to and from memory can become a significant bottleneck. However, by leveraging a small, fixed-size stack and mapping it directly to hardware registers, it is possible to eliminate most of this data movement.

In this technical exploration, we examine a method for building a stack-based VM where the stack is restricted to 8 slots, allowing each slot to be mapped to an x86_64 register. By using Steel Bank Common Lisp (SBCL) as a "breadboard" for machine code generation, we can create a highly specialized interpreter that minimizes the cost of stack operations.

The Rotating Stack Concept

Traditional Forth-like engines often cache only the top one or two elements of the stack in registers to maintain performance. A more aggressive approach, inspired by the x87 floating-point stack and the F18 processor, is to treat a set of registers as a circular buffer.

Instead of physically moving data between registers during a push or pop, the VM maintains a modular counter that points to the current Top of Stack (TOS). A push operation simply decrements this counter, and a pop increments it.

While indexing registers dynamically is generally not possible in most Instruction Set Architectures (ISAs), this can be solved through primitive specialization. If the stack is limited to 8 slots, there are only 8 possible values for the stack pointer. By generating 8 distinct versions of every VM primitive—one for each possible stack pointer position—the register mapping becomes static for each version of the instruction.

Implementing the VM with SBCL

SBCL provides an exceptional environment for this kind of experimentation because it allows for the interactive generation and inspection of machine code. Using sb-assem, we can emit raw assembly directly from Lisp.

Register Mapping

The VM is designed with the following register assignments:

r8-r15: Stack slots (32-bit)
rsi: Base address for machine code primitives
rdi: Virtual Instruction Pointer (VIP)
rax, rbx, rcx, rdx: Scratch registers
rsp: Virtual return stack pointer

The Dispatch Mechanism (`NEXT`)

In a direct-threaded VM, each primitive ends with a sequence that jumps to the next instruction. To support the rotating stack, the NEXT sequence must dispatch to the version of the next primitive that corresponds to the current stack pointer.

To achieve this, variants of each primitive are stored at regular intervals (e.g., one page apart). The NEXT sequence calculates the address of the next primitive as follows:

mov eax, [rsi]
add rsi, 4
lea rax, [rax + rdi + variant_offset]
jmp rax

Where variant_offset = 4288 * stack_counter. Because the stack counter is known at the time the primitive variant is compiled, this offset is a constant, making the dispatch efficient.

Control Flow and Fused Operators

A functional VM requires more than just arithmetic; it needs control flow. Implementing jmp, call, and ret involves manipulating the virtual instruction pointer (rdi) and the virtual return stack (rsp).

The Power of Fusion

One of the most effective ways to increase performance in a VM is to "fuse" common sequences of operations into a single specialized primitive. For example, a loop typically consists of a decrement, a comparison, and a conditional jump.

By implementing a djn (decrement and jump if non-zero) operator, the VM can avoid multiple indirect jumps. The author tested several versions of this operator:

Naive Bytecode: 11-15x slower than native code.
Specialized djn: ~8x slower than native code.
Specialized djn2 (using conditional branches instead of cmov): ~6x slower than native code.

Comparing this to a native assembly loop (which runs at roughly 1 cycle per iteration), the specialized VM implementation achieves impressive efficiency for a high-level interpreter.

Integration with Common Lisp

To make the VM useful, a Foreign Function Interface (FFI) is required to move data between the Lisp environment and the VM. This is achieved by creating an enter and leave sequence:

enter: Sets up the stack frame and copies values from a Lisp buffer into the r8-r15 registers.
leave: Copies the final state of the registers back into the Lisp buffer and unwinds the frame.

This allows Lisp to call into the VM, execute a sequence of bytecode, and retrieve the result as a standard Lisp array.

Conclusion

The experiment demonstrates that specializing primitives based on a virtual stack pointer is a viable strategy for reducing the overhead of threaded interpreters, provided the stack size is small and fixed.

Beyond the performance gains, the project highlights the utility of SBCL as a tool for systems programming. By providing deep access to machine code generation and a REPL for immediate testing, SBCL allows developers to iterate on low-level architectural ideas with a speed that is nearly impossible in C or traditional assembly languages.

SBCL as an Assembly Breadboard: Building a High-Performance Stack VM

SBCL as an Assembly Breadboard: Building a High-Performance Stack VM

The Rotating Stack Concept

Implementing the VM with SBCL

Register Mapping

The Dispatch Mechanism (`NEXT`)

Control Flow and Fused Operators

The Power of Fusion

Integration with Common Lisp

Conclusion

References

HN Stories

SBCL as an Assembly Breadboard: Building a High-Performance Stack VM

The Rotating Stack Concept

Implementing the VM with SBCL

Register Mapping

The Dispatch Mechanism (NEXT)

Control Flow and Fused Operators

The Power of Fusion

Integration with Common Lisp

Conclusion

References

HN Stories

The Dispatch Mechanism (`NEXT`)