From Gflop/s to Tflop/s: Optimizing Matrix Multiplication in Swift
Training a Large Language Model (LLM) is essentially a massive exercise in matrix multiplication. At its core, the process is a repetitive loop of z += x * y performed trillions of times. For developers working on Apple Silicon, the challenge is often finding the balance between the high-level safety of Swift and the raw performance required for these workloads.
In a recent exploration, developer zdw set out to rewrite Andrej Karpathy’s llm.c (a plain C implementation of a GPT-2 compatible model) in Swift. The goal was not just to achieve parity with C, but to push Swift to its absolute limits using every tool available on the M-series chips—from the CPU and SIMD instructions to the "secret" AMX coprocessor and the GPU via Metal.
The Starting Point: The Performance Gap
When translating the core matmul_forward function from C to basic Swift, the initial results were stark. Despite running in Release configuration with runtime asserts removed, the basic Swift implementation was 15 to 20 times slower than the plain C version.
| Model | Tokens/s | Training iterations/s | Training vs llm.c |
|---|---|---|---|
| llm.c | 0.926 | 0.175 | 100% |
| Basic Swift | 0.054 | 0.014 | 7.3% |
This represents a performance of roughly 2.8 Gflop/s—a figure that would have been impressive in 1999 but is unacceptable for modern LLM workloads. The primary culprit was identified as _ArrayBuffer.beginCOWMutation(). Swift’s Copy-on-Write (COW) uniqueness checks created massive overhead, even when the arrays were unique.
Closing the Gap: Swift-Level Optimizations
To move past the COW bottleneck, the first step was adopting MutableSpan (introduced in Swift 6.2), which provides a reliable way to access memory with near-zero overhead. While this improved training speed, the forward pass remained slow because Swift lacks a direct equivalent to C’s -ffast-math flag, which enables Fused Multiply-Add (FMA) instructions.
Leveraging Relaxed Math and SIMD
By using the Swift-Numerics library and its Relaxed.multiplyAdd function, the implementation could finally utilize the fmla (SIMD vectorized FMA) instructions. This change alone provided a nearly 10x speed increase in tokens per second.
Loop Unrolling and Inline Arrays
To match the optimized C implementation, which strides over loops to encourage compiler unrolling, the author utilized InlineArray (Swift 6.2). This allowed for stack-allocated buffers, avoiding the high cost of heap-allocating arrays within loops. At this stage, "Fast Swift" achieved parity with C, actually slightly exceeding it in training iterations per second (106.6% of llm.c).
Scaling Up: Multi-threading and AMX
While single-threaded performance was solved, the next leap required utilizing all available CPU cores. Using DispatchQueue.concurrentPerform allowed the workload to be split across the M3 Max’s 16 cores. However, this introduced significant "visual clutter" in the code, requiring withUnsafeMutableBufferPointer and @unchecked Sendable wrappers to bypass Swift’s concurrency safety checks.
The "Secret" Weapon: AMX
Beyond standard SIMD, Apple Silicon contains the AMX (Apple Matrix Coprocessor). While Apple only officially exposes this via the Accelerate framework, reverse-engineered instructions like AMX_MATFP allow for direct manipulation of 16x16 tiles.
Warning: Direct use of AMX instructions is discouraged for production as they are undocumented and subject to binary compatibility breaks. The Accelerate framework remains the recommended path.
Implementing AMX instructions pushed training performance to 958.8% of the original llm.c implementation.
The Final Frontier: Metal and the GPU
To reach Tflop/s territory, the workload was moved to the GPU using Metal. The transition involved writing a compute kernel in Metal/C++ and an invocation layer in Swift.
- Basic Metal: A naive kernel provided a modest boost over AMX.
- Threaded Metal: Optimizing
threadsPerThreadgroupyielded a significant jump (2204.6% ofllm.c). - Tiled Metal: By implementing a tiling kernel to improve memory locality (reducing the need to traverse long rows), the performance finally broke the 1 Tflop/s barrier.
Final Performance Comparison
| Model | Tokens/s | Training iterations/s | Training vs llm.c |
|---|---|---|---|
| llm.c | 0.926 | 0.175 | 100% |
| Multithreaded Swift | 4.356 | 1.014 | 558.5% |
| AMX | 5.884 | 1.678 | 958.8% |
| Tiled Metal | 11.123 | 5.351 | 3057.7% |
Technical Insights and Counterpoints
The FMA Debate
A critical point raised in the community discussion concerns the use of -ffast-math. While the author used it to enable FMA, some experts argue that -ffast-math is too broad and can lead to undesirable numerical inaccuracies. The recommended alternative for those who specifically want FMA without the risks of full fast-math is -ffp-contract=fast.
The GPU Software Moat
The difficulty in moving from "Basic Metal" to "Tiled Metal" highlights why software ecosystems like NVIDIA’s CUDA remain so dominant. Peak GPU performance is not about the hardware alone, but about having a vast library of highly tuned kernels for specific data shapes.
Conclusion
Starting from a naive implementation of 2.8 Gflop/s, the author achieved 1.1 Tflop/s—a 382x increase in performance. This journey underscores that while Swift is capable of matching or exceeding C's speed, the cost is often a loss of the language's signature elegance, as the code descends into unsafe pointers and manual memory management. For production applications, the lesson is clear: use the established frameworks (Accelerate, CoreML, MPSGraph) that have already spent years optimizing these kernels.