From Gflop/s to Tflop/s: Optimizing Matrix Multiplication in Swift

Training a Large Language Model (LLM) is essentially a massive exercise in matrix multiplication. At its core, the process is a repetitive loop of z += x * y performed trillions of times. For developers working on Apple Silicon, the challenge is often finding the balance between the high-level safety of Swift and the raw performance required for these workloads.

In a recent exploration, developer zdw set out to rewrite Andrej Karpathy’s llm.c (a plain C implementation of a GPT-2 compatible model) in Swift. The goal was not just to achieve parity with C, but to push Swift to its absolute limits using every tool available on the M-series chips—from the CPU and SIMD instructions to the "secret" AMX coprocessor and the GPU via Metal.

The Starting Point: The Performance Gap

When translating the core matmul_forward function from C to basic Swift, the initial results were stark. Despite running in Release configuration with runtime asserts removed, the basic Swift implementation was 15 to 20 times slower than the plain C version.

Model	Tokens/s	Training iterations/s	Training vs llm.c
llm.c	0.926	0.175	100%
Basic Swift	0.054	0.014	7.3%

This represents a performance of roughly 2.8 Gflop/s—a figure that would have been impressive in 1999 but is unacceptable for modern LLM workloads. The primary culprit was identified as _ArrayBuffer.beginCOWMutation(). Swift’s Copy-on-Write (COW) uniqueness checks created massive overhead, even when the arrays were unique.

Closing the Gap: Swift-Level Optimizations

To move past the COW bottleneck, the first step was adopting MutableSpan (introduced in Swift 6.2), which provides a reliable way to access memory with near-zero overhead. While this improved training speed, the forward pass remained slow because Swift lacks a direct equivalent to C’s -ffast-math flag, which enables Fused Multiply-Add (FMA) instructions.

Leveraging Relaxed Math and SIMD

By using the Swift-Numerics library and its Relaxed.multiplyAdd function, the implementation could finally utilize the fmla (SIMD vectorized FMA) instructions. This change alone provided a nearly 10x speed increase in tokens per second.

Loop Unrolling and Inline Arrays

To match the optimized C implementation, which strides over loops to encourage compiler unrolling, the author utilized InlineArray (Swift 6.2). This allowed for stack-allocated buffers, avoiding the high cost of heap-allocating arrays within loops. At this stage, "Fast Swift" achieved parity with C, actually slightly exceeding it in training iterations per second (106.6% of llm.c).

Scaling Up: Multi-threading and AMX

While single-threaded performance was solved, the next leap required utilizing all available CPU cores. Using DispatchQueue.concurrentPerform allowed the workload to be split across the M3 Max’s 16 cores. However, this introduced significant "visual clutter" in the code, requiring withUnsafeMutableBufferPointer and @unchecked Sendable wrappers to bypass Swift’s concurrency safety checks.

The "Secret" Weapon: AMX

Beyond standard SIMD, Apple Silicon contains the AMX (Apple Matrix Coprocessor). While Apple only officially exposes this via the Accelerate framework, reverse-engineered instructions like AMX_MATFP allow for direct manipulation of 16x16 tiles.

Warning: Direct use of AMX instructions is discouraged for production as they are undocumented and subject to binary compatibility breaks. The Accelerate framework remains the recommended path.

Implementing AMX instructions pushed training performance to 958.8% of the original llm.c implementation.

The Final Frontier: Metal and the GPU

To reach Tflop/s territory, the workload was moved to the GPU using Metal. The transition involved writing a compute kernel in Metal/C++ and an invocation layer in Swift.

Basic Metal: A naive kernel provided a modest boost over AMX.
Threaded Metal: Optimizing threadsPerThreadgroup yielded a significant jump (2204.6% of llm.c).
Tiled Metal: By implementing a tiling kernel to improve memory locality (reducing the need to traverse long rows), the performance finally broke the 1 Tflop/s barrier.

Final Performance Comparison

Model	Tokens/s	Training iterations/s	Training vs llm.c
llm.c	0.926	0.175	100%
Multithreaded Swift	4.356	1.014	558.5%
AMX	5.884	1.678	958.8%
Tiled Metal	11.123	5.351	3057.7%

Technical Insights and Counterpoints

The FMA Debate

A critical point raised in the community discussion concerns the use of -ffast-math. While the author used it to enable FMA, some experts argue that -ffast-math is too broad and can lead to undesirable numerical inaccuracies. The recommended alternative for those who specifically want FMA without the risks of full fast-math is -ffp-contract=fast.

The GPU Software Moat

The difficulty in moving from "Basic Metal" to "Tiled Metal" highlights why software ecosystems like NVIDIA’s CUDA remain so dominant. Peak GPU performance is not about the hardware alone, but about having a vast library of highly tuned kernels for specific data shapes.

Conclusion

Starting from a naive implementation of 2.8 Gflop/s, the author achieved 1.1 Tflop/s—a 382x increase in performance. This journey underscores that while Swift is capable of matching or exceeding C's speed, the cost is often a loss of the language's signature elegance, as the code descends into unsafe pointers and manual memory management. For production applications, the lesson is clear: use the established frameworks (Accelerate, CoreML, MPSGraph) that have already spent years optimizing these kernels.

From Gflop/s to Tflop/s: Optimizing Matrix Multiplication in Swift

From Gflop/s to Tflop/s: Optimizing Matrix Multiplication in Swift

The Starting Point: The Performance Gap

Closing the Gap: Swift-Level Optimizations

Leveraging Relaxed Math and SIMD

Loop Unrolling and Inline Arrays

Scaling Up: Multi-threading and AMX

The "Secret" Weapon: AMX

The Final Frontier: Metal and the GPU

Final Performance Comparison

Technical Insights and Counterpoints

The FMA Debate

The GPU Software Moat

Conclusion

References

HN Stories