Llama.cpp MTP Support merged - up to 2.5x speed increase
Qwen3.6-27B-MTP-UD-Q5_K_XL on my 7900XTX goes from 32 t/s to 50-72 t/s depending on the predictability of the task. So, a 1.5x increase on creative tasks up to a 2.2x increase on math.
MTP does not change the quality with the only cost being a few hundred MB extra VRAM usage. You will need to download a gguf model with MTP support to use it.
My parameters:
```
; Context memory usage
ctx-size = 65536
ctk = q8_0
ctv = q8_0
; Prompt processing speed
batch-size = 1024
ubatch-size = 1024
; Speculative decoding
np = 1
spec-type = draft-mtp
spec-draft-n-max = 3
```
Edit: did some more testing using Unsloth's parameters and with `spec-draft-n-max = 6` I can get up to 82 tk/s, a 2.56x increase, on the same math prompt. But this comes at the cost of the creative writing task that now falls below 40 tk/s.
It seems like this should be tweaked depending on the prompt similar to the sampling parameters.
MTP does not change the quality with the only cost being a few hundred MB extra VRAM usage. You will need to download a gguf model with MTP support to use it.
My parameters:
```
; Context memory usage
ctx-size = 65536
ctk = q8_0
ctv = q8_0
; Prompt processing speed
batch-size = 1024
ubatch-size = 1024
; Speculative decoding
np = 1
spec-type = draft-mtp
spec-draft-n-max = 3
```
Edit: did some more testing using Unsloth's parameters and with `spec-draft-n-max = 6` I can get up to 82 tk/s, a 2.56x increase, on the same math prompt. But this comes at the cost of the creative writing task that now falls below 40 tk/s.
It seems like this should be tweaked depending on the prompt similar to the sampling parameters.