LIVE
Loading live headlines…
Home Trending World Technology Entertainment Gaming Sports Music Science Lifestyle Business About Contact
c/localllama by u/variety4me 1w ago

llama.cpp Multi-Model Server Architecture: ASUS Zenbook UM3504DA

14 upvotes 9 comments
## System & Software Stack
- **Hardware:** ASUS Zenbook 15 UM3504DA | AMD Ryzen 7 7735U (8C/16T) | Radeon 680M iGPU (512 MB BIOS-limited VRAM) | 32 GB LPDDR5 RAM
- **OS:** CachyOS (Arch Linux) | Wayland + Niri compositor
- **Runtime:** `llama.cpp` custom Vulkan build | `llama-server` with preset routing
- **Deployment Scope:** Single-user local inference | 2–3 year static configuration window

## Build Configuration
The binary is compiled with hardware-aware optimizations and server/tooling support. Each flag addresses a specific constraint or capability of the target platform.

```bash
cmake .. \
-DGGML_NATIVE=ON \
-DGGML_OPENMP=ON \
-DGGML_VULKAN=ON \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_INTERPROCEDURAL_OPTIMIZATION=ON \
-DLLAMA_BUILD_SERVER=ON \
-DLLAMA_BUILD_TOOLS=ON
```

| Flag | Purpose | Measured Impact |
|------|---------|-----------------|
| `GGML_NATIVE=ON` | Enables CPU-specific ISA extensions (AVX2/AVX512) | +10–15% prompt throughput on Zen 3+ cores |
| `GGML_OPENMP=ON` | Parallelizes prompt processing across available cores | Required for batched CPU inference |
| `GGML_VULKAN=ON` | GPU acceleration backend | Mandatory for Rembrandt iGPU. ROCm unsupported. CUDA inapplicable. |
| `CMAKE_INTERPROCEDURAL_OPTIMIZATION=ON` | Link-time optimization | Reduces binary size, improves instruction cache locality |
| `DLLAMA_BUILD_SERVER=ON` | Compiles HTTP server with OpenAI-compatible API | Enables remote UI and agent routing |
| `DLLAMA_BUILD_TOOLS=ON` | Enables structured function calling | Required for agentic task execution |

## Server Launch & Routing Architecture
The server is invoked with strict resource controls to prevent memory thrashing on constrained hardware:

```bash
llama-server --port 8080 --host 0.0.0.0 \
--models-preset /mnt/data/ai/models.ini \
--models-max 1 \
--tools all
```

- `--models-max 1`: Enforces single-model residency. Prevents concurrent RAM/GTT allocation spikes.
- `--models-preset`: Loads declarative INI configuration for deterministic parameter application.
- `--tools all`: Activates full OpenAI-compatible tool/schema support for agent workflows.
- Port 8080 bound to all interfaces for integration with local UIs (OpenWebUI, Helium) and routing scripts.

## Configuration Architecture (`models.ini`)
The preset system uses a global-defaults + per-model-override structure. This eliminates runtime flag management, ensures baseline stability across all workloads, and allows precise parameter alignment per model architecture.

```ini
version = 1

[*]
; Global defaults - CPU-optimized baseline
seed = -1
top-p = 0.95
top-k = 20
min-p = 0.05
presence-penalty = 0.0
repeat-penalty = 1.1
jinja = true
batch-size = 256
ubatch-size = 256
threads = 8
threads-batch = 8
cpu-range = 0-7
cpu-strict = 1
kv-offload = false
defrag-thold = 0.1
poll = 25
poll-batch = 50
cpu-moe = true
gpu-layers = 0
ctx-size = 16384
```

Global defaults prioritize CPU affinity, strict thread binding, MoE routing on CPU, and conservative KV cache management. Per-model sections override only the parameters required for their specific workload profile.

## Per-Model Profiles & Parameter Rationale

### Quick Reasoning: `gemma-4-e4b`
```ini
[gemma-4-e4b]
model = /mnt/data/models/daily/google_gemma-4-E4B-it-Q4_K_M.gguf
temperature = 0.7
reasoning-budget = 256
gpu-layers = 32
ctx-size = 32768
```
- **Purpose:** Low-latency code completion, rapid drafting, lightweight Q&A.
- **Rationale:** 4B MoE fits entirely within GPU offload limits. Extended context (32K) enables long-file navigation. `reasoning-budget = 256` constrains chain-of-thought to prevent token waste. `temperature = 0.7` maintains creative variance for ideation tasks.

### General Purpose: `gemma-4-26b` *(Daily Driver)*
```ini
[gemma-4-26b]
model = /mnt/data/models/daily/google_gemma-4-26B-A4B-it-IQ4_NL.gguf
temperature = 0.65
repeat-penalty = 1.05
reasoning-budget = 512
batch-size = 512
ubatch-size = 512
defrag-thold = 0.05
gpu-layers = 18
```
- **Purpose:** Primary conversational, analytical, and long-form generation workload.
- **Rationale:** Heavily optimized for sustained throughput and thermal stability. Detailed parameters documented in the following section.

### Agentic Router: `qwen3.5-9b`
```ini
[qwen3.5-9b]
model = /mnt/data/models/daily/Qwen_Qwen3.5-9B-Q4_K_M.gguf
temperature = 0.65
top-k = 25
repeat-penalty = 1.05
```
- **Purpose:** Function calling, tool selection, structured API routing.
- **Rationale:** `top-k = 25` narrows sampling distribution to improve tool-call determinism. Reduced `repeat-penalty` prevents schema repetition loops. Global CPU defaults apply to minimize latency during routing.

### Complex Reasoning: `qwen3.6-35b`
```ini
[qwen3.6-35b]
model = /mnt/data/models/daily/Qwen_Qwen3.6-35B-A3B-Q3_K_M.gguf
temperature = 0.6
presence-penalty = 0.8
reasoning-budget = 256
repeat-penalty = 1.05
ctx-size = 8192
```
- **Purpose:** Deep analysis, multi-step reasoning, constrained exploration.
- **Rationale:** 35B MoE requires memory safety limits. `ctx-size = 8192` prevents GTT saturation. `presence-penalty = 0.8` forces lexical diversity during long-form generation. `reasoning-budget = 256` maintains structured output without unbounded context accumulation.

### Experimental: `lfm2-24b`
```ini
[lfm2-24b]
model = /mnt/data/models/experimental/LFM2-24B-A2B-Q4_K_M.gguf
temperature = 0.6
presence-penalty = 0.8
reasoning-budget = 256
repeat-penalty = 1.05
```
- **Purpose:** Architecture evaluation, quantization testing, parameter isolation.
- **Rationale:** Mirrors 35B safety guardrails. Kept separate from daily workflows to prevent context contamination or parameter bleed during testing.

## Primary Model Optimization: Gemma-4-26B
The 26B MoE profile represents the core optimization target. Parameter selection resulted from systematic empirical testing across offload depth, batch sizing, cache management, and thermal behavior.

| Parameter | Value | Rationale |
|-----------|-------|-----------|
| `gpu-layers` | 18 | Measured efficiency sweet spot. Beyond 18 layers, GTT usage exceeds 9.8 GB with diminishing returns (+0.15 t/s per layer). |
| `batch-size` / `ubatch-size` | 512 | Increased from global 256. Matches prompt throughput requirements without exceeding KV cache limits. |
| `defrag-thold` | 0.05 | Aggressive KV cache defragmentation prevents memory fragmentation during long sessions. |
| `threads` | 6 (override) | Reduced from global 8. Maintains baseline CPU activity to trigger firmware fan curves during GPU-heavy inference. |
| `reasoning-budget` | 512 | Enforces structured chain-of-thought. Improves cache locality and prevents context bloat. |
| `temperature` / `repeat-penalty` | 0.65 / 1.05 | Balances coherence with lexical variation. Lower repeat penalty prevents over-penalization in technical prose. |

**Measured Performance:**
- CPU-only (0 layers): 9.9 t/s generation | 0 GB GTT
- 18-layer offload: 16.9 t/s generation | 9.8 GB GTT | Stable >50 hrs
- 24-layer offload: 18.6 t/s generation | 12.6 GB GTT | Marginal stability
- Real-world 2,090-token response: 116s → 68s (40% reduction)

## Hardware Constraints & Empirical Findings
1. **VRAM Limitation:** BIOS locks dedicated VRAM to 512 MB. GPU offloading immediately utilizes GTT (system RAM mapped as VRAM). `amdgpu_top` confirms usable VRAM caps at ~450 MB.
2. **Offloading Diminishing Returns:** Layers 0–6 yield +0.73 t/s per layer. Layers 6–18 yield +0.15–0.33 t/s per layer. Layers 18–24 yield +0.15–0.28 t/s per layer with >0.5 GB GTT cost per layer. Stability degrades past 20 layers.
3. **Thermal Firmware Constraint:** ASUS fan curves respond exclusively to CPU load. GPU-only inference bypasses thermal regulation. `threads = 6` ensures consistent CPU activity to maintain airflow.
4. **Context Scaling:** Generation throughput drops ~27% at 40% context fill due to O(n) attention scanning. 16K context is the practical ceiling for 32 GB RAM with 20B+ models.
5. **Reasoning Budget as Cache Optimizer:** Enforcing explicit reasoning tokens structures KV cache layout, reduces attention fragmentation, and prevents unbounded context accumulation during long sessions.

## Deployment Parameters
This configuration is locked for sustained single-user deployment. No dynamic context routing, no concurrent model loading, no over-engineered orchestration.

**Locked Baseline:**
- Vulkan + OpenMP + Native ISA compilation
- Global CPU defaults with per-model parameter overrides
- `gemma-4-26b` at 18 layers (16.9 t/s, 9.8 GB GTT)
- `gemma-4-e4b` at 32 layers (full GPU offload)
- `ctx-size = 16384` default, model-specific reductions where required
- `threads = 6` on 26B profile for thermal regulation
- Single-model residency enforced via `--models-max 1`

The stack delivers deterministic throughput, stable memory residency, and predictable thermal behavior within hardware constraints. Configuration changes are restricted to model quantization updates or hardware replacement.
Open discussion