Downloading and running models on vLLM with OpenClaw

OK after installing vLLM, here are what i did: hf download cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit --local-dir models/gemma-4-26B-A4B-it-AWQ-4bit --> check the model is downloaded in the dir specified --> this is the tuned setting for me: vllm serve models/gemma-4-26B-A4B-it-AWQ-4bit --served-model-name gemma-4-26B-A4B-it-AWQ-4bit --max-model-len 20480 --gpu-memory-utilization 0.9 --enforce-eager --enable-auto-tool-choice --tool-call-parser gemma4 --default-chat-template-kwargs '{"enable_thinking": true}' Why this command works: --default-chat-template-kwargs: This is the global server flag that tells vLLM to pass enable_thinking=True to the Gemma 4 tokenizer every time it prepares a prompt. --enforce-eager: Critical for your 3090; it prevents CUDA graph overhead which can save you up to 2GB of VRAM. --max-model-len 20480: Your safe upper limit. In openclaw, i need to make the following changes: Setting,Value,Purpose contextWindow,20480,Tells OpenClaw exactly how much VRAM you have. maxTokens,4096,Reserves enough space for a 10-page AI response. reserveTokensFloor,4096,Triggers cleanup only when the VRAM is actually full. "agents": { "defaults": { "compaction": { "mode": "default", "reserveTokensFloor": 4096, "keepRecentTokens": 2048 } } } --> I did not do this: bootstrapMaxChars,8000,New: Prevents OpenClaw from auto-loading huge files. Ensure your contextWindow is 20480 (don't leave it at 128k, or OpenClaw will send too much data and crash the server). reserveTokensFloor: 4096 This tells OpenClaw: "Don't panic and reset the chat until I have less than 4,000 tokens left." Since your max output is also 4,000, this ensures the AI always has room to answer. keepRecentTokens: 2048 When OpenClaw "compacts" (summarizes) your history to save space, this tells it to keep the last 2,000 tokens exactly as they were. This prevents the model from getting "amnesia" about what you just said 30 seconds ago. Summary: Getting a **Gemma-4-26B** model running smoothly on a single **RTX 3090** is like fitting a grand piano into a studio apartment—it requires precise organization. Here is a summary of every major hurdle we cleared and the configuration that finally worked. ### 1. The VRAM Battle (Model Size) * **The Problem:** You originally tried to run the **BF16** (unquantized) version. At 2 bytes per parameter, a 26B model requires **~52 GB of VRAM** just to load the weights—impossible for a 24 GB card. * **The Fix:** We switched to the **AWQ 4-bit** quantized version. This shrunk the model weights to **~15.5 GB**, leaving us about **8 GB** of "breathing room" for the KV cache (memory) and the operating system. ### 2. The Architecture Conflict (FP8 Hardware) * **The Problem:** vLLM tried to use **FP8 E4M3** (a high-speed memory format) for the KV cache. However, the RTX 3090 (Ampere) does not support this specific format—it’s only for the 40-series (Ada) and H100 (Hopper). This caused the `ValueError: type fp8e4nv not supported` crash. * **The Fix:** We reverted to the standard **BF16** cache. While it uses more memory than FP8, it is natively supported by your 3090 and avoids hardware-level errors. ### 3. The Context "Budget" (Token Math) * **The Problem:** Gemma-4 has a massive default context window (256k tokens). vLLM tried to reserve over **5.5 GB** for this cache, which pushed the total memory usage over your 23.57 GiB limit. * **The Fix:** We capped the context window at **20,480 tokens**. * **The Math:** 15.5 GB (Weights) + ~6 GB (20k Cache) + 1 GB (Overhead) = **~22.5 GB**. * This leaves about **1 GB free**, which is the perfect "safe zone" for a 3090. ### 4. The Intelligence Tweak (Thinking Mode) * **The Problem:** Gemma-4 is a "Reasoning" model. Disabling thinking makes it significantly less capable. However, thinking tokens use up your 20k "budget." * **The Fix:** We used `--default-chat-template-kwargs '{"enable_thinking": true}'` to keep the model smart, but we adjusted **OpenClaw** to manage the history better so the model doesn't "overthink" and crash. ### 5. The OpenClaw Sync (The Handshake) * **The Problem:** OpenClaw was trying to send prompts based on its own 128k default limit, causing vLLM to reject requests for being "too big" (e.g., the 16385 vs 16384 error). * **The Fix:** * **Alignment:** We set OpenClaw's `contextWindow` to **20,480** to match vLLM. * **Reservation:** We set `maxTokens` to **4096** to ensure the AI always has room to reply. * **Compaction:** We set the `reserveTokensFloor` to **4096** so the "Panic Button" only triggers when the VRAM is actually full. --- ### 🚀 Your Final "Stable" Command ```bash vllm serve models/gemma-4-26B-A4B-it-AWQ-4bit \ --served-model-name gemma-4-26B-A4B-it-AWQ-4bit \ --max-model-len 20480 \ --gpu-memory-utilization 0.9 \ --enforce-eager \ --enable-auto-tool-choice \ --tool-call-parser gemma4 \ --default-chat-template-kwargs '{"enable_thinking": true}' ``` **The result:** You now have a high-intelligence, reasoning-capable agent that can hold about **40–50 pages of text** in its head at once, running entirely on your local 24GB hardware. Is the model's "thinking" process showing up clearly in your Discord chats now? To keep your "Intelligence" high without crashing: Run /context list in Discord. It will show you exactly which files are eating your 20k budget. Move unnecessary files out of your main OpenClaw workspace folder. If you have a 5MB log file or a massive PDF in there, OpenClaw is sending it every single time you type "Hi." Use /status regularly. If you see you are at 90% usage, run /compact manually before the bot crashes.

Search This Blog

Mas AI Vignettes

Downloading and running models on vLLM with OpenClaw

Comments

Post a Comment

Popular posts from this blog

Set manual ip via command line

Exposing Docker to portainer remotely

Expanding filesystem on ubuntu