Mas AI Vignettes

Posts

Showing posts from April, 2026

Special setting for SearchXNG with OpenClaw

April 18, 2026

In settings.yaml, search for "-html" and add "-json" after it.

Downloading and running models on vLLM with OpenClaw

April 18, 2026

OK after installing vLLM, here are what i did: hf download cyankiwi/gemma-4-26B-A4B-it-AWQ-4bit --local-dir models/gemma-4-26B-A4B-it-AWQ-4bit --> check the model is downloaded in the dir specified --> this is the tuned setting for me: vllm serve models/gemma-4-26B-A4B-it-AWQ-4bit --served-model-name gemma-4-26B-A4B-it-AWQ-4bit --max-model-len 20480 --gpu-memory-utilization 0.9 --enforce-eager --enable-auto-tool-choice --tool-call-parser gemma4 --default-chat-template-kwargs '{"enable_thinking": true}' Why this command works: --default-chat-template-kwargs: This is the global server flag that tells vLLM to pass enable_thinking=True to the Gemma 4 tokenizer every time it prepares a prompt. --enforce-eager: Critical for your 3090; it prevents CUDA graph overhead which can save you up to 2GB of VRAM. --max-model-len 20480: Your safe upper limit. In openclaw, i need to make the following changes: Setting,Value,Purpose contextWindow,20480,T...

vLLM install

April 06, 2026

cd ~/vLLM python3 -m venv venv source venv/bin/activate pip install torch --index-url https://download.pytorch.org/whl/cu118 pip install vllm pip install huggingface_hub huggingface-cli login python -m vllm.entrypoints.openai.api_server \ --model Qwen/Qwen2.5-9B-Instruct-AWQ \ --quantization awq \ --gpu-memory-utilization 0.9 \ --max-model-len 4096 With Comfyui Step 1 — Start vLLM with limit python -m vllm.entrypoints.openai.api_server \ --model Qwen/Qwen2.5-9B-Instruct-AWQ \ --quantization awq \ --gpu-memory-utilization 0.6 \ --max-model-len 4096 This reserves ~60% VRAM (~14GB) ComfyUI will: Use remaining VRAM (~10GB) Work fine for most workflows needed, launch ComfyUI with: export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512 start-vllm.sh #!/bin/bash cd ~/ai-stack/vllm source venv/bin/activate python -m vllm.entrypoints.openai.api_server \ --model Qwen/Qwen2.5-9B-Instruct-AWQ \ --quantization awq \ --gpu-memory-utilization 0.6 start-comfyui.s...