vLLM install
cd ~/vLLM
python3 -m venv venv
source venv/bin/activate
pip install torch --index-url https://download.pytorch.org/whl/cu118
pip install vllm
pip install huggingface_hub
huggingface-cli login
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-9B-Instruct-AWQ \
--quantization awq \
--gpu-memory-utilization 0.9 \
--max-model-len 4096
With Comfyui
Step 1 — Start vLLM with limit
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-9B-Instruct-AWQ \
--quantization awq \
--gpu-memory-utilization 0.6 \
--max-model-len 4096
This reserves ~60% VRAM (~14GB)
ComfyUI will:
Use remaining VRAM (~10GB)
Work fine for most workflows
needed, launch ComfyUI with:
export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:512
start-vllm.sh
#!/bin/bash
cd ~/ai-stack/vllm
source venv/bin/activate
python -m vllm.entrypoints.openai.api_server \
--model Qwen/Qwen2.5-9B-Instruct-AWQ \
--quantization awq \
--gpu-memory-utilization 0.6
start-comfyui.sh
#!/bin/bash
cd ~/ai-stack/comfyui/ComfyUI
source ../venv/bin/activate
python main.py
Comments
Post a Comment