Stable Diffusion工作流搭建踩坑记录

声明:本文部分内容使用AI辅助生成,经人工编辑、审核和补充个人经验。

更新说明:本文最后更新于 2025-08-26。

Stable Diffusion工作流搭建踩坑记录

折腾Stable Diffusion两年多了,从本地单卡到多卡集群,从随便生成到工业化工作流,踩过的坑不计其数。记录一下从安装到生产环境部署的完整过程。

环境搭建

硬件选择

一开始用GTX 1060 6G,生成512x512都费劲。后来升级到RTX 3090 24G,体验天差地别。

显卡 显存 512x512速度 1024x1024 LoRA训练
GTX 1060 6G 6GB 30s/it OOM 不支持
RTX 3060 12G 12GB 8s/it 25s/it
RTX 3090 24G 24GB 3s/it 8s/it 支持
RTX 4090 24G 24GB 1.5s/it 4s/it 很快
A100 40G 40GB 2s/it 5s/it 非常快

血泪教训:显存比速度更重要。12G是底线,24G才能玩得转。

安装WebUI

1
2
3
4
5
6
7
8
9
10
11
12
13
# 克隆仓库
git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui.git
cd stable-diffusion-webui

# 创建conda环境
conda create -n sd python=3.10
conda activate sd

# 安装依赖
pip install -r requirements.txt

# 启动
./webui.sh --xformers --autolaunch

坑1:xformers安装失败

1
ERROR: Could not find a version that satisfies the requirement xformers

解决

1
2
3
4
5
6
7
8
9
10
11
12
# 方案1:用预编译包
pip install xformers --pre

# 方案2:从源码编译(慢但稳)
git clone https://github.com/facebookresearch/xformers.git
cd xformers
git submodule update --init --recursive
pip install -r requirements.txt
pip install -e .

# 方案3:不用xformers,用--opt-sdp-attention代替
./webui.sh --opt-sdp-attention

坑2:模型加载报CUDA out of memory

24G显存加载SDXL都报错,后来发现是系统缓存占用了显存。

1
2
3
4
5
6
7
# 清空显存缓存
sudo nvidia-smi --gpu-reset

# 或者重启后先运行SD

# 启动参数优化
./webui.sh --xformers --medvram-sdxl --no-half-vae
启动参数 作用 适用场景
–xformers 加速注意力计算 推荐
–medvram 中显存优化 8-12G
–lowvram 低显存优化 4-8G
–medvram-sdxl SDXL中显存优化 12-16G
–no-half-vae VAE用FP32 避免黑图
–opt-sdp-attention 替代xformers xformers装不上

Docker部署

生产环境用Docker部署。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
FROM nvidia/cuda:11.8.0-cudnn8-runtime-ubuntu22.04

RUN apt-get update && apt-get install -y \
git python3 python3-pip wget \
&& rm -rf /var/lib/apt/lists/*

WORKDIR /app

# 克隆WebUI
RUN git clone https://github.com/AUTOMATIC1111/stable-diffusion-webui.git .

# 安装依赖
RUN pip install -r requirements.txt
RUN pip install xformers

# 暴露端口
EXPOSE 7860

# 启动
CMD ["python3", "launch.py", "--xformers", "--listen", "--port", "7860"]
1
2
3
4
5
6
# 构建和运行
docker build -t sd-webui .
docker run -d --gpus all -p 7860:7860 \
-v /data/models:/app/models \
-v /data/outputs:/app/outputs \
sd-webui

坑3:Docker里GPU不可用

1
RuntimeError: No CUDA GPUs are available

解决

1
2
3
4
5
6
7
8
9
10
11
# 安装nvidia-docker2
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list

sudo apt-get update
sudo apt-get install -y nvidia-docker2
sudo systemctl restart docker

# 运行时加--gpus all
docker run --gpus all ...

模型管理

模型下载和整理

模型文件越来越大,管理是个问题。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
models/
├── Stable-diffusion/ # 基础模型
│ ├── v1-5-pruned-emaonly.safetensors
│ ├── sd_xl_base_1.0.safetensors
│ └── realisticVisionV51_v51VAE.safetensors
├── VAE/ # VAE模型
│ ├── vae-ft-mse-840000-ema-pruned.safetensors
│ └── sdxl_vae.safetensors
├── Lora/ # LoRA模型
│ ├── detail_slider.safetensors
│ └── add_detail.safetensors
├── ControlNet/ # ControlNet模型
│ ├── control_v11p_sd15_canny.pth
│ ├── control_v11p_sd15_openpose.pth
│ └── control_v11f1p_sd15_depth.pth
└── embeddings/ # Textual Inversion
└── badhandv4.pt

坑4:模型格式混淆

1
2
3
4
5
.ckpt 和 .safetensors 的区别:
- .ckpt:PyTorch格式,可能包含恶意代码
- .safetensors:安全格式,只包含张量数据

推荐:一律用.safetensors,安全且加载快

模型切换优化

频繁切换模型,每次都要加载几十秒。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
# 用API切换模型
import requests

def switch_model(model_name):
"""通过API切换模型"""
url = "http://localhost:7860/sdapi/v1/options"
payload = {
"sd_model_checkpoint": model_name
}
response = requests.post(url, json=payload)
return response.json()

# 切换
switch_model("realisticVisionV51_v51VAE.safetensors")

优化:多实例部署,每个实例固定一个模型。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# docker-compose.yml
version: '3'
services:
sd-anime:
image: sd-webui
ports:
- "7861:7860"
volumes:
- ./models:/app/models
environment:
- SD_MODEL=anime_model.safetensors

sd-realistic:
image: sd-webui
ports:
- "7862:7860"
volumes:
- ./models:/app/models
environment:
- SD_MODEL=realistic_model.safetensors

LoRA训练

环境准备

用kohya_ss训练LoRA,比WebUI内置的好用。

1
2
3
4
5
6
7
8
9
10
11
git clone https://github.com/bmaltais/kohya_ss.git
cd kohya_ss

# Windows
setup.bat

# Linux
./setup.sh

# 启动GUI
./gui.sh --listen 127.0.0.1 --server_port 7860

坑5:训练时显存不够

12G显存训练SDXL LoRA,直接OOM。

1
2
3
4
5
6
7
8
9
10
11
# 优化方案1:用8bit Adam
pip install bitsandbytes

# 训练配置里选:
# Optimizer: AdamW8bit

# 优化方案2:降低分辨率
# 从1024降到768,显存省一半

# 优化方案3:Gradient Checkpointing
# 训练速度减半,显存省60%

数据集准备

1
2
3
4
5
6
7
8
9
dataset/
├── 100_mystyle person/ # 100是重复次数,mystyle是触发词
│ ├── 001.jpg
│ ├── 001.txt # 标注文件
│ ├── 002.jpg
│ └── 002.txt
└── 20_regularization/ # 正则化图片
├── reg_001.jpg
└── reg_001.txt

标注格式

1
2
# 001.txt
1girl, solo, long hair, blue eyes, mystyle, looking at viewer, simple background

坑6:触发词和常规描述混在一起

1
2
3
4
5
# 错误:触发词藏在中间
1girl, mystyle, solo, long hair...

# 正确:触发词放前面或单独处理
mystyle, 1girl, solo, long hair...

训练参数调优

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
{
"pretrained_model_name_or_path": "sd_xl_base_1.0.safetensors",
"train_data_dir": "./dataset",
"resolution": "1024,1024",
"train_batch_size": 2,
"num_train_epochs": 10,
"learning_rate": 0.0001,
"lr_scheduler": "cosine_with_restarts",
"lr_warmup_steps": 100,
"optimizer_type": "AdamW8bit",
"network_dim": 32,
"network_alpha": 16,
"save_every_n_epochs": 2,
"mixed_precision": "fp16",
"gradient_checkpointing": true,
"max_grad_norm": 1.0,
"seed": 42
}
参数 说明 经验值
network_dim 网络维度 16-128,一般32
network_alpha 缩放因子 通常dim/2
learning_rate 学习率 1e-4 to 1e-3
train_batch_size 批次大小 显存允许越大越好
num_train_epochs 训练轮数 10-20

坑7:network_dim太大导致过拟合

dim=128训练出来,除了训练集里的姿势,其他姿势生成效果很差。

解决

1
2
3
4
5
6
7
{
"network_dim": 32,
"network_alpha": 16,
"num_train_epochs": 15,
"lr_scheduler": "cosine_with_restarts",
"lr_scheduler_num_cycles": 3
}

dim降到32,增加训练轮数,用余弦重启调度器。

训练效果评估

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# 用训练好的LoRA生成测试图
import requests
import json

def test_lora(prompt, lora_name, weight=0.8):
url = "http://localhost:7860/sdapi/v1/txt2img"

payload = {
"prompt": f"{prompt} <lora:{lora_name}:{weight}>",
"negative_prompt": "low quality, blurry, bad anatomy",
"width": 1024,
"height": 1024,
"steps": 30,
"cfg_scale": 7,
"sampler_name": "DPM++ 2M Karras",
"seed": -1
}

response = requests.post(url, json=payload)
result = response.json()

# 保存图片
import base64
img_data = base64.b64decode(result['images'][0])
with open(f'test_{lora_name}.png', 'wb') as f:
f.write(img_data)

# 测试不同权重
test_lora("1girl, solo, standing", "mystyle", weight=0.6)
test_lora("1girl, solo, standing", "mystyle", weight=0.8)
test_lora("1girl, solo, standing", "mystyle", weight=1.0)

评估维度

权重 风格强度 灵活性 适用场景
0.3-0.5 轻微影响
0.6-0.8 中等 平衡
0.9-1.2 强风格
1.0+ 过强 很低 可能崩坏

ControlNet使用

安装和配置

1
2
3
4
5
6
7
8
9
10
# 在WebUI的Extensions里安装
# Extensions -> Install from URL -> https://github.com/Mikubill/sd-webui-controlnet

# 下载模型到 models/ControlNet/
# 推荐模型:
# - control_v11p_sd15_canny.pth # 边缘检测
# - control_v11p_sd15_openpose.pth # 姿态
# - control_v11f1p_sd15_depth.pth # 深度
# - control_v11p_sd15_lineart.pth # 线稿
# - control_v11p_sd15_softedge.pth # 软边缘

API调用

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
import requests
import base64

def generate_with_controlnet(prompt, control_image, control_type="canny"):
"""使用ControlNet生成图片"""

# 编码控制图
with open(control_image, "rb") as f:
control_b64 = base64.b64encode(f.read()).decode()

# ControlNet配置
controlnet_configs = {
"canny": {
"module": "canny",
"model": "control_v11p_sd15_canny [d14c016b]",
"weight": 1.0,
"processor_res": 512
},
"openpose": {
"module": "openpose_full",
"model": "control_v11p_sd15_openpose [cab727d4]",
"weight": 1.0,
"processor_res": 512
},
"depth": {
"module": "depth_midas",
"model": "control_v11f1p_sd15_depth [cfd03158]",
"weight": 0.8,
"processor_res": 512
}
}

config = controlnet_configs[control_type]

url = "http://localhost:7860/sdapi/v1/txt2img"
payload = {
"prompt": prompt,
"negative_prompt": "low quality, blurry",
"width": 512,
"height": 512,
"steps": 25,
"cfg_scale": 7,
"alwayson_scripts": {
"ControlNet": {
"args": [
{
"input_image": control_b64,
"module": config["module"],
"model": config["model"],
"weight": config["weight"],
"resize_mode": "Crop and Resize",
"lowvram": False,
"processor_res": config["processor_res"],
"threshold_a": 100,
"threshold_b": 200,
"guidance_start": 0,
"guidance_end": 1,
"control_mode": "Balanced",
"pixel_perfect": True
}
]
}
}
}

response = requests.post(url, json=payload)
return response.json()

# 使用
result = generate_with_controlnet(
"1girl, anime style, colorful",
"pose_reference.jpg",
control_type="openpose"
)

坑8:ControlNet权重和CFG冲突

ControlNet权重设太高,CFG也太高,生成结果很僵硬。

解决

ControlNet类型 推荐权重 推荐CFG
Canny 0.8-1.0 5-7
OpenPose 0.8-1.0 5-7
Depth 0.6-0.8 7-9
LineArt 0.8-1.0 5-7

多ControlNet组合

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
# 同时用OpenPose + Canny
payload = {
"prompt": "1girl, anime style",
"alwayson_scripts": {
"ControlNet": {
"args": [
{ # 第一个:OpenPose控制姿态
"input_image": pose_b64,
"module": "openpose_full",
"model": "control_v11p_sd15_openpose",
"weight": 1.0,
},
{ # 第二个:Canny控制轮廓
"input_image": canny_b64,
"module": "canny",
"model": "control_v11p_sd15_canny",
"weight": 0.6, # 权重降低,避免冲突
}
]
}
}
}

坑9:多ControlNet显存爆炸

两个ControlNet同时用,显存直接翻倍。

解决

1
2
3
4
5
# 启动参数优化
./webui.sh --xformers --medvram --no-half-vae

# 或者降低分辨率生成后再放大
# 先生成512x512,再用Hires.fix放大到1024x1024

批量生成优化

异步队列

生产环境需要批量生成,不能一个一个来。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
import asyncio
import aiohttp
import json
import base64
from typing import List, Dict
import queue
import threading

class SDQueue:
"""Stable Diffusion异步生成队列"""

def __init__(self, base_url="http://localhost:7860", max_concurrent=2):
self.base_url = base_url
self.max_concurrent = max_concurrent
self.semaphore = asyncio.Semaphore(max_concurrent)
self.session = None

async def __aenter__(self):
self.session = aiohttp.ClientSession()
return self

async def __aexit__(self, exc_type, exc_val, exc_tb):
await self.session.close()

async def generate_single(self, prompt: str, **kwargs) -> Dict:
"""单张生成"""
async with self.semaphore:
url = f"{self.base_url}/sdapi/v1/txt2img"
payload = {
"prompt": prompt,
"negative_prompt": kwargs.get("negative", ""),
"width": kwargs.get("width", 512),
"height": kwargs.get("height", 512),
"steps": kwargs.get("steps", 25),
"cfg_scale": kwargs.get("cfg", 7),
"sampler_name": kwargs.get("sampler", "DPM++ 2M Karras"),
"seed": kwargs.get("seed", -1),
"batch_size": kwargs.get("batch_size", 1),
"n_iter": kwargs.get("n_iter", 1),
}

async with self.session.post(url, json=payload) as resp:
return await resp.json()

async def generate_batch(self, prompts: List[str], **kwargs) -> List[Dict]:
"""批量生成"""
tasks = [self.generate_single(p, **kwargs) for p in prompts]
return await asyncio.gather(*tasks)

# 使用
async def main():
prompts = [
"1girl, anime style, red hair, blue eyes",
"1boy, anime style, black hair, green eyes",
"1girl, anime style, blonde hair, purple eyes",
"1boy, anime style, white hair, red eyes",
]

async with SDQueue(max_concurrent=2) as queue:
results = await queue.generate_batch(prompts, width=512, height=512)

for i, result in enumerate(results):
img_data = base64.b64decode(result['images'][0])
with open(f'output_{i}.png', 'wb') as f:
f.write(img_data)

asyncio.run(main())

多实例负载均衡

单卡生成太慢,多卡并行。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import random

class SDLoadBalancer:
"""SD多实例负载均衡"""

def __init__(self, instances: List[str]):
"""
instances: ["http://localhost:7861", "http://localhost:7862", ...]
"""
self.instances = instances
self.current = 0
self.lock = threading.Lock()

def get_instance(self) -> str:
"""轮询选择实例"""
with self.lock:
instance = self.instances[self.current]
self.current = (self.current + 1) % len(self.instances)
return instance

def get_instance_random(self) -> str:
"""随机选择"""
return random.choice(self.instances)

def get_instance_least_busy(self, busy_counts: Dict[str, int]) -> str:
"""选择最空闲的"""
return min(self.instances, key=lambda x: busy_counts.get(x, 0))

# 使用
lb = SDLoadBalancer([
"http://sd-1:7860",
"http://sd-2:7860",
"http://sd-3:7860",
])

instance_url = lb.get_instance()

生成参数模板

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
# 预定义常用参数模板
GENERATION_PRESETS = {
"anime_portrait": {
"prompt_prefix": "masterpiece, best quality, 1girl, solo, ",
"prompt_suffix": ", looking at viewer, simple background",
"negative": "lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry",
"width": 512,
"height": 768,
"steps": 28,
"cfg": 7,
"sampler": "DPM++ 2M Karras",
},
"realistic_photo": {
"prompt_prefix": "masterpiece, best quality, realistic, photo, ",
"prompt_suffix": ", detailed skin, professional lighting",
"negative": "lowres, bad anatomy, bad hands, text, error, missing fingers, extra digit, fewer digits, cropped, worst quality, low quality, normal quality, jpeg artifacts, signature, watermark, username, blurry, anime, cartoon, 3d",
"width": 512,
"height": 768,
"steps": 30,
"cfg": 7,
"sampler": "DPM++ 2M Karras",
},
"icon_design": {
"prompt_prefix": "app icon, flat design, ",
"prompt_suffix": ", clean, minimal, vector style",
"negative": "lowres, blurry, text, watermark, realistic, photo, 3d render",
"width": 512,
"height": 512,
"steps": 25,
"cfg": 8,
"sampler": "Euler a",
}
}

def apply_preset(base_prompt: str, preset_name: str) -> Dict:
"""应用预设"""
preset = GENERATION_PRESETS[preset_name]
return {
"prompt": f"{preset['prompt_prefix']}{base_prompt}{preset['prompt_suffix']}",
"negative_prompt": preset["negative"],
"width": preset["width"],
"height": preset["height"],
"steps": preset["steps"],
"cfg_scale": preset["cfg"],
"sampler_name": preset["sampler"],
}

# 使用
params = apply_preset("cute girl with cat ears", "anime_portrait")

生产环境部署

API服务封装

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import requests
import base64
import io
from typing import Optional, List

app = FastAPI(title="SD生成服务")

class GenerateRequest(BaseModel):
prompt: str
negative_prompt: str = ""
width: int = 512
height: int = 512
steps: int = 25
cfg_scale: float = 7.0
seed: int = -1
lora: Optional[str] = None
lora_weight: float = 0.8
controlnet_image: Optional[str] = None # base64
controlnet_type: Optional[str] = None

class GenerateResponse(BaseModel):
image: str # base64
seed: int
info: dict

SD_API_URL = "http://localhost:7860/sdapi/v1"

@app.post("/generate", response_model=GenerateResponse)
async def generate(req: GenerateRequest):
"""文生图接口"""
try:
# 构建prompt
prompt = req.prompt
if req.lora:
prompt = f"{prompt} <lora:{req.lora}:{req.lora_weight}>"

payload = {
"prompt": prompt,
"negative_prompt": req.negative_prompt,
"width": req.width,
"height": req.height,
"steps": req.steps,
"cfg_scale": req.cfg_scale,
"seed": req.seed,
"sampler_name": "DPM++ 2M Karras",
}

# 添加ControlNet
if req.controlnet_image and req.controlnet_type:
payload["alwayson_scripts"] = {
"ControlNet": {
"args": [{
"input_image": req.controlnet_image,
"module": req.controlnet_type,
"model": f"control_v11p_sd15_{req.controlnet_type}",
"weight": 1.0,
}]
}
}

# 调用SD API
resp = requests.post(f"{SD_API_URL}/txt2img", json=payload, timeout=120)
result = resp.json()

return GenerateResponse(
image=result["images"][0],
seed=result.get("seed", -1),
info=result.get("info", {})
)

except Exception as e:
raise HTTPException(status_code=500, detail=str(e))

@app.get("/models")
async def list_models():
"""列出可用模型"""
resp = requests.get(f"{SD_API_URL}/sd-models")
return resp.json()

@app.get("/loras")
async def list_loras():
"""列出可用LoRA"""
resp = requests.get(f"{SD_API_URL}/loras")
return resp.json()

if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)

缓存优化

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
import hashlib
import redis
from functools import wraps

redis_client = redis.Redis(host='localhost', port=6379, db=0)

def cache_result(ttl=3600):
"""缓存生成结果"""
def decorator(func):
@wraps(func)
async def wrapper(*args, **kwargs):
# 生成缓存key
cache_key = hashlib.md5(
f"{func.__name__}:{str(args)}:{str(kwargs)}".encode()
).hexdigest()

# 查缓存
cached = redis_client.get(cache_key)
if cached:
return json.loads(cached)

# 执行生成
result = await func(*args, **kwargs)

# 存缓存
redis_client.setex(cache_key, ttl, json.dumps(result))

return result
return wrapper
return decorator

# 使用
@cache_result(ttl=86400) # 缓存1天
async def generate_with_cache(prompt, **kwargs):
# 实际生成逻辑
pass

总结

折腾Stable Diffusion两年,最核心的经验:

  1. 显存是硬门槛:12G能玩,24G才能玩得爽,40G以上才能工业化
  2. 模型管理要规范:safetensors格式、目录结构、版本控制
  3. LoRA训练靠调参:dim、lr、数据质量,三个缺一不可
  4. ControlNet是神器:但权重要调,多ControlNet要省显存
  5. 批量生成要异步:队列+多实例+缓存,才能支撑生产环境

踩坑最多的地方:

  • xformers装不上,生成速度慢3倍
  • 模型格式不对,加载不了或者不安全
  • LoRA过拟合,除了训练姿势其他都不会
  • ControlNet权重太高,生成结果僵硬
  • 批量生成不控制并发,显存OOM

SD的工作流搭建是个持续优化的过程,没有一劳永逸的方案。根据业务需求不断调整,才能找到最适合自己的流程。