Linux 服务器流量暴增排查实战：iftop 监控、数据埋点优化与缓存策略

2018-10-21

更新说明：内容已通过 AI 辅助优化，确保符合 2026 年最新 AdSense 内容政策。所有技术信息均经过验证，代码示例可安全使用。

AI 声明：本文部分内容使用人工智能技术辅助创作，经人工审核编辑后发布。

去年遇到个头疼的问题：服务器带宽突然被打满，服务响应慢到没法用。这篇记录一下排查过程和解决方案。

问题场景分析

流量暴增的影响

┌─────────────────────────────────────────────────────────────────────┐
│              流量暴增对系统的影响                                      │
├─────────────────────────────────────────────────────────────────────┤
│
│  正常状态：
│  用户请求 ──► 业务服务器 ──► 数据服务器 ──► 数据库
│     │              │              │              │
│     │              │              │              │
│   带宽占用 10%   处理正常       响应及时       查询正常
│
│
│  异常状态（流量暴增 200M/s）：
│  用户请求 ──► 业务服务器 ───┐
│     │              │        │
│     │              │        ▼
│   访问缓慢      处理阻塞 ◄── 带宽占满（200M/s）
│   响应超时         │        ▲
│   用户体验差       └───────┘
│                      数据查询返回大量无用数据
│

常见流量暴增原因

原因类型	具体表现	排查方法
数据查询异常	返回大量冗余数据	分析 SQL/接口调用
埋点数据上报	全量用户数据未过滤	检查埋点代码逻辑
缓存失效	大量请求穿透到数据库	检查缓存命中率
DDoS 攻击	外部恶意流量	分析来源 IP
定时任务	全表扫描或批量导出	检查定时任务调度
配置错误	日志级别过高或循环请求	检查应用配置

流量监控工具使用

iftop 实时流量监控

iftop 是 Linux 下常用的实时流量监控工具，可以显示每个连接的带宽使用情况。

# 安装 iftop（CentOS/RHEL）
sudo yum install iftop -y

# 安装 iftop（Ubuntu/Debian）
sudo apt-get install iftop -y

# 基本使用
sudo iftop -i eth0

# 常用参数说明
iftop -i eth0      # 监控 eth0 网卡
iftop -n           # 禁用 DNS 解析，显示 IP 地址
iftop -N           # 显示端口号而非服务名称
iftop -B           # 以 Bytes 为单位显示（默认是 bits）
iftop -F 192.168.1.0/24   # 只显示特定网段的流量

iftop 界面解读

┌─────────────────────────────────────────────────────────────────────┐
│                          iftop 界面说明                               │
├─────────────────────────────────────────────────────────────────────┤
│
│  192.168.1.100 => 192.168.1.1    150Mb  120Mb  180Mb
│                <=                20Mb   15Mb   25Mb
│  192.168.1.100 => 10.0.0.5       50Mb   45Mb   55Mb
│                <=                5Mb    4Mb    6Mb
│
│  解释：
│  => 表示发送流量（本机向外发送）
│  <= 表示接收流量（本机接收的数据）
│  三列数字分别是：2秒平均 / 10秒平均 / 40秒平均
│
│  底部统计：
│  TX:  200Mb/s  累计发送流量
│  RX:   30Mb/s  累计接收流量
│  TOTAL: 230Mb/s  总流量
│

其他流量监控工具

# nethogs - 按进程查看流量
sudo nethogs eth0

# vnstat - 历史流量统计
vnstat -i eth0
vnstat -h    # 查看小时统计
vnstat -d    # 查看日统计

# tcpdump - 抓包分析
sudo tcpdump -i eth0 -w capture.pcap

# ss/netstat - 查看连接状态
ss -tan | awk '{print $1}' | sort | uniq -c
netstat -anp | grep ESTABLISHED

# 查看特定端口的流量
tcpdump -i eth0 port 3306

紧急处理方案

方案一：分离网络（快速治标）

当业务受到严重影响时，最快的解决方案是分离业务流量和数据同步流量。

┌─────────────────────────────────────────────────────────────────────┐
│              双网卡分离方案                                          │
├─────────────────────────────────────────────────────────────────────┤
│
│  改造前（单网卡）：
│
│  ┌──────────────┐      eth0 (200M/s)      ┌──────────────┐
│  │   业务服务器  │◄───────────────────────►│   数据服务器  │
│  └──────────────┘   带宽占满，业务受影响    └──────────────┘
│
│
│  改造后（双网卡）：
│
│  ┌──────────────┐                    ┌──────────────┐
│  │   业务服务器  │──eth0(外网)────────►│   数据服务器  │  业务流量
│  │              │    正常访问          │              │  不受影响
│  │              │──eth1(内网)────────►│              │  数据同步
│  └──────────────┘   192.168.x.x       └──────────────┘  走内网
│
│  优点：快速解决问题，业务不受影响
│  缺点：治标不治本，需要后续优化数据查询逻辑
│

配置步骤：

# 1. 添加第二块网卡（通过云平台或物理添加）

# 2. 配置内网 IP
# /etc/sysconfig/network-scripts/ifcfg-eth1 (CentOS)
cat > /etc/sysconfig/network-scripts/ifcfg-eth1 << 'EOF'
DEVICE=eth1
TYPE=Ethernet
ONBOOT=yes
BOOTPROTO=static
IPADDR=192.168.10.10
NETMASK=255.255.255.0
EOF

# 3. 重启网络
systemctl restart network

# 4. 修改应用配置，数据服务器走内网 IP
# 配置示例：
# DATA_SERVER_HOST=192.168.10.20

方案二：限流控制

# 使用 tc (Traffic Control) 进行流量控制

# 在 eth0 上添加根队列
tc qdisc add dev eth0 root handle 1: htb default 12

# 创建一个类，限制带宽为 100Mbit
tc class add dev eth0 parent 1: classid 1:1 htb rate 100mbit burst 15k

# 查看 tc 配置
tc qdisc show dev eth0
tc class show dev eth0

# 删除 tc 规则
tc qdisc del dev eth0 root

根本原因分析与优化

问题定位：数据埋点导致的流量暴增

在我们这个案例中，流量暴增的原因是：

┌─────────────────────────────────────────────────────────────────────┐
│              数据埋点问题分析                                         │
├─────────────────────────────────────────────────────────────────────┤
│
│  问题代码示例：
│
│  // 用户行为埋点
│  function trackUserBehavior(userId) {
│      // 查询用户数据
│      const userData = db.users.find({ id: userId });
│
│      // 问题：返回了完整的用户对象，包含所有字段！
│      // {
│      //   id: 12345,
│      //   name: "张三",
│      //   email: "zhangsan@example.com",
│      //   avatar: "...",  // 可能包含 Base64 图片数据
│      //   settings: {...},  // 大量配置数据
│      //   history: [...],   // 用户完整历史记录
│      //   logs: [...],      // 日志数据
│      //   ... 等等
│      // }
│
│      // 上报数据（传输了大量无用数据）
│      analytics.report(userData);
│  }
│
│  影响：
│  - 用户量大时，每次埋点传输数据量巨大
│  - 网络带宽被占满（200M/s）
│  - 业务服务器访问缓慢
│

优化方案一：查询字段精简

// 优化前：返回所有字段
const userData = await db.users.find({ id: userId });

// 优化后：只返回需要的字段
const userData = await db.users.find(
    { id: userId },
    {
        projection: {
            id: 1,
            name: 1,
            level: 1,
            region: 1
            // 明确排除大字段
            // avatar: 0,
            // history: 0,
            // logs: 0
        }
    }
);

SQL 版本：

-- 优化前：SELECT * 返回所有列
SELECT * FROM users WHERE id = ?;

-- 优化后：只查询需要的列
SELECT id, name, level, region FROM users WHERE id = ?;

优化方案二：业务层缓存

// UserCache.js - 用户数据缓存层
class UserCache {
    constructor() {
        // 使用 LRU 缓存，限制条目数
        this.cache = new Map();
        this.maxSize = 10000;
        this.ttl = 5 * 60 * 1000; // 5 分钟过期
    }

    // 获取用户数据
    async get(userId) {
        const key = `user:${userId}`;
        const cached = this.cache.get(key);

        if (cached && Date.now() - cached.timestamp < this.ttl) {
            return cached.data;
        }

        // 缓存未命中，从数据库查询
        const data = await this.fetchFromDB(userId);

        // 写入缓存
        this.set(key, data);

        return data;
    }

    set(key, data) {
        // LRU 淘汰策略
        if (this.cache.size >= this.maxSize) {
            const firstKey = this.cache.keys().next().value;
            this.cache.delete(firstKey);
        }

        this.cache.set(key, {
            data: data,
            timestamp: Date.now()
        });
    }

    async fetchFromDB(userId) {
        // 只查询必要字段
        return await db.users.find(
            { id: userId },
            { projection: { id: 1, name: 1, level: 1, region: 1 } }
        );
    }

    // 缓存穿透防护：缓存空结果
    async getWithNullCache(userId) {
        const key = `user:${userId}`;
        const cached = this.cache.get(key);

        if (cached) {
            if (cached.isNull) return null;
            if (Date.now() - cached.timestamp < this.ttl) {
                return cached.data;
            }
        }

        const data = await this.fetchFromDB(userId);

        if (data) {
            this.set(key, data);
        } else {
            // 缓存空结果，防止缓存穿透
            this.cache.set(key, {
                isNull: true,
                timestamp: Date.now()
            });
        }

        return data;
    }
}

// 使用 Redis 作为分布式缓存
class RedisUserCache {
    async get(userId) {
        const key = `user:${userId}`;
        let data = await redis.get(key);

        if (data) {
            return JSON.parse(data);
        }

        // 缓存未命中，加锁防止缓存击穿
        const lockKey = `lock:${key}`;
        const lock = await redis.set(lockKey, '1', 'NX', 'EX', 10);

        if (lock) {
            try {
                data = await this.fetchFromDB(userId);
                if (data) {
                    await redis.setex(key, 300, JSON.stringify(data)); // 5分钟过期
                } else {
                    // 空值缓存，防止穿透
                    await redis.setex(key, 60, 'null');
                }
            } finally {
                await redis.del(lockKey);
            }
        } else {
            // 其他进程正在加载，等待后重试
            await new Promise(resolve => setTimeout(resolve, 100));
            return this.get(userId);
        }

        return data;
    }
}

优化方案三：埋点服务分离

┌─────────────────────────────────────────────────────────────────────┐
│              埋点服务分离架构                                         │
├─────────────────────────────────────────────────────────────────────┤
│
│  改造前：
│
│  ┌──────────┐      查询用户数据        ┌──────────┐
│  │ 业务服A  │─────────────────────────►│ 数据服   │
│  └──────────┘                          └──────────┘
│       │                                       ▲
│       │ 上报埋点（大量数据）                    │
│       ▼                                       │
│  ┌──────────┐      查询用户数据        ┌──────────┐
│  │ 业务服B  │─────────────────────────►│ 分析服   │
│  └──────────┘                          └──────────┘
│       │                                       │
│       └───────────────────────────────────────┘
│              争用带宽，影响业务
│
│  改造后：
│
│  ┌──────────┐      查询必要数据        ┌──────────┐
│  │ 业务服A  │─────────────────────────►│ 业务数据服│ 核心业务
│  └──────────┘                          └──────────┘
│       │
│       │ 上报精简埋点
│       ▼
│  ┌──────────┐                        ┌──────────┐
│  │ 埋点服务  │───────────────────────►│ 分析数据服│ 独立部署
│  └──────────┘                        └──────────┘
│       ▲
│       │ 上报精简埋点
│  ┌──────────┐
│  │ 业务服B  │──────► 业务数据服
│  └──────────┘
│
│  优点：
│  - 业务查询和埋点查询分离，互不影响
│  - 埋点服务可独立扩展
│  - 可以单独优化埋点数据查询逻辑
│

优化方案四：数据压缩传输

// 使用压缩减少传输数据量
const zlib = require('zlib');
const util = require('util');
const gzip = util.promisify(zlib.gzip);
const gunzip = util.promisify(zlib.gunzip);

class CompressedDataClient {
    async send(data) {
        // 序列化
        const jsonStr = JSON.stringify(data);

        // 压缩
        const compressed = await gzip(Buffer.from(jsonStr));

        // 发送压缩后的数据
        return await this.httpPost('/api/track', compressed, {
            headers: {
                'Content-Encoding': 'gzip',
                'Content-Type': 'application/json'
            }
        });
    }

    async receive(compressedData) {
        // 解压
        const decompressed = await gunzip(compressedData);

        // 解析
        return JSON.parse(decompressed.toString());
    }
}

// 使用 MessagePack 替代 JSON（更高效的序列化）
const msgpack = require('msgpack-lite');

class MessagePackClient {
    encode(data) {
        return msgpack.encode(data);
    }

    decode(buffer) {
        return msgpack.decode(buffer);
    }
}

监控告警体系建设

流量监控脚本

#!/bin/bash
# network_monitor.sh - 网络流量监控告警脚本

THRESHOLD=100  # 告警阈值 MB/s
INTERFACE="eth0"
WEBHOOK_URL="https://hooks.slack.com/services/xxx"

# 获取当前流量
get_traffic() {
    cat /proc/net/dev | grep $INTERFACE | awk '{print $2,$10}'
}

# 计算速率
monitor() {
    read rx1 tx1 <<< $(get_traffic)
    sleep 1
    read rx2 tx2 <<< $(get_traffic)

    rx_speed=$((($rx2 - $rx1) / 1024 / 1024))
    tx_speed=$((($tx2 - $tx1) / 1024 / 1024))

    echo "$(date): RX: ${rx_speed}MB/s TX: ${tx_speed}MB/s"

    if [ $rx_speed -gt $THRESHOLD ] || [ $tx_speed -gt $THRESHOLD ]; then
        send_alert $rx_speed $tx_speed
    fi
}

# 发送告警
send_alert() {
    rx=$1
    tx=$2

    message=" 服务器流量告警\n主机: $(hostname)\n接口: $INTERFACE\n接收: ${rx}MB/s\n发送: ${tx}MB/s\n时间: $(date)"

    curl -X POST -H 'Content-type: application/json' \
        --data "{\"text\":\"$message\"}" \
        $WEBHOOK_URL
}

# 主循环
while true; do
    monitor
    sleep 60
done

Prometheus + Grafana 监控

# docker-compose.yml
version: '3'
services:
  prometheus:
    image: prom/prometheus
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
    ports:
      - "9090:9090"

  grafana:
    image: grafana/grafana
    ports:
      - "3000:3000"
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=admin

# prometheus.yml
global:
  scrape_interval: 15s

scrape_configs:
  - job_name: 'node-exporter'
    static_configs:
      - targets: ['localhost:9100']

Node Exporter 网络指标：

# 网络接收速率
rate(node_network_receive_bytes_total{device="eth0"}[5m])

# 网络发送速率
rate(node_network_transmit_bytes_total{device="eth0"}[5m])

# 告警规则
groups:
  - name: network_alerts
    rules:
      - alert: HighNetworkTraffic
        expr: rate(node_network_receive_bytes_total[5m]) > 104857600
        for: 5m
        annotations:
          summary: "服务器流量过高"

预防措施与最佳实践

代码 Review 检查清单

检查项	风险等级	说明
是否使用 SELECT *	高	必须明确指定需要的字段
是否返回完整对象	高	嵌套对象可能包含大量数据
是否有分页限制	高	查询必须带 LIMIT
是否启用压缩	中	大数据传输必须压缩
是否有缓存策略	中	热点数据必须缓存
埋点是否精简	高	只上报必要字段

容量规划建议

┌─────────────────────────────────────────────────────────────────────┐
│              服务器容量规划                                          │
├─────────────────────────────────────────────────────────────────────┤
│
│  带宽计算公式：
│
│  所需带宽 = 峰值用户数 × 每用户数据量 × 请求频率
│
│  示例：
│  - 峰值用户: 10,000 并发
│  - 每用户数据: 100 KB
│  - 请求频率: 每 10 秒 1 次
│
│  所需带宽 = 10000 × 100KB × 0.1/s = 100MB/s
│
│  建议配置：
│  - 实际带宽应为计算值的 1.5-2 倍
│  - 配置 200MB/s 带宽
│  - 监控阈值设为 150MB/s
│

定期巡检脚本

#!/bin/bash
# daily_check.sh - 每日系统巡检

echo "========== $(date) 系统巡检报告 =========="

# 1. 流量检查
echo "【网络流量】"
iftop -t -s 5 -i eth0 2>/dev/null | grep "Total send rate"

# 2. 连接数检查
echo "【连接数】"
ss -tan | wc -l

# 3. Top 流量进程
echo "【Top 流量进程】"
nethogs -t -c 5 2>/dev/null | tail -10

# 4. 数据库慢查询
echo "【MySQL 慢查询】"
mysql -e "SELECT COUNT(*) FROM mysql.slow_log WHERE start_time > DATE_SUB(NOW(), INTERVAL 1 DAY);"

# 5. 磁盘使用
echo "【磁盘使用】"
df -h | grep -E "(Filesystem|/dev/)"

# 6. 内存使用
echo "【内存使用】"
free -h

echo "========== 巡检结束 =========="

写在最后

服务器流量暴增问题的处理要点：

快速定位：使用 iftop、nethogs 等工具快速确定流量来源
紧急处理：通过双网卡分离或限流控制，优先恢复业务
根本解决：
- 精简查询字段，避免返回无用数据
- 增加业务层缓存，减少重复查询
- 分离埋点服务，避免影响核心业务
预防措施：
- 代码 Review 时关注数据查询优化
- 建立流量监控告警体系
- 定期进行容量规划和压力测试
持续优化：
- 使用数据压缩减少传输量
- 优化缓存策略防止穿透和雪崩
- 建立完善的监控和日志体系

通过系统化的监控、快速的应急响应和持续的优化，可以有效避免和解决服务器流量暴增问题。

运维与DevOps监控告警