实时系统常见问题与解决方案
目录
WebSocket问题
问题1: WebSocket连接频繁断开
症状:
- 连接每隔几分钟就断开
- 客户端不停重连
- 日志显示
connection timeout
原因分析:
- 代理/负载均衡器有idle timeout(如Nginx默认60秒)
- 没有实现心跳机制
- 客户端网络不稳定
解决方案:
# Nginx配置增加超时时间
location /ws {
proxy_pass http://backend;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
# 关键配置
proxy_read_timeout 3600s; # 1小时
proxy_send_timeout 3600s;
proxy_connect_timeout 60s;
}
// 客户端实现心跳
class WebSocketClient {
private heartbeatInterval: NodeJS.Timeout | null = null;
connect() {
this.ws = new WebSocket(this.url);
this.ws.onopen = () => {
// 每30秒发送ping
this.heartbeatInterval = setInterval(() => {
if (this.ws?.readyState === WebSocket.OPEN) {
this.ws.send(JSON.stringify({ type: 'ping' }));
}
}, 30000);
};
}
disconnect() {
if (this.heartbeatInterval) {
clearInterval(this.heartbeatInterval);
}
this.ws?.close();
}
}
// 服务端实现超时检测
io.on('connection', (socket) => {
let lastPing = Date.now();
// 监听ping
socket.on('ping', () => {
lastPing = Date.now();
socket.emit('pong');
});
// 每60秒检查一次
const checkInterval = setInterval(() => {
if (Date.now() - lastPing > 90000) { // 90秒无ping
console.log('Client timeout, disconnecting');
socket.disconnect();
clearInterval(checkInterval);
}
}, 60000);
socket.on('disconnect', () => {
clearInterval(checkInterval);
});
});
问题2: WebSocket无法建立连接
症状:
- 浏览器报错
WebSocket connection failed - Network tab显示
101 Switching Protocols失败
诊断步骤:
# 1. 检查服务器是否监听WebSocket端口
netstat -tulpn | grep 3000
# 2. 测试WebSocket连接
wscat -c ws://localhost:3000
# 3. 检查防火墙
sudo ufw status
常见原因与解决:
| 原因 | 解决方案 |
|---|---|
| CORS配置错误 | 正确设置 cors.origin |
| SSL证书问题 | 使用wss://而非ws:// |
| 反向代理未配置升级 | 添加Upgrade头支持 |
| 端口被占用 | 更换端口或kill进程 |
// 正确的CORS配置
const io = new Server(httpServer, {
cors: {
origin: process.env.NODE_ENV === 'production'
? ['https://yourdomain.com']
: ['http://localhost:3000', 'http://127.0.0.1:3000'],
methods: ['GET', 'POST'],
credentials: true
}
});
问题3: 消息丢失
症状:
- 发送的消息部分未收到
- 消息顺序错乱
解决方案:
// 1. 消息确认机制
class ReliableWebSocket {
private pendingMessages: Map<string, Message> = new Map();
send(message: any) {
const id = generateId();
const envelope = { id, ...message, timestamp: Date.now() };
this.pendingMessages.set(id, envelope);
this.ws.send(JSON.stringify(envelope));
// 3秒后未确认则重发
setTimeout(() => {
if (this.pendingMessages.has(id)) {
console.warn('Message not acknowledged, resending:', id);
this.send(message);
}
}, 3000);
}
onMessage(data: any) {
if (data.type === 'ack') {
this.pendingMessages.delete(data.id);
}
}
}
// 服务端发送确认
socket.on('message', (data) => {
// 处理消息
handleMessage(data);
// 发送确认
socket.emit('ack', { id: data.id });
});
性能问题
问题4: 高并发下服务器崩溃
症状:
- 超过1000个连接时服务器变慢
- 内存使用飙升
- CPU 100%
诊断:
# 监控资源使用
top -p $(pgrep node)
# 查看内存heap
node --expose-gc --inspect server.js
# Chrome DevTools连接
chrome://inspect
解决方案:
方案A: 垂直扩展(单机优化)
// 1. 使用Cluster模式
import cluster from 'cluster';
import os from 'os';
if (cluster.isPrimary) {
const numCPUs = os.cpus().length;
for (let i = 0; i < numCPUs; i++) {
cluster.fork();
}
cluster.on('exit', (worker) => {
console.log(`Worker ${worker.id} died, restarting...`);
cluster.fork();
});
} else {
// Worker进程运行服务器
startServer();
}
// 2. 使用Redis适配器(多进程共享状态)
import { createAdapter } from '@socket.io/redis-adapter';
const pubClient = new Redis();
const subClient = pubClient.duplicate();
io.adapter(createAdapter(pubClient, subClient));
方案B: 水平扩展(多服务器)
# docker-compose.yml
version: '3.8'
services:
app1:
image: myapp
environment:
- INSTANCE_ID=1
app2:
image: myapp
environment:
- INSTANCE_ID=2
app3:
image: myapp
environment:
- INSTANCE_ID=3
nginx:
image: nginx
ports:
- "80:80"
volumes:
- ./nginx.conf:/etc/nginx/nginx.conf
depends_on:
- app1
- app2
- app3
redis:
image: redis:alpine
# nginx.conf - WebSocket负载均衡
upstream websocket_backend {
ip_hash; # 保持连接粘性
server app1:3000;
server app2:3000;
server app3:3000;
}
server {
listen 80;
location /ws {
proxy_pass http://websocket_backend;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header X-Real-IP $remote_addr;
}
}
问题5: 前端UI卡顿
症状:
- 消息量大时页面卡顿
- 滚动不流畅
- 输入延迟
原因:
- 每条消息都触发DOM操作
- 未使用虚拟滚动
- 未做节流
解决方案:
// 1. 批量更新
class MessageBuffer {
private buffer: Message[] = [];
private flushTimer: NodeJS.Timeout | null = null;
add(message: Message) {
this.buffer.push(message);
if (!this.flushTimer) {
this.flushTimer = setTimeout(() => {
this.flush();
}, 100); // 每100ms刷新一次
}
}
flush() {
if (this.buffer.length > 0) {
// 批量更新UI
this.onFlush(this.buffer);
this.buffer = [];
}
this.flushTimer = null;
}
onFlush: (messages: Message[]) => void = () => {};
}
// 使用
const buffer = new MessageBuffer();
buffer.onFlush = (messages) => {
setMessages(prev => [...prev, ...messages]);
};
ws.on('new_message', (msg) => {
buffer.add(msg);
});
// 2. 虚拟滚动(只渲染可见区域)
import { useVirtualizer } from '@tanstack/react-virtual';
function MessageList({ messages }: { messages: Message[] }) {
const parentRef = useRef<HTMLDivElement>(null);
const virtualizer = useVirtualizer({
count: messages.length,
getScrollElement: () => parentRef.current,
estimateSize: () => 60,
overscan: 10 // 多渲染10个以优化滚动
});
return (
<div ref={parentRef} style={{ height: '600px', overflow: 'auto' }}>
<div style={{ height: virtualizer.getTotalSize() }}>
{virtualizer.getVirtualItems().map(item => (
<div
key={item.index}
style={{
position: 'absolute',
top: 0,
left: 0,
width: '100%',
transform: `translateY(${item.start}px)`
}}
>
<MessageItem message={messages[item.index]} />
</div>
))}
</div>
</div>
);
}
扩展性问题
问题6: 单点故障
症状:
- 一台服务器挂了,整个服务不可用
解决方案: 高可用架构
┌────────────┐
│ 用户 │
└─────┬──────┘
│
┌─────▼───────────────┐
│ 负载均衡器 (HAProxy)│ ← 故障转移
│ 主: 10.0.1.10 │
│ 备: 10.0.1.11 │
└─────┬───────────────┘
│
├──► 应用服务器1 (主动)
├──► 应用服务器2 (主动)
└──► 应用服务器3 (主动)
│
┌────▼────┐
│ Redis │ ← Sentinel模式
│ (哨兵集群)│
└─────────┘
│
┌────▼────┐
│ DB │ ← 主从复制 + 自动故障转移
│ (Patroni)│
└─────────┘
实现步骤:
# docker-compose-ha.yml
version: '3.8'
services:
haproxy:
image: haproxy:latest
ports:
- "80:80"
- "8404:8404" # Stats页面
volumes:
- ./haproxy.cfg:/usr/local/etc/haproxy/haproxy.cfg
app1:
image: myapp
environment:
REDIS_SENTINELS: sentinel1:26379,sentinel2:26379,sentinel3:26379
app2:
image: myapp
environment:
REDIS_SENTINELS: sentinel1:26379,sentinel2:26379,sentinel3:26379
redis-master:
image: redis:alpine
command: redis-server --appendonly yes
redis-slave1:
image: redis:alpine
command: redis-server --slaveof redis-master 6379
sentinel1:
image: redis:alpine
command: redis-sentinel /etc/sentinel.conf
volumes:
- ./sentinel.conf:/etc/sentinel.conf
数据一致性问题
问题7: 消息顺序错乱
症状:
- 后发送的消息先显示
- 多个客户端看到的顺序不一致
解决方案:
// 1. 使用时间戳 + 序号
interface Message {
id: string;
content: string;
timestamp: number;
sequence: number; // 单调递增序号
userId: string;
}
class MessageOrdering {
private sequence = 0;
private buffer: Message[] = [];
private nextExpectedSeq = 0;
add(message: Message) {
this.buffer.push(message);
this.buffer.sort((a, b) => a.sequence - b.sequence);
// 按序输出
while (this.buffer.length > 0 &&
this.buffer[0].sequence === this.nextExpectedSeq) {
const msg = this.buffer.shift()!;
this.emit(msg);
this.nextExpectedSeq++;
}
// 检测丢失的消息
if (this.buffer.length > 0 &&
this.buffer[0].sequence > this.nextExpectedSeq) {
console.warn('Missing messages detected');
this.requestMissing(this.nextExpectedSeq, this.buffer[0].sequence);
}
}
emit(message: Message) {
// 输出到UI
}
requestMissing(from: number, to: number) {
// 请求服务器补发
this.ws.send({
type: 'request_missing',
from,
to
});
}
}
问题8: 多设备同步冲突
症状:
- 用户在手机和电脑同时编辑,数据冲突
解决方案: CRDT (Conflict-free Replicated Data Type)
// 使用Yjs实现无冲突同步
import * as Y from 'yjs';
import { WebsocketProvider } from 'y-websocket';
// 创建共享文档
const ydoc = new Y.Doc();
const provider = new WebsocketProvider('ws://localhost:1234', 'my-doc', ydoc);
// 共享文本
const ytext = ydoc.getText('content');
// 设备A编辑
ytext.insert(0, 'Hello ');
// 设备B同时编辑
ytext.insert(0, 'Hi ');
// 自动合并,不会冲突!
console.log(ytext.toString()); // "Hi Hello "
安全问题
问题9: WebSocket劫持攻击
症状:
- 未认证用户能连接WebSocket
- CSRF攻击
解决方案:
// 1. JWT认证
io.use(async (socket, next) => {
const token = socket.handshake.auth.token ||
socket.handshake.headers.authorization?.split(' ')[1];
if (!token) {
return next(new Error('Authentication required'));
}
try {
const decoded = jwt.verify(token, process.env.JWT_SECRET!);
socket.data.user = decoded;
next();
} catch (error) {
next(new Error('Invalid token'));
}
});
// 2. Origin检查
io.use((socket, next) => {
const origin = socket.handshake.headers.origin;
const allowedOrigins = ['https://yourdomain.com'];
if (process.env.NODE_ENV === 'production' &&
!allowedOrigins.includes(origin)) {
return next(new Error('Origin not allowed'));
}
next();
});
// 3. 速率限制
import rateLimit from 'express-rate-limit';
const limiter = rateLimit({
windowMs: 1000, // 1秒
max: 10, // 最多10个请求
message: 'Too many requests'
});
io.use((socket, next) => {
limiter(socket.request, {}, next);
});
问题10: XSS攻击
症状:
- 用户发送包含脚本的消息
- 其他用户执行了恶意脚本
解决方案:
// 1. 内容过滤
import DOMPurify from 'isomorphic-dompurify';
function sanitizeMessage(content: string): string {
// 移除所有HTML标签
return DOMPurify.sanitize(content, {
ALLOWED_TAGS: [],
ALLOWED_ATTR: []
});
}
// 2. 显示时转义
function MessageBubble({ content }: { content: string }) {
return (
<div>
{/* React自动转义 */}
<p>{content}</p>
{/* 或手动转义 */}
<p dangerouslySetInnerHTML={{
__html: DOMPurify.sanitize(content)
}} />
</div>
);
}
// 3. CSP头部
app.use((req, res, next) => {
res.setHeader(
'Content-Security-Policy',
"default-src 'self'; script-src 'self'; style-src 'self' 'unsafe-inline'"
);
next();
});
部署问题
问题11: Heroku/Render WebSocket连接问题
症状:
- 本地正常,部署后WebSocket无法连接
解决方案:
// 1. 使用环境变量
const PORT = process.env.PORT || 3000;
const WS_URL = process.env.NODE_ENV === 'production'
? 'wss://myapp.herokuapp.com'
: 'ws://localhost:3000';
// 2. 监听正确的端口
httpServer.listen(PORT, '0.0.0.0', () => {
console.log(`Server running on port ${PORT}`);
});
// 3. Procfile配置
# Procfile
web: node dist/server.js
// 4. 启用sticky session (Heroku)
# 在Heroku dashboard设置:
# Settings > Config Vars > SESSION_AFFINITY = true
问题12: Docker容器间通信问题
症状:
- 应用无法连接数据库
- WebSocket服务连接不到Redis
解决方案:
# docker-compose.yml
version: '3.8'
services:
app:
build: .
environment:
# 使用服务名作为hostname
DATABASE_URL: postgresql://user:pass@db:5432/myapp
REDIS_URL: redis://redis:6379
depends_on:
db:
condition: service_healthy
redis:
condition: service_started
db:
image: postgres:15
healthcheck:
test: ["CMD-SHELL", "pg_isready -U user"]
interval: 5s
timeout: 5s
retries: 5
redis:
image: redis:alpine
监控与调试
问题13: 如何监控WebSocket连接数?
解决方案:
// 1. Prometheus指标
import promClient from 'prom-client';
const connectedClients = new promClient.Gauge({
name: 'websocket_connected_clients',
help: 'Number of connected WebSocket clients'
});
io.on('connection', (socket) => {
connectedClients.inc();
socket.on('disconnect', () => {
connectedClients.dec();
});
});
// 暴露指标endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', promClient.register.contentType);
res.end(await promClient.register.metrics());
});
# prometheus.yml
scrape_configs:
- job_name: 'websocket-app'
static_configs:
- targets: ['app:3000']
2. 自定义监控面板:
// 监控数据收集
class MonitoringService {
private stats = {
totalConnections: 0,
currentConnections: 0,
messagesPerSecond: 0,
avgLatency: 0
};
trackConnection() {
this.stats.totalConnections++;
this.stats.currentConnections++;
}
trackDisconnection() {
this.stats.currentConnections--;
}
trackMessage(latency: number) {
// 计算指数移动平均
this.stats.avgLatency = 0.9 * this.stats.avgLatency + 0.1 * latency;
}
getStats() {
return this.stats;
}
}
// 实时广播统计
setInterval(() => {
io.emit('stats', monitoring.getStats());
}, 1000);
问题14: 如何调试WebSocket?
工具与方法:
1. Chrome DevTools:
F12 → Network → WS (过滤WebSocket) → 点击连接 → Messages
2. wscat命令行工具:
# 安装
npm install -g wscat
# 连接测试
wscat -c ws://localhost:3000
# 发送消息
> {"type": "ping"}
# 带Header
wscat -c ws://localhost:3000 -H "Authorization: Bearer token123"
3. 日志记录:
// 详细日志
io.on('connection', (socket) => {
console.log('[CONNECT]', {
id: socket.id,
transport: socket.conn.transport.name,
headers: socket.handshake.headers
});
socket.onAny((event, ...args) => {
console.log('[EVENT]', {
socketId: socket.id,
event,
args
});
});
socket.on('disconnect', (reason) => {
console.log('[DISCONNECT]', {
id: socket.id,
reason
});
});
});
性能调优清单
部署前检查:
- 启用Gzip压缩
- 配置HTTP/2
- 设置合理的连接超时
- 实现心跳机制
- 配置Redis缓存
- 使用CDN加速静态资源
- 启用数据库连接池
- 配置日志级别(生产环境降低)
- 设置内存限制
- 启用CPU集群模式
- 配置健康检查
- 设置告警规则
应急预案
服务器CPU 100%:
# 1. 快速定位进程
top
# 2. 生成heap dump
kill -USR2 <pid>
# 3. 重启服务(临时缓解)
pm2 restart app
# 4. 扩容(长期方案)
kubectl scale deployment app --replicas=5
数据库连接耗尽:
// 增加连接池
const pool = new Pool({
max: 50, // 增加到50
idleTimeoutMillis: 30000,
connectionTimeoutMillis: 2000
});
Redis内存不足:
# 清理过期key
redis-cli --scan --pattern '*' | xargs redis-cli del
# 设置maxmemory策略
redis-cli CONFIG SET maxmemory-policy allkeys-lru
记住: 提前预防 > 事后补救。定期进行压力测试和演练!