2026-06-17 04:41:18 网络安全文章来源：ZONE.CI 全球网 0 阅读模式

文章总结： 本文系统解析Golang重试机制的设计要点，指出盲目重试可能放大故障风险。核心在于区分可重试错误（如网络闪断、5xx状态码）与不可重试错误（如业务逻辑错误），并强调写操作必须依赖服务端幂等性。代码示例展示了包含退避抖动、Retry-After头解析的重试框架，建议限制重试次数（2-3次）、添加详细日志并配合熔断限流控制重试预算。 综合评分： 87 文章分类： 安全开发,实战经验,解决方案,安全建设,代码审计

cover_image

Golang重试机制，如何应对各种失败的场景

原创

go go

Go语言教程

2026年6月13日 13:02 陕西

在小说阅读器读本章

去阅读

接口超时了，第一反应就加重试，这地方我一般先皱眉。

不是不能重试，是很多系统被重试打崩的。原本下游只是抖了一下，上游几十台机器一起“帮忙”，每个请求再补三次，流量瞬间翻几倍。日志里看起来很勤奋：

pay callback timeout, retry=1
pay callback timeout, retry=2
pay callback timeout, retry=3

但下游看到的不是勤奋，是踩油门。

重试这东西，在 Go 里写起来太简单，真正麻烦的是判断：什么失败值得重试，什么失败重试就是添乱。

我平时会先把失败分几类。

网络闪断、连接被重置、临时 DNS 抖动，可以重试。

HTTP 502、503、504，可以重试，但要慢一点。

429 这种限流，别傻乎乎马上重试，要看 Retry-After，没有就退避。

业务错误，比如余额不足、参数非法、订单状态不对，别重试。再试十次也还是错。

还有一种最坑：请求其实成功了，只是响应超时。比如扣款接口，服务端已经扣了钱，客户端没收到响应。你再补一枪，业务没有幂等兜底，事故就来了。

所以我写重试，第一行不是 for 循环，而是先写判断函数。

package&nbsp;retryx

import&nbsp;(
&nbsp;"context"
&nbsp;"errors"
&nbsp;"math/rand"
&nbsp;"net"
&nbsp;"net/http"
&nbsp;"strconv"
&nbsp;"time"
)

type&nbsp;Result&nbsp;struct&nbsp;{
&nbsp;StatusCode&nbsp;int
&nbsp;Header &nbsp; &nbsp; http.Header
}

type&nbsp;RetryableError&nbsp;interface&nbsp;{
&nbsp;error
&nbsp;Temporary()&nbsp;bool
}

func&nbsp;shouldRetry(res *Result, err error)&nbsp;bool&nbsp;{
&nbsp;if&nbsp;err !=&nbsp;nil&nbsp;{
&nbsp;&nbsp;if&nbsp;errors.Is(err, context.Canceled) {
&nbsp; &nbsp;return&nbsp;false
&nbsp; }
&nbsp;&nbsp;if&nbsp;errors.Is(err, context.DeadlineExceeded) {
&nbsp; &nbsp;return&nbsp;true
&nbsp; }

&nbsp;&nbsp;var&nbsp;ne net.Error
&nbsp;&nbsp;if&nbsp;errors.As(err, &ne) {
&nbsp; &nbsp;return&nbsp;ne.Timeout() || ne.Temporary()
&nbsp; }

&nbsp;&nbsp;var&nbsp;re RetryableError
&nbsp;&nbsp;if&nbsp;errors.As(err, &re) {
&nbsp; &nbsp;return&nbsp;re.Temporary()
&nbsp; }

&nbsp;&nbsp;return&nbsp;false
&nbsp;}

&nbsp;if&nbsp;res ==&nbsp;nil&nbsp;{
&nbsp;&nbsp;return&nbsp;false
&nbsp;}

&nbsp;switch&nbsp;res.StatusCode {
&nbsp;case&nbsp;http.StatusTooManyRequests,
&nbsp; http.StatusBadGateway,
&nbsp; http.StatusServiceUnavailable,
&nbsp; http.StatusGatewayTimeout:
&nbsp;&nbsp;return&nbsp;true
&nbsp;default:
&nbsp;&nbsp;return&nbsp;false
&nbsp;}
}

这里我不会把 500 一股脑放进去。有些系统的 500 是代码空指针，有些是库存不足也给你包装成 500。这个要看你们接口约定。接口约定不清楚，别在客户端自作聪明。

再看重试主体。

type&nbsp;Config&nbsp;struct&nbsp;{
&nbsp;MaxAttempts&nbsp;int
&nbsp;BaseDelay &nbsp; time.Duration
&nbsp;MaxDelay &nbsp; &nbsp;time.Duration
}

func&nbsp;Do(ctx context.Context, cfg Config, fn&nbsp;func(context.Context)&nbsp;(*Result, error))&nbsp;(*Result, error)&nbsp;{
&nbsp;if&nbsp;cfg.MaxAttempts <=&nbsp;0&nbsp;{
&nbsp; cfg.MaxAttempts =&nbsp;3
&nbsp;}
&nbsp;if&nbsp;cfg.BaseDelay <=&nbsp;0&nbsp;{
&nbsp; cfg.BaseDelay =&nbsp;80&nbsp;* time.Millisecond
&nbsp;}
&nbsp;if&nbsp;cfg.MaxDelay <=&nbsp;0&nbsp;{
&nbsp; cfg.MaxDelay =&nbsp;2&nbsp;* time.Second
&nbsp;}

&nbsp;var&nbsp;lastErr error
&nbsp;var&nbsp;lastRes *Result

&nbsp;for&nbsp;attempt :=&nbsp;1; attempt <= cfg.MaxAttempts; attempt++ {
&nbsp; callCtx, cancel := context.WithTimeout(ctx,&nbsp;900*time.Millisecond)
&nbsp; res, err := fn(callCtx)
&nbsp; cancel()

&nbsp; lastRes, lastErr = res, err

&nbsp;&nbsp;if&nbsp;!shouldRetry(res, err) {
&nbsp; &nbsp;return&nbsp;res, err
&nbsp; }

&nbsp;&nbsp;if&nbsp;attempt == cfg.MaxAttempts {
&nbsp; &nbsp;break
&nbsp; }

&nbsp; delay := backoff(cfg.BaseDelay, cfg.MaxDelay, attempt)

&nbsp;&nbsp;if&nbsp;res !=&nbsp;nil&nbsp;&& res.StatusCode == http.StatusTooManyRequests {
&nbsp; &nbsp;if&nbsp;v := retryAfter(res.Header); v >&nbsp;0&nbsp;{
&nbsp; &nbsp; delay = v
&nbsp; &nbsp;}
&nbsp; }

&nbsp;&nbsp;select&nbsp;{
&nbsp;&nbsp;case&nbsp;<-ctx.Done():
&nbsp; &nbsp;return&nbsp;lastRes, ctx.Err()
&nbsp;&nbsp;case&nbsp;<-time.After(delay):
&nbsp; }
&nbsp;}

&nbsp;return&nbsp;lastRes, lastErr
}

func&nbsp;backoff(base, max time.Duration, attempt&nbsp;int)&nbsp;time.Duration&nbsp;{
&nbsp;d := base << (attempt -&nbsp;1)
&nbsp;if&nbsp;d > max {
&nbsp; d = max
&nbsp;}

&nbsp;// 加一点抖动，别让所有机器同一毫秒冲回去
&nbsp;jitter := time.Duration(rand.Int63n(int64(d /&nbsp;2)))
&nbsp;return&nbsp;d/2&nbsp;+ jitter
}

func&nbsp;retryAfter(h http.Header)&nbsp;time.Duration&nbsp;{
&nbsp;v := h.Get("Retry-After")
&nbsp;if&nbsp;v ==&nbsp;""&nbsp;{
&nbsp;&nbsp;return&nbsp;0
&nbsp;}
&nbsp;sec, err := strconv.Atoi(v)
&nbsp;if&nbsp;err !=&nbsp;nil&nbsp;|| sec <=&nbsp;0&nbsp;{
&nbsp;&nbsp;return&nbsp;0
&nbsp;}
&nbsp;return&nbsp;time.Duration(sec) * time.Second
}

这段代码里有几个我比较在意的点。

第一个，每次调用都有自己的 timeout。不要只给最外层一个 context，然后里面请求卡住半天。重试不是无限续命，单次调用也必须有边界。

第二个，退避要带 jitter。线上集群最怕整齐，整齐就容易一起死。100 台机器同一时间失败，同一时间 100ms 后重试，再同一时间 200ms 后重试，这不叫容错，这叫排队冲锋。

第三个，最大次数别贪。大部分接口 2 到 3 次已经够了。再多通常不是“恢复”，而是在扩大故障面。

HTTP 请求可以这么包一层：

func&nbsp;callInventory(ctx context.Context, client *http.Client, sku&nbsp;string)&nbsp;error&nbsp;{
&nbsp;_, err := retryx.Do(ctx, retryx.Config{
&nbsp; MaxAttempts:&nbsp;3,
&nbsp; BaseDelay: &nbsp;&nbsp;100&nbsp;* time.Millisecond,
&nbsp; MaxDelay: &nbsp; &nbsp;800&nbsp;* time.Millisecond,
&nbsp;},&nbsp;func(c context.Context)&nbsp;(*retryx.Result, error)&nbsp;{
&nbsp; req, _ := http.NewRequestWithContext(c, http.MethodPost,
&nbsp; &nbsp;"http://inventory-inner/lock?sku="+sku,&nbsp;nil)

&nbsp; resp, err := client.Do(req)
&nbsp;&nbsp;if&nbsp;err !=&nbsp;nil&nbsp;{
&nbsp; &nbsp;return&nbsp;nil, err
&nbsp; }
&nbsp;&nbsp;defer&nbsp;resp.Body.Close()

&nbsp;&nbsp;return&nbsp;&retryx.Result{
&nbsp; &nbsp;StatusCode: resp.StatusCode,
&nbsp; &nbsp;Header: &nbsp; &nbsp; resp.Header,
&nbsp; },&nbsp;nil
&nbsp;})

&nbsp;return&nbsp;err
}

但这还不够。

只要接口涉及“写”，我一定会问一句：有没有幂等键？

比如创建订单、扣库存、发优惠券，这些操作不能靠客户端“少重试一点”来保平安。客户端会超时，网关会重放，MQ 会重复投递，人还会手工补偿。没有幂等，迟早出事。

请求里带一个业务唯一键，服务端按这个键去重：

func&nbsp;buildIdempotentKey(userID, bizNo&nbsp;string)&nbsp;string&nbsp;{
&nbsp;return&nbsp;"lock_stock:"&nbsp;+ userID +&nbsp;":"&nbsp;+ bizNo
}

这个 key 不要用随机 UUID 糊弄。随机 UUID 每次重试都不一样，那叫制造重复，不叫幂等。

数据库失败也一样，不是所有 error 都能重试。

死锁、锁等待超时，可以短暂重试。唯一键冲突，大概率说明请求已经处理过，应该查结果，不是再插一遍。SQL 语法错误、字段超长，重试没有意义。

MQ 发送失败也要小心。生产者发送超时，消息可能已经进 broker 了。你再发一次，消费者必须能扛重复。别把“生产者重试”和“消费者幂等”拆开看，它们是同一个坑的两边。

日志也别只打一行 retry。

我更喜欢这种：

retry call=lock_stock sku=10086 attempt=2 cost=904ms delay=187ms err="context deadline exceeded"

线上排障时，这一行比一堆“调用失败”有用多了。至少能看出来第几次失败、睡了多久、是不是单次 timeout 太短。

最后还有个东西容易被忽略：重试预算。

一个接口本来 QPS 1000，下游抖动后每个请求重试 3 次，理论流量直接变 3000。所以上游最好配合限流或者熔断。重试救的是偶发失败，不是拿来硬扛下游雪崩的。

我见过不少代码，失败就 for 三次，sleep 一秒，最后还把原始错误吞了，只返回一句“系统繁忙”。这种代码短期看没问题，真到线上出事，查起来很恶心。

重试机制写到最后，其实就几句话：

能判断的失败再重试。

写操作先谈幂等。

退避必须带抖动。

次数别贪。

日志要留下现场。

否则你以为自己写的是容错，线上看起来更像放大器。

免责声明：

本文所载程序、技术方法仅面向合法合规的安全研究与教学场景，旨在提升网络安全防护能力，具有明确的技术研究属性。

任何单位或个人未经授权，将本文内容用于攻击、破坏等非法用途的，由此引发的全部法律责任、民事赔偿及连带责任，均由行为人独立承担，本站不承担任何连带责任。

本站内容均为技术交流与知识分享目的发布，若存在版权侵权或其他异议，请通过邮件联系处理，具体联系方式可点击页面上方的联系我。

本文转载自：Go语言教程 go go《Golang重试机制，如何应对各种失败的场景》