2026-01-05 18:09:18 网络安全文章来源：ZONE.CI 全球网 0 阅读模式

文章总结： 本文详解Protobuf逆向技术，涵盖WireFormat编码与字段识别。通过解析二进制Key-Value结构及Varint编码，演示了从十六进制流中提取字段信息并还原.proto文件的流程。该实战指南为网络协议分析提供具体方法，帮助安全人员有效应对私有协议逆向挑战。 综合评分： 90 文章分类： 逆向分析,二进制安全,网络安全

cover_image

Protobuf 逆向实战：从 Wire Format 到完整 .proto 文件

原创

二进制磨剑

2026年1月4日 15:26 美国

一、什么是protobuf?

Protocol Buffers（简称 Protobuf） 是 Google 提出的一种高效、跨语言、平台无关的结构化数据序列化协议。它通过 .proto 描述文件定义数据结构，再由编译器生成对应语言的代码，用于数据的序列化与反序列化。相比 XML、JSON，Protobuf 具有体积小、解析快、类型强约束等优势，特别适合网络通信、RPC、存储与高性能系统。在实际使用中，开发者只需维护 .proto 文件，即可在 C/C++、Java、Python、Go 等多种语言间安全、高效地传递数据。

1.1 proto文件示例

.proto 文件（Protocol Buffers 定义文件）本质上是对“数据结构 + 通信约定”的一份IDL 描述。基本由下面这些固定模块组成 :

syntax（必须，语法版本声明）：用于指定使用 proto2 还是 proto3。比如：

  syntax = "proto3";

proto2：支持 required / optional / default``proto3：语法更简洁，默认值不可区分“未设置”

逆向提示：抓包里看到大量 0 / "" / false，而区分不了“没传还是传了 0”，基本是 proto3

package（可选但推荐，包名）

用于定义命名空间，防止 message 冲突。比如：

  package com.example.api.user;

import（可选，依赖导入）

引入其他 .proto 文件，比如：

  import "google/protobuf/timestamp.proto";
  import "common/base.proto";

option（可选但常见，编译/语言选项）

控制不同语言的代码生成行为。比如：

  option java_package = "com.example.api";
  option java_outer_classname = "UserProto";
  option go_package = "example.com/api/user";
  option optimize_for = SPEED;

message（核心，消息结构）

定义数据结构本身，类似 C struct / Java class，比如：

  syntax = "proto3";

  package user;

  // 用户账号信息
  message UserAccount {
  &nbsp; string username = 1; // 账号
  &nbsp; string password = 2; // 密码
  &nbsp; string email &nbsp; &nbsp;= 3; // 邮箱
  &nbsp; int32 &nbsp;age &nbsp; &nbsp; &nbsp;= 4; // 年龄
  &nbsp; int32 &nbsp;height &nbsp; = 5; // 身高（cm）
  &nbsp; double weight &nbsp; = 6; // 体重（kg）
  }

字段的组成结构：

  <类型> <字段名> = <tag>;
  # 比如：
  string username = 1;

tag 编号规则：1 ~ 15 → 1 字节编码（最常见），16 ~ 2047 → 2 字节编码。tag一旦发布不可修改。

service / rpc（RPC 服务定义）

用于 gRPC / 内部协议，比如：

  service UserService {
  &nbsp; rpc GetUser (GetUserRequest) returns (GetUserResponse);
  }

enum（枚举）

定义枚举常量

  enum UserType {
  &nbsp; USER_TYPE_UNKNOWN = 0;
  &nbsp; USER_TYPE_NORMAL &nbsp;= 1;
  &nbsp; USER_TYPE_ADMIN &nbsp; = 2;
  }

规则：第一个值必须是 0（proto3 强制），实际编码是 int32

相对完整的.proto文件内容如下：

syntax = "proto3";

package api.user;

import "google/protobuf/timestamp.proto";

option java_package = "com.example.api.user";
option java_outer_classname = "UserProto";

message User {
&nbsp; uint32 id = 1;
&nbsp; string name = 2;
&nbsp; UserType type = 3;
&nbsp; repeated string roles = 4;
&nbsp; google.protobuf.Timestamp create_time = 5;
}

enum UserType {
&nbsp; USER_TYPE_UNKNOWN = 0;
&nbsp; USER_TYPE_NORMAL = 1;
&nbsp; USER_TYPE_ADMIN = 2;
}

service UserService {
&nbsp; rpc GetUser (GetUserRequest) returns (GetUserResponse);
}

二、如何从数据层面识别是否为protobuf序列化数据？

仅凭一段裸二进制，无法 100% 证明它一定是 Protobuf（因为 Protobuf 没有强制的全局魔数/固定头）。但可以用“强特征 + 统计/一致性校验”把准确率做到很高。

2.1 Protobuf 序列化的总体原则

Protobuf 把一条消息序列化为一串「字段（field）」的二进制拼接，每个字段 = key + value。

核心设计目标只有三个：

紧凑（尽量少字节）
快速解析（无需 schema 也能跳过未知字段）
向前 / 向后兼容（靠字段号）

2.2字段的基本结构：Key + Value

Key 的编码规则（非常关键）。Key 是一个 varint，包含两部分信息：

key = (field_number << 3) | wire_type

field_number：字段编号（>=1，0 非法）
wire_type：值的编码方式（低 3 bit）

wire_type 取值规则

| wire_type | 二进制 | 含义 | 常见对应类型 | | — | — | — | — | | 0 | 000 | Varint | int32, int64, bool, enum | | 1 | 001 | 64-bit | fixed64, double | | 2 | 010 | Length-delimited | string, bytes, message, packed repeated | | 3 | 011 | Start group | 已废弃 | | 4 | 100 | End group | 已废弃 | | 5 | 101 | 32-bit | fixed32, float |

判断 protobuf 的核心线索之一： key 的低 3 bit 几乎总在 {0,1,2,5}。

2.3 Value 的序列化规则（按 wire_type）

wire_type = 0 ：Varint（变长整数）：每个字节 7 bit 存数据，最高位 MSB：1 → 后面还有字节0 → 当前字节是最后一个。示例：

  12 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;→ 0x12 → 18
  a5 01 &nbsp; &nbsp; &nbsp; → 0b10100101 00000001 → 165

wire_type = 1 ：64-bit（固定 8 字节）：固定 8 字节、小端序、不带长度信息；对应类型：fixed64、double；遇到 wire=1，直接读 8 字节并可尝试按 double / uint64 两种解释
wire_type = 5 ：32-bit（固定 4 字节）：固定 4 字节、小端序；对应类型：fixed32、float
wire_type = 2 ：Length-delimited（最复杂也最常见）

  [length(varint)] + [length 个字节 payload]

可能的 payload 含义：string（UTF-8）、bytes、嵌套 message、packed repeated（连续的基础类型）

2.4 字段顺序、可选性规则、默认值

字段顺序不重要

  message User {
  &nbsp; string name = 1;
  &nbsp; int32 age = 2;
  }

下面两种序列化等价：

[1=name][2=age]
[2=age][1=name]

解析器按 field_number 匹配，而不是顺序。

字段可缺失（默认值不序列化）

未出现的字段 = 默认值，默认值通常 不写入数据，这也是 protobuf 比 JSON 小的重要原因之一。

| 类型 | 默认值 | | — | — | | int / enum | 0 | | bool | false | | string | “” | | message | null |

在看一段二进制时，可以按下面流程走：

while&nbsp;not&nbsp;EOF:
&nbsp; &nbsp; read key (varint)
&nbsp; &nbsp; field_number = key >>&nbsp;3
&nbsp; &nbsp; wire_type = key &&nbsp;0x07

&nbsp; &nbsp;&nbsp;if&nbsp;wire_type ==&nbsp;0:
&nbsp; &nbsp; &nbsp; &nbsp; read varint
&nbsp; &nbsp;&nbsp;elif&nbsp;wire_type ==&nbsp;1:
&nbsp; &nbsp; &nbsp; &nbsp; read&nbsp;8&nbsp;bytes
&nbsp; &nbsp;&nbsp;elif&nbsp;wire_type ==&nbsp;2:
&nbsp; &nbsp; &nbsp; &nbsp; read len(varint)
&nbsp; &nbsp; &nbsp; &nbsp; read len bytes payload
&nbsp; &nbsp;&nbsp;elif&nbsp;wire_type ==&nbsp;5:
&nbsp; &nbsp; &nbsp; &nbsp; read&nbsp;4&nbsp;bytes
&nbsp; &nbsp;&nbsp;else:
&nbsp; &nbsp; &nbsp; &nbsp; ❌ 非法 / 已废弃

如果能 一路合法走到结尾，那几乎可以断定：这是 protobuf wire format 数据

2.5 .proto文件演练

我们有如下的.proto文件:

syntax = "proto3";

package user;

// 用户账号信息
message UserAccount {
&nbsp; string username = 1; // 账号
&nbsp; string password = 2; // 密码
&nbsp; string email &nbsp; &nbsp;= 3; // 邮箱
&nbsp; int32 &nbsp;age &nbsp; &nbsp; &nbsp;= 4; // 年龄
&nbsp; int32 &nbsp;height &nbsp; = 5; // 身高（cm）
&nbsp; double weight &nbsp; = 6; // 体重（kg）
}

生成python代码：

protoc --python_out=. user.proto

cat &nbsp;user_pb2.py
# -*- coding: utf-8 -*-
# Generated by the protocol buffer compiler. &nbsp;DO NOT EDIT!
# source: user.proto
"""Generated protocol buffer code."""
from google.protobuf.internal import builder as _builder
from google.protobuf import descriptor as _descriptor
from google.protobuf import descriptor_pool as _descriptor_pool
from google.protobuf import symbol_database as _symbol_database
# @@protoc_insertion_point(imports)

_sym_db = _symbol_database.Default()

DESCRIPTOR = _descriptor_pool.Default().AddSerializedFile(b'\n\nuser.proto\x12\x04user\"@\n\x0bUserAccount\x12\x10\n\x08username\x18\x01 \x01(\t\x12\x10\n\x08password\x18\x02 \x01(\t\x12\r\n\x05\x65mail\x18\x03 \x01(\tb\x06proto3')

_builder.BuildMessageAndEnumDescriptors(DESCRIPTOR, globals())
_builder.BuildTopDescriptorsAndMessages(DESCRIPTOR,&nbsp;'user_pb2', globals())
if&nbsp;_descriptor._USE_C_DESCRIPTORS == False:

&nbsp; DESCRIPTOR._options = None
&nbsp; _USERACCOUNT._serialized_start=20
&nbsp; _USERACCOUNT._serialized_end=84
# @@protoc_insertion_point(module_scope)

三、从二进制数据恢复.proto文件

3.1 序列化对象

序列化一个对象实际分析试试：

import&nbsp;user_pb2&nbsp;# 上一节生成的文件

# 创建消息对象
user = user_pb2.UserAccount()
user.username =&nbsp;"admin"
user.password =&nbsp;"123456"
user.email =&nbsp;"[email protected]"
user.age =&nbsp;18
user.height =&nbsp;165
user.weight =&nbsp;49.8
# 序列化为 bytes（二进制）
data = user.SerializeToString()

print("serialized bytes:", data)
print("hex:", data.hex())

得到序列化数据：

serialized bytes:&nbsp;b'\n\x05admin\x12\x06123456\x1a\[email protected] \x12(\xa5\x011fffff\xe6H@'
hex:&nbsp;0a0561646d696e12063132333435361a1161646d696e406578616d706c652e636f6d201228a501316666666666e64840

3.2 重构.proto文件

根据前一节的规则，做如下分析：

key = 0x0a

→ field_number = key >> 3

→ field_number = 1

→ wire_type = key & 0x07

→ wire_type = 2

→length-delimited

→length = 0x05

→读取5个字节：61646d696e

→”admin”

可猜测第一个字段大概率是string类型。重复上述操作，恢复剩余所有字段信息即可。当所有字段都被准确解析后，由于IDL一经发布就不再更改，等价于本次protobuf逆向基本完成。

免责声明：

本文所载程序、技术方法仅面向合法合规的安全研究与教学场景，旨在提升网络安全防护能力，具有明确的技术研究属性。

任何单位或个人未经授权，将本文内容用于攻击、破坏等非法用途的，由此引发的全部法律责任、民事赔偿及连带责任，均由行为人独立承担，本站不承担任何连带责任。

本站内容均为技术交流与知识分享目的发布，若存在版权侵权或其他异议，请通过邮件联系处理，具体联系方式可点击页面上方的联系我。

本文转载自：二进制磨剑二进制磨剑《Protobuf 逆向实战：从 Wire Format 到完整 .proto 文件》