2026-01-23 12:08:12 网络安全文章来源：ZONE.CI 全球网 0 阅读模式

文章总结： 文章系统展示腾讯CodeBuddy在恶意代码家族分类中的AICoding实践，先回顾传统静动态特征提取瓶颈，再用CodeBuddy自动化完成数据预处理、随机森林与CNN-BiLSTM建模及t-SNE可视化，测试集准确率达91.32%，提供可运行代码与开源数据集链接，验证AICoding提升分析效率与工程一致性。 综合评分： 88 文章分类： AI安全,恶意软件,安全工具,漏洞分析,安全运营

您是否感受到了大模型和AI Coding的魅力和力量！

3.AI赋能基于机器学习的恶意家族分类

随后，我们将基于预处理的数据集开展基于机器学习的恶意代码分析。

第一步，构建精准的提示词。

请撰写python代码构建随机森林模型，读取processed中的训练集train_dataset.csv和测试集test_dataset.csv的数据，利用[tactic,technique,tid,api,dynamic_api]五维特征用来构建向量，五列特征融合了恶意代码的静态和动态特征，直接拼接成数据集，其分类家族为[label]列，总共5个家族。请利用sklearn构建随机森林算法评价性能，要求保留4位有效数字，包括精确率、召回率、F1值和准确率。请给出详细代码，并绘制可视化图形（包括混淆矩阵图）。

第二步，将提示词输入对话框中，选择大模型并进行提交。

4.AI赋能基于深度学习的恶意家族分类

接下来，我们利用CodeBuddy生成深度学习CNN-BiLSTM模型实现家族分类。

第一步，构建提示词。

请撰写python Pytorch代码构建CNN-BiLSTM模型，读取processed中的训练集train_dataset.csv和测试集test_dataset.csv的数据，利用[tactic,technique,tid,api,dynamic_api]五维特征用来构建向量，五列特征融合了恶意代码的静态和动态特征，直接拼接成数据集，其分类家族为[label]列，总共5个家族。请利用CNN-BiLSTM模型进行恶意家族分类并评价性能，要求保留4位有效数字，包括精确率、召回率、F1值和准确率。请给出详细代码，并绘制可视化图形（包括混淆矩阵图），保证程序能顺利运行。

第二步，在CodeBuddy中输入提示词，经过深度思考后生成代码。

第三步，自动运行生成的深度学习代码。

由于作者是用CPU，因此代码运行比较耗时，因此该代码请大家自行尝试。注意，如果代码运行报错，大家一定要学会与CodeBuddy对话，从而优化代码直至完成相关功能。

最终生成代码如下图所示，整个代码量500多行还是非常大的。

模型构建部分的关键代码如下，完整代码请参考作者的Github。

https://github.com/eastmountyxz/LLM-for-Malware

# 1. 定义自定义数据集类class&nbsp;MalwareDataset(Dataset):&nbsp; &nbsp;&nbsp;def&nbsp;__init__(self, features, labels):&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;self.features = torch.FloatTensor(features)&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;self.labels = torch.LongTensor(labels)
&nbsp; &nbsp;&nbsp;def&nbsp;__len__(self):&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;return&nbsp;len(self.labels)
&nbsp; &nbsp;&nbsp;def&nbsp;__getitem__(self, idx):&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;return&nbsp;self.features[idx],&nbsp;self.labels[idx]
# 2. 定义CNN-BiLSTM模型class&nbsp;CNNBiLSTM(nn.Module):&nbsp; &nbsp;&nbsp;def&nbsp;__init__(self, input_dim, hidden_dim, num_layers, num_classes, dropout=0.5):&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;super(CNNBiLSTM,&nbsp;self).__init__()
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;# CNN部分&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;self.conv1 = nn.Conv1d(in_channels=1, out_channels=64, kernel_size=3, padding=1)&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;self.conv2 = nn.Conv1d(in_channels=64, out_channels=128, kernel_size=3, padding=1)&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;self.pool = nn.MaxPool1d(kernel_size=2)&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;self.dropout1 = nn.Dropout(dropout)
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;# 计算CNN输出维度&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;# 输入: (batch_size, 1, input_dim)&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;# Conv1: (batch_size, 64, input_dim)&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;# Pool1: (batch_size, 64, input_dim//2)&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;# Conv2: (batch_size, 128, input_dim//2)&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;# Pool2: (batch_size, 128, input_dim//4)&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;self.cnn_output_dim =&nbsp;128&nbsp;* (input_dim //&nbsp;4)
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;# BiLSTM部分&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;self.bilstm = nn.LSTM(&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; input_size=self.cnn_output_dim,&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; hidden_size=hidden_dim,&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; num_layers=num_layers,&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; batch_first=True,&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; bidirectional=True,&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; dropout=dropout&nbsp;if&nbsp;num_layers >&nbsp;1&nbsp;else&nbsp;0&nbsp; &nbsp; &nbsp; &nbsp; )
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;# 全连接层&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;self.dropout2 = nn.Dropout(dropout)&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;self.fc = nn.Linear(hidden_dim *&nbsp;2, num_classes) &nbsp;# 双向LSTM输出是2倍hidden_dim
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;# 激活函数&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;self.relu = nn.ReLU()
&nbsp; &nbsp;&nbsp;def&nbsp;forward(self, x):&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;# x shape: (batch_size, input_dim)
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;# Reshape for CNN: (batch_size, 1, input_dim)&nbsp; &nbsp; &nbsp; &nbsp; x = x.unsqueeze(1)
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;# CNN部分&nbsp; &nbsp; &nbsp; &nbsp; x =&nbsp;self.conv1(x) &nbsp;# (batch_size, 64, input_dim)&nbsp; &nbsp; &nbsp; &nbsp; x =&nbsp;self.relu(x)&nbsp; &nbsp; &nbsp; &nbsp; x =&nbsp;self.pool(x) &nbsp;# (batch_size, 64, input_dim//2)&nbsp; &nbsp; &nbsp; &nbsp; x =&nbsp;self.dropout1(x)
&nbsp; &nbsp; &nbsp; &nbsp; x =&nbsp;self.conv2(x) &nbsp;# (batch_size, 128, input_dim//2)&nbsp; &nbsp; &nbsp; &nbsp; x =&nbsp;self.relu(x)&nbsp; &nbsp; &nbsp; &nbsp; x =&nbsp;self.pool(x) &nbsp;# (batch_size, 128, input_dim//4)&nbsp; &nbsp; &nbsp; &nbsp; x =&nbsp;self.dropout1(x)
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;# Flatten for LSTM: (batch_size, cnn_output_dim)&nbsp; &nbsp; &nbsp; &nbsp; x = x.view(x.size(0), -1)
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;# Reshape for LSTM: (batch_size, 1, cnn_output_dim)&nbsp; &nbsp; &nbsp; &nbsp; x = x.unsqueeze(1)
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;# BiLSTM部分&nbsp; &nbsp; &nbsp; &nbsp; lstm_out, (h_n, c_n) =&nbsp;self.bilstm(x) &nbsp;# h_n shape: (num_layers*2, batch_size, hidden_dim)
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;# 使用最后一个时间步的输出&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;# 拼接前向和后向的最终隐藏状态&nbsp; &nbsp; &nbsp; &nbsp; h_forward = h_n[-2] &nbsp;# 前向最后一层&nbsp; &nbsp; &nbsp; &nbsp; h_backward = h_n[-1] &nbsp;# 后向最后一层&nbsp; &nbsp; &nbsp; &nbsp; x = torch.cat([h_forward, h_backward], dim=1) &nbsp;# (batch_size, hidden_dim*2)
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;# 全连接层&nbsp; &nbsp; &nbsp; &nbsp; x =&nbsp;self.dropout2(x)&nbsp; &nbsp; &nbsp; &nbsp; x =&nbsp;self.fc(x) &nbsp;# (batch_size, num_classes)
&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;return&nbsp;x
# 3. 读取数据print("\n【步骤1: 读取数据】")train_df = pd.read_csv('c:/Users/xiuzhang/Desktop/mal_analysis/processed/train_dataset.csv')test_df = pd.read_csv('c:/Users/xiuzhang/Desktop/mal_analysis/processed/test_dataset.csv')
print(f"训练集样本数:&nbsp;{len(train_df)}")print(f"测试集样本数:&nbsp;{len(test_df)}")
# 4. 特征工程print("\n【步骤2: 特征工程】")feature_columns = ['tactic',&nbsp;'technique',&nbsp;'tid',&nbsp;'api',&nbsp;'dynamic_api']
# 填充缺失值for&nbsp;col&nbsp;in&nbsp;feature_columns:&nbsp; &nbsp; train_df[col] = train_df[col].fillna('')&nbsp; &nbsp; test_df[col] = test_df[col].fillna('')
# 使用TF-IDF编码print("使用TF-IDF对文本特征进行编码...")tactic_tfidf = TfidfVectorizer(max_features=100, token_pattern=r'(?u)\b\w+\b|;')technique_tfidf = TfidfVectorizer(max_features=100, token_pattern=r'(?u)\b\w+\b|;')tid_tfidf = TfidfVectorizer(max_features=50, token_pattern=r'(?u)\b\w+\b|;')api_tfidf = TfidfVectorizer(max_features=200, token_pattern=r'(?u)\b\w+\b|;')dynamic_api_tfidf = TfidfVectorizer(max_features=200, token_pattern=r'(?u)\b\w+\b|;')
# 训练集train_tactic = tactic_tfidf.fit_transform(train_df['tactic'].astype(str)).toarray()train_technique = technique_tfidf.fit_transform(train_df['technique'].astype(str)).toarray()train_tid = tid_tfidf.fit_transform(train_df['tid'].astype(str)).toarray()train_api = api_tfidf.fit_transform(train_df['api'].astype(str)).toarray()train_dynamic_api = dynamic_api_tfidf.fit_transform(train_df['dynamic_api'].astype(str)).toarray()
X_train = np.hstack([train_tactic, train_technique, train_tid, train_api, train_dynamic_api])
# 测试集test_tactic = tactic_tfidf.transform(test_df['tactic'].astype(str)).toarray()test_technique = technique_tfidf.transform(test_df['technique'].astype(str)).toarray()test_tid = tid_tfidf.transform(test_df['tid'].astype(str)).toarray()test_api = api_tfidf.transform(test_df['api'].astype(str)).toarray()test_dynamic_api = dynamic_api_tfidf.transform(test_df['dynamic_api'].astype(str)).toarray()
X_test = np.hstack([test_tactic, test_technique, test_tid, test_api, test_dynamic_api])
print(f"训练集特征向量维度:&nbsp;{X_train.shape}")print(f"测试集特征向量维度:&nbsp;{X_test.shape}")
# 5. 标签编码print("\n【步骤3: 标签编码】")label_encoder = LabelEncoder()y_train = label_encoder.fit_transform(train_df['label'])y_test = label_encoder.transform(test_df['label'])
print(f"标签类别:&nbsp;{label_encoder.classes_}")print(f"标签数量:&nbsp;{len(label_encoder.classes_)}")
# 6. 创建数据加载器print("\n【步骤4: 创建数据加载器】")train_dataset = MalwareDataset(X_train, y_train)test_dataset = MalwareDataset(X_test, y_test)
batch_size =&nbsp;32train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)test_loader = DataLoader(test_dataset, batch_size=batch_size, shuffle=False)
print(f"训练集批次数:&nbsp;{len(train_loader)}")print(f"测试集批次数:&nbsp;{len(test_loader)}")
# 7. 创建模型print("\n【步骤5: 创建CNN-BiLSTM模型】")input_dim = X_train.shape[1] &nbsp;# 560hidden_dim =&nbsp;128num_layers =&nbsp;2num_classes =&nbsp;len(label_encoder.classes_) &nbsp;# 5
model = CNNBiLSTM(input_dim, hidden_dim, num_layers, num_classes, dropout=0.5)model = model.to(device)
print(f"模型结构:")print(model)print(f"\n模型参数数量:&nbsp;{sum(p.numel()&nbsp;for&nbsp;p&nbsp;in&nbsp;model.parameters()):,}")
# 8. 定义损失函数和优化器criterion = nn.CrossEntropyLoss()optimizer = optim.Adam(model.parameters(), lr=0.001, weight_decay=1e-4)scheduler = optim.lr_scheduler.ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=5, verbose=True)

5.AI赋能可视化聚类分析

最后，我们尝试进行可视化降维分析，提示词如下：

现在需要进行降维可视化分析，请读取test_dataset.csv文件中的特征[tactic,technique,tid,api,dynamic_api]来构建向量，利用t-SNE进行可视化分析，其分类的家族为label列，共5个家族。最终呈现美观的聚类效果图。注意，整个代码利用Python实现，并且家族之间颜色不同，呈现的效果美观。

运行结果如下图所示，还需要进一步结合实验特征优化表征。

四.总结

本文围绕 AI Coding 与安全分析的融合实践，系统探讨了 CodeBuddy 在恶意代码分析与家族分类中的应用路径。从传统静态与动态特征分析的局限性出发，文章展示了大语言模型驱动的 AI Coding 如何在特征提取、数据预处理、分类建模与可视化分析等环节中显著提升分析效率与工程一致性，体现了智能化方法在复杂安全任务中的现实价值。

未来，大语言模型（LLM）与智能体（Agent）将在恶意代码分析领域扮演更加核心的角色。

在特征建模层面，LLM 有望实现对二进制代码、反汇编结果和运行日志的语义级理解，从而减少对人工特征工程的依赖，提升对混淆、变种与对抗样本的鲁棒性。
在分析流程层面，引入具备规划与执行能力的安全智能体，可将恶意代码分析任务拆解为自动化的多步骤流程，实现从样本采集、行为分析到家族归因的自主协同分析。
在知识层面，LLM 可与知识图谱和威胁情报库深度融合，支持跨样本、跨家族的关联推理与攻击链重构，增强分析结果的可解释性与可追溯性。

此外，在工程实践中，AI Coding 平台与安全工具链的深度集成，将推动恶意代码分析从“工具驱动”向“智能协作”转变，使安全分析人员逐步从底层实现细节中解放出来，更多关注威胁建模与决策支持问题。总体而言，LLM 与智能体的引入不仅将重塑恶意代码分析的技术路径，也为构建高效、智能、可演化的安全分析体系提供了重要发展方向。

与此同时，Eastmount已正式开启《AI Coding》专栏，将持续发布关于大模型辅助编程、国产AI IDE工具评测、AI自动化开发实战等系列内容，欢迎关注专栏，一起探索智能开发的前沿趋势，不断学习与精进。基础性文章，希望对您有所帮助，写得不好的地方还请海涵！2026年加油。

『网络攻防和AI安全之家』目前收到了很多博友、朋友和老师的支持和点赞，并且保持每周七次更新，尤其是一些看了我文章多年的老粉，购买来感谢，真的很感动，类目。未来，我将分享更多高质量文章，更多安全干货，真心帮助到大家。虽然起步晚，但贵在坚持，像十多年如一日的博客分享那样，脚踏实地，只争朝夕。继续加油，再次感谢！尤其是一些看了我文章多年的老粉，购买来感谢，真的很感动，类目。未来，我将分享更多高质量文章，更多安全干货，真心帮助到大家。虽然起步晚，但贵在坚持，像十多年如一日的博客分享那样，脚踏实地，只争朝夕。继续加油，再次感谢！

(By:Eastmount 2026-01-22 周四写于贵阳)

免责声明：

本文所载程序、技术方法仅面向合法合规的安全研究与教学场景，旨在提升网络安全防护能力，具有明确的技术研究属性。

任何单位或个人未经授权，将本文内容用于攻击、破坏等非法用途的，由此引发的全部法律责任、民事赔偿及连带责任，均由行为人独立承担，本站不承担任何连带责任。

本站内容均为技术交流与知识分享目的发布，若存在版权侵权或其他异议，请通过邮件联系处理，具体联系方式可点击页面上方的联系我。

本文转载自：娜璋AI安全之家 Eastmount Eastmount《[AI Coding+安全] 二.CodeBuddy赋能恶意代码分析与家族分类实践（肝货）》