第17章ComputerUse+GUIAgent——AI操控电脑

admin 2026-06-21 05:01:30 网络安全文章 来源:ZONE.CI 全球网 0 阅读模式

文章总结: 本文系统解析了AI操控电脑的ComputerUse技术,重点介绍了Screenshot-ActionLoop核心工作机制,对比了Anthropic与OpenAI两种实现方案的差异。文档指出当前技术存在坐标精度误差、高延迟和高成本等局限,强调必须采用Docker容器隔离等安全措施,并提出了混合自动化等优化策略。 综合评分: 82 文章分类: AI安全,安全工具,解决方案,安全运营,技术标准


cover_image

第17章 Computer Use + GUI Agent —— AI 操控电脑

原创

网络安全民工 网络安全民工

网络安全民工

2026年6月20日 09:02 天津

在小说阅读器读本章

去阅读

17.1 为什么需要 Computer Use?

传统 Agent 的局限:Agent 只能调 API → 但世界上绝大多数软件没有 API!

企业内部的遗留系统

桌面软件(Photoshop、Excel)

图形化界面的 SaaS 工具

Computer Use 的突破:

Agent 不再需要对方提供 API

它直接「看屏幕 → 分析画面 → 控制鼠标键盘」

就像人类一样和任何软件交互

类比:

传统 Agent = 只能打电话的人(必须对方有号码)

Computer Use = 能走进办公室的人(可以和任何人面对面交流)

17.2 Screenshot-Action Loop —— 核心循环

这是所有 Computer Use 系统的底层逻辑:

📊 架构示意

  ┌──────────────────────────────────────────────────┐  │                                                    │  │  1. 📸 Screenshot: 截取当前屏幕画面                 │  │         │                                          │  │         ▼                                          │  │  2. 👁️ Analyze: LLM 视觉分析画面                    │  │     - 识别窗口、按钮、文本框                        │  │     - 读取屏幕上的文字内容                          │  │     - 理解当前界面的状态                            │  │         │                                          │  │         ▼                                          │  │  3. 🤔 Decide: 决定下一步操作                       │  │     - 应该点击哪里?                                │  │     - 应该输入什么?                                │  │     - 是否需要滚动?                                │  │         │                                          │  │         ▼                                          │  │  4. 🖱️ Execute: 执行操作                            │  │     - mouse_move(x, y)                             │  │     - left_click()                                 │  │     - type("文本")                                 │  │     - scroll(direction)                            │  │     - key_press("Enter")                           │  │         │                                          │  │         ▼                                          │  │     回到步骤 1(直到任务完成)                       │  │                                                    │  └──────────────────────────────────────────────────┘

关键挑战:像素坐标的精确计算

问题:LLM 需要输出 「点击 (450, 200)」这样的坐标

但 LLM 是文本模型,不理解像素

Anthropic 的解决方案(训练阶段):

专门训练 Claude 精确计数像素的能力

“Training Claude to count pixels accurately was critical.

Without this skill, the model finds it difficult to give mouse commands.”

实操中的坐标系统:

截图尺寸通常是 1280×800 或 1920×1080

LLM 返回的坐标需要缩放到实际屏幕分辨率

返回格式:(x_pct, y_pct) 百分比比绝对像素更稳健

17.3 Anthropic Computer Use vs OpenAI CUA

📊 架构示意

┌──────────────┬─────────────────────────┬─────────────────────────┐│     维度      │  Anthropic Computer Use  │   OpenAI CUA            │├──────────────┼─────────────────────────┼─────────────────────────┤│ 发布时间      │ 2024.10 (API)            │ 2025.01 (Operator)       ││             │ 2025.10 (正式发布)        │                         ││ 操作范围      │ 整个操作系统              │ 浏览器内(虚拟浏览器)    ││ 模型          │ Claude 3.5+ Sonnet      │ GPT-4o (CUA 微调版)     ││ 环境          │ 用户真实桌面/Docker      │ 安全的虚拟浏览器环境      ││ 安全性        │ 依赖使用者自行沙箱         │ 平台内置安全隔离         ││ 动作类型      │ 鼠标+键盘+截图            │ 浏览器操作(点击/输入/滚动)││ 成本          │ 截图Token昂贵            │ 浏览器操作Token消耗较低   ││ Benchmark    │ OSWorld 14.9%            │ 未公布独立评分           │└──────────────┴─────────────────────────┴─────────────────────────┘

Anthropic 的设计哲学:「给 Claude 真实的电脑,让它按人类的方式工作」

OpenAI 的设计哲学:「给 GPT-4o 一个安全沙箱,专注于 Web 任务」

选型建议:

需要控制桌面软件 → Anthropic Computer Use

只需要浏览器操作 → OpenAI CUA

想要完全控制 → Anthropic + Docker 沙箱

17.4 性能数据与局限

OSWorld Benchmark 成绩:

📊 架构示意

  ┌──────────────────┬───────────┐  │      系统         │   得分     │  ├──────────────────┼───────────┤  │ 人类              │   75.0%   │  │ Claude 3.5 Sonnet │   14.9%   │  │ GPT-4V            │    7.8%   │  └──────────────────┴───────────┘

→ Claude 翻倍了前最好成绩,但离人类还很远

延迟:

每个动作 3-4 秒(截取→分析→执行)

10步任务 = 30-40秒

对比 Selenium 的 0.1秒/步,差距巨大

成本:

每张 1080p 截图消耗约 1500 tokens

每分钟成本约 $0.10-0.30

对比 API 调用的 $0.001/分钟,贵 100 倍

当前定位(2025-2026):

→ 不是 Selenium 的替代品

→ 适合「API 无法覆盖的长尾场景」

→ 适合「快速原型验证」

→ 生产级自动化仍需传统方案

17.5 安全沙箱 —— 必须学!

给 AI 鼠标键盘的权限 = 极高的安全风险:

✗ 读取屏幕上的密码

✗ 复制敏感数据

✗ 误操作删除文件

✗ Prompt Injection 利用 AI 执行危险命令

安全措施(必须!):

Docker 容器隔离

docker run -d \--security-opt=no-new-privileges \--cap-drop=ALL \--network=none \--read-only \computer-use-sandbox

操作系统级隔离

非管理员用户

只读挂载关键目录

网络访问白名单

操作确认(Human-in-the-Loop)

危险操作需用户确认

大额交易/删除文件 → 二次确认

审计日志

记录每一次鼠标点击和键盘输入

可追溯所有操作

Anthropic 官方建议:

“Always sandbox in Docker containers with limited permissions.

Never run with admin privileges.”

17.5.1 坐标系统工程 —— 为什么 LLM 总是点不准?

▍ 坐标转换:从 LLM 输出到屏幕像素

LLM 输出的坐标通常是「归一化坐标」或「基于截图分辨率的坐标」。

但实际屏幕分辨率可能不同 → 需要坐标缩放。

问题链:

截图 1280×800 → LLM 分析 → 输出「点击 (640, 400)」→ 但实际屏幕是 2560×1600→ 如果直接发 (640, 400),点到错误位置!

标准做法:

a) 发给 LLM 的截图使用固定分辨率(如 1280×800)

b) LLM 输出坐标基于 1280×800

c) 执行前做坐标缩放:x_real = x_llm × (screen_width / screenshot_width)

d) 使用百分比坐标更稳健:(x_pct, y_pct) 而非绝对像素

▍ 为什么 LLM 的坐标会有 ±5px 的误差?

这是 Computer Use 的一个核心难点:「让文本模型理解空间坐标」。

Anthropic 专门训练了 Claude 的「像素计数能力」,但仍然存在天然误差:

小按钮(20x20px)→ 5px 误差可能点到按钮外

密集列表 → 可能点到相邻项

动态 UI(动画中的元素)→ 坐标过了检测时已经变了

工程应对:

a) 坐标容错 —— 点击后截图验证,不符预期则微调坐标重试

b) 区域点击代替点点击 —— 「点击按钮中心区域」而非精确坐标

c) 元素描述作为主导航 —— 「点击 ‘Login’ 按钮」→ 先用 OCR 定位

▍ 截图分辨率的经济学

分辨率越高 → LLM 的坐标越精准 → 但 Token 成本越高

分辨率越低 → 成本低 → 但可能 LLM 看错按钮

最佳实践(Anthropic 推荐):

日常操作:1024×768 → 约 800 tokens/图

需要细节:1280×800 → 约 1200 tokens/图

文本密集(代码、表格):1920×1080 → 约 2000 tokens/图

动态分辨率策略:

第一轮 → 低分辨率 1024×768(快速判断页面状态)

发现无法识别 → 升级到 1280×800

仍然需要细节 → 局部截图 500×500(放大特定区域)

17.5.2 沙箱深度 —— 不是「run docker」就完了

▍ 多层安全隔离(从外到内 4 层)

第1层:容器隔离 —— Docker/VM 层

禁用网络(–network=none)、只读根文件系统(–read-only)、

限制 CPU/Memory、丢弃所有 Linux Capabilities

第2层:用户隔离 —— 容器内部跑非 root 用户

USER agent(非 root),家目录只读

授予的目录要白名单管理,不是黑名单

第3层:动作白名单 —— 不是所有键盘组合都允许

禁止: Ctrl+Alt+Del、Win+R、rm -rf、格式化命令

允许: 文本输入、鼠标移动、窗口切换

第4层:结果审查 —— Agent 操作完的截图交给另一个 LLM 审核

检查:是否打开了敏感页面?是否输入了敏感内容?

这是最后一道防线

▍ 成本优化 —— Computer Use 的「省钱三板斧」

缓存重复截图 —— 同一个页面多次截取 → 对比 hash,相同则复用 LLM 分析结果

最小化截图区域 —— 不全屏截图,只截需要操作的应用窗口

混合自动化 —— 能用 API 的操作用 API(便宜 100 倍),

只在 API 无法覆盖时启动 Computer Use

面试可以提:「我们不是用 Computer Use 替代 Selenium,而是用 Computer Use

覆盖 Selenium 覆盖不到的 5% 的长尾操作场景。」

17.6 模拟 Computer Use Agent

下面实现一个 ComputerUseAgent 模拟器,模拟 Claude 的「截图→分析→动作」

核心循环。真实的 Computer Use 每次迭代需要传入 desktop screenshot 的 base64

给 Claude Vision,这里用描述文字替代。关键保留了三阶段:环境感知、动作决策、

执行反馈,以及坐标计算的容错逻辑。

📝 对应的代码实现

add_elementfind_element_atdescribeanalyze_screendecide_actionexecute_actionrun_taskdemo_computer_useVirtualScreenComputerUseAgent

import&nbsp;timeimport&nbsp;jsonfrom&nbsp;typing&nbsp;import&nbsp;Optionalclass&nbsp;VirtualScreen:&nbsp; &nbsp;&nbsp;"""模拟计算机屏幕 —— 用作 Computer Use 的目标环境。"""&nbsp; &nbsp;&nbsp;def&nbsp;__init__(self, width:&nbsp;int&nbsp;=&nbsp;1280, height:&nbsp;int&nbsp;=&nbsp;800):&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;self.width = width&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;self.height = height&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;self.elements = {} &nbsp;# 屏幕上的 UI 元素 {name: (x, y, w, h)}&nbsp; &nbsp;&nbsp;def&nbsp;add_element(self, name:&nbsp;str, x:&nbsp;int, y:&nbsp;int,&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; w:&nbsp;int, h:&nbsp;int, text:&nbsp;str&nbsp;=&nbsp;""):&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;"""添加一个 UI 元素(按钮/输入框/文本)。"""&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;self.elements[name] = {&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;"x": x,&nbsp;"y": y,&nbsp;"w": w,&nbsp;"h": h,&nbsp;"text": text,&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp;&nbsp;def&nbsp;find_element_at(self, click_x:&nbsp;int, click_y:&nbsp;int) ->&nbsp;Optional[str]:&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;"""根据坐标查找被点击的元素。"""&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;for&nbsp;name, elem&nbsp;in&nbsp;self.elements.items():&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;if&nbsp;(elem["x"] <= click_x <= elem["x"] + elem["w"]&nbsp;and&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; elem["y"] <= click_y <= elem["y"] + elem["h"]):&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;return&nbsp;name&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;return&nbsp;None&nbsp; &nbsp;&nbsp;def&nbsp;describe(self) ->&nbsp;str:&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;"""生成屏幕描述(模拟 LLM 视觉分析的结果)。"""&nbsp; &nbsp; &nbsp; &nbsp; lines = [f"屏幕分辨率:&nbsp;{self.width}x{self.height}"]&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;for&nbsp;name, elem&nbsp;in&nbsp;self.elements.items():&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; lines.append(&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;f" &nbsp;[{name}] 位置({elem['x']},{elem['y']}) "&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;f"大小({elem['w']}x{elem['h']}) 文本:「{elem['text']}」"&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; )&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;return&nbsp;"\n".join(lines)class&nbsp;ComputerUseAgent:&nbsp; &nbsp;&nbsp;"""Computer Use Agent —— 模拟完整的 Screenshot-Action Loop。&nbsp; &nbsp; 核心循环:&nbsp; &nbsp; &nbsp; while not done:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; screenshot → analyze → decide → execute → observe&nbsp; &nbsp; """&nbsp; &nbsp;&nbsp;def&nbsp;__init__(self):&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;self.action_log = []&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;self.max_actions =&nbsp;10&nbsp; &nbsp;&nbsp;def&nbsp;analyze_screen(self, screen: VirtualScreen) ->&nbsp;dict:&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;"""模拟 LLM 分析屏幕截图的结果。&nbsp; &nbsp; &nbsp; &nbsp; 真实实现中,这一步由 Claude 的视觉能力完成:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; - 上传截图(base64)到 Claude API&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; - Claude 返回对界面元素的分析和下一步操作的建议&nbsp; &nbsp; &nbsp; &nbsp; Args:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; screen: 虚拟屏幕对象。&nbsp; &nbsp; &nbsp; &nbsp; Returns:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 分析结果。&nbsp; &nbsp; &nbsp; &nbsp; """&nbsp; &nbsp; &nbsp; &nbsp; description = screen.describe()&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;return&nbsp;{&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;"resolution":&nbsp;f"{screen.width}x{screen.height}",&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;"elements_found":&nbsp;len(screen.elements),&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;"description": description,&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp;&nbsp;def&nbsp;decide_action(self, task:&nbsp;str,&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; screen: VirtualScreen) ->&nbsp;Optional[dict]:&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;"""模拟 LLM 决定的下一步操作。&nbsp; &nbsp; &nbsp; &nbsp; 真实实现中,Claude 返回结构化的工具调用:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; tool: "computer"&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; action: {"type": "left_click", "x": 450, "y": 200}&nbsp; &nbsp; &nbsp; &nbsp; 这里用简化的规则模拟:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 根据任务关键词匹配屏幕上的元素。&nbsp; &nbsp; &nbsp; &nbsp; Args:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; task: 任务描述。&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; screen: 当前屏幕。&nbsp; &nbsp; &nbsp; &nbsp; Returns:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 操作指令字典。&nbsp; &nbsp; &nbsp; &nbsp; """&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;for&nbsp;name, elem&nbsp;in&nbsp;screen.elements.items():&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;if&nbsp;name.lower()&nbsp;in&nbsp;task.lower():&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;# 计算元素中心坐标&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; cx = elem["x"] + elem["w"] //&nbsp;2&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; cy = elem["y"] + elem["h"] //&nbsp;2&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;return&nbsp;{&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;"type":&nbsp;"left_click",&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;"x": cx,&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;"y": cy,&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;"target": name,&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;"reasoning":&nbsp;f"找到了匹配元素 '{name}',点击其中心({cx},&nbsp;{cy})",&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;return&nbsp;{&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;"type":&nbsp;"type_text",&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;"text": task,&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;"target":&nbsp;"search_box",&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;"reasoning":&nbsp;"未找到匹配按钮,尝试搜索",&nbsp; &nbsp; &nbsp; &nbsp; }&nbsp; &nbsp;&nbsp;def&nbsp;execute_action(self, action:&nbsp;dict,&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;screen: VirtualScreen) ->&nbsp;dict:&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;"""执行操作并返回结果。&nbsp; &nbsp; &nbsp; &nbsp; Args:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; action: 操作指令。&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; screen: 当前屏幕。&nbsp; &nbsp; &nbsp; &nbsp; Returns:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 执行结果。&nbsp; &nbsp; &nbsp; &nbsp; """&nbsp; &nbsp; &nbsp; &nbsp; action_type = action["type"]&nbsp; &nbsp; &nbsp; &nbsp; result = {"success":&nbsp;True,&nbsp;"action": action}&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;if&nbsp;action_type ==&nbsp;"left_click":&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; target = screen.find_element_at(action["x"], action["y"])&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; result["clicked"] = target&nbsp;or&nbsp;"空白区域"&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;if&nbsp;target&nbsp;is&nbsp;None:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; result["success"] =&nbsp;False&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; result["error"] =&nbsp;"未找到可点击的元素"&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;elif&nbsp;action_type ==&nbsp;"type_text":&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; result["input"] = action["text"]&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;elif&nbsp;action_type ==&nbsp;"scroll":&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; result["scroll"] = action.get("direction",&nbsp;"down")&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;self.action_log.append(result)&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;return&nbsp;result&nbsp; &nbsp;&nbsp;def&nbsp;run_task(self, task:&nbsp;str, screen: VirtualScreen) ->&nbsp;dict:&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;"""运行完整的 Computer Use 任务。&nbsp; &nbsp; &nbsp; &nbsp; Args:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; task: 任务描述。&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; screen: 虚拟屏幕环境。&nbsp; &nbsp; &nbsp; &nbsp; Returns:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; 包含执行记录的结果。&nbsp; &nbsp; &nbsp; &nbsp; """&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;print(f"\n &nbsp;🎯 任务:&nbsp;{task}")&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;print(f" &nbsp;📺&nbsp;{screen.describe()}\n")&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;for&nbsp;step&nbsp;in&nbsp;range(1,&nbsp;self.max_actions +&nbsp;1):&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;print(f" &nbsp;--- Step&nbsp;{step}&nbsp;---")&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;# 1. 分析屏幕&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; analysis =&nbsp;self.analyze_screen(screen)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;print(f" &nbsp;👁️ &nbsp;分析: 发现&nbsp;{analysis['elements_found']}&nbsp;个元素")&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;# 2. 决策&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; action =&nbsp;self.decide_action(task, screen)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;if&nbsp;action&nbsp;is&nbsp;None:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;print(f" &nbsp;✅ 任务完成,无需更多操作")&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;break&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;print(f" &nbsp;🤔 决策:&nbsp;{action['reasoning']}")&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;# 3. 执行&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; result =&nbsp;self.execute_action(action, screen)&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; status =&nbsp;"✅"&nbsp;if&nbsp;result["success"]&nbsp;else&nbsp;"❌"&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;print(f" &nbsp;🖱️ &nbsp;执行:&nbsp;{status}&nbsp;{action['type']}&nbsp;→&nbsp;{result.get('clicked', result.get('input',&nbsp;''))}")&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;# 4. 判断是否完成&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;if&nbsp;action.get("target")&nbsp;and&nbsp;result["success"]:&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;print(f" &nbsp;🎉 成功点击目标元素,任务完成!")&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;break&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; time.sleep(0.3) &nbsp;# 模拟操作延迟&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;return&nbsp;{&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;"task": task,&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;"steps":&nbsp;len(self.action_log),&nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;&nbsp;"actions":&nbsp;self.action_log,&nbsp; &nbsp; &nbsp; &nbsp; }def&nbsp;demo_computer_use():&nbsp; &nbsp;&nbsp;"""演示 Computer Use 的完整流程。"""&nbsp; &nbsp;&nbsp;print("="&nbsp;*&nbsp;60)&nbsp; &nbsp;&nbsp;print(" &nbsp;Computer Use Agent 演示")&nbsp; &nbsp;&nbsp;print("="&nbsp;*&nbsp;60)&nbsp; &nbsp;&nbsp;# 场景1:模拟「登录网页」&nbsp; &nbsp;&nbsp;print("\n &nbsp;── 场景1:登录网页 ──")&nbsp; &nbsp; login_screen = VirtualScreen(1280,&nbsp;800)&nbsp; &nbsp; login_screen.add_element("username_input",&nbsp;500,&nbsp;300,&nbsp;200,&nbsp;30,&nbsp;"请输入用户名")&nbsp; &nbsp; login_screen.add_element("password_input",&nbsp;500,&nbsp;350,&nbsp;200,&nbsp;30,&nbsp;"请输入密码")&nbsp; &nbsp; login_screen.add_element("login_button",&nbsp;550,&nbsp;400,&nbsp;100,&nbsp;40,&nbsp;"登录")&nbsp; &nbsp; agent = ComputerUseAgent()&nbsp; &nbsp; result = agent.run_task("点击登录按钮", login_screen)&nbsp; &nbsp;&nbsp;# 场景2:模拟「搜索」&nbsp; &nbsp;&nbsp;print("\n &nbsp;── 场景2:搜索操作 ──")&nbsp; &nbsp; search_screen = VirtualScreen(1280,&nbsp;800)&nbsp; &nbsp; search_screen.add_element("search_box",&nbsp;400,&nbsp;200,&nbsp;400,&nbsp;35,&nbsp;"搜索...")&nbsp; &nbsp; search_screen.add_element("search_button",&nbsp;810,&nbsp;200,&nbsp;80,&nbsp;35,&nbsp;"搜索")&nbsp; &nbsp; search_screen.add_element("result1",&nbsp;400,&nbsp;300,&nbsp;500,&nbsp;50,&nbsp;"结果1: Python教程")&nbsp; &nbsp; search_screen.add_element("result2",&nbsp;400,&nbsp;360,&nbsp;500,&nbsp;50,&nbsp;"结果2: AI Agent 入门")&nbsp; &nbsp; agent = ComputerUseAgent()&nbsp; &nbsp; result = agent.run_task("点击搜索按钮", search_screen)

17.7 本章总结

核心要点回顾:

Computer Use = AI 用人类的方式操作电脑

Screenshot-Action Loop: 截图→分析→决策→执行

不依赖 API,可以操作任何软件

核心挑战:像素坐标计算 + 视觉理解

两大阵营

Anthropic: 控制真实电脑(通用但危险)

OpenAI CUA: 虚拟浏览器(安全但局限)

当前局限性(面试时坦诚讨论)

OSWorld 14.9%(人类 75%)—— 还有很长的路

延迟 3-4秒/步,成本 100倍于 API

安全风险高

安全第一

Docker 沙箱 + 非管理员 + 只读挂载

操作确认 + 审计日志

「Always sandbox. Never admin.」

面试速记:

“Computer Use 的原理和挑战?”

→ 原理:Screenshot-Action Loop(截图→视觉分析→坐标→操作)

→ 挑战:像素坐标精确度、延迟、安全风险

→ 定位:不是替代传统自动化,而是覆盖「API无法触及」的长尾

📝 对应的代码实现

import&nbsp;timeimport&nbsp;jsonfrom typing&nbsp;import&nbsp;Optionalif&nbsp;__name__&nbsp;==&nbsp;"__main__":&nbsp; &nbsp;&nbsp;print("╔══════════════════════════════════════════════════════╗")&nbsp; &nbsp;&nbsp;print("║ &nbsp;第17章:Computer Use + GUI Agent &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; ║")&nbsp; &nbsp;&nbsp;print("║ &nbsp;Screenshot-Action Loop · 坐标计算 · 安全沙箱 &nbsp; &nbsp; &nbsp; &nbsp;║")&nbsp; &nbsp;&nbsp;print("╚══════════════════════════════════════════════════════╝")&nbsp; &nbsp; demo_computer_use()&nbsp; &nbsp;&nbsp;print("\n▶ OSWorld Benchmark 成绩")&nbsp; &nbsp;&nbsp;print("-"&nbsp;*&nbsp;50)&nbsp; &nbsp;&nbsp;print(" &nbsp;人类 &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;75.0%")&nbsp; &nbsp;&nbsp;print(" &nbsp;Claude 3.5 Sonnet &nbsp;14.9% &nbsp;(Computer Use)")&nbsp; &nbsp;&nbsp;print(" &nbsp;GPT-4V &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp; &nbsp;7.8% &nbsp;(传统视觉模式)")&nbsp; &nbsp;&nbsp;print()&nbsp; &nbsp;&nbsp;print(" &nbsp;结论:Computer Use 翻倍了前最好成绩,")&nbsp; &nbsp;&nbsp;print(" &nbsp; &nbsp; &nbsp; &nbsp;但离人类水平仍有巨大差距。")&nbsp; &nbsp;&nbsp;print("\n▶ Anthropic vs OpenAI Computer Use 对比")&nbsp; &nbsp;&nbsp;print("-"&nbsp;*&nbsp;50)&nbsp; &nbsp; comparisons&nbsp;=&nbsp;[&nbsp; &nbsp; &nbsp; &nbsp; ("操作范围",&nbsp;"Anthropic: 整个操作系统",&nbsp;"OpenAI: 浏览器内"),&nbsp; &nbsp; &nbsp; &nbsp; ("安全性",&nbsp;"Anthropic: 需自行沙箱",&nbsp;"OpenAI: 内置安全隔离"),&nbsp; &nbsp; &nbsp; &nbsp; ("适用场景",&nbsp;"Anthropic: 桌面软件+Web",&nbsp;"OpenAI: Web任务"),&nbsp; &nbsp; &nbsp; &nbsp; ("成本",&nbsp;"Anthropic: 截图Token较贵",&nbsp;"OpenAI: 操作Token较低"),&nbsp; &nbsp; ]&nbsp; &nbsp;&nbsp;for&nbsp;dim, a, o&nbsp;in&nbsp;comparisons:&nbsp; &nbsp; &nbsp; &nbsp;&nbsp;print(f" &nbsp;{dim:10s} &nbsp;{a:30s} &nbsp;{o}")&nbsp; &nbsp;&nbsp;print("\n✅ 第17章完成!")

免责声明:

本文所载程序、技术方法仅面向合法合规的安全研究与教学场景,旨在提升网络安全防护能力,具有明确的技术研究属性。

任何单位或个人未经授权,将本文内容用于攻击、破坏等非法用途的,由此引发的全部法律责任、民事赔偿及连带责任,均由行为人独立承担,本站不承担任何连带责任。

本站内容均为技术交流与知识分享目的发布,若存在版权侵权或其他异议,请通过邮件联系处理,具体联系方式可点击页面上方的联系我

本文转载自:网络安全民工 网络安全民工 网络安全民工《第17章 Computer Use + GUI Agent —— AI 操控电脑》

评论:0   参与:  0