The UI-TARS Desktop repository is a monorepo containing two primary products and shared infrastructure for building multimodal AI agents that can interact with graphical user interfaces. This documentation covers the architecture, components, and development practices for the entire repository.
This page provides a high-level introduction to the repository structure, main products, and core architectural concepts. For detailed information about specific subsystems, see:
Sources: README.md1-299 multimodal/package.json1-49 README.zh-CN.md1-297
The repository ships two distinct but related products:
Agent TARS is a general-purpose multimodal AI agent stack that brings GUI Agent and Vision capabilities to terminals, browsers, and web applications. It provides:
@agent-tars/cli) for running agents in terminal environments README.md33-36Agent TARS is distributed as an npm package and can be run with:
packages/ui-tars/sdk/README.md55-57
Sources: README.md83-163 README.zh-CN.md80-159 packages/ui-tars/sdk/README.md55-57
UI-TARS Desktop is an Electron-based native application for local and remote computer/browser automation README.md39-46 Key features:
NutJSOperator for native desktop actions docs/quick-start.md63-65The desktop application uses the @ui-tars/sdk for core GUI automation logic README.md78
Sources: README.md237-278 docs/quick-start.md1-162 packages/ui-tars/sdk/README.md1-417 apps/ui-tars/src/renderer/src/pages/remote/free.tsx76-80
Both products share common infrastructure packages located in the monorepo:
| Package Category | Key Packages | Purpose |
|---|---|---|
| Agent Infrastructure | @agent-infra/browser, @agent-infra/shared | Browser automation and shared utilities multimodal/benchmark/content-extraction/package.json12-13 |
| Tarko Framework | @tarko/ui, @tarko/agent-ui | Base agent execution, server infrastructure, web UI multimodal/package.json14 |
| OmniTARS | @omni-tars/agent, @omni-tars/gui-agent | Composable multimodal agent system multimodal/package.json7-17 |
| UI-TARS SDK | @ui-tars/sdk, @ui-tars/operator-nut-js, @ui-tars/operator-browser | GUI automation SDK and operators pnpm-lock.yaml195-203 |
Sources: README.md1-50 multimodal/package.json1-49 pnpm-lock.yaml195-203
The following diagram shows how the main products, frameworks, and infrastructure components relate:
Sources: README.md1-50 packages/ui-tars/sdk/README.md11-51 multimodal/CHANGELOG.md10
The @ui-tars/sdk provides a standard loop for GUI automation, bridging the model's visual reasoning with platform-specific execution.
Sources: packages/ui-tars/sdk/README.md73-103 packages/ui-tars/sdk/README.md11-51
| Layer | Technologies |
|---|---|
| Runtime | Node.js ≥22, TypeScript 5.8 multimodal/package.json35-44 |
| Package Management | pnpm 9 (workspaces), turbo (task orchestration) package.json5-27 |
| Desktop Application | Electron, Vite, Electron Forge pnpm-lock.yaml120-145 |
| Web Framework | React 18, React Router multimodal/websites/main/package.json15-19 |
| UI Libraries | NextUI, Tailwind CSS, Framer Motion multimodal/websites/docs/package.json13-21 |
| Build Tools | Rsbuild, Rslib, Rspress multimodal/websites/main/package.json26 multimodal/websites/docs/package.json6 |
Sources: multimodal/package.json34-48 package.json1-58 multimodal/websites/main/package.json1-57
The system uses a state machine to track agent execution progress:
Sources: packages/ui-tars/sdk/README.md185-194 packages/ui-tars/sdk/README.md181
The OmniTARS agent supports specialized operational modes:
| Mode | Purpose |
|---|---|
| omni | General multimodal tasks with full capabilities multimodal/CHANGELOG.md22 |
| gui | Specialized for GUI automation and browser control multimodal/CHANGELOG.md16 |
| game | Optimized for game interaction with enhanced error handling multimodal/CHANGELOG.md16-20 |
Sources: multimodal/CHANGELOG.md16-24
The repository uses pnpm workspaces with packages organized into categories:
multimodal/
├── tarko/ # Framework (agent core, server, UI)
├── omni-tars/ # Composable agent implementation
├── gui-agent/ # GUI automation plugin
├── agent-tars/ # CLI and high-level product
└── websites/ # Documentation and main site
packages/ui-tars/ # SDK and platform operators
apps/ui-tars/ # Electron Desktop application
Sources: multimodal/package.json7-17 pnpm-lock.yaml96-209
The repository uses pnpm-dev-kit (pdk) for automated releases and versioning multimodal/package.json22-32
Sources: multimodal/package.json7-32 package.json1-21
Both UI-TARS SDK and GUI Agent Plugin use the Operator pattern to abstract platform differences:
Implementations:
NutJSOperator: Local desktop control via @computer-use/nut-js pnpm-lock.yaml117-119BrowserOperator: Web automation via Puppeteer/Playwright pnpm-lock.yaml195-197RemoteOperator: Cloud computer/browser control via VNC apps/ui-tars/src/renderer/src/pages/remote/free.tsx51-80Sources: packages/ui-tars/sdk/README.md198-256 pnpm-lock.yaml117-200
For remote operations, the system manages a lifecycle including resource allocation and time-balancing:
useRemoteResource: Hook for managing cloud instances apps/ui-tars/src/renderer/src/pages/remote/free.tsx69-75VNCPreview: Component for viewing remote desktop streams apps/ui-tars/src/renderer/src/pages/remote/free.tsx40getTimeBalance: Checks remaining session time (e.g., 30-minute limits) apps/ui-tars/src/renderer/src/pages/remote/free.tsx87-113Sources: apps/ui-tars/src/renderer/src/pages/remote/free.tsx1-241
Refresh this wiki
This wiki was recently refreshed. Please wait 5 days to refresh again.