githubEdit

videoMeeting Agents

The mascot joins meetings as a real participant: listens, takes notes, speaks back into the call, animates its face into the camera grid, and uses tools mid-meeting. More than a notetaker.

The mascot's flagship integration is the Meeting Agent: the same character you talk to on your desktop can join a Google Meet on your behalf, sit in the participant grid as an animated face, hear everyone in the room, talk back into the call with its own voice, and reach for tools while the meeting is happening.

It is not a notetaker. A notetaker sits silently and produces a transcript. A meeting agent participates - it answers questions, looks things up live, remembers prior meetings with the same people, and contributes when you (or it) decide it has something useful to add.

What it actually does in a call

1. It joins as a real participant

The mascot joins the meeting through an embedded webview, the same way a person joins from their browser. There is a name, a face, and a tile in the grid. Other participants see and hear it the way they'd see and hear any other attendee - no calendar bot, no dial-in number, no "this meeting is being recorded by …" banner.

Under the hood the meeting brain lives in src/openhuman/meet_agent/brain.rs, and the webview side is the same CEF child window OpenHuman uses for other embedded providers.

2. It listens to everyone in the room

Inbound audio from the meeting is captured and pushed through streaming speech-to-text in real time. The transcript is diarized per speaker, cleaned up by the same hallucination filter and postprocessor used for desktop dictation, and folded into the Memory Tree as the meeting unfolds - under the right people, the right topics, the right project, with backlinks the mascot can use later.

Because the transcript is being structured live, the mascot can answer questions about this meeting (or any prior meeting with the same people) while it is still happening.

3. It interacts - it answers, it asks, it follows up

The agent is not muted. When you address it directly ("Ghosty, can you pull up the numbers from last quarter?"), or when it decides it has something useful to add, it generates a reply on the fly using the project's normal LLM stack and speaks it into the meeting.

Conversational turns are routed through the fast model tier (see Automatic Model Routing) so the latency feels like talking to a person who's listening, not waiting on a chatbot.

4. It speaks - its own TTS audio plays back into the call

Replies are generated by the project's TTS stack and streamed straight into the meeting as an outbound microphone feed. It is not played through your local speakers and re-captured by your mic - it is injected directly as the agent's audio, so it lands clean for everyone else and doesn't echo through your room.

5. It animates - the mascot's face IS the camera feed

The mascot's canvas is piped into the Meet call as the outbound camera stream (the work introduced in commit b6d05cb4, with the mascot frame pipeline polished further in f5dce783). When the agent is talking, the mascot is talking on the camera tile - mouth shapes lip-sync to the same TTS audio everyone else is hearing. When it is listening, it shows the listening pose. When it is reasoning before it speaks, you see the thinking pose.

Other participants don't see a black tile or a static avatar. They see an animated character reacting in time with what's being said, which is what makes the call feel like a conversation with something alive rather than a voice coming out of nowhere.

6. It uses tools mid-meeting - this is the part a notetaker can't do

This is the difference between a transcription bot and a meeting agent.

While the call is happening, the mascot has access to the same tool surface it has on your desktop:

  • Memory Tree - recall prior meetings, decisions, open threads, who said what last time, what's been promised.

  • Auto-fetch from integrations and third-party integrations - pull a thread from Slack, an email, a Linear ticket, a Notion doc, a calendar entry, a file from Drive.

  • Native tools - search the web, scrape a page, run a quick code/data lookup, all without leaving the call.

  • Subconscious Loop outputs - anything it has been working on in the background is already on hand.

So when someone in the call asks "wait, didn't we decide to drop the Q3 launch last month?", the mascot doesn't just transcribe the question. It answers it - with the actual decision, the meeting it was made in, and who agreed.

That moves it from notetaker to the most informed participant in the room.

Why it feels alive

A meeting agent that only transcribes is a tool. A meeting agent that participates is a presence. The Meet integration is deliberately built to make the mascot feel like a real attendee, not a recording device:

  • It has a face on the camera grid that lip-syncs and reacts, not a black square or a logo.

  • It has its own voice that plays into the call, not into your speakers.

  • It has persistent memory of the people in the room, the project, the prior decisions - so it can be addressed by name and answer in context.

  • It has tools so it can act on what's said, not just record it.

  • It runs the subconscious loop between meetings - so when it joins your next call, it has already done the homework on what was promised in the last one.

The result, in practice, is that participants stop treating it like a bot and start treating it like a colleague who happens to be very fast at looking things up.

Setup, controls, privacy

  • Joining a call. You can hand the mascot a Google Meet link from the desktop app; it will open the embedded Meet webview, join with the configured display name, and switch its camera tile to the mascot canvas.

  • Mic and camera control. The agent's mic is the TTS injection stream, not your real microphone. The agent's camera is the mascot frame producer, not your real webcam. You can mute the agent's mic from the app at any time, the same way you'd mute yourself in Meet.

  • Transcripts and memory. Live transcripts land in the Memory Tree the same way any other source does - under the people in the call, the project, and the topics that came up. They are local-first and follow the project's Privacy & Security rules.

  • No covert recording. The agent appears as a normal participant in the grid; everyone in the call can see it and see when it's speaking.

Implementation pointers (for developers)

Curious how this is wired up:

  • Brain - src/openhuman/meet_agent/brain.rs (LLM turns, speak/no-speak decisions, tool calls).

  • Voice plumbing - src/openhuman/voice/ (STT in, TTS out, hallucination filter, postprocess). See Native Voice.

  • Mascot canvas as outbound camera - app/src/features/meet/MascotFrameProducer.tsx and the Tauri-side mascot_native_window.rs window.

  • Embedded Meet webview - see Chromium Embedded Framework. The Meet child webview ships with zero injected JavaScript; everything host-side runs natively via CDP.

  • Notable commits to read for context - 0bc74575 (live note-taking), f1203479 (real LLM turns + tuned TTS), b6d05cb4 (mascot canvas as outbound camera), f5dce783 (mascot frame pipeline + off-screen meet window).

See also

Last updated