Skip to content

TIDE Protocol — Trust-Informed Dynamic Escalation

⚠️ Research / Future — This document describes a protocol that is not yet implemented. It is retained for future reference and research purposes.

An adaptive coordination protocol for multi-agent systems that dynamically adjusts coordination intensity based on real-time signals — per-task, not global.

Abstract

Multi-agent systems face a fundamental tension: too little coordination leads to conflicts, wasted work, and silent divergence; too much coordination throttles throughput and negates the benefits of parallelism. TIDE resolves this by treating coordination as a variable, not a constant. Inspired by TCP congestion control, TIDE defines five coordination states and transitions between them based on observable signals — file conflicts, test failures, agent idle time, and task novelty. Each task independently tracks its own coordination phase, so two agents on independent features stay relaxed while two others hitting conflicts get tighter oversight. The result: coordination overhead stays below 15% of total agent effort while maintaining safety guarantees.


Table of Contents


1. Overview

TIDE is an adaptive coordination protocol for multi-agent development crews. It draws directly from TCP congestion control: start cautious, ramp up autonomy as things go well, pull back hard at the first sign of trouble, then recover smoothly without resetting to zero.

Core insight: Coordination intensity should be a function of real-time conditions, not a fixed policy. A new agent on an unfamiliar codebase needs hand-holding. The same agent, three tasks later on familiar code, should be left alone. And if two agents collide on a merge conflict, only those two need tighter coordination — everyone else keeps working.

What TIDE is NOT:

  • Not a global mode switch (it's per-task)
  • Not a trust score (it works alongside trust tiers, not replacing them)
  • Not a replacement for file locking or other concrete coordination mechanisms (it's the policy layer that decides how much coordination to apply)

2. The Five States

Each task tracked by the system is in exactly one of these five states at any time. Different tasks can be in different states simultaneously.

2.1 Slow Start

When: New or unfamiliar task. Agent hasn't worked on this area before, or the task has high novelty.

Coordination behavior:

  • Lead assigns explicit, scoped tasks
  • Agents confirm understanding before acting
  • Frequent check-ins (agents report progress proactively)
  • Establish task-area conventions: naming patterns, file structure, shared vocabulary. This is the infrastructure investment that makes Congestion Avoidance possible. Conventions set during Slow Start pay compound returns.
  • High overhead, but safe for unknown territory

Communication pattern: Push-based. Lead drives.

Analogy: TCP slow start — begin with a small congestion window and grow it as ACKs come in.

2.2 Congestion Avoidance

When: Steady state. Agent is productive on a familiar task with no recent conflicts.

Coordination behavior:

  • Agents work independently using conventions and shared workspace
  • Lead monitors but doesn't micromanage
  • Communication shifts from push (lead-driven) to pull (agent-initiated when needed)
  • Lowest overhead — this is where the system should spend most of its time

Communication pattern: Pull-based. Agents reach out when they need something.

Analogy: TCP congestion avoidance — the window grows linearly, probing for capacity without being reckless.

2.3 Fast Retransmit

When: Collision or failure detected — merge conflict, test failure, agent blocked, or dependency deadlock.

Coordination behavior:

  • Tighten coordination for affected agents only (not global)
  • Lead re-engages with explicit task assignment
  • Affected agents pause independent work and coordinate directly
  • Quick diagnosis: what happened, who's affected, what's the fix?

Communication pattern: Push-based for affected agents. Others unaffected.

Analogy: TCP fast retransmit — upon detecting packet loss (3 duplicate ACKs), retransmit immediately without waiting for a full timeout.

2.4 Fast Recovery

When: After a conflict from Fast Retransmit has been resolved.

Coordination behavior:

  • Resume moderate coordination — not a full reset to Slow Start
  • Lead stays engaged briefly but begins loosening oversight
  • Agents gradually return to independent work
  • The key property: damping. Don't overreact to a resolved problem.

Communication pattern: Transitioning from push back to pull.

Analogy: TCP fast recovery — halve the congestion window instead of resetting to 1. The connection was working; one lost packet doesn't erase that history.

2.5 Audit

When: Periodically during prolonged Congestion Avoidance, or when a configurable time threshold passes without any state changes.

Coordination behavior:

  • Lightweight, asynchronous coherence check
  • Agents diff their assumptions against the shared workspace
  • Catches silent divergence — contradictory work that hasn't triggered any conflict signals
  • No change to coordination intensity unless issues are found

Communication pattern: Layer 3 (Signals) — minimal overhead.

Analogy: Regulatory T-cells patrolling healthy tissue for autoimmune activity. Everything looks fine, but it's worth checking.


3. Key Properties

3.1 Per-Task Scoping

TIDE phases are tracked per-task, not globally. Two agents on independent features stay in Congestion Avoidance while two others hitting conflicts escalate to Fast Retransmit. This prevents a single collision from throttling the entire crew.

3.2 Damping

Fast Recovery ≠ Slow Start. After resolving a conflict, the system resumes at moderate coordination rather than resetting to full hand-holding. This reflects the reality that one conflict doesn't erase an agent's track record.

3.3 Hysteresis (Asymmetric Thresholds)

The bar to escalate is lower than the bar to de-escalate:

DirectionThresholdRationale
Escalate2 conflicts in 5 minutesQuick to tighten — catch problems early
De-escalate5 clean cyclesSlow to relax — prove stability before loosening

This asymmetry is intentional. False positives (over-coordinating) are annoying but recoverable. False negatives (under-coordinating during real problems) are dangerous and invisible until they become catastrophic.

3.4 Piggyback Signaling

Control signals are embedded in work artifacts rather than sent as separate messages. Agents writing to the shared workspace is the status signal. File locks acquired, commits made, tests run — these are both productive work and coordination signals. Zero extra reporting overhead.

3.5 Convention Investment

Slow Start isn't just about safety — it's where conventions are established. Naming patterns, file structure, API contracts, shared vocabulary — these are the infrastructure that makes Congestion Avoidance possible.

The cost of convention-setting is front-loaded in Slow Start, but returns compound across the entire task lifecycle:

PhaseConvention CostConvention Return
Slow StartHigh (establishing)Low (not yet leveraged)
Congestion AvoidanceZero (already established)High (agents coordinate implicitly)
Fast RecoveryLow (reinforcing)High (faster return to steady state)

Conventions also compound across sessions — when captured as skill files (.github/skills/), conventions established in one session become free infrastructure for future sessions. This makes Slow Start progressively shorter for recurring task patterns.

3.6 Boundary Minimization

The highest-leverage COR optimization is eliminating coordination events entirely by minimizing task boundaries.

Principle: Prefer end-to-end task assignment (one agent owns an entire feature) over assembly-line decomposition (multiple agents each owning a slice).

Why it matters:

  • Every task boundary is a potential Fast Retransmit trigger
  • Fewer boundaries = fewer conflicts = more time in Congestion Avoidance = lower COR
  • Context switching between agents is expensive — shared context must be explicitly communicated

Automatic detection: When the system detects high context overlap (>80% shared files between two tasks), it should suggest combining those tasks under a single agent. This is a structural optimization that prevents coordination problems rather than managing them.

Tradeoff: Boundary minimization reduces parallelism. A single agent owning a large feature is slower than two agents working in parallel — but only if those two agents don't collide. When collision probability is high, the serial approach is faster because it avoids the coordination tax entirely.


4. Detection Signals

These observable signals trigger state transitions:

SignalHigh Value MeansTransition Direction
File conflict rateMultiple agents hitting the same files→ Escalate
Test failure frequencyIncreasing failure rate→ Escalate
Agent idle timeUnexpected inactivity (may be blocked)→ Investigate
Task dependency densityHigh coupling between tasks→ More coordination
Task noveltyUnfamiliar patterns or new codebase areas→ Active orchestration

Signals are weighted and combined. A single marginal signal may not trigger a transition, but multiple converging signals will (see Tidal Metaphor: Spring Tide).


5. Phase Transition Rules

5.1 Transition Map

                    ┌─────────────────────────────────┐
                    │                                   │
                    ▼                                   │
              ┌──────────┐    success    ┌────────────────────┐
  start ────▶ │Slow Start│ ──────────▶  │Congestion Avoidance│◀──┐
              └──────────┘               └────────────────────┘   │
                    │                        │        │           │
                    │ conflict                │conflict│ periodic  │
                    ▼                        ▼        ▼           │
              ┌────────────────┐       ┌──────────┐  ┌─────┐     │
              │Fast Retransmit │◀──────│Fast Retr. │  │Audit│     │
              └────────────────┘       └──────────┘  └─────┘     │
                    │                                   │         │
                    │ resolved                          │ clean   │
                    ▼                                   │         │
              ┌─────────────┐                           │         │
              │Fast Recovery │──────── stable ──────────┼─────────┘
              └─────────────┘                           │
                                                issues found → Fast Retransmit

5.2 Escalation Bias

The system is biased toward escalation:

  • Escalate threshold: Low (2 conflicts in 5 minutes)
  • De-escalate threshold: High (5 clean cycles without incidents)
  • Reason: Under-coordination failures are silent and compound. Over-coordination is visible and self-correcting (people notice and relax).

5.3 Simultaneous Phases

The system can be in different phases for different tasks at the same time. Task A might be in Congestion Avoidance while Task B is in Fast Retransmit. Each task's phase is independent.

5.4 Audit Triggers

Audit runs:

  • Periodically during prolonged Congestion Avoidance (configurable interval)
  • When a configurable time threshold passes without any state changes
  • Manually, if the lead requests a coherence check

6. Integration with Trust Tiers

TIDE's current phase influences which trust tier applies to agents:

TIDE PhaseDefault Trust TierBehavior
Slow StartTier 3 — Propose-and-WaitNew agents must propose changes and wait for approval
Congestion AvoidanceTier 1 — Auto-MatchProven agents on familiar tasks act autonomously
Fast RetransmitCurrent tier + 1 levelEscalate the current trust tier by one level
Fast RecoveryGradually restore previous tierSmooth return, not instant
AuditNo changeResults may trigger tier adjustments if issues found

Key interaction: Trust tiers are about agent capability. TIDE phases are about task conditions. A highly trusted agent (Tier 1) working on a task in Fast Retransmit still gets tighter coordination — not because the agent is untrusted, but because the task conditions demand it.


7. Integration with 5-Layer Communication Stack

Each TIDE phase maps to preferred layers of the communication stack:

TIDE PhasePrimary LayersCommunication Style
Slow StartLayers 4–5 (Messages + Escalation)Explicit, synchronous, lead-driven
Congestion AvoidanceLayers 1–2 (Conventions + Workspace)Implicit, asynchronous, convention-driven
Fast RetransmitLayers 3–4 (Signals + Messages)Targeted, direct, problem-focused
Fast RecoveryLayers 3–4 → 1–2 (transitioning)Gradually shifting from explicit to implicit
AuditLayer 3 (Signals)Lightweight coherence checks

Layer-shifting strategy: Optimize not by making expensive layers (messages, escalation) faster, but by pushing coordination down the stack into cheaper layers (conventions, workspace). TIDE's steady state (Congestion Avoidance) relies almost entirely on Layers 1–2, which have near-zero marginal cost.


8. North Star Metric

COR — Coordination Overhead Ratio

Target: COR < 15%

Agents should spend 85%+ of their effort on productive work, not on coordination.

COR = (time spent on coordination) / (total agent time) × 100

What counts as coordination:

  • Waiting for approvals or confirmations
  • Reading status updates from other agents
  • Resolving merge conflicts
  • Participating in check-ins or audits
  • Re-doing work due to collisions

What counts as productive work:

  • Writing code, tests, documentation
  • Reviewing code (substantive review, not coordination)
  • Investigating bugs
  • Making architectural decisions

Optimization strategy: Don't make coordination faster — make it unnecessary. Push coordination down to cheaper layers (conventions and workspace artifacts) so that explicit coordination events become rare. A well-tuned TIDE system spends most of its time in Congestion Avoidance, where coordination is essentially free (it's embedded in the work itself via piggyback signaling).


9. UX Representation

9.1 Coordination Gauge

A four-color indicator showing the current phase per task:

ColorStateLabel
🟢 GreenCongestion AvoidanceCruising
🟡 YellowFast Recovery / Slow StartTightening
🔴 RedFast RetransmitClose Coordination
🔵 BlueAuditAuditing

9.2 Trend Sparkline

A 30-minute sparkline showing the phase history. The trend is as important as the current state. A task that's been green for an hour and just turned yellow is less concerning than one that's been oscillating between green and red.

9.3 Per-Task Phase Indicators

Small icons on agent cards showing the current phase of their active task:

IconMeaning
⏱️ ClockSlow Start (cautious, timed check-ins)
🚀 RocketCongestion Avoidance (full speed)
⚡ LightningFast Retransmit (problem detected)
🔄 CycleFast Recovery (returning to normal)

9.4 Phase History Timeline

In the task detail view, show the phase history as a color-banded timeline. Each band represents a period in a specific phase. This makes patterns visible: a task that keeps bouncing between states may need structural intervention (e.g., better file boundaries).

9.5 Audit Pulse

The Audit state should be subtle — a brief blue pulse that says "checked, all clear." Audit is routine and healthy. It should not demand attention. A clean audit is good news; only a failed audit should escalate to a visible alert.


10. Tidal Metaphor Extensions

The TIDE name invites natural metaphor extensions for discussing system behavior:

Tidal PhenomenonTIDE AnalogMeaning
Spring tide (moon + sun aligned)Multiple escalation signals convergingMaximum coordination intensity — several independent signals all pointing to trouble
Neap tide (moon + sun opposing)Mixed signalsModerate response — some signals say escalate, others say things are fine
Slack tide (pause at high/low)Hysteresis damping periodThe pause before transitioning states — stability check before committing to a change
Tidal bore (rare extreme event)Catastrophic failureBeyond normal parameters — system-wide cascade failure requiring emergency coordination

These metaphors are useful for team communication. "We're seeing a spring tide on the auth module" immediately conveys that multiple signals are converging on a problem.


11. Biological Validation

TIDE's design patterns are independently validated by biological coordination systems:

Engineering validation:

Validation SourceTIDE ParallelWhat It Validates
TCP congestion controlOverall state modelAdaptive throughput management with fast escalation and smooth recovery

Biological validation:

Validation SourceTIDE ParallelWhat It Validates
Ant colony stigmergyCongestion AvoidanceIndirect coordination through shared environment (conventions + workspace artifacts)
Immune system T-cell maturationTrust tier integrationProgressive trust: prove capability → prove safety → deploy → earn expanded autonomy
Regulatory T-cellsAudit stateConstant patrol for autoimmune/divergence even in healthy tissue — catches silent problems
Neural lateral inhibitionFile lockingOne agent claims work, others inhibited from the same area — prevents duplicate effort
Embryonic morphogenesisDissolving hierarchyEarly centralized gradients (lead-driven) dissolve as local signaling (conventions) takes over

These aren't just analogies — they're evidence that the patterns TIDE uses are evolutionarily stable strategies for coordinating independent agents under uncertainty.


12. Implementation Notes

12.1 State Storage

Each task's TIDE phase should be stored alongside the task in the DAG:

typescript
interface TideState {
  phase: 'slow-start' | 'congestion-avoidance' | 'fast-retransmit' | 'fast-recovery' | 'audit';
  enteredAt: string;        // ISO 8601 timestamp
  conflictCount: number;    // rolling window (last 5 minutes)
  cleanCycleCount: number;  // consecutive cycles without incidents
  lastAuditAt: string;      // when Audit last ran
}

12.2 Signal Collection

Detection signals should be collected from existing infrastructure:

  • File conflict rate: From FileLockRegistry — denied lock attempts
  • Test failure frequency: From EventPipeline — test run results
  • Agent idle time: From AlertEnginestuck_agent alerts
  • Task dependency density: From task DAG — edge count per node
  • Task novelty: From ActivityLedger — has this agent worked on similar files/tasks before?

12.3 Transition Logic

on signal_received(task, signal):
  state = get_tide_state(task)
  
  if signal.severity >= ESCALATE_THRESHOLD:
    if state.phase in ['congestion-avoidance', 'fast-recovery']:
      transition(task, 'fast-retransmit')
    elif state.phase == 'slow-start':
      # already in tight coordination, log the signal
      state.conflictCount++

  if signal.type == 'conflict_resolved':
    if state.phase == 'fast-retransmit':
      transition(task, 'fast-recovery')

  if signal.type == 'clean_cycle':
    state.cleanCycleCount++
    if state.cleanCycleCount >= DE_ESCALATE_THRESHOLD:
      if state.phase == 'slow-start':
        transition(task, 'congestion-avoidance')
      elif state.phase == 'fast-recovery':
        transition(task, 'congestion-avoidance')

  if should_audit(task, state):
    transition(task, 'audit')

12.4 Configuration

All thresholds should be configurable:

ParameterDefaultDescription
escalateConflictCount2Conflicts in window to trigger escalation
escalateWindowSeconds300Rolling window for conflict counting (5 min)
deEscalateCleanCycles5Clean cycles required before de-escalating
auditIntervalSeconds1800Time between periodic audits (30 min)
auditStaleThresholdSeconds3600Max time without state changes before forcing audit (1 hr)
corTargetPercent15Target COR — alert if exceeded
boundaryOverlapThreshold0.8File overlap ratio to suggest task merging
auditCostBudgetPercent2Max % of agent time audits may consume

12.5 Audit Implementation

The Audit state requires a concrete detection mechanism for silent divergence. This section specifies what an audit actually does.

Inputs:

  • Files modified by each agent since the last audit
  • Task goals and acceptance criteria for each active task
  • Shared conventions (naming patterns, API contracts, architectural decisions)

Process: Lightweight semantic comparison at diff-level granularity (not full code review):

  1. Collect file diffs per agent since last audit
  2. Cross-reference: are any agents modifying files in overlapping areas?
  3. Compare assumptions: do agents' changes assume compatible contracts?
  4. Check convention adherence: do modifications follow established patterns?

Outputs:

  • clean — no issues found. Return to Congestion Avoidance. Brief blue pulse in UX.
  • coherence-warning — specific divergence description with affected agents and files. Triggers transition to Fast Retransmit for affected tasks only.

Example divergence scenarios:

  • Agent A implements JWT-based auth while Agent B's API assumes session cookies
  • Agent A renames a shared interface that Agent B is actively consuming
  • Agent A introduces a new error handling pattern that contradicts the convention Agent B is following

Cost budget: Audits must consume < 2% of total agent time to stay within the COR < 15% target. This constrains the audit to diff-level checks, not deep code review. If an audit consistently exceeds its cost budget, increase the audit interval rather than making audits more thorough.


13. Future Work

13.1 Machine Learning for Signal Weighting

Currently, all detection signals are weighted equally (or manually tuned). A future enhancement could learn optimal signal weights from historical data — which combinations of signals actually predicted problems vs. false alarms.

13.2 Agent-Specific Phase Profiles

Some agents may need different escalation/de-escalation thresholds based on their track record. A new agent might need 10 clean cycles to de-escalate, while a proven agent needs only 3. This could be learned from the trust tier history.

13.3 Cross-Task Phase Correlation

Currently, phases are independent per task. But if three tasks on the same module all escalate within minutes, that's a signal about the module, not just the tasks. Future work could detect cross-task patterns and apply module-level coordination policies.

13.4 Predictive Escalation

Instead of reacting to conflicts, predict them. If two agents are both reading files in the same directory and one acquires a lock, the system could proactively warn the other agent before a conflict occurs.

13.5 COR Measurement Instrumentation

Build concrete measurement infrastructure for the COR metric. This requires classifying agent time into coordination vs. productive work — potentially using LLM-based classification of agent actions.

13.6 Tidal Bore Protocol

Define a formal emergency protocol for catastrophic failures (tidal bore events) that exceed TIDE's normal operating parameters — e.g., CI/CD pipeline failure affecting all agents, or a critical dependency breaking.


Appendix A: State Transition Diagram

                         ┌──────────────────────────────────────────┐
                         │          TIDE State Machine              │
                         │          (per-task instance)             │
                         └──────────────────────────────────────────┘

     ┌─────────────┐                              ┌─────────────────────┐
     │             │    5 clean cycles             │                     │
     │  SLOW START │ ──────────────────────────▶   │ CONGESTION AVOIDANCE│
     │             │                               │                     │
     └──────┬──────┘                               └──┬──────────┬──────┘
            │                                          │          │
            │ conflict                       conflict  │          │ periodic /
            │ detected                       detected  │          │ stale timer
            │                                          │          │
            ▼                                          ▼          ▼
     ┌──────────────────┐                         ┌─────────┐
     │                  │◀────────────────────────│  AUDIT  │
     │ FAST RETRANSMIT  │   issues found          └────┬────┘
     │                  │                               │
     └────────┬─────────┘                               │ clean
              │                                         │
              │ resolved                                │
              ▼                                         │
     ┌──────────────┐       5 clean cycles              │
     │              │ ─────────────────────────▶ Congestion Avoidance
     │FAST RECOVERY │                                   │
     └──────┬───────┘                                   │
            │                              audit clean ─┘
            │ conflict detected        (returns to Congestion Avoidance)

     Fast Retransmit (re-enter)

Appendix B: Glossary

TermDefinition
CORCoordination Overhead Ratio — percentage of agent time spent on coordination vs. productive work
Clean cycleA work cycle (configurable duration) completed without conflicts, test failures, or other escalation signals
Congestion windowBorrowed from TCP — conceptually, how much independent work an agent is allowed before checking in
DampingThe property that recovery doesn't overshoot — Fast Recovery resumes at moderate coordination, not Slow Start
HysteresisAsymmetric thresholds for state transitions — easy to escalate, slow to de-escalate
Piggyback signalingEmbedding coordination signals in work artifacts (file locks, commits, workspace writes) instead of sending separate status messages
Spring tideMultiple escalation signals converging simultaneously — maximum coordination response
StigmergyIndirect coordination through modifications to a shared environment (the shared workspace)
TIDE phaseOne of the five states: Slow Start, Congestion Avoidance, Fast Retransmit, Fast Recovery, Audit
Tidal boreCatastrophic failure exceeding normal TIDE operating parameters
Boundary minimizationStrategy of assigning end-to-end features to single agents to reduce coordination boundaries
Convention investmentFront-loaded cost of establishing conventions during Slow Start, with compound returns in later phases

This document is part of the Flightdeck architecture documentation. For related protocols, see Agent Coordination and Agent Communication.

Documentation generated by AI

Documentation generated by AI