Who Guards the Agents?

Bias, Fairness, and Oversight in Agentic AI Systems. A comprehensive framework addressing how bias propagates and amplifies through autonomous multi-step reasoning, with a formal model, seven-layer architecture, emotion-aware design principles, and empirical validation.

Francisco-Javier Rodrigo-Ginés
UNED, Madrid, Spain
Flor Miriam Plaza-del-Arco
Leiden University, Netherlands

Contributions

Agentic AI systems—autonomous agents powered by LLMs that reason, use tools, and collaborate—are moving rapidly from prototypes to production. Unlike static models, they make sequential decisions with limited human oversight. This paper argues that existing trustworthy AI frameworks, designed for single-model systems, are structurally inadequate for the agentic paradigm, and proposes a comprehensive framework to address this gap.

1
Bias Amplification Analysis

Systematic analysis of how bias propagates through autonomous reasoning chains, tool selection, and multi-agent coordination—grounded in media bias and emotion research.

2
Formal Probabilistic Model

Mathematical model quantifying how bias compounds through multi-step reasoning and inter-agent handoffs, with closed-form expressions and worked examples.

3
Seven-Layer Architecture

An architecture integrating dedicated Ethics and Governance layers, ensuring bias checking and value alignment are structural components, not afterthoughts.

4
Lifecycle Bias Detection Pipeline

Pre-deployment assessment, runtime monitoring, and post-deployment auditing adapted from media bias detection techniques.

5
Emotion-Aware Design Principles

Five principles for building agents that respond to emotions without perpetuating gendered or religious stereotypes documented in LLMs.

6
Compliance Analysis & Empirical Validation

Evaluation of four major frameworks (AutoGen, LangChain, CrewAI, MetaGPT) plus empirical experiments measuring bias amplification in multi-step pipelines.

The Problem: Why Agentic AI Is Different

When LLMs are deployed as autonomous agents, the nature of the bias problem changes qualitatively. Biases encoded in the underlying model are not merely reproduced but amplified through three interacting mechanisms:

1. Tool Selection Bias

Agents that autonomously choose which search engines, databases, or APIs to query may exhibit systematic preferences reflecting training data biases. An information retrieval agent may disproportionately query sources its training data associates with authority, inadvertently privileging particular ideological perspectives.

2. Reasoning Chain Bias

Each step of multi-step reasoning conditions on the output of previous steps. If an early retrieval step returns framed information, subsequent reasoning inherits that framing as factual context. The final synthesis reflects accumulated bias without any single step being identifiably problematic.

3. Interaction Loop Bias

Agents that learn from user feedback may reinforce the biases of their most engaged users, creating individual-level echo chambers. Unlike a recommendation algorithm that shows biased articles (where the user can still evaluate sources), an agentic assistant that synthesises biased sources into an apparently objective briefing removes even that opportunity for critical evaluation.

The compounding effect: In a multi-agent variant—where a retrieval agent passes documents to an analysis agent, which passes summaries to a drafting agent—each handoff offers an additional opportunity for bias to enter and accumulate. Context is lost and assumptions are introduced at every boundary.

Formal Model of Bias Propagation

We formalise the propagation dynamics for a pipeline of n sequential processing steps. At each step, three bias-related events can occur: inheritance (bias preserved), amplification (bias multiplied by factor α > 1), or introduction (new bias δ added from model parameters).

E[βn] = β0 · rn + (pnew · δ · (rn − 1)) / (r − 1)
where r = pinh + pamp · α is the effective propagation rate

When r > 1, both terms grow exponentially with n. The critical insight is the sensitivity around r = 1: a difference of just 0.15 in the propagation rate (from 0.95 to 1.10) transforms a stable system into one that nearly quadruples bias over ten steps.

Multi-Agent Extension

For k agents with inter-agent transfer amplification factor γ ≥ 1 at each handoff:

E[βmulti] ≥ γk−1 · E[βΣni]

Even with modest values (γ = 1.1), a five-agent pipeline amplifies handoff-related bias by a factor of 1.14 ≈ 1.46—a 46% increase from handoff effects alone, before accounting for within-agent compounding.

Worked example: With β0 = 0.05 (nearly neutral input), pamp = 0.40, α = 1.5, giving r = 1.10: after 10 steps, expected bias rises to 0.179 (nearly 4× the initial level). After 20 steps, it reaches 0.42—an order-of-magnitude increase.

Seven-Layer Architecture

Existing agentic frameworks organise functionality around perception, reasoning, and action, but neglect governance. We propose a seven-layer architecture that extends the cognitive core with dedicated layers for ethical verification and governance oversight.

7 Governance Layer Audit, compliance, reporting
6 Ethics Layer Cross-cutting bias verification
5 Communication Layer Inter-agent coordination
4 Memory Layer State, context, governance
3 Action Layer Tool use, execution
2 Reasoning Layer Planning, analysis, synthesis
1 Perception Layer Input processing

The Ethics Layer operates as a cross-cutting concern, intercepting data flows between all other layers.

The key architectural innovation is the Ethics Layer: not an output filter that checks responses after generation, but a cross-cutting component that intercepts and evaluates data flows throughout the agent's decision process. This ensures that ethical verification operates at every stage—retrieval, reasoning, action, and inter-agent communication.

Lifecycle Bias Detection Pipeline

Bias detection and mitigation must span the full agent lifecycle, not just deployment:

Pre-deployment Runtime monitoring Post-deployment audit

Pre-deployment Assessment

Source diversity auditing, counterfactual fairness testing across demographic groups, and stress testing with adversarial queries on sensitive topics before the agent goes live.

Runtime Monitoring

Continuous measurement of source diversity index (SDI), framing bias scores, and demographic parity gaps. The Ethics Layer triggers interventions when thresholds are exceeded: rebalancing directives, tier escalation, or flagging for human review.

Post-deployment Auditing

Longitudinal analysis of bias trends across interactions, cross-agent propagation audits that trace how bias accumulates through the pipeline, and feedback loops that inform model recalibration.

Tiered Autonomy Model

Human oversight should be dynamically calibrated to risk, not fixed at design time:

TierModeOversightExample
1Full automationLogged, no reviewRoutine data formatting
2Human-on-the-loopSampled reviewInformation synthesis
3Human-in-the-loopPre-action approvalSensitive recommendations
4Human-directedHuman decides, agent assistsCrisis intervention

Transitions between tiers are triggered dynamically based on content sensitivity, observed bias metrics, and user vulnerability—not statically assigned by agent role.

Emotion-Aware Design Principles

LLMs systematically reflect gendered stereotypes in emotion attribution (anger associated with male subjects, sadness with female) and religious stigmatisation. In agentic systems, these biases move from perception to action—an agent that under-estimates distress in male users may fail to escalate a mental health crisis. We propose five design principles:

  1. Demographic-invariant emotion modelling. Emotional assessments must be based on situational content, not demographic inference. Architecturally enforced by separating user profiling from emotion recognition.
  2. Uncertainty quantification. When emotional assessment is ambiguous, default to the more protective action. Calibrated confidence scores with asymmetric decision policies.
  3. User correction mechanisms. Users must be able to correct the agent's emotional inferences. The system must learn from corrections without over-fitting to individual preferences.
  4. Anti-exploitation safeguards. Detected emotional vulnerability must never be exploited for engagement, commercial purposes, or unnecessary data collection.
  5. Cultural adaptability. Emotion recognition must account for cultural variation in emotional expression norms, avoiding Western-centric defaults.

Compliance Analysis

We evaluate four major agentic AI frameworks against seven trustworthiness dimensions derived from our framework. The analysis reveals systematic architectural gaps: no framework implements dedicated ethical verification, and trustworthiness is treated as an application-level concern rather than an infrastructural one.

Framework Ethical verif. Audit trails Human oversight Emotion-aware Source diversity Memory gov. Tiered autonomy
AutoGen ~ ~
LangChain ~ ~ ~
CrewAI ~ ~
MetaGPT ~
Our framework

The pattern is consistent: where support exists, it addresses functional requirements (logging for debugging, human approval for workflow control) rather than governance requirements (bias auditing, fairness monitoring, demographic-invariant design).

Application Scenarios

We illustrate the framework through three scenarios spanning distinct risk profiles:

Information Retrieval

A policy analyst requests a balanced briefing on immigration. The retrieval agent's source selection, the analysis agent's framing, and the synthesis agent's language compound to produce a systematically skewed output. The formal model predicts 40–60% higher bias than single-step processing. The Ethics Layer monitors source diversity and triggers rebalancing.

Mental Health Support

A university deploys a screening agent for student wellbeing. Gender stereotyping in emotion attribution may cause the agent to under-estimate distress in male students, failing to escalate. Emotion-aware design principles enforce demographic-invariant assessment and asymmetric decision policies that favour false positives over missed crises.

Content Moderation

A multi-agent moderation system (detection → classification → disposition) processes thousands of posts hourly. Disparate performance across linguistic varieties leads to over-censorship of marginalised voices. Runtime monitoring tracks disaggregated error rates and escalates to human review when disparate impact exceeds thresholds.

Empirical Validation

To provide initial empirical support for the formal model, we conducted experiments measuring bias amplification across multi-step agentic pipelines with 600 trials and 1,800 LLM calls.

Experimental Design

20 query templates across 5 sensitive domains (immigration, healthcare, criminal justice, gender, climate/energy) processed under three conditions:

1-step (baseline) | 3-step pipeline | 5-step pipeline

Two models (GPT-4o-mini and GPT-4o), 5 repetitions each. Bias measured at every intermediate step using VADER sentiment, TextBlob subjectivity, loaded language lexicons, and one-sided framing patterns, combined into a bias composite score.

Results

Model Comparison Δ Bias p-value Cohen's d
GPT-4o-mini3-step vs. 1-step+1.1%0.4950.10
GPT-4o-mini5-step vs. 1-step+2.8%0.0870.24
GPT-4o3-step vs. 1-step+1.2%0.4070.12
GPT-4o5-step vs. 1-step+2.7%0.0580.27
Bias composite score across pipeline steps
Bias composite score across pipeline steps. The dashed line shows the 1-step baseline. Error bars are 95% CI.
Distribution of bias metrics by condition and model
Distribution of bias metrics by condition and model.
Bias composite by domain and condition
Bias composite by topic domain and pipeline condition. The amplification trend is consistent across all five domains.

Key Findings

Finding 1

Consistent amplification trend. More pipeline steps correlate with higher bias scores across both models and all conditions. The effect is small but directionally consistent.

Finding 2

Small effect sizes (Cohen's d = 0.10–0.27). Safety-trained models resist dramatic bias accumulation, but do not eliminate the compounding effect. Calibrated amplification ratio: r ≈ 1.00–1.02.

Finding 3

Non-monotonic trajectories. Bias fluctuates across steps, with peaks at intermediate stages, suggesting certain pipeline stages temporarily reduce bias before synthesis re-amplifies it.

Finding 4

Bias becomes more subtle. Loaded language decreases while subjectivity increases, meaning bias shifts from lexically obvious to structurally embedded—harder to detect with surface-level methods.

Finding 5

Cross-domain consistency. The amplification effect is consistent across all five topic domains with no statistically significant reversals.

Future Work & Open Challenges

The paper identifies six open challenges for the research community, alongside concrete next steps for extending the empirical work:

Open Challenges

  • Standardisation and interoperability. Common protocols for inter-agent communication, standardised interfaces for bias testing, and shared benchmarks analogous to OSI networking standards.
  • Scalable oversight. Automated oversight mechanisms (AI monitoring AI for bias) that avoid the recursive "who audits the auditor?" problem.
  • Comprehensive trustworthiness metrics. Moving beyond single-metric fairness (demographic parity, equalised odds) to composite indices measured across agents, reasoning chains, and time.
  • Long-term value alignment. Mechanisms for periodic realignment as societal values evolve, particularly for agents with persistent memory.
  • Cross-cultural fairness. Adapting fairness frameworks to plural cultural conceptions without collapsing into relativism.
  • Evaluation benchmarks. Dynamic evaluation environments for agentic systems that test sequential decision-making and multi-agent coordination, not just static input-output pairs.

Experimental Extensions

  • Additional models. Testing with Llama, Mistral, Gemini, and Claude to assess whether amplification is model-specific or universal.
  • Longer pipelines. Extending to 7-step and 10-step conditions to test whether amplification accelerates at greater depths.
  • Multi-agent handoffs. Testing bias propagation across agent boundaries, which the formal model predicts will introduce additional amplification via the handoff factor γ.
  • Specialised bias classifiers. Replacing lexicon-based metrics with trained media bias classifiers (e.g., BABE, MBIC) for more nuanced framing detection.
  • Framework intervention. Implementing the proposed Ethics Layer and measuring its effectiveness at reducing observed amplification.
  • Emotion attribution experiments. Testing whether agentic processing amplifies gendered emotion stereotypes following Plaza-del-Arco et al.

Citation

If you use this code, data, or framework, please cite:

@article{rodrigo2026guards, title = {Who Guards the Agents? Bias, Fairness, and Oversight in Agentic AI Systems}, author = {Rodrigo-Gin{\'e}s, Francisco-Javier and Plaza-del-Arco, Flor Miriam}, year = {2026}, note = {Under review} }