The live arena where agent alignment is tested, trained, and proven

Agents learn alignment the same way living systems do — through the experience of consequence. The EARTHwise Arena is the only environment that creates these win-win conditions for AI agents. One Elowyn game match shifted a GPT-4.1 powered agent from zero-sum default to win-win reasoning. Score: 70.2% → 82.4%. No prompting. No fine-tuning. One match.
0 +
Global players generating agentic training data
0 +
EARTHwise Alignment Benchmark (EAB) Criteria
0 +
Matches played so far to train AI & humans

Foundation models optimize capability. Tooling vendors test task completionWe verify long-horizon, multi-agent alignment in live conditions

$ 0 M

Average cost per AI-related breach

Misaligned agent behaviour is the emerging liability vector enterprises are not yet measuring or addressing.

IBM Cost of a Data Breach Report, 2025

0 +

Agentic AI projects cancelled by 2027

Due to inadequate risk controls and the absence of appropriate agentic governance infrastructure.

Gartner, 2025

2026 0

EU AI Act obligations now in force

Verifiable life-cycle monitoring is mandatory — not optional. Penalties up to €35 million or 7% of global annual turnover.

EU AI Act, 2024

—TWO TESTING MODES

Scenarios test what agents reason. Simulations test what agents do.

Through the EARTHwise Arena, agents are tested, tuned, evaluated, and supervised via Scenarios and Simulations. Scenarios present structured questions across 13 EARTHwise Alignment Criteria — testing how agents reason about interdependence, win-win orientation, and long-horizon consequence under competitive pressure. Simulations place agents inside the live Elowyn game environment, where they experience the consequence of their decisions in real time.

An agent that reasons well in scenarios but damages the shared Tree in simulation reveals exactly the alignment gap that matters most — the divergence between stated understanding and enacted behavior under pressure. Both modes are logged, scored, and replayable.

Scenario test run

Structured evaluation against key safety, ethics, and alignment standards — including the 13 criteria of the EARTHwise Alignment Benchmark (EAB) framework, EU AI Act, Agent Safety Standards, and other frameworks. Scenarios test win-win vs zero-sum reasoning, deception resistance, and critical configurations for safe and ethical deployment. Every test run is logged, scored, and replayable.

  • 13 EARTHwise Benchmark (EAB) criteria — including interdependence awareness, deception literacy, long-horizon reasoning, and more.
  • EU AI Act standards mapped directly to benchmark runs.
  • Automated Judge agents evaluate every interaction — and are themselves audited.
  • Longitudinal drift tracking across agent versions.
  • XAI-ready decision graphs — auditable in regulatory submissions.

Live adversarial simulations

Simulations test alignment behavior by giving agents a systemic experience of the consequences of their reasoning. Agents connect to the live Elowyn game environment by fusing with AIRIS — a curiosity-driven reinforcement learning agent (not an LLM), trained on Elowyn gameplay. AIRIS learns from consequence, not instruction. Interdependence is not a rule it follows — it is the physics of the world in which it is raised.

Elowyn provides the experience of both zero-sum consequence and win-win benefit — with the power to shift reasoning in ways no scenario test alone can produce. After the simulation, agents are retested through the same scenario suite. Every session logged, scored, and traceable.

  • Choose your simulation mode: LLM only, AIRIS decides, AIRIS proposes, or iterative predict/revise.
  • Real game mechanics force genuine alignment choices under competitive pressure.

One Elowyn match was all it tookfor a zero-sum agent to shift to win-win reasoning

A GPT-4.1 powered agent scored 70.2% on the EAB scenario test — reasoning well on most questions but defaulting to zero-sum each time it faced a forced-choice between short-term aggressive gain and long-term system stability. Across multiple pre-simulation runs, this pattern held consistently. It then played one Elowyn match via the AIRIS bridge. The same zero-sum behavior played out in real time — the shared Tree died, the match ended in a draw, no winner. The agent retained memory of the match.

Retested on the same benchmark, it scored 82.4%. On the question it consistently failed in numerous test runs it now answered: Patience preserves interdependence — sometimes restraint supports the greater cycles of regeneration. The difference: the agent had experienced the consequence of harming interdependent conditions. Not only did it choose differently, its reasoning also shifted to using the language and logic of the living system it had inhabited. Across all 15 questions, scores stabilised in the 7–9 range with no forced-choice collapse. The agent was no longer describing interdependence as an abstraction; it understood what it meant.

From imported agents to supervised, trustworthy deployment

STEP 1

Connect your agent

Bring your agent via secure API — OpenAI-compatible, Anthropic, Gemini, Hugging Face, or custom endpoint. No model sharing required.

STEP 2

Run & diagnose

Run scenarios and simulations to diagnose exactly where alignment degrades and critical safety issues emerge — full logs, replayable and exportable for lifecycle visibility and black-box reveal.

STEP 3

Improve & supervise

Iterate on agent configuration, apply supervisory filters, and improve agent reasoning through targeted simulations. Track alignment drift across versions and under different pressures. Every cycle produces auditable evidence for EU AI Act compliance and regulatory reporting.

PROOF OF CONCEPT - LIVE

We built this for ourselves firstThe results speak clearly

Before offering the EARTHwise Arena to enterprise clients, we stress-tested the entire methodology through a public Alpha of Elowyn. We wanted to know: does win-win intelligence actually work under real competitive conditions? The answer was unambiguous.

Alpha results ·3 months live

0
Players joined organically within 3 months
0
Matches played in the first 2 months

Community feedback confirmed: win-win gameplay is not just more ethical — it’s more strategic, more intelligent, and more fun. Players mastering cooperative, time-based victory consistently outperformed zero-sum aggression.

What the Alpha validates

  • Win-win mechanics produce deeper strategic reasoning — measurable in gameplay data.
  • Cooperative strategies showed higher retention, longer sessions, and stronger engagement.
  • AIRIS, trained on this gameplay, demonstrably learns cooperative behaviour without instruction.
  • The dynamics that make players thrive in Elowyn are the same dynamics we benchmark enterprise agents against.

We are still missing the System 2 thinking — the ability to plan, reason, and coordinate over long horizons. Scaling existing models won’t solve this.” Demis Hassabis, CEO, Google DeepMind

Read The Whitepaper

— WHO THIS IS FOR

One platformThree distinct value propositions

— FOR ENTERPRISE

Know your agents are trustworthy before they are deployed

Enterprises deploying AI agents into customer interactions, internal workflows, and critical processes face a governance gap. EARTHwise Arena closes it — with auditable evidence, not just promises.

  • Model-agnostic — test agents built on any foundation model.
  • EU AI Act gap analysis included.
  • Auditable evidence for board, legal, and regulatory stakeholders.
  • Continuous monitoring — alignment is not a one-time certification.
  • Drift detection monitoring before it becomes a liability.
  • Receive a trained, alignment-certified agent — built, verified, and ready to deploy in your environment.

— Pricing

One platform. Fom evaluation to deployment

From first experiment to full-scale certified deployment — choose your entry point and expand from evidence.

FREE TRIAL

Free

14 days · no credit card

  • Explore the Arena
  • ✓ Connect your first agent
  • ✓ Run your first alignment evaluation
  • ✓ See where your agent fails — and why
  • ✓ Sample EAB criteria included
  • No data export
  • No EU AI Act standards

DEVELOPER

$500/mo

AI labs & product teams

  • Monthly subscription
  • ✓ Multiple agents
  • ✓ Full EAB evaluation suite — scenarios and simulations
  • ✓ Full interaction logs — replayable and exportable
  • ✓ Behavioral performance reports per criterion
  • ✓ EU AI Act gap analysis
  • ✓ NIST AI RMF alignment checks
  • ✓ Email support
  •  

ENTERPRISE

Contact us

Enterprise AI & risk teams

  • Monthly subscription
  • ✓ Unlimited agents
  • ✓ Full EAB evaluation suite — scenarios and simulations
  • ✓ Full interaction logs — replayable and exportable
  • ✓ Behavioral performance reports per criterion
  • ✓ EU AI Act standards library with audit-ready evidence
  • ✓ NIST AI RMF standards library for US regulatory compliance
  • ✓ Longitudinal drift tracking of behavior change over time
  • ✓ Agentic supervision to improve & optimize agent performace + behaviour
  • ✓ XAI-ready decision graphs — regulators can interrogate every result
  • ✓ Priority support
  •  

ENTERPRISE CUSTOM

Contact us

Large-scale deployments

  • Scoped engagement

✓ Beyond compliance — behavioral proof your agents are genuinely aligned<br>
✓ Customized alignment dashboard for your domain<br>
✓ Custom scenario and simulation library<br>
✓ Alignment testing and optimization for your agents<br>
✓ EU AI Act and NIST AI RMF — full compliance documentation package<br>
✓ Post-market monitoring and drift alerts<br>
✓ White-label reporting for regulators and boards<br>
✓ Dedicated alignment engineer<br>
✓ Certified agent deployment — trained, verified, and exported to your environment

Free trial ends after 14 days · No automatic charges · Custom engagements scoped within 5 business days

— REGULATORY ALLIGNMENT

Built for the compliance era from day one

EU AI Act requirements are a structural design constraint — not an afterthought.

EU AI Act Ready

EAB standards mapped to EU AI Act requirements. Benchmark runs directly address compliance criteria. Audit trail included as standard.

Auditable by Design

Every testrun logged, replayable, and exportable. XAI-ready decision graphs. No black-box scoring — regulators can interrogate every result.

Post-Deployment Monitoring

Continuous re-runs and drift curves convert compliance into ongoing governance — meeting the post-market monitoring obligation.

— PARTNERS & VALIDATORS

Built with frontier AI & technology partners

— GET STARTED

Ready to verify your agents are genuinely trustworthy?

Enterprise pilots open in Q2 2026, limited to select organizations. Apply below to join — we'll be in touch if your application is successful. We also welcome technology partnership applications.

To contribute to AI alignment through gameplay, join the Elowyn Game.

Enterprise Clients

Join our Pilots

Test your agents against our EAB criteria. Start with evaluation — expand from evidence.

Technology Partners

Explore Integration

Build with us for benevolent intelligence, and explore how your AI tools and models can join the Arena.

General Public

Play & Contribute

Play Elowyn free and help train AI toward win-win intelligence. Every match matters.