Is prompt engineering dead in 2026?

No, but it stopped being the center. The prompt remains one layer — alongside context, memory, tools, hooks, evals and runtime. The main question is no longer «how do I phrase the request?» but «how do I build an environment where AI reliably closes the loop and doesn't lose context?».

How does context engineering differ from prompt engineering?

Prompt engineering asks «how do I say this?»; context engineering asks «what should the model know right now?». In the DataHub survey, 82% of IT leaders said prompt engineering alone is no longer enough for AI at scale, and 95% consider context engineering important for scaling agents.

Should every task be turned into an AI agent?

No — there's a new risk of overrating agents the way prompts were once overrated. Many «agent use cases» are really workflows with a known sequence of steps, where a function call suffices. Known path → workflow, unknown path → agent, high risk → human gate.

Why has agentic AI work become more expensive?

Because an agent burns loops, not «messages»: it reads the repo, runs tests, holds long context, does subagent analysis and repeats. Bad context = more tokens, bad evals = more manual review, bad memory = every run from scratch. Hence the end of the all-you-can-eat subscription illusion for heavy agent workflows.

After the Prompt: The Birth of an Army of Agentic Loops

Just six months ago, the main hero of working with AI was the user searching for the right wording. In May 2026 that’s no longer true. The winner isn’t the one who writes a prettier prompt, but the one who closes the loop faster: context, agent, tools, artifact, verification, memory, repetition. This is a new operational logic — and a very familiar one.

The chat no longer looks like a chat

Not long ago, interacting with AI looked simple.

You opened ChatGPT or Claude. You typed a request. You waited. You copied the text. You fixed it. You typed again. You waited again. You copied again.

It was the era of the prompt.

In it, the central hero was the user trying to find the right wording. “Write like a senior developer.” “Act as a McKinsey consultant.” “Ask me 10 clarifying questions.” “Don’t hallucinate.” “Be precise.” “Write without fluff.” “Think step by step.”

It really worked. For a while.

But in May 2026 it became clear that this logic is already becoming obsolete. Not because prompt engineering disappeared — it remained as one of the layers. But because it stopped being the center of the system.

Now the main question is no longer “how do I phrase the request?”

It’s a different one:

This is no longer a chat. It's an **operating system**.

On 5 May, OpenAI rolled out memory sources in ChatGPT — the ability to see exactly which saved memories, past chats, custom instructions, files, or connected Gmail influenced an answer. Memory stops being a mystical black box and becomes a visible working layer. ^[1]

Nine days later, OpenAI rolled out Codex in the mobile ChatGPT. And here the “mobile” part isn’t what matters. What matters is something else: Codex can now be run like a live remote employee. From your phone you can see active threads, project context, screenshots, terminal output, diffs, test results, approvals, switch models, or kick off a new task. Over 4 million people already use Codex weekly. ^[2]

This is a radical change in UX.

AI no longer sits in a chat waiting for one perfect request. It works in an environment. It sees files. It runs tests. It writes diffs. It waits for approval. It has hooks. It connects over Remote SSH. It leaves traces.

The prompt becomes just a launch button.

The work happens in a loop.

The end of the magic prompt

In the old era, prompt engineering was like shamanism.

People collected “perfect prompts.” Collections of commands. Roles. Secret formulas. Markdown templates. “Ultimate Claude Code prompt.” “Best ChatGPT prompt for developers.” “One prompt to build your SaaS.”

It was natural. When a tool is new, people try to control it with language.

But over time it became clear: one giant prompt is a bad way to manage complex work.

A big prompt bloats easily. It contradicts itself. It mixes rules, context, goals, exceptions, style, technical constraints, history, security, and answer format. It becomes not an instruction, but a trash can.

And worst of all: it doesn’t scale.

One prompt can help write a text, fix a function, or explain an error. But it holds a long process poorly:

research a topic;
gather sources;
create a structure;
write code;
deploy to a server;
verify;
get feedback;
make edits;
update memory;
ship the next version.

In such a process, the prompt is just one layer. Alongside it are context, memory, tools, hooks, permissions, runtime, evals, logs, sandboxes, human approvals.

DataHub put it bluntly in April: prompt engineering optimizes how you phrase the instruction, while context engineering manages the entire information environment in which the model works. In their study, 82% of IT and data leaders said prompt engineering alone is no longer enough for AI at scale, and 95% consider context engineering important for scaling agents. ^[3]

The difference is enormous.

In the first case, a person polishes the wording. In the second, they build a system for delivering the right context at the right moment.

It’s like the difference between a beautiful order and proper logistics. A general can write a perfect order. But if the map is old, comms are broken, the unit doesn’t know the terrain, the fuel didn’t arrive, and HQ can’t see the front — the order is worthless.

It’s the same with AI. The model can be very smart. The prompt can be beautiful. But if the context is bad, the memory is noisy, the tools are disordered, and approval logic is undefined — the system will break.

The new formula of power: not the model, but the loop

The old AI logic thought in platforms.

GPT. Claude. Gemini. Llama. Grok. DeepSeek. (Which exact frontier model to pick for which task is a separate article, because the landscape changes fast.)

The new logic thinks in loops.

intent
  → context
  → agent
  → tools
  → artifact
  → verification
  → feedback
  → memory
  → next iteration

In the old logic you asked: “Which model is best?”

In the new logic you have to ask:

how does the agent get context?
which tools can it call?
where is it allowed to write?
when should it stop?
what will it log?
who approves risky actions?
how does the result turn into the next loop?
what of this gets saved to memory?
which errors become eval tests?

This is no longer the magic of the answer. This is the engineering of repetition.

LangChain, in its State of Agent Engineering report, writes that 57.3% of respondents already have agents in production, and another 30.4% are actively developing agents with plans to deploy. The biggest production blocker is quality, named by 32% of respondents. Observability has already been adopted by 89% of organizations, while only 52.4% have evals. ^[4]

These numbers show one thing: agents are no longer a demo. They have become working infrastructure. And now the main headache isn’t “how to make the model write something,” but how to make its work reliable.

## Codex in your pocket: the agent as a remote employee

OpenAI named the May release simply: “Work with Codex from anywhere.”

Formally, it’s mobile access to Codex in the ChatGPT app. But culturally it’s something else entirely — it’s the first mass-market image of an AI coding agent as a process that doesn’t end with an answer in a chat.

You launch a task on a laptop, Mac mini, or devbox. The agent works in your environment. It sees the project context. It runs commands. It outputs terminal output. It creates a diff. It takes screenshots. It runs tests. It asks for permission.

At that moment you can be in a taxi, on a walk, at the gym, or between calls — and give it a decision from your phone.

This is not “writing code on your phone.” It’s managing an execution loop from your phone.

OpenAI writes it plainly: a small check-in can keep the work from stalling, avoid unnecessary rework, or help Codex move with the right context. Hence the set of actions: review outputs, approve commands, change models, start something new. ^[2]

This is a very important moment for everyone who works with AI.

A person no longer sits in front of the model as an operator of a text field. The person becomes a dispatcher of long tasks.

human:
  sets the intent
  gives constraints
  makes decisions
  verifies the result

agent:
  reads the code
  calls tools
  tries options
  creates an artifact
  returns evidence

This is a new rhythm. Not “one request — one answer.” But “one task — many micro-interventions.”

In this sense, the smartphone becomes not a device for consuming AI, but a remote control for the agent.

Claude Code and the problem of long loops

Anthropic, too, is moving not just toward a smarter model, but toward a longer execution loop.

On 6 May, Anthropic doubled the five-hour Claude Code rate limits for Pro, Max, Team, and seat-based Enterprise plans, removed the peak-hours limit reduction for Claude Code on Pro and Max, and substantially raised the API limits for Opus models. In the same release — a partnership with SpaceX that grants access to over 300 MW of new capacity and over 220,000 NVIDIA GPUs over the course of a month. ^[5]

These numbers sound like infrastructure news. But in reality it’s news about UX.

Why do AI coding tools hit limits so quickly?

Because agentic development burns not “messages.” It burns loops.

The agent reads the repository. It searches for files. It tries a patch. It runs tests. It gets an error. It reads the log. It rewrites. It runs again. It makes a diff. It gives a summary. It waits for the human. It continues.

This is a long session. It can last minutes or hours.

LangChain describes it like this: long-running agents need durable execution, memory, multi-tenancy, human-in-the-loop, and observability. An agent can work for minutes or hours, wait for human approval, survive a deploy or crash, and not lose progress. ^[6]

So the real bottleneck isn’t only intelligence.

The real bottleneck is the duration and reliability of the loop.

If an agent can’t work for a long time, it stays autocomplete.

If it can work for a long time, save state, ask for permission, recover, and leave an audit trail — it becomes a worker in the system.

Control plane: where the new war is really being fought

On 15 May, VentureBeat very precisely named the next front: not the model war, but the agent control plane.

The idea is simple: companies are no longer just choosing which model answers better. They’re choosing where the operational machine of AI will live: in the Microsoft stack, the OpenAI API layer, the Anthropic managed runtime, an open framework, or a hybrid mix.

Per VB Pulse, in February 2026 Microsoft Copilot Studio and Azure AI Studio had 38.6% primary-platform adoption among enterprise agent orchestration respondents, OpenAI Assistants and Responses API — 25.7%, Anthropic tool use and workflows — 5.7% (the sample is small, so VB explicitly cautions against over-reading it). ^[7]

But even with that caution, the signal is strong.

A model can be swapped. A control plane is harder to swap. Because that’s where these live:

permissions;
memory;
tools;
approvals;
logs;
auditability;
sandboxing;
integrations;
cost controls;
security policies;
workflow state.

In the old AI logic, vendor lock-in was at the model level. In the new one, it’s at the runtime level. That’s much deeper.

If your team keeps workflows, permissions, memory, hooks, and agent tasks in one environment, you’re no longer just “using a model.” You’re building an operational fabric around it.

Not “which LLM is the smartest?” But “where does my work live?”

RAG is no longer enough

Another shift: classic RAG stops being the universal answer.

A few years ago it was fashionable to say: “We’ll connect documents to a vector database, and the agent will know everything.”

But agentic workflows quickly exposed the weakness of this approach.

When an agent works in a long loop, it needs not just a search of documents. It needs a compiled structure of knowledge: what the source of truth is, how entities are related, what the permissions are, which data is stale, what format is needed for the next tool call, what can be thrown out of the context window.

On 4 May, VentureBeat described this as a transition from a RAG pipeline to a compilation-stage knowledge layer. In the Pinecone Nexus example, one financial analysis task that previously consumed 2.8M tokens was completed with 4,000 tokens — a claimed reduction of 98% (this is Pinecone’s internal benchmark, not yet customer-validated). ^[8]

Even if you treat the number cautiously, the direction is obvious.

The future isn’t about throwing a bigger context window at the model. The future is about giving it a smaller, cleaner, more structured context.

bad:
  all the documents
  the entire history
  all the instructions
  all the noise

better:
  relevant facts
  current state
  clear constraints
  the needed tools
  short memory
  evidence links

A large context without discipline isn’t power. It’s trash with a big limit.

And if noise gets into that memory, the agent degrades **even before** it runs out of tokens.

GitHub showed how not to breed agents

One of the best practical examples of the week is the GitHub accessibility agent.

On 15 May, GitHub described an experiment with a general-purpose accessibility agent. Its job is to answer accessibility questions in the Copilot CLI and VS Code integration, and also to catch and automatically fix simple, objective accessibility issues before production. The agent has already reviewed 3,535 pull requests and has a 68% resolution rate. ^[9]

But that’s not the most interesting part. The most interesting part is the architecture.

GitHub initially had a monolithic agent, but it quickly hit its limits. The team moved to a sub-agent architecture. Many guides advise building a whole zoo of agents, but GitHub found this works worse. They kept only two:

a passive reviewer / researcher;
an active implementer.

They are sandboxed and don’t pass content directly to one another. Instead, each creates structured, templatized output that the parent orchestrating agent consumes, validates, and routes.

orchestrator
  → reviewer
  → structured findings
  → orchestrator validates
  → implementer
  → changes or guidance
  → re-audit

Here you can see the new culture of AI workflow.

Agents don’t need to “communicate like humans.” It sounds cute, but it quickly creates chaos. They need to pass structured artifacts.

GitHub writes plainly that without a template schema, agents would start communicating arbitrarily, which creates higher token expenditure, hallucinations, unnecessary code changes, and a nearly impossible audit. ^[9]

Even more important — GitHub introduced **complexity-based behavior**. If the code is too complex, the agent is not allowed to generate changes. It switches to guidance-only mode or escalates to a human. There are also high-risk patterns where the agent is **forbidden to write code**: drag and drop, toasts, rich text editors, tree views, data grids.

This is mature agent design. AI shouldn’t always act. Sometimes the best thing an agent can do is stop.

The limit of automation: 36% won’t yield

In the same material, GitHub gives another strong number.

Of the 55 WCAG level A and AA Success Criteria, only 35 can be detected by deterministic automated code checkers. That means roughly 36% of the criteria require manual evaluation. ^[9]

This isn’t just an accessibility fact. It’s a model of reality for any AI workflows.

In every complex field there’s a part that can be checked automatically. And there’s a part where human judgment is needed.

automatable:
  syntax
  tests
  obvious errors
  format
  repeatable patterns
  part of compliance

needs judgment:
  UX
  reputational risk
  ethical ambiguity
  client context
  strategic trade-off
  semantic quality

The problem with many AI systems is that they behave as if 100% of reality can be turned into a tool call. That’s not true.

A strong AI architecture doesn’t deny human judgment. It places it at the right point in the loop.

Human-in-the-loop isn’t QA at the end

The old idea of human-in-the-loop looked like this: AI does something, the human checks it, approves or edits. This is a weak model.

A stronger model: the human doesn’t just check the output. The human shapes the trajectory.

LangChain describes the agent improvement loop as a process in which a team quickly creates a first version of an agent, runs it in a production-like environment, gathers data, analyzes outputs and eval scores, and human feedback influences context engineering and the next iterations. ^[10]

So the human isn’t an editor after the model. The human is the trainer of the loop.

They see where the agent gets confused. Which sources are missing. Where the context needs to be compressed. Where to add an example. Where to forbid an action. Where an escalation gate is needed. Where an error needs to be turned into a test.

run
  → failure
  → human judgment
  → eval case
  → context patch
  → workflow patch
  → next run

This is the key to reducing wasted iterations. Not asking the model to “be better.” But taking every failure and turning it into a new element of the system.

Hooks: rules move from the prompt into the runtime

In the Codex release, OpenAI emphasized: Hooks are now generally available on all plans. They can be used for secret scanning, validators, conversation logging, memory creation, or repo-specific behavior customization. ^[2]

Claude Code is moving in this same logic. The Claude Code documentation from April–May shows a whole wave of runtime primitives: Routines, /usage, /ultrareview, effort levels, hooks, monitor tools, permission logic, sandbox rules. Routines run templated cloud agents on a schedule, a GitHub event, or an API call. /usage shows exactly what’s consuming the limits. /ultrareview runs parallel multi-agent analysis and an adversarial critique pass for code review. ^[13]

This means rules are increasingly moving from the prompt into the runtime.

You don’t need to write in the prompt:

“Please don’t delete important files, don’t push secrets, don’t change production configs, don’t run dangerous Bash commands, don’t generate code in high-risk zones.”

This needs to be coded into hooks, policies, deny-lists, validators, and approval gates.

This is a fundamental difference.

On 13 May, GitHub also rolled out the Agent tasks REST API for the Copilot cloud agent in public preview. Copilot Business and Enterprise users can programmatically launch cloud agent tasks. The agent works in its own development environment, can make and validate code changes, and then open a pull request. GitHub gives scenarios: fan out refactors across repositories, one-click repo setup from an internal developer portal, weekly release preparation with release notes. ^[12]

This is another step from chat to infrastructure. When an agent is launched via a REST API, it becomes not an assistant in a window, but a part of the pipeline.

Sakana Conductor: a small manager stronger than a big genius

The most intellectual signal of recent weeks is Sakana AI’s work on Conductor.

The idea is almost elegant: don’t train yet another model that decides everything itself. Teach a small model to manage other models.

Sakana describes a 7B Conductor model, trained with reinforcement learning, that orchestrates a pool of frontier models — GPT-5, Gemini, Claude, and open-source models. It doesn’t write code directly. It decides: whom to call, which subtask to assign, what context to show, how to assemble the workflow. For simple factual questions it might call one model. For complex coding problems — it creates a planner-executor-verifier pipeline. ^[14]

The results are strong: in the paper, Conductor shows 83.93 on LiveCodeBench, 93.3 on AIME25, 87.5 on GPQA-Diamond, and an average of 77.27, exceeding the individual worker models in this setup. ^[15]

This is a very important metaphor for the entire AI era.

The future may belong not to the biggest model. But to the best coordinator.

big model:
  solves the task itself

conductor:
  breaks down the task
  picks the agents
  limits the context
  triggers verification
  assembles the final result

This is like a team. The strongest leader isn’t necessarily the best designer, programmer, analyst, and editor themselves. Their strength is knowing whom to bring in when, what to assign to whom, what information to give, when to stop, and how to assemble the result.

AI is starting to learn not only to answer. AI is starting to learn management.

But not every workflow needs an agent

Here it’s important not to fall into the opposite foolishness.

If prompt engineering was overrated, now there’s a risk of overrating agents.

On 14 May, Martin Fowler published James Pritchard’s view: many “agent use cases” are really just workflows — known sequences of steps where one or two steps involve an LLM. If the workflow is known, autonomy is often not needed. A function call is. ^[16]

This is painful, but correct. Not everything needs to be turned into an agent.

If a process is stable — code the process. If the steps are known — make a pipeline. If you need to extract data, classify, reformat, validate a template — that’s often a function with an LLM call inside.

An agent is needed where there is: uncertainty, search, branching, tool use, long context, intermediate decisions, a need for human approval, a variable trajectory.

known path → workflow
unknown path → agent
high risk → human gate
repeatable pattern → automation

A simple matrix, but it saves you from over-agenting.

The economics of agents: subscriptions are no longer bottomless

Another unpleasant but important signal is billing.

On 14 May, Zed explained that, effective 15 June, Anthropic splits Claude subscription billing into two pools: first-party Claude tools and third-party agent / SDK usage. For third-party agent usage through ACP, claude -p, and other tools, an Agent SDK credit is introduced: $20 for Pro, $100 for Max 5x, $200 for Max 20x. Once the credit is exhausted — usage at API rates or requests stop. ^[17]

This isn’t just pricing drama. It’s the end of the all-you-can-eat illusion for heavy agent workflows.

When a person writes 30 messages in a chat, that’s one economics. When an agent launches dozens of tool calls, reads a repository, runs subagent analysis, holds long context, and repeats tests — that’s an entirely different economics.

Agentic work costs a lot, because it’s not “an answer.” It’s a compute loop.

Bad context = more tokens.
Bad prompt = more retries.
Bad tools = more erroneous actions.
Bad evals = more human review.
Bad runtime = more interruptions.
Bad memory = every run from scratch.

All of this costs. Not metaphorically. Literally.

Voice-to-artifact: the next natural form of work

The most interesting thing is that this logic is already starting to look very natural in real work.

A person speaks into a microphone. Claude Code or Codex gets the task. It creates an HTML, a landing page, a script, a database migration, a Telegram bot, a research document. It uploads it to the server. The person looks at the result. By voice, they give edits. The AI changes it. Deploy again. Feedback again. Everything is documented in .md files, project memory, agent instructions, changelog.

This is already a normal working mode for people who live in fast iteration.

voice
  → agent
  → artifact
  → deploy
  → inspect
  → correction
  → memory
  → next version

Thinking stops being separated from production.

Previously, between an idea and an artifact there was a lot of friction: sit down, formulate, write a spec, hand it to a developer, wait, receive it, explain the edits, wait again.

Now the voice interface compresses this loop. The idea moves into the product almost directly.

But that’s exactly why structure becomes critical. If you don’t formalize this loop, it quickly turns into chaos: different sessions, different agents, lost context, duplicates, poorly recorded decisions, “why did we do this?”, “where’s the latest version?”, “which prompt worked?”

So the new stack must have memory. Not romantic. Technical:

/project.md      what it is, the goal, users, domain, deploy
/decisions.md    key decisions, why, what not to do
/workflows.md    how to launch, deploy, verify, roll back
/agents.md       roles, constraints, tools, escalation rules
/evals.md        typical errors, acceptance criteria, regressions

This isn’t bureaucracy. It’s a way not to lose speed.

Why wasted iterations are the main enemy

This whole topic comes down to one thing: reduce the number of wasted iterations.

Not just “get a better answer.” But get fewer loops to the right result.

A bad AI workflow looks like this:

prompt
  → not it
  → explanation
  → not it
  → clarification
  → not it
  → irritation
  → manual edit

A good AI workflow looks like this:

spec
  → context
  → agent run
  → artifact
  → validation
  → focused correction
  → memory update
  → reusable template

The difference isn’t in the “smartness of the model.” The difference is that the second loop learns. After each error it becomes better. The first one just burns nerves.

That’s exactly why the artifact-first approach is so strong. Don’t ask the AI to “explain what you did.” Ask it to create an artifact that can be verified: a diff, a test result, a deployed page, JSON, a checklist, a PR, a changelog, a screenshot, a log, a report.

## The worst anti-patterns of 2026

In the new AI reality, work is most often broken not by models, but by bad patterns.

1. A giant prompt instead of a system. When all the rules, style, context, history, and constraints live in one canvas of text — the system becomes fragile. Better: a short core prompt, context separately, tools separately, policies in hooks, managed memory, explicit evals.

2. An agent without boundaries. If an agent can do everything, sooner or later it will do something unnecessary. Better: read-only by default, write only with scope, dangerous actions require approval, high-risk zones blocked.

3. Free text between agents. Without a schema you get hallucinations, token bloat, and audit hell. Better: structured handoff, template schema, explicit fields, parent orchestrator validates.

4. No memory of decisions. Every new session starts from scratch. The human explains the same thing again. Better: decisions.md, project.md, known constraints, what not to do.

5. No eval loop. Errors are fixed manually but don’t become tests. Better: failure → captured → classified → added to eval → prevents regression.

A new profession: architect of agentic loops

From this a new role is born.

Not a prompt engineer in the old sense. But an agent workflow architect.

A person who can:

break processes into stages;
determine where a model is needed and where ordinary code is;
design the context flow;
configure memory;
spell out agent roles;
create structured handoffs;
build approval gates;
introduce evals;
control costs;
make workflows portable between vendors.

This isn’t one profession on LinkedIn. It’s a skill that will permeate the work of a founder, a CTO, a product manager, an operations lead, an analyst, an editor, a developer.

In 2024 it was valuable “to be able to prompt.” In 2026 it’s valuable to be able to build loops.

Bottom line

The era of the magic prompt is ending not because prompts became unnecessary. It’s ending because the work became longer than one prompt.

AI now writes code, runs tests, edits files, works in a devbox, waits for approval, remembers decisions, reads mail, connects tools, creates PRs, launches via API, gets hooks, and falls under governance.

This is no longer a “text generator.” It’s a new execution machine.

And in this machine, what matters most isn’t who writes the prettiest prompt.

What matters is who closes the loop faster:

see
  → formulate
  → launch
  → verify
  → fix
  → remember
  → repeat

Just as in Madyar’s war the winner isn’t a single platform but the speed of the sensor loop — in modern AI work the winner isn’t a single model but the speed of the agentic loop.

The prompt was a command.

The loop becomes an army.

Sources

OpenAI — Memory Sources release for ChatGPT, 5 May 2026. https://openai.com/index/
OpenAI — “Work with Codex from anywhere” mobile launch + Hooks GA, 14 May 2026. https://openai.com/index/
DataHub — Context Engineering survey (82% IT leaders, 95% on importance), April 2026. https://datahub.com/
LangChain — State of Agent Engineering report, 2026 edition. https://blog.langchain.com/
Anthropic — Claude Code rate-limit doubling + SpaceX 300 MW partnership, 6 May 2026. https://www.anthropic.com/news/
LangChain — Durable execution for production deep agents. https://blog.langchain.com/
VentureBeat — Agent Control Plane analysis (Microsoft 38.6%, OpenAI 25.7%, Anthropic 5.7%), 15 May 2026. https://venturebeat.com/ai/
VentureBeat — Pinecone Nexus compilation-stage knowledge layer benchmark (2.8M → 4K tokens), 4 May 2026. https://venturebeat.com/ai/
GitHub Engineering — Accessibility Agent architecture, sub-agent pattern, 36% manual-only threshold, 15 May 2026. https://github.blog/engineering/
LangChain — Agent improvement loop & context engineering patterns. https://blog.langchain.com/
GitHub — Copilot CLI in JetBrains IDEs, Ask Question tool, .agent.md support, 13 May 2026. https://github.blog/
GitHub — Agent tasks REST API public preview, 13 May 2026. https://github.blog/
Anthropic Claude Code documentation — Routines, /usage, /ultrareview, hooks, sandbox rules (April–May 2026). https://docs.claude.com/
Sakana AI — Conductor paper: 7B model orchestrating frontier models via RL. https://sakana.ai/
Sakana AI — Conductor benchmark results (LiveCodeBench 83.93, AIME25 93.3, GPQA-D 87.5). https://sakana.ai/
Martin Fowler & James Pritchard — “Workflows vs agents” distinction, 14 May 2026. https://martinfowler.com/
Zed — Anthropic Agent SDK credit split ($20/$100/$200), effective 15 June 2026, 14 May 2026. https://zed.dev/blog/