Navigating the complexities of building and scaling Multi-Agent System

Navigating the complexities of building and scaling Multi-Agent System

Navigating the complexities of building and scaling Multi-Agent System

Nitai Agarwal, Riya Parikh & Harshit Tated

Apr 19, 2025

Gen AI multi-agent systems (MAS) are emerging as a way to tackle complex & abstract tasks by having multiple specialized AI agents collaborate. Already we are seeing a few agents (<10) becoming popular and by the end of the year multi agent systems with >10 agents will become commonplace.  Building and operating these MAS systems in the real-world is fraught with challenges at every stage of the lifecycle. In this post, we’ll walk through the current status, challenges and opportunities of the five phases of the MAS lifecycleBuild, Evaluate, Deploy, Monitor, and Lifecycle Management – using a healthcare patient concierge AI as a running example.

Different stages of Agents Lifecycle

The lifecycle of agents in MAS encompasses mostly these five key stages: 

Design and development, where agent specialization and interaction models are established; 

Evaluation & Testing , which verifies individual functionality and inter-agent communication; 

Deployment & Initialization, covering environment setup and CI/CD pipelines; 

Monitoring & Observability, tracking agent health, captures interaction logs, and surfaces performance anomalies in real time;

Life-Cycle Management , managing version updates while preserving staters and maintaining components

Fig 1: Operational Stages of a GenAI Multi-Agent Systems (MAS) Pipeline

Effective management across these stages requires tailored approaches for each phase. During early stages, emphasis falls on clear role definition and communication protocols, while later stages demand robust version compatibility and state preservation. Throughout all stages, security considerations must evolve alongside agent capabilities, implementing granular permissions and comprehensive monitoring. Organizations that excel recognize that agent testing fundamentally differs from traditional software testing, with particular focus on inter-agent dynamics and emergent behaviors that can only be observed when the complete system operates under varying conditions.

Use Case in Focus: Imagine a “patient concierge” AI service for a hospital. A patient can ask this AI to do things like explain their symptoms, schedule an appointment with the right specialist, check insurance coverage, and send reminders. Rather than a single monolithic AI, such a service could be composed of multiple agents – for example: a Medical Agent for gathering symptoms, a Scheduling Agent to book appointments, a Billing Agent for insurance queries, all coordinated by a Concierge Orchestrator Agent that interacts with the patient. This MAS needs to seamlessly integrate medical knowledge, tools (like the hospital’s calendar and databases), and maintain patient privacy. We’ll refer back to this example to illustrate each phase of the lifecycle.

Fig 2: Healthcare MAS with Orchestrated Patient Flow

Build Phase: Designing and Developing the Multi-Agent System

Building a MAS is a fundamentally new kind of software development. In this phase, we define the agents (their roles, skills, and personalities), how they communicate, and what tools or data they can access. 

Architecture of a MAS

  • Multiple specialized agents (medical, scheduling, billing etc) collaborate under an Orchestrator

    • The Symptom Collector agent might consult a knowledge base or guidelines; 

    • The Scheduling agent queries hospital systems (EHR, calendars); 

    • The Billing agent checks insurance databases. 

    • The Orchestrator agent mediates the conversation with the patient and coordinates subtasks among the specialists. 

This design aims to mimic a care team working together for the patient.

Challenges in the Build Phase

Defining Roles and Scope

One of the hardest parts of MAS development is specification and system design. Each agent’s role and responsibilities must be clearly defined so they complement each other without confusion. Ambiguous design can lead to agents stepping on each other’s toes!

For example, if we don’t explicitly specify that only the Symptom Collector Agent should have access to guidelines and knowledge (and not the orchestrator), agents might violate these role boundaries and give out incorrect information. Ensuring that all possible tasks are covered by some agent, and that each agent knows when to yield control to another, is non-trivial.

Communication and Coordination Protocol

LLM-based agents often communicate in natural language, which is prone to misunderstanding. We need to decide how they will interact. Will the agents talk to each other in an open chat thread, or will the orchestrator route messages between them?

Our healthcare agents need a protocol – e.g. the orchestrator first asks the Symptom Collector Agent for an answer, then passes that to the Scheduling Agent if an appointment is needed, and so on. Without a clear interaction plan and handover mechanisms, multi-agent dialogues can potentially become chaotic.

Knowledge & Tools

Each agent may require access to different knowledge sources or tools. Continuing from our example, the Symptom Collector might use a medical knowledge base, or a finetuned LLM, the Scheduling Agent will need secure access to calendar API. Prompting an agent to produce structured output is error proves and ensuring LLM output can properly trigger tool usage is a challenge. 

Iterative Design

Where is my blueprint? Currently, it’s often trial and error, using intuition to prompt-engineer agent behaviours. Developers must iterate on prompts and roles, essentially debugging conversations and simulating scenarios. In a critical domain like healthcare, this is especially difficult because we must anticipate corner cases (like a medical question the agent shouldn’t answer and should escalate to a human). 

LangGraph’s graph-based design lets agents iteratively revisit and refine context‑aware workflows, CrewAI enforces clear specialist roles for effective delegation, and AutoGen orchestrates seamless multi‑agent conversations—mimicking human teamwork to automate complex workflows. 

Despite these tools, gaps remain in the Build stage. LangGraph’s growing graphs can become difficult to debug at scale, CrewAI’s rigid role templates may not adapt well to evolving or cross‑functional tasks, and AutoGen’s conversational loops can introduce latency, obscure error origins, and lack built‑in tools for detecting emergent inter-agent failures.It’s still difficult to guarantee that an MAS will behave as intended across all scenarios. Developers lack formal testing methods during building – often you have to run sample conversations to see if agents do the right thing. Misalignment can already creep in here: an agent might follow its prompt fine in isolation, but once multiple agents interact, unpredictable behaviors emerge.

As in our example, in healthcare, one might enforce constraints (e.g., “Symptom Collector Agent should never give treatment advice”) at the prompt level, but currently that’s up to the developer to remember. This leads into the next phase – even with a well-designed MAS, we must Evaluate it thoroughly to discover any flaws.

Evaluate Phase: Testing the Multi-Agent System’s Performance and Safety

Once we have a MAS design, we need to evaluate whether it actually works as intended. For a healthcare concierge, evaluation is critical – we must test that the AI agents together can handle a variety of patient requests accurately and safely. The Evaluate phase covers functional correctness (does the MAS achieve the goals, e.g. correctly scheduling an appointment with appropriate advice given?) and non-functional aspects like efficiency (do the agents take too many steps or cost too much?), as well as safety (do they avoid dangerous recommendations or privacy breaches?)

Fig 3: Evaluation cycle for a Multi Agent System

Challenges in the Evaluation Phase

The current evaluation frameworks for agents are still evolving to address the needs of MAS systems which are discussed below.

Complex Emergent Behaviour

Unlike a single LLM call with a well-defined input-output, an MAS involves multi-turn processes between agents. This makes evaluation much harder. We are essentially evaluating an entire dialogue or collaborative process, which could branch into many paths. Even if each agent individually passes unit tests, their combination might fail in novel ways.

Lack of Established Metrics

What does it even mean for a MAS to “pass” a test? For a given user query, there might be multiple acceptable ways the agents could solve it. Traditional accuracy metrics don’t directly apply. We often need to evaluate along several dimensions:

  • Task Success: Did the MAS ultimately fulfill the user’s request? (E.g., the patient asked for an appointment and got one with correct details.)

  • Quality of reasoning / conversation: Did agents share information correctly and come to a sound solution (especially important for symptom collection and its completeness)?

  • Efficiency:  Did they solve it in a reasonable number of dialogue turns and API calls, or did they wander off-topic?

  • Safety / Compliance: Did the MAS avoid unauthorized actions or disallowed content (no private data leaked, no unsafe medical suggestion made)?

Human Evaluation & Test Scenarios 

In healthcare, one would ideally have experts review the MAS’s outputs. But evaluating every possible scenario is impossible – you need to pick representative test cases. Creating a suite of test scenarios (like patient personas with various requests) is itself a challenge, requiring medical expertise and understanding of what could go wrong. There’s also the issue of sensitive data – you might have to test with synthetic data if you can’t use real patient info during eval, which may not cover all real-world quirks.

Approaches and Frameworks for Evaluation

Given these challenges, how are people evaluating MAS today? It’s a mix of new techniques and adapting old ones:

LLM as a Judge

One approach is using a language model to evaluate the outputs of another (or a whole MAS). The idea is to have the AI itself score the conversation on success criteria. For example, after our concierge MAS handles a test patient query, we might feed the entire interaction and the intended goal to LLM and ask, “Did the agents fulfill the request correctly and safely? If not, where did they fail?” 

Test scenarios (simulated user queries with known expected outcomes) are fed to the MAS. An Evaluation Agent (LLM-as-Judge) can compare the MAS’s responses against the expected outcome, producing a score or verdict. Logs of the dialogue (number of turns, tools used, etc.) are also collected as part of evaluation metrics. 

Muti Agent Debate

Another method is to pit agents against each other in an evaluation setting. Essentially, two (or more) agents discuss or critique a given solution, hoping that flaws will be revealed in the debate. While originally more about evaluating single LLM answers, one could imagine using debate for MAS: e.g., spin up a separate “critic” agent to argue why the concierge’s plan for the patient might be wrong (“Did it consider the patient’s medical history? What if the diagnosis is incorrect?”) and see if the MAS can defend or adjust

Human-in-the-loop Evaluations

Ultimately, for sensitive domains like healthcare, human evaluation is the gold standard. One might conduct a study where medical professionals interact with the concierge MAS and rate its performance. Human eval is expensive and time-consuming, so it’s often done on a small scale or after initial automated testing passes.

Once we have some confidence from evaluation, we move to deploying the system in the real world – which will reveal new challenges of its own.

Deploy Phase: From Prototype to Production

Deployment is about taking the MAS out of the lab and integrating it into a real-world setting. 

This phase deals with all the practical considerations of making the MAS actually useful to end-users: integration with infrastructure, performance, scalability, and compliance. 

For our patient concierge, deployment means the AI is now interacting with actual patients through a user interface, connected to live hospital databases, and operating under real-world conditions (network issues, high usage periods, etc.).

Challenges in the Deployment Phase

Deployment Pipeline & Lifecycle 

Deploying new agent versions reliably hinges on a robust CI/CD pipeline—containerizing the Medical, Scheduling, Billing, and Orchestrator agents via infrastructure‑as‑code (IaC). Ideally, we lean on MAS frameworks and IaC templates to enforce environment parity and inject secrets securely, while automated rollback hooks tied to versioned release artifacts let us revert immediately if an update breaks appointment‑booking logic. At the same time, dynamic resource policies have to be in place to ensure extra Scheduling Agent instances spin up during the morning check‑in rush and scale down overnight.

Integration with External Systems

In prototyping, we might have mocked the hospital’s Electronic Health Record (EHR) system or scheduling database. In deployment, the MAS must interface with real systems – e.g., calling the EHR API to get a patient’s lab results, or reading/writing appointment info in the scheduling system. Each agent that needs external data becomes a potential integration point. Ensuring the agents make API calls correctly, handle errors (like “system down” or “no appointments available”), and do so securely (not requesting info they shouldn’t) is a huge challenge. 

Emerging standards like the open Agent‑to‑Agent (A2A) protocol give agents a shared, schema‑validated channel for calling and responding to hospital APIs.
Likewise, Anthropic’s Model Context Protocol (MCP) abstracts backend details so agents can swap underlying LLMs or data stores without touching the integration surface

Scalability & Latency 

How do we scale a multi-agent system? One approach is to run multiple instances of the MAS in parallel (horizontally scale), but coordinating state could be an issue if the same patient interacts multiple times – do we keep a conversation context per session? Traditional web service scaling techniques apply, but the agent conversation adds a layer on top that needs to be managed (possibly via an orchestrator service). Ensuring the system meets uptime and response time requirements is a classic engineering challenge magnified by the unpredictability of LLM agents

Reliability & Safety

Deploying guardrails begins in the CI/CD pipeline. During the build stage, the pipeline integrates and unit tests these guardrails—simulating prompt‑injection, malformed API responses, and unauthorized PHI access etc, to ensure no agent can bypass restrictions before any build is published.

Monitor Phase: Observing and Maintaining Performance in Production

Once the MAS is live, the Monitor phase involves keeping an eye on it: tracking its performance, catching errors or drifts, and gathering data for improvement. Monitoring a multi-agent AI is in some ways similar to monitoring any software service (you collect logs, uptime metrics, etc.), but it also has unique aspects – we need to monitor the quality of the AI’s decisions, not just that it’s running.

Challenges in the Monitor Phase

The current monitor/observability frameworks for agents are still evolving to address the needs of MAS systems which are discussed below.

Visibility into Agent Reasoning 

One challenge is that a MAS’s inner workings are essentially conversations in natural language, which can be hard to systematically parse. We can log every message between agents (and we should, for audit purposes), but sifting through those logs to identify issues is non-trivial. If a patient got a poor outcome, we’d want to trace which agent or exchange led to that. For example, maybe the Symptom Collector Agent got the correct info, but the handoff to Scheduling Agent failed because the latter misunderstood a date format)

Visibility into Cost & Latency 

In production, each agent call - Medical, Scheduling, Billing, Orchestrator - incurs token usage and network latency. Without per-agent telemetry, it’s impossible to tell whether cost overruns stem from verbose prompts in the Medical Agent or from excessive back‑and‑forth loops in the orchestrator. Similarly, end‑to‑end response time hides slow external dependencies (e.g., EHR lookups). Effective monitoring must capture elements like tokens, API calls per conversation, response times per agent and resource usage, then correlate spikes to specific scenarios so teams can take corrective actions like optimize prompts, adjust model, or refine auto‑scaling policies etc. 

Inter-Agent Misalignment Over Time

As the MAS handles many queries, certain patterns of failure might emerge. One known category of issues is inter-agent misalignment, where agents gradually diverge from the intended cooperative behavior. In long-running dialogues, they might get confused about context. Potential signals could be unusually long dialogues (many turns back and forth), or the orchestrator agent frequently resetting the conversation.

Drifts and Updates 

Over time, the underlying LLMs might change (for instance, an API upgrade to a new model version) which can change the behavior of your agents subtly. Even without explicit updates, the model might start outputting differently if the provider tweaks it. Monitoring needs to catch distribution shifts – are we suddenly seeing more hallucinations or errors than before? We may need to establish baselines during initial deployment and continuously compare. If, say, the ratio of successful scheduling tasks drops in a particular week, that should trigger an alert to investigate.

Fallbacks and Incident Handling

When (not if) something goes wrong – say an agent outputs something inappropriate to a user – monitoring should flag this ASAP. 

Another difficulty is reproducing issues: by the time we see an alert, the conversation already happened. We have the logs, but if it was due to a rare model quirk, running it again might not produce the same result.

Failure during the execution of the agent can lead to information mismatch in DBs getting stored or incorrect state updates when follow up happens. At each stage, we need to have fallbacks to mitigate these issues.

Approaches for Monitoring MAS

AgentOps & Telemetry 

There’s a growing ecosystem for AgentOps (analogous to MLOps & LLMOps but for Agentic applications). Tools like RagaAI catalyst allow logging of all prompts and responses, and come with a UI to search and analyze them. For example, we could search the logs for when the Medical Agent said “ER” to see if it’s frequently sending people to the Emergency Room inappropriately. We can allow plug code based evaluation functions, e.g., you could write a Python function that checks if an appointment was scheduled when a symptom was severe, and run this function on all logged sessions.

Automated Metrics 

We can derive and set up certain quantitative metrics from the logs:

For example, Average number of dialogue turns per task, Success rate of tasks (if we have a way to label outcomes as success/fail, tool usage frequency and errors (how often did an API call fail or need to be retried), cost per conversation (tokens consumed).

Setting up time-series tracking of these can tell us if, say, after a code update the turns doubled (regression in efficiency) or cost spiked.

Human Oversight 

It might be prudent (especially early in deployment) to have a human monitor (e.g., a nurse or admin) reviewing a sample of the interactions daily, which can be flagged via automation or metrics. In some settings, a shadow mode deployment is done first: the MAS interacts with patients but a human moderator is watching in real-time and can intervene if needed

The gaps in monitoring are largely around measuring qualitative correctness continuously. We can easily monitor uptime or if an agent process crashed, but monitoring whether the content of the agent’s advice was correct is much harder. It ties back to evaluation – essentially we’d like to evaluate as many real conversations as possible, but doing that 100% with either AI or humans is infeasible. Another gap is feeding the monitor findings back into the system quickly – which belongs to the final phase, Lifecycle Management. We want to use what we observe (errors, user feedback etc.) to continually improve the MAS.

Lifecycle Management Phase: Continuous Improvement and Maintenance

Building, evaluating, deploying, and monitoring a multi-agent system is not a one-and-done effort. Lifecycle Management is about the ongoing process of updating the system – fixing bugs, adding new capabilities, adapting to changing conditions – while ensuring it remains stable and effective.

Challenges in Lifecycle Management

Version Control 

One obvious maintenance task is updating agents as we learn from failures. For example, if monitoring shows the Medical Agent sometimes gives too blunt an answer, we might want to amend its prompt to emphasize empathy and detail. However, as discussed, even a small prompt change can have ripple effects on the overall MAS behavior. Without re-evaluating, we risk breaking something else.

Managing version control for prompts and agent “brains” is important. In code, we have Git and automated tests; for prompts, we need a way to track changes, perhaps link them to resolved issues. 

The same applies for all facets of MAS (tools, memory, etc.) - versioning and keeping track with a re-evaluation cycle becomes inescapable. 

Model Updates

Over the lifecycle, the base LLMs powering agents may be updated or replaced. Perhaps we fine-tune a smaller model on domain data to reduce reliance on an API. Each model update is like a huge code change – it will alter the dynamics of the MAS. This poses a challenge: how to continuously improve the agents’ intelligence while not breaking coordination?

Challenge essentially boils down to - “unifying and simplifying the training process for these agents”

This is still quite cutting-edge, in practice, most current MAS rely on fixed pre-trained models and prompt engineering, because multi-agent learning is an evolving research problem

Adaptive System Evolution

Over time, what we need the MAS to do might change. The hospital might expand the concierge to also handle prescription refills. That means adding or altering agents. Lifecycle management includes extensibility – can we add a new agent into the mix without rebuilding from scratch? In a well-architected MAS, this might be possible (one could plug in a new specialist agent and update the orchestrator’s prompt to utilize it). Ideally, industry is moving towards more modular MAS architectures, where each agent can be updated or swapped out like a microservice, with clear contracts on how they communicate

Continuous learning

A big question is whether to allow the MAS to learn and adapt on its own (online learning), and if yes, how do we enable that. Online learning (like updating its knowledge base as it answers questions) could improve performance but is dangerous in healthcare because errors could compound. Currently, most MAS are static or at best retrained offline with new data. We see a gap in safe continuous learning – how can the system incorporate human feedback loops effectively? For example,  perhaps patients rate each answer; we could feed those ratings to periodically finetune the agents. But one has to be careful to not degrade some other aspect while optimizing for ratings. Future advances in reinforcement learning for MAS might allow a more automated improvement cycle, but today it’s largely manual.

Strategies and Tools for Lifecycle Management

Routine Re-Evaluation and Updates

A best practice is to have a regular schedule to update the MAS. This could include refreshing prompts, applying any new models, and then re-running the evaluation suite from the Evaluate phase. Our hospital MAS team might, for instance, meet every two weeks to review collected conversation logs, identify areas to improve, adjust prompts or agent logic, and then test and redeploy. Having a systematic process here is part of the lifecycle management discipline.

Versioning Management

Treating every piece of your MAS: prompts, tool definitions, memory schemas, and fine‑tuned model artifacts as first‑class versioned assets are the need of the hour. Integration of versioning with deployment cycles gives visibility on the performance of the agent over time and versions, piggy banking on the evaluations set up for agent performance assessments.

Knowledge Base Updates

Part of the lifecycle is keeping the knowledge current. Our Agent might rely on a knowledge base of medical guidelines. That database keeps updating as guidelines change. Ensuring the MAS incorporates the latest info might be as simple as re-indexing a vector database or as complex as retraining an agent’s model. 

Major gaps in lifecycle management revolve around the lack of formalized processes and automated support. We need “AgentOps”. This includes:

  • Automated regression testing for MAS (to ensure an update didn’t reintroduce a previous bug).

  • Version control systems tailored to prompts and model checkpoints, tool checkpoints etc (so you can roll back if an update performs worse).

  • Continuous monitoring integration – linking monitoring signals to create tickets or triggers for the next iteration.

In our healthcare example, lifecycle management might eventually involve handing some control over to the hospital’s IT or medical staff: for example, giving a tool to clinicians to easily update the knowledge base the agents use (so that a new clinical protocol can be uploaded). 

Empowering domain experts to maintain the MAS without always needing an AI engineer is another frontier to strive for.

Conclusion

Multi-agent AI systems hold great promise – our hypothetical concierge AI could transform patient engagement by providing responsive, 24/7 support that draws on multiple specialties. But as we’ve seen, realizing this vision requires navigating a complex lifecycle:

Build: Frameworks like Langgraphl, CrewAI now provide role templates and orchestrator modules to streamline agent definition and workflows - enabling rapid prototyping of agent teams. Still, fine‑tuning cross‑agent coordination and establishing design best practices remain areas for further innovation.

Evaluate: Structured assessment pipelines using LLM‑as‑Judge setups  and integrated testing suites—offer scalable ways to score dialogue quality, task success, and safety. Augmenting these with human‑in‑the‑loop reviews helps close gaps in rigorous testing. 

Deploy: One‑click deployment platforms can automate infrastructure provisioning, UI generation, and policy enforcement, speeding prototypes into production. However, adapting these deployments to meet security, latency, and compliance requirements still calls for custom integration work. 

Monitor: Unified analytics dashboards paired with log aggregation, and built-in feedback loops surface real‑time insights into agent behavior. To turn natural‑language exchanges into actionable health‑care metrics, we need more specialized monitoring tools that bridge unstructured dialogue with structured observability.

Lifecycle Management: Modular training pipelines, prompt versioning, and continuous feedback loops enable ongoing MAS refinement. Building out full-fledged AgentOps—complete with automated regression tests, prompt change tracking, and adaptive retraining—will be crucial to make these systems maintainable, reliable, and safe over the long haul.

References

  1. “Why Do Multi-Agent LLM Systems Fail?”, arXiv, 2025

  2. Microsoft Research Technical Report MSR-TR-2024-53, “Challenges in human-agent communication”

  3. “Large Language Model based Multi-Agents: A Survey of Progress and Challenges”, arXiv, 2024

  4. “AgentScope: A Flexible yet Robust Multi-Agent Platform”, arXiv, 2024

  5. “AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation”, Microsoft Research, 2023 

  6. “MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework”, 2023

  7. “ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate”, arXiv, 2023

Gen AI multi-agent systems (MAS) are emerging as a way to tackle complex & abstract tasks by having multiple specialized AI agents collaborate. Already we are seeing a few agents (<10) becoming popular and by the end of the year multi agent systems with >10 agents will become commonplace.  Building and operating these MAS systems in the real-world is fraught with challenges at every stage of the lifecycle. In this post, we’ll walk through the current status, challenges and opportunities of the five phases of the MAS lifecycleBuild, Evaluate, Deploy, Monitor, and Lifecycle Management – using a healthcare patient concierge AI as a running example.

Different stages of Agents Lifecycle

The lifecycle of agents in MAS encompasses mostly these five key stages: 

Design and development, where agent specialization and interaction models are established; 

Evaluation & Testing , which verifies individual functionality and inter-agent communication; 

Deployment & Initialization, covering environment setup and CI/CD pipelines; 

Monitoring & Observability, tracking agent health, captures interaction logs, and surfaces performance anomalies in real time;

Life-Cycle Management , managing version updates while preserving staters and maintaining components

Fig 1: Operational Stages of a GenAI Multi-Agent Systems (MAS) Pipeline

Effective management across these stages requires tailored approaches for each phase. During early stages, emphasis falls on clear role definition and communication protocols, while later stages demand robust version compatibility and state preservation. Throughout all stages, security considerations must evolve alongside agent capabilities, implementing granular permissions and comprehensive monitoring. Organizations that excel recognize that agent testing fundamentally differs from traditional software testing, with particular focus on inter-agent dynamics and emergent behaviors that can only be observed when the complete system operates under varying conditions.

Use Case in Focus: Imagine a “patient concierge” AI service for a hospital. A patient can ask this AI to do things like explain their symptoms, schedule an appointment with the right specialist, check insurance coverage, and send reminders. Rather than a single monolithic AI, such a service could be composed of multiple agents – for example: a Medical Agent for gathering symptoms, a Scheduling Agent to book appointments, a Billing Agent for insurance queries, all coordinated by a Concierge Orchestrator Agent that interacts with the patient. This MAS needs to seamlessly integrate medical knowledge, tools (like the hospital’s calendar and databases), and maintain patient privacy. We’ll refer back to this example to illustrate each phase of the lifecycle.

Fig 2: Healthcare MAS with Orchestrated Patient Flow

Build Phase: Designing and Developing the Multi-Agent System

Building a MAS is a fundamentally new kind of software development. In this phase, we define the agents (their roles, skills, and personalities), how they communicate, and what tools or data they can access. 

Architecture of a MAS

  • Multiple specialized agents (medical, scheduling, billing etc) collaborate under an Orchestrator

    • The Symptom Collector agent might consult a knowledge base or guidelines; 

    • The Scheduling agent queries hospital systems (EHR, calendars); 

    • The Billing agent checks insurance databases. 

    • The Orchestrator agent mediates the conversation with the patient and coordinates subtasks among the specialists. 

This design aims to mimic a care team working together for the patient.

Challenges in the Build Phase

Defining Roles and Scope

One of the hardest parts of MAS development is specification and system design. Each agent’s role and responsibilities must be clearly defined so they complement each other without confusion. Ambiguous design can lead to agents stepping on each other’s toes!

For example, if we don’t explicitly specify that only the Symptom Collector Agent should have access to guidelines and knowledge (and not the orchestrator), agents might violate these role boundaries and give out incorrect information. Ensuring that all possible tasks are covered by some agent, and that each agent knows when to yield control to another, is non-trivial.

Communication and Coordination Protocol

LLM-based agents often communicate in natural language, which is prone to misunderstanding. We need to decide how they will interact. Will the agents talk to each other in an open chat thread, or will the orchestrator route messages between them?

Our healthcare agents need a protocol – e.g. the orchestrator first asks the Symptom Collector Agent for an answer, then passes that to the Scheduling Agent if an appointment is needed, and so on. Without a clear interaction plan and handover mechanisms, multi-agent dialogues can potentially become chaotic.

Knowledge & Tools

Each agent may require access to different knowledge sources or tools. Continuing from our example, the Symptom Collector might use a medical knowledge base, or a finetuned LLM, the Scheduling Agent will need secure access to calendar API. Prompting an agent to produce structured output is error proves and ensuring LLM output can properly trigger tool usage is a challenge. 

Iterative Design

Where is my blueprint? Currently, it’s often trial and error, using intuition to prompt-engineer agent behaviours. Developers must iterate on prompts and roles, essentially debugging conversations and simulating scenarios. In a critical domain like healthcare, this is especially difficult because we must anticipate corner cases (like a medical question the agent shouldn’t answer and should escalate to a human). 

LangGraph’s graph-based design lets agents iteratively revisit and refine context‑aware workflows, CrewAI enforces clear specialist roles for effective delegation, and AutoGen orchestrates seamless multi‑agent conversations—mimicking human teamwork to automate complex workflows. 

Despite these tools, gaps remain in the Build stage. LangGraph’s growing graphs can become difficult to debug at scale, CrewAI’s rigid role templates may not adapt well to evolving or cross‑functional tasks, and AutoGen’s conversational loops can introduce latency, obscure error origins, and lack built‑in tools for detecting emergent inter-agent failures.It’s still difficult to guarantee that an MAS will behave as intended across all scenarios. Developers lack formal testing methods during building – often you have to run sample conversations to see if agents do the right thing. Misalignment can already creep in here: an agent might follow its prompt fine in isolation, but once multiple agents interact, unpredictable behaviors emerge.

As in our example, in healthcare, one might enforce constraints (e.g., “Symptom Collector Agent should never give treatment advice”) at the prompt level, but currently that’s up to the developer to remember. This leads into the next phase – even with a well-designed MAS, we must Evaluate it thoroughly to discover any flaws.

Evaluate Phase: Testing the Multi-Agent System’s Performance and Safety

Once we have a MAS design, we need to evaluate whether it actually works as intended. For a healthcare concierge, evaluation is critical – we must test that the AI agents together can handle a variety of patient requests accurately and safely. The Evaluate phase covers functional correctness (does the MAS achieve the goals, e.g. correctly scheduling an appointment with appropriate advice given?) and non-functional aspects like efficiency (do the agents take too many steps or cost too much?), as well as safety (do they avoid dangerous recommendations or privacy breaches?)

Fig 3: Evaluation cycle for a Multi Agent System

Challenges in the Evaluation Phase

The current evaluation frameworks for agents are still evolving to address the needs of MAS systems which are discussed below.

Complex Emergent Behaviour

Unlike a single LLM call with a well-defined input-output, an MAS involves multi-turn processes between agents. This makes evaluation much harder. We are essentially evaluating an entire dialogue or collaborative process, which could branch into many paths. Even if each agent individually passes unit tests, their combination might fail in novel ways.

Lack of Established Metrics

What does it even mean for a MAS to “pass” a test? For a given user query, there might be multiple acceptable ways the agents could solve it. Traditional accuracy metrics don’t directly apply. We often need to evaluate along several dimensions:

  • Task Success: Did the MAS ultimately fulfill the user’s request? (E.g., the patient asked for an appointment and got one with correct details.)

  • Quality of reasoning / conversation: Did agents share information correctly and come to a sound solution (especially important for symptom collection and its completeness)?

  • Efficiency:  Did they solve it in a reasonable number of dialogue turns and API calls, or did they wander off-topic?

  • Safety / Compliance: Did the MAS avoid unauthorized actions or disallowed content (no private data leaked, no unsafe medical suggestion made)?

Human Evaluation & Test Scenarios 

In healthcare, one would ideally have experts review the MAS’s outputs. But evaluating every possible scenario is impossible – you need to pick representative test cases. Creating a suite of test scenarios (like patient personas with various requests) is itself a challenge, requiring medical expertise and understanding of what could go wrong. There’s also the issue of sensitive data – you might have to test with synthetic data if you can’t use real patient info during eval, which may not cover all real-world quirks.

Approaches and Frameworks for Evaluation

Given these challenges, how are people evaluating MAS today? It’s a mix of new techniques and adapting old ones:

LLM as a Judge

One approach is using a language model to evaluate the outputs of another (or a whole MAS). The idea is to have the AI itself score the conversation on success criteria. For example, after our concierge MAS handles a test patient query, we might feed the entire interaction and the intended goal to LLM and ask, “Did the agents fulfill the request correctly and safely? If not, where did they fail?” 

Test scenarios (simulated user queries with known expected outcomes) are fed to the MAS. An Evaluation Agent (LLM-as-Judge) can compare the MAS’s responses against the expected outcome, producing a score or verdict. Logs of the dialogue (number of turns, tools used, etc.) are also collected as part of evaluation metrics. 

Muti Agent Debate

Another method is to pit agents against each other in an evaluation setting. Essentially, two (or more) agents discuss or critique a given solution, hoping that flaws will be revealed in the debate. While originally more about evaluating single LLM answers, one could imagine using debate for MAS: e.g., spin up a separate “critic” agent to argue why the concierge’s plan for the patient might be wrong (“Did it consider the patient’s medical history? What if the diagnosis is incorrect?”) and see if the MAS can defend or adjust

Human-in-the-loop Evaluations

Ultimately, for sensitive domains like healthcare, human evaluation is the gold standard. One might conduct a study where medical professionals interact with the concierge MAS and rate its performance. Human eval is expensive and time-consuming, so it’s often done on a small scale or after initial automated testing passes.

Once we have some confidence from evaluation, we move to deploying the system in the real world – which will reveal new challenges of its own.

Deploy Phase: From Prototype to Production

Deployment is about taking the MAS out of the lab and integrating it into a real-world setting. 

This phase deals with all the practical considerations of making the MAS actually useful to end-users: integration with infrastructure, performance, scalability, and compliance. 

For our patient concierge, deployment means the AI is now interacting with actual patients through a user interface, connected to live hospital databases, and operating under real-world conditions (network issues, high usage periods, etc.).

Challenges in the Deployment Phase

Deployment Pipeline & Lifecycle 

Deploying new agent versions reliably hinges on a robust CI/CD pipeline—containerizing the Medical, Scheduling, Billing, and Orchestrator agents via infrastructure‑as‑code (IaC). Ideally, we lean on MAS frameworks and IaC templates to enforce environment parity and inject secrets securely, while automated rollback hooks tied to versioned release artifacts let us revert immediately if an update breaks appointment‑booking logic. At the same time, dynamic resource policies have to be in place to ensure extra Scheduling Agent instances spin up during the morning check‑in rush and scale down overnight.

Integration with External Systems

In prototyping, we might have mocked the hospital’s Electronic Health Record (EHR) system or scheduling database. In deployment, the MAS must interface with real systems – e.g., calling the EHR API to get a patient’s lab results, or reading/writing appointment info in the scheduling system. Each agent that needs external data becomes a potential integration point. Ensuring the agents make API calls correctly, handle errors (like “system down” or “no appointments available”), and do so securely (not requesting info they shouldn’t) is a huge challenge. 

Emerging standards like the open Agent‑to‑Agent (A2A) protocol give agents a shared, schema‑validated channel for calling and responding to hospital APIs.
Likewise, Anthropic’s Model Context Protocol (MCP) abstracts backend details so agents can swap underlying LLMs or data stores without touching the integration surface

Scalability & Latency 

How do we scale a multi-agent system? One approach is to run multiple instances of the MAS in parallel (horizontally scale), but coordinating state could be an issue if the same patient interacts multiple times – do we keep a conversation context per session? Traditional web service scaling techniques apply, but the agent conversation adds a layer on top that needs to be managed (possibly via an orchestrator service). Ensuring the system meets uptime and response time requirements is a classic engineering challenge magnified by the unpredictability of LLM agents

Reliability & Safety

Deploying guardrails begins in the CI/CD pipeline. During the build stage, the pipeline integrates and unit tests these guardrails—simulating prompt‑injection, malformed API responses, and unauthorized PHI access etc, to ensure no agent can bypass restrictions before any build is published.

Monitor Phase: Observing and Maintaining Performance in Production

Once the MAS is live, the Monitor phase involves keeping an eye on it: tracking its performance, catching errors or drifts, and gathering data for improvement. Monitoring a multi-agent AI is in some ways similar to monitoring any software service (you collect logs, uptime metrics, etc.), but it also has unique aspects – we need to monitor the quality of the AI’s decisions, not just that it’s running.

Challenges in the Monitor Phase

The current monitor/observability frameworks for agents are still evolving to address the needs of MAS systems which are discussed below.

Visibility into Agent Reasoning 

One challenge is that a MAS’s inner workings are essentially conversations in natural language, which can be hard to systematically parse. We can log every message between agents (and we should, for audit purposes), but sifting through those logs to identify issues is non-trivial. If a patient got a poor outcome, we’d want to trace which agent or exchange led to that. For example, maybe the Symptom Collector Agent got the correct info, but the handoff to Scheduling Agent failed because the latter misunderstood a date format)

Visibility into Cost & Latency 

In production, each agent call - Medical, Scheduling, Billing, Orchestrator - incurs token usage and network latency. Without per-agent telemetry, it’s impossible to tell whether cost overruns stem from verbose prompts in the Medical Agent or from excessive back‑and‑forth loops in the orchestrator. Similarly, end‑to‑end response time hides slow external dependencies (e.g., EHR lookups). Effective monitoring must capture elements like tokens, API calls per conversation, response times per agent and resource usage, then correlate spikes to specific scenarios so teams can take corrective actions like optimize prompts, adjust model, or refine auto‑scaling policies etc. 

Inter-Agent Misalignment Over Time

As the MAS handles many queries, certain patterns of failure might emerge. One known category of issues is inter-agent misalignment, where agents gradually diverge from the intended cooperative behavior. In long-running dialogues, they might get confused about context. Potential signals could be unusually long dialogues (many turns back and forth), or the orchestrator agent frequently resetting the conversation.

Drifts and Updates 

Over time, the underlying LLMs might change (for instance, an API upgrade to a new model version) which can change the behavior of your agents subtly. Even without explicit updates, the model might start outputting differently if the provider tweaks it. Monitoring needs to catch distribution shifts – are we suddenly seeing more hallucinations or errors than before? We may need to establish baselines during initial deployment and continuously compare. If, say, the ratio of successful scheduling tasks drops in a particular week, that should trigger an alert to investigate.

Fallbacks and Incident Handling

When (not if) something goes wrong – say an agent outputs something inappropriate to a user – monitoring should flag this ASAP. 

Another difficulty is reproducing issues: by the time we see an alert, the conversation already happened. We have the logs, but if it was due to a rare model quirk, running it again might not produce the same result.

Failure during the execution of the agent can lead to information mismatch in DBs getting stored or incorrect state updates when follow up happens. At each stage, we need to have fallbacks to mitigate these issues.

Approaches for Monitoring MAS

AgentOps & Telemetry 

There’s a growing ecosystem for AgentOps (analogous to MLOps & LLMOps but for Agentic applications). Tools like RagaAI catalyst allow logging of all prompts and responses, and come with a UI to search and analyze them. For example, we could search the logs for when the Medical Agent said “ER” to see if it’s frequently sending people to the Emergency Room inappropriately. We can allow plug code based evaluation functions, e.g., you could write a Python function that checks if an appointment was scheduled when a symptom was severe, and run this function on all logged sessions.

Automated Metrics 

We can derive and set up certain quantitative metrics from the logs:

For example, Average number of dialogue turns per task, Success rate of tasks (if we have a way to label outcomes as success/fail, tool usage frequency and errors (how often did an API call fail or need to be retried), cost per conversation (tokens consumed).

Setting up time-series tracking of these can tell us if, say, after a code update the turns doubled (regression in efficiency) or cost spiked.

Human Oversight 

It might be prudent (especially early in deployment) to have a human monitor (e.g., a nurse or admin) reviewing a sample of the interactions daily, which can be flagged via automation or metrics. In some settings, a shadow mode deployment is done first: the MAS interacts with patients but a human moderator is watching in real-time and can intervene if needed

The gaps in monitoring are largely around measuring qualitative correctness continuously. We can easily monitor uptime or if an agent process crashed, but monitoring whether the content of the agent’s advice was correct is much harder. It ties back to evaluation – essentially we’d like to evaluate as many real conversations as possible, but doing that 100% with either AI or humans is infeasible. Another gap is feeding the monitor findings back into the system quickly – which belongs to the final phase, Lifecycle Management. We want to use what we observe (errors, user feedback etc.) to continually improve the MAS.

Lifecycle Management Phase: Continuous Improvement and Maintenance

Building, evaluating, deploying, and monitoring a multi-agent system is not a one-and-done effort. Lifecycle Management is about the ongoing process of updating the system – fixing bugs, adding new capabilities, adapting to changing conditions – while ensuring it remains stable and effective.

Challenges in Lifecycle Management

Version Control 

One obvious maintenance task is updating agents as we learn from failures. For example, if monitoring shows the Medical Agent sometimes gives too blunt an answer, we might want to amend its prompt to emphasize empathy and detail. However, as discussed, even a small prompt change can have ripple effects on the overall MAS behavior. Without re-evaluating, we risk breaking something else.

Managing version control for prompts and agent “brains” is important. In code, we have Git and automated tests; for prompts, we need a way to track changes, perhaps link them to resolved issues. 

The same applies for all facets of MAS (tools, memory, etc.) - versioning and keeping track with a re-evaluation cycle becomes inescapable. 

Model Updates

Over the lifecycle, the base LLMs powering agents may be updated or replaced. Perhaps we fine-tune a smaller model on domain data to reduce reliance on an API. Each model update is like a huge code change – it will alter the dynamics of the MAS. This poses a challenge: how to continuously improve the agents’ intelligence while not breaking coordination?

Challenge essentially boils down to - “unifying and simplifying the training process for these agents”

This is still quite cutting-edge, in practice, most current MAS rely on fixed pre-trained models and prompt engineering, because multi-agent learning is an evolving research problem

Adaptive System Evolution

Over time, what we need the MAS to do might change. The hospital might expand the concierge to also handle prescription refills. That means adding or altering agents. Lifecycle management includes extensibility – can we add a new agent into the mix without rebuilding from scratch? In a well-architected MAS, this might be possible (one could plug in a new specialist agent and update the orchestrator’s prompt to utilize it). Ideally, industry is moving towards more modular MAS architectures, where each agent can be updated or swapped out like a microservice, with clear contracts on how they communicate

Continuous learning

A big question is whether to allow the MAS to learn and adapt on its own (online learning), and if yes, how do we enable that. Online learning (like updating its knowledge base as it answers questions) could improve performance but is dangerous in healthcare because errors could compound. Currently, most MAS are static or at best retrained offline with new data. We see a gap in safe continuous learning – how can the system incorporate human feedback loops effectively? For example,  perhaps patients rate each answer; we could feed those ratings to periodically finetune the agents. But one has to be careful to not degrade some other aspect while optimizing for ratings. Future advances in reinforcement learning for MAS might allow a more automated improvement cycle, but today it’s largely manual.

Strategies and Tools for Lifecycle Management

Routine Re-Evaluation and Updates

A best practice is to have a regular schedule to update the MAS. This could include refreshing prompts, applying any new models, and then re-running the evaluation suite from the Evaluate phase. Our hospital MAS team might, for instance, meet every two weeks to review collected conversation logs, identify areas to improve, adjust prompts or agent logic, and then test and redeploy. Having a systematic process here is part of the lifecycle management discipline.

Versioning Management

Treating every piece of your MAS: prompts, tool definitions, memory schemas, and fine‑tuned model artifacts as first‑class versioned assets are the need of the hour. Integration of versioning with deployment cycles gives visibility on the performance of the agent over time and versions, piggy banking on the evaluations set up for agent performance assessments.

Knowledge Base Updates

Part of the lifecycle is keeping the knowledge current. Our Agent might rely on a knowledge base of medical guidelines. That database keeps updating as guidelines change. Ensuring the MAS incorporates the latest info might be as simple as re-indexing a vector database or as complex as retraining an agent’s model. 

Major gaps in lifecycle management revolve around the lack of formalized processes and automated support. We need “AgentOps”. This includes:

  • Automated regression testing for MAS (to ensure an update didn’t reintroduce a previous bug).

  • Version control systems tailored to prompts and model checkpoints, tool checkpoints etc (so you can roll back if an update performs worse).

  • Continuous monitoring integration – linking monitoring signals to create tickets or triggers for the next iteration.

In our healthcare example, lifecycle management might eventually involve handing some control over to the hospital’s IT or medical staff: for example, giving a tool to clinicians to easily update the knowledge base the agents use (so that a new clinical protocol can be uploaded). 

Empowering domain experts to maintain the MAS without always needing an AI engineer is another frontier to strive for.

Conclusion

Multi-agent AI systems hold great promise – our hypothetical concierge AI could transform patient engagement by providing responsive, 24/7 support that draws on multiple specialties. But as we’ve seen, realizing this vision requires navigating a complex lifecycle:

Build: Frameworks like Langgraphl, CrewAI now provide role templates and orchestrator modules to streamline agent definition and workflows - enabling rapid prototyping of agent teams. Still, fine‑tuning cross‑agent coordination and establishing design best practices remain areas for further innovation.

Evaluate: Structured assessment pipelines using LLM‑as‑Judge setups  and integrated testing suites—offer scalable ways to score dialogue quality, task success, and safety. Augmenting these with human‑in‑the‑loop reviews helps close gaps in rigorous testing. 

Deploy: One‑click deployment platforms can automate infrastructure provisioning, UI generation, and policy enforcement, speeding prototypes into production. However, adapting these deployments to meet security, latency, and compliance requirements still calls for custom integration work. 

Monitor: Unified analytics dashboards paired with log aggregation, and built-in feedback loops surface real‑time insights into agent behavior. To turn natural‑language exchanges into actionable health‑care metrics, we need more specialized monitoring tools that bridge unstructured dialogue with structured observability.

Lifecycle Management: Modular training pipelines, prompt versioning, and continuous feedback loops enable ongoing MAS refinement. Building out full-fledged AgentOps—complete with automated regression tests, prompt change tracking, and adaptive retraining—will be crucial to make these systems maintainable, reliable, and safe over the long haul.

References

  1. “Why Do Multi-Agent LLM Systems Fail?”, arXiv, 2025

  2. Microsoft Research Technical Report MSR-TR-2024-53, “Challenges in human-agent communication”

  3. “Large Language Model based Multi-Agents: A Survey of Progress and Challenges”, arXiv, 2024

  4. “AgentScope: A Flexible yet Robust Multi-Agent Platform”, arXiv, 2024

  5. “AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation”, Microsoft Research, 2023 

  6. “MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework”, 2023

  7. “ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate”, arXiv, 2023

Gen AI multi-agent systems (MAS) are emerging as a way to tackle complex & abstract tasks by having multiple specialized AI agents collaborate. Already we are seeing a few agents (<10) becoming popular and by the end of the year multi agent systems with >10 agents will become commonplace.  Building and operating these MAS systems in the real-world is fraught with challenges at every stage of the lifecycle. In this post, we’ll walk through the current status, challenges and opportunities of the five phases of the MAS lifecycleBuild, Evaluate, Deploy, Monitor, and Lifecycle Management – using a healthcare patient concierge AI as a running example.

Different stages of Agents Lifecycle

The lifecycle of agents in MAS encompasses mostly these five key stages: 

Design and development, where agent specialization and interaction models are established; 

Evaluation & Testing , which verifies individual functionality and inter-agent communication; 

Deployment & Initialization, covering environment setup and CI/CD pipelines; 

Monitoring & Observability, tracking agent health, captures interaction logs, and surfaces performance anomalies in real time;

Life-Cycle Management , managing version updates while preserving staters and maintaining components

Fig 1: Operational Stages of a GenAI Multi-Agent Systems (MAS) Pipeline

Effective management across these stages requires tailored approaches for each phase. During early stages, emphasis falls on clear role definition and communication protocols, while later stages demand robust version compatibility and state preservation. Throughout all stages, security considerations must evolve alongside agent capabilities, implementing granular permissions and comprehensive monitoring. Organizations that excel recognize that agent testing fundamentally differs from traditional software testing, with particular focus on inter-agent dynamics and emergent behaviors that can only be observed when the complete system operates under varying conditions.

Use Case in Focus: Imagine a “patient concierge” AI service for a hospital. A patient can ask this AI to do things like explain their symptoms, schedule an appointment with the right specialist, check insurance coverage, and send reminders. Rather than a single monolithic AI, such a service could be composed of multiple agents – for example: a Medical Agent for gathering symptoms, a Scheduling Agent to book appointments, a Billing Agent for insurance queries, all coordinated by a Concierge Orchestrator Agent that interacts with the patient. This MAS needs to seamlessly integrate medical knowledge, tools (like the hospital’s calendar and databases), and maintain patient privacy. We’ll refer back to this example to illustrate each phase of the lifecycle.

Fig 2: Healthcare MAS with Orchestrated Patient Flow

Build Phase: Designing and Developing the Multi-Agent System

Building a MAS is a fundamentally new kind of software development. In this phase, we define the agents (their roles, skills, and personalities), how they communicate, and what tools or data they can access. 

Architecture of a MAS

  • Multiple specialized agents (medical, scheduling, billing etc) collaborate under an Orchestrator

    • The Symptom Collector agent might consult a knowledge base or guidelines; 

    • The Scheduling agent queries hospital systems (EHR, calendars); 

    • The Billing agent checks insurance databases. 

    • The Orchestrator agent mediates the conversation with the patient and coordinates subtasks among the specialists. 

This design aims to mimic a care team working together for the patient.

Challenges in the Build Phase

Defining Roles and Scope

One of the hardest parts of MAS development is specification and system design. Each agent’s role and responsibilities must be clearly defined so they complement each other without confusion. Ambiguous design can lead to agents stepping on each other’s toes!

For example, if we don’t explicitly specify that only the Symptom Collector Agent should have access to guidelines and knowledge (and not the orchestrator), agents might violate these role boundaries and give out incorrect information. Ensuring that all possible tasks are covered by some agent, and that each agent knows when to yield control to another, is non-trivial.

Communication and Coordination Protocol

LLM-based agents often communicate in natural language, which is prone to misunderstanding. We need to decide how they will interact. Will the agents talk to each other in an open chat thread, or will the orchestrator route messages between them?

Our healthcare agents need a protocol – e.g. the orchestrator first asks the Symptom Collector Agent for an answer, then passes that to the Scheduling Agent if an appointment is needed, and so on. Without a clear interaction plan and handover mechanisms, multi-agent dialogues can potentially become chaotic.

Knowledge & Tools

Each agent may require access to different knowledge sources or tools. Continuing from our example, the Symptom Collector might use a medical knowledge base, or a finetuned LLM, the Scheduling Agent will need secure access to calendar API. Prompting an agent to produce structured output is error proves and ensuring LLM output can properly trigger tool usage is a challenge. 

Iterative Design

Where is my blueprint? Currently, it’s often trial and error, using intuition to prompt-engineer agent behaviours. Developers must iterate on prompts and roles, essentially debugging conversations and simulating scenarios. In a critical domain like healthcare, this is especially difficult because we must anticipate corner cases (like a medical question the agent shouldn’t answer and should escalate to a human). 

LangGraph’s graph-based design lets agents iteratively revisit and refine context‑aware workflows, CrewAI enforces clear specialist roles for effective delegation, and AutoGen orchestrates seamless multi‑agent conversations—mimicking human teamwork to automate complex workflows. 

Despite these tools, gaps remain in the Build stage. LangGraph’s growing graphs can become difficult to debug at scale, CrewAI’s rigid role templates may not adapt well to evolving or cross‑functional tasks, and AutoGen’s conversational loops can introduce latency, obscure error origins, and lack built‑in tools for detecting emergent inter-agent failures.It’s still difficult to guarantee that an MAS will behave as intended across all scenarios. Developers lack formal testing methods during building – often you have to run sample conversations to see if agents do the right thing. Misalignment can already creep in here: an agent might follow its prompt fine in isolation, but once multiple agents interact, unpredictable behaviors emerge.

As in our example, in healthcare, one might enforce constraints (e.g., “Symptom Collector Agent should never give treatment advice”) at the prompt level, but currently that’s up to the developer to remember. This leads into the next phase – even with a well-designed MAS, we must Evaluate it thoroughly to discover any flaws.

Evaluate Phase: Testing the Multi-Agent System’s Performance and Safety

Once we have a MAS design, we need to evaluate whether it actually works as intended. For a healthcare concierge, evaluation is critical – we must test that the AI agents together can handle a variety of patient requests accurately and safely. The Evaluate phase covers functional correctness (does the MAS achieve the goals, e.g. correctly scheduling an appointment with appropriate advice given?) and non-functional aspects like efficiency (do the agents take too many steps or cost too much?), as well as safety (do they avoid dangerous recommendations or privacy breaches?)

Fig 3: Evaluation cycle for a Multi Agent System

Challenges in the Evaluation Phase

The current evaluation frameworks for agents are still evolving to address the needs of MAS systems which are discussed below.

Complex Emergent Behaviour

Unlike a single LLM call with a well-defined input-output, an MAS involves multi-turn processes between agents. This makes evaluation much harder. We are essentially evaluating an entire dialogue or collaborative process, which could branch into many paths. Even if each agent individually passes unit tests, their combination might fail in novel ways.

Lack of Established Metrics

What does it even mean for a MAS to “pass” a test? For a given user query, there might be multiple acceptable ways the agents could solve it. Traditional accuracy metrics don’t directly apply. We often need to evaluate along several dimensions:

  • Task Success: Did the MAS ultimately fulfill the user’s request? (E.g., the patient asked for an appointment and got one with correct details.)

  • Quality of reasoning / conversation: Did agents share information correctly and come to a sound solution (especially important for symptom collection and its completeness)?

  • Efficiency:  Did they solve it in a reasonable number of dialogue turns and API calls, or did they wander off-topic?

  • Safety / Compliance: Did the MAS avoid unauthorized actions or disallowed content (no private data leaked, no unsafe medical suggestion made)?

Human Evaluation & Test Scenarios 

In healthcare, one would ideally have experts review the MAS’s outputs. But evaluating every possible scenario is impossible – you need to pick representative test cases. Creating a suite of test scenarios (like patient personas with various requests) is itself a challenge, requiring medical expertise and understanding of what could go wrong. There’s also the issue of sensitive data – you might have to test with synthetic data if you can’t use real patient info during eval, which may not cover all real-world quirks.

Approaches and Frameworks for Evaluation

Given these challenges, how are people evaluating MAS today? It’s a mix of new techniques and adapting old ones:

LLM as a Judge

One approach is using a language model to evaluate the outputs of another (or a whole MAS). The idea is to have the AI itself score the conversation on success criteria. For example, after our concierge MAS handles a test patient query, we might feed the entire interaction and the intended goal to LLM and ask, “Did the agents fulfill the request correctly and safely? If not, where did they fail?” 

Test scenarios (simulated user queries with known expected outcomes) are fed to the MAS. An Evaluation Agent (LLM-as-Judge) can compare the MAS’s responses against the expected outcome, producing a score or verdict. Logs of the dialogue (number of turns, tools used, etc.) are also collected as part of evaluation metrics. 

Muti Agent Debate

Another method is to pit agents against each other in an evaluation setting. Essentially, two (or more) agents discuss or critique a given solution, hoping that flaws will be revealed in the debate. While originally more about evaluating single LLM answers, one could imagine using debate for MAS: e.g., spin up a separate “critic” agent to argue why the concierge’s plan for the patient might be wrong (“Did it consider the patient’s medical history? What if the diagnosis is incorrect?”) and see if the MAS can defend or adjust

Human-in-the-loop Evaluations

Ultimately, for sensitive domains like healthcare, human evaluation is the gold standard. One might conduct a study where medical professionals interact with the concierge MAS and rate its performance. Human eval is expensive and time-consuming, so it’s often done on a small scale or after initial automated testing passes.

Once we have some confidence from evaluation, we move to deploying the system in the real world – which will reveal new challenges of its own.

Deploy Phase: From Prototype to Production

Deployment is about taking the MAS out of the lab and integrating it into a real-world setting. 

This phase deals with all the practical considerations of making the MAS actually useful to end-users: integration with infrastructure, performance, scalability, and compliance. 

For our patient concierge, deployment means the AI is now interacting with actual patients through a user interface, connected to live hospital databases, and operating under real-world conditions (network issues, high usage periods, etc.).

Challenges in the Deployment Phase

Deployment Pipeline & Lifecycle 

Deploying new agent versions reliably hinges on a robust CI/CD pipeline—containerizing the Medical, Scheduling, Billing, and Orchestrator agents via infrastructure‑as‑code (IaC). Ideally, we lean on MAS frameworks and IaC templates to enforce environment parity and inject secrets securely, while automated rollback hooks tied to versioned release artifacts let us revert immediately if an update breaks appointment‑booking logic. At the same time, dynamic resource policies have to be in place to ensure extra Scheduling Agent instances spin up during the morning check‑in rush and scale down overnight.

Integration with External Systems

In prototyping, we might have mocked the hospital’s Electronic Health Record (EHR) system or scheduling database. In deployment, the MAS must interface with real systems – e.g., calling the EHR API to get a patient’s lab results, or reading/writing appointment info in the scheduling system. Each agent that needs external data becomes a potential integration point. Ensuring the agents make API calls correctly, handle errors (like “system down” or “no appointments available”), and do so securely (not requesting info they shouldn’t) is a huge challenge. 

Emerging standards like the open Agent‑to‑Agent (A2A) protocol give agents a shared, schema‑validated channel for calling and responding to hospital APIs.
Likewise, Anthropic’s Model Context Protocol (MCP) abstracts backend details so agents can swap underlying LLMs or data stores without touching the integration surface

Scalability & Latency 

How do we scale a multi-agent system? One approach is to run multiple instances of the MAS in parallel (horizontally scale), but coordinating state could be an issue if the same patient interacts multiple times – do we keep a conversation context per session? Traditional web service scaling techniques apply, but the agent conversation adds a layer on top that needs to be managed (possibly via an orchestrator service). Ensuring the system meets uptime and response time requirements is a classic engineering challenge magnified by the unpredictability of LLM agents

Reliability & Safety

Deploying guardrails begins in the CI/CD pipeline. During the build stage, the pipeline integrates and unit tests these guardrails—simulating prompt‑injection, malformed API responses, and unauthorized PHI access etc, to ensure no agent can bypass restrictions before any build is published.

Monitor Phase: Observing and Maintaining Performance in Production

Once the MAS is live, the Monitor phase involves keeping an eye on it: tracking its performance, catching errors or drifts, and gathering data for improvement. Monitoring a multi-agent AI is in some ways similar to monitoring any software service (you collect logs, uptime metrics, etc.), but it also has unique aspects – we need to monitor the quality of the AI’s decisions, not just that it’s running.

Challenges in the Monitor Phase

The current monitor/observability frameworks for agents are still evolving to address the needs of MAS systems which are discussed below.

Visibility into Agent Reasoning 

One challenge is that a MAS’s inner workings are essentially conversations in natural language, which can be hard to systematically parse. We can log every message between agents (and we should, for audit purposes), but sifting through those logs to identify issues is non-trivial. If a patient got a poor outcome, we’d want to trace which agent or exchange led to that. For example, maybe the Symptom Collector Agent got the correct info, but the handoff to Scheduling Agent failed because the latter misunderstood a date format)

Visibility into Cost & Latency 

In production, each agent call - Medical, Scheduling, Billing, Orchestrator - incurs token usage and network latency. Without per-agent telemetry, it’s impossible to tell whether cost overruns stem from verbose prompts in the Medical Agent or from excessive back‑and‑forth loops in the orchestrator. Similarly, end‑to‑end response time hides slow external dependencies (e.g., EHR lookups). Effective monitoring must capture elements like tokens, API calls per conversation, response times per agent and resource usage, then correlate spikes to specific scenarios so teams can take corrective actions like optimize prompts, adjust model, or refine auto‑scaling policies etc. 

Inter-Agent Misalignment Over Time

As the MAS handles many queries, certain patterns of failure might emerge. One known category of issues is inter-agent misalignment, where agents gradually diverge from the intended cooperative behavior. In long-running dialogues, they might get confused about context. Potential signals could be unusually long dialogues (many turns back and forth), or the orchestrator agent frequently resetting the conversation.

Drifts and Updates 

Over time, the underlying LLMs might change (for instance, an API upgrade to a new model version) which can change the behavior of your agents subtly. Even without explicit updates, the model might start outputting differently if the provider tweaks it. Monitoring needs to catch distribution shifts – are we suddenly seeing more hallucinations or errors than before? We may need to establish baselines during initial deployment and continuously compare. If, say, the ratio of successful scheduling tasks drops in a particular week, that should trigger an alert to investigate.

Fallbacks and Incident Handling

When (not if) something goes wrong – say an agent outputs something inappropriate to a user – monitoring should flag this ASAP. 

Another difficulty is reproducing issues: by the time we see an alert, the conversation already happened. We have the logs, but if it was due to a rare model quirk, running it again might not produce the same result.

Failure during the execution of the agent can lead to information mismatch in DBs getting stored or incorrect state updates when follow up happens. At each stage, we need to have fallbacks to mitigate these issues.

Approaches for Monitoring MAS

AgentOps & Telemetry 

There’s a growing ecosystem for AgentOps (analogous to MLOps & LLMOps but for Agentic applications). Tools like RagaAI catalyst allow logging of all prompts and responses, and come with a UI to search and analyze them. For example, we could search the logs for when the Medical Agent said “ER” to see if it’s frequently sending people to the Emergency Room inappropriately. We can allow plug code based evaluation functions, e.g., you could write a Python function that checks if an appointment was scheduled when a symptom was severe, and run this function on all logged sessions.

Automated Metrics 

We can derive and set up certain quantitative metrics from the logs:

For example, Average number of dialogue turns per task, Success rate of tasks (if we have a way to label outcomes as success/fail, tool usage frequency and errors (how often did an API call fail or need to be retried), cost per conversation (tokens consumed).

Setting up time-series tracking of these can tell us if, say, after a code update the turns doubled (regression in efficiency) or cost spiked.

Human Oversight 

It might be prudent (especially early in deployment) to have a human monitor (e.g., a nurse or admin) reviewing a sample of the interactions daily, which can be flagged via automation or metrics. In some settings, a shadow mode deployment is done first: the MAS interacts with patients but a human moderator is watching in real-time and can intervene if needed

The gaps in monitoring are largely around measuring qualitative correctness continuously. We can easily monitor uptime or if an agent process crashed, but monitoring whether the content of the agent’s advice was correct is much harder. It ties back to evaluation – essentially we’d like to evaluate as many real conversations as possible, but doing that 100% with either AI or humans is infeasible. Another gap is feeding the monitor findings back into the system quickly – which belongs to the final phase, Lifecycle Management. We want to use what we observe (errors, user feedback etc.) to continually improve the MAS.

Lifecycle Management Phase: Continuous Improvement and Maintenance

Building, evaluating, deploying, and monitoring a multi-agent system is not a one-and-done effort. Lifecycle Management is about the ongoing process of updating the system – fixing bugs, adding new capabilities, adapting to changing conditions – while ensuring it remains stable and effective.

Challenges in Lifecycle Management

Version Control 

One obvious maintenance task is updating agents as we learn from failures. For example, if monitoring shows the Medical Agent sometimes gives too blunt an answer, we might want to amend its prompt to emphasize empathy and detail. However, as discussed, even a small prompt change can have ripple effects on the overall MAS behavior. Without re-evaluating, we risk breaking something else.

Managing version control for prompts and agent “brains” is important. In code, we have Git and automated tests; for prompts, we need a way to track changes, perhaps link them to resolved issues. 

The same applies for all facets of MAS (tools, memory, etc.) - versioning and keeping track with a re-evaluation cycle becomes inescapable. 

Model Updates

Over the lifecycle, the base LLMs powering agents may be updated or replaced. Perhaps we fine-tune a smaller model on domain data to reduce reliance on an API. Each model update is like a huge code change – it will alter the dynamics of the MAS. This poses a challenge: how to continuously improve the agents’ intelligence while not breaking coordination?

Challenge essentially boils down to - “unifying and simplifying the training process for these agents”

This is still quite cutting-edge, in practice, most current MAS rely on fixed pre-trained models and prompt engineering, because multi-agent learning is an evolving research problem

Adaptive System Evolution

Over time, what we need the MAS to do might change. The hospital might expand the concierge to also handle prescription refills. That means adding or altering agents. Lifecycle management includes extensibility – can we add a new agent into the mix without rebuilding from scratch? In a well-architected MAS, this might be possible (one could plug in a new specialist agent and update the orchestrator’s prompt to utilize it). Ideally, industry is moving towards more modular MAS architectures, where each agent can be updated or swapped out like a microservice, with clear contracts on how they communicate

Continuous learning

A big question is whether to allow the MAS to learn and adapt on its own (online learning), and if yes, how do we enable that. Online learning (like updating its knowledge base as it answers questions) could improve performance but is dangerous in healthcare because errors could compound. Currently, most MAS are static or at best retrained offline with new data. We see a gap in safe continuous learning – how can the system incorporate human feedback loops effectively? For example,  perhaps patients rate each answer; we could feed those ratings to periodically finetune the agents. But one has to be careful to not degrade some other aspect while optimizing for ratings. Future advances in reinforcement learning for MAS might allow a more automated improvement cycle, but today it’s largely manual.

Strategies and Tools for Lifecycle Management

Routine Re-Evaluation and Updates

A best practice is to have a regular schedule to update the MAS. This could include refreshing prompts, applying any new models, and then re-running the evaluation suite from the Evaluate phase. Our hospital MAS team might, for instance, meet every two weeks to review collected conversation logs, identify areas to improve, adjust prompts or agent logic, and then test and redeploy. Having a systematic process here is part of the lifecycle management discipline.

Versioning Management

Treating every piece of your MAS: prompts, tool definitions, memory schemas, and fine‑tuned model artifacts as first‑class versioned assets are the need of the hour. Integration of versioning with deployment cycles gives visibility on the performance of the agent over time and versions, piggy banking on the evaluations set up for agent performance assessments.

Knowledge Base Updates

Part of the lifecycle is keeping the knowledge current. Our Agent might rely on a knowledge base of medical guidelines. That database keeps updating as guidelines change. Ensuring the MAS incorporates the latest info might be as simple as re-indexing a vector database or as complex as retraining an agent’s model. 

Major gaps in lifecycle management revolve around the lack of formalized processes and automated support. We need “AgentOps”. This includes:

  • Automated regression testing for MAS (to ensure an update didn’t reintroduce a previous bug).

  • Version control systems tailored to prompts and model checkpoints, tool checkpoints etc (so you can roll back if an update performs worse).

  • Continuous monitoring integration – linking monitoring signals to create tickets or triggers for the next iteration.

In our healthcare example, lifecycle management might eventually involve handing some control over to the hospital’s IT or medical staff: for example, giving a tool to clinicians to easily update the knowledge base the agents use (so that a new clinical protocol can be uploaded). 

Empowering domain experts to maintain the MAS without always needing an AI engineer is another frontier to strive for.

Conclusion

Multi-agent AI systems hold great promise – our hypothetical concierge AI could transform patient engagement by providing responsive, 24/7 support that draws on multiple specialties. But as we’ve seen, realizing this vision requires navigating a complex lifecycle:

Build: Frameworks like Langgraphl, CrewAI now provide role templates and orchestrator modules to streamline agent definition and workflows - enabling rapid prototyping of agent teams. Still, fine‑tuning cross‑agent coordination and establishing design best practices remain areas for further innovation.

Evaluate: Structured assessment pipelines using LLM‑as‑Judge setups  and integrated testing suites—offer scalable ways to score dialogue quality, task success, and safety. Augmenting these with human‑in‑the‑loop reviews helps close gaps in rigorous testing. 

Deploy: One‑click deployment platforms can automate infrastructure provisioning, UI generation, and policy enforcement, speeding prototypes into production. However, adapting these deployments to meet security, latency, and compliance requirements still calls for custom integration work. 

Monitor: Unified analytics dashboards paired with log aggregation, and built-in feedback loops surface real‑time insights into agent behavior. To turn natural‑language exchanges into actionable health‑care metrics, we need more specialized monitoring tools that bridge unstructured dialogue with structured observability.

Lifecycle Management: Modular training pipelines, prompt versioning, and continuous feedback loops enable ongoing MAS refinement. Building out full-fledged AgentOps—complete with automated regression tests, prompt change tracking, and adaptive retraining—will be crucial to make these systems maintainable, reliable, and safe over the long haul.

References

  1. “Why Do Multi-Agent LLM Systems Fail?”, arXiv, 2025

  2. Microsoft Research Technical Report MSR-TR-2024-53, “Challenges in human-agent communication”

  3. “Large Language Model based Multi-Agents: A Survey of Progress and Challenges”, arXiv, 2024

  4. “AgentScope: A Flexible yet Robust Multi-Agent Platform”, arXiv, 2024

  5. “AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation”, Microsoft Research, 2023 

  6. “MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework”, 2023

  7. “ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate”, arXiv, 2023

Gen AI multi-agent systems (MAS) are emerging as a way to tackle complex & abstract tasks by having multiple specialized AI agents collaborate. Already we are seeing a few agents (<10) becoming popular and by the end of the year multi agent systems with >10 agents will become commonplace.  Building and operating these MAS systems in the real-world is fraught with challenges at every stage of the lifecycle. In this post, we’ll walk through the current status, challenges and opportunities of the five phases of the MAS lifecycleBuild, Evaluate, Deploy, Monitor, and Lifecycle Management – using a healthcare patient concierge AI as a running example.

Different stages of Agents Lifecycle

The lifecycle of agents in MAS encompasses mostly these five key stages: 

Design and development, where agent specialization and interaction models are established; 

Evaluation & Testing , which verifies individual functionality and inter-agent communication; 

Deployment & Initialization, covering environment setup and CI/CD pipelines; 

Monitoring & Observability, tracking agent health, captures interaction logs, and surfaces performance anomalies in real time;

Life-Cycle Management , managing version updates while preserving staters and maintaining components

Fig 1: Operational Stages of a GenAI Multi-Agent Systems (MAS) Pipeline

Effective management across these stages requires tailored approaches for each phase. During early stages, emphasis falls on clear role definition and communication protocols, while later stages demand robust version compatibility and state preservation. Throughout all stages, security considerations must evolve alongside agent capabilities, implementing granular permissions and comprehensive monitoring. Organizations that excel recognize that agent testing fundamentally differs from traditional software testing, with particular focus on inter-agent dynamics and emergent behaviors that can only be observed when the complete system operates under varying conditions.

Use Case in Focus: Imagine a “patient concierge” AI service for a hospital. A patient can ask this AI to do things like explain their symptoms, schedule an appointment with the right specialist, check insurance coverage, and send reminders. Rather than a single monolithic AI, such a service could be composed of multiple agents – for example: a Medical Agent for gathering symptoms, a Scheduling Agent to book appointments, a Billing Agent for insurance queries, all coordinated by a Concierge Orchestrator Agent that interacts with the patient. This MAS needs to seamlessly integrate medical knowledge, tools (like the hospital’s calendar and databases), and maintain patient privacy. We’ll refer back to this example to illustrate each phase of the lifecycle.

Fig 2: Healthcare MAS with Orchestrated Patient Flow

Build Phase: Designing and Developing the Multi-Agent System

Building a MAS is a fundamentally new kind of software development. In this phase, we define the agents (their roles, skills, and personalities), how they communicate, and what tools or data they can access. 

Architecture of a MAS

  • Multiple specialized agents (medical, scheduling, billing etc) collaborate under an Orchestrator

    • The Symptom Collector agent might consult a knowledge base or guidelines; 

    • The Scheduling agent queries hospital systems (EHR, calendars); 

    • The Billing agent checks insurance databases. 

    • The Orchestrator agent mediates the conversation with the patient and coordinates subtasks among the specialists. 

This design aims to mimic a care team working together for the patient.

Challenges in the Build Phase

Defining Roles and Scope

One of the hardest parts of MAS development is specification and system design. Each agent’s role and responsibilities must be clearly defined so they complement each other without confusion. Ambiguous design can lead to agents stepping on each other’s toes!

For example, if we don’t explicitly specify that only the Symptom Collector Agent should have access to guidelines and knowledge (and not the orchestrator), agents might violate these role boundaries and give out incorrect information. Ensuring that all possible tasks are covered by some agent, and that each agent knows when to yield control to another, is non-trivial.

Communication and Coordination Protocol

LLM-based agents often communicate in natural language, which is prone to misunderstanding. We need to decide how they will interact. Will the agents talk to each other in an open chat thread, or will the orchestrator route messages between them?

Our healthcare agents need a protocol – e.g. the orchestrator first asks the Symptom Collector Agent for an answer, then passes that to the Scheduling Agent if an appointment is needed, and so on. Without a clear interaction plan and handover mechanisms, multi-agent dialogues can potentially become chaotic.

Knowledge & Tools

Each agent may require access to different knowledge sources or tools. Continuing from our example, the Symptom Collector might use a medical knowledge base, or a finetuned LLM, the Scheduling Agent will need secure access to calendar API. Prompting an agent to produce structured output is error proves and ensuring LLM output can properly trigger tool usage is a challenge. 

Iterative Design

Where is my blueprint? Currently, it’s often trial and error, using intuition to prompt-engineer agent behaviours. Developers must iterate on prompts and roles, essentially debugging conversations and simulating scenarios. In a critical domain like healthcare, this is especially difficult because we must anticipate corner cases (like a medical question the agent shouldn’t answer and should escalate to a human). 

LangGraph’s graph-based design lets agents iteratively revisit and refine context‑aware workflows, CrewAI enforces clear specialist roles for effective delegation, and AutoGen orchestrates seamless multi‑agent conversations—mimicking human teamwork to automate complex workflows. 

Despite these tools, gaps remain in the Build stage. LangGraph’s growing graphs can become difficult to debug at scale, CrewAI’s rigid role templates may not adapt well to evolving or cross‑functional tasks, and AutoGen’s conversational loops can introduce latency, obscure error origins, and lack built‑in tools for detecting emergent inter-agent failures.It’s still difficult to guarantee that an MAS will behave as intended across all scenarios. Developers lack formal testing methods during building – often you have to run sample conversations to see if agents do the right thing. Misalignment can already creep in here: an agent might follow its prompt fine in isolation, but once multiple agents interact, unpredictable behaviors emerge.

As in our example, in healthcare, one might enforce constraints (e.g., “Symptom Collector Agent should never give treatment advice”) at the prompt level, but currently that’s up to the developer to remember. This leads into the next phase – even with a well-designed MAS, we must Evaluate it thoroughly to discover any flaws.

Evaluate Phase: Testing the Multi-Agent System’s Performance and Safety

Once we have a MAS design, we need to evaluate whether it actually works as intended. For a healthcare concierge, evaluation is critical – we must test that the AI agents together can handle a variety of patient requests accurately and safely. The Evaluate phase covers functional correctness (does the MAS achieve the goals, e.g. correctly scheduling an appointment with appropriate advice given?) and non-functional aspects like efficiency (do the agents take too many steps or cost too much?), as well as safety (do they avoid dangerous recommendations or privacy breaches?)

Fig 3: Evaluation cycle for a Multi Agent System

Challenges in the Evaluation Phase

The current evaluation frameworks for agents are still evolving to address the needs of MAS systems which are discussed below.

Complex Emergent Behaviour

Unlike a single LLM call with a well-defined input-output, an MAS involves multi-turn processes between agents. This makes evaluation much harder. We are essentially evaluating an entire dialogue or collaborative process, which could branch into many paths. Even if each agent individually passes unit tests, their combination might fail in novel ways.

Lack of Established Metrics

What does it even mean for a MAS to “pass” a test? For a given user query, there might be multiple acceptable ways the agents could solve it. Traditional accuracy metrics don’t directly apply. We often need to evaluate along several dimensions:

  • Task Success: Did the MAS ultimately fulfill the user’s request? (E.g., the patient asked for an appointment and got one with correct details.)

  • Quality of reasoning / conversation: Did agents share information correctly and come to a sound solution (especially important for symptom collection and its completeness)?

  • Efficiency:  Did they solve it in a reasonable number of dialogue turns and API calls, or did they wander off-topic?

  • Safety / Compliance: Did the MAS avoid unauthorized actions or disallowed content (no private data leaked, no unsafe medical suggestion made)?

Human Evaluation & Test Scenarios 

In healthcare, one would ideally have experts review the MAS’s outputs. But evaluating every possible scenario is impossible – you need to pick representative test cases. Creating a suite of test scenarios (like patient personas with various requests) is itself a challenge, requiring medical expertise and understanding of what could go wrong. There’s also the issue of sensitive data – you might have to test with synthetic data if you can’t use real patient info during eval, which may not cover all real-world quirks.

Approaches and Frameworks for Evaluation

Given these challenges, how are people evaluating MAS today? It’s a mix of new techniques and adapting old ones:

LLM as a Judge

One approach is using a language model to evaluate the outputs of another (or a whole MAS). The idea is to have the AI itself score the conversation on success criteria. For example, after our concierge MAS handles a test patient query, we might feed the entire interaction and the intended goal to LLM and ask, “Did the agents fulfill the request correctly and safely? If not, where did they fail?” 

Test scenarios (simulated user queries with known expected outcomes) are fed to the MAS. An Evaluation Agent (LLM-as-Judge) can compare the MAS’s responses against the expected outcome, producing a score or verdict. Logs of the dialogue (number of turns, tools used, etc.) are also collected as part of evaluation metrics. 

Muti Agent Debate

Another method is to pit agents against each other in an evaluation setting. Essentially, two (or more) agents discuss or critique a given solution, hoping that flaws will be revealed in the debate. While originally more about evaluating single LLM answers, one could imagine using debate for MAS: e.g., spin up a separate “critic” agent to argue why the concierge’s plan for the patient might be wrong (“Did it consider the patient’s medical history? What if the diagnosis is incorrect?”) and see if the MAS can defend or adjust

Human-in-the-loop Evaluations

Ultimately, for sensitive domains like healthcare, human evaluation is the gold standard. One might conduct a study where medical professionals interact with the concierge MAS and rate its performance. Human eval is expensive and time-consuming, so it’s often done on a small scale or after initial automated testing passes.

Once we have some confidence from evaluation, we move to deploying the system in the real world – which will reveal new challenges of its own.

Deploy Phase: From Prototype to Production

Deployment is about taking the MAS out of the lab and integrating it into a real-world setting. 

This phase deals with all the practical considerations of making the MAS actually useful to end-users: integration with infrastructure, performance, scalability, and compliance. 

For our patient concierge, deployment means the AI is now interacting with actual patients through a user interface, connected to live hospital databases, and operating under real-world conditions (network issues, high usage periods, etc.).

Challenges in the Deployment Phase

Deployment Pipeline & Lifecycle 

Deploying new agent versions reliably hinges on a robust CI/CD pipeline—containerizing the Medical, Scheduling, Billing, and Orchestrator agents via infrastructure‑as‑code (IaC). Ideally, we lean on MAS frameworks and IaC templates to enforce environment parity and inject secrets securely, while automated rollback hooks tied to versioned release artifacts let us revert immediately if an update breaks appointment‑booking logic. At the same time, dynamic resource policies have to be in place to ensure extra Scheduling Agent instances spin up during the morning check‑in rush and scale down overnight.

Integration with External Systems

In prototyping, we might have mocked the hospital’s Electronic Health Record (EHR) system or scheduling database. In deployment, the MAS must interface with real systems – e.g., calling the EHR API to get a patient’s lab results, or reading/writing appointment info in the scheduling system. Each agent that needs external data becomes a potential integration point. Ensuring the agents make API calls correctly, handle errors (like “system down” or “no appointments available”), and do so securely (not requesting info they shouldn’t) is a huge challenge. 

Emerging standards like the open Agent‑to‑Agent (A2A) protocol give agents a shared, schema‑validated channel for calling and responding to hospital APIs.
Likewise, Anthropic’s Model Context Protocol (MCP) abstracts backend details so agents can swap underlying LLMs or data stores without touching the integration surface

Scalability & Latency 

How do we scale a multi-agent system? One approach is to run multiple instances of the MAS in parallel (horizontally scale), but coordinating state could be an issue if the same patient interacts multiple times – do we keep a conversation context per session? Traditional web service scaling techniques apply, but the agent conversation adds a layer on top that needs to be managed (possibly via an orchestrator service). Ensuring the system meets uptime and response time requirements is a classic engineering challenge magnified by the unpredictability of LLM agents

Reliability & Safety

Deploying guardrails begins in the CI/CD pipeline. During the build stage, the pipeline integrates and unit tests these guardrails—simulating prompt‑injection, malformed API responses, and unauthorized PHI access etc, to ensure no agent can bypass restrictions before any build is published.

Monitor Phase: Observing and Maintaining Performance in Production

Once the MAS is live, the Monitor phase involves keeping an eye on it: tracking its performance, catching errors or drifts, and gathering data for improvement. Monitoring a multi-agent AI is in some ways similar to monitoring any software service (you collect logs, uptime metrics, etc.), but it also has unique aspects – we need to monitor the quality of the AI’s decisions, not just that it’s running.

Challenges in the Monitor Phase

The current monitor/observability frameworks for agents are still evolving to address the needs of MAS systems which are discussed below.

Visibility into Agent Reasoning 

One challenge is that a MAS’s inner workings are essentially conversations in natural language, which can be hard to systematically parse. We can log every message between agents (and we should, for audit purposes), but sifting through those logs to identify issues is non-trivial. If a patient got a poor outcome, we’d want to trace which agent or exchange led to that. For example, maybe the Symptom Collector Agent got the correct info, but the handoff to Scheduling Agent failed because the latter misunderstood a date format)

Visibility into Cost & Latency 

In production, each agent call - Medical, Scheduling, Billing, Orchestrator - incurs token usage and network latency. Without per-agent telemetry, it’s impossible to tell whether cost overruns stem from verbose prompts in the Medical Agent or from excessive back‑and‑forth loops in the orchestrator. Similarly, end‑to‑end response time hides slow external dependencies (e.g., EHR lookups). Effective monitoring must capture elements like tokens, API calls per conversation, response times per agent and resource usage, then correlate spikes to specific scenarios so teams can take corrective actions like optimize prompts, adjust model, or refine auto‑scaling policies etc. 

Inter-Agent Misalignment Over Time

As the MAS handles many queries, certain patterns of failure might emerge. One known category of issues is inter-agent misalignment, where agents gradually diverge from the intended cooperative behavior. In long-running dialogues, they might get confused about context. Potential signals could be unusually long dialogues (many turns back and forth), or the orchestrator agent frequently resetting the conversation.

Drifts and Updates 

Over time, the underlying LLMs might change (for instance, an API upgrade to a new model version) which can change the behavior of your agents subtly. Even without explicit updates, the model might start outputting differently if the provider tweaks it. Monitoring needs to catch distribution shifts – are we suddenly seeing more hallucinations or errors than before? We may need to establish baselines during initial deployment and continuously compare. If, say, the ratio of successful scheduling tasks drops in a particular week, that should trigger an alert to investigate.

Fallbacks and Incident Handling

When (not if) something goes wrong – say an agent outputs something inappropriate to a user – monitoring should flag this ASAP. 

Another difficulty is reproducing issues: by the time we see an alert, the conversation already happened. We have the logs, but if it was due to a rare model quirk, running it again might not produce the same result.

Failure during the execution of the agent can lead to information mismatch in DBs getting stored or incorrect state updates when follow up happens. At each stage, we need to have fallbacks to mitigate these issues.

Approaches for Monitoring MAS

AgentOps & Telemetry 

There’s a growing ecosystem for AgentOps (analogous to MLOps & LLMOps but for Agentic applications). Tools like RagaAI catalyst allow logging of all prompts and responses, and come with a UI to search and analyze them. For example, we could search the logs for when the Medical Agent said “ER” to see if it’s frequently sending people to the Emergency Room inappropriately. We can allow plug code based evaluation functions, e.g., you could write a Python function that checks if an appointment was scheduled when a symptom was severe, and run this function on all logged sessions.

Automated Metrics 

We can derive and set up certain quantitative metrics from the logs:

For example, Average number of dialogue turns per task, Success rate of tasks (if we have a way to label outcomes as success/fail, tool usage frequency and errors (how often did an API call fail or need to be retried), cost per conversation (tokens consumed).

Setting up time-series tracking of these can tell us if, say, after a code update the turns doubled (regression in efficiency) or cost spiked.

Human Oversight 

It might be prudent (especially early in deployment) to have a human monitor (e.g., a nurse or admin) reviewing a sample of the interactions daily, which can be flagged via automation or metrics. In some settings, a shadow mode deployment is done first: the MAS interacts with patients but a human moderator is watching in real-time and can intervene if needed

The gaps in monitoring are largely around measuring qualitative correctness continuously. We can easily monitor uptime or if an agent process crashed, but monitoring whether the content of the agent’s advice was correct is much harder. It ties back to evaluation – essentially we’d like to evaluate as many real conversations as possible, but doing that 100% with either AI or humans is infeasible. Another gap is feeding the monitor findings back into the system quickly – which belongs to the final phase, Lifecycle Management. We want to use what we observe (errors, user feedback etc.) to continually improve the MAS.

Lifecycle Management Phase: Continuous Improvement and Maintenance

Building, evaluating, deploying, and monitoring a multi-agent system is not a one-and-done effort. Lifecycle Management is about the ongoing process of updating the system – fixing bugs, adding new capabilities, adapting to changing conditions – while ensuring it remains stable and effective.

Challenges in Lifecycle Management

Version Control 

One obvious maintenance task is updating agents as we learn from failures. For example, if monitoring shows the Medical Agent sometimes gives too blunt an answer, we might want to amend its prompt to emphasize empathy and detail. However, as discussed, even a small prompt change can have ripple effects on the overall MAS behavior. Without re-evaluating, we risk breaking something else.

Managing version control for prompts and agent “brains” is important. In code, we have Git and automated tests; for prompts, we need a way to track changes, perhaps link them to resolved issues. 

The same applies for all facets of MAS (tools, memory, etc.) - versioning and keeping track with a re-evaluation cycle becomes inescapable. 

Model Updates

Over the lifecycle, the base LLMs powering agents may be updated or replaced. Perhaps we fine-tune a smaller model on domain data to reduce reliance on an API. Each model update is like a huge code change – it will alter the dynamics of the MAS. This poses a challenge: how to continuously improve the agents’ intelligence while not breaking coordination?

Challenge essentially boils down to - “unifying and simplifying the training process for these agents”

This is still quite cutting-edge, in practice, most current MAS rely on fixed pre-trained models and prompt engineering, because multi-agent learning is an evolving research problem

Adaptive System Evolution

Over time, what we need the MAS to do might change. The hospital might expand the concierge to also handle prescription refills. That means adding or altering agents. Lifecycle management includes extensibility – can we add a new agent into the mix without rebuilding from scratch? In a well-architected MAS, this might be possible (one could plug in a new specialist agent and update the orchestrator’s prompt to utilize it). Ideally, industry is moving towards more modular MAS architectures, where each agent can be updated or swapped out like a microservice, with clear contracts on how they communicate

Continuous learning

A big question is whether to allow the MAS to learn and adapt on its own (online learning), and if yes, how do we enable that. Online learning (like updating its knowledge base as it answers questions) could improve performance but is dangerous in healthcare because errors could compound. Currently, most MAS are static or at best retrained offline with new data. We see a gap in safe continuous learning – how can the system incorporate human feedback loops effectively? For example,  perhaps patients rate each answer; we could feed those ratings to periodically finetune the agents. But one has to be careful to not degrade some other aspect while optimizing for ratings. Future advances in reinforcement learning for MAS might allow a more automated improvement cycle, but today it’s largely manual.

Strategies and Tools for Lifecycle Management

Routine Re-Evaluation and Updates

A best practice is to have a regular schedule to update the MAS. This could include refreshing prompts, applying any new models, and then re-running the evaluation suite from the Evaluate phase. Our hospital MAS team might, for instance, meet every two weeks to review collected conversation logs, identify areas to improve, adjust prompts or agent logic, and then test and redeploy. Having a systematic process here is part of the lifecycle management discipline.

Versioning Management

Treating every piece of your MAS: prompts, tool definitions, memory schemas, and fine‑tuned model artifacts as first‑class versioned assets are the need of the hour. Integration of versioning with deployment cycles gives visibility on the performance of the agent over time and versions, piggy banking on the evaluations set up for agent performance assessments.

Knowledge Base Updates

Part of the lifecycle is keeping the knowledge current. Our Agent might rely on a knowledge base of medical guidelines. That database keeps updating as guidelines change. Ensuring the MAS incorporates the latest info might be as simple as re-indexing a vector database or as complex as retraining an agent’s model. 

Major gaps in lifecycle management revolve around the lack of formalized processes and automated support. We need “AgentOps”. This includes:

  • Automated regression testing for MAS (to ensure an update didn’t reintroduce a previous bug).

  • Version control systems tailored to prompts and model checkpoints, tool checkpoints etc (so you can roll back if an update performs worse).

  • Continuous monitoring integration – linking monitoring signals to create tickets or triggers for the next iteration.

In our healthcare example, lifecycle management might eventually involve handing some control over to the hospital’s IT or medical staff: for example, giving a tool to clinicians to easily update the knowledge base the agents use (so that a new clinical protocol can be uploaded). 

Empowering domain experts to maintain the MAS without always needing an AI engineer is another frontier to strive for.

Conclusion

Multi-agent AI systems hold great promise – our hypothetical concierge AI could transform patient engagement by providing responsive, 24/7 support that draws on multiple specialties. But as we’ve seen, realizing this vision requires navigating a complex lifecycle:

Build: Frameworks like Langgraphl, CrewAI now provide role templates and orchestrator modules to streamline agent definition and workflows - enabling rapid prototyping of agent teams. Still, fine‑tuning cross‑agent coordination and establishing design best practices remain areas for further innovation.

Evaluate: Structured assessment pipelines using LLM‑as‑Judge setups  and integrated testing suites—offer scalable ways to score dialogue quality, task success, and safety. Augmenting these with human‑in‑the‑loop reviews helps close gaps in rigorous testing. 

Deploy: One‑click deployment platforms can automate infrastructure provisioning, UI generation, and policy enforcement, speeding prototypes into production. However, adapting these deployments to meet security, latency, and compliance requirements still calls for custom integration work. 

Monitor: Unified analytics dashboards paired with log aggregation, and built-in feedback loops surface real‑time insights into agent behavior. To turn natural‑language exchanges into actionable health‑care metrics, we need more specialized monitoring tools that bridge unstructured dialogue with structured observability.

Lifecycle Management: Modular training pipelines, prompt versioning, and continuous feedback loops enable ongoing MAS refinement. Building out full-fledged AgentOps—complete with automated regression tests, prompt change tracking, and adaptive retraining—will be crucial to make these systems maintainable, reliable, and safe over the long haul.

References

  1. “Why Do Multi-Agent LLM Systems Fail?”, arXiv, 2025

  2. Microsoft Research Technical Report MSR-TR-2024-53, “Challenges in human-agent communication”

  3. “Large Language Model based Multi-Agents: A Survey of Progress and Challenges”, arXiv, 2024

  4. “AgentScope: A Flexible yet Robust Multi-Agent Platform”, arXiv, 2024

  5. “AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation”, Microsoft Research, 2023 

  6. “MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework”, 2023

  7. “ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate”, arXiv, 2023

Gen AI multi-agent systems (MAS) are emerging as a way to tackle complex & abstract tasks by having multiple specialized AI agents collaborate. Already we are seeing a few agents (<10) becoming popular and by the end of the year multi agent systems with >10 agents will become commonplace.  Building and operating these MAS systems in the real-world is fraught with challenges at every stage of the lifecycle. In this post, we’ll walk through the current status, challenges and opportunities of the five phases of the MAS lifecycleBuild, Evaluate, Deploy, Monitor, and Lifecycle Management – using a healthcare patient concierge AI as a running example.

Different stages of Agents Lifecycle

The lifecycle of agents in MAS encompasses mostly these five key stages: 

Design and development, where agent specialization and interaction models are established; 

Evaluation & Testing , which verifies individual functionality and inter-agent communication; 

Deployment & Initialization, covering environment setup and CI/CD pipelines; 

Monitoring & Observability, tracking agent health, captures interaction logs, and surfaces performance anomalies in real time;

Life-Cycle Management , managing version updates while preserving staters and maintaining components

Fig 1: Operational Stages of a GenAI Multi-Agent Systems (MAS) Pipeline

Effective management across these stages requires tailored approaches for each phase. During early stages, emphasis falls on clear role definition and communication protocols, while later stages demand robust version compatibility and state preservation. Throughout all stages, security considerations must evolve alongside agent capabilities, implementing granular permissions and comprehensive monitoring. Organizations that excel recognize that agent testing fundamentally differs from traditional software testing, with particular focus on inter-agent dynamics and emergent behaviors that can only be observed when the complete system operates under varying conditions.

Use Case in Focus: Imagine a “patient concierge” AI service for a hospital. A patient can ask this AI to do things like explain their symptoms, schedule an appointment with the right specialist, check insurance coverage, and send reminders. Rather than a single monolithic AI, such a service could be composed of multiple agents – for example: a Medical Agent for gathering symptoms, a Scheduling Agent to book appointments, a Billing Agent for insurance queries, all coordinated by a Concierge Orchestrator Agent that interacts with the patient. This MAS needs to seamlessly integrate medical knowledge, tools (like the hospital’s calendar and databases), and maintain patient privacy. We’ll refer back to this example to illustrate each phase of the lifecycle.

Fig 2: Healthcare MAS with Orchestrated Patient Flow

Build Phase: Designing and Developing the Multi-Agent System

Building a MAS is a fundamentally new kind of software development. In this phase, we define the agents (their roles, skills, and personalities), how they communicate, and what tools or data they can access. 

Architecture of a MAS

  • Multiple specialized agents (medical, scheduling, billing etc) collaborate under an Orchestrator

    • The Symptom Collector agent might consult a knowledge base or guidelines; 

    • The Scheduling agent queries hospital systems (EHR, calendars); 

    • The Billing agent checks insurance databases. 

    • The Orchestrator agent mediates the conversation with the patient and coordinates subtasks among the specialists. 

This design aims to mimic a care team working together for the patient.

Challenges in the Build Phase

Defining Roles and Scope

One of the hardest parts of MAS development is specification and system design. Each agent’s role and responsibilities must be clearly defined so they complement each other without confusion. Ambiguous design can lead to agents stepping on each other’s toes!

For example, if we don’t explicitly specify that only the Symptom Collector Agent should have access to guidelines and knowledge (and not the orchestrator), agents might violate these role boundaries and give out incorrect information. Ensuring that all possible tasks are covered by some agent, and that each agent knows when to yield control to another, is non-trivial.

Communication and Coordination Protocol

LLM-based agents often communicate in natural language, which is prone to misunderstanding. We need to decide how they will interact. Will the agents talk to each other in an open chat thread, or will the orchestrator route messages between them?

Our healthcare agents need a protocol – e.g. the orchestrator first asks the Symptom Collector Agent for an answer, then passes that to the Scheduling Agent if an appointment is needed, and so on. Without a clear interaction plan and handover mechanisms, multi-agent dialogues can potentially become chaotic.

Knowledge & Tools

Each agent may require access to different knowledge sources or tools. Continuing from our example, the Symptom Collector might use a medical knowledge base, or a finetuned LLM, the Scheduling Agent will need secure access to calendar API. Prompting an agent to produce structured output is error proves and ensuring LLM output can properly trigger tool usage is a challenge. 

Iterative Design

Where is my blueprint? Currently, it’s often trial and error, using intuition to prompt-engineer agent behaviours. Developers must iterate on prompts and roles, essentially debugging conversations and simulating scenarios. In a critical domain like healthcare, this is especially difficult because we must anticipate corner cases (like a medical question the agent shouldn’t answer and should escalate to a human). 

LangGraph’s graph-based design lets agents iteratively revisit and refine context‑aware workflows, CrewAI enforces clear specialist roles for effective delegation, and AutoGen orchestrates seamless multi‑agent conversations—mimicking human teamwork to automate complex workflows. 

Despite these tools, gaps remain in the Build stage. LangGraph’s growing graphs can become difficult to debug at scale, CrewAI’s rigid role templates may not adapt well to evolving or cross‑functional tasks, and AutoGen’s conversational loops can introduce latency, obscure error origins, and lack built‑in tools for detecting emergent inter-agent failures.It’s still difficult to guarantee that an MAS will behave as intended across all scenarios. Developers lack formal testing methods during building – often you have to run sample conversations to see if agents do the right thing. Misalignment can already creep in here: an agent might follow its prompt fine in isolation, but once multiple agents interact, unpredictable behaviors emerge.

As in our example, in healthcare, one might enforce constraints (e.g., “Symptom Collector Agent should never give treatment advice”) at the prompt level, but currently that’s up to the developer to remember. This leads into the next phase – even with a well-designed MAS, we must Evaluate it thoroughly to discover any flaws.

Evaluate Phase: Testing the Multi-Agent System’s Performance and Safety

Once we have a MAS design, we need to evaluate whether it actually works as intended. For a healthcare concierge, evaluation is critical – we must test that the AI agents together can handle a variety of patient requests accurately and safely. The Evaluate phase covers functional correctness (does the MAS achieve the goals, e.g. correctly scheduling an appointment with appropriate advice given?) and non-functional aspects like efficiency (do the agents take too many steps or cost too much?), as well as safety (do they avoid dangerous recommendations or privacy breaches?)

Fig 3: Evaluation cycle for a Multi Agent System

Challenges in the Evaluation Phase

The current evaluation frameworks for agents are still evolving to address the needs of MAS systems which are discussed below.

Complex Emergent Behaviour

Unlike a single LLM call with a well-defined input-output, an MAS involves multi-turn processes between agents. This makes evaluation much harder. We are essentially evaluating an entire dialogue or collaborative process, which could branch into many paths. Even if each agent individually passes unit tests, their combination might fail in novel ways.

Lack of Established Metrics

What does it even mean for a MAS to “pass” a test? For a given user query, there might be multiple acceptable ways the agents could solve it. Traditional accuracy metrics don’t directly apply. We often need to evaluate along several dimensions:

  • Task Success: Did the MAS ultimately fulfill the user’s request? (E.g., the patient asked for an appointment and got one with correct details.)

  • Quality of reasoning / conversation: Did agents share information correctly and come to a sound solution (especially important for symptom collection and its completeness)?

  • Efficiency:  Did they solve it in a reasonable number of dialogue turns and API calls, or did they wander off-topic?

  • Safety / Compliance: Did the MAS avoid unauthorized actions or disallowed content (no private data leaked, no unsafe medical suggestion made)?

Human Evaluation & Test Scenarios 

In healthcare, one would ideally have experts review the MAS’s outputs. But evaluating every possible scenario is impossible – you need to pick representative test cases. Creating a suite of test scenarios (like patient personas with various requests) is itself a challenge, requiring medical expertise and understanding of what could go wrong. There’s also the issue of sensitive data – you might have to test with synthetic data if you can’t use real patient info during eval, which may not cover all real-world quirks.

Approaches and Frameworks for Evaluation

Given these challenges, how are people evaluating MAS today? It’s a mix of new techniques and adapting old ones:

LLM as a Judge

One approach is using a language model to evaluate the outputs of another (or a whole MAS). The idea is to have the AI itself score the conversation on success criteria. For example, after our concierge MAS handles a test patient query, we might feed the entire interaction and the intended goal to LLM and ask, “Did the agents fulfill the request correctly and safely? If not, where did they fail?” 

Test scenarios (simulated user queries with known expected outcomes) are fed to the MAS. An Evaluation Agent (LLM-as-Judge) can compare the MAS’s responses against the expected outcome, producing a score or verdict. Logs of the dialogue (number of turns, tools used, etc.) are also collected as part of evaluation metrics. 

Muti Agent Debate

Another method is to pit agents against each other in an evaluation setting. Essentially, two (or more) agents discuss or critique a given solution, hoping that flaws will be revealed in the debate. While originally more about evaluating single LLM answers, one could imagine using debate for MAS: e.g., spin up a separate “critic” agent to argue why the concierge’s plan for the patient might be wrong (“Did it consider the patient’s medical history? What if the diagnosis is incorrect?”) and see if the MAS can defend or adjust

Human-in-the-loop Evaluations

Ultimately, for sensitive domains like healthcare, human evaluation is the gold standard. One might conduct a study where medical professionals interact with the concierge MAS and rate its performance. Human eval is expensive and time-consuming, so it’s often done on a small scale or after initial automated testing passes.

Once we have some confidence from evaluation, we move to deploying the system in the real world – which will reveal new challenges of its own.

Deploy Phase: From Prototype to Production

Deployment is about taking the MAS out of the lab and integrating it into a real-world setting. 

This phase deals with all the practical considerations of making the MAS actually useful to end-users: integration with infrastructure, performance, scalability, and compliance. 

For our patient concierge, deployment means the AI is now interacting with actual patients through a user interface, connected to live hospital databases, and operating under real-world conditions (network issues, high usage periods, etc.).

Challenges in the Deployment Phase

Deployment Pipeline & Lifecycle 

Deploying new agent versions reliably hinges on a robust CI/CD pipeline—containerizing the Medical, Scheduling, Billing, and Orchestrator agents via infrastructure‑as‑code (IaC). Ideally, we lean on MAS frameworks and IaC templates to enforce environment parity and inject secrets securely, while automated rollback hooks tied to versioned release artifacts let us revert immediately if an update breaks appointment‑booking logic. At the same time, dynamic resource policies have to be in place to ensure extra Scheduling Agent instances spin up during the morning check‑in rush and scale down overnight.

Integration with External Systems

In prototyping, we might have mocked the hospital’s Electronic Health Record (EHR) system or scheduling database. In deployment, the MAS must interface with real systems – e.g., calling the EHR API to get a patient’s lab results, or reading/writing appointment info in the scheduling system. Each agent that needs external data becomes a potential integration point. Ensuring the agents make API calls correctly, handle errors (like “system down” or “no appointments available”), and do so securely (not requesting info they shouldn’t) is a huge challenge. 

Emerging standards like the open Agent‑to‑Agent (A2A) protocol give agents a shared, schema‑validated channel for calling and responding to hospital APIs.
Likewise, Anthropic’s Model Context Protocol (MCP) abstracts backend details so agents can swap underlying LLMs or data stores without touching the integration surface

Scalability & Latency 

How do we scale a multi-agent system? One approach is to run multiple instances of the MAS in parallel (horizontally scale), but coordinating state could be an issue if the same patient interacts multiple times – do we keep a conversation context per session? Traditional web service scaling techniques apply, but the agent conversation adds a layer on top that needs to be managed (possibly via an orchestrator service). Ensuring the system meets uptime and response time requirements is a classic engineering challenge magnified by the unpredictability of LLM agents

Reliability & Safety

Deploying guardrails begins in the CI/CD pipeline. During the build stage, the pipeline integrates and unit tests these guardrails—simulating prompt‑injection, malformed API responses, and unauthorized PHI access etc, to ensure no agent can bypass restrictions before any build is published.

Monitor Phase: Observing and Maintaining Performance in Production

Once the MAS is live, the Monitor phase involves keeping an eye on it: tracking its performance, catching errors or drifts, and gathering data for improvement. Monitoring a multi-agent AI is in some ways similar to monitoring any software service (you collect logs, uptime metrics, etc.), but it also has unique aspects – we need to monitor the quality of the AI’s decisions, not just that it’s running.

Challenges in the Monitor Phase

The current monitor/observability frameworks for agents are still evolving to address the needs of MAS systems which are discussed below.

Visibility into Agent Reasoning 

One challenge is that a MAS’s inner workings are essentially conversations in natural language, which can be hard to systematically parse. We can log every message between agents (and we should, for audit purposes), but sifting through those logs to identify issues is non-trivial. If a patient got a poor outcome, we’d want to trace which agent or exchange led to that. For example, maybe the Symptom Collector Agent got the correct info, but the handoff to Scheduling Agent failed because the latter misunderstood a date format)

Visibility into Cost & Latency 

In production, each agent call - Medical, Scheduling, Billing, Orchestrator - incurs token usage and network latency. Without per-agent telemetry, it’s impossible to tell whether cost overruns stem from verbose prompts in the Medical Agent or from excessive back‑and‑forth loops in the orchestrator. Similarly, end‑to‑end response time hides slow external dependencies (e.g., EHR lookups). Effective monitoring must capture elements like tokens, API calls per conversation, response times per agent and resource usage, then correlate spikes to specific scenarios so teams can take corrective actions like optimize prompts, adjust model, or refine auto‑scaling policies etc. 

Inter-Agent Misalignment Over Time

As the MAS handles many queries, certain patterns of failure might emerge. One known category of issues is inter-agent misalignment, where agents gradually diverge from the intended cooperative behavior. In long-running dialogues, they might get confused about context. Potential signals could be unusually long dialogues (many turns back and forth), or the orchestrator agent frequently resetting the conversation.

Drifts and Updates 

Over time, the underlying LLMs might change (for instance, an API upgrade to a new model version) which can change the behavior of your agents subtly. Even without explicit updates, the model might start outputting differently if the provider tweaks it. Monitoring needs to catch distribution shifts – are we suddenly seeing more hallucinations or errors than before? We may need to establish baselines during initial deployment and continuously compare. If, say, the ratio of successful scheduling tasks drops in a particular week, that should trigger an alert to investigate.

Fallbacks and Incident Handling

When (not if) something goes wrong – say an agent outputs something inappropriate to a user – monitoring should flag this ASAP. 

Another difficulty is reproducing issues: by the time we see an alert, the conversation already happened. We have the logs, but if it was due to a rare model quirk, running it again might not produce the same result.

Failure during the execution of the agent can lead to information mismatch in DBs getting stored or incorrect state updates when follow up happens. At each stage, we need to have fallbacks to mitigate these issues.

Approaches for Monitoring MAS

AgentOps & Telemetry 

There’s a growing ecosystem for AgentOps (analogous to MLOps & LLMOps but for Agentic applications). Tools like RagaAI catalyst allow logging of all prompts and responses, and come with a UI to search and analyze them. For example, we could search the logs for when the Medical Agent said “ER” to see if it’s frequently sending people to the Emergency Room inappropriately. We can allow plug code based evaluation functions, e.g., you could write a Python function that checks if an appointment was scheduled when a symptom was severe, and run this function on all logged sessions.

Automated Metrics 

We can derive and set up certain quantitative metrics from the logs:

For example, Average number of dialogue turns per task, Success rate of tasks (if we have a way to label outcomes as success/fail, tool usage frequency and errors (how often did an API call fail or need to be retried), cost per conversation (tokens consumed).

Setting up time-series tracking of these can tell us if, say, after a code update the turns doubled (regression in efficiency) or cost spiked.

Human Oversight 

It might be prudent (especially early in deployment) to have a human monitor (e.g., a nurse or admin) reviewing a sample of the interactions daily, which can be flagged via automation or metrics. In some settings, a shadow mode deployment is done first: the MAS interacts with patients but a human moderator is watching in real-time and can intervene if needed

The gaps in monitoring are largely around measuring qualitative correctness continuously. We can easily monitor uptime or if an agent process crashed, but monitoring whether the content of the agent’s advice was correct is much harder. It ties back to evaluation – essentially we’d like to evaluate as many real conversations as possible, but doing that 100% with either AI or humans is infeasible. Another gap is feeding the monitor findings back into the system quickly – which belongs to the final phase, Lifecycle Management. We want to use what we observe (errors, user feedback etc.) to continually improve the MAS.

Lifecycle Management Phase: Continuous Improvement and Maintenance

Building, evaluating, deploying, and monitoring a multi-agent system is not a one-and-done effort. Lifecycle Management is about the ongoing process of updating the system – fixing bugs, adding new capabilities, adapting to changing conditions – while ensuring it remains stable and effective.

Challenges in Lifecycle Management

Version Control 

One obvious maintenance task is updating agents as we learn from failures. For example, if monitoring shows the Medical Agent sometimes gives too blunt an answer, we might want to amend its prompt to emphasize empathy and detail. However, as discussed, even a small prompt change can have ripple effects on the overall MAS behavior. Without re-evaluating, we risk breaking something else.

Managing version control for prompts and agent “brains” is important. In code, we have Git and automated tests; for prompts, we need a way to track changes, perhaps link them to resolved issues. 

The same applies for all facets of MAS (tools, memory, etc.) - versioning and keeping track with a re-evaluation cycle becomes inescapable. 

Model Updates

Over the lifecycle, the base LLMs powering agents may be updated or replaced. Perhaps we fine-tune a smaller model on domain data to reduce reliance on an API. Each model update is like a huge code change – it will alter the dynamics of the MAS. This poses a challenge: how to continuously improve the agents’ intelligence while not breaking coordination?

Challenge essentially boils down to - “unifying and simplifying the training process for these agents”

This is still quite cutting-edge, in practice, most current MAS rely on fixed pre-trained models and prompt engineering, because multi-agent learning is an evolving research problem

Adaptive System Evolution

Over time, what we need the MAS to do might change. The hospital might expand the concierge to also handle prescription refills. That means adding or altering agents. Lifecycle management includes extensibility – can we add a new agent into the mix without rebuilding from scratch? In a well-architected MAS, this might be possible (one could plug in a new specialist agent and update the orchestrator’s prompt to utilize it). Ideally, industry is moving towards more modular MAS architectures, where each agent can be updated or swapped out like a microservice, with clear contracts on how they communicate

Continuous learning

A big question is whether to allow the MAS to learn and adapt on its own (online learning), and if yes, how do we enable that. Online learning (like updating its knowledge base as it answers questions) could improve performance but is dangerous in healthcare because errors could compound. Currently, most MAS are static or at best retrained offline with new data. We see a gap in safe continuous learning – how can the system incorporate human feedback loops effectively? For example,  perhaps patients rate each answer; we could feed those ratings to periodically finetune the agents. But one has to be careful to not degrade some other aspect while optimizing for ratings. Future advances in reinforcement learning for MAS might allow a more automated improvement cycle, but today it’s largely manual.

Strategies and Tools for Lifecycle Management

Routine Re-Evaluation and Updates

A best practice is to have a regular schedule to update the MAS. This could include refreshing prompts, applying any new models, and then re-running the evaluation suite from the Evaluate phase. Our hospital MAS team might, for instance, meet every two weeks to review collected conversation logs, identify areas to improve, adjust prompts or agent logic, and then test and redeploy. Having a systematic process here is part of the lifecycle management discipline.

Versioning Management

Treating every piece of your MAS: prompts, tool definitions, memory schemas, and fine‑tuned model artifacts as first‑class versioned assets are the need of the hour. Integration of versioning with deployment cycles gives visibility on the performance of the agent over time and versions, piggy banking on the evaluations set up for agent performance assessments.

Knowledge Base Updates

Part of the lifecycle is keeping the knowledge current. Our Agent might rely on a knowledge base of medical guidelines. That database keeps updating as guidelines change. Ensuring the MAS incorporates the latest info might be as simple as re-indexing a vector database or as complex as retraining an agent’s model. 

Major gaps in lifecycle management revolve around the lack of formalized processes and automated support. We need “AgentOps”. This includes:

  • Automated regression testing for MAS (to ensure an update didn’t reintroduce a previous bug).

  • Version control systems tailored to prompts and model checkpoints, tool checkpoints etc (so you can roll back if an update performs worse).

  • Continuous monitoring integration – linking monitoring signals to create tickets or triggers for the next iteration.

In our healthcare example, lifecycle management might eventually involve handing some control over to the hospital’s IT or medical staff: for example, giving a tool to clinicians to easily update the knowledge base the agents use (so that a new clinical protocol can be uploaded). 

Empowering domain experts to maintain the MAS without always needing an AI engineer is another frontier to strive for.

Conclusion

Multi-agent AI systems hold great promise – our hypothetical concierge AI could transform patient engagement by providing responsive, 24/7 support that draws on multiple specialties. But as we’ve seen, realizing this vision requires navigating a complex lifecycle:

Build: Frameworks like Langgraphl, CrewAI now provide role templates and orchestrator modules to streamline agent definition and workflows - enabling rapid prototyping of agent teams. Still, fine‑tuning cross‑agent coordination and establishing design best practices remain areas for further innovation.

Evaluate: Structured assessment pipelines using LLM‑as‑Judge setups  and integrated testing suites—offer scalable ways to score dialogue quality, task success, and safety. Augmenting these with human‑in‑the‑loop reviews helps close gaps in rigorous testing. 

Deploy: One‑click deployment platforms can automate infrastructure provisioning, UI generation, and policy enforcement, speeding prototypes into production. However, adapting these deployments to meet security, latency, and compliance requirements still calls for custom integration work. 

Monitor: Unified analytics dashboards paired with log aggregation, and built-in feedback loops surface real‑time insights into agent behavior. To turn natural‑language exchanges into actionable health‑care metrics, we need more specialized monitoring tools that bridge unstructured dialogue with structured observability.

Lifecycle Management: Modular training pipelines, prompt versioning, and continuous feedback loops enable ongoing MAS refinement. Building out full-fledged AgentOps—complete with automated regression tests, prompt change tracking, and adaptive retraining—will be crucial to make these systems maintainable, reliable, and safe over the long haul.

References

  1. “Why Do Multi-Agent LLM Systems Fail?”, arXiv, 2025

  2. Microsoft Research Technical Report MSR-TR-2024-53, “Challenges in human-agent communication”

  3. “Large Language Model based Multi-Agents: A Survey of Progress and Challenges”, arXiv, 2024

  4. “AgentScope: A Flexible yet Robust Multi-Agent Platform”, arXiv, 2024

  5. “AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation”, Microsoft Research, 2023 

  6. “MetaGPT: Meta Programming for a Multi-Agent Collaborative Framework”, 2023

  7. “ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate”, arXiv, 2023

Subscribe to our newsletter to never miss an update

Get Started With RagaAI®

Book a Demo

Schedule a call with AI Testing Experts

Get Started With RagaAI®

Book a Demo

Schedule a call with AI Testing Experts

Get Started With RagaAI®

Book a Demo

Schedule a call with AI Testing Experts

Get Started With RagaAI®

Book a Demo

Schedule a call with AI Testing Experts