This post is co-written with Lee Rehwinkel from Planview.

Businesses today face numerous challenges in managing intricate projects and programs, deriving valuable insights from massive data volumes, and making timely decisions. These hurdles frequently lead to productivity bottlenecks for program managers and executives, hindering their ability to drive organizational success efficiently.

Planview, a leading provider of connected work management solutions, embarked on an ambitious plan in 2023 to revolutionize how 3 million global users interact with their project management applications. To realize this vision, Planview developed an AI assistant called Planview Copilot, using a multi-agent system powered by Amazon Bedrock.

Developing this multi-agent system posed several challenges:

Reliably routing tasks to appropriate AI agents
Accessing data from various sources and formats
Interacting with multiple application APIs
Enabling the self-serve creation of new AI skills by different product teams

To overcome these challenges, Planview developed a multi-agent architecture built using Amazon Bedrock. Amazon Bedrock is a fully managed service that provides API access to foundation models (FMs) from Amazon and other leading AI startups. This allows developers to choose the FM that is best suited for their use case. This approach is both architecturally and organizationally scalable, enabling Planview to rapidly develop and deploy new AI skills to meet the evolving needs of their customers.

This post focuses primarily on the first challenge: routing tasks and managing multiple agents in a generative AI architecture. We explore Planview’s approach to this challenge during the development of Planview Copilot, sharing insights into the design decisions that provide efficient and reliable task routing.

We describe customized home-grown agents in this post because this project was implemented before Amazon Bedrock Agents was generally available. However, Amazon Bedrock Agents is now the recommended solution for organizations looking to use AI-powered agents in their operations. Amazon Bedrock Agents can retain memory across interactions, offering more personalized and seamless user experiences. You can benefit from improved recommendations and recall of prior context where required, enjoying a more cohesive and efficient interaction with the agent. We share our learnings in our solution to help you understanding how to use AWS technology to build solutions to meet your goals.

Solution overview

Planview’s multi-agent architecture consists of multiple generative AI components collaborating as a single system. At its core, an orchestrator is responsible for routing questions to various agents, collecting the learned information, and providing users with a synthesized response. The orchestrator is managed by a central development team, and the agents are managed by each application team.

The orchestrator comprises two main components called the router and responder, which are powered by a large language model (LLM). The router uses AI to intelligently route user questions to various application agents with specialized capabilities. The agents can be categorized into three main types:

Help agent – Uses Retrieval Augmented Generation (RAG) to provide application help
Data agent – Dynamically accesses and analyzes customer data
Action agent – Runs actions within the application on the user’s behalf

After the agents have processed the questions and provided their responses, the responder, also powered by an LLM, synthesizes the learned information and formulates a coherent response to the user. This architecture allows for a seamless collaboration between the centralized orchestrator and the specialized agents, which provides users an accurate and comprehensive answers to their questions. The following diagram illustrates the end-to-end workflow.

Technical overview

Planview used key AWS services to build its multi-agent architecture. The central Copilot service, powered by Amazon Elastic Kubernetes Service (Amazon EKS), is responsible for coordinating activities among the various services. Its responsibilities include:

Managing user session chat history using Amazon Relational Database Service (Amazon RDS)
Coordinating traffic between the router, application agents, and responder
Handling logging, monitoring, and collecting user-submitted feedback

The router and responder are AWS Lambda functions that interact with Amazon Bedrock. The router considers the user’s question and chat history from the central Copilot service, and the responder considers the user’s question, chat history, and responses from each agent.

Application teams manage their agents using Lambda functions that interact with Amazon Bedrock. For improved visibility, evaluation, and monitoring, Planview has adopted a centralized prompt repository service to store LLM prompts.

Agents can interact with applications using various methods depending on the use case and data availability:

Existing application APIs – Agents can communicate with applications through their existing API endpoints
Amazon Athena or traditional SQL data stores – Agents can retrieve data from Amazon Athena or other SQL-based data stores to provide relevant information
Amazon Neptune for graph data – Agents can access graph data stored in Amazon Neptune to support complex dependency analysis
Amazon OpenSearch Service for document RAG – Agents can use Amazon OpenSearch Service to perform RAG on documents

The following diagram illustrates the generative AI assistant architecture on AWS.

Router and responder sample prompts

The router and responder components work together to process user queries and generate appropriate responses. The following prompts provide illustrative router and responder prompt templates. Additional prompt engineering would be required to improve reliability for a production implementation.

First, the available tools are described, including their purpose and sample questions that can be asked of each tool. The example questions help guide the natural language interactions between the orchestrator and the available agents, as represented by tools.

tools = ”’
<tool>
<toolName>applicationHelp</toolName>
<toolDescription>
Use this tool to answer application help related questions.
Example questions:
How do I reset my password?
How do I add a new user?
How do I create a task?
</toolDescription>
</tool>
<tool>
<toolName>dataQuery</toolName>
<toolDescription>
Use this tool to answer questions using application data.
Example questions:
Which tasks are assigned to me?
How many tasks are due next week?
Which task is most at risk?
</toolDescription>
</tool>

Next, the router prompt outlines the guidelines for the agent to either respond directly to user queries or request information through specific tools before formulating a response:

system_prompt_router = f”’
<role>
Your job is to decide if you need additional information to fully answer the User’s
questions.
You achieve your goal by choosing either ‘respond’ or ‘callTool’.
You have access to your chat history in <chatHistory></chatHistory> tags.
You also have a list of available tools to assist you in <tools></tools> tags.
</role>
<chatHistory>
{chatHistory}
</chatHistory>
<tools>
{tools}
</tools>
<rules>
– If the chat history contains sufficient information to answer the User’s questions,
choose the ‘respond’ action.
– To gather more information before responding, choose the ‘callTool’ action.
– You many only choose from the tools in the <tools></tools> tags.
– If no tool can assist with the question, choose the ‘respond’ action.
– Place your chosen action within <action></action> tags.
– When you chose the ‘callTool’ action, provide the <toolName> and the <toolQuestion> you
would like to ask.
– Your <toolQuestion> should be verbose and avoid using pronouns.
– Start by providing your step-by-step thinking in <thinking></thinking> tags.
– Then you will give your answer in <answer></answer> tags.
– Your answer should follow the format of one of these three examples:
When choosing the ‘respond’ action, your answer should follow the below example EXACTLY:
<example>
<answer>
<action>respond</action>
</answer>
</example>
When choosing the ‘callTool’ action for a single Tool:
<example>
<answer>
<action>callTool</action>
<tools>
<tool>
<toolName>applicationHelp</toolName>
<toolQuestion>How do I reset my password?</toolQuestion>
</tool>
</tools>
</answer>
</example>
Executing the above, would produce the following result:
You can also call multiple Tools using this format:
<example>
<answer>
<action>callTool</action>
<tools>
<tool>
<toolName>dataQuery</toolName>
<toolQuestion>How many tasks are assigned to me?</toolQuestion>
</tool>
<tool>
<toolName>applicationHelp</toolName>
<toolQuestion>How do I add a new task?</toolQuestion>
</tool>
</tools>
</answer>
</example>
</rules>
”’

The following is a sample response from the router component that initiates the dataQuery tool to retrieve and analyze task assignments for each user:

<thinking>
To determine who has the most tasks assigned, I will need to query the application data. The “dataQuery” tool seems most appropriate for this question.
</thinking>

<answer>
<action>callTool</action>
<tools>
<tool>
<toolName>dataQuery</toolName>
<toolQuestion>Which user has the most tasks currently assigned to them? </toolQuestion>
</tool>
</tools>
</answer>

The following is a sample response from the responder component that uses the dataQuery tool to fetch information about the user’s assigned tasks. It reports that the user has five tasks assigned to them.

<thinking>
Based on the chat history, I previously called the dataQuery tool to ask “How many tasks are currently assigned to the user?”. The tool responded that the user has 5 tasks assigned to them.
</thinking>

<answer>
According to the data I queried previously, you have 5 tasks assigned to you.
</answer>

Model evaluation and selection

Evaluating and monitoring generative AI model performance is crucial in any AI system. Planview’s multi-agent architecture enables assessment at various component levels, providing comprehensive quality control despite the system’s complexity. Planview evaluates components at three levels:

Prompts – Assessing LLM prompts for effectiveness and accuracy
AI agents – Evaluating complete prompt chains to maintain optimal task handling and response relevance
AI system – Testing user-facing interactions to verify seamless integration of all components

The following figure illustrates the evaluation framework for prompts and scoring.

To conduct these evaluations, Planview uses a set of carefully crafted test questions that cover typical user queries and edge cases. These evaluations are performed during the development phase and continue in production to track the quality of responses over time. Currently, human evaluators play a crucial role in scoring responses. To aid in the evaluation, Planview has developed an internal evaluation tool to store the library of questions and track the responses over time.

To assess each component and determine the most suitable Amazon Bedrock model for a given task, Planview established the following prioritized evaluation criteria:

Quality of response – Assuring accuracy, relevance, and helpfulness of system responses
Time of response – Minimizing latency between user queries and system responses
Scale – Making sure the system can scale to thousands of concurrent users
Cost of response – Optimizing operational costs, including AWS services and generative AI models, to maintain economic viability

Based on these criteria and the current use case, Planview selected Anthropic’s Claude 3 Sonnet on Amazon Bedrock for the router and responder components.

Results and impact

Over the past year, Planview Copilot’s performance has significantly improved through the implementation of a multi-agent architecture, development of a robust evaluation framework, and adoption of the latest FMs available through Amazon Bedrock. Planview saw the following results between the first generation of Planview Copilot developed mid-2023 and the latest version:

Accuracy – Human-evaluated accuracy has improved from 50% answer acceptance to now exceeding 95%
Response time – Average response times have been reduced from over 1 minute to 20 seconds
Load testing – The AI assistant has successfully passed load tests, where 1,000 questions were submitted simultaneous with no noticeable impact on response time or quality
Cost-efficiency – The cost per customer interaction has been slashed to one tenth of the initial expense
Time-to-market – New agent development and deployment time has been reduced from months to weeks

Conclusion

In this post, we explored how Planview was able to develop a generative AI assistant to address complex work management process by adopting the following strategies:

Modular development – Planview built a multi-agent architecture with a centralized orchestrator. The solution enables efficient task handling and system scalability, while allowing different product teams to rapidly develop and deploy new AI skills through specialized agents.
Evaluation framework – Planview implemented a robust evaluation process at multiple levels, which was crucial for maintaining and improving performance.
Amazon Bedrock integration – Planview used Amazon Bedrock to innovate faster with broad model choice and access to various FMs, allowing for flexible model selection based on specific task requirements.

Planview is migrating to Amazon Bedrock Agents, which enables the integration of intelligent autonomous agents within their application ecosystem. Amazon Bedrock Agents automate processes by orchestrating interactions between foundation models, data sources, applications, and user conversations.

As next steps, you can explore Planview’s AI assistant feature built on Amazon Bedrock and stay updated with new Amazon Bedrock features and releases to advance your AI journey on AWS.

About Authors

Sunil Ramachandra is a Senior Solutions Architect enabling hyper-growth Independent Software Vendors (ISVs) to innovate and accelerate on AWS. He partners with customers to build highly scalable and resilient cloud architectures. When not collaborating with customers, Sunil enjoys spending time with family, running, meditating, and watching movies on Prime Video.

Benedict Augustine is a thought leader in Generative AI and Machine Learning, serving as a Senior Specialist at AWS. He advises customer CxOs on AI strategy, to build long-term visions while delivering immediate ROI.As VP of Machine Learning, Benedict spent the last decade building seven AI-first SaaS products, now used by Fortune 100 companies, driving significant business impact. His work has earned him 5 patents.

Lee Rehwinkel is a Principal Data Scientist at Planview with 20 years of experience in incorporating AI & ML into Enterprise software. He holds advanced degrees from both Carnegie Mellon University and Columbia University. Lee spearheads Planview’s R&D efforts on AI capabilities within Planview Copilot. Outside of work, he enjoys rowing on Austin’s Lady Bird Lake.

Categorized in: