Improving factuality and reasoning in language models through multiagent debate

Written by Rohan Sethi, M1 @ Loyola Stritch School of Medicine

Abstract Link: Improving factuality and reasoning in language models through multiagent debate

tldr;

Application of AI is transforming many industries including medicine. However, the practice of medicine has higher standards of trust and responsibility it owes to patients, and AI should not be an exception to such standards. Large Language Models (LLMs) are especially prone to hallucinations and fallicious answers that can be difficult to address manually at larger scales. This paper describes a simple yet powerful technique to improve the quality and factuality of output from LLMs. This paper proposes the facilitation of a debate between multiple instances of LLMs (regardless of whether its the same or different base model/architecture) by initially having each participating LLM to answer a prompt and iteratively having each LLM adjust its answers with access to the thoughts of its digital “colleagues”. This approach and a few baselines were evaluated on question & answer, mathematical, chess challenges. See results below!

Methods

  • Baselines

    • Single Agent: only one instance of an LLM is prompted once and used as the final answer (zero-shot)

    • Single Agent Reflection: only one instance of an LLM is prompted for an initial answer and then prompted to adjust its initial answer for a finalized one (similar to chain of thought)

    • Multi-agent Majority: multiple instances of an LLM are prompted once and the majority overlapping answer is chosen as the final (similar to ensemble ML techniques like random forest)

  • Debate Model/Experimental Approach

    1. Each instance of an LLM (ChatGPT, Bard, or Both) is prompted for initial answers

    2. Each answer was then concatenated into a context blurb that was provided to every agent asking to adjust its initial answer with answers from the other models

    3. The debate would end if the model instances arrived as the same answer (convergence) or eventually forcefully ending the debate with majority as the final answer

  • Reasoning Tasks: basically evaluating the logic behind arriving to the right answer

    • Arithmetic: addition, subtraction, multiplication, division

    • Grade School Math: GSM8K dataset (unspecified but considered harder than arithmetic)

    • Chess: given n moves, the model must predict the best next move; moves were extracted from grand-master games

  • Factuality Tasks: basically evaluating the truthfulness of the model

    • Biographies: models were asked to describe the life of well-known computer scientists in bullet points (evaluation not clear but perhaps using perplexity or semantic evaluators like ROUGE/BERTScore)

    • MMLU: simple factual questions

    • Chess Move Validity: given the rules of the game checking if the model chooses a valid move from a choice of a few moves

Results

  • Reasoning

  • Factuality

In both reasoning and factuality, the mutliagent debate approach outperforms baselines compared against. Significance of these results is unknown but with standard deviations perhaps this was also taken into account.

  • Example Debate

Interestingly there were some instances where note of the agents intially got the right answer but through debate were able to draw a consensus on the right answer, showing how this approach is so much more powerful than the baselines provided.

  • Effects of Debate Hyperparameters:

    • Number of Agents, Rounds of Debate, Prompts (prompting short or long debates), Debate Summaries (either directly copying the answers from other models or summarizing them before exposure; approach not described but perhaps with another LLM), Multiple Types of LLMs

Discussion and Related Work

  • Multiagent Debate increased accuracy and performance score metrics on a host of tasks described above

  • Multiagent debates were able to help models intially incorrect to arrive at the correct answer, and in some instances all models to the right answer when all were intially wrong

  • All debate hyperparameters led to a better performance, though performance plateus for these hyperparameters after a certain limit

  • Approach is more expensive in cost and time given the need of multiple agents and steps before arriving to an answer; context could also be difficult for agents to process as reasoning LLMs give can be unbounded in token length

  • Various related approaches cited including verification of LLM responses using other LLMs, reinformcement learning with human feedback (approach used by Open AI to align model parameters to appropriate responses), etc.