Cookie Consent by Free Privacy Policy Generator

From zero-shot prompting to RAG - Part 6: Self-consistency prompting

From zero-shot prompting to RAG - Part 6: Self-consistency prompting
Photo by Almas Salakhov / Unsplash

In part 6 of our series, we will once again be looking at an advanced prompting technique. Today it's about self-consistency prompting. Self-consistency prompting is based on a simple idea: why generate just one answer when the model can try out different ways of thinking? Classic chain-of-thought (CoT) prompting asks the model to reveal its thought process. Self-consistency goes one step further and collects several trains of thought at once.

Instead of writing a prompt and working with one answer, you request several samples. Each sample provides its own chain of thought and a candidate answer at the end. Now you count which answer occurs most frequently - and output exactly this as the final output.

This way, you benefit in two ways: firstly, random outliers or errors in reasoning are suppressed because they have little statistical weight. Secondly, you use the "collective intelligence" of the model, which explores different paths and thus arrives at the most robust solution. A practical example: in math problems, the model sometimes returns "42", sometimes "43" and sometimes "44" - depending on which calculation path it chooses. Instead of relying on the first answer, you generate 30 prompts and find that "43" appears in 18 cases, "42" only in 7 and "44" in 5. Voilà: Your final answer is then "43".

Another example with a one-shot prompt and 3 attempts:

Approach 1: Right ✅ User
Lisa has 12 apples. She gives 3 to her brother and eats 2 herself. How many apples does Lisa have left?
ChatGPT
Lisa still has 7 apples, because 12 - 3 - 2 = 7
Apple 2: Wrong ❌ User
Lisa has 12 apples. She gives 3 to her brother and eats 2 herself. How many apples does Lisa have left?
ChatGPT
Lisa has 9 apples left as she gives away 3
Apple 3: Right ✅ User
Lisa has 12 apples. She gives 3 to her brother and eats 2 herself. How many apples does Lisa have left?
ChatGPT
Lisa has 7 apples left, as she eats 2 herself and gives 3 to her brother. So there are 7/12 apples left

We can see that in this example, 7 seems to be the correct answer, as it occurs most frequently.

In practice, self-consistency prompting is particularly worthwhile for tasks where reliability and robustness are important - for example in complex arithmetic tasks, legal text analyses or medical classifications. As soon as a single outlier could have fatal consequences (e.g. when helping to diagnose a model) or if several valid solution paths exist, multiple sampling provides clear added value. The situation is different if you want to quickly retrieve simple facts, such as "Who was German Chancellor in 1999?" or "What is the name of the largest lake in the world?". Here, the overhead of increased sampling and vote aggregation is simply excessive: The query almost always runs correctly in one pass with the other techniques, and the time and cost involved is not worth the extra effort. Even in real-time applications with low latency - such as chatbots that have to react in milliseconds - self-consistency can lead to delays due to the multiple model calls. In such cases, it is better to fall back on leaner decoding strategies and only use self-consistency selectively where precision justifies the loss in performance.

💡
Self-consistency generates multiple chain-of-thought outputs with a high sampling temperature. The final answer is determined by majority vote across the collected candidates. Errors due to outliers are statistically compensated. Complex tasks benefit from multiple thought paths and collective intelligence. Implementation only requires adjustment of the decoding parameters, no new model.

Self-Consistency-Prompting is an elegant and effective method for improving the accuracy of language models without much additional effort. Whether tricky math problems or complex classifications - with Self-Consistency you get the best out of your model. Try it out and be surprised how often the consensus is clearer than a single answer!


Sources:

Self-Consistency - Nextra
A Comprehensive Overview of Prompt Engineering
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Chain-of-thought prompting combined with pre-trained large language models has achieved encouraging results on complex reasoning tasks. In this paper, we propose a new decoding strategy, self-consistency, to replace the naive greedy decoding used in chain-of-thought prompting. It first samples a diverse set of reasoning paths instead of only taking the greedy one, and then selects the most consistent answer by marginalizing out the sampled reasoning paths. Self-consistency leverages the intuition that a complex reasoning problem typically admits multiple different ways of thinking leading to its unique correct answer. Our extensive empirical evaluation shows that self-consistency boosts the performance of chain-of-thought prompting with a striking margin on a range of popular arithmetic and commonsense reasoning benchmarks, including GSM8K (+17.9%), SVAMP (+11.0%), AQuA (+12.2%), StrategyQA (+6.4%) and ARC-challenge (+3.9%).