AI Learning Hub Roadmap Resources Blog Projects Paths

Chain-of-Thought Prompting: Techniques for Better LLM Reasoning

· 6 min read · AI Learning Hub

Learning Objectives

  • Understand why chain-of-thought improves LLM accuracy
  • Apply zero-shot and few-shot CoT techniques
  • Use self-consistency to improve reliability
  • Implement Tree-of-Thought for complex problems
  • Know when to use CoT and when it's overkill

What Is Chain-of-Thought?

Chain-of-thought (CoT) prompting guides an LLM to show its reasoning step by step before reaching a conclusion. This significantly improves accuracy on math, logic, and multi-step reasoning tasks.

Without CoT:

Q: If a store sells 3 shirts for $45, how much do 7 shirts cost? A: $95 ❌

With CoT:

Q: If a store sells 3 shirts for $45, how much do 7 shirts cost? Let's think step by step. A: First, find the price per shirt: $45 ÷ 3 = $15. Then multiply by 7: $15 × 7 = $105. The answer is $105. ✅


Technique 1: Zero-Shot CoT

Simply add "Let's think step by step." to any prompt. Surprisingly effective — no examples needed.

from openai import OpenAI

client = OpenAI()

def zero_shot_cot(question: str) -> str:
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "You are a careful reasoning assistant."},
            {"role": "user",   "content": f"{question}\n\nLet's think step by step."},
        ],
        temperature=0,
    )
    return response.choices[0].message.content


# Math
print(zero_shot_cot("A train travels 60 mph. It departs at 2 PM and arrives at 5:30 PM. How far did it travel?"))

# Logic
print(zero_shot_cot("All programmers drink coffee. Some coffee drinkers are introverts. Are all programmers introverts?"))

# Code
print(zero_shot_cot("A Python list contains [3, 1, 4, 1, 5, 9, 2, 6]. What is the output of sorted(set(lst))?"))

Other effective triggers:

  • "Let's work through this carefully."
  • "First, let me identify what we know..."
  • "I'll break this down into steps."
  • "Step 1:"

Technique 2: Few-Shot CoT

Provide worked examples that demonstrate the reasoning pattern:

FEW_SHOT_COT_EXAMPLES = """
Q: Tom has 3 times as many marbles as Jerry. Together they have 48 marbles. How many does Tom have?
A: Let Jerry have x marbles. Then Tom has 3x. Together: x + 3x = 4x = 48. So x = 12. Tom has 3 × 12 = 36 marbles.

Q: A recipe needs 2.5 cups of flour for 12 cookies. How much flour for 30 cookies?
A: Flour per cookie = 2.5 ÷ 12 = 0.2083 cups. For 30 cookies: 30 × 0.2083 = 6.25 cups of flour.

Q: {question}
A:"""


def few_shot_cot(question: str) -> str:
    prompt = FEW_SHOT_COT_EXAMPLES.format(question=question)
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": prompt}],
        temperature=0,
    )
    return response.choices[0].message.content


print(few_shot_cot("A car dealership sold 40% of its inventory in March and 25% of what remained in April. If it started with 200 cars, how many are left?"))

Few-shot CoT works best when:

  • The problem type is consistent (all math, all logic, etc.)
  • Examples are high quality and representative
  • The reasoning pattern generalizes to new problems

Technique 3: Self-Consistency

Generate multiple reasoning paths and take the majority answer. Dramatically improves accuracy on hard problems.

from collections import Counter
import re

def self_consistency(question: str, n_samples: int = 5) -> str:
    answers = []

    for i in range(n_samples):
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": "Solve step by step. End with 'Final answer: [number/answer]'"},
                {"role": "user",   "content": question},
            ],
            temperature=0.7,  # non-zero temperature for diverse paths
        )
        text = response.choices[0].message.content

        # Extract final answer
        match = re.search(r'[Ff]inal answer[:\s]+(.+?)(?:\n|$)', text)
        if match:
            answers.append(match.group(1).strip())
        else:
            # Fall back to last number mentioned
            numbers = re.findall(r'\b\d+(?:\.\d+)?\b', text)
            if numbers:
                answers.append(numbers[-1])

    # Return most common answer
    counter = Counter(answers)
    majority_answer, count = counter.most_common(1)[0]

    print(f"Answers: {dict(counter)}")
    print(f"Majority ({count}/{n_samples}): {majority_answer}")
    return majority_answer


# Works especially well for math problems where LLMs make arithmetic errors
result = self_consistency(
    "A box has 15 red and 10 blue balls. You pick 3 balls without replacement. "
    "What's the probability all 3 are red?",
    n_samples=5,
)

When to use: High-stakes reasoning where accuracy matters more than cost. 5 samples is usually enough.


Technique 4: Step-Back Prompting

For complex questions, first ask a more general question to retrieve background knowledge:

def step_back_prompting(specific_question: str) -> str:
    # Step 1: Generate a more general "step-back" question
    step_back_response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Generate a more general question that provides background knowledge helpful for answering the specific question."},
            {"role": "user",   "content": f"Specific question: {specific_question}\n\nMore general question:"},
        ],
        temperature=0,
    )
    general_q = step_back_response.choices[0].message.content

    # Step 2: Answer the general question
    general_response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[{"role": "user", "content": general_q}],
        temperature=0,
    )
    background = general_response.choices[0].message.content

    # Step 3: Answer original with background context
    final_response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": f"Use this background knowledge:\n\n{background}"},
            {"role": "user",   "content": specific_question},
        ],
        temperature=0,
    )
    return final_response.choices[0].message.content


print(step_back_prompting("Why does PyTorch use dynamic computation graphs instead of static ones?"))

Technique 5: Tree of Thoughts (ToT)

Explore multiple reasoning paths simultaneously and prune dead ends. Best for creative/exploratory problems.

def tree_of_thoughts(problem: str, n_thoughts: int = 3, depth: int = 2) -> str:
    """Simplified Tree of Thoughts implementation."""

    def generate_thoughts(context: str, n: int) -> list[str]:
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": f"Generate {n} different approaches or next steps. Number them 1-{n}."},
                {"role": "user",   "content": context},
            ],
            temperature=0.8,
        )
        text = response.choices[0].message.content
        thoughts = re.split(r'\n\d+\.', text)
        return [t.strip() for t in thoughts if t.strip()][:n]

    def evaluate_thought(thought: str, problem: str) -> float:
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": "Rate the promise of this approach for solving the problem (1-10). Reply with just a number."},
                {"role": "user",   "content": f"Problem: {problem}\n\nApproach: {thought}"},
            ],
            temperature=0,
        )
        try:
            return float(re.search(r'\d+', response.choices[0].message.content).group())
        except:
            return 5.0

    # Generate and evaluate initial thoughts
    initial_thoughts = generate_thoughts(f"Problem: {problem}\n\nList {n_thoughts} approaches:", n_thoughts)
    scored_thoughts = [(t, evaluate_thought(t, problem)) for t in initial_thoughts]
    best_thought = max(scored_thoughts, key=lambda x: x[1])

    # Expand best thought
    expanded = generate_thoughts(
        f"Problem: {problem}\n\nBest approach so far: {best_thought[0]}\n\nExpand this approach:",
        n=2,
    )

    # Final synthesis
    context = f"Problem: {problem}\n\nBest approach: {best_thought[0]}\n\nExpanded ideas: {'; '.join(expanded)}"
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": "Synthesize the best ideas into a complete solution."},
            {"role": "user",   "content": context},
        ],
    )
    return response.choices[0].message.content

When to Use Which Technique

Technique Best For Cost Complexity
Zero-shot CoT Quick improvement on any reasoning task Low None
Few-shot CoT Consistent problem types (math, code) Low Medium (writing examples)
Self-consistency High-accuracy math/logic Medium (5× API calls) Low
Step-back Complex multi-concept questions Low Low
Tree of Thoughts Creative/exploratory problems High (many calls) High

Measuring CoT Effectiveness

def evaluate_cot(test_cases: list[tuple], n_runs: int = 3) -> dict:
    """Compare plain prompting vs CoT."""
    plain_correct = 0
    cot_correct   = 0

    for question, expected in test_cases:
        # Plain
        for _ in range(n_runs):
            r = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[{"role": "user", "content": question}],
                temperature=0,
            )
            if expected.lower() in r.choices[0].message.content.lower():
                plain_correct += 1

        # CoT
        for _ in range(n_runs):
            r = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[{"role": "user", "content": f"{question}\nLet's think step by step."}],
                temperature=0,
            )
            if expected.lower() in r.choices[0].message.content.lower():
                cot_correct += 1

    total = len(test_cases) * n_runs
    return {
        "plain_accuracy": plain_correct / total,
        "cot_accuracy":   cot_correct / total,
        "improvement":    cot_correct / total - plain_correct / total,
    }

Troubleshooting

CoT gives wrong final answer despite correct reasoning → Add "Your final answer must be X format." Add explicit verification step: "Now double-check your answer."

Model skips steps → Use few-shot examples that demonstrate granular step-by-step reasoning. Add "Show every step, do not skip."

CoT is too verbose → Add length guidance: "Reason in at most 150 words, then give your answer."


FAQ

Does CoT work with all models? Most reliably with GPT-4, Claude, and larger models. Smaller models (< 7B) benefit less. The reasoning capability must exist in the model for CoT to help — it can't compensate for missing knowledge.

Is CoT always worth the extra tokens? No. For simple factual questions or classification, CoT adds cost without benefit. Reserve it for multi-step reasoning, math, logic, and code analysis.


What to Learn Next

← Back to all articles