## The Idea in Brief The multi-armed bandit (MAB) problem models decision-making under uncertainty. Imagine a row of slot machines, each with an unknown chance of payout. A gambler wants to win as much as possible by pulling levers. Should they keep playing the machine that seems best so far (exploitation) or try others to gather more information (exploration)? This simple set-up captures a core difficulty in statistics, economics, and machine learning: balancing learning with maximising reward. --- ## Key Concepts ### 1. Exploration vs Exploitation - **Exploration**: testing new actions to learn their reward probabilities. - **Exploitation**: choosing the option currently believed to yield the highest return. - The tension between these strategies is central to the problem. ### 2. Problem Formulation - Each machine (or “arm”) has an unknown probability distribution of rewards. - The player chooses one arm at a time, observing only its outcome. - Objective: maximise cumulative reward over many trials. ### 3. Solution Approaches - **ε-greedy**: with probability ε, explore; otherwise, exploit the best option. - **Upper Confidence Bound (UCB)**: balances exploration and exploitation by selecting the option with the highest potential reward under uncertainty. - **Thompson Sampling**: a Bayesian approach that samples from probability distributions of expected rewards to guide decisions. ### 4. Applications - **Online advertising**: deciding which ad to show a user. - **Clinical trials**: allocating patients to treatments while learning effectiveness. - **Recommendation systems**: personalising content while testing alternatives. - **Operations research**: adaptive resource allocation problems. --- ## Implications The multi-armed bandit problem distils the essence of learning under uncertainty. Its algorithms underpin modern systems that adapt in real time, from ad targeting to drug testing. Despite its simplicity, it remains one of the most influential problems in statistics and artificial intelligence.