A multi-armed bandit is a dynamic experiment that tries to maximize success for a selection of options with unknown probabilities of success. As it gathers information, it makes more informed choices.
For a web experiment, this means the dynamic optimization — Present the most successful variations most often, in contrast to an A/B test that might present users with a poorly performing variation 50% of the time for the lifetime of an experiment.
The beta distribution is a distribution over probabilities. It only requires two parameters. We will count trials with success and trials without success to model the probability of success this way.
Using confidence bounds to produce reasonable optimistic/pessimistic estimates for the probability of each option, the best option has highest high bound. However, if its low bound overlaps with the high bounds of other options, we don't have enough confidence that it's truly the best option yet and need more data for that confidence level.
With Thompson Sampling we can choose which option to present to users in a way that will balance exploitation (choosing the option we think is the best) and exploration (digging deeper into how good the other options are)