## Multi-armed Bandit

This is a particular case of a Multi-Armed bandit problem. I say a particular case because generally we do n’t know any of the probabilities of heads ( in this case we know one of the coins has probability 0.5 ) .

The return you raise is known as the exploration vanadium exploitation dilemma : do you explore the other options, or do you stick with what you think is the best. There is an immediate optimum solution assuming you knew all probabilities : just choose the mint with the highest probability of winning. The trouble, as you have alluded to, is that we are uncertain about what the true probabilities are .

There is lots of literature on the subject, and there are many deterministic algorithm, but since you tagged this bayesian, I ‘d like to tell you about my personal favorite solution : the bayesian Bandit !

## The Baysian Bandit Solution

The bayesian approach to this problem is very natural. We are concerned in answering “ What is the probability that coin X is the better of the two ? “ .

A priori, assuming we have observed no mint flips yet, we have no mind what the probability of coin B ‘s Heads might be, denote this strange $ p_B $. So we should assign a anterior undifferentiated distribution to this unknown probability. alternatively, our prior ( and back tooth ) for coin A is trivially hard entirely at 1/2 .

As you have stated, we observe 2 tails and 1 heads from coin B, we need to update our later distribution. Assuming a consistent prior, and flips are Bernoulli coin-flips, our later is a $ Beta ( 1 + 1, 1 + 2 ) $. Comparing the back tooth distributions or A and B nowadays :

### Finding an approximately optimal strategy

immediately that we have the posteriors, what to do ? We are concerned in answering “ What is the probability mint B is the better of the two ” ( Remember from our bayesian perspective, although there is a definite answer to which one is better, we can only speak in probabilities ) :

$ $ w_B = P ( p_b > 0.5 ) $ $

The approximately optimum solution is to choose B with probability $ w_B $ and A with probability $ 1 – w_B $. This scheme maximizes out expected gains. $ w_B $ can be computed in calculated numerically, as we know the later distribution, but an concern way is the comply :

```
1. Sample P_B from the posterior of coin B
2. If P_B > 0.5, choose coin B, else choose coin A.
```

This scheme is besides self-updating. When we observe the consequence of choosing coin B, we update our buttocks with this new information, and choose again. This room, if mint B is in truth bad we will choose it less, and it mint B is in fact in truth good, we will choose it more often. Of course, we are Bayesians, hence we can never be absolutely certain coin B is better. Choosing probabilistically like this is the most natural **solution** to the exploration-exploitation dilemma .

This is a finical model of Thompson Sampling. More information, and cool applications to online advertising, can be found in Google ‘s research newspaper and Yahoo ‘s research newspaper. I love this stuff !

## Leave a Comment