1 Introduction
This paper presents new analyses of Bayesian flavored strategies for sequential resource allocation in an unknown, stochastic environment modeled as a multiarmed bandit. A stochastic multiarmed bandit model is a set of probability distributions, , called arms, with which an agent interacts in a sequential way. At round , the agent, who does not know the arms’ distributions, chooses an arm . The draw of this arm produces an independent sample from the associated probability distribution , often interpreted as a reward. Indeed, the arms can be viewed as those of different slot machines, also called onearmed bandits, generating rewards according to some underlying probability distribution.
In several applications that range from the motivating example of clinical trials [38] to the more modern motivation of online advertisement (e.g., [16]), the goal of the agent is to adjust his strategy , also called a bandit algorithm, in order to maximize the rewards accumulated during his interaction with the bandit model. The adopted strategy has to be sequential, in the sense that the next arm to play is chosen based on past observations: letting be the field generated by the observations up to round , is measurable, where
is a uniform random variable independent from
(as algorithms may be randomized).More precisely, the goal is to design a sequential strategy maximizing the expectation of the sum of rewards up to some horizon . If denote the means of the arms, and , this is equivalent to minimizing the regret, defined as the expected difference between the reward accumulated by an oracle strategy always playing the best arm, and the reward accumulated by a strategy :
(1) 
The expectation is taken with respect to the randomness in the sequence of successive rewards from each arm , denoted by , and the possible randomization of the algorithm, . We denote by the number of draws from arm at the end of round , so that .
This paper focuses on good strategies in parametric bandit models, in which the distribution of arm depends on some parameter : we write
. Like in every parametric model, two different views can be adopted. In the frequentist view,
is an unknown parameter. In the Bayesian view, is a random variable, drawn from a prior distribution . More precisely, we define (resp. ) the probability (resp. expectation) under the probabilistic model in which for all , is i.i.d. distributed under and (resp. ) the probability (resp. expectation) under the probabilistic model in which for all is i.i.d. conditionally to with conditional distribution , and . The expectation in (1) can thus be taken under either of these two probabilistic models. In the first case this leads to the notion of frequentist regret, which depends on :(2) 
In the second case, this leads to the notion of Bayesian regret, sometimes called Bayes risk in the literature (see [27]), which depends on the prior distribution :
(3) 
The first bandit strategy was introduced by Thompson in 1933 [38] in a Bayesian framework, and a large part of the early work on bandit models is adopting the same perspective [10, 7, 19, 8]. Indeed, as Bayes risk minimization has an exact—yet often intractable—solution, finding ways to efficiently compute this solution has been an important line of research. Since 1985 and the seminal work of Lai and Robbins [28], there is also a precise characterization of good bandit algorithms in a frequentist sense. They show that for any uniformly efficient policy (i.e. such that for all , for all ), the number of draws of any suboptimal arm () is asymptotically lower bounded as follows:
(4) 
where
denotes the KullbackLeibler divergence between the distributions
and . From (2), this yields a lower bound on the regret.This result holds for simple parametric bandit models, including exponential family bandit models presented in Section 2, that will be our main focus in this paper. It paved the way to a new line of research, aimed at building asymptotically optimal strategies, that is, strategies matching the lower bound (4) for some classes of distributions. Most of the algorithms proposed since then belong to the family of index policies
, that compute at each round one index per arm, depending on the history of rewards observed from this arm only, and select the arm with largest index. More precisely, they are UCBtype algorithms, building confidence intervals for the means of the arms and choosing as an index for each arm the associated Upper Confidence Bound (UCB). The design of the confidence intervals has been successively improved
[27, 1, 6, 5, 4, 21, 14] so as to obtain simple index policies for which nonasymptotic upper bound on the regret can be given. Among them, the klUCB algorithm [14] matches the lower bound (4) for exponential family bandit models. As they use confidence intervals on unknown parameters, all these index policies are based on frequentist tools. Nevertheless, it is interesting to note that the first index policy was introduced by Gittins in 1979 [19] to solve a Bayesian multiarmed bandit problem and is based on Bayesian tools, i.e. on exploiting the posterior distribution on the parameter of each arm.However, tools and objectives can be separated: one can compute the Bayes risk of an algorithm based on frequentist tools, or the (frequentist) regret of an algorithm based on Bayesian tools. In this paper, we focus on the latter and advocate the use of index policies inspired by Bayesian tools for minimizing regret, in particular the BayesUCB algorithm [24], which is based on quantiles of the posterior distributions on the means. Our main contribution is to prove that this algorithm is asymptotically optimal, i.e. that it matches the lower bound (4), for any exponential bandit model and for a large class of prior distributions. Our analysis relies on two new ingredients: tight bounds on the tail of posterior distributions (Lemma 4), and a selfnormalized deviation inequality featuring an exploration rate that decreases with the number of observations (Lemma 5). This last tool also allows us to prove the asymptotic optimality of two variants of klUCB, called klUCB and klUCBH, that display improved empirical performance. Interestingly, the alternative exploration rate used by these two algorithms is already suggested by asymptotic approximations of the Bayesian exact solution or the FiniteHorizon Gittins indices.
The paper is structured as follows. Section 2 introduces the class of exponential family bandit models that we consider in the rest of the paper, and the associated frequentist and Bayesian tools. In Section 3, we present the BayesUCB algorithm, and give a proof of its asymptotic optimality. We introduce klUCB and klUCBH in Section 4, in which we prove their asymptotic optimality and also exhibit connections with existing Bayesian policies. In Section 5, we illustrate numerically the good performance of our three asymptotically optimal, Bayesianflavored index policies in terms of regret. We also investigate their ability to attain an optimal rate in terms of Bayes risk. Some proofs are provided in the supplemental paper [23].
Notation
Recall that is the number of draws from arm at the end of round . Letting be the empirical mean of the first rewards from , the empirical mean of arm after rounds of the bandit algorithm, , satisfies if , otherwise.
2 (Bayesian) exponential family bandit models
In the rest of the paper, we consider the important class of exponential family bandit models, in which the arms belong to a oneparameter canonical exponential family.
2.1 Exponential family bandit model
A oneparameter canonical exponential family is a set of probability distributions, indexed by a real parameter called the natural parameter, that is defined by
where is an open interval, a twicedifferentiable and convex function (called the logpartition function) and
a reference measure. Examples of such distributions include Bernoulli distributions, Gaussian distributions with known variance, Poisson distributions, or Gamma distributions with known shape parameter.
If , it can be shown that and , where (resp. ) is the derivative (resp. second derivative) of with respect to the natural parameter . Thus there is a onetoone mapping between the natural parameter and the mean , and distributions in an exponential family can be alternatively parametrized by their mean. Letting , for we denote by the distribution in that has mean : . The variance of the distribution is related to its mean in the following way:
(5) 
In the sequel, we fix an exponential family and consider a bandit model , where belongs to and has mean . When considering Bayesian bandit models, we restrict our attention to product prior distributions on , such that is drawn from a prior distribution on that has density with respect to the Lebesgue measure. We let be the posterior distribution on after the first rounds of the bandit game. With a slight abuse of notation, we will identify with its density, for which a more precise expression is provided in Section 2.3.
2.2 KullbackLeibler divergence and confidence intervals
For distributions that belong to a oneparameter exponential family, the large deviation rate function has a simple and explicit form, featuring the KullbackLeibler (KL) divergence, and one can build tight confidence intervals on their means. The KLdivergence between two distributions and in an exponential family has a closed form expression as a function of the natural parameters and , given by
(6) 
We also introduce as the KLdivergence between the distributions of means and :
Applying the CramérChernoff method (see e.g. [9]) in an exponential family yields an explicit deviation inequality featuring this divergence function: if is the empirical mean of samples from and , one has . This inequality can be used to build a confidence interval for based on a fixed number of observations . Inside a bandit algorithm, computing a confidence interval on the mean of an arm requires to take into account the random number of observations available at round . Using a selfnormalized deviation inequality (see [14] and references therein), one can show that, at any round of a bandit game, the klUCB index, defined as
(7) 
where is a real parameter, satisfies and is thus an upper confidence bound on . The exploration rate, which is here , controls the coverage probability of the interval.
Closedform expressions for the divergence function in the most common examples of exponential families are available (see [14]). Using the fact that is increasing when , an approximation of can then be obtained using, for example, binary search.
2.3 Posterior distributions in Bayesian exponential family bandits
It is wellknown that the posterior distribution on the mean of a distribution that belongs to an exponential family depends on two sufficient statistics: the number of observations and the empirical means of these observations. With the density of the prior distribution on , introducing
the density of the posterior distribution on after rounds of the bandit game can be written
While our analysis holds for any choice of prior distribution, in practice one may want to exploit the existence of families of conjugate priors (e.g. Beta distributions for Bernoulli rewards, Gaussian distributions for Gaussian rewards, Gamma distributions for Poisson rewards). With a prior distribution chosen in such a family, the associated posterior distribution is wellknown and its quantiles are easy to compute, which is of particular interest for the BayesUCB algorithm, described in the next section.
Finally, we give below a rewriting of the posterior distribution that will be very useful in the sequel to obtain tight bounds on its tails.
Lemma 1.
Proof
3 BayesUCB: a simple and optimal Bayesian index policy
3.1 Algorithm and main result
The BayesUCB algorithm is an index policy that was introduced by [24] in the context of parametric bandit models. Given a prior distribution on the parameters of the arms, the index used for each arm is a wellchosen quantile of the (marginal) posterior distributions of its mean. For exponential family bandit models, given a product prior distribution on the means, the BayesUCB index is
where is the quantile of order of the distribution (that is, ) and is a real parameter. In the particular case of bandit models with Gaussian arms, [33] have introduced a variant of BayesUCB with a slightly different tuning of the confidence level, under the name UCL (for Upper Credible Limit).
While the efficiency of BayesUCB has been demonstrated even beyond bandit models with independent arms, regret bounds are available only in very limited cases. For Bernoulli bandit models asymptotic optimality is established by [24] when a uniform prior distribution on the mean of each arm is used. For Gaussian bandit models [33] give a logarithmic regret bound when an uniformative prior is used. In this section, we provide new finitetime regret bounds that hold in general exponential family bandit models, showing that a slight variant of BayesUCB is asymptotically optimal for a large class of prior distributions.
We fix an exponential family, characterized by its logpartition function and the interval of possible natural parameters. We let and ( may be equal to and to ). We analyze BayesUCB for exponential bandit models satisfying the following assumption.
Assumption 2.
There exists and such that
For Poisson or Exponential distributions, this assumption requires that the means of all arms are different from zero, while they should be included in
for Bernoulli distributions. We now introduce a regularized version of the BayesUCB index that relies on the knowledge of and , as(8) 
where . Note that and can be chosen arbitrarily close to and respectively, in which case often coincides with the original BayesUCB index .
Theorem 3.
From Theorem 3, taking the and letting go to zero show that (this slight variant of) BayesUCB satisfies
Thus this index policy is asymptotically optimal, as it matches Lai and Robbins’ lower bound (4). As we shall see in Section 5
, from a practical point of view BayesUCB outperforms klUCB and performs similarly (sometimes slightly better, sometimes slightly worse) as Thompson Sampling, another popular Bayesian algorithm that we now discuss.
3.2 Posterior quantiles versus posterior samples
Over the past few years, another Bayesian algorithm, Thompson Sampling, has become increasingly popular for its good empirical performance, and we explain how BayesUCB is related to this alternative, randomized, Bayesian approach.
The Thompson Sampling algorithm, that draws each arm according to its posterior probability of being optimal, was introduced in 1933 as the very first bandit algorithm
[38] and rediscovered recently for its good empirical performance [36, 16]. Thompson Sampling can be implemented in virtually any Bayesian bandit model in which one can sample the posterior distribution, by drawing onesample from the posterior on each arm and selecting the arm that yields the largest sample. In any such case, BayesUCB can be implemented as well and may appear as a more robust alternative as the quantiles can be estimated based on
several samples in case there is no efficient algorithm to compute them.Our experiments of Section 5 show that BayesUCB as well as the other Bayesianflavored index policies presented in Section 4 are competitive with Thompson Sampling in general onedimensional exponential families. Compared to BayesUCB, the theoretical understanding of Thompson Sampling is more limited: this algorithm is known to be asymptotically optimal in exponential family bandit models, yet only for specific choices of prior distributions [25, 3, 26].
In more complex bandit models, there are situations in which BayesUCB is indeed used over Thompson Sampling. When there is a potentially infinite number of arms and the mean reward function is assumed to be drawn from a Gaussian Process, the GPUCB of [37], that coincides with BayesUCB, is very popular in the Bayesian optimization community [11].
3.3 Tail bounds for posterior distributions
Just like the analysis of [24], the analysis of BayesUCB that we give in the next section relies on tight bounds on the tails of posterior distributions that permit to control quantiles. These bounds are expressed with the KullbackLeibler divergence function . Therefore, an additional tool in the proof is the control of the deviations of the empirical mean rewards from the true mean reward, measured with this divergence function, which follows from the work of [14].
In the particular case of Bernoulli bandit models, BayesUCB uses quantiles of Beta posterior distributions. In that case a specific argument, namely the fact that is the distribution of the th order statistic among
uniform random variables, relates a Beta distribution (and its tails) to a Binomial distribution (and its tails). This ‘BetaBinomial trick’ is also used extensively in the analysis of Thompson Sampling for Bernoulli bandits proposed by
[2, 25, 3]. Note that this argument can only be used for Beta distributions with integer parameters, which rules out many possible prior distributions. The analysis of [33] in the Gaussian case also relies on specific tails bounds for the Gaussian posterior distributions. For exponential family bandit models, an upper bound on the tail of the posterior distribution was obtained by [26] using the Jeffrey’s prior.Lemma 4 below present more general results that hold for any class of exponential family bandit models and any prior distribution with a density that is positive on . For such (proper) prior distributions, we give deterministic upper and lower bounds on the corresponding posterior probabilities . Compared to the result of [26], which is not presented in this deterministic way, Lemma 4 is based on a different rewriting of the posterior distribution, given in Lemma 1.
Lemma 4.
Let be defined in Assumption 2.

There exist two positive constants and such that for all that satisfy , for all , for all ,

There exists a constant such that for all that satisfy , for all , for all ,
The constants depend on ,, and the prior densities.
This result permits in particular to show that the quantile defined in (8) satisfies , with
Hence, despite their Bayesian nature, the indices used in BayesUCB are strongly related to frequentist klUCB type indices. However, compared to the index defined in (7), the exploration rate that appears in and also features the current number of draws . Lai gives in [27]
an asymptotic analysis of any index strategy of the above form with an exploration function
, where when goes to infinity. Yet neither nor are not exactly of that form, and we propose below a finitetime analysis that relies on new, nonasymptotic, tools.3.4 Finitetime analysis
We give here the proof of Theorem 3. To ease the notation, assume that arm 1 is an optimal arm, and let be a suboptimal arm.
We introduce a truncated version of the KLdivergence, and let be a decreasing sequence to be specified later.
Using that, by definition of the algorithm, if is played at round , it holds in particular that , one has
This yields
The posterior bounds established in Lemma 4 permit to further upper bound the two sums in the righthand side of the above inequality. With defined in Lemma 4, we introduce , defined by
On the one hand, for ,
since by the lower bound in the second statement of Lemma 4,
Now using the lower bound in the first statement of Lemma 4,
On the other hand,
(9)  
Using Lemma 4, the first sum in (9) is upper bounded by
To third inequality follows from exchanging the sums over and and using that is smaller than 1 for all . The last inequality uses that if , and . Then by Chernoff inequality,
Still using Chernoff inequality, the second sum in (9) is upper bounded by
Putting things together, we showed that there exists some constant such that
Term is shown below to be of order , as cannot be too far from . Note however that the deviation is expressed with in place of the traditional , which makes the proof of Lemma 5 more intricate. In particular, Lemma 5 applies to a specific sequence defined therein, and a similar result could not be obtained for the choice , unlike Lemma 6 below.
Lemma 5.
Let be such that . If , for all , if is larger than ,
From Lemma 5, one has
The following lemma permits to give an upper bound on Term T2.
Lemma 6.
Let be three functions such that
with and nonincreasing for large enough.
For all there exists a (problemdependent) constant such that for all ,
with , where the variance function is defined in (5).
4 A Bayesian insight on alternative exploration rates
The klUCB index of an arm, , introduced in (7), uses the exploration rate
Comments
There are no comments yet.