1 Introduction
Balancing exploration with exploitation is a wellknown and important problem in reinforcement learning (Sutton and Barto, 2018). If the behaviour policy focuses too much on exploration rather than exploitation, then this could hurt the performance in an online setting. Furthermore, onpolicy algorithms such as SARSA or TD() might not converge to a good policy. On the other hand, if the exploration policy focuses too much on exploitation rather than exploration, then the state space might not be explored sufficiently and an optimal policy would not be found.
Historically, numerous exploration policies have been proposed for addressing the explorationexploitation tradeoff in modelfree reinforcement learning, including Boltzmann exploration and epsilongreedy (McFarlane, 2018). There, this tradeoff is often controlled by one or more tuning parameters, such as in epsilongreedy or the temperature parameter in Boltzmann exploration. However, these parameters typically have to be handcrafted or tuned for each task in order to obtain good performance. This motivates the design of exploration algorithms that adapt their behaviour according to some measure of the learning progress. The simplest approaches adapt the tuning parameters of a fixed class of exploration policies such as epsilongreedy (Tokic, 2010). Other methods, such as countbased exploration (Thrun, 1992; Bellemare et al., 2016; Ostrovski et al., 2017) and Bayesian Qlearning (Dearden et al., 1998), use specialized techniques to develop new classes of exploration policies.
However, despite the recent developments in exploration strategies, epsilongreedy is still often the exploration approach of choice (Vermorel and Mohri, 2005; HeidrichMeisner, 2009; Mnih et al., 2015; Van Hasselt et al., 2016). Epsilongreedy is both intuitive and simpler to tune than other approaches, since it is completely parameterized by one parameter, . Another benefit of this policy is that it can be easily combined with more sophisticated frameworks, such as options (Bacon et al., 2017). Unfortunately, the performance of epsilongreedy in practice is highly sensitive to the choice of , and existing methods for adapting from data are adhoc and offer little theoretical justification.
In this paper, we take a fully Bayesian perspective on adapting based on return data. Recent work has demonstrated the strong potential of a Bayesian approach for parameter tuning in modelfree reinforcement learning (Downey and Sanner, 2010)
. Another key advantage of a fully Bayesian approach over heuristics is the ability to specify priors on parameters, such as the predictive inverse variance of returns,
in this work, which are more robust to noise or temporary digressions in the learning process. In addition, our approach can be combined with other exploration policies such as Boltzmann exploration (Tokic and Palm, 2011). Specifically, we contribute:Empirically, we evaluate the performance of BMC on domains with discrete and continuous state spaces, using tabular and approximate RL methods. We empirically show that our algorithm can outperform exploration strategies that fix or anneal based on time, and even existing adaptive algorithms. In the end, BMC is a novel, efficient and general approach to adapting the exploration parameter in epsilongreedy policies that empirically outperforms a variety of fixed annealing schedules and other adhoc approaches.
2 Related Work
Our paper falls within the scope of adaptive epsilon greedy algorithms. Perhaps the most similar approach to our work is the Value Differences Based Exploration (VDBE) algorithm of Tokic (2010), in which was modelled using a moving average and updated according to the Bellman (TD) error. However, that algorithm was presented for stationaryreward multiarmed bandits. Tokic and Palm (2011) combined the ideas of VDBE with Boltzmann exploration to create the VDBESoftmax algorithm. dos Santos Mignon and da Rocha (2017) later developed a similar heuristic algorithm that worked on nonstationary multiarmed bandit problems. However, all these approaches are heuristic in nature; our paper approaches the problem from a Bayesian perspective.
Modelling Qvalues using normalgamma priors, as done in our paper, is a cornerstone of Bayesian Qlearning (Dearden et al., 1998)
. However, that paper is fundamentally different from ours, in that it addresses the problem of exploration by adding a bonus to the Qvalues that estimates the myopic
value of perfect information. Our paper, on the other hand, applies the normalgamma prior only to model the variance of the returns, while the exploration is handled using the epsilongreedy policy withmodelled using a Beta distribution.
3 Preliminaries
3.1 Markov Decision Processes
In this paper, we denote a Markov decision process (MDP) as a tuple , where is a set of states, is a finite set of actions, is a transition function for the system state, is a bounded reward function, and is a discount factor.
Randomized exploration policies are sequences of mappings from states to probability distributions over actions. Given an MDP
, for each stateaction pair and policy , we define the expected returnwhere is sampled from , are sampled from and . The associated value function is , and the objective is to learn an optimal policy that attains the supremum of over all policies. Puterman (2014) contains a more detailed treatment of this subject.
3.2 Reinforcement Learning
In the reinforcement learning setting, neither nor are known, so optimal policies are learned from experience, defined as sequences of transitions broken up into episodes. Here, states and rewards are sampled from the environment, and actions follow some exploration policy . Given an estimate of the expected return at time starting from state and taking action , temporal difference (TD) learning updates the Qvalues as follows:
where is a problemdependent learning rate parameter.
Typically, is computed by bootstrapping from the current Qvalues, in order to reduce variance. Two of the most popular bootstrapping algorithms are Qlearning and SARSA, given respectively as:
(1)  
(2) 
Qlearning is an example of an offpolicy algorithm, whereas SARSA is onpolicy. Under relatively mild conditions and in tabular settings, Qlearning has been shown to converge to the optimal policy with probability one (Watkins and Dayan, 1992).
One additional algorithm that is important in this work is expected SARSA
(3) 
which is similar to SARSA, but in which the uncertainty of the next action with respect to is averaged out. This results in considerable variance reduction as compared to SARSA, and theoretical properties of this algorithm are detailed in Van Seijen et al. (2009). Sutton and Barto (2018) provides a comprehensive treatment of reinforcement learning methods.
3.3 EpsilonGreedy Policy
In this paper, exploration is carried out using greedy policies, defined formally as
(4) 
In other words, samples a random action from with probability , and otherwise selects the greedy action according to . As a result, can be interpreted as the relative importance placed on exploration.
The optimal value of the parameter is typically problemdependent, and found through experimentation. Often, is annealed over time in order to favor exploration at the beginning, and exploitation closer to convergence (Sutton and Barto, 2018). However, such approaches are not adaptive since they do not take into account the learning process of the agent. In this paper, our main objective is to derive a datadriven tuning strategy for , that depends on current learning progress rather than trial and error.
4 Adaptive EpsilonGreedy
In this section, we show how the expected return under epsilongreedy policies can be written as an average of two return models weighted by . There are two relevant Bayesian methods for combining multiple models based on evidence: Bayesian model averaging (BMA), and Bayesian model combination (BMC). Generally, the Bayesian model combination approach is preferred to model averaging, since it provides a richer space of hypotheses and reduced variance (Minka, 2002). By interpreting
as a random variable whose posterior distribution can be updated on the basis of observed data, BMC naturally leads to a method for
adaptation.4.1 A Bayesian Interpretation of Expected Sarsa With the EpsilonGreedy Policy
We begin by combining the definition of expected SARSA (3) with the greedy policy (4). For , , and we have
(5) 
where is the Qlearning bootstrap (1) and
(6) 
is an estimate that uniformly averages over all the actionvalues, which we call the uniform model. This leads to the following important observation: expected SARSA can be viewed as a probabilityweighted average of two models, the greedy model that trusts the current Qvalue estimates and acts optimally with respect to them, and the uniform model that completely distrusts the current Qvalue estimates and consequently places a uniform belief over them. Under this interpretation, and are the posterior beliefs assigned to the two aforementioned models, respectively. In the following subsections, we verify this simple fact algebraically in the context of Bayesian model combination. We also develop a method for maintaining the (approximate) posterior belief state efficiently, with a computational cost that is constant in both space and time, and with provable convergence guarantees.
4.2 Bayesian QLearning
In order to facilitate tractable learning and inference, we assume that the return observation at time , given the model
, is normally distributed:
(7) 
where the means and are given in (1) and (6), respectively, and is the inverse of the variance, or precision
. This assumption can be justified, by viewing the return as a discounted sum of future (random) reward observations, and appealing to the central limit theorem when
is close to 1 and the MDP is ergodic (Dearden et al., 1998).There are two special cases of interest in this work. In the first case, is allowed to be constant across all stateaction pairs and models, and naturally leads to a stateindependent adaptation. This approach is particularly advantageous when it is costly or impossible to maintain independent statistics per state, such as when the state space is very large or continuous in nature. In the second case, independent statistics are maintained per state and lead to statedependent exploration.
In order to update , we consider the standard normalgamma model:
(8)  
where are i.i.d. given and
. Since the returns in different stateaction pairs are dependent, this assumption is likely to be violated in practice. However, it leads to a compact learning representation necessary for tractable Bayesian inference, and has been used effectively in the existing literature in similar forms
(Dearden et al., 1998). Furthermore, (8) is not used to model the Qvalues directly in our paper, but rather, to facilitate robust estimation of , as we now show.Given data of previously observed returns, the joint posterior distribution of and with likelihood (7) and prior (8
) is also normalgamma distributed, and so the marginal posterior distribution of
, , is gamma distributed with parameters:(9)  
where and are the sample mean and variance of the returns in , respectively (Bishop, 2006). These quantities can be updated online after each new observation in constant time (Welford, 1962).
Finally, for each model , we marginalize over the uncertainty in , using (7) and (9) as follows:
Finally, we have:
(10) 
where is the threeparameter Student tdistribution (Bishop, 2006). Therefore, marginalizing over the unknown precision leads to a tdistributed likelihood function. Alternatively, one could simply use the Gaussian likelihood in equation (7) and treat as a problemdependent tuning parameter. However, the heavy tail property of the tdistribution is advantageous in the nonstationary setting typically encountered in reinforcement learning applications, where Qvalues change during the learning phase. We now show how to link this update with the expected SARSA decomposition (5) to derive an adaptive epsilongreedy policy.
4.3 EpsilonBmc: Adapting Epsilon Using Bayesian Model Combination
In the general setting of Bayesian model combination, we model the uncertainty in Qvalues for each stateaction pair as random variables with posterior distribution . The expected posterior return can be written as an average over all possible combinations of greedy and uniform model,
(11) 
where is the weight assigned to the uniform model and is the weight assigned to the greedy model (Monteith et al., 2011). As will be verified shortly, the expectation of this weight given the past return data will turn out to be a Bayesian interpretation of .
The belief over is maintained as a posterior distribution . Continuing from (11) and using (10):
which is exactly (5) except that now . We have thus shown that the expected SARSA bootstraps with datadriven can be viewed in terms of Bayesian model combination. We denote this new estimate .
The posterior distribution is updated recursively by Bayes’ rule:
for every new observation . Since the number of terms in grows exponentially in , it is necessary to use posterior approximation techniques to compute .
One approach to address the intractability of computing an exact posterior
is to sample directly from the distribution. However, such an approach is inherently noisy and inefficient in practice. Instead, we apply the Dirichlet momentmatching technique
(Hsu and Poupart, 2016; Gimelfarb et al., 2018) that was shown to be effective at differentiating between good and bad models from data, easy to implement, and efficient. In particular, we apply the approach in Gimelfarb et al. (2018) with the models and , by matching the first and second moments of to those of the beta distribution and solving the resulting system of equations for and . The closedform solution is:(12)  
(13)  
(14)  
(15)  
(16) 
where and are the respective probabilities of observing a return under the distributions (10). It follows that
(17) 
All quantities, including , , and , can be computed online in constant time and with minimal storage overhead without caching . We call this approach BMC and present the corresponding pseudocode in Algorithm 1. Therein, lines with a * indicate additions to the ordinary expected SARSA algorithm.
In Algorithm 1, the expected SARSA return was used to update both the posterior on and the Qvalues. However, it is possible to update the Qvalues in line 11 using a different estimator of the future return, including Qlearning (1), SARSA (2
), or other approaches. The Qvalues could also be approximated using a deep neural network or other function approximator. The algorithm can also be run offline by caching
for an entire episode and then updating . Statedependent exploration can be easily implemented by maintaining the posterior statistics independently for each state, or approximating them using a neural network.A final advantage of our algorithm is a provable convergence guarantee under fairly general assumptions.
Theorem 1 (Monotone Convergence of Bmc).
Suppose . Then, for all , therefore converges as .
Proof.
Let and observe that:
(18) 
where the first line uses (15)(17), the second uses (12) and the third uses (15) and (16). Then, since :
Now, if , then this implies that
and from (4.3) we conclude that . The first statement of the theorem follows from the assumption and a standard induction argument. The second statement follows from the monotone convergence theorem (see, e.g. Rudin (1976), pg. 56). ∎
The convergence of BMC holds using any value function representation, including neural networks. It is important to note that convergence of BMC can only be guaranteed when is initialized in . However, this is not a concern in practice, since it has been found that there is no significant gain in using values of larger than 0.5 (dos Santos Mignon and da Rocha, 2017).
5 Empirical Evaluation
To demonstrate the ability of Algorithm 1 to adapt in a variety of environments, we consider a deterministic, finite state gridworld domain, the continuous state cartpole control problem, and a stochastic, discrete state supplychain problem. The third domain was chosen to show how our algorithm performs when the action space is large and the problem is stochastic. We considered two different reinforcement learning algorithms: onpolicy tabular expected SARSA (Sutton and Barto, 2018), and offpolicy DQN with experience replay (Mnih et al., 2015) ^{1}^{1}1As noted in the previous section, we only need to replace line 11 of Algorithm 1 with the DQN update.. The parameter settings are listed in Tables 1 and 2 in the supplementary materials ^{2}^{2}2The code and supplementary materials can be found at https://github.com/mikegimelfarb/bayesianepsilongreedy.
. All experiments were run independently 100 times, and mean curves with shaded standard error are reported.
In the empirical evaluation of Algorithm 1, our goal is to quantify the added value of adapting the parameter in epsilongreedy policies using a Bayesian approach, rather than compare the performance of epsilongreedy policies against other approaches, which has been investigated in the literature in various settings (Vermorel and Mohri, 2005; Tijsma et al., 2016). Therefore, Algorithm 1 is compared against different annealing schedules for , broken down into the following categories:

constant: , where ;

geometric: , where and is the episode number;

power: , where and is the episode number;

adaptive: VDBE (Tokic, 2010) with , , and .
We do not compare to Tokic and Palm (2011), since that paper falls outside the scope of epsilongreedy policies. However, we reiterate that it is a trivial matter to interchange VDBE and BMC in that framework.
5.1 GridWorld
The first benchmark problem is the discrete deterministic 5by5 gridworld navigation problem with subgoals presented in Ng et al. (1999). Valid moves incur a cost of 0.1, and invalid moves incur a cost of 0.2, in order to encourage the agent to solve the task in the least amount of time. We set . Testing consists of running a single episode, starting from the same initial state, using the greedy policy at the end of each episode. The results are shown in Figures 1 and 2.
5.2 CartPole
The second problem is the continuous deterministic cartpole control problem. A reward of 1.0 is provided until the pole falls, to encourage the agent to keep the pole upright. We also set . To run the tabular expected SARSA algorithm and VDBE, we discretize the fourdimensional state space into equal regions. Since the initial position is randomized, testing consists of evaluating the greedy policy on 10 independent episodes and averaging the returns. Since overfitting was a significant concern for DQN, we stop training as soon as perfect test performance (the pole has not been dropped) was observed over four consecutive episodes. The results are shown in Figures 3 and 4.
5.3 SupplyChain Management
This supplychain management problem was described in Kemmer et al. (2018), and consists of a factory and warehouses. The agent must decide how much inventory to produce at the factory and how much inventory to ship to the warehouse(s) to meet the demand. The parameters used in our experiment are: , , , , ,
, and demand follows a Poisson distribution with rate
. We also set a transportation limit of items per period, and assume that unfulfilled demand is not backlogged, but lost forever. The initial state is always for training and testing (e.g. 10 items initially at the factory and 0 at the warehouse). We set . Like cartpole, testing is done by averaging the returns of 10 independent trials using the greedy policy. The results are illustrated in Figures 5 and 6.5.4 Discussion
Overall, we see that BMC consistently outperformed all other types of annealing strategies, including VDBE, or performed similarly. However, BMC converged slightly later than VDBE on the gridworld domain and the fixed annealing strategy on the supplychain problem, using tabular expected SARSA. However, in the former case, BMC outperformed all fixed tuning strategies, and in the latter case, it outperformed VDBE by a large margin. These observations are related to the speed of convergence; asymptotically, BMC approached the performance of the best policy that was attained (for gridworld this is indeed the optimal policy).
While it performed well on the simple gridworld domain, VDBE performed considerably worse than BMC on the more complex supplychain problem. We believe that the Bayesian approach of BMC smooths out the noise in the return signals better than VDBE and other adhoc approaches for adapting . This also suggests why our algorithm performed better on DQN.
Furthermore, we see that no single family of annealing strategies worked consistently well across all domains and algorithms. For instance, geometric decay strategies worked well on the gridworld domain, while performing poorly on the supplychain problem using tabular SARSA. The power decay strategies worked well on the supplychain problem using tabular SARSA, but failed to match the performance of other strategies when switching to DQN. Also, the performance of VDBE was highly sensitive to the choice of the parameter. A lower value of worked well for gridworld and cartpole, but higher values of worked better for supplychain. The performance of BMC was relatively insensitive to the choice of prior parameters for and (), so we were able to use the same values in all our experiments. However, unsurprisingly, it was more sensitive to the strength of the prior on (, ). Since we can always set , this effectively reduces to the problem of selecting a single parameter that controls the strength of the prior on . This is considerably easier to do than to select both a good annealing method and the tuning parameter(s).
6 Conclusion
In this paper, we proposed a novel Bayesian approach to solve the explorationexploitation problem in general modelfree reinforcement learning, in the form of an adaptive epsilongreedy policy. Our novel algorithm, BMC, is a novel approach for tuning the parameter automatically from return observations based on Bayesian model combination and approximate momentmatching based inference. It was argued to be general, efficient, robust, and theoretically grounded, and was shown empirically to outperform fixed annealing schedules for and even a stateoftheart adaptation scheme.
In future work, it would be interesting to evaluate the performance of BMC combined with Boltzmann exploration (Tokic and Palm, 2011), as well as the statedependent version. We believe that it is possible to obtain a Bayesian interpretation of VDBE by placing priors over the Bellman errors and updating them using data, but we have not investigated this approach. It would also be interesting to extend our approach to handle options.
Acknowledgements
We thank the reviewers for their insightful comments and suggestions.
References
References

The optioncritic architecture.
In
ThirtyFirst AAAI Conference on Artificial Intelligence
, Cited by: §1.  Unifying countbased exploration and intrinsic motivation. In Advances in Neural Information Processing Systems, pp. 1471–1479. Cited by: §1.
 Pattern recognition and machine learning. Springer. Cited by: §4.2, §4.2.
 Bayesian qlearning. In AAAI/IAAI, pp. 761–768. Cited by: §1, §2, §4.2, §4.2.
 An adaptive implementation of greedy in reinforcement learning. Procedia Computer Science 109, pp. 1146–1151. Cited by: §2, §4.3.
 Temporal difference bayesian model averaging: a bayesian perspective on adapting lambda.. In International Conference on Machine Learning, pp. 311–318. Cited by: §1.
 Reinforcement learning with multiple experts: a bayesian model combination approach. In Advances in Neural Information Processing Systems, pp. 9549–9559. Cited by: §4.3.
 Interview with richard s. sutton.. KI 23 (3), pp. 41–43. Cited by: §1.
 Online bayesian moment matching for topic modeling with unknown number of topics. In Advances In Neural Information Processing Systems, pp. 4536–4544. Cited by: §4.3.
 Reinforcement learning for supply chain optimization. In European Workshop on Reinforcement Learning 14, Cited by: §5.3.
 A survey of exploration strategies in reinforcement learning. Note: Unpublished Cited by: §1.
 Bayesian model averaging is not model combination. Note: Unpublished Cited by: §4.
 Humanlevel control through deep reinforcement learning. Nature 518 (7540), pp. 529. Cited by: §1, §5.
 Turning bayesian model averaging into bayesian model combination. In Neural Networks (IJCNN), pp. 2657–2663. Cited by: §4.3.
 Policy invariance under reward transformations: theory and application to reward shaping. In International Conference on Machine Learning, Vol. 99, pp. 278–287. Cited by: §5.1.
 Countbased exploration with neural density models. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 2721–2730. Cited by: §1.
 Markov decision processes: discrete stochastic dynamic programming. John Wiley & Sons. Cited by: §3.1.
 Principles of mathematical analysis. Third edition, McGrawHill Book Co., New York. External Links: ISBN 0070856133, MathReview Cited by: §4.3.
 Reinforcement learning: an introduction. MIT press. Cited by: §1, §3.2, §3.3, §5.
 Efficient exploration in reinforcement learning. Technical report CMUCS92102, School of Computer Science, Carnegie Mellon. Cited by: §1.
 Comparing exploration strategies for qlearning in random stochastic mazes. In 2016 IEEE Symposium Series on Computational Intelligence (SSCI), pp. 1–8. Cited by: §5.
 Valuedifference based exploration: adaptive control between epsilongreedy and softmax. In Annual Conference on Artificial Intelligence, pp. 335–346. Cited by: §1, §2, §5, §6.
 Adaptive greedy exploration in reinforcement learning based on value differences. In Annual Conference on Artificial Intelligence, pp. 203–210. Cited by: §1, §2, 4th item.
 Deep reinforcement learning with double qlearning. In Thirtieth AAAI Conference on Artificial Intelligence, Cited by: §1.
 A theoretical and empirical analysis of expected sarsa. In Adaptive Dynamic Programming and Reinforcement Learning, 2009. ADPRL’09. IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, pp. 177–184. Cited by: §3.2.
 Multiarmed bandit algorithms and empirical evaluation. In European conference on machine learning, pp. 437–448. Cited by: §1, §5.
 Qlearning. Machine learning 8 (34), pp. 279–292. Cited by: §3.2.
 Note on a method for calculating corrected sums of squares and products. Technometrics 4 (3), pp. 419–420. Cited by: §4.2.
Supplementary Material
Tables 1 and 2 describe parameter settings used in the experimentation for SARSA and DQN, respectively.
PARAMETER  DESCRIPTION  GRIDWORLD  CARTPOLE  SUPPLYCHAIN 

Table initialization  uniform on [0, 0.1]  zeros  uniform on [0, 0.1]  
Learning rate ( episode #)  0.7  0.6  
Max. episode length  200  200  200  
Prior parameter in (8)  0  0  0  
Prior parameter in (8)  1  1  1  
Prior parameter in (8)  500  500  500  
Prior parameter in (8)  500  500  500  
Prior parameter for  1  10  1000  
Prior parameter for  1 + 0.01  10 + 0.01  1000 + 0.01 
PARAMETER  DESCRIPTION  GRIDWORLD  CARTPOLE  SUPPLYCHAIN 

Network initialization  Glorot uniform  Glorot uniform  Glorot uniform  
Network topology  1625254  412122  102100100100  
Hidden activation  ReLU  ReLU  ReLU  
Regularization  none  L2()  none  
State encoding  onehot  none  onehot  
Learning rate  0.001  0.0005  0.001  
Replay buffer size  2000  2000  3000  
Batch size  24  32  64  
Training epochs per batch 
5  3  2  
Max. episode length  200  200  200  
Prior parameter in (8)  0  0  0  
Prior parameter in (8)  1  1  1  
Prior parameter in (8)  500  500  500  
Prior parameter in (8)  500  500  500  
Prior parameter for  1  5  25  
Prior parameter for  1 + 0.01  5 + 0.01  25 + 0.01 
Comments
There are no comments yet.