-->

Tuesday, December 19, 2017

Q-learning is a model-free reinforcement learning technique. Specifically, Q-learning can be used to find an optimal action-selection policy for any given (finite) Markov decision process (MDP). It works by learning an action-value function, often denoted by Q ( s , a ) {\displaystyle Q(s,a)} , which ultimately gives the expected utility of taking a given action a {\displaystyle a} in a given state s {\displaystyle s} , and following an optimal policy thereafter. A policy, often denoted by π {\displaystyle \pi } , is a rule that the agent follows in selecting actions, given the state it is in. When such an action-value function is learned, the optimal policy can be constructed by simply selecting the action with the highest value in each state. One of the strengths of Q-learning is that it is able to compare the expected utility of the available actions without requiring a model of the environment. Additionally, Q-learning can handle problems with stochastic transitions and rewards, without requiring any adaptations. It has been proven that for any finite MDP, Q-learning eventually finds an optimal policy, in the sense that the expected value of the total reward return over all successive steps, starting from the current state, is the maximum achievable.

Algorithm


exploration and exploitation in Q-learning - Stack Overflow
exploration and exploitation in Q-learning - Stack Overflow. Source : stackoverflow.com

The problem model consists of an agent, states S {\displaystyle S} and a set of actions per state A {\displaystyle A} . By performing an action a ∈ A {\displaystyle a\in A} , the agent can move from state to state. Executing an action in a specific state provides the agent with a reward (a numerical score). The goal of the agent is to maximize its total (future) reward. It does this by learning which action is optimal for each state. The action that is optimal for each state is the action that has the highest long-term reward. This reward is a weighted sum of the expected values of the rewards of all future steps starting from the current state, where the weight for a step from a state Î" t {\displaystyle \Delta t} steps into the future is calculated as γ Î" t {\displaystyle \gamma ^{\Delta t}} . Here, γ {\displaystyle \gamma } is a number between 0 and 1 ( 0 ≤ γ ≤ 1 {\displaystyle 0\leq \gamma \leq 1} ) called the discount factor and trades off the importance of sooner versus later rewards. γ {\displaystyle \gamma } may also be interpreted as the likelihood to succeed (or survive) at every step Î" t {\displaystyle \Delta t} .

The algorithm, therefore, has a function that calculates the quality of a state-action combination:

Q : S × A â†' R {\displaystyle Q:S\times A\to \mathbb {R} } .

Before learning has started, Q {\displaystyle Q} is initialized to a possibly arbitrary fixed value (chosen by the programmer). Then, at each time t {\displaystyle t} the agent selects an action a t {\displaystyle a_{t}} and observes a reward r t {\displaystyle r_{t}} and a new state s t + 1 {\displaystyle s_{t+1}} that may depend on both the previous state s t {\displaystyle s_{t}} and the selected action, Q {\displaystyle Q} is updated. The core of the algorithm is a simple value iteration update, using the weighted average of the old value and the new information:

Q ( s t , a t ) ← ( 1 âˆ' α ) â‹… Q ( s t , a t ) ⏟ o l d   v a l u e + α ⏟ l e a r n i n g   r a t e â‹… ( r t ⏟ r e w a r d + γ ⏟ d i s c o u n t   f a c t o r â‹… max a Q ( s t + 1 , a ) ⏟ e s t i m a t e   o f   o p t i m a l   f u t u r e   v a l u e ) ⏞ l e a r n e d   v a l u e {\displaystyle Q(s_{t},a_{t})\leftarrow (1-\alpha )\cdot \underbrace {Q(s_{t},a_{t})} _{\rm {old~value}}+\underbrace {\alpha } _{\rm {learning~rate}}\cdot \overbrace {{\bigg (}\underbrace {r_{t}} _{\rm {reward}}+\underbrace {\gamma } _{\rm {discount~factor}}\cdot \underbrace {\max _{a}Q(s_{t+1},a)} _{\rm {estimate~of~optimal~future~value}}{\bigg )}} ^{\rm {learned~value}}}

where r t {\displaystyle r_{t}} is the reward observed for the current state s t {\displaystyle s_{t}} , and α {\displaystyle \alpha } is the learning rate ( 0 < α ≤ 1 {\displaystyle 0<\alpha \leq 1} ).

An episode of the algorithm ends when state s t + 1 {\displaystyle s_{t+1}} is a final or terminal state. However, Q-learning can also learn in non-episodic tasks. If the discount factor is lower than 1, the action values are finite even if the problem can contain infinite loops.

Note that for all final states s f {\displaystyle s_{f}} , Q ( s f , a ) {\displaystyle Q(s_{f},a)} is never updated but is set to the reward value r {\displaystyle r} observed for state s f {\displaystyle s_{f}} . In most cases, Q ( s f , a ) {\displaystyle Q(s_{f},a)} can be taken to be equal to zero.

Influence of variables on the algorithm


Ethical Issues in Artificial Reinforcement Learning | Essays on ...
Ethical Issues in Artificial Reinforcement Learning | Essays on .... Source : reducing-suffering.org

Learning rate

The learning rate or step size determines to what extent the newly acquired information will override the old information. A factor of 0 will make the agent not learn anything, while a factor of 1 would make the agent consider only the most recent information. In fully deterministic environments, a learning rate of α t = 1 {\displaystyle \alpha _{t}=1} is optimal. When the problem is stochastic, the algorithm still converges under some technical conditions on the learning rate that require it to decrease to zero. In practice, often a constant learning rate is used, such as α t = 0.1 {\displaystyle \alpha _{t}=0.1} for all t {\displaystyle t} .

Discount factor

The discount factor γ {\displaystyle \gamma } determines the importance of future rewards. A factor of 0 will make the agent "myopic" (or short-sighted) by only considering current rewards, while a factor approaching 1 will make it strive for a long-term high reward. If the discount factor meets or exceeds 1, the action values may diverge. For γ = 1 {\displaystyle \gamma =1} , without a terminal state, or if the agent never reaches one, all environment histories will be infinitely long, and utilities with additive, undiscounted rewards will generally be infinite. Even with a discount factor only slightly lower than 1, the Q-function learning leads to propagation of errors and instabilities when the value function is approximated with an artificial neural network. In that case, it is known that starting with a lower discount factor and increasing it towards its final value yields accelerated learning.

Initial conditions (Q0)

Since Q-learning is an iterative algorithm, it implicitly assumes an initial condition before the first update occurs. High initial values, also known as "optimistic initial conditions", can encourage exploration: no matter what action is selected, the update rule will cause it to have lower values than the other alternative, thus increasing their choice probability. Recently, it was suggested that the first reward r {\displaystyle r} could be used to reset the initial conditions. According to this idea, the first time an action is taken the reward is used to set the value of Q {\displaystyle Q} . This will allow immediate learning in case of fixed deterministic rewards. Surprisingly, this resetting-of-initial-conditions (RIC) approach seems to be consistent with human behaviour in repeated binary choice experiments.

Implementation


Simple Reinforcement Learning with Tensorflow Part 4: Deep Q ...
Simple Reinforcement Learning with Tensorflow Part 4: Deep Q .... Source : medium.com

Q-learning at its simplest uses tables to store data. This very quickly loses viability with increasing sizes of state/action space of the system it is monitoring/controlling. One solution to this problem is to use an (adapted) artificial neural network as a function approximator, as demonstrated by Tesauro in his Backgammon playing temporal difference learning research.

More generally, Q-learning can be combined with function approximation. This makes it possible to apply the algorithm to larger problems, even when the state space is continuous, and therefore infinitely large. Additionally, it may speed up learning in finite problems, due to the fact that the algorithm can generalize earlier experiences to previously unseen states.

Early study


Smart Cab | Machine Learning, Deep Learning, and Computer Vision
Smart Cab | Machine Learning, Deep Learning, and Computer Vision. Source : www.ritchieng.com

Q-learning was first introduced by Watkins in 1989. The convergence proof was presented later by Watkins and Dayan in 1992.

The problem Watkins was solving was named “Learning from delayed rewards” which was the title of his PhD Thesis. Eight years earlier in 1981 the same problem under the name of “Delayed reinforcement learning” was solved by a system named Crossbar Adaptive Array (CAA) Initial published report was given in 1982 The memory matrix W(a,s) of the presented CAA architecture is exactly the same as the Q-table of Q-learning. The architecture shown in a Figure introduced the term “state evaluation” in reinforcement learning research. The crossbar learning algorithm, written in mathematical pseudocode in the paper, in each iteration performs the following computation: 1) in state s perform action a; 2) receive consequence state s’; 3) compute state evaluation v(s’); 4) update crossbar value W’(a,s) = W(a,s) + v(s’). The term “secondary reinforcement” is borrowed from animal learning theory, to model state values backpropagation: the state value v(s’) of the consequence situation is backpropagated to the previously encountered situation s. In a crossbar fashion, CAA computes state values vertically and actions horizontally. Demonstration graphs showing delayed reinforcement learning contained states represented by emoticons (desirable, undesirable, and neutral states), which were computed by state evaluation function. This learning system in 1997 was recognized as a forerunner of the Q-learning algorithm.

Variants


GA-Based Q-Learning to Develop Compact Control Table for Multiple ...
GA-Based Q-Learning to Develop Compact Control Table for Multiple .... Source : www.intechopen.com

A recent application of Q-learning to deep learning, by Google DeepMind, titled "deep reinforcement learning" or "deep Q-networks", has been successful at playing some Atari 2600 games at expert human levels. Preliminary results were presented in 2014, with a paper published in February 2015 in Nature.

Because the maximum approximated action value is used in the Q-learning update, in noisy environments Q-learning can sometimes overestimate the actions values, slowing the learning. A recent variant called Double Q-learning was proposed to correct this. This algorithm was later combined with deep learning, as in the DQN algorithm (see above), resulting in Double DQN which was shown to outperform the original DQN algorithm.

Delayed Q-learning is an alternative implementation of the online Q-learning algorithm, with Probably approximately correct learning (PAC).

Greedy GQ is a variant of Q-learning to use in combination with (linear) function approximation. The advantage of Greedy GQ is that convergence guarantees can be given even when function approximation is used to estimate the action values.

Q-learning may suffer from slow rate of convergence, especially when the discount factor γ {\displaystyle \gamma } is close to one. Speedy Q-learning, a new variant of Q-learning algorithm, deals with this problem and achieves a slightly better rate of convergence than model-based methods such as value iteration.

See also


ç
ç"¨Tensorflow基于Deep Q Learning DQN 玩Flappy Bird â€" 山脚下的胖子.. Source : firsh.me

  • Reinforcement learning
  • Temporal difference learning
  • SARSA
  • Iterated prisoner's dilemma
  • Game theory

References



External links



  • Watkins, C.J.C.H. (1989). Learning from Delayed Rewards. PhD thesis, Cambridge University, Cambridge, England.
  • Strehl, Li, Wiewiora, Langford, Littman (2006). PAC model-free reinforcement learning
  • Reinforcement Learning: An Introduction by Richard Sutton and Andrew S. Barto, an online textbook. See "6.5 Q-Learning: Off-Policy TD Control".
  • Piqle: a Generic Java Platform for Reinforcement Learning
  • Reinforcement Learning Maze, a demonstration of guiding an ant through a maze using Q-learning.
  • Q-learning work by Gerald Tesauro
  • Q-learning work by Tesauro Citeseer Link
  • Q-learning algorithm implemented in processing.org language
  • Solution for the pole balancing problem with Q(lambda) / SARSA(lambda) and the fourier basis in javascript


 
Sponsored Links