DRL - 01导论
ML 23-1 deep reinforcement learning
scenario of deep reinforcement learning
- learning to play GO
- Supervised vs Reinforcement
applications
Universe: https://openai.com/blog/universe/
difficulties of reinforcement learning
reward delay 一些没有奖励的动作在当前看起来没有用,但对未来会产生影响,帮助在未来得到奖励。
agent’s actions affect the subsequent data it recevives,agent 需要去探索,不管是好的行为还是坏的。
outline
Policy-based Approach - Learning an Actor
- machine learning $\approx$ looking for a function
找function 的三大步骤
DRL
neural network as actor
input: vector、matrix,eg: pixels
output: action 采取行动的几率,stochastic
goodness of function
supervised learning vs DRL
- pick the best
- gradient ascent
- add a baseline
critics
评估observation
Actor-Critic
ML 23-2 policy gradient (Supplementary Explanation)
ML 23-3 RL
interact with environments
机器学到的行为会影响下一步的发展,所有的action 当成整体看待
components
env、reward function不能控制,只能调整actor的行为
critic
评估critic:
Monre-Carlo:
Temporal defference:
Q
actor 如果⽆法穷举则会爆炸,采用PDPG
pathwise derivative policy gradient
Asynchronous A3C
imitation learning
类似GAN:
DRL - 01导论