2022-03-19发表2022-03-24更新深度强化学习3 分钟读完 (大约443个字)

DRL - 01导论

ML 23-1 deep reinforcement learning

scenario of deep reinforcement learning

learning to play GO

Supervised vs Reinforcement

applications

Gym: https://gym.openai.com/

Universe: https://openai.com/blog/universe/
difficulties of reinforcement learning

reward delay 一些没有奖励的动作在当前看起来没有用，但对未来会产生影响，帮助在未来得到奖励。

agent’s actions affect the subsequent data it recevives，agent 需要去探索，不管是好的行为还是坏的。
outline

Policy-based Approach - Learning an Actor

machine learning $\approx$ looking for a function

找function 的三大步骤
DRL
1. neural network as actor
  
  input: vector、matrix，eg: pixels
  
  output: action 采取行动的几率，stochastic
1. goodness of function
  
  supervised learning vs DRL
1. pick the best
- gradient ascent
- add a baseline

critics

评估observation

Actor-Critic

ML 23-2 policy gradient (Supplementary Explanation)

ML 23-3 RL

interact with environments

机器学到的行为会影响下一步的发展，所有的action 当成整体看待

components

env、reward function不能控制，只能调整actor的行为

critic

评估critic：

Monre-Carlo：

Temporal defference：

Q

actor 如果⽆法穷举则会爆炸，采用PDPG

pathwise derivative policy gradient

Asynchronous A3C

imitation learning

类似GAN:

DRL - 01导论

http://example.com/2022/03/19/DRL - 01/

作者

Yang

发布于

2022-03-19

更新于

2022-03-24

许可协议

评论