#noindex
AKA '''마코프 의사결정 과정, 마르코프 결정 프로세스'''

Sub:
partially-observable Markov decision process // [[partial_observability]]
 Ggl:POMDP 
 Ggl:"partially-observable Markov decision process"

MKL
[[마르코프_과정,Markov_process]] 중에서 ...
[[마르코프_연쇄,Markov_chain]]
[[강화학습,reinforcement_learning]]
[[마르코프_성질,Markov_property]]
DQN and [[Q학습,Q-learning]] Q-learning Ggl:Q-learning Naver:Q-learning
{
topics, mkl/del.

REL
[[마르코프_결정과정,Markov_decision_process,MDP]]

timestep
[[상태,state]]
[[partial_observability]] =,partial_observability . 
 {
 '''partial observability'''
 '''partially observable''' adj.

 "partial observability"
 Ggl:"partial observability"

 Up: [[observability]] =,observability . observability
  {
  관찰|관측 가능성 , 가관찰성 가관측성, ... TBD... Ndict:observability Naver:observability Ggl:observability Ggl:"definition of observability"

  REL [[관측,observation]] and/or [[관찰,observation]]

  Sub:
  total? full? observability
  [[partial_observability]]
  }
 }

See Ggl:"Deep Recurrent Q-Learning for Partially Observable MDPs"

}

----
4-tuple로 정의함

----
$\{T, S, A_s, p_t(\cdot | s, a), r_t(s, a): t \in T, s \in S, a \in A_s\}$

$T \in [0,\infty)$ : set of decision epochs
decision maker가 결정을 하는 시각들
T가 무한인가에 따라 유한 horizon 혹은 무한 horizon (지평선?)

$S$ : state space, set of states

$A$ : action space, set of actions
set of actions that are possible when the state of the system is $s\in S$

$p_t(\cdot|s,a)$ : transition probabilities
어떻게 시스템의 상태가 one decision epoch에서 다음으로 넘어가는지 명시(specify) (given that T is discrete)

$r_t(s,a)$ : rewards

(from ㄷㅎㅈ 강의자료)
----
Decision rule $\delta_t$
decision maker에게 decision epoch가 t일 때 action을 어떻게 정할 것인지 알려줌

Policy $\pi$
Decision rule의 [[수열,sequence]], for every decision epochs (δ,,1,,, δ,,2,,, …)
매 decision epoch마다 같은 결정법(decision rule)이 사용되면, π is called ''stationary''.

Value function $v_t(s_t)$
Maximum total expected reward starting in state $s_t$ from $t$ decision epoch onward
$v_t(s_t)=\max_{a_t\in {A_s}_t}\left\lbrace r_t(s_t,a_t)+E[v_{t+1}(s_{t+1})]\right\rbrace$
(무엇을?) 최대로 하는 action $a_t$ 를 찾으려면,
 $r_t(\cdots)$ - 시각(period) t에서 기대되는 immediate reward
 $E[\cdots]$ - 기대되는 최대의 전체 남은 rewards in periods t+1, t+2, ... (expected maximum total reward-to-go)
중에서 찾는다. CHK

Bellman optimality equation 이란 것을 언급함. 위의 것이 그거?

= tmp bmks ko =
https://jrc-park.tistory.com/293
 특정 [[시간,time]] ''t''에서 MDP는 [[상태,state]]를 [[확률변수,random_variable]] ''X,,t,,''로 표현한다.
 [[마르코프_연쇄,Markov_chain]] = [[마르코프_과정,Markov_process]]이며, MDP는 MC(=MP)에 action''([[액션,action]]. [[행동,action]]? [[작용,action]]은 물리쪽의 번역?)''이 추가된 것이 전부다. ''(이름에선 [[결정,decision]]이 추가)''
 [[상태전이분포,state_transition_distribution]]([[상태전이,state_transition]]([[상태,state]] [[전이,transition]]) [[분포,distribution]]) 식으로 놓고 보면 ... 인데 여기서 보다시피 action이 state와 독립이면 MDP는 MC와 동일.

----
Twins:
[[Zeta:마르코프_결정_과정]]
[[WpKo:마르코프_결정_과정]]

https://itwiki.kr/w/마르코프_결정_프로세스

tmp twins en:
https://developers.google.com/machine-learning/glossary?hl=ko#markov-decision-process-mdp

tmp bmks ko:
https://daheekwon.github.io/MDP/

Up: [[마르코프_과정,Markov_process]] [[결정,decision]] [[과정,process]] [[결정과정,decision_process]]