REINFORCEMENT LEARNING

Slides:

Advertisements

Similar presentations

인공지능 소개 부산대학교 인공지능연구실. 인공 + 지능 인공지능이란 ? 2.

Advertisements

지능형 에이전트 (Intelligent Agents) (Lecture Note #29)

연관규칙기법과 분류모형을 결합한 상품 추천 시스템:

15 장. 알고리즘의 설계 알고리즘 설계 학습목표 기본 패턴 패턴의 한계점 일곱 가지 패턴의 알고리즘 설계 기법을 이해한다.

Chapter 2 정보시스템 아키텍처 (IS Architecture)

Multiple features Linear Regression with multiple variables (다변량 선형회귀)

Chapter 7 ARP and RARP.

Neural Network - Perceptron

Dialogue System Seminar

과제도출하기 액션러닝.

제 7 장 LR 파서.

Chaper 2 ~ chaper 3 허승현 제어시스템 설계.

Chapter 3. Dynamic programming

Chapter 5. Q-LEARNING & DEEP SARSA

정 의 학습의 일반적 정의 기계학습(Machine Learning)의 정의

강좌 개요 2009년 1학기 컴퓨터의 개념 및 실습.

과목 홈페이지  전산학개론 이메일 숙제를 제출할 경우, 메일 제목은 반드시 ‘[전산학개론]’으로 시작.

Problems of Finite Difference Method (유한차분법)

LOGO 네트워크 운용(2).

제7장 제어구조 I – 식과 문장.

Discrete Math II Howon Kim

Word2Vec Tutorial 박 영택 숭실대학교.

Routing Protocol - Router의 주 목적 중 하나는 Routing

Computational Finance

Genetic Algorithm 신희성.

Dynamic Programming.

3D Vision Lecture 7 동작 이해 (광류).

국가우주개발 미래 비전 수립 김 병 수 한국과학기술기획평가원 전략협력실

Chapter 2. Finite Automata Exercises

Discrete Math II Howon Kim

5. 비제약 최적설계의 수치해법 (Numerical Methods for Unconstrained Optimum Design)

Next Radio System Lab 소개

계수와 응용 (Counting and Its Applications)

Chapter 4 The Von Neumann Model.

4-1 Gaussian Distribution

Parallel software Lab. 박 창 규

9. 강화 학습.

이산수학(Discrete Mathematics)  증명 전략 (Proof Strategy)

Structural Dynamics & Vibration Control Lab., KAIST

Data Mining Final Project

프로그램 식 조합 방법 <expr> ::= <constant> | <name>

Modeling one measurement variable against another Regression analysis (회귀분석) Chapter 12.

정보 추출기술 (Data Mining Techniques ) : An Overview

Great Expectation: Prediction in Entertainment Applications

Discrete Math II Howon Kim

신입사원육성체계 및 Mentoring System

Dynamic Programming.

User Datagram Protocol (UDP)

인공지능 소개 및 1장.

CEO가 가져야 할 품질 혁신 마인드.

Discrete Math II Howon Kim

Machine Evolution.

Data Analytics for Healthcare

이산수학(Discrete Mathematics) 비둘기 집 원리 (The Pigeonhole Principle)

MR 댐퍼의 동특성을 고려한 지진하중을 받는 구조물의 반능동 신경망제어

제12장. Algorithmic Computation의 한계

Modeling one measurement variable against another Regression analysis (회귀분석) Chapter 12.

점화와 응용 (Recurrence and Its Applications)

창 병 모 숙명여대 전산학과 자바 언어를 위한 CFA 창 병 모 숙명여대 전산학과

1. 관계 데이터 모델 (1) 관계 데이터 모델 정의 ① 논리적인 데이터 모델에서 데이터간의 관계를 기본키(primary key) 와 이를 참조하는 외래키(foreign key)로 표현하는 데이터 모델 ② 개체 집합에 대한 속성 관계를 표현하기 위해 개체를 테이블(table)

The general form of 0-1 programming problem based on DNA computing

이산수학(Discrete Mathematics)  증명 전략 (Proof Strategy)

CHAPTER 9 SCHEDULING: PROGRAM EVALUATION AND REVIEW

[CPA340] Algorithms and Practice Youn-Hee Han

CSI 진화연산 2008년도 제 1학기.

강화학습: 기초.

제 5 장 의사결정지원시스템 : 모델.

Model representation Linear regression with one variable

Chapter 7: Deadlocks.

Presentation transcript:

REINFORCEMENT LEARNING 1998년 3월 10일 조 동 연

INTRODUCTION (1) agent, state, actions, policy 주제 agent의 목표는 reward 함수에 의하여 정의 됨 제어 정책 어떤 초기 상태로부터 최대의 누적 보상이 얻어지는 행동을 선택 예 : manufacturing optimization problems, sequential scheduling problems

INTRODUCTION (2)

INTRODUCTION (3) Function approximation problems  : S  A, a = (s) Delayed reward training set의 형태가 <s, (s)>가 아니고 행동의 sequence에 대한 reward이므로 temporal credit assignment 문제 발생 Exploration 모르는 states 와 actions의 exploration (새 정보 획득) 이미 학습한 states 와 actions의 exploitation (최대의 누적 reward) Partially observable states 실제로 환경에 대한 전체 정보를 알 수 없으므로, 행동을 선택함에 있어 전 단계에서 관찰된 것도 고려해야 함 Life-long learning 몇 개의 관련된 작업도 학습하는 것이 요구 됨

THE LEARNING TASK(1) The problem of learning sequential control strategies agent’s action : deterministic  nondeterministic trained : expert  itself Markov decision process(MDP) agent는 S를 인식할 수 있고, A를 가지고 있다. st+1 = (st , at), rt = r(st , at) 와 r은 환경에 따르며, agent가 알 필요가 없다. (st , at), r(st , at)은 현재 state와 action에만 의존하며, 이전 states나 actions과는 상관 없다.

THE LEARNING TASK(2) Task of the agent learn a policy,  : S  A, (st) = at discounted cumulative reward optimal policy

THE LEARNING TASK(3)

Q LEARNING (1) Training example의 형태가 <s, a>가 아니고 r(si, ai)이므로 직접적으로  : S  A를 학습하기는 어렵다. Evaluation function 와 r이 완벽하게 알려져 있을 때만 사용 가능 실제 문제에서는 이러한 함수에 대한 결과의 정확한 예측이 불가능 (예: robot control)

Q LEARNING (2) The Q Function evaluation function optimal action 와 r을 모르는 경우에라도 optimal action을 선택할 수 있다.

Q LEARNING (3) An Algorithm for Learning Q iterative approximation training rule

Q LEARNING (4) Q learning algorithm For each s,a initialize the table entry to zero Observe the current state s Do forever: •Select an action a and execute it •Receive immediate reward r •Observe the new state s’ •Update the table entry for as follows: • s  s’

Q LEARNING (5) An Illustrative Example 두 가지 특성

Q LEARNING (6) Convergence 수렴 조건 Theorem The system is a deterministic MDP. The immediate reward values are bounded. The agent selects actions in such a fashion that it visits every possible state-action pair infinitely often. Theorem converges to as n  , for all s, a.

Q LEARNING (7) Proof.

Q LEARNING (8) Experimentation Strategies agent가 action을 선택하는 방법 를 최대로 하는 action a를 선택 다른 action을 이용하지 않게 됨 확률적인 방법 k가 크면 exploit, 작으면 explore 반복 회수에 따라 k를 변화 시킬 수도 있음

Q LEARNING (9) Updating Sequence training example의 순서를 바꿈에 의하여 training 효율을 향상 시킬 수 있다. 역순으로 update  목표 지점에 도달하는 경로상의 모든 transition에 대하여 한번에 update 가능 (추가의 저장 공간 필요) 이전의 state-action transition과 reward를 저장해 둔 후, 주기적으로 retrain 한다.

NONDETERMINISTIC REWARDS AND ACTIONS (1) (s, a)와 r(s, a)은 확률 분포에 의하여 결정 nondeterministic MDP s, a에만 의존하고 이전 s, a에는 무관 expected value

NONDETERMINISTIC REWARDS AND ACTIONS (2) Training rule Theorem converges to as n  , for all s, a.

TEMPORAL DIFFERENCE LEARNING Q learning is a special case of a general class of temporal difference algorithms. TD() by Sutton (1988)

GENERALIZING FROM EXAMPLES Target function이 명확한 lookup table로 표현된다는 가정과 모든 가능한 state-action pair가 방문 되어야 한다는 가정은 비현실적이다. (무한 공간이나, action의 수행 비용이 큰 경우)  다른 방법과의 통합이 필요하다. Lookup table대신 neural network을 사용하여 Q learning algorithm에 BACK PROPAGATION과 같은 function approximation algorithm을 통합한다.

RELATIONSHIP TO DYNAMIC PROGRAMMING 완벽한 배경 지식 계산을 최소화하는 것이 가장 큰 목표 직접적 상호 작용이 없는 내부적 simulation (offline) Bellman’s equation

SUMMARY Reinforcement learning addresses the problem of learning control strategies for autonomous agents. The reinforcement learning algorithms addressed in this chapter fit a problem setting known as a Markov decision process. Q learning is one form of reinforcement learning in which the agent learns an evaluation function over states and actions. Q learning can be proven to converge to the correct Q function under certain assumptions. Q learning is a member of a more general class of algorithms, called temporal difference algorithms. Reinforcement learning is closely related to dynamic programming approaches to Markov decision processes.