Modeling one measurement variable against another Regression analysis (회귀분석) Chapter 12.

Slides:

Advertisements

Similar presentations

전자통신연구실 1 확률과 랜덤 해석 잡음 분석 확률 - 실험 (experiment) - 결과 (outcome) - 사건 (event)

Advertisements

Ch.4 수요관리와 수요예측 Ch.2 수요예측생산 ∙ 운영관리 1. 제 1 절 수요관리의 개념과 중요성 1. 수요관리의 필요성 정확한 수요예측은 사업의 성과를 좌우하는 매우 중요한 과제이다. – 수요는 판매량과 다르다. – 하지만 온갖 불확실성 요소가 난무하는 사업환경에서.

1 통계를 왜 공부해야 하나 ? Dept. of Public Administration Chungnam National University.

주제 : 독거여성노인의 현황과 대책 학 과 학 번 성 명 사회복지학과 김 진 석

5장, 마케팅조사의 종류와 마케팅자료 마케팅 조사원론.

빅데이터 기술 개요 2016/8/20 ~ 9/3 윤형기

Multiple features Linear Regression with multiple variables (다변량 선형회귀)

Keller: Stats for Mgmt & Econ, 7th Ed

13장 t검정(t - test) 양윤권.

5.1 모수 (parameter) vs 통계량 (statistics)

판별분석의 개념과 적용(→ 추계통계적 성격)

2장. 데이터의 시각적 묘사.

의료의 질 평가 분석 기법 김 민 경.

상관분석(Correlation Analysis)

사회복지조사론 Research Method for Social Welfare

4-4 Comparison of Standard Deviations with the F test

논문을 위한 통계 집단간 평균 차이: t-test, ANOVA 하성욱 한성대학교 대학원.

일시 : , (PM) 6:30-10:30 장소 : 삼성암센터 (지하1층 세미나실2)

실습 (using SPSS) Department of Biostatistics, Samsung Biomedical Research Institute Samsung Medical Center.

제1장 과학과 사회조사방법 과학적 지식(scientific knowledge): 과학적 방법에 의해 얻어진 지식, 즉 논리적, 체계적, 경험적, 객관적 절차를 통해 얻어진 지식 과학적 지식의 특성 1) 재생가능성(reproducibility) 2) 경험가능성(empiricism)

최소 자승 오차법 (Least Squares Method)

선형회귀분석.

단순(선형)회귀분석.

9.확률 분포 정규 분포 형태 : 평균을 중심으로 좌우대칭의 종 모양을 가진 분포이다.

Medical Instrumentation

CHAPTER 21 UNIVARIATE STATISTICS

Chap 3. 표본조사 3.1 표본추출(Sampling)의 기초 3.2 단순임의표본추출 3.3 표본으로부터 모집단 추정

8차시: 측정시스템 분석(MSA) 학 습 목 표 학 습 내 용 1. 측정시스템 분석(MSA) 개념 이해

Cluster Analysis (군집 분석)

2007 겨울 통계강좌 중급과정 제6강 다변량 분석에 대한 이해.

논문을 위한 통계 논문과 통계의 기초 개념 하성욱 한성대학교 대학원.

Medical Instrumentation

4-1 Gaussian Distribution

Parallel software Lab. 박 창 규

기 초 통 계 인하대학교 통계학과.

Week 10:확률변수(Random Variable)

기초통계학 Chapter 5: 회귀분석 (Regression analysis)

Linear Mixed Model을 이용한 분석 결과

Other ANOVA designs Two-way ANOVA

(independent variable)

한밭대학교 산업경영공학과 강진규 ( jkkang.com.ne.kr)

제 7장 회귀분석 강 사 : 김 효 창.

경제통계학 개요 사공 용 서강대학교 경제학과.

Association between two measurement variables Correlation

Inferences concerning two populations and paired comparisons

Association between two measurement variables Correlation

5장, 마케팅조사의 종류와 마케팅자료 마케팅 조사원론.

감마선스펙트럼 방사능측정 불확도 Environmental Metrology Center

: Two Sample Test - paired t-test - t-test - modified t-test

Keller: Stats for Mgmt & Econ, 7th Ed 다중회귀분석 Multiple Regression

Statistical inference I (통계적 추론)

The normal distribution (정규분포)

사용자 경험 측정 (Measuring User Experience)

Chapter Ⅱ. 연구 설계.

통계방법의 이해.

Chapter 4: 통계적 추정과 검정 Pilsung Kang

Eliminating noise and other sources of error

제2장 통계학의 기초 1절 확률 기본정의 확률의 기본 공리와 법칙 2절 확률변수와 확률분포 3절 정규분포와 관련 분포 정규분포

Modeling one measurement variable against another Regression analysis (회귀분석) Chapter 12.

Week 13:가설검정(Hypothesis Testing)

해양생태학 2016년 1학기 안순모.

Definitions (정의) Statistics란?

4.1 실험연구/관측연구 기초 4.2 좋은 실험연구란? 4.3 좋은 관측연구란?

제3장 사회조사방법의 기본개념 변수(variable): 사람, 물건, 사건 등의 특성이나 속성이 두 가지 이상의 가치(value)를 가질 때 변수라고 함. 즉 상호배타적인 속성들의 집합 1) 속성에 따른 분류 -. 명목변수(Nominal Variable): 분류에 기초를.

의학자료분석론 교재: 강의록 Rosner B, Fundamentals of Biostatistics, 7th ed. Brooks/Cole Cengage Learning, Canada, 강의 평가: 출석 20% 숙제 30% 기말고사 50%

기 술 통 계 학 6 1 기술통계학 2 자료의 정리 3 위치척도 4 산포의 척도.

경영통계학 제1장 통계학은 어떤 학문인가? What is Statistics? 1.1.

표본분포 개요 랜덤추출법 표본분포 모양과 CLT.

Progress Seminar 권순빈.

Presentation transcript:

Modeling one measurement variable against another Regression analysis (회귀분석) Chapter 12

Regression analysis (회귀분석) 생물학에서 two or more measurement variables 사이의 relationships 알고자 할 경우가 많이 있다 Ex. 체중과 혈압과의 관계 Female size와 offspring의 수 Drug dosage level과 특정 생리반응 Relationships에 관한 분석 (두 가지) Regression analysis: Chapter 14 Correlation analysis (상관분석): Chapter 15

Regression versus correlation Correlation analysis (상관분석) 두 variables 사이에 association이 존재하는지, 그 association이 얼마나 strong 한지를 결정하는데 이용 Association이 존재한다면 하나의 variable이 변할 때 다른 variable도 비슷하게 변할 것이다 그러나 correlation의 경우 두 variables 사이에 cause-and-effect association (인과관계)가 없다 Ex. 시험성적과 숙제성적과의 관계를 알고자 할 경우 하나의 variable이 다른 variable의 원인 이라기보다는 두 variables 모두 제3의 variable에 의해 나타난다고 보는 것이 더 정확하다 출석 정도, 학습시간, 예 복습 시간 등 두 variables 이 얼마나 강하게 상관되어 있는지 만 알 수 있다

Regression versus correlation Regression analysis 두 variables 사이에 인과관계 (cause-and-effect relationship)가 예측될 경우 하나의 variable (response variable)이 다른 variable (predictor variable)에 의해 설명되어 질 경우 Dependent variable (종속변수), independent variable (독립변수)라고도 부른다 Ex. Caffeine 섭취와 심장박동과의 관계를 알고자 할 경우 두 variables 사이에 relationship이 존재한다면 섭취하는 caffeine 양 (independent variable)에 의해 심장 박동수 (dependent variable)가 변한다고 할 수 있다 그러나 반대의 경우는 가정할 수 없다 심장박동의 변화에 의해 caffeine 섭취량이 결정된다 (not a case)

Regression versus correlation Regression analysis 대부분의 regression analysis의 경우, independent variable이 normally distributed random variable이 아니다 연구자에 의해 결정된다 (fixed effect design) Random effect가 아님 생물학 연구에서 혼용된 regression analysis or correlation analysis의 선택이 쉽게 발견된다 일부 연구자는 엄격하게 두 분석을 구분하나 일부는 두 분석을 혼용하여 사용한다

An example of a correlation problem Random sample of 11 female iguanas 각 개체의 산란 후 체중과 한번에 낳는 알의 수를 측정 체중과 알의 수와의 관계 Regression or correlation?? Independent variable을 정하지 않았음 임의로 체중을 x-variable로, 알의 수를 y-variable로 정하여 graph를 그림 (Figure 12.1) Correlation analysis의 경우 x, y 축의 선택이 영향을 미치지 않음

An example of a correlation problem Figure 12.1에서 큰 size의 female iguana 가 많은 알을 낳는다고 섣부른 해석할 수 있다 알의 수가 체중의 원인이 될 수 없으며, 반대로 체중이 알의 수의 차이의 원인이 될 필요는 없다 아마도 두 variables 모두 제3의 요인에 의해 결정될 것이다 Graph에 line이 없음

An example of a correlation problem The third variable Age or nutrition 잘 먹은 iguana의 size가 클 수 있고 먹은 음식의 양에 의해 알의 수가 결정된다면 두 variables (size and number of eggs) 모두 제 3의 variable 인 nutrition에 의해 조절된다

An example of a regression problem Ex. 12.2 A regression problem 뱀의 생리를 연구하는 학자가 비단뱀의 심장박동에 미치는 온도의 영향을 알고자 함 Same age, size, and sex를 가진 비단뱀 9마리를 선택함 각 뱀을 미리 선택된 온도 (2°C - 18°C, 2°C 간격)의 cage에 넣음 뱀들이 외부온도와 평형을 이룬 후 심장 박동을 측정함 Regression or correlation?? Independent variable은? Result: Table 12.2 and Figure 12.2

Graph에 line이 있음

An example of a regression problem Ex. 12.2 A regression problem 이 경우 온도가 연구자에 의해 조절됨 따라서 온도가 random variable이 아니다 Figure에서 온도가 x-axis, 심장박동이 y-axis에 위치한다 반대의 경우는 성립되지 않음 심장박동에 의해 체온이 영향을 받는 것이 아니므로 Figure에 line이 있음 그 line의 equation이 결정될 경우 그 equation을 이용하여 다양한 온도에서의 심장박동을 추정할 수 있다

An example of a regression problem Ex. 12.3 Another regression problem Ex. 12.1에서 iguana의 체중과 알의 수의 관계를 알고자 했을 때 iguana를 random selection 하였음 그러나 이 경우는 female이 낳는 알의 수가 체중에 영향을 받는다고 추정하고 실험 개체들을 random 하게 선택하는 것이 아니라 independent variable (이 경우 체중)을 바탕으로 선택 미리 정한 체중에 부합되는 iguana를 선택함 그리고 각 개체의 알의 수를 측정함 이 경우 regression analysis를 이용하고, line의 equation을 이용하여 체중으로 알의 수를 추정할 수 있다

An example of a regression problem 따라서 사용되는 분석은 알고자 하는 질문과 experiment design에 의해 결정된다 Ex.12.1에서는 female iguana를 random sample 따라서 두 variables 모두 normal distribution을 할 것이며, 연구자에 의해 control 되지 않음 Thus, correlation analysis를 이용함 더 sophisticated한 (복잡한) regression analysis를 사용하고 싶은 욕망을 억제해야 함 Ex.12.3에서는 size를 바탕으로 선택되었으므로 size가 random variable 이 아니다 Size: independent variable (Fixed variable) 따라서 regression analysis를 사용할 수 있다

Simple linear regression fundamentals (단순선형회귀분석의 원리) Regression analysis에서는 두 variables 사이에 cause-and-effect relationship이 있다 Independent variable은 연구자의 control 하에 있디 Fixed effects experimental design Independent variable (x) 값에 따른 dependent variable (y) 값을 예측할 수 있는 functional relationship (함수관계)이 있다 수학적으로: y = f(x) Simple linear regression의 functional relationship μy = α + βx

Simple linear regression fundamentals μy : y 값의 population mean value α : population y intercept β : population slope 특정 값 y는 expected value인 μy 에서 벗어난다 Unexplained variation, residual (error term, e) 때 문에 yi = α + βxi + e

Simple linear regression fundamentals Regression analysis는 여러 목적을 가지고 있다 1. 두 variables 사이의 linear relationship을 나타내는 equation을 추정 Regression equation or regression function α와 β를 sample로부터 추정한다 2. 이 equation으로부터 line을 그릴 수 있다 Least squares regression line (최소제곱 회귀직선) Why least squares? 3. regression equation은 independent variable (x)에 해당하는 dependent variable (y)를 예측하는데 사용된다

Estimating the regression function and the regression line Regression analysis을 수행할 때 parametric regression function (regression equation)을 추정한다 (μy = α + βx) Sample로부터 추정된 intercept: a Sample로부터 추정된 slope: b (regression coefficient) The line described by this equation Best fit the regression function Regression line이 항상 지나는 한 점이 있다 Mean of x, and mean of y: (x, y) Ex. 12.2 A regression problem을 이용 (온도에 따른 뱀의 심장박동 실험)

Estimating the regression function and the regression line Figure 12.3은 이점 (x, y 평균값)을 지나는 수평선 수직선: 각 y 값에서 수평선까지의 차이 y – y 이 값의 합은 0 이다 따라서 제곱한 후 합함: Σ(y – y)2 Sum of square for y 804.87로 매우 큰 값 x (temperature)를 고려하지 않은 sum of square 온도를 고려하지 않을 경우 심장박동은 매우 큰 variance를 보인다 Line figure 12.3을 (x, y)를 중심점으로 회전시킴 x = 10; y = 19.89

Mean x: 10 Mean y: 19.89 Figure 12.3 table 12.4 온도에 따른 비단뱀의 심장박동 변화

Estimating the regression function and the regression line 회전시켜 실제 y 값과의 차이를 최소화시킴 Best fit of our data: Sum of square가 최소화됨 ‘least squares’란 용어가 여기서 나옴 (Least squares regression line (최소제곱 회귀직선)

Estimating the regression function and the regression line 실제 측정되는 y 값이 항상 그 line 상에 위치하지 않음 Line 상에 위치하는 y 값: ŷ (y hat) Regression equation으로 계산되는 y 값 실측값과 계산값의 차이: residuals (error term) e = y – ŷ Residuals를 제곱한 후 합함 Sum of square for y (온도를 고려했을 경우) 온도를 고려할 경우 y의 sum of square가 크게 감소한다 (Table 12.4) 804.87 to 48.74

Estimating the regression function and the regression line 온도를 고려했을 때 sum of square for y 가 감소한다는 의미 Ex. 비단뱀의 심장박동을 측정할 때 온도의 영향에 관한 정보가 없을 경우 평균심장박동수가 19.89로 측정될 것이다 Variance (분산)이 커짐 비단뱀 심장박동에 미치는 온도의 영향이 파악된 경우, 보다 정확한 결과를 예측할 수 있다 Ex. 2ºC에서는 5.69회, 18ºC에서는 34.09회 온도에 따라 예측되는 심장 박동수를 고려하면 분산이 줄어든다

Calculating the estimated regression equation Estimated slope of the regression equation Estimated y-intercept a = 19.89 – (1.775 × 10) = 2.14 ŷ = 2.14 + 1.775x

Testing the significance of the regression equation Slope b는 samples로부터 계산된 parametric slope (β)의 추정치 β = 0 (y가 x에 dependent 하지 않음) β = 0 인 경우도 chance에 의해 b ≠ 0 의 결과를 얻을 수 있다 β ≠ 0이 아니라는 것을 검정하기 위하여 귀무가설을 β = 0 로 설정한 후 ANOVA test를 수행한다 Total sum of square (df = n – 1) Regression sum of square (df = 1) Error sum of square (df = n – 2) y가 x에 dependent 하다는 검정

Testing the significance of the regression equation Total sum of square SSt = 4365 – (179)2/9 = 804.89 Regression sum of square SSr = 1.775 × (2216 – 90*179/9) = 756.15 Error sum of square (SSe) SSe = SSt - SSr = 804.89 – 756.15 = 48.74

Testing the significance of the regression equation ANOVA table for the data in table 12.3 Critical F value (df=1,7; 0.05) from table A6 = 5.59 Calculated F value (108.60)이 critical value보다 훨씬 크다 따라서 β = 0 이라는 귀무가설을 reject 결론: y 값은 x 값에 dependent 하다

The confidence interval for β b는 β의 estimation 이므로 β의 confidence interval을 구할 수 있다 Standard error for the slope (Sb) The 95% confidence interval for β b ± Sb × t(0.05, n-2)

The confidence interval for β Sb = √6.96/(1140 – 8100/9) = 0.1703 t(0.05, 7) = 2.365 95% confidence interval for β 1.775 ± (0.1703 × 2.365) = 1.775 ± 0.403 1.372 to 2.178

The coefficient of determination (r2) (결정계수) y 값이 x 값에 얼마나 dependent 한지 결정 y 값이 x 값에 완전히 dependent 할 경우 모든 y 값이 regression line 상에 위치할 것이다 따라서 no error variance y variance의 what proportion이 x dependence로 설명되는지를 알고자 할 때 r2를 구함 r2 = SSr/SSt = 756.15/804.89 = 0.939 따라서 93.9%의 y variance 가 x에 dependent 하다 x 값을 알 경우 불확실성을 93.9% 까지 감소시킬 수 있다 6.1% (100 – 93.9)는 설명되지 않는 부분: error term