Statistical inference I (통계적 추론)

Slides:

Advertisements

Similar presentations

신진영 현지 조사 방법 및 보고서 작성법 제 10 강 - 측정 및 척도 - - 통계적 추론 원리 -

Advertisements

1 통계를 왜 공부해야 하나 ? Dept. of Public Administration Chungnam National University.

불확도의 개념과 평가 값 ± 몇? 최 종 오 측정품질그룹 제목.

주제 : 독거여성노인의 현황과 대책 학 과 학 번 성 명 사회복지학과 김 진 석

이화여자대학교 의료원 직업환경의학과 김현주

빅데이터 기술 개요 2016/8/20 ~ 9/3 윤형기

Eliminating noise and other sources of error

Keller: Stats for Mgmt & Econ, 7th Ed

13장 t검정(t - test) 양윤권.

5.1 모수 (parameter) vs 통계량 (statistics)

(주)금성정공 Single PPM 테마활동 추진사례

통계적 품질관리(SQC).

기술 통계학 (Descriptive Statistics)

경제활동인구조사 1997년 실업률 조사의 설계 표본추출방법 가중치 부여 표준오차 편의

Excel과 통계학.

의료의 질 평가 분석 기법 김 민 경.

상관분석(Correlation Analysis)

운영리스크 고급측정법 모형의 적합성 검증방안에 대한 연구

국민건강영양조사 한국보건의료연구원 이 자 연

실습 (using SPSS) Department of Biostatistics, Samsung Biomedical Research Institute Samsung Medical Center.

제1장 과학과 사회조사방법 과학적 지식(scientific knowledge): 과학적 방법에 의해 얻어진 지식, 즉 논리적, 체계적, 경험적, 객관적 절차를 통해 얻어진 지식 과학적 지식의 특성 1) 재생가능성(reproducibility) 2) 경험가능성(empiricism)

9.확률 분포 정규 분포 형태 : 평균을 중심으로 좌우대칭의 종 모양을 가진 분포이다.

Medical Instrumentation

CHAPTER 21 UNIVARIATE STATISTICS

Z-test -Z 검증은 추리 통계의 여러 가지 검증 기법들 가운데 가장 기본적인 형태의 검증방식이다.

Chap 3. 표본조사 3.1 표본추출(Sampling)의 기초 3.2 단순임의표본추출 3.3 표본으로부터 모집단 추정

패턴인식 개론 Ch.5 확률 변수와 확률 분포.

8차시: 측정시스템 분석(MSA) 학 습 목 표 학 습 내 용 1. 측정시스템 분석(MSA) 개념 이해

통계적 품질관리.

Cluster Analysis (군집 분석)

논문을 위한 통계 논문과 통계의 기초 개념 하성욱 한성대학교 대학원.

Medical Instrumentation

4-1 Gaussian Distribution

추정의 기본원리 Introduction to Estimation

Hypothesis Testing 가설 검정

Week 10:확률변수(Random Variable)

Other ANOVA designs Two-way ANOVA

Modeling one measurement variable against another Regression analysis (회귀분석) Chapter 12.

Descriptive statistics

경제통계학 개요 사공 용 서강대학교 경제학과.

Association between two measurement variables Correlation

Inferences concerning two populations and paired comparisons

Association between two measurement variables Correlation

감마선스펙트럼 방사능측정 불확도 Environmental Metrology Center

: Two Sample Test - paired t-test - t-test - modified t-test

Frequency distributions and Graphic presentation of data

The normal distribution (정규분포)

사용자 경험 측정 (Measuring User Experience)

Chapter Ⅱ. 연구 설계.

통계방법의 이해.

■ 척도의 종류 : 변도(variance)를 나타내는 수치들이 가지는 특성에 따라 측정수준에 따른 분류 → 척도분류

2015년도 2학기 제 5 장 자료의 수집 : 실험 마케팅조사.

Chapter 4: 통계적 추정과 검정 Pilsung Kang

제2장 통계학의 기초 1절 확률 기본정의 확률의 기본 공리와 법칙 2절 확률변수와 확률분포 3절 정규분포와 관련 분포 정규분포

제10장. 품질관리 (CHAPTER 10. Quality Control)

Modeling one measurement variable against another Regression analysis (회귀분석) Chapter 12.

Week 13:가설검정(Hypothesis Testing)

식품분석 기초.

Definitions (정의) Statistics란?

제3장 사회조사방법의 기본개념 변수(variable): 사람, 물건, 사건 등의 특성이나 속성이 두 가지 이상의 가치(value)를 가질 때 변수라고 함. 즉 상호배타적인 속성들의 집합 1) 속성에 따른 분류 -. 명목변수(Nominal Variable): 분류에 기초를.

의학자료분석론 교재: 강의록 Rosner B, Fundamentals of Biostatistics, 7th ed. Brooks/Cole Cengage Learning, Canada, 강의 평가: 출석 20% 숙제 30% 기말고사 50%

켈러의 경영경제통계학 제11장 모집단에 관한 추론.

기 술 통 계 학 6 1 기술통계학 2 자료의 정리 3 위치척도 4 산포의 척도.

경영통계학 제1장 통계학은 어떤 학문인가? What is Statistics? 1.1.

표본분포 개요 랜덤추출법 표본분포 모양과 CLT.

표 본 분 포 7 1 모집단분포와 표본분포 2 표본평균의 분포 3 정규모집단에 관련된 분포의 응용 4 표본비율의 분포.

Progress Seminar 이준녕.

Progress Seminar 양승만.

Presentation transcript:

Statistical inference I (통계적 추론) Chapter 7

Statistical inference I (통계적 추론) 생물학에서 samples로부터 population (모집단)에 관한 결론을 도출하는 것이 매우 중요 이러한 activity를 statistical inference (통계적 추론) 라고 한다 Two broad categories가 있다 1. Estimating a population parameter (추정) 2. Testing a statistical hypothesis (가설검정)

Statistical inference I (통계적 추론) Estimation (추정) Ex. 7.1: Bluegill-sunfish lengths의 10 random samples Sample mean이 159.40 mm 이다 이 sample mean이 얼마나 잘 population mean을 추정하는가? 이러한 statistical inference의 general category를 estimation (추정)이라 부른다

Statistical inference I (통계적 추론) Hypothesis testing (가설검정) Ex. 7.2: Vitamin Y는 필수영양소이나, 많이 섭취할 경우 몸에 해롭다. 따라서 FDA (Food and Drug Administration)에서 각 vitamin pill에 평균 100 units의 vitamin Y가 함유되도록 정함 제조회사에서 100정의 vitamin pills을 random sample하여 vitamin 함량을 측정했을 때 Mean: 100.5 units; Standard deviation: 2.19 units Sample mean 100.5 units 이 population mean of 100에서 나왔다고 할 수 있나? 이러한 질문에 답을 찾는 것을 hypothesis testing (가설검정)이라 한다 (다음 chapter에서 다룸)

Sampling distribution (표본분포) 모집단에 대한 추론은 sample mean (표본의 평균값) 이나 표본의 분산을 이용해야 한다 따라서 다른 type의 probability distribution인 sampling distribution (표본분포)을 이해해야 한다 Sampling distribution (표본분포) 모집단으로부터 random sample했을 때 그 sample은 모집단으로부터 선택될 수 있는 수 많은 samples 들 중 하나이다

Sampling distribution (표본분포) Normal distribution을 하고 있는 population으로부터 같은 size의 sample을 반복해서 수행한 후 means을 계산하면 각각의 mean은 서로 다를뿐아니라 이들 값은 central value를 중심으로 모이는 경향을 보인다 다시 말해 이들 means 들이 means의 mean 값과 standard deviation을 가지며 normal distribution을 하게 된다 이러한 확률분포를 sampling distribution이라 한다 따라서 하나의 sample mean은 sampling distribution (표본분포)의 one observation이다 표본분포의 mean = μ (모평균) 표본분포의 standard deviation = σ / √n (n = sample size)

The central limit theorem (중심극한정리) Normal distribution을 하는 population (모집단)으로부터 random sample된 표본평균 (sample means)들은 sample size에 상관없이 normal distribution을 한다 모집단이 normal distribution을 하지 않더라도 sample size가 크면 sampling distribution은 근사적으로 (approximately) normal distribution을 한다 실질적으로 sample size가 30 이상일 경우 근사적으로 정규분포를 따른다

The central limit theorem (중심극한정리) Figure 7.2: central limit theorem을 보여준다 도시의 크기의 original data는 skewed distribution (비대칭분포)을 보여준다 작은 도시가 대부분 Sample size n으로 반복하여 표본을 추출하여 sample mean을 계산하여 frequency distribution을 그릴 경우 Sample size가 커질수록 normal distribution에 가까워진다

The central limit theorem (중심극한정리)의 효용 생물현상을 수량화하여 분포를 살필 경우 그 분포가 정확히 정규분포를 이루지 않는 경우가 많이 있다 중심극한정리에 따르면 모집단의 분포에 상관없이 sample size (n) 가 커지면 표본평균의 표본분포 (sampling distribution)가 정규분포에 가까워진다 따라서 모집단이 정규분포를 하지 않더라도 sample size가 큰 표본 (n > 30)을 추출하면 정규분포를 이용하여 표본분석을 할 수 있다

Sampling distribution (표본분포) Sampling distribution의 또 다른 특성은 sample size n이 커질수록 sample means 들의 분포가 좁아진다 대부분의 sample means들이 the true mean (μ)에 가까워진다 Sample size가 커질수록 추정치의 정밀도 (precision)가 높아진다 Cf. accuracy (정확도)와는 다름 σ = σ / √n (n = sample size) A single sample mean은 normal distribution을 하는 sampling distribution을 구성하는 많은 평균값 중 하나이다

Estimating a population mean: Standard error of the mean Ex. 7.3 : Bluegill sunfish lengths The population mean (μ) = 152.10 mm The population standard deviation (σ) = 19.64 mm 모집단으로부터 10개체를 random sampling한 후 sample mean을 구함 Sample mean = 159.40 mm Quest. 1. Sample mean이 159.40 mm 이상일 확률은? Single observation의 z score z = (x – μ)/ σ (chapter 6) Sample mean의 z score z = ( – μ)/ σ σ = σ / √n Sample mean은 sampling distribution을 하므로

Estimating a population mean: Standard error of the mean Sampling distribution의 standard deviation을 standard error of the mean (σ ) 이라 부른다 σ = σ / √n = 19.64 / √10 = 6.21 z = ( – μ)/ σ = (159.40 – 152.10) / 6.21 = 1.18 From Table A.1, z score of 1.18: 0.3810 따라서 mean 값이 159.40 보다 클 확률 0.5 – 0.3810 = 0.1190 (약 12%)

Standard error of the mean Question 2: sample mean의 95%가 포함되는 sample mean의 range를 구하라. The population mean (μ) = 152.10 mm The population standard deviation (σ) = 19.64 mm Sample size: 10

Standard error of the mean 0.95/2 = 0.475 From Table A.1, probability 0.475의 z score = 1.96 z score 1.96과 -1.96에 해당하는 sample means z = ( – μ)/ σ , = μ + (z × σ ) z score 1.96에 해당하는 sample means Sample mean = 152.10 + (1.96 × 6.21) = 164.27 z score -1.96에 해당하는 sample means Sample mean = 152.10 + (-1.96 × 6.21) = 139.93 따라서 sample mean 95%가 포함되는 mean의 range 139.93 – 164.27 mm 159.40 mm 는 이 범위에 속한다

Confidence interval of μ when σ is known Sample mean 과 population standard deviation을 알 경우 population mean이 위치하는 범위를 알 수 있다 The range: mean의 confidence interval (신뢰구간)이라 함 이 범위가 모집단의 true value를 포함할 확률을 신뢰수준이라 함 Confidence level (신뢰수준): 일반적으로 0.95 (95%), 0.99 (99%) Assumption of the test 1. sample은 random sample이어야 한다 2. measurement는 interval or ratio scale로, variable은 continuous 3. variable은 approximately normally distributed 4. population standard deviation을 알아야 한다

Confidence interval of μ when σ is known Ex. 7.1: Bluegill-sunfish lengths의 10 random samples Sample mean이 159.40 mm 이다 Population standard deviation (σ) = 19.64 mm 95% 신뢰수준에서 mean의 confidence interval? The range: the upper limit (UL) and lower limit (LL) 1.96 = z score for probability 0.475 σ = σ / √n = 19.64 / √10 = 6.21

Confidence interval of μ when σ is known UL0.95 = 159.40 + (1.96 × 6.21) = 171.57 LL0.95 = 159.40 + (-1.96 × 6.21) = 147.23 따라서 95% CI for μ = 159.40 ± 12.17 mm Or 95% CI for μ = 147.23, 171.57 99% confidence interval (0.99/2 = 0.495; z = 2.576) UL0.99 = 159.40 + (2.576 × 6.21) = 175.40 LL0.99 = 159.40 + (-2.576 × 6.21) = 143..40 99% CI for μ = 159.40 ± 16.00 mm 2.576은 확률 (0.99/2)의 z score (from Table A.1)

Confidence interval of μ when σ is known Caution Confidence level (신뢰수준): 이 범위 (CI)가 모집단의 true value를 포함할 확률 (95% or 99%) The probability that the confidence interval includes the parameter!!! 이 범위에 true value가 포함될 확률이 아니다 (true value는 이미 정해져 있음) 95% probability that the parameter lies between the upper and lower limits- Incorrect!!! Ex. μ를 알고 있는 population (모집단)에서 20 번 sampling 함. For each sample, 95% confidence interval을 계산함 Figure 7.6: 20개의 confidence interval 중 19개 (95%)의 interval 이 true value를 포함함

Confidence interval of μ when σ is unknown: the t distribution 현실적으로 population standard deviation (σ)을 모르는 경우가 대부분이다 따라서 sample standard deviation을 이용해야 한다 (population의 parameter를 추정하기 위해) 이 경우에는 the t distribution (Student’s t distribution)을 이용해야 한다 (t distribution 개발자가 Student라는 익명으로 발표) Student’s t distribution은 normal distribution과 유사하나 degrees of freedom (n-1, 자유도)가 추가된다 Normal distribution: mean과 standard deviation으로 정의됨 Student’s t distribution: mean, standard deviation, degree of freedom으로 정의됨

Confidence interval of μ when σ is unknown: the t distribution t value는 z score 와 성질이 유사하나 t-distribution은 sample size에 따라 달라진다 따라서 degree of freedom (n – 1)이 추가된다 t distribution (Table A.2) 특정 proportion을 제외하는 t value를 Table A.2 에서 결정할 수 있다 Table A.2에서 0.05의 의미: 5%를 제외하고 95%를 포함 Shaded portion이 특정 ± t value의 바깥쪽에 놓인 proportion 이다 Table A.2은 table A.1 과는 달리 양쪽 tails 을 포함 한다 Table 속의 numbers: t-values

Confidence interval of μ when σ is unknown: the t distribution Ex. t distribution의 0.05를 제외하고 자유도 4인 t-value는? 2.776 따라서 t-value ± 2.776은 t-distribution의 0.05를 제외 (or includes 0.95) Figure 7.7

Confidence interval of μ when σ is unknown: the t distribution Ex. 자유도 4, t distribution의 only upper 0.05를 제외하는 t-value는? Table A.2는 특정 proportion을 양쪽 tails로 똑같이 나눈 t-value One-half proportion in the upper tail One-half proportion in the lower tail 한쪽 tail에 만 관심이 있을 경우 어떻게 하나? Table A.2에 있는 proportion을 double 한다 Proportion (probability) 0.10 (rather than 0.05)의 t-value가 upper 0.05를 제외하는 t-value이다 2.132 (Figure 7.8)

Confidence interval of μ when σ is unknown: the t distribution t-value: 2.132 Shaded proportion: 0.05 Unshaded proportion: 0.95

Confidence interval of μ when σ is unknown: the t distribution t-value는 자유도가 증가할수록 감소한다 자유도가 무한대 일 때, 0.05를 제외하는 t-value 1.960 Standard normal distribution에서 0.05를 제외하는 (0.95를 포함하는) z value와 같다 자유도가 증가하면 t-distribution은 normal distribution으로 가는 경향이 있기 때문 자유도가 아주 클 경우 (sample size가 클 경우) normal distribution (Table A.1)을 이용할 수 있다

Confidence interval of μ when σ is unknown: the t distribution Ex. 7.4: a random sample of 20 male mosquito fish, total length (mm) was determined Sample mean ( ): 21.0 mm Sample standard deviation (s): 1.76 mm Length of the fish: approximately normally distributed Population mean의 95% confidence interval은? Both the population mean과 standard deviation이 sample values로부터 추정되어야 함

Confidence interval of μ when σ is unknown: the t distribution Assumptions of the test Random sample이어야 함 Measurement는 interval or ratio scale Variable (변수)는 continuous (discrete일 경우 variable의 range가 넓어야 함) Variable은 approximately normally distributed

Confidence interval of μ when σ is unknown: the t distribution The upper limit for 95% confidence interval The lower limit for 95% confidence interval t(0.05, n-1): 자유도 n-1, t-distribution의 0.95 (table A. 2에서 0.05 column)를 나타내는 t-value s = standard error of the mean ( = s / √n)

Confidence interval of μ when σ is unknown: the t distribution n = 20, sample mean = 21, s = 1.76, t0.05, n=19 = 2.093 UL0.95 = 21.0 + (2.093 × 1.76/√20) = 21.825 LL0.95 = 21.0 - (2.093 × 1.76/√20) = 20.175 따라서 95% CI for μ = 21.0 ± 0.825 mm Or 20.175 < μ < 21.825 Range 20.175 mm to 21.825 mm 가 population mean을 포함할 확률이 95 (이 range에 population mean이 포함될 확률이 아님!!!)

Reporting a sample mean and its variation Sample mean을 scientific presentation에 어떻게 나타내는가? 1. mean ± standard deviation ( ± s), ex: 100.5 ± 2.19 모집단에서 추출된 measured variable의 variation에 대한 정보를 어느 정도 제공하나 sample mean이 population을 얼마나 잘 추정했는지에 관한 정보는 제공해 주지 않는다 2. mean ± standard error ( ± s ) Sample mean을 보여주는 가장 일반적인 방법: sample size가 반영됨 Sample mean이 population을 얼마나 잘 추정했는지를 rough하게 보여준다 (CI 보다는 못함) 3. mean ± 95% (or 99%) confidence interval ( ± CI) 가장 정확하게 data를 평가할 수 있게 하는 방법

Reporting a sample mean and its variation 많은 경우 Sample mean을 graph로 나타낸다 주로 mean ± standard error Point (mean) 과 vertical lines (error bar; standard error)

Exercises Exercises 1: 11 female green iguanas 가 낳은 알의 수 33 50 46 33 53 57 44 31 60 40 50 Population mean의 95%, 와 99% confidence intervals를 구하라 Sample standard deviation: 9.968

Exercises 11 female green iguanas 가 낳은 알의 수 33 50 46 33 53 57 44 31 60 40 50 Population mean의 95%, 와 99% confidence intervals를 구하라 Mean: 45.18 SS = Σ x2 – (Σx)2/n, SS / n-1 = variance (분산) = s2 Standard deviation = √variance SS = 23449 – 247009/11 = 993.6364 Variance = 993.6364/10 = 99.36364 Standard deviation = 9.968 SE = 9.968/√11 = 3

Exercises 11 female green iguanas 가 낳은 알의 수 33 50 46 33 53 57 44 31 60 40 50 Population mean의 95%, 와 99% confidence intervals를 구하라 Mean: 45.18, SE: 3 95% CI: t-value (0.05, df=10): 2.228 45.18 ± 2.228*3 = 45.18 ± 6.683 38.497 – 51.863 99% CI: t-value (0.01, df=10): 3.169 45.18 ± 3.169*3 = 45.18 ± 9.507 35.673 – 54.687