일반화가능도 이론을 이용한 과학 수행평가의 오차원 분석 및 신뢰도 추정

이기영¹, 안희수²

An Analysis of Sources of Error and an Estimation of Reliability in Science Performance Assessment using Generalizability Theory

Ki-Young Lee¹, Hui-Soo An²

¹한성과학고등학교 교사

²서울대학교 교수

¹Teacher, Hansung Science Highschool

²Professor, Seoul National University

ⓒ Copyright 2004, Korea Institute for Curriculum and Evaluation. This is an Open-Access article distributed under the terms of the Creative Commons Attribution NonCommercial-ShareAlike License (http://creativecommons.org/licenses/by-nc-sa/4.0) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Received: Apr 15, 2004 ; Revised: May 18, 2004 ; Accepted: Jun 7, 2004

Published Online: Jun 30, 2004

요약

본 연구에서는 일반화가능도 이론을 이용하여 과학 수행평가의 오차원을 분석하고, 이를 토대로 일반화가능도(신뢰도)를 추정하였다. 서울시 소재 고등학교 1학년 2개 반 90명의 과학 수행평가 자료를 대상으로 두 개의 일반화가능도 연구(G 연구)를 학기별로 실시하였다. 그 결과 서술형 문항(i)과 채점자(r)를 국면으로 하는 p× ( i : r ) 설계에서는 문항보다 채점자 관련 분산성분이 오차원에 더 크게 기여하는 것으로 추정되었으며, 반에 따라 채점자 관련 분산성분에 많은 차이가 있었다. 또 1학기에 비해 2학기의 채점자 관련 분산성분이 많이 감소하였는데, 이것은 학기를 거치면서 채점자들의 채점 능력이 향상된 훈련의 효과로 판단되었다. 수행과제(t)와 채점자(r)를 국면으로하는 p× ( t : r ) 설계에서는 채점자보다 수행과제 관련 분산성분이 더 크게 추정되었으며, 반에 따라 채점자 관련 분산성분에 많은 차이가 있었다. 두 개의 G 연구 결과, 피험자가 어느 채점자군에 속하느냐에 따라, 또 어떤 과제를 수행하느냐에 따라 측정점수가 다를 수 있는 것으로 나타났다. G 연구 설계와 동일하게 실시한 결정 연구(D 연구) 결과, 대부분의 경우에서 일반화가능도 계수가 적정 수준인 0.8에 미치지 못하였으며, 적정 수준의 일반화가능도 계수를 얻기 위해서는 더 많은 수의 채점자와 문항 그리고 수행과제가 필요한 것으로 분석되었다. 또한 근본적인 처방으로 과학 수행평가의 일반화가능도를 높이기 위해서는 채점자간의 차이를 줄이기 위한 심도 있는 교사 훈련이 필요하며, 서술형 문항과 수행 과제간의 차이를 줄이기 위한 노력이 있어야 할 것으로 판단 되었다.

ABSTRACT

In this study, we analyzed the sources of error underlying in science performance assessment by employing the generalizability theories, and its reliabilities estimated from this analyzed results, are presented and discussed. Science performance assessment data was obtained from two tenth-grade classes located in Seoul city, and two generalizability studies(G study) - p× ( i : r ) design and p× ( t : r ) design - is applied for both 1st and 2nd semesters. Where, facet i, t, and r indicate essay-type item, performance task and rater, respectively.

The results, in the p× ( i : r ) G studies, show that the variance components associated with rater( r) are found to be larger than those resulting from item( i). In addition, the estimated variance components showed a variety of differences from class to class. Variance components related to rater( r) were decreased apparently in the 2nd semester compared with the 1st semester. This result appears to be correlated with the increased training effect, mainly due to the enhancement of raters' scoring abilities in the 2nd semester.

On the other hand, the p× ( t : r ) study shows that the variance components originating from rater( t) were larger than those from item( r), and also variance components related to rater( r) show much differences from class to class.

These findings from two G studies indicate that person's score might differ according to rater group and performance task.

The results from Decision Study indicates that generalizability coefficients were turned out to be lower than the acceptable generalizability level(0.8).

We concluded that the population of statistical parameters such as number of rater, item and performance task, should be increased for approaching the acceptable level of generalizability coefficients, and basically, teacher training in rating and diminishing the difference in essay-type item and performance task are also needed to improve the generalizability.

Keywords: 과학 수행 평가; 신뢰도; 일반화가능도 이론; 채점자; 서술형 문항; 수행 과제

Keywords: science performance assessment; reliability; generalizability theory; rater; essay-type item; performance task