A Note on Homogeneity of Variance of Scores and Ranks (1996). Journal of Experimental Education, 4, 351-362.
When any two or more sets of scores with unequal variances are combined and ranked together as one set, the corresponding sets of ranks inherit the unequal variances. This fact is well known in the theory of nonparametric statistics, but in practice researchers and applied statisticians frequently overlook its implications. Because of this property, familiar nonparametric rank tests cannot overcome effects of heterogeneous variances of treatment groups in statistical significance testing. A simulation study demonstrates explicitly that transformation of scores to ranks reduces variance heterogeneity, although not enough to prevent gross distortion of the probabilities of Type I and Type II errors of statistical significance tests, including the t test, the Wilcoxon-Mann-Whitney test, and the van der Waerden, or normal scores test. The present note also focuses attention on an aspect of the problem that is neglected in the literature: The equivalence of various nonparametric tests and their parametric counterparts performed on ranks, or the rank transformation concept, provides a rationale for the influence of unequal variances on test statistics calculated from ranks.
Is a Final Theory Conceivable? (1996). Psychological Record, 46, 423-438.
Some physicists believe that a final theory, which will unify separate branches of theoretical physics, including quantum field theory and general relativity, is imminent. The physicists who expect such a theory typically assume that ultimate natural laws will be expressed by the same mathematical formalism which is associated with present-day physics. This view is questionable, because mathematics itself evolves and is not currently in a finished form. Historically, a succession of discoveries in physics has unveiled new laws of nature, each stage being made possible by the development of new mathematics. Evolutionary biology and interbehavioral psychology, ordinarily overlooked by the philosophy of science, reveal further inconsistencies inherent in the concept of a terminal stage of scientific discovery. Another troublesome implication of this notion has come to light recently: Finality of a conceptual scheme comprised of a finite number of natural laws is incompatible with recent developments in mathematical logic and the theory of computational irreducibility.
Some Properties of Preliminary Tests of Equality of Variances in the Two-Sample Location Problem (1996). Journal of General Psychology, 123, 217-231.
A simulation study was conducted to examine probabilities of Type I errors of the two-sample Student t test, the Wilcoxon-Mann-Whitney test, and the Welch separate-variances t test under violation of homogeneity of variance. Two-stage procedures in which the choice of a significance test in the second stage is determined by the outcome of a preliminary test of equality of variances in the first stage were also examined. Type I error rates of both the t test and the Wilcoxon test were severely biased by unequal population variances combined with unequal sample sizes. The two-stage procedures were not only ineffective, they actually distorted the significance level of the test of location. Furthermore, the distortion ws greatest when the discrepancy between variances was slight rather than extreme. Unconditional substitution of the Welch separate-variances t test for the Student t test whenever sample sizes were unequal was the most effective way to counteract modification of the significance level. Conditional substitution of the Welch test, depending on the outcome of a preliminary test, was far less effective.
Properties of the Spearman Correction for Attenuation for Normal and Realistic Non-Normal Distributions (with Richard H. Williams, 1997). Applied Psychological Measurement, 3, 253-270.
Results are presented of a computer simulation study of the Spearman correction for attenuation using a design originally suggested by Spearman (1904). Two parallel measures were generated for each of two variables with predetermined distribution shapes, correlations between true scores, and population reliability coefficients. The resulting data consisted of error scores, observed scores, sample reliability coefficients, and sample validity coefficients. The correction for attenuation was performed, and means, variances, and relative frequency distributions of both the uncorrected and corrected validity coefficients were analyzed for normal and non-normal distributions. For varying sample sizes and for all population distributions, the means of the corrected sample correlations were very close to the correlation between true scores, provided that the population reliability coefficients were fairly high. The variability of the corrected sample correlations was substantial, even for larger sample sizes. For lower reliability values, there was pronounced overcorrection, combined with extreme variability, especially for smaller sample sizes. Under these conditions, corrections exceeding 1.00 were frequent. The correction for attenuation appears to be useful only if the reliability coefficients of both measures are relatively high and sample size is relatively large. The properties of the correction for attenuation appear to be independent of the shape of the population distribution of test scores, at least for distributions commonly encountered in psychological and educational research.
A Geometric Interpretation of the Validity and Reliability of Difference Scores (1997). British Journal of Mathematical and Statistical Psychology, 50, 73-80.
For several decades, psychometricians have been concerned about the unreliability and meagre validity of difference scores and gain scores. The present paper explores a geometric interpretation, in which observed scores are identified with vectors in a function space of random variables, and true and error components of scores are identified with orthogonal projections onto complementary subspaces. This point of view provides a simple and easily visualized geometric explanation for the widespread belief that differences are inherently unreliable. Furthermore, it discloses conditions under which difference scores are highly reliable and have substantial correlations with other measures.
A Note on Interpretation of the Paired-Samples t Test (1997). Journal of Educational and Behavioral Statistics, 22, 349-360.
Explanations of advantages and disadvantages of paired-samples experimental designs in textbooks in education and psychology frequently overlook the change in Type I error probability which occurs when an independent-samples t test is performed on correlated observations. This alteration of the significance level can be extreme even if the correlation is small. By comparison, the loss of power of the paired-samples t test on difference scores due to reduction of degrees of freedom, which typically is emphasized, is relatively slight. Although paired-samples designs are appropriate and widely used when there is a natural correspondence or pairing of scores, researchers have not often considered the implications of undetected correlation between supposedly independent samples in the absence of explicit pairing.
Invalidation of Parametric and Nonparametric Statistical Tests by Concurrent Violation of Two Assumptions (1998). Journal of Experimental Education, 67, 55-68.
To provide counterexamples to some commonly held generalizations about the benefits of nonparametric tests, the author concurrently violated in a simulation study two assumptions of parametric statistical significance testsnormality and homogeneity of variance. For various combinations of non-normal distribution shapes and degrees of variance heterogeneity, the Type I error probability of a nonparametric rank test, the Wilcoxon-Mann-Whitney test, was found to be biased to a far greater extent than that of its parametric counterpart, the Student t test. The Welch-Satterthwaite separate-variances version of the t test, together with a preliminary outlier detection and downweighting procedure, protected the significance level more consistently than the nonparametric test did. Those findings reveal that nonparametric methods are not always acceptable substitutes for parametric methods such as the t test and F test in research studies when parametric assumptions are not satisfied. They also indicate that multiple violation of assumptions can produce anomalous effects not observed in separate violations.
Reliability of Gain Scores under Realistic Assumptions about Properties of Pre-test and Post-test Scores (with Richard H. Williams, 1998). British Journal of Mathematical and Statistical Psychology, 51, 343-351.
For many years, psychometricians working in the context of classical test theory have questioned the reliability of measures of gains and growth. Many tables and figures in textbooks and journal articles have suggested that the reliability of these measures is far below that of the pre-test and post-test scores from which they are determined. Previous findings reported by the present authors have shown that under special conditions, gain scores can be much more reliable, especially when variances of pre-test and post-test scores are unequal and at the same time the reliability coefficients of pre-test and post-test scores are unequal with the same directionality. The present paper explores a novel approach to derivation of formulae for the reliability of gain scores in classical test theory. This approach reveals not only that gain scores can be reliable, but also that their reliability coefficients are intermediate between those of the pre-test and the post-test in a large proportion of practical testing applications.
How Should Classical Test Theory Have Defined Validity? (1998), Social Indicators Research, 45, 233-251.
Classical test theory defined the predictive validity of a test as the ordinary Pearson correlation between scores on the test and scores on a validation criterion. For some purposes this definition is satisfactory, but for others it leads to complications, because derivation of familiar equations relating validity and reliability requires an independent assumption of uncorrelated errors of measurement. The present paper proposes an alternate definition of validity that avoids difficulties arising from correlated error scores and is more consistent with standard definitions of true score, error score, and reliability in the classical theory.
Type I Error Probabilities of the Wilcoxon-Mann-Whitney Test and Student t Test Altered by Heterogeneous Variances and Equal Sample Sizes (1999). Perceptual and Motor Skills, 88, 556-558.
The Student t test maintains its significance level more consistently than the Wilcoxon-Mann-Whitney test when variances of treatment groups are unequal and sample sizes are equal. The probability of a Type I error of the nonparametric test systematically increases above the nominal significance level as the ratio of population standard deviations increases, while that of the t test remains fairly stable.
Restriction of Range and Correlation in Outlier-Prone Distributions (with Richard H. Williams, 2000). Applied Psychological Measurement, 24, 267-280.
Statistical theory indicates that restriction of the range of possible values of normally distributed variables, and many non-normal variables, reduces correlations in unrestricted populations. Contrary to this typical outcome, results of a simulation study show that range restriction sometimes increased the correlation between variables having outlier-prone distributions. This result occurred in the case of exponential and ex-Gaussian distributions, which are encountered in experimental studies involving response times. It did not occur in truncated versions of the same densities. Chance occurrence of outliers in contaminated-normal, or mixed-normal, distributions reduced the correlation found between samples from uncontaminated populations. Conversely, detection and downweighting of outliers increased the magnitude of sample correlations, and a similar result occurred for many other outlier-prone distributions. Practical implications of these findings are discussed.
The Effect of Selection of Samples for Homogeneity on Type I Error Rate (2001). Interstat.
The Type I error probability of the two-sample Student t test is known to deviate from the statistical significance level when variances are unequal and, at the same time, sample sizes are unequal. The present simulation study indicates that, for various sample sizes and ratios of population standard deviations, the conditional probability of a Type I error, under the condition that the ratio of sample standard deviations falls in a narrow interval close to 1.00, is larger than the unconditional probability of a Type I error. This result emphasizes that it is essential to distinguish between homogeneity of population variances and homogeneity of sample variances. If population variances are heterogeneous and sample sizes are unequal, a Student t test will be invalid, whether or not sample variances of treatment groups happen to be equal. Accordingly, it is not possible for researchers to protect the significance level of the test by explicit selection of homogeneous samples.
The Geometry of Probability, Statistics, and Test Theory (with Bruno D. Zumbo, 2001). International Journal of Testing, 1(3&4), 283-303.
The model of tests and measurements outlined in this paper identifies test scores with Hilbert space vectors and true and error components of scores with linear operators. The collection of all observed scores associated with a test procedure is represented by a function space comprised of all random variables defined on a probability space, and the collection of all true scores is a Hilbert subspace of this function space. The collection of all error scores is the orthogonal complement of the subspace of true scores. This geometric formalism simplifies derivations in test theory and brings to light relations among concepts in probability, statistics, and measurement that are not otherwise apparent. Test reliability, test validity, error of measurement, parallel tests, and other familiar concepts can be studied in this framework, making their mathematical properties and their interrelations with one another more obvious.
A Warning about Statistical Significance Tests Performed on Large Samples of Nonindependent Observations (2002). Perceptual and Motor Skills, 94, 259-263.
When sample observations are not independent, the variance estimate in the denominator of the Student t statistic is altered, inflating the value of the test statistic and resulting in far too many Type I errors. Furthermore, how much the Type I error probability exceeds the nominal significance level is an increasing function of sample size. If N is quite large, in the range of 100 or 200 or larger, small apparently inconsequential correlations that are unknown to a researcher, such as .01 or .02, can have substantial effects and lead to false reports of statistical significance when effect size is zero.
Bias in Estimation and Hypothesis Testing of Correlation (with Bruno D. Zumbo and Richard H. Williams, 2003). Psicologica, 24, 133-158.
This study examined bias in the sample correlation coefficient, r, and its correction by unbiased estimators. Computer simulations revealed that the expected value of correlation coefficients in samples from a normal population is slightly less than the population correlation and that the bias is almost eliminated by an estimator suggested by R.A. Fisher and is more completely eliminated by a related estimator recommended by Olkin and Pratt. Transformation of initial scores to ranks and calculation of the Spearman rank correlation produces somewhat greater bias. Type I error probabilities of significance tests of zero correlation based on the Student t statistic and exact tests based on critical values of the Spearman correlation obtained from permutations remain fairly close to the significance level for normal and several non-normal distributions. However, significance tests of non-zero values of correlation based on the r to Z transformation are grossly distorted for distributions that violate bivariate normality. Also, significance tests of non-zero values of the Spearman correlation based on the r to Z transformation are distorted even for normal distributions.
A New Look at the Influence of Guessing on the Reliability of Multiple-Choice Tests (with Richard H. Williams, 2003). Applied Psychological Measurement, 27, 357-371.
Previous studies have established that chance success due to guessing contributes to error variance and diminishes the reliability of multiple-choice tests and true-false tests. However, the practical usefulness of these theoretical results remains doubtful. Equations that have been derived have not often been used in practical work in testing and test construction. One reason is that relatively little is known about how guessing combines with other sources of error variance that determine test reliability and what proportion of the total variance of test scores is accounted for by guessing. The present paper derives explicit formulas which allow for combinations of error variance due to guessing and other sources of error. These formulas disclose an interaction between test length and number of item choices in determining the reliability of multiple-choice tests. They provide a more realistic guide as to how much improvement in reliability can be expected by altering parameters such as number of test items, number of item choices, and the means and variances of examinees' observed scores.
A Warning about the Large-Sample Wilcoxon-Mann-Whitney Test (2003). Understanding Statistics, 2, 267-280.
It is known that the Wilcoxon-Mann-Whitney test is strongly influenced by unequal variances of treatment groups combined with unequal sample sizes. The present simulation study indicates that, for various continuous and discrete distributions, the discrepancy between the empirical Type I error rate and the nominal significance level is large even when sample sizes are equal. In some cases, it exceeds the similar discrepancy characteristic of the Student t test. Furthermore, for some distributions, the discrepancy becomes increasingly more extreme as sample sizes increase. When sample sizes are relatively large, so that the normal-approximation form of the Wilcoxon-Mann-Whitney statistic is appropriate, minor and usually undetected differences in variability of treatment groups can substantially inflate the Type I error rate. For several distributions, including some that occur frequently in psychological research, ratios of population standard deviations as small as 1.1 or 1.2 have sizeable effects.
Conditional Probabilities of Rejecting Null Hypotheses by Pooled and Separate Variances t Tests Given Heterogeneity of Sample Variances (2004). Communications in Statistics: Simulation and Computation, 33, 69-81.
It is known that the Type I error probability of the Student t test is spuriously elevated or depressed by unequal variances combined with unequal sample sizes and that the Welch separate-variances version of the t test usually eliminates these effects. The present study found conditional probabilities of rejecting the null hypothesis, for both significance tests, given various conditions on the sample variances. The conditional probability of a Type I error, given that sample variances are nearly equal, is also elevated or depressed, sometimes to an even greater extent than the unconditional probability. For various combinations of sample sizes and variance heterogeneity, similar results characterize the Welch t test. These findings imply that researchers cannot protect the significance level and power of the t test by deciding whether to use a pooled-variances or separate-variances version based solely on inspection of sample data.
Inflation of Type I Error Rates by Unequal Variances Associated with Parametric, Nonparametric, and Rank-Transformation Tests (2004). Psicologica, 25, 103-133.
It is well known that the two-sample Student t test fails to maintain its significance level when the variances of treatment groups are unequal and, at the same time, sample sizes are unequal. It is generally believed, however, that tests of location are robust to variance heterogeneity when sample sizes are equal. The present study discloses that, for a wide variety of non-normal distributions, especially skewed distributions, the Type I error probabilities of both the t test and the Wilcoxon-Mann-Whitney test are inflated by heterogeneous variances, even when sample sizes are equal. The Type I error rate of the t test performed on ranks replacing the scores (rank-transformed data) is inflated in the same way and always corresponds quite closely to that of the Wilcoxon-Mann-Whitney test. For many probability densities, the distortion of the significance level is far greater after transformation to ranks and, contrary to known asymptotic properties, the magnitude of the inflation is an increasing function of sample size. For symmetric distributions these effects are minimal. Apparently, the Wilcoxon-Mann-Whitney test and rank-transformation tests are insensitive to differences in shape that are not related to skewness and not accompanied by differences in the means of ranks.
A Note on Preliminary Tests of Equality of Variances (2004). British Journal of Mathematical and Statistical Psychology, 57, 173-181.
Preliminary tests of equality of variances used before a test of location are no longer widely recommended by statisticians, although they persist in some textbooks and software packages. The present study extended the findings of previous studies and provided further reasons for discontinuing the use of preliminary tests. The study found Type I error rates of a two-stage procedure, consisting of a preliminary Levene test on samples of different sizes with unequal variances, followed by either a Student pooled-variances t test or a Welch separate-variances t test. Simulations disclosed that the two-stage procedure fails to protect the significance level and usually makes the situation worse. Earlier studies have shown that preliminary tests often adversely affect the size of the test, and also that the Welch test is superior to the usual t test when variances are unequal. The present simulations reveal that changes in Type I error rates are greater when sample sizes are smaller, when the difference in variances is slight rather than extreme, and when the significance level is more stringent. Futhermore, the validity of the Welch test deteriorates if it is used only on those occasison where a preliminary test indicates that it is needed. Optimum protection is assured by using a separate-variances test unconditionally whenever sample sizes are unequal.
Inflated Statistical Significance of Student's t Test Associated with Small Intersubject Correlation (2004). Journal of Statistical Computation and Simulation, 74, 691-696.
The independence assumption in statistical significance testing becomes increasingly crucial and unforgiving as sample size increases. Seemingly inconsequential violations of this assumption can substantially increase the probability of a Type I error if sample sizes are large. In the case of Student's t test, it is found that correlations within samples in a range from .01 to .05 can lead to rejection of a true null hypothesis with high probability if N is 50, 100, or larger.
Louis Guttman's Contributions to Classical Test Theory (with Richard H. Williams, Bruno D. Zumbo, and Donald Ross, 2005). International Journal of Testing, 5(1), 81-95.
This article focuses on Louis Guttman's contributions to the classical theory of educational and psychological tests, one of the lesser known of his many contributions to quantitative methods in the social sciences. Guttman's work in this field provided a rigorous mathematical basis for ideas that, for many decades after Spearman's initial work, had been somewhat ambiguous and not adequately formalized. It anticipated later developments that are more widely known in education and psychology.
Can Percentiles Replace Raw Scores in the Statistical Analysis of Test Data? (with Bruno D. Zumbo, 2005). Educational and Psychological Measurement, 65, 616-638.
Textbooks in educational and psychological testing typically warn that it is inappropriate to perform arithmetic operations and subsequent statistical analysis on percentiles in place of raw scores. This recommendation appears to be inconsistent with the well-established finding that transformation of scores to ranks and use of nonparametric methods often improves the validity and power of significance tests for non-normal distributions. The present study compared the Student t test performed on raw scores, on the ranks of scores, and on percentiles of the same scores obtained from larger populations, for normal and various skewed and symmetric non-normal distributions. Using percentiles in place of raw scores protects the Type I error rate of the t test, just like using ranks in place of raw scores, for all the distributions studied. Furthermore, using percentiles markedly increases the power of the t test for skewed distributions, more so than using ranks, and percentiles are nearly as effective as ranks for symmetric distributions. These findings have practical relevance to experimental designs involving test scores and other measures when both raw scores and percentiles are available.
Increasing Power in Paired-Samples Designs by Correcting the Student t Statistic for Correlation. 2005. Interstat.
In order to circumvent the influence of correlation in paired-samples and repeated measures experimental designs, researchers typicaly perform a one-sample Student t test on difference scores. This procedure entails some loss of power, because it employs n 1 degrees of freedom instead of the 2n 2 degrees of freedom of the independent-samples t test. In the case of non-normal distributions, researchers typically substitute the Wiloxon signed-ranks test for the one-sample t test. The present study explored an alternate strategy, using a modified two-samples t test with a correction for correlation. For non-normal distributions, the same modified t test was performed on rank-transformed data. Simulations disclosed that this procedure, which retains 2n 2 degrees of freedom, protects the Type I error rate for moderate and large samples, maintains power for normal distributions, and substantially increases power for many non-normal distributions.
Two Separate Effects of Variance Heterogeneity on the Validity and Power of Significance Tests of Location. 2006. Statistical Methodology, 3, 341-394.
Heterogeneity of variances of treatment groups influences the validity and power of significance tests of location in two distinct ways. First, if sample sizes are unequal, the Type I error rate and power are depressed if a larger variance is associated with a larger sample size, and elevated if a larger variance is associated with a smaller sample size. This well-established effect, which occurs in t and F tests, and to a lesser degree in nonparametric rank tests, results from unequal contributions of pooled estimates of error variance in the computation of test statistics. It is observed in samples from normal distributions, as well as non-normal distributions of various shapes. Second, transformation of scores from skewed distributions with unequal variances to ranks produces differences in the means of the ranks assigned to the respective groups, even if the means of the initial groups are equal, and a subsequent inflation of Type I error rates and power. This effect occurs for all sample sizes, equal and unequal. For the t test, the discrepancy diminishes, and for the Wilcoxon-Mann-Whitney test, it becomes larger, as sample size increases. The Welch separate-variances t test overcomes the first effect but not the second. Because of interaction of these separate effects, the validity and power of both parametric and nonparametric tests performed on samples of any size from unknown distributions with possibly unequal variances can be distorted in unpredictable ways.
Correction for Attenuation with Biased Reliability Estimates and Correlated Errors in Populations and Samples. (2007). Educational and Psychological Measurement, 67, 920-939.
Properties of the Spearman correction for attenuation were investigated using Monte Carlo methods, under conditions where a correlation between error scores exists as a population parameter, and also where correlated errors arise by chance in random sampling. Equations allowing for all possible dependence among true and error scores on two tests at both the sample and population levels were derived and compared to simulation results. The additional influence of biased estimates of reliability at both levels was examined. Research settings under which the correction for attenuation can be useful in data analysis and those under which it is inaccurate and extremely variable were distinguished.
The Reliability of Difference Scores in Populations and Samples (2009). Journal of Educational Measurement, 46, 19-42.
This study investigated the relation between the reliability of difference scores, considered as a parameter characterizing a population of examinees, and the reliability estimates obtained from random samples from the population. The parameters in familiar equations for the reliability of difference scores were redefined in such a way that determinants of reliability in both populations and samples become more transparent. Computer simulation using MATHEMATICA found sample values and plotted frequency distributions of various correlations and variance ratios relevant to the reliability of differences. The results give a more complete picture of conditions under which difference scores and gain scores are likely to be useful in research.
Inheritance of Properties of Normal and Non-Normal Distributions After Transformation of Scores to Ranks. Psicologica (in press)
This study investigated how population parameters representing heterogeneity of variance, skewness, kurtosis, bimodality, and outlier-proneness, drawn from normal and eleven non-normal distributions, also characterized the ranks corresponding to independent samples of scores. When the parameters of population distributions from which samples were drawn were diferent, the ranks corresponding to the same pairs of samples of scores inherited similar differences. This finding explains some known results concerning Type I error probabilities and the relative power of parametric and nonparametric tests for various non-normal densities.
Power Comparisons of Significance Tests of Location Using Scores, Ranks, and Modular Ranks. British Journal of Mathematical and Statistical Psychology (in press).
The Type I error probability and the power of the independent-samples Student t test, performed directly on the ranks of scores in combined samples in place of the original scores, are known to be the same as those of the nonparametric Wilcoxon-Mann-Whitney test. In the present study, simulations revealed that these probabilities remain essentially unchanged when the number of ranks is reduced by assigning the same rank to multiple ordered scores. For example, if 200 ranks are reduced to as few as 20, or 10, or 5 ranks by replacing sequences of consecutive ranks by a single number, the Type I error probability and power stay about the same. Significance tests performed on these modular ranks consistently reproduce familiar findings about the comparative power of the t test and the Wilcoxon-Mann-Whitney tests for normal and various non-normal distributions. Similar results are obtained for modular ranks used in comparing the one-sample Student t test and the Wilcoxon signed-ranks test.
Sampling Variability and Axioms of Classical Test Theory. Journal of Educational and Behavioral Statistics (in press).
Many familiar formulas in classical test theory are mathematical identities in populations of individuals, but do not have the same status in samples from those populations, because assumptions made in derivation of the formulas are not necessarily satisfied in samples. The present study derived equations relating test scores and components of scores that are identities in all samples and that reduce to the corresponding population equations when various correlation coefficients are zero. Simulations determined the accuracy of both the familiar and the revised equations when applied to samples of various sizes from populations with known reliability coefficients. The programs also determined the variances of sample values, as well as the mean and variance of discrepancies between sample and population values. The findings revealed that extensive inaccuracy and variability in characteristics of test scores resulted not only from ordinary sampling error, but also from the fact that assumptions made in the mathematical derivation of equations often were not satisfied in small samples.
A Simple and Effective Decision Rule for Choosing a Significance Test to Protect Against Non-Normality. British Journal of Mathematical and Statistical Psychology (in press).
There is no formal and generally accepted procedure for choosing an appropriate significance test for sample data when the assumption of normality is doubtful. Various tests of normality that have been proposed over the years have been found to have limited usefulness, and sometimes a preliminary test makes the situation worse. The present paper investigates a specific and easily applied rule for choosing between a parametric and nonparametric test that does not require a preliminary significance test of normality. Simulations revealed that the rule, which can be applied to sample data automatically by computer software, protects the Type I error rate and increases power for various sample sizes, significance levels, and non-normal distribution shapes.
A Note on Consistency of Nonparametric Rank Tests and Related Rank Transformations. British Journal of Mathematical and Statistical Psychology (in press).
The extent to which rank transformations result in the same statistical decisions as their nonparametric counterparts was investigated. The study performed simulations using the Wilcoxon-Mann-Whitney test, the Wilcoxon signed-ranks test, and the Kruskal-Wallis test, together with the rank transformations and t and F tests corresponding to each of those nonparametric methods. In addition to the Type I error rates and power found from 50,000 iterations, the study also examined the consistency of the outcomes of the two methods on each individual sample. The results disclosed how acceptance or rejection of the null hypothesis and differences in p-values of the test statistics depend in a regular and predictable way on sample size, significance level, and differences between means, for normal and various non-normal distributions.
Inheritance of Properties of Normal and Non-Normal Distributions After Transformation of Scores to Ranks (2011). Psicologica, 32, 65-85.
This study investigated how population parameters representing heterogeneity of variance, skewness, kurtosis, bimodality, and outlier-proneness, drawn from normal and eleven non-normal distributions, also characterized the ranks corresponding to independent samples of scores. When the parameters of population distributions from which samples were drawn were diferent, the ranks corresponding to the same pairs of samples of scores inherited similar differences. This finding explains some known results concerning Type I error probabilities and the relative power of parametric and nonparametric tests for various non-normal densities.
Power Comparisons of Significance Tests of Location Using Scores, Ranks, and Modular Ranks (2011). British Journal of Mathematical and Statistical Psychology, 64, 233-243.
The Type I error probability and the power of the independent-samples Student t test, performed directly on the ranks of scores in combined samples in place of the original scores, are known to be the same as those of the nonparametric Wilcoxon-Mann-Whitney test. In the present study, simulations revealed that these probabilities remain essentially unchanged when the number of ranks is reduced by assigning the same rank to multiple ordered scores. For example, if 200 ranks are reduced to as few as 20, or 10, or 5 ranks by replacing sequences of consecutive ranks by a single number, the Type I error probability and power stay about the same. Significance tests performed on these modular ranks consistently reproduce familiar findings about the comparative power of the t test and the Wilcoxon-Mann-Whitney tests for normal and various non-normal distributions. Similar results are obtained for modular ranks used in comparing the one-sample Student t test and the Wilcoxon signed-ranks test.
Sampling Variability and Axioms of Classical Test Theory (2011). Journal of Educational and Behavioral Statistics, 35, 586-615.
Many familiar formulas in classical test theory are mathematical identities in populations of individuals, but do not have the same status in samples from those populations, because assumptions made in derivation of the formulas are not necessarily satisfied in samples. The present study derived equations relating test scores and components of scores that are identities in all samples and that reduce to the corresponding population equations when various correlation coefficients are zero. Simulations determined the accuracy of both the familiar and the revised equations when applied to samples of various sizes from populations with known reliability coefficients. The programs also determined the variances of sample values, as well as the mean and variance of discrepancies between sample and population values. The findings revealed that extensive inaccuracy and variability in characteristics of test scores resulted not only from ordinary sampling error, but also from the fact that assumptions made in the mathematical derivation of equations often were not satisfied in small samples.
A Simple and Effective Decision Rule for Choosing a Significance Test to Protect Against Non-Normality (2011). British Journal of Mathematical and Statistical Psychology, 64, 388-409.
There is no formal and generally accepted procedure for choosing an appropriate significance test for sample data when the assumption of normality is doubtful. Various tests of normality that have been proposed over the years have been found to have limited usefulness, and sometimes a preliminary test makes the situation worse. The present paper investigates a specific and easily applied rule for choosing between a parametric and nonparametric test, the Student t test and the Wilcoxon-Mann-Whitney test, that does not require a preliminary significance test of normality. Simulations reveal that the rule, which can be applied to sample data automatically by computer software, protects the Type I error rate and increases power for various sample sizes, significance levels, and non-normal distribution shapes. Limitations of the procedure in the case of heterogeneity of variance are discussed.
Abstracts of recent manuscripts in press (accepted for publication but not yet in print) or under review (submitted to journals but not yet accepted for publication)
Abstracts of selected earlier publications (before 1996)