Normative Data Project for Libraries

Anomalies in Datasets or Data Anomalies

There is an interesting literature on data anomalies that provides clues to how to vet the data and functions as a reminder of a difficulty faced in collecting and reporting data, that is, anomalies or obvious wrong data that creep into datasets. What follows is an overview of articles that deal with anomalies. This project is large, complex, and as it grows and evolves, tracking the data and their anomalies is similarly complex. The point of analyzing anomalies and understanding the structure of the data is to ensure that when the data are analyzed, reported data that are impossible or doubtful do not confuse the results of the analysis. The details of anomalies we have found and what we are doing about them are treated separately.

We have set up a process through these pages and the blog to find anomalies, investigate them, and then inform users of what has been found out and has been done. As a long run strategy, we are also developing reports to send to member libraries so that they can be involved and, where appropriate, fix their records, thus improving the dataset subsequently.

Anomalous and wrong values are a well-known aspect of working with data. William Kruskal, in Statistics in Society: Problems Unsolved and Unformulated said: "A reasonably perceptive person, with some common sense and a head for figures, can sit down with almost any structured and substantial data set or statistical compilation and find strange-looking numbers in less than an hour. Some of those strange looking numbers may turn out to be right after all; others will, upon investigation, turn out to be mistakes of one kind or another."

The article includes some of these anomalies and cites sources of others. For instance:

Another work Kruskal cites is Oskar Morgenstern's On the Accuracy of Economic Observations which is a book devoted to an analysis of economic variables. There are inherent theoretical problems with measuring an economy, while others are conceptually easy but technically difficult to measure. If you believe what you read about consumer prices or other economic statistics, you might pick up this book.

What marks this article by Kruskal, though, is not so much the amusing statistics he cites but that he attempts to develop a taxonomy of the kinds of errors that exist. Kruskal suggests a taxonomy of "kinds of strangeness" in data in large datasets: "smoothness," "logical inconsistencies," and "trangressions of general knowledge."

Problems with smoothness would occur if there are big jumps in the data, say from year to year.

Logical inconsistencies can occur when relationships are wrong...say, if the average of some set of figures is greater than the maximum. This kind of thing can occur if a formula is entered incorrectly. Checking for these kinds of errors will also be an ongoing process, particularly when we improve the NDP by adding new features. In a project of this size, as Kruskal points out, these errors do occur.

Trangressions of general knowledge will come from both informal sources such as common knowledge of the world and from formal sources such as the literature which may discuss some variables. Tilling the copious literature on library use studies, for instance, will provide an ongoing source of such information. For an overview of these studies see the review of use studies elsewhere.

As a general observation, the percentage of anomalies in the NDP dataset is rather small, probably because catalogers' accuracy is rather high. Still, there are many observations in the dataset – it is quite large, after all, and includes a number of anomalous data.

In working on a set of data, we should expect to find anomalies and curiosities, some of which will, upon investigation, be real values, some of them will not, and some will not be resolvable. In fact, if you consult the details, we have anomalies that may or may not be accurate, but there are also impossible values in the dataset that are among Kruskal's "transgressions of general knowledge." Iit is generally a sound principle to follow Stubbs and Buxton's admonition in the Cumulated ARL University Library Statistics, 1962-63 through 1978-79. After converting the data from the Association of Research Libraries to electronic form from their paper originals, they, of course, also found anomalies and concluded: "these vagaries in data collection over the years stand here as they appeared in the original..."

This statement is what might be called a Hippocratic Oath of Data: "First, do no harm." However, there are clear cases where something must be done to ensure that users of the data do not get misleading results. These matters are discussed in detail.


Footnotes

William Kruskal, Statistics in Society: Problems Unsolved and Unformulated, Journal of the American Statistical Association, Volume 76, Number 375, pp. 505-515, September 1981.

Ansley J. Coale and Frederick F. Stephan, The Case of the Indians and the Teen-Age Widows, Journal of the American Statistical Association, Volume 57, Number 298, pp. 338-347, June 1962.

Oskar Morgenstern's On the Accuracy of Economic Observations, Princeton: Princeton University Press, 1963.

Stubbs, Kendon and David Buxton (1981) Cumulated ARL University Library Statistics 1962-63 through 1978-79, (Washington, Association of Research Libraries, 1981) v.

Sitemap