Anomalies in Datasets or Data Anomalies
There is an interesting literature on data anomalies that provides clues to how to vet the data
and functions as a reminder of a difficulty faced in collecting and reporting data, that is, anomalies or obvious wrong
data that creep into datasets. What follows is an overview of articles that deal with anomalies. This project is large,
complex, and as it grows and evolves, tracking the data and their anomalies is similarly complex. The point of analyzing
anomalies and understanding the structure of the data is to ensure that when the data are analyzed, reported data that
are impossible or doubtful do not confuse the results of the analysis. The
We have set up a process through these pages and the blog to find anomalies, investigate them, and then inform users of what has been found out and has been done. As a long run strategy, we are also developing reports to send to member libraries so that they can be involved and, where appropriate, fix their records, thus improving the dataset subsequently.
Anomalous and wrong values are a well-known aspect of working with data. William Kruskal, in Statistics in Society: Problems Unsolved and Unformulated said: "A reasonably perceptive person, with some common sense and a head for figures, can sit down with almost any structured and substantial data set or statistical compilation and find strange-looking numbers in less than an hour. Some of those strange looking numbers may turn out to be right after all; others will, upon investigation, turn out to be mistakes of one kind or another."
The article includes some of these anomalies and cites sources of others. For instance:
- The 1960 Census of Population reported that 62 "ever-married females" between the ages of 15-19 in 1960 had 12 or more children.
- In the 1970 Census, there were 54 New York residents who commuted to their jobs in England by car.
- And in the 1950 Census of Population, Coale and Stephan report, there were 1,670 14-year old widowed males. This famous article traced down the source of this and other anomalies to miskeying of Census punch cards--in a small number of cases by keying a number one column from where it should have been resulted in the anomalously high value.
Another work Kruskal cites is Oskar Morgenstern's On the Accuracy of Economic Observations which is a book devoted to an analysis of economic variables. There are inherent theoretical problems with measuring an economy, while others are conceptually easy but technically difficult to measure. If you believe what you read about consumer prices or other economic statistics, you might pick up this book.
What marks this article by Kruskal, though, is not so much the amusing statistics he cites but that he attempts to develop a taxonomy of the kinds of errors that exist. Kruskal suggests a taxonomy of "kinds of strangeness" in data in large datasets: "smoothness," "logical inconsistencies," and "trangressions of general knowledge."
Problems with smoothness would occur if there are big jumps in the data, say from year to year.
Logical inconsistencies can occur when relationships are wrong...say, if the average of some set of figures is greater than the maximum. This kind of thing can occur if a formula is entered incorrectly. Checking for these kinds of errors will also be an ongoing process, particularly when we improve the NDP by adding new features. In a project of this size, as Kruskal points out, these errors do occur.
Trangressions of general knowledge will come from both informal sources such as common knowledge of the world and from formal sources such as the literature which may discuss some variables. Tilling the copious literature on library use studies, for instance, will provide an ongoing source of such information. For an overview of these studies see the review of use studies elsewhere.
As a general observation, the percentage of anomalies in the NDP dataset is rather small, probably because catalogers' accuracy is rather high. Still, there are many observations in the dataset – it is quite large, after all, and includes a number of anomalous data.
In working on a set of data, we should expect to find anomalies and curiosities, some of which will, upon investigation, be real values, some of them will not, and some will not be resolvable. In fact, if you consult the details, we have anomalies that may or may not be accurate, but there are also impossible values in the dataset that are among Kruskal's "transgressions of general knowledge." Iit is generally a sound principle to follow Stubbs and Buxton's admonition in the Cumulated ARL University Library Statistics, 1962-63 through 1978-79. After converting the data from the Association of Research Libraries to electronic form from their paper originals, they, of course, also found anomalies and concluded: "these vagaries in data collection over the years stand here as they appeared in the original..."
This statement is what might be called a Hippocratic Oath of Data: "First, do no harm." However, there are clear cases where something must be done to ensure that users of the data do not get misleading results. These matters are discussed in detail.