Vetting the Data in NDP
The purpose of this Web page is to give a general outline of the processes we are following to assure the accuracy of the data reported in the Normative Data Project (NDP). This page, then, presents a general, conceptual background. There are two related pages that should be reviewed if you want a fuller explanation. One page presents the details – that is, the actual execution of these concepts. While those details change as we gain a better understanding of the data in the NDP, this outline will be less dynamic because it discusses the problem at conceptual level. The second related page anomalies is one that discusses some of the literature from the field of statistics that deals with anomalous and obviously incorrect data. These articles present some hints on how to proceed here, as is discussed below, and also provide droll reminders of aspects of the human condition.
The Normative Data Project (NDP) presents new sets of data collected from public libraries in the US and Canada. In fact, collecting such a set of data on each circulation transaction and each book held at the member libraries would have been impossible until recent developments in computer storage created the capacity to hold the necessary databases. The NDP includes data collected from libraries on their holdings and transactions, as well as data collected from public libraries and reported by the US National Center for Education Statistics (NCES) through its Library Statistics Program http://nces.ed.gov/surveys/libraries/
When we start working with a new series of data, how do we know if the data are correctly or accurately representing the processes being measured?
In a few cases, we have data from published sources, such as the public library data published by the NCES, that overlap with the NDP. The two sets of data differ in several ways. For one, The NCES data present system-level data. The NDP, however, presents outlet-level data, but they can be aggregated to compare to the system-level data from NCES where they overlap.
The second way the two series differ is the time period measured. The NDP represents calendar quarters, while the NCES data represent fiscal year data. To complicate matters, fiscal years across the states – and even within states – vary, so matching the data from the two sets of data involves several steps: matching variables, then matching reporting periods by library.
Where the times do not overlap, we look for improbable changes between the older NCES data and the newer NDP data. What "improbable" means is a matter of judgment and will vary as we get a better idea of the characteristics of these data.
The following variables can be matched between the two sets of data:
Library Name and FSCSKEY are in all datasets. The FSCSKEY is a NCES alphanumeric key variable that is easier to manipulate than library names, which often vary from year to year.
Counts of service outlets (central libraries, branches, and bookmobiles) are comparable between the NDP and NCES data.
NDP presents a breakdown into more categories of types of materials, but the two overlap in audio and video materials. Each has a count of books and serials, but the NCES data report volumes while the NDP report titles. Any comparison between these two variables has to bear this important difference in mind. With ratios where the NDP title count is in the denominator, such as turnover (the number of times a year that the average item circulates) NDP would likely have higher ratios than that calculated from the comparable NCES ratio.
Each has total circulation and children's circulation. NCES defines children's circulation as "Total annual circulation (including renewals) of all children's materials in all formats to all users." By structuring the search correctly, a similar number can be calcuated from the NDP data.
Librarians often use ratios. Their most popular ratios are the per capitas found in the Summary Rank Order tables which are calculated by NCES by dividing the (now) 22 variables by the unduplicated population of the legal service area of libraries.
"Unduplicated" population is a variable used to compare states in which a person may be in the legal service area of more than one library with those states where each person can only be in one library's service area. The per capita ratios of the two methods are affected by this difference and, hence, the comparisons between these two types of states. This fact is why we have the estimate of "unduplicated" population. There are 22 states in FY 2002 in which a person can be in the service are of more than one library.
Other tests are more subjective and come from the article by William Kruskal sited in the literature from Statistics page. Kruskal suggests a taxonomy of "kinds of strangeness" in data in large datasets: "smoothness," "logical inconsistencies," and "trangressions of general knowledge."
Problems with smoothness would occur if there are big jumps in the data, say from one year to the next.
Logical inconsistencies can occur when relationships are wrong – say, if the average of some set of figures is greater than the maximum – which can occur if a formula is entered incorrectly. Checking for these kinds of errors will also be an ongoing process, particularly when we improve the NDP by adding new features.
Trangressions of general knowledge will come from both informal sources, such as common knowledge of the world, and from formal sources, such as the literature which may discuss some variables. Tilling the copious literature on library use studies, for instance, will provide an ongoing source of such information. For an overview of these studies see the overview of use studies.
In a project of this size, as Kruskal points out, these errors do occur. If you find one, please email us at questions. As anomalies are discovered, they will be discussed on the blog, and as they are resolved, they will be documented on this site where appropriate.
1. Compare NDP with any other related, published statistics
Preliminary tests have shown discrepancies with published NCES circulation data and NDP data for the same libraries and for the same period. We are trying to resolve them, but it appears that not all circulations are run through the SirsiDynix ILS at NDP member libraries, the current sources of the NDP data. Another discrepancy is that the NCES data report volumes held, while NDP reports titles held. This fact will affect ratios such as turnover ratios for purposes of comparison.
The NCES data are reported for fiscal years, and the NDP data are reported for calendar quarters. Thirteen states have fiscal years that match the calendar year, while 25 have a fiscal year from July 1 to June 30. The remainder have varying fiscal years. In most cases, it should be possible to map an individual library's NCES data to the same NDP periods. A more detailed discussion of this matter will be here as the code making these comparisons is written and results are obtained.
The following comparisons will be done:
- Same variables for the same times, when possible. Unfortunately, there is not much overlap with the NCES and NDP data.
- Quarterly, basic stats on all NDP data. Univariate statistics are the basic descriptive statistics done on a single variable at a time: mean, median, standard deviation, maximum, minimum, and so on – the characteristics in the various variables' distributions. A short list of these variables is found ?? ncesuni.html> for the NCES variables and ?? ndpuni.html> for the NDP variables. The data come from the SAS proc univariate procedure, as will the more detailed univariate statistics. This procedure will be run on most of the raw data as a baseline.
Percent changes from period to period
Percent changes will be calculated for NDP variables from period to period. Initially, we will not know what to expect, that is, what kind of change from one period to the next is improbable and the sign of a data problem? Basic univariate statistics will be calculated on percent change each quarter for each variable. As we gain a better understanding of these changes, we will be able to build more sophisticated programs that will report when there are unusual changes from one period to the next.
As indicated, a few variables are slightly off--NDP counts titles; NCES counts volumes. This fact will affect things like ratios, such as turnover, in uncertain ways. These uncertain ways must be investigated too.
2. National Commission on Libraries and Information Science (NCLIS)/National Center for Education Statistics (NCES)
The public library data from NCES is the most carefully edited series of published library data. These data can generally be accepted as they are. The data from NCLIS are a longitudinal recompilation of the NCES public library data. The NCES data were issued as annual compilations, and the data from each of the years were recompiled into one file, thus allowing trend analysis. These NCLIS data are to be included in the NDP back to the 1990 fiscal year, the first year all states reported data to NCES. The data are documented at the NCLIS link above. Data are also available in a number of other formats at that site. These data are referred to here as the NCES/NCLIS data and represent a comprehensive set of data on all public libraries in the US since FY 1990.
As mentioned above, the ?? ncesuni.html> basic univariate statistics on 38 of the numeric variables reported for FY 2002 are available. The year-to-year changes are essentially an analytic, not vetting, problem and the domain of time series analysis.
3. What to do when we find an discrepancy between NDP and a published source or when something seems improbable?
We may have to develop a mechanism to report apparent errors to the NDP member libraries. Many libraries, however, have enough trouble cataloging the books they purchase, much less recataloging, so it may take time to ferret out all errors. So this matter will be ongoing as the NDP grows.