This page describes how the Normative Data Project (NDP) data are going to be examined for accuracy and for consistency with other data. It will grow in detail and will eventually become complex as more is learned about these data. A general overview is available elsewhere for anyone not interested in the level of detail here. The intention of this page is to be systematic and detailed and to discuss the many data issues involved in the organization of a dataset of this complexity.
This page is, thus, part of the vetting process because it allows readers to point out tests missed, and it also is a part of the dataset's documentation.
The problem at hand is to examine a new dataset – and one with little precedence in the literature – to inform us what the data should look like – that is, what is reasonable. It follows then, that we will start out as best we can and, as we gain a handle on the datasets, increase our understanding of what is possible. That understanding will be folded into the analysis and improve subsequent testing. Vetting is partially a matter of doing the right procedures to expose problems in the data, such as logical inconsistencies, but also a matter of looking around for other incongruities. We will pursue what we find by appropriate means. Running the problems down makes the data better as we ferret out sources of error and develop filters for the various types of errors when new data are added. And better data leads to better analysis.
The goal is to provide a useful set of data that will provide insights into the condition of libraries in spite of the occasional anomalous or obviously wrong value. These data are remarkably free of such peculiarities, but they are there, as Professor Kruskal warned us.
We are guided by what might be called the Hippocratic Oath of Data: first do no harm. On the other hand, there are values in the dataset wrong by such amounts that they might render incorrect, and potentially serious, results from the manipulation of data.
What follows is a discussion of anomalous values we have noted so far and what, if anything, we have done about them. They come from the actual records of the member libraries, and we are addressing this matter in several ways that are discussed here.
We could ignore these few problems and proceed blithely along. However, even though the number of these anomalies is small, the NDP interface permits – indeed, invites a – level of granular analysis that before now was the province only of skilled programmers. Now crosstabs on smaller and smaller subsets of data are so effortless that users might easily end up with wrong values which throw off the analysis. Given that financial decisions are to be made based on these data, we must address this problem seriously. For instance, if you are trying to analyze the age of a collection by using the Average Publication Year, a 9999 can lead you to an incorrect conclusion about the age of a collection resulting, perhaps, in the misallocation of resources.
The data are dynamic, so we have included estimates of the numbers based on current numbers as of the discussion.
The date of publication is recorded by member libraries' catalogs. There are impossible numbers in this series as discussed in the definitions. Briefly, there are slightly fewer than 1,300 titles putatively published after 2005 and three around -14,000. In addition, there are numbers like "99." This is a possible number, but it is more likely the result of truncation, for example, "1999" is replaced with "99." There are 255 cases of "9999," probably indicating the date was not known. Filling a field with 9s is occasionally used for values that are not known.
What we have done is to treat publication dates later than 2005 and before 1000 as "unknown."
Note that there are more active things we could do at this stage. For instance, on a quick search of Amazon, it appeared that about 80% of so the 1,300 or so (as of this writing) titles with impossible publication dates can be identified unambiguously, so we could write programs to change these as they are brought into the NDP.
We did an analysis of the top 1,000 titles ever circulated and found that 13.4% of the titles in that list were in ephemeral categories, such as "Adult Paperback," "Ephemeral Checkout Donated Paperback--Adult," and "ONTHEFLY." These categories reduced the value of the lists of popular titles so we have removed them from those lists. The information is still in the system so "Adult Paperbacks" will still appear as a paperback Item Type but the intent is that those lists will include only recognizable titles. The analysis also disclosed a very few residual cases where like titles were not together: The Two Towers appeared separately from The Lord of the Rings: The Two Towers, a fact which reflects local cataloging and, likely, different editions. The intent is to group like titles together for these general, summary lists. If you want to examine the Top 100 DVDs circulated at some subset of libraries, the list you see will be DVDs but if you ask for a general list of titles, the intent is for movies, books, and so on, irrespective of format, to be listed.
On one hand, it is disappointing not to be able to examine those ephemeral titles, but on the other, this fact reflects the fact that ILSs are flexible enough to accommodate on-the-fly circulations.
There are about 34,000 titles (of 35 million as of this writing) with initial "Zs" before the title: ZZZBoys Life. Some of these titles are legitimate: "ZZZZZZ" was an episode of The Outer Limits that clearly attracted a lot of buzz because so many libraries have it. The band, ZZ Top, is also represented.
However, there is a mixed bag of other entries. The largest single group – and it is quite large – is periodicals as in the example given above. This subset of initial ZZZs indicates a special format or location. Similarly, there are a few entries that appear to be interlibrary loans because they have an initial. "ZZZ ILL" then is a recognizable title. Other entries are not so clear, such as "ZZZBookbag." But some are decipherable, such as "ZZZWatch Out Ronald Morgan." What about the no-doubt useful "ZZZHip-Hop Pronounciation Videorecording"? Live or Memorex?
It is a tricky matter, and we have not yet decided what to do with these, because it is clear that precipitously stripping out the initial Zs will destroy information. An argument can be made to strip out the ZZZs in the periodical titles because it can be done relatively easily, on a large subset of observations in this class of anomaly, and it would result in a more useful database as a location marker now separates the same titles. That is, ZZZBoys Life is separated from Boys Life.
For a (maybe?) related anomaly, see cases where city names have initial ZZs in the NCES public library data. The initial ZZs were removed from the NCES/NCLIS. There the solution was obvious and easy to do.
Our reports on use of materials by language and holdings by languages shows that English is the overwhelming language by use and by holdings. The list of languages held in the NDP libraries, however, is quite long, and more than a few of them are barely creditable – in fact, many are pretty clearly cataloging errors probably resulting from something similar to Cole and Stephan's teen-age widows. We are investigating these cases and will likely contact the member libraries so they can adjust their records.
There are 467 possible languages listed. This list comes from the Library of Congress. Of these, about 270 are actually listed as being in NDP member libraries. Quite a few appear to be cataloging errors. For instance, one title at one library in a dead language in a non-Western script is not a likely title for a public library. The numbers are so small that this matter will not require immediate attention.
There are four categories of language that are legitimate, if curious. At the Library of Congress language page, (search on "Special Codes") are these categories: Undetermined Multiple Languages Unknown (blank)
In addition, there is a 'miscellaneous' category reflected in NDP member libraries' catalogs that is listed on other LC pages but is not rigorously defined. All four of these language categories appear in the dataset. Two of the most popular languages in public libraries in the NDP are Unknown and Undetermined.
It would be useful to disaggregate these to their proper languages, but it seems a daunting problem at this stage.
Data are available on collections by Dewey and Library of Congress (LC) classification systems. We have mapped the two systems to each other using the 050 and 082 MARC record fields so that either system can be used to gain insights into these collections.
The Dewey numbers breakdowns in the NDP are the second expansion, so that one can examine, say, the circulation patterns of materials in the 640-649 range. The underlying dataset has the third expansion so that the circulation patterns of the 641s are recorded. Eventually, one might be able to examine the use of materials in 641.5, for instance. At this time, the performance hit from including these data is substantial.
Note: the "Unclassed" category. As of this writing, there are 4.3 million titles in 15.2 million copies (of 12 million and 34.3 million, respectively). This large category includes materials of many types such as fiction (classed variously as "F," "Fic," etc.), juvenile materials ("J," "Juv"), reference ("R," "Ref") and so on. This is another group we may look to disaggregate, but this would be a complex undertaking, indeed.
We have removed this variable and a derivative, an estimate of the value of collections, from the reports. We have kept the values of each but decided to think about how to treat the anomalies.
There are anomalies that have not yet been addressed. These anomalies will affect computations. Caution is in order here. Summary data of this variable will be higher than the true number. Some subsets and crosstabs will return bizarre results.
Mean = $31.57
Median = 16.95
There are seven books with a prices listed as $9,999,999.99, 54 over $1 million, and 249 over $10,000. At the other end of the scale, 224 are below $ .01, and 10,663 below $1.00.
The maximum figure is a curious number and it looks rather like someone filled up the field with 9s. That practice is occasionally used as a means of indicating the unknown, but it seems odd here. There are other places in the dataset, though, where there are all 9s in fields, as noted above with dates. Local practices certainly vary.
We discovered another problem in the handling of the price of sets that we do not understand sufficiently yet. Until we do, we are not touching this variable.
With the release of the NCES FY 2003 national data, it is possible to compare figures between libraries reporting to NCES with those available from the NDP.
Given that fiscal years vary-not only between states but also within them, comparing the two estimates of circulation requires care. Of the 52 systems currently in the NDP, there are 36 which have data for periods covered by NCES's most recent release of data.
The comparison made here is between NDP's "Checkouts and Renewals" and NCES's variable "totcir" for the same periods.
Of the 36 libraries, 18 have NDP circulation numbers greater than the NCES number. 17 have the NCES number greater. One is exactly the same. Of the 36, 19 of the pairs of numbers are within 5%, plus or minus, 27 are within 10%. So, even though we are measuring the same thing, we are sometimes getting different numbers. The totals show that about 2 million more transactions are measured by the NDP than are reported to NCES.
The largest negative (NCES is less NDP) discrepancies are:
|Lubbock Public Library||-1.7 million||(-65%)|
|Pikes Peak Public Library District, Colorado||-700K||(-11%)|
|Kansas City, Kansas Public Library||-469K||(-35%)|
|Calcasieu Parish Public Library, Louisiana||-174K||(-25%)|
|Waukesha Public Library, Wisconsin||-127K||(-8%)|
And a few where the NCES figures are greater:
|Montgomery County Public Library, Maryland||484K||(4%)|
|Allen County Public Library, Indiana||322K||(7%)|
|Huntsville-Madison Public Library, Alabama||222K||(11%)|
Isn't this curious that the two measures of this same thing are not the same? In one case, we think we understand what is going on and it may illustrate the path to resolving the discrepancies. Let's look at Pikes Peak's experience.
Modern ILSs are complex with many options. These options are their strengths because they make the ILSs configurable to local requirements as a number of the entries on this page show. An unintended consequence of this flexibility is that enforcing comparability on reported data is difficult. The Federal State Cooperative System for Public Library data is a group that among other things spends a great deal of time on maintaining consistent definitions across the country. It is a difficult task but having watched it in operation, I think it works well, but they have had a number of years and the attention of dedicated people to make it so. The NDP is just starting out.
Pikes Peak Library District's (PPLD) NDP figure reflects what the ILS measured as a circulation transaction. PPLD has a floating collection-that is, materials are owned by the system and titles that are requested at one branch that are at another branch, after use, stay at the requesting branch rather than returning. The Unicorn system allows local administration to configure the ILS for local conditions and these internal transactions were separately recorded. They show up in the NDP as a "User Profile" "Used for Library Functions." For the period in question, this category has about 700K transactions. When reporting to NCES, the folks at Pikes Peak subtract this amount from the total.
What happens if we subtract "Used for Library Transactions" from all NDP reported data? Alas, only in the PPLD is this the seeming explanation for the differences and in cases where the NCES number is larger, subtracting this amount makes the differences greater. It appears that we are going to be digging deeper in this matter and that one size will not fit all.
A result of these extra 700K titles in the NDP circulations is that they appear as titles circulated in which case they may well be double-counted.
"Unduplicated Population" is an estimate of the non-overlapping population served by each library system. In FY 2003, 24 states and DC do not have overlapping areas served but 26 do. In those states with overlapping areas served, the population served figures could be greater than the population of the state because one person could be counted more than once because he or she might be in more than one library's service area.
Given that the State Rank Order Tables are based on per capita measures, a state's ranking in these tables could well be affected by these varying methods of calculating the population of the service area so a new measure, of "unduplicated" population was created. It is an estimate based on calculations. If one were to total up the unduplicated population of all public library systems using the NCES Public Library Data File-the one with imputations, the result is 280,718,256. However, the came calculation done using the NCES State Summary Data, the result is 280,718,181. The difference between the two of 75 people is a result of rounding on the calculations done and is spread among the states, a few here or there.