5 Error Rates
On two occasions I have been asked, “Pray, Mr. Babbage, if you put into the machine wrong figures, will the right answers come out?” … I am not able rightly to apprehend the kind of confusion of ideas that could provoke such a question.
— Charles Babbage (1864), Passages from the Life of a Philosopher
Babbage, widely considered the “father of the computer,” clearly recognized the importance of reliable input data for generating reliable output or reports Figure 5.1. The more recent computer science concept of garbage in, garbage out (GIGO) is germane to museum database creation and management (Figure 5.2). True for paper or electronic documentation: if the records are a mess, they aren’t very useful. Given this fact, it is generally prudent to ensure that poor quality data do not enter a database to begin with. However, when poor quality data does enter the database there need to be mechanisms for discovering and resolving those inaccuracies. The long standing and well founded concern over GIGO raises the issue, what are the error rates of data entry at the University Museum?
In an effort to better understand error rates during data entry, Craig systematically evaluated PastPerfect database records produced by a single undergraduate student who created inventory entries for 791 objects from roughly 2021-05-27 through 2021-07-08. All of the objects are located in the Museum basement. Error rate analysis here focuses exclusively on two fields which are arguably among the most complex but also most important logged: “Object Name” (objname
) (Section 2.3.1) which implements nomenclature and “Description (UM)” (udf21
) (Section 2.2.1) which is the custom description field.
5.1 Object Name (objname
)
In terms of Object Name (objname
), Craig found incorrect assignment of Nomenclature terms in 66.25% (n=524) of the 791 records. While this is higher than several other individuals who created records in the database, across the board there seemed to be a general misunderstanding of how Nomenclature works. Unfortunately, in this case a large portion of the object descriptions lacked sufficient information or a diagnostic photograph to take corrective measures without reviewing the objects themselves. Craig was not able to revisit the objects and make those additional calibrations. The rate of incorrect designations is almost certainly higher than the 66.25% captured here.
5.2 Description (UM) (udf21
)
In terms of Description (UM) (udf21
), Craig found the student entered into this field text that was riddled with typographical errors, exhibited inconsistent capitalization, and was often composed of incomplete sentences. Generally, the descriptions were thin and undiagnostic. In fact, the descriptions more closely resembled brief “tweets” or text messages clumsily entered from a phone rather than than formal object descriptions one would expect in a museum setting (Figure 5.3 see Column udf21.pre
). By and large, Craig was not able to resolve these issues because dong so would involve essentially re-cataloging the entire sample of 791 records.1 Bringing those record descriptions to even a minimum standard of acceptability entailed trying to ferret out spelling errors and ensuring that descriptions were at least complete sentences, even if those descriptions were thin and non-diagnostic. In making rudimentary revisions, Craig had to modify 92.04% (n=728) of the 971 records created by this student. This is obviously unacceptable.
In Philippica XII Cicero (43 BC) wrote “anyone can err, but only the fool persists in his fault.”2 While an undergraduate student should clearly have the sense that misspellings and prose lacking punctuation is unacceptable it is also on the shoulders of the Museum staff to figure out how to ensure that information going into the database meets minimum standards of quality. It can be extremely difficult to stay on top of everyone who is creating data, but figuring out methods to get feedback on data entry is crucial to the health of the database.
5.3 Possible Strategies for Addressing High Data Entry Error Rates
In reflecting on 20 years of designing and implementing digital object recording systems, Craig (2000; Craig and Aldenderfer 2001, 2001; Klarich and Craig 2001; Craig, Aldenderfer, and Moyes 2006) finds that when entering information into computer database forms, students and professionals alike frequently seem to lack an appreciation for the fact that another person will be reading those records in the future. Instead there tends to be a habit of uncritically feeding “information” into a machine with little thought to what one will find a year or more when someone seeks to retrieve those records. When this happens, subsequent users often find themselves frustrated with the data’s poor quality. Data entry must be done with a long term thought about people who will use that information later in time–“time binding” (Korzybski 1949; Montagu 1953) or passing information through time is the point of logging the data in the first place.
It is, important to help people doing data entry understand that what goes into a canonical museum database becomes part of the institution’s official record. This means the information must be clear, detailed, and accurate so that other people down the road can read and understand what an object is. This is important for staff trying to relocate and object or for the general public seeking to better understand a museum’s collection from an online catalog.
In Craig’s experience, when it comes to digital data entry, having individuals see their own work can be helpful for teaching conceptual change. This is part of a general process of creating feedback loops to catch errors early on Figure 5.4; this principle is relevant to just about any data collection procedure. Without some form of check, low quality data can accumulate very quickly–to the point that database systems can be rendered of little use. Thus regular data validation, is always necessary. This is true for experienced professionals, but particularly acute for new students just learning how to inventory objects.
Recommendations:
- Individuals who are creating new object inventory records should first write information out on paper and have this information checked–before entering it into the permanent canonical database. Craig understood that this was already standard practice in the Museum, but it appears that many errors were not caught.
- When it comes time to enter information into the database, type up descriptions in an external text editor and paste that text into the description field. Doing this allows the person entering data to get immediate feedback on incomplete sentences, grammar issues, and typographical errors–before entering it into the permanent canonical database.
- Individuals entering data should routinely query their records from the database and evaluate the quality of the work done. These incremental reports should be shared with Museum staff for review. Peer review of such records could make for a useful classroom activity in a Museum studies class. Reviewing, critiquing, and revising existing records would also make for a productive course activity.
5.4 String comparison resources
5.4.1 Links
- How Do You Compare Two Strings in R? - Stack Overflow
- How to Compare Strings in R with examples | R-bloggers
- stringr - Compare two strings and look for differences and display them for easy viewing in R (similar to git diff)? - Stack Overflow
- How to compare two strings word by word in R - Stack Overflow