6 Duplicate Identification

¡TL/DR!

6.1 Purpose

This document outlines how to check for duplicate accession and object identification numbers in PastPerfect. Mixed use of two and four digit encoding of year in accession number and object identification number resulted in duplicate records. This document outlines an approach to identifying and reporting these kinds of duplicate records. The document’s purpose is to support duplicate record merging in PastPerfect or other similar situations. The general approach to duplicate record identification could be applied in a variety of database deduping contexts.

6.2 Problem

Information about museum objects is tracked by object numbers (Simmons and Kiser 2020; NPS Information Management Team (IMT) and NPS Museum Management Program (MMP) 2019; Chenhall 1975). Museum accession numbers are generally based on a binomial following the formula year.accesion while object numbers are based on a trinomal system using the following formula year.acquisition.object (Simmons and Kiser 2020, pt. 4). This system encodes key pieces of information about accessions and objects. Each object number should be completely unique, and the museum database should contain no duplicate records.

Unfortunately, over time different individuals registering objects in PastPerfect inconsistently entered year values in both accession and object id fields. In earlier cases, year was entered as a two digit value (i.e. ##). Now year is entered as a four digit value (i.e. ####). Additionally, several individuals recently added new records into the database without first checking to see if there were existing records logged with two digit year encoding. As a result, now there are several duplicate records in the database, early records have two digit year encoding and the subsequent duplicates have four digits. Each record contains independent linked information like location and conservation history. It is valuable to merge this information into a single comprehensive canonical record. To get there, one must first identify the duplicate records.

For the future

So that this matter does not arise again, identify all records with two digit year encoding and modify the accessions and accession numbers in PastPerfect.

6.3 Methods

The following reads the data, collapses four digit year encoding writing it to a new column, establishes counts by this new collapsed number, identifies ids with multiple entries, isolates these in a new table, and joins them to the complete record. That product gets reported as a reactive DT table.

6.3.1 Read and Clean Data

This probably loads an unnecessary number of libraries, but deploys verbs I’m familiar with. For example, I believe base tapply could substitute the group by and summarize dplyr functions.

# load libraries
library(readxl)
library(dplyr)
library(janitor)
library(stringr)
library(forcats)

Data were first exported from PastPerfect (Section 3.1). This export file is read into R. From there, field types are assigned and names are cleaned consistent with R conventions. A separate script is called to do some standard wrangling.

# Read data
df <- read_excel(here::here("data/inventory/inventory_2023_02_23.xlsx")) %>% janitor::clean_names()

# Clean columns
source(here::here("scripts/past_perfect_clean_fields.R"))

6.3.2 Create New Object Number w. Truncated Year

To compare two digit and four digit year encoding, four digit year encoding were stripped to two digits. Doing this relies on a regular expression or regex. To build the proper expression, I had to fiddle a bit. I worked my way to it by making a minimum reproducible example that isolated the problem (see Section 6.5).

# Add column composed of two digit objectid
df$objectid2 <- str_remove_all(df$objectid , "^19|^20")

6.3.3 Summarize New Object Number w. Truncated Year

With two digit year encoding in place, group by the new object id field and summarize based on that grouping.

# Get duplicates
df_objectid2 <- df |> 
  group_by(objectid2) |> 
  summarise(n = n())

6.3.4 Isolate Duplicates

Filter only those object numbers whose counts are greater than 1. This results in a list of duplicate records.

# Isolate duplicates
df_dupes <- df_objectid2 |> 
  filter(n>1)

6.3.5 Join Duplicate Table with Complete Record

Join the duplicate record list to the full set of object attributes.

# Join table of duplicates with full record info
df_joined <- left_join(df_dupes, df, by = "objectid2", multiple = "all")

6.4 Results

There are 44 duplicate records in the database.

6.4.1 Plot Reactive Table Using JavaScript

Note

2023-03-01 flagdate is not reporting properly. Need to track that down.

# Other fields: catby, invnby, flagdate

df_joined |> 
  select(objectid2, catdate, invndate, flagnotes, flagreason, objname) |> 
  DT::datatable()

Show entries

Search:

	objectid2	catdate	invndate	flagnotes	flagreason	objname
1	69.03.03	2016-05-24		Duplicate of 1969.03.03	Quality Check	Bag
2	69.03.03	2022-03-30	2022-03-30			Bag
3	73.02.10	2019-06-11				Hat
4	73.02.10	2022-09-15				Hat
5	75.30.02	2019-07-31				Dress
6	75.30.02	2000-06-06				Dress
7	77.25.100	2018-07-17				Candleholder
8	77.25.100	2021-12-06	2021-12-06			Candleholder
9	77.25.11	2018-07-16				Mug
10	77.25.11	2021-07-01	2021-07-01			Pitcher

Showing 1 to 10 of 44 entries

Previous1 2 3 4 5Next

6.5 Reproducible Example

The following is a reproducible example illustrating the key pieces of the procedure above. Making a small example that isolates the problem is a really good way to finding a solution.

Code

x <- c("1991.45.03",
       "91.45.03",
       "2002.34.55",
       "2002.34.75")

Return only numbers that begin with 19, using a stringr verbs.

Code

str_detect(x, "^19")

[1]  TRUE FALSE FALSE FALSE

Code

str_starts(x, "19")

[1]  TRUE FALSE FALSE FALSE

Use stringr::str_remove_all with the regex to strip four digit year encoding to two digits.

Code

# Removes all when objectid starts with 19 or 20
str_remove_all(x, "^19|^20")

[1] "91.45.03" "91.45.03" "02.34.55" "02.34.75"