I attended the American Library Association Midwinter meeting in Denver in February 2018 and would like to highlight the session I Know Very Well How I Got My Name: Linked Data Authority Projects. Here's a summary of three different linked data name authority projects.
Name Reconciliation Work and Management of Digitized Special Collections
The first speaker was Myung-Ja (“MJ”) K. Han from the University of Illinois at Urbana-Champaign. They created a project, funded by the Andrew W. Mellon Foundation, to answer these two questions:
- How can libraries recognize and reconcile name entities already described in other established linked open data sources?
- How can libraries best manage unique names that are found in local special collections, but not found anywhere else?
Many of their digitized special collections are in silos and hidden from anyone starting their search on the web using a search engine. They wished to explore how to better connect these types of collections to the web, and to explore how hard it would be to use Linked Open Data to do this. The objectives of the project were to map legacy metadata schema to linked open data compliant schemes and to actively link to and from DBpedia, VIAF, Wikidata, and other related web resources. The benefits of doing the work on this project include:
- Knowledge Cards on digital collection search results pages (information from Wikipedia and related works)
- On item pages, providing contextual information and links to other linked data sources
- Implementing data visualization that shows relationships between entities
One of the digital collections they used for the project was the Motley Collection of Theatre & Costume Design, which contains 5,000 images of costume and set designs. Names in this collection include actors (there are about 3,500 pictures of actors), authors, dancers, directors, and others who worked on productions. These are the types of names not found in other traditional authority data, such as the Library of Congress Name Authority File.
They consulted quite a few name authority sources when going through the metadata clean up phase, such as the Library of Congress Name Authority File, Virtual International Authority File, Internet Movie Database, Wikipedia, and quite a few others. To find URIs for names, they consulted several other linked data sources such as Internet Broadway Database, WorldCat Identifies, and those mentioned previously. They also used several other web resources such as Theatricalia. The Person URIs were found through a manual look-up process, and they also looked up Theatre and Play/Performance names. For the Motley collection, 624 links for person names were found out of 984. For Theater names, 52 were found out of 59. For play/performance names, 105 links out of 127 were found.
They also looked at their Kolb-Proust Archive for this project. The names in this collection include family members, friends, journalists, and more. They already have a local name database for this collection. They found URIs for 1,953 people out of 5,727. For this collection, they decided to encode and publish their existing local name database as Linked Open Data (using schema.org).
Observations shared included:
- Additional name authority sources are needed for special collections. Many of the current traditional sources are focused on authors, and there is not one source that has every name.
- When searching manually for URIs, they found it was easiest to start in WorldCat Identities, and then a Google web search next.
- Manual clean-up of metadata and manual search helps recall different name spellings, maiden names, nicknames, etc.
It took a graduate student 6 months, working 10-12 hours a week (240 hours total) to do this work for the two digital collections. The tasks included: metadata cleanup, enhancement, reconciliation, and link collection.
Western Name Authority File: Linking People and Corporate Bodies
Anna Neatrour and Jeremy Myntti, from the University of Utah J. Willard Marriott Library, discussed their project to “explore and pilot a system for developing a collaborative, regional authority file for personal names and corporate bodies from digital collection metadata.” This is a two year IMLS planning grant project and they are just wrapping up the second year. Mountain West Digital Library (MWDL) is an aggregator of digital collections from several states, and a Digital Public Library of America hub. When you aggregate a lot of data from many places, you can end up with a lot of variations on how cultural heritage organizations enter names. The example shared had 12 variations on the spelling of a single person’s name! Vendor digital asset management systems usually don’t have good authority control solutions for digital collections as exist in traiditional MARC-based library catalogs. The MWDL partner institutions use Library of Congress Authorities and Vocabularies when possible, but not all metadata creators are trained to be able to create and add new names to the Library of Congress Authority File. Many of MWDL's large academic libraries are hosting digital collections for smaller institutions, where some control over cataloging practices is lost. Often libraries consult other regional name sources, which doesn’t always line up with Library of Congress Authorities and Vocabularies. To help solve the larger authority control issues, they sought out a grant to develop the Western Name Authority File. They identified four project phases: Investigation, Evaluation, Implementation, and Assessment.
In the first phase of project they reviewed several models and came up with a core set of fields that they wanted to capture (for example, preferred form of name, alternate forms of name, etc). They looked at data models like SKOS, OWL, BIBFRAME Authorities/Agent/Role and EAC-CPF (Encoded Archival Context for Corporate Bodies, Personal Names, and Families). They chose to use EAC-CPF as that model best fit their needs. EAC-CPF is a data model that capture relationships between people and the collections with information about those people. This is the same data model used in the SNAC (Social Networks and Archival Context) project.
Another big part of the project involved collecting the data from all the partners. Most common formats included JSON-LD, tab delimited text files exported from CONTENTdm collections, and lists of names. They compiled all the data into one massive spreadsheet. It was about half a million names, with many duplicates. After deduping, they got the list down to about 75,000 names. Then had to reconcile the names with the Library of Congress Name Authority File. For that they used the tool Open Refine.
In the second phase of the project they investigated many open source tools/frameworks to test that could be used to make the Western Authority File available. Some of the tools they tested were xEAC, Apache Jena, TemaTres, and Collective Access. They created an evaluation matrix to compare the different features of each of the tools. They initially chose Collective Access for the pilot implementation, but later found it would not give them an easy discovery layer for the Western Authority file. They ended up switching to using OmekaS. They had been using it for some of their digital collection exhibits, and it had a lot of the functionality they were looking for: custom vocabularies, an API to support reconciliation, it can publish data as JSONLD, and contains a search/discovery layer.
They are currently in the pilot implementation phase of the project, and still have the assessment phase to complete. You can find more information about the project on the website.
Introducing Cedar: A Linked Data Authority Service at the University of Houston Libraries
The last presentation featured Anne Washington and Xiping Liu from the University of Houston Libraries. The library staff recently created a local linked data thesaurus they named Cedar. They published the thesaurus using SKOS, and it includes subjects, personal and organization names, place names, and time periods all found in their digital collections and their electronic theses and dissertations collection.
The technology behind Cedar is that it is a Ruby on Rails web application that uses the iQvoc SKOS Vocabulary gem. They chose iQvoc as their vocabulary manager because it has a full featured graphical user interface and responsive user interface. They have also created some local customizations to be able to mint their own unique identifiers for local names. The presenters described how they are now using the thesaurus in their digital creation workflows, and for authority control when cataloging their ETDs. They plan to continue to refine these workflows, and have more conversations on how to make even more use of the vocabulary in the future. They mentioned a recent NISO report, Issues in Vocabularly Management. This report includes recommendations for documentation around vocabularies, and they plan to work on documentation next. Cedar can be viewed here.