AI can recognize handwriting in historical documents to streamline transcription, can aggregate data to simplify the research process and can change the way learners interact with primary documents. However, AI models might not have permission to use those documents and might misinterpret those documents without the proper safeguards.
PAPER ARCHIVES
Nowadays, most university materials produced are born digital — created in a digital format like a PDF. But older, analog materials have to be converted into a digital medium to remain useful, according to Justin McHenry, university archivist at the American Public University System (APUS). There’s no way around manually scanning them, adding descriptive metadata like a title, author, date, location and who owns it, so most institutions have at least some backlog.
Once a document is scanned, though, AI has shown proficiency in recognizing complex patterns in handwritten and printed documents that humans may struggle with due to older writing conventions, difficult historical contexts or damage to the document.
FromThePage, a historical document transcription website that used to crowdsource transcription services for education and museum clients, is pivoting to include AI word recognition in its platform. Now, when a transcriber pulls up a document, they can generate a paragraph of what the AI predicts the document says off to the side, or overlay it on top of the document to compare the original document with the generated text word by word.
“Think of it as having someone look over your shoulder and give you a second opinion: ‘What’s this word? I wonder what this word is. Do you think it could be this or this?’” Sara Brumfield, co-founder of FromThePage, said. “It's a way to do that in an automated fashion.”
The AI transcription is time-saving and can also keep humans from engaging repeatedly with records of historical atrocities. The U.S. Holocaust Memorial Museum, which works with FromThePage, used AI transcription to process Nazi-era records and locate information on people without requiring a human to read large volumes of potentially upsetting material.
AI can also summarize transcribed paper documents, change the style for different grade levels or create a podcast. But mixing generative AI’s bias mitigation practices with historical injustices can have messy ramifications — omitting mention of key, if potentially unpleasant, events.
“We’ve seen a lot of the ethical guardrails put on systems like ChatGPT that limit the kind of research that it’s possible to use these tools for,” Ben Brumfield, co-founder of FromThePage, said. “If they don’t want to talk about slavery or antisemitism or genocide or violence, our historical texts are full of those.”
The Brumfields have taken measures to keep humans centered in the transcription process. In it, emojis with an unsure expression pop up next to words the AI is less confident in, and paragraph disclaimers remind transcribers the generated text likely contains errors.
BORN DIGITAL ARCHIVES
Unlike paper documents, perhaps, born digital archives cannot be forgotten. They won’t evolve as digital formats grow and change, so it is up to an archivist to keep track of how to convert different file types into a virtual version of that climate-controlled room and maintain an archival database. Like the scanning process, this portion of the work remains mostly manual, and here AI is making it easier to create text, photos, videos — digital-born work that might then need to be cataloged in a university archive.
“It’s a lot harder to stay on top of everything, because there's so much digital information that's being created, and it's also being created in a format that's not easily preservable,” McHenry said.
Where AI can be helpful is in processing searches, an area archivists and historians typically help with. Where now a researcher could have to separately look for pertinent resources on each branch of the federal government's websites, AI might be able to aggregate that data into a single search, he said. Similarly, the Brumfields worked with one researcher studying Civil War-era terrorism — mapping the newer term, unusual then, to relevant older words like “brigandage” or “raids” to locate pertinent records and avoid search bias.
TRAINING MODELS
While AI models can be useful for archival work, Lisa Macklem, a legal scholar and lecturer at the University of Western Ontario, said generative AI tools are often built on massive data sets that can lead to copyright issues, misrepresentation of historical facts and a lack of transparency.
While it is common practice to cite academic sources like archives, many AI tools do not follow the same citation procedures. So, for example, a chatbot’s response could cite an archive for a point that doesn’t align with the fact of the work or the archivist’s intent. The New York Times is in an ongoing lawsuit with OpenAI and Microsoft, alleging not only that the newspaper’s archives were used without permission, but that incorrect or misleading reproductions of those archives dilute trust in the organization, Macklem explained.
“Particularly for archives, it’s important, I think, now to start putting up safeguards,” she said. “If the barn door is already open and the material has been scraped, how do you get it back?”