- A computer scientist at the Library of Congress is using machine learning to isolate historic images from digital newspaper archives.
- The project, called Newspaper Navigator, uses optical character recognition algorithms to turn handwritten or text-based characters into a searchable document. Machine learning automates the process.
- You can read the team's preprint paper (that means it has not yet been peer-reviewed), posted to the server ArXiv on May 4.
In July 1848, L'illustration, a French weekly, printed the first photo to appear alongside a story. It depicted Parisian barricades set up during the city's June Days uprising. Nearly two centuries later, photojournalism has bestowed libraries with legions of archival pictures that tell stories of our past. But without a methodical approach to curate them, these historical images could get lost in endless mounds of data.
That's why the Library of Congress in Washington, D.C. is undergoing an experiment. Researchers are using specialized algorithms to extract historic images from newspapers. While digital scans can already do this, these algorithms can also analyze, catalog, and archive the image, creating an massive database for 16 million newspaper pages that can find images using a simple search.
Ben Lee, innovator-in-residence at the Library of Congress, and a graduate student studying computer science at the University of Washington, is spearheading what's called Newspaper Navigator. His dataset comes from an existing project called Chronicling America, which compiles digital newspaper pages between 1789 and 1963.
He noticed that the library had already embarked on a crowdsourcing journey to turn some of those newspaper pages into a searchable database, with a focus on content relating to World War I. Volunteers could mark up and transcribe the digital newspaper pages–something that computers aren't always so great at. In effect, what they had built was a perfect set of training data for a machine learning algorithm that could automate all of that grueling, laborious work.
"Volunteers were asked to draw the bounding boxes such that they included things like titles and captions, and so then the system would...identify that text," Lee tells Popular Mechanics. "I thought, let's try to see how we can use some emerging computer science tools to augment our abilities and how we use collections."
In total, it took about 19 days' worth of processing time for the system to sift through all 16,358,041 newspaper pages. Of those, the system only failed to process 383 pages.
What Is Optical Character Recognition?
Newspaper Navigator builds upon the same technology that engineers used to create Google Books. It's called optical character recognition, or OCR for short, and it's a class of machine learning algorithms that can translate images of typed or handwritten symbols, like words on a scanned magazine page, into digital, machine-readable text.
At Popular Mechanics, we have an archive of almost all of our magazines on Google Books, dating back to January 1905. Because Google has used OCR to optimize those digital scans, it's simple to go through and search our entire archive for mentions of, say, "spies," to get a result like this:
But images are something else, entirely.
Using deep learning, Lee built an object detection model that could isolate seven different types of content: photographs, illustrations, maps, comics, editorial cartoons, headlines, and advertisements. So if you want to find photos specifically of soldiers in trenches, you might search "trenches" in Newspaper Navigator and get results instantly.
Before, you'd have to sift through potentially thousands of pages' worth of data. This breakthrough will be extremely empowering for archivists, and Lee has open-sourced all of the code that he used to build his deep-learning model.
"Our hope is actually that people who have collections of newspapers...might be able to use the the code that I'm releasing, or do their own version of this at different scales," Lee says. One day your local library could use this sort of technology to help digitize and archive the history of your local community.
Libraries of the Future?
This is not to say that the system is perfect. "There definitely are cases in which the system will especially miscategorize say, an illustration as a cartoon or something like that," Lee says. But he has accounted for these false positives through confidence scores that highlight the likelihood that a given piece of media is a cartoon or a photograph.
Lee also says that, even despite his best efforts, these kinds of systems will always encode some human bias. But to reduce any heavy-handedness, Lee tried to focus on emphasizing the classes of images—cartoon versus advertisement—rather than what's actually shown in the images themselves. Lee believes this should reduce the instances of the system attempting to make judgement calls about the dataset. That should be left up to the curator, he says.
"I think a lot of these questions are very very important ones to consider and one of my goals is to use this project as an opportunity to highlight some of the issues around algorithmic bias," Lee says. "It's easy to assume that machine learning solves all the problems—that's a fantasy—but in the this project, I think it's a real opportunity to emphasize that we need to be careful how we use these tools."
You Might Also Like