Science Success Story
Collaboration Develops AI Tool for "Long Tail" Stamp Recognition in Japanese Historic Documents
XSEDE systems, expertise power study of business document that promises better automated analysis of datasets with numerous rare items, a key limitation in artificial intelligence
By Ken Chiacchia, Pittsburgh Supercomputing Center
Number of stamp images collected by class, showing the beginning of the "long tail" of stamps with smaller representation. Note this is only the beginning of the long tail, as only classes with over 20 images are shown. Smaller classes would extend far to the right of the page.
The MiikeMineStamps dataset of rubber-stamp impressions provides a unique window into the workings of a large Japanese corporation, opening unprecedented possibilities for researchers in the humanities and social sciences. But some of the stamps in this archive only appear in a small number of instances. This makes for a "long tail" distribution that poses particular challenges for AI learning, including fields in which AI has experienced serious failures. A collaboration between scientists at the University of Pittsburgh (Pitt), the Pittsburgh Supercomputing Center (PSC), DeepMap Inc. of California and Carnegie Mellon University (CMU) took up this challenge, using XSEDE's Extended Collaborative Support Service (ECSS) and the XSEDE-allocated Bridges and Bridges-2 systems at PSC to build a new deep-learning (DL)-based tool for analyzing "long tail" distributions.
Why It's Important
Important documents like contracts, loans and other legal or financial documents are usually signed. In East Asia, personalized stamps have usually taken the place of signatures, even though electronic signatures have recently become more common. Whether a signature, a stamp or an electronic signature, instruments of verification are supposed to verify that a document is authentic. But all of these means of certification can be forged. Our entire legal and financial system assumes that signatures, stamps, and electronic signatures are irreproducible, even though we know that this isn't the case.
If all of these instruments of certification are forgeable, why have they been used for so long? And what is the historical process that caused societies to trust that documents that are certified with a signature, a stamp or an electronic signature are authentic?
...until now, it was pretty much impossible to easily index tens of thousands of stamps in an archive of documents, especially when these documents are all in a language like Japanese, which uses thousands of different Chinese characters. This project makes that possible.—Raja Adal, University of Pittsburgh
Raja Adal, associate professor at the University of Pittsburgh, has been working on these and other questions related to instruments of writing, reproduction and certification. In his latest project, he is analyzing stamps in what is probably the largest business archive in modern Japan. The Mitsui Company's Mi'ike Mine archive spans half a century and includes tens of thousands of pages of documents with close to 100,000 stamps. This trove of documents can help to uncover the role that stamps have played in modern Japanese society. But creating a database of so many stamps has until recently been impossible. Such a database requires a huge team of research assistants to assemble it. More importantly, these assistants would need to be able to decipher the highly stylized form of the Kanji characters that they use – a level of expertise that isn't realistic.
Examples of collected images of stamps in the MiikeMineStamps dataset.
To create an index of the stamps in this archive, Adal turned to the ECSS, XSEDE computing systems, and colleagues at PSC, DeepMap Inc. of California (now NVIDIA) and CMU.
How XSEDE Helped
Adal's team had their work cut out for them, needing both expertise and computing power. XESDE had both on offer, starting with ECSS experts Paola Buitrago, director of AI and Big Data at PSC and first author in the research; Rajanie Prabha, PSC machine-learning research scientist; and Julian Uran, PSC machine-learning research engineer.
This type of project is typically tackled with DL, a type of AI. DL works by first having the computer train a "model" on a limited dataset in which human experts have labeled the "right answers." Once training is complete, the researchers test the accuracy of the model using a portion of the data that had not been used during training. After repeated cycles of training and successful testing, the AI is ready to be applied to the full, unlabeled dataset.
Problems come up, though, when the initial training dataset isn't varied enough to capture the complexity of the full dataset. Infamous failures of AI in the real world – for example, AIs that pick up prejudiced language or hire in a biased way – often originate in a biased or incomplete training dataset. The problem with the MiikeMineStamps dataset is that there are many, many people whose stamps are represented very few times. This means that there's a "long tail" of rare stamps in the full dataset that is extremely difficult to capture in a training dataset.
Machine learning learns by examples. It's all about having enough well-represented examples … When you have so many possible classes and limited, [rare] instances in each class … you need to [approach the problem] differently from what you normally would assume for a regular machine-learning task.—Paola Buitrago, XSEDE/PSC
To overcome that problem, the team applied a new, two-step method of DL. In the first step, they used three different datasets of general images – not stamps – to train the AI to detect and classify generic objects. Once it had done that, they used a classification model that grouped the stamps in general classes that included the rare ones. They then used those classes to randomly select stamps in a stamp-training dataset. This approach, a type of DL called active learning, enabled their AI to become flexible enough to recognize the long-tail stamps in the full dataset.
Using the AI-enabling advanced graphics processing units (GPUs) of the XSEDE-allocated Bridges advanced research computer, and transitioning to the new, even more advanced XSEDE-allocated Bridges-2 when Bridges retired, the team showed that repeated cycles of training could improve the AI's average precision from 44.7% to 84.3%. That's just a beginning; now that they've published this proof-of-concept study, they'd like to improve the AI's performance and use it to study various other datasets that address a broad range of different questions. The researchers presented their findings in a paper at the International Conference on Document Analysis and Recognition (ICDAR 2021) on Sept. 5-10, 2021.
This work used the Extreme Science and Engineering Discovery Environment (XSEDE), which is supported by National Science Foundation grant number ACI-1548562. Specifically, it used the Bridges and Bridges-2 systems, which are supported by NSF award number ACI-1445606 and ACI-1928147, at PSC. The work was made possible through the XSEDE ECSS program.
At a Glance:
Despite its promise, artificial intelligence (AI) can fail – infamously – when its training dataset isn't fully representative of the real world.
The MiikeMineStamps dataset offers an historical and social science goldmine in studying business practices in Japan and elsewhere; but its "long tail" of rare stamps poses a big training-dataset challenge.
Using XSEDE's Extended Collaborative Support Service and XSEDE-allocated computation, scientists created a new active-learning-based AI that leverages a training dataset with good representation of the MiikeMineStamps data.
The method holds promise for addressing training-dataset-driven failures of AI.