Machine enrichment of content

Find out how our text and image processing pipeline can apply text recognition, natural language processing and machine learning techniques to make new connections in your content and enable new forms of navigation and discovery.

Many of our projects involve enrichment: the creation of new content, though both human and machine actions, from existing digitised images. This means providing full-text, tagging, image descriptions and so on. These are the materials from which we can then build richer search and discovery experiences.

Our enrichment pipeline is used extensively in a project like The Indigenous Digital Archive. Starting with the sorting of masses of digitised content into meaningful archival items, through Optical Character Recognition (OCR) of the text in those items, where available, and on to named entity extraction. Humans contribute more through additional tagging, transcription creation and correction, and comments.

Even if the only enrichment step is OCR to provide textual annotations and full-text search, the Enrichment Pipeline still drives the process. This can be seen in the RCVS case study. Here, the process starts with the basic hosted IIIF Images, with the pipeline generating text annotations that are later used by our search server to deliver the IIIF Search API, with word-level hit highlighting and full-text transcriptions for each page.

The Pipeline: IIIF in, Annotations out.

The output of an enrichment pipeline step might be things like:

  • OCR a page of text and save the results
  • Identify images in text and describe their contents
  • Identify figures and tables in text and generate machine-readable versions
  • Look for themes and sentiments in text and identify text ranges
  • Classify images by colour palette
  • ...in fact, any kind of automated generation of new knowledge from web resources you can think of!

Diagram showing steps in enrichment pipeline

Each step in the pipeline can inspect the IIIF resources flowing through it, and create new annotations in response to those resources. This might be tagging names in text, or classifying photographs in a newspaper.

The pipeline doesn't need to understand what happens inside each enrichment component. It provides the framework for content to flow through and be processed. The responsibility of an enrichment pipeline component is to accept and send notifications (so that it knows what to work on and when to work on it) and be able to save its outputs in standard annotation form.

The internal implementation of a processing step is hidden to the pipeline. In our projects, an enrichment module often talks to external services, like Google Vision or AWS Textract; or internal modules (such as a Tesseract wrapper, or our IIIF-to-NLP component, Montague).

With IIIF-in and Annotations-out, you can plug in any conceivable processing step that can analyse the content of a IIIF Canvas (even AV content), generate some new information, and save that new information in the form of a Web Annotation.