Repertorium’s MIR tools

REPERTORIUM employs OCR alignment and image matching techniques for chant identification and manuscript similarity detection.

The alignment of inaccurate Optical Character Recognition (OCR) outputs with known transcripts is not new. In the literature, it has traditionally been used to create databases for training and evaluating OCR systems. Within the context of REPERTORIUM, these techniques are now being employed to determine whether an input manuscript photo (image) contains a chant that has already been indexed and transcribed in chant databases (text). When dealing with historical handwritten documents, the process of transcript alignment typically involves Hidden Markov Models (HMMs) or dynamic programming methods like Dynamic Time Warping (DTW) or global sequence alignment.

For example, methods in [1,2] utilise HMM-based sequence alignment to assess the similarity between the output of an OCR system and a ground-truth transcript. In [3], the OCR system OCRopus [4], trained on a small data set, is used to create an initial transcription of a medieval manuscript image.

This transcription is expected to misidentify some characters. Subsequently, a dynamic programming-based sequence alignment algorithm (specifically, the Needleman-Wunsch algorithm, a global sequence alignment method) is applied to align this OCR transcript with an existing, accurate transcript.

The method is designed to be adaptable to any type of manuscript and allows for the utilisation of existing OCR models with minimal effort in terms of training and data preparation, as there is no strict requirement for the OCR transcript to be highly accurate. In a similar fashion to [1], a numerical value can be computed to measure the similarity between the OCR output and the candidate chant text.

Using similar methods, the University of Jaén has previously worked on searching musical score databases to find the score that matches a given input audio query [5].

Another approach, that operates independently of OCR, involves direct image matching between the input manuscript photo and candidate images in a database. These techniques can be useful to detect similar manuscripts. The matching of images depicting the same scene or object typically relies on local features. These local features are sparse keypoints, each linked to a descriptor of its local appearance. The typical pipeline for image matching involves four steps: feature detection, description, matching and outlier rejection.

Recent research has predominantly concentrated on developing Convolutional Neural Networks (CNNs) to detect and describe local features [6]. Then, given the two sets of features, they are subsequently matched using a nearest neighbour search in descriptor space (Lowe’s ratio test) [7].

Next, the matching outliers are rejected using heuristic algorithms. More recently, deep neural networks have been trained to simultaneously match local features and reject outliers when provided with an input image pair. Examples of this are SuperGlue, based on transformers, and its lightweight version LightGlue [8].

Other works have proposed end-to-end network architectures that perform image matching in a single forward pass instead of dividing the process in the previous steps [9].

References

[1] S. Feng and R. Manmatha, “A Hierarchical, HMM-based Automatic Evaluation of OCR Accuracy for a Digital Library of Books,” in Proceedings of the Joint Conference on Digital Libraries, Chapel Hill, NC, 2006.

[2] V. Romero-Gómez, A. H. Toselli, V. Bosch, J. A. Sánchez, and E. Vidal, “Automatic Alignment of Handwritten Images and Transcripts for Training Handwritten Text Recognition Systems,” in Proceedings of the 13th IAPR International Workshop on Document Analysis Systems, 2018, pp. 328–333.

[3] T. de Reuse and I. Fujinaga, “Robust transcript alignment on medieval chant manuscripts,” in Proc. of the 2nd Int. Workshop on Reading Music Systems, (Delft, The Netherlands), 2019.

[4] T. M. Breuel, “The OCRopus Open Source OCR System,” in Proceedings of the Society of Photo-Optical Instrumentation Engineers, 2008.

[5] A. J. Muñoz Montoro, R. Cortina, S. Galan, E. Combarro, and J. Ranilla, “A score identification parallel system based on audio-to-score alignment,” The Journal of Supercomputing, vol. 76, 2020.

[6] K. M. Yi, E. Trulls, V. Lepetit, and P. Fua, “Lift: Learned invariant feature transform,” in ECCV, 2016, pp. 467–483.

[7] D. G. Lowe, “Distinctive image features from scale-invariant keypoints,” Int. J. Comput. Vis. (IJCV), vol. 60(2), 2004.

[8] P. Lindenberger, P.-E. Sarlin, and M. Pollefeys, “LightGlue: Local Feature Matching at Light Speed,” arXiv, 2023.

[9] J. Sun, Z. Shen, Y. Wang, H. Bao, and X. Zhou, “LoFTR: Detector-free local feature matching with transformers,” in CVPR, 2021, pp. 8922–8931