Open this publication in new window or tab >>2023 (English)Licentiate thesis, comprehensive summary (Other academic)
Abstract [en]
Historical documents are a valuable source of cultural knowledge and can provide information about previous events, societies, beliefs, and cultures. They can serve as an excellent source for research in various fields including history, literature, linguistics, and anthropology. Their preservation and analysis pose significant challenges due to the unique characteristics of handwritten scripts, the variability, and the document degradation. With the rise of the Deep Learning era, enormous amounts of annotated data are required to train large models that can efficiently perform tasks on unseen data. Nowadays, digital libraries provide high-quality digitized images for analysis and processing of historical documents. However, collecting and annotating the provided data is an expensive task and requires a lot of expertise from historians and the humanities. Hence, generating synthetic data to enhance the performance of Deep Learning frameworks is a common approach in Computer Vision and, specifically in this thesis, in Document Image Analysis and Recognition (DIAR).
This thesis focuses on leveraging generative models to facilitate DIAR tasks, focusing on historical and handwritten documents, by generating realistic synthetic images that resemble a real distribution and enhance the training of downstream DIAR tasks. The contributions of the thesis include a systematic literature review, a comparison evaluation, and a developed method for handwriting generation.
First, a systematic literature review of existing historical document image datasets, provides summarized information of 65 studies, focusing on different aspects, such as statistics, document type, language, visual, and annotation aspects. The study discusses limitations and promising resources for future research, which refer to the limited dataset size and absence of benchmarks, as well as the lack of standardization in terms of data format and evaluation scheme.
A subsequent contribution is the integration of generated data in a historical document font classification task. Semi-synthetic data are generated with the use of DocCreator, an open-source software, from which different document degradation augmentations are used. A conditional Generative Adversarial Network (GAN) is used to generate fully synthetic data conditioned on a specific sample. The data generated by the two methods areintegrated as additional samples in the training of several Convolutional Neural Networks classifiers and the effect in the performance is examined.
The final contribution of the thesis introduces a new method for generating styled handwritten text images based on Denoising Diffusion Probabilistic Models (DDPM), which is an unexplored method in DIAR. The method manages to capture stylistic and content characteristics of a standard multi-writer handwriting dataset and achieved an improved performance in enhancing writer identification and handwriting text recognition compared to Generative Adversarial Network (GAN)-based methods. The results demonstrate the potential of the generative method for enabling deep document image analysis and pave the way for further research.
As a future direction, this work will aim to progress from generating word images to generating sentence and full document images by conditioning on the content, style, and layout of historical documents. Another future action will be to further extend the proposed method to operate in a few-shot scheme for the writer style condition in order to generate unseen styles. Furthermore, the future work will aim to leverage important features from pre-training with synthetic and real data in order to generalize to historical documents that are a scarce source and adjusting the text encoding parts to different languages and scripts. Finally, the ultimate goal of the future work aims to generate a massive synthetic historical document image database to fill the existing benchmark gap.
Place, publisher, year, edition, pages
Luleå: Luleå University of Technology, 2023
Series
Licentiate thesis / Luleå University of Technology, ISSN 1402-1757
National Category
Other Electrical Engineering, Electronic Engineering, Information Engineering
Research subject
Machine Learning
Identifiers
urn:nbn:se:ltu:diva-96361 (URN)978-91-8048-303-2 (ISBN)978-91-8048-304-9 (ISBN)
Presentation
2023-06-07, A117, Luleå tekniska universitet, Luleå, 10:00 (English)
Opponent
Supervisors
2023-04-122023-04-112024-03-22Bibliographically approved