Handwritten Text Generation (HTG) has emerged as a promising remedy for data scarcity that limits the training of Deep Learning (DL) models for Document Image Analysis and Recognition (DIAR) tasks. Current HTG pipelines are mostly based on adversarial training, which can face several issues, such as mode collapse, limiting variability, and in turn, reducing the practical usefulness of the generated data. Moreover, existing evaluation protocols prioritize visual realism rather than usefulness, obscuring the link between generative quality and downstream task performance.
These limitations raise several questions related to which HTG methods truly benefit DIAR tasks, which real datasets should serve as a foundation for synthetic data generation, how Diffusion Models can be adapted for controllable and robust HTG, and whether current evaluation protocols capture downstream utility. Given the success of text-to-image Diffusion Models in generating realistic images from text prompts, their adaptation for handwriting generation offers the potential to produce high-quality, style-aware handwritten data that can directly enhance Handwritten Text Recognition (HTR). The overarching goal of this thesis is to bridge the gap between generation and task-aligned evaluation by showing how controllable diffusion-based HTG, guided by systematic data analysis and evaluated through practical utility, can address these challenges.
The contributions of this thesis are organized along three complementary directions. First, to establish a solid basis for HTG, a comprehensive overview of modern, historical, and synthetic document image datasets and HTG methods is presented. This results in C1, a systematic overview of dataset resources, and C2, a detailed survey of generative paradigms and evaluation practices in HTG, identifying key data and methodological gaps that motivate the development of diffusion-based models. Second, to overcome the instability and limited variability of adversarial methods, three diffusion-based approaches are proposed. C3 (WordStylist), a latent diffusion model enabling verbatim text and style conditioning. C4 (DiffusionPen), a few-shot extension of WordStylist capable of generalizing to unseen writers through hybrid classification and metric-learning style embeddings. Moreover, C5 (Dual Orthogonal Guidance), a sampling-time mechanism that enhances stability while preserving stylistic diversity. These proposed HTG methods demonstrate that diffusion-based models can generate realistic, diverse, and style-consistent handwriting under controllable conditions. Third, recognizing that generative quality should translate into practical utility, C6 introduces a task-aligned evaluation framework that links generation metrics to recognition outcomes. This framework measures content preservation, style preservation, robustness to Out-of-Vocabulary (OOV) content, and variability, providing a practical assessment of whether synthetic handwriting improves DIAR performance. By integrating generation and evaluation, this contribution redefines how HTG success is measured.
In summary, this thesis demonstrates that controllable diffusion-based HTG, grounded in systematic dataset analysis, enabled by robust generative modeling, and evaluated with task-aligned metrics, provides efficient synthetic handwriting pipelines that directly enhance HTR performance. On a broader level, this work bridges the gap between generative modeling and document analysis, setting the stage for future research in task-aware synthesis of document images.
Luleå: Luleå University of Technology, 2025.