CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
Handwritten Text Generation with Diffusion Models: Beyond Visual Quality
Luleå University of Technology, Department of Computer Science, Electrical and Space Engineering, Embedded Internet Systems Lab.ORCID iD: 0000-0002-9332-3188
2025 (English)Doctoral thesis, monograph (Other academic)
Abstract [en]

Handwritten Text Generation (HTG) has emerged as a promising remedy for data scarcity that limits the training of Deep Learning (DL) models for Document Image Analysis and Recognition (DIAR) tasks. Current HTG pipelines are mostly based on adversarial training, which can face several issues, such as mode collapse, limiting variability, and in turn, reducing the practical usefulness of the generated data. Moreover, existing evaluation protocols prioritize visual realism rather than usefulness, obscuring the link between generative quality and downstream task performance.

These limitations raise several questions related to which HTG methods truly benefit DIAR tasks, which real datasets should serve as a foundation for synthetic data generation, how Diffusion Models can be adapted for controllable and robust HTG, and whether current evaluation protocols capture downstream utility. Given the success of text-to-image Diffusion Models in generating realistic images from text prompts, their adaptation for handwriting generation offers the potential to produce high-quality, style-aware handwritten data that can directly enhance Handwritten Text Recognition (HTR). The overarching goal of this thesis is to bridge the gap between generation and task-aligned evaluation by showing how controllable diffusion-based HTG, guided by systematic data analysis and evaluated through practical utility, can address these challenges.

The contributions of this thesis are organized along three complementary directions. First, to establish a solid basis for HTG, a comprehensive overview of modern, historical, and synthetic document image datasets and HTG methods is presented. This results in C1, a systematic overview of dataset resources, and C2, a detailed survey of generative paradigms and evaluation practices in HTG, identifying key data and methodological gaps that motivate the development of diffusion-based models. Second, to overcome the instability and limited variability of adversarial methods, three diffusion-based approaches are proposed. C3 (WordStylist), a latent diffusion model enabling verbatim text and style conditioning. C4 (DiffusionPen), a few-shot extension of WordStylist capable of generalizing to unseen writers through hybrid classification and metric-learning style embeddings. Moreover, C5 (Dual Orthogonal Guidance), a sampling-time mechanism that enhances stability while preserving stylistic diversity. These proposed HTG methods demonstrate that diffusion-based models can generate realistic, diverse, and style-consistent handwriting under controllable conditions. Third, recognizing that generative quality should translate into practical utility, C6 introduces a task-aligned evaluation framework that links generation metrics to recognition outcomes. This framework measures content preservation, style preservation, robustness to Out-of-Vocabulary (OOV) content, and variability, providing a practical assessment of whether synthetic handwriting improves DIAR performance. By integrating generation and evaluation, this contribution redefines how HTG success is measured.

In summary, this thesis demonstrates that controllable diffusion-based HTG, grounded in systematic dataset analysis, enabled by robust generative modeling, and evaluated with task-aligned metrics, provides efficient synthetic handwriting pipelines that directly enhance HTR performance. On a broader level, this work bridges the gap between generative modeling and document analysis, setting the stage for future research in task-aware synthesis of document images.

Place, publisher, year, edition, pages
Luleå: Luleå University of Technology, 2025.
Series
Doctoral thesis / Luleå University of Technology 1 jan 1997 → …, ISSN 1402-1544
Keywords [en]
handwritten text generation, diffusion models
National Category
Computer Vision and Learning Systems
Research subject
Machine Learning
Identifiers
URN: urn:nbn:se:ltu:diva-115100ISBN: 978-91-8048-923-2 (print)ISBN: 978-91-8048-924-9 (electronic)OAI: oai:DiVA.org:ltu-115100DiVA, id: diva2:2006187
Public defence
2025-12-09, A117, Luleå University of Technology, Luleå, 09:00 (English)
Opponent
Supervisors
Funder
Swedish Research Council, 363131Available from: 2025-10-14 Created: 2025-10-13 Last updated: 2025-11-18Bibliographically approved

Open Access in DiVA

fulltext(53589 kB)61 downloads
File information
File name FULLTEXT02.pdfFile size 53589 kBChecksum SHA-512
972e29015310dd5da36449d8a8f2a30da462eb5be416100fa8457fb520e99013754b4e83ca2158e48be223de55a2b57a774c7028ea386bbefe0b7973f4cadd6b
Type fulltextMimetype application/pdf

Authority records

Nikolaidou, Konstantina

Search in DiVA

By author/editor
Nikolaidou, Konstantina
By organisation
Embedded Internet Systems Lab
Computer Vision and Learning Systems

Search outside of DiVA

GoogleGoogle Scholar
Total: 63 downloads
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

isbn
urn-nbn

Altmetric score

isbn
urn-nbn
Total: 1425 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf