📖 Technical Reference

Book2MD Converter Pipeline

Convert Italian and German books (PDF / EPUB) to structured Markdown, extract bibliographic metadata, evaluate conversion quality, and annotate linguistic structure.

vLLM inference Qwen3-VL + Qwen3 NED · BLEU · MarkdownF1 Stanza NLP IT · DE

Overview

The pipeline converts books in PDF and EPUB format into structured Markdown, extracts bibliographic and genre metadata, evaluates conversion quality, and produces dependency parsing annotations. Everything is packaged as an installable Python library (pip install -e .), exposing a clean public API and a book2md CLI command usable from any directory.

Project Structure

directory tree

book2md/
├── __init__.py                    # Public API: from book2md import ConverterPipeline, ...
├── base.py                        # Abstract PipelineStep base class
├── config.py                      # Typed configuration dataclasses
├── utils.py                       # Shared utilities
├── pipeline.py                    # ConverterPipeline orchestrator
├── cli.py                         # CLI entry point (book2md command)
├── converters/
│   ├── pdf.py                     # PDF → Markdown via Qwen3-VL
│   ├── epub.py                    # EPUB → Markdown via Qwen3
│   └── text.py                    # Rule-based conversion (no LLM)
├── metadata/
│   └── extractor.py               # Author / title / year / genre extraction
├── parsing/
│   └── parser.py                  # Linguistic annotation with Stanza
└── evaluation/
    └── evaluator.py               # Quality evaluation (NED, BLEU, MarkdownStructureF1)

books/                             # Input: original PDF and EPUB files
output/                            # Output: Markdown + eval pages/chunks
scores/                            # Per-book score JSON files
pyproject.toml                     # Package definition (pip install -e .)
setup.sh                           # System dependency installer

Setup

Local / server

bash

bash setup.sh            # system + Python dependencies
bash setup.sh --with-eval  # also installs evaluation libraries + Page2MDBench
pip install -e .           # installs book2md as a command

Google Colab

python — Colab cell

!bash setup.sh --with-eval
!pip install -e .

setup.sh installs poppler-utils (required by pdf2image) and all Python dependencies from requirements.txt. The --with-eval flag additionally clones Page2MDBench and installs the evaluation libraries (rapidfuzz, sacrebleu, mistune, bert-score). The pip install -e . step registers the book2md command and makes all modules importable from any working directory.

Pipeline Flow

books/

  *.pdf → PDFConverter → output/{stem}/{stem}.md + {stem}.txt
                             → eval_pages/{i}.md + {i}.ref.md
  *.epub → EPUBConverter → output/{stem}/{stem}.md + {stem}.txt
                             → eval_chunks/{i}.md + {i}.ref.md

output/

  **/eval_pages/ → QualityEvaluator → scores/{book}_scores.json
  **/eval_pages/ → MetadataExtractor → metadata/metadata.csv
  **/{stem}.md   → DependencyParser → parsed/{stem}.conllu

`config.py` — Typed Configuration

All parameters are organised in typed dataclasses. Module-level singleton instances (pdf_config, epub_config, etc.) serve as defaults throughout the codebase. There are three ways to customise them — no source code editing required for common use cases.

Level 1 — CLI flags

bash

book2md --input /my/books/ --output /results/ convert --pdf
book2md parse --langs it de --format json

Level 2 — Constructor arguments (Python / Colab)

python

from book2md import ConverterPipeline

pipeline = ConverterPipeline(
    input_dir="books/",
    pdf_model_id="Qwen/Qwen3-VL-7B-Instruct",
)

Level 3 — Config override (DPI, token limits, prompts)

python

from book2md.config import pdf_config, epub_config

pdf_config.dpi = 150
pdf_config.max_new_tokens = 2048
epub_config.repetition_penalty = 1.2

Override before instantiating any class; defaults apply otherwise.

Dataclasses

python

PDFConfig      # model_id, dpi, max_new_tokens, repetition_penalty, prompt
EPUBConfig     # model_id, max_chunk_chars, max_new_tokens, repetition_penalty, prompt
MetadataConfig # max_new_tokens, biblio_prompt, genre_prompt
EvalConfig     # n (pages sampled), enable_prefix_caching
ParseConfig    # langs, output_format
PathConfig     # input_dir, output_dir, scores_dir, metadata_csv

Qwen3 is open-weight, available on HuggingFace, with strong Italian and German support. The VL variant is required for PDF because pages are rasterised to PNG, preserving layout, formulas, and images.

Prefix Caching

ENABLE_PREFIX_CACHING = True enables vLLM Automatic Prefix Caching (APC). Since all books in a batch share the same system prompt, vLLM caches it in the KV states and avoids recomputing it per request. All system prompts (BIBLIO_PROMPT, GENRE_PROMPT, PDF_PROMPT) are constants to maximize the cache hit rate.

Stderr filtering

_StderrFilter filters non-fatal protobuf/grpc error messages that vLLM emits on stderr. The underlying incompatibility is pinned at the dependency level (protobuf<4.0 in requirements.txt).

`utils.py` — Shared Utilities

pil_to_data_url(img)

Encodes a PIL image as a base64 JPEG data URL (data:image/jpeg;base64,...). Images are resized to at most 1024 px on the longest side and compressed at JPEG quality 85 before encoding — reducing payload by ~3–5× versus raw PNG. Used to pass page images to the vision-language model via vLLM's OpenAI-compatible multimodal format.

sample_indices(total, n=20)

Selects n page indices for evaluation using a hybrid strategy:

Guaranteed: the first min(10, n) pages are always included (title page, colophon, metadata).
Stratified: remaining slots are filled by random sampling — ~75% from the body, ~25% from the back.

Pure uniform sampling over-represents the body and ignores final pages (indexes, colophons). The 75/25 ratio empirically covers typographic variability better.

truncate_repetitions(text)

Post-processing guard against LLM repetition loops. Runs two passes:

Line-level: if a non-trivial line (≥25 chars) reappears within 6 lines of its previous occurrence, the text is truncated before the second occurrence.
Inline: if a phrase of ≥40 chars reappears within 400 chars of running text, truncation is applied before the second occurrence. This catches loops inside a single paragraph — common with scanned PDFs where the model generates running text without newlines.

Applied inside _clean() of both converters, as a second layer on top of repetition_penalty in SamplingParams.

md_to_txt(md)

Strips Markdown syntax from a string and returns plain text. Removes headings markers, bold/italic, images, links, blockquotes, list markers, tables, horizontal rules, and footnote declarations. Used by all converters to produce a .txt file alongside every .md output file.

suppress_worker_stderr()

Intercepts file descriptor 2 at the OS level (not just sys.stderr) during LLM initialization. vLLM worker processes are forked and inherit fd 2 directly, so a Python-level wrapper is not sufficient.

`pipeline.py` — Pipeline Orchestrator

ConverterPipeline exposes three conversion methods:

Method	Converter	When to use
`run_simple()`	DocumentProcessor (rule-based)	Quick draft, no GPU needed
`run_pdf_llm()`	PDFToMarkdownConverter	PDFs with complex layout
`run_epub_llm()`	EpubToMarkdownConverter	EPUB files

↺

Resume and idempotency: _already_converted(stem) checks whether output/{stem}/{stem}.md exists. If it does, the book is skipped. This allows the pipeline to be interrupted and resumed without reprocessing already-converted books, which is critical when working with large corpora.

CLI

After pip install -e ., all operations are available via the book2md command from any directory. Global flags (--input, --output, etc.) must be placed before the subcommand:

bash

book2md --input books/ --output output/ convert --pdf
book2md --output output/ --scores scores/ evaluate --bertscore
book2md --help          # full option list
book2md convert --help  # subcommand options

Subcommands: convert (flags: --pdf, --epub, --simple), metadata, evaluate (flag: --bertscore), parse (flags: --langs, --format).

`converters/text.py` — Rule-Based Conversion

DocumentProcessor converts without an LLM using deterministic rules.

PDF

Uses PyMuPDF (fitz) in rawdict mode, which exposes font size and flags per span.

Condition	Markdown output
font size ≥ 22	`# H1`
font size ≥ 18	`## H2`
font size ≥ 14	`### H3`
font size ≥ 12 + bold	`#### H4`
flags & 16 (bold)	`text`
flags & 2 (italic)	`text`
size < 9, starts with †‡§	`> [^fn]: ...`
matches figura/fig/tabella/tab	`caption`

EPUB

Uses ebooklib to extract HTML chapters and a recursive HTML-to-Markdown walker that handles: headings, bold/italic, code blocks, lists, GFM tables, footnotes (epub:type="footnote"), figures with captions, blockquotes, and callout boxes (detected by CSS class: callout|note|warning|tip).

⚠

Limitation: fails on scanned PDFs, complex LaTeX formulas, and multi-column layouts — motivating the LLM-based converters.

`converters/pdf.py` — PDF to Markdown via LLM

PDFToMarkdownConverter uses Qwen3-VL (vision-language):

Rasterizes each PDF page to PNG at PDF_DPI=300 dpi.
Resizes each image to max 1024 px and re-encodes as JPEG (quality 85) before sending to the model.
Blank pages (<0.1% non-white pixels) are detected and skipped before batch inference.
Builds a batch of multimodal messages and runs inference with LLM.chat() — all pages in one call.
Passes each output through _clean() (strips code fences, truncates repetition loops via truncate_repetitions()).
Saves the full Markdown to {stem}.md and a plain-text version to {stem}.txt (via md_to_txt()).
Saves sampled page pairs to eval_pages/{i}.md and the PyMuPDF text-layer reference to eval_pages/{i}.ref.md.

💡

Why rasterize? Direct text extraction loses visual structure (columns, tables, equations). Rasterization lets the model see the page exactly as a human reader would.

Batching: vLLM's continuous batching runs parallel inference on all pages in a single forward pass — far more efficient than page-by-page inference.

`converters/epub.py` — EPUB to Markdown via LLM

EpubToMarkdownConverter uses Qwen3 (text-only):

Iterates chapters in spine order using ebooklib directly (no pypandoc).
Skips TOC/navigation documents (toc, nav, ncx, contents by filename or epub:type="nav" in HTML).
Extracts images to images/ and rewrites <img> src attributes.
Splits each chapter HTML into chunks of at most EPUB_MAX_CHUNK_CHARS=8000 chars at top-level block tags inside <body>.
Runs batch inference on all chunks and joins results with \n\n.
Passes each output through _clean(): strips code fences, removes heading-scheme echoes from the prompt, and truncates repetition loops.
Saves sampled chunks to eval_chunks/{i}.md (LLM output) and {i}.ref.md (HTML converted to Markdown as reference).

💡

Why chunk by HTML tag? Splitting at top-level tags inside <body> ensures each chunk is semantically coherent. Fixed-size windows would cut mid-sentence.

`metadata/extractor.py` — Metadata Extraction

MetadataExtractor extracts five fields — author, title, year, genre — using zero-shot prompting on the text model.

Page selection in collect_samples()

Variable	Pages	Used for
`front_files`	0 – 4	author, title, year (BIBLIO_PROMPT)
`body_files`	7 – 9	genre (GENRE_PROMPT)

All selected pages belong to the guaranteed first-10 set, so they are always available regardless of book length. Pages 7–9 are past the front matter but still early enough to represent the book's tone and style.

Two separate batch calls in run()

python

# Call 1: bibliographic info
biblio_outputs = [o.outputs[0].text for o in self.llm.chat(biblio_dataset, sp)]

# Call 2: genre
genre_outputs  = [o.outputs[0].text for o in self.llm.chat(genre_dataset, sp)]

Keeping the calls separate maximizes prefix cache hit rate: within each batch, all messages share the same constant system prompt.

CSV resume

If output_csv already exists, existing records are loaded, already-processed books are filtered out, and only new records are appended. If nothing is new, the method returns early without rewriting the file (preserving mtime).

_parse_json(raw)

Handles imperfect model output: first attempts json.loads() directly; on failure, recovers by taking the text before the first blank line and appending a missing } if needed.

`evaluation/evaluator.py` — Quality Evaluation

QualityEvaluator uses reference-based metrics from Page2MDBench to measure conversion quality. No LLM or GPU is required for evaluation.

Metrics

Metric	Direction	Description
NED	lower ↓	Normalised Edit Distance between reference and prediction
BLEU	higher ↑	n-gram precision score
MarkdownStructureF1	higher ↑	F1 overlap of structural Markdown elements (headings, lists, tables)
BERTScore	higher ↑	Semantic similarity via BERT embeddings (optional; automatically uses GPU if available, falls back to CPU)

References

Each metric compares a reference against the LLM-generated Markdown:

PDF: reference is {i}.ref.md, extracted directly from the PDF text layer via page.get_text("markdown") (PyMuPDF) during conversion. No image rasterisation involved. Stored in eval_pages/ alongside the LLM output {i}.md.
EPUB: reference is {i}.ref.md, produced during conversion by converting each HTML chunk to Markdown via DocumentProcessor._epub_html_to_markdown() and saved to eval_chunks/. Same scheme as PDF.

💡

Why rule-based extraction as reference? Rule-based Markdown preserves document structure (headings, paragraphs, tables) without hallucination, making it a reliable ground-truth proxy. Comparing plain text to Markdown would penalise structural metrics unfairly, so the reference is always Markdown.

Setup

Run bash setup.sh --with-eval (see Setup section) — this clones Page2MDBench and installs all evaluation dependencies in one step.

The evaluator imports metrics directly if Page2MDBench is on the Python path, or falls back to inserting the cloned directory into sys.path automatically.

evaluate_pdf(eval_pages_dir, scores_dir)

Reads all {i}.ref.md + {i}.md pairs from eval_pages_dir and computes metrics for each. Requires books to have been converted with the current version of PDFToMarkdownConverter (which saves .ref.md during conversion).

evaluate_epub(eval_chunks_dir, scores_dir)

Reads all {i}.ref.md + {i}.md pairs from eval_chunks_dir and computes metrics. Same scheme as PDF evaluation.

evaluate_all(output_dir, scores_dir)

Iterates over all book folders in output_dir, detects the conversion type from the presence of eval_pages/ (PDF) or eval_chunks/ (EPUB), and calls the matching method.

python

from book2md import QualityEvaluator

evaluator = QualityEvaluator(use_bertscore=False)
evaluator.evaluate_all(output_dir="output/", scores_dir="scores/")

Results are saved as scores/{book_name}_scores.json with the structure:

json

{
  "average": {"ned": 0.12, "bleu": 68.4, "structure_f1": 0.91},
  "pages": {"0": {"ned": 0.10, "bleu": 71.2, "structure_f1": 0.93}, ...}
}

✓

Design note: evaluation is fully decoupled from conversion. For PDF, the reference files are saved during conversion; for EPUB, the HTML chunks are sufficient. No access to original PDF/EPUB files is needed at evaluation time.

`parsing/parser.py` — Linguistic Annotation

DependencyParser applies morphosyntactic dependency analysis to the main Markdown file of each converted book.

Flow

run() finds one .md per book by matching output/{stem}/{stem}.md — eval pages and eval chunks are ignored.
Markdown is stripped to plain text via markdown + BeautifulSoup.
Language is detected automatically from the first 3000 characters using langdetect (DetectorFactory.seed = 0 for reproducibility).
Only the matching Stanza pipeline runs: tokenize + MWT + POS + lemma + depparse + NER.
Output: a single {stem}.conllu / {stem}.json, no language suffix.

CoNLL-U token fields

conllu

id | text | lemma | upos | xpos | feats | head | deprel | _ | MISC

head is the index of the governing token (0 = root). deprel is the dependency relation type (e.g. nsubj, obj, root). MISC contains the NER tag in BIO format (e.g. NER=B-PER, NER=I-ORG); non-entity tokens have _.

JSON NER output

Each sentence object contains a "tokens" array (with a per-token "ner" BIO tag) and an "entities" array with span-level entries:

json

{"text": "Eurac Research", "type": "ORG", "start": 12, "end": 26}

Key Architectural Decisions

Decision	Discarded alternative	Rationale
vLLM batching for the entire book	Page-by-page inference	Much higher throughput via continuous batching
PDF rasterization to PNG	PyMuPDF text extraction	Preserves visual layout, formulas, and complex tables
EPUB chunking by HTML tag	Fixed-size text windows	Semantically coherent chunks, no text split mid-sentence
Constant system prompts	Prompts with inline variables	Maximizes prefix cache hit rate with vLLM APC
Eval pages saved during conversion	Re-reading original files for eval	Decouples stages; evaluation works without the originals
Resume based on `.md` existence	Flags in a database or JSON	Zero overhead; survives crashes and interruptions
Two separate LLMs (VL + text)	A single multimodal model	Text model is faster and uses less VRAM for EPUB and metadata
Reference-based metrics (Page2MDBench) for evaluation	LLM-as-judge	No GPU or LLM needed for evaluation; deterministic and reproducible scores
Rule-based Markdown as PDF reference (`.ref.md`)	Plain text extraction or image-only comparison	Preserves document structure so structural metrics (MarkdownStructureF1) are meaningful
Save `.ref.md` during conversion	Extract reference at evaluation time	Original PDF is available during conversion; evaluation is decoupled and needs no original files
Automatic language detection per book	Running all pipelines on every book	Avoids redundant computation; one correctly-annotated file per book

Tests

Tests use pytest's tmp_path fixture for complete isolation (no global state). LLM models are replaced by stubs (FakeLLM) that return fixed JSON responses, so all tests run without a GPU.

test_book_converter.py: tests resume logic (_already_converted) and correct skipping across all three run methods
test_metadata.py: tests _parse_json (robust parsing), collect_samples (page selection), and run (CSV append/skip behavior)
test_utils.py: tests sample_indices (first-10 guarantee, stratification) and pil_to_data_url (PNG round-trip)

References

Sources and papers for the core tools and metrics used in this pipeline.

Inference & Models

vLLM — PagedAttention-based LLM serving engine.
Kwon et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP 2023.
arxiv.org/abs/2309.06180 · github.com/vllm-project/vllm
Qwen3-VL — Vision-language model used for PDF page conversion.
Qwen Team, Alibaba Cloud (2025). Qwen3 Technical Report.
arxiv.org/abs/2505.09388 · github.com/QwenLM/Qwen3
Qwen3 (text) — Text-only model used for EPUB conversion and metadata extraction.
Qwen Team, Alibaba Cloud (2025). Qwen3 Technical Report.
arxiv.org/abs/2505.09388 · github.com/QwenLM/Qwen3

Evaluation Metrics

Page2MDBench — Benchmark suite providing NED, BLEU, MarkdownStructureF1, and BERTScore for PDF-to-Markdown evaluation.
github.com/Hipsterfil998/Page2MDBench
BLEU — Bilingual Evaluation Understudy; n-gram precision metric for text generation quality.
Papineni et al. (2002). BLEU: a Method for Automatic Evaluation of Machine Translation. ACL 2002.
aclanthology.org/P02-1040 · Implementation via sacrebleu (Post, 2018. arxiv.org/abs/1804.08771)
BERTScore — Semantic similarity metric using contextual BERT embeddings.
Zhang et al. (2020). BERTScore: Evaluating Text Generation with BERT. ICLR 2020.
arxiv.org/abs/1904.09675 · github.com/Tiiiger/bert_score
NED — Normalised Edit Distance; character-level string similarity based on Levenshtein distance, normalised to [0, 1]. Implementation via RapidFuzz.

NLP & Parsing

Stanza — Neural NLP library for tokenization, POS tagging, lemmatization, and dependency parsing.
Qi et al. (2020). Stanza: A Python Natural Language Processing Toolkit for Many Human Languages. ACL 2020.
arxiv.org/abs/2003.07082 · github.com/stanfordnlp/stanza
langdetect — Language detection library ported from Google's language-detection.
github.com/Mimino666/langdetect

PDF & EPUB Processing

PyMuPDF (fitz) — Python bindings for MuPDF; used for PDF rendering and text-layer extraction.
pymupdf.readthedocs.io · github.com/pymupdf/PyMuPDF
pdf2image — Converts PDF pages to PIL images via Poppler.
github.com/Belval/pdf2image
ebooklib — Python library for reading and writing EPUB2/EPUB3 files.
github.com/aerkalov/ebooklib
BeautifulSoup4 — HTML/XML parser used for extracting and cleaning EPUB chapter content.
crummy.com/software/BeautifulSoup

Book2MD Converter Pipeline

Overview

Project Structure

Setup

Local / server

Google Colab

Pipeline Flow

config.py — Typed Configuration

Level 1 — CLI flags

Level 2 — Constructor arguments (Python / Colab)

Level 3 — Config override (DPI, token limits, prompts)

Dataclasses

Prefix Caching

Stderr filtering

utils.py — Shared Utilities

pil_to_data_url(img)

sample_indices(total, n=20)

truncate_repetitions(text)

md_to_txt(md)

suppress_worker_stderr()

pipeline.py — Pipeline Orchestrator

CLI

converters/text.py — Rule-Based Conversion

PDF

EPUB

converters/pdf.py — PDF to Markdown via LLM

converters/epub.py — EPUB to Markdown via LLM

metadata/extractor.py — Metadata Extraction

Page selection in collect_samples()

Two separate batch calls in run()

CSV resume

_parse_json(raw)

evaluation/evaluator.py — Quality Evaluation

Metrics

References

Setup

evaluate_pdf(eval_pages_dir, scores_dir)

evaluate_epub(eval_chunks_dir, scores_dir)

evaluate_all(output_dir, scores_dir)

parsing/parser.py — Linguistic Annotation

Flow

CoNLL-U token fields

JSON NER output

Key Architectural Decisions

Tests

References

Inference & Models

Evaluation Metrics

NLP & Parsing

PDF & EPUB Processing

`config.py` — Typed Configuration

`utils.py` — Shared Utilities

`pipeline.py` — Pipeline Orchestrator

`converters/text.py` — Rule-Based Conversion

`converters/pdf.py` — PDF to Markdown via LLM

`converters/epub.py` — EPUB to Markdown via LLM

`metadata/extractor.py` — Metadata Extraction

`evaluation/evaluator.py` — Quality Evaluation

`parsing/parser.py` — Linguistic Annotation