📖 Technical Reference

Book2MD Converter Pipeline

Convert Italian and German books (PDF / EPUB) to structured Markdown, extract bibliographic metadata, evaluate conversion quality, and annotate linguistic structure.

vLLM inference Qwen3-VL + Qwen3 NED · BLEU · MarkdownF1 Stanza NLP IT · DE
BookConverter logo

Overview

The pipeline converts books in PDF and EPUB format into structured Markdown, extracts bibliographic and genre metadata, evaluates conversion quality, and produces dependency parsing annotations. Everything is packaged as an installable Python library (pip install -e .), exposing a clean public API and a book2md CLI command usable from any directory.

Project Structure

directory tree
book2md/
├── __init__.py                    # Public API: from book2md import ConverterPipeline, ...
├── base.py                        # Abstract PipelineStep base class
├── config.py                      # Typed configuration dataclasses
├── utils.py                       # Shared utilities
├── pipeline.py                    # ConverterPipeline orchestrator
├── cli.py                         # CLI entry point (book2md command)
├── converters/
│   ├── pdf.py                     # PDF → Markdown via Qwen3-VL
│   ├── epub.py                    # EPUB → Markdown via Qwen3
│   └── text.py                    # Rule-based conversion (no LLM)
├── metadata/
│   └── extractor.py               # Author / title / year / genre extraction
├── parsing/
│   └── parser.py                  # Linguistic annotation with Stanza
└── evaluation/
    └── evaluator.py               # Quality evaluation (NED, BLEU, MarkdownStructureF1)

books/                             # Input: original PDF and EPUB files
output/                            # Output: Markdown + eval pages/chunks
scores/                            # Per-book score JSON files
pyproject.toml                     # Package definition (pip install -e .)
setup.sh                           # System dependency installer

Setup

Local / server

bash
bash setup.sh            # system + Python dependencies
bash setup.sh --with-eval  # also installs evaluation libraries + Page2MDBench
pip install -e .           # installs book2md as a command

Google Colab

python — Colab cell
!bash setup.sh --with-eval
!pip install -e .

setup.sh installs poppler-utils (required by pdf2image) and all Python dependencies from requirements.txt. The --with-eval flag additionally clones Page2MDBench and installs the evaluation libraries (rapidfuzz, sacrebleu, mistune, bert-score). The pip install -e . step registers the book2md command and makes all modules importable from any working directory.


Pipeline Flow

books/
  *.pdf    PDFConverter    output/{stem}/{stem}.md + {stem}.txt
                               eval_pages/{i}.md + {i}.ref.md
  *.epub    EPUBConverter    output/{stem}/{stem}.md + {stem}.txt
                               eval_chunks/{i}.md + {i}.ref.md
 
output/
  **/eval_pages/    QualityEvaluator    scores/{book}_scores.json
  **/eval_pages/    MetadataExtractor    metadata/metadata.csv
  **/{stem}.md     DependencyParser    parsed/{stem}.conllu

config.py — Typed Configuration

All parameters are organised in typed dataclasses. Module-level singleton instances (pdf_config, epub_config, etc.) serve as defaults throughout the codebase. There are three ways to customise them — no source code editing required for common use cases.

Level 1 — CLI flags

bash
book2md --input /my/books/ --output /results/ convert --pdf
book2md parse --langs it de --format json

Level 2 — Constructor arguments (Python / Colab)

python
from book2md import ConverterPipeline

pipeline = ConverterPipeline(
    input_dir="books/",
    pdf_model_id="Qwen/Qwen3-VL-7B-Instruct",
)

Level 3 — Config override (DPI, token limits, prompts)

python
from book2md.config import pdf_config, epub_config

pdf_config.dpi = 150
pdf_config.max_new_tokens = 2048
epub_config.repetition_penalty = 1.2

Override before instantiating any class; defaults apply otherwise.

Dataclasses

python
PDFConfig      # model_id, dpi, max_new_tokens, repetition_penalty, prompt
EPUBConfig     # model_id, max_chunk_chars, max_new_tokens, repetition_penalty, prompt
MetadataConfig # max_new_tokens, biblio_prompt, genre_prompt
EvalConfig     # n (pages sampled), enable_prefix_caching
ParseConfig    # langs, output_format
PathConfig     # input_dir, output_dir, scores_dir, metadata_csv

Qwen3 is open-weight, available on HuggingFace, with strong Italian and German support. The VL variant is required for PDF because pages are rasterised to PNG, preserving layout, formulas, and images.

Prefix Caching

ENABLE_PREFIX_CACHING = True enables vLLM Automatic Prefix Caching (APC). Since all books in a batch share the same system prompt, vLLM caches it in the KV states and avoids recomputing it per request. All system prompts (BIBLIO_PROMPT, GENRE_PROMPT, PDF_PROMPT) are constants to maximize the cache hit rate.

Stderr filtering

_StderrFilter filters non-fatal protobuf/grpc error messages that vLLM emits on stderr. The underlying incompatibility is pinned at the dependency level (protobuf<4.0 in requirements.txt).


utils.py — Shared Utilities

pil_to_data_url(img)

Encodes a PIL image as a base64 JPEG data URL (data:image/jpeg;base64,...). Images are resized to at most 1024 px on the longest side and compressed at JPEG quality 85 before encoding — reducing payload by ~3–5× versus raw PNG. Used to pass page images to the vision-language model via vLLM's OpenAI-compatible multimodal format.

sample_indices(total, n=20)

Selects n page indices for evaluation using a hybrid strategy:

  1. Guaranteed: the first min(10, n) pages are always included (title page, colophon, metadata).
  2. Stratified: remaining slots are filled by random sampling — ~75% from the body, ~25% from the back.

Pure uniform sampling over-represents the body and ignores final pages (indexes, colophons). The 75/25 ratio empirically covers typographic variability better.

truncate_repetitions(text)

Post-processing guard against LLM repetition loops. Runs two passes:

  1. Line-level: if a non-trivial line (≥25 chars) reappears within 6 lines of its previous occurrence, the text is truncated before the second occurrence.
  2. Inline: if a phrase of ≥40 chars reappears within 400 chars of running text, truncation is applied before the second occurrence. This catches loops inside a single paragraph — common with scanned PDFs where the model generates running text without newlines.

Applied inside _clean() of both converters, as a second layer on top of repetition_penalty in SamplingParams.

md_to_txt(md)

Strips Markdown syntax from a string and returns plain text. Removes headings markers, bold/italic, images, links, blockquotes, list markers, tables, horizontal rules, and footnote declarations. Used by all converters to produce a .txt file alongside every .md output file.

suppress_worker_stderr()

Intercepts file descriptor 2 at the OS level (not just sys.stderr) during LLM initialization. vLLM worker processes are forked and inherit fd 2 directly, so a Python-level wrapper is not sufficient.


pipeline.py — Pipeline Orchestrator

ConverterPipeline exposes three conversion methods:

MethodConverterWhen to use
run_simple()DocumentProcessor (rule-based)Quick draft, no GPU needed
run_pdf_llm()PDFToMarkdownConverterPDFs with complex layout
run_epub_llm()EpubToMarkdownConverterEPUB files

Resume and idempotency: _already_converted(stem) checks whether output/{stem}/{stem}.md exists. If it does, the book is skipped. This allows the pipeline to be interrupted and resumed without reprocessing already-converted books, which is critical when working with large corpora.

CLI

After pip install -e ., all operations are available via the book2md command from any directory. Global flags (--input, --output, etc.) must be placed before the subcommand:

bash
book2md --input books/ --output output/ convert --pdf
book2md --output output/ --scores scores/ evaluate --bertscore
book2md --help          # full option list
book2md convert --help  # subcommand options

Subcommands: convert (flags: --pdf, --epub, --simple), metadata, evaluate (flag: --bertscore), parse (flags: --langs, --format).


converters/text.py — Rule-Based Conversion

DocumentProcessor converts without an LLM using deterministic rules.

PDF

Uses PyMuPDF (fitz) in rawdict mode, which exposes font size and flags per span.

ConditionMarkdown output
font size ≥ 22# H1
font size ≥ 18## H2
font size ≥ 14### H3
font size ≥ 12 + bold#### H4
flags & 16 (bold)**text**
flags & 2 (italic)*text*
size < 9, starts with †‡§> [^fn]: ...
matches figura/fig/tabella/tab*caption*

EPUB

Uses ebooklib to extract HTML chapters and a recursive HTML-to-Markdown walker that handles: headings, bold/italic, code blocks, lists, GFM tables, footnotes (epub:type="footnote"), figures with captions, blockquotes, and callout boxes (detected by CSS class: callout|note|warning|tip).

Limitation: fails on scanned PDFs, complex LaTeX formulas, and multi-column layouts — motivating the LLM-based converters.


converters/pdf.py — PDF to Markdown via LLM

PDFToMarkdownConverter uses Qwen3-VL (vision-language):

  1. Rasterizes each PDF page to PNG at PDF_DPI=300 dpi.
  2. Resizes each image to max 1024 px and re-encodes as JPEG (quality 85) before sending to the model.
  3. Blank pages (<0.1% non-white pixels) are detected and skipped before batch inference.
  4. Builds a batch of multimodal messages and runs inference with LLM.chat() — all pages in one call.
  5. Passes each output through _clean() (strips code fences, truncates repetition loops via truncate_repetitions()).
  6. Saves the full Markdown to {stem}.md and a plain-text version to {stem}.txt (via md_to_txt()).
  7. Saves sampled page pairs to eval_pages/{i}.md and the PyMuPDF text-layer reference to eval_pages/{i}.ref.md.
💡

Why rasterize? Direct text extraction loses visual structure (columns, tables, equations). Rasterization lets the model see the page exactly as a human reader would.

Batching: vLLM's continuous batching runs parallel inference on all pages in a single forward pass — far more efficient than page-by-page inference.


converters/epub.py — EPUB to Markdown via LLM

EpubToMarkdownConverter uses Qwen3 (text-only):

  1. Iterates chapters in spine order using ebooklib directly (no pypandoc).
  2. Skips TOC/navigation documents (toc, nav, ncx, contents by filename or epub:type="nav" in HTML).
  3. Extracts images to images/ and rewrites <img> src attributes.
  4. Splits each chapter HTML into chunks of at most EPUB_MAX_CHUNK_CHARS=8000 chars at top-level block tags inside <body>.
  5. Runs batch inference on all chunks and joins results with \n\n.
  6. Passes each output through _clean(): strips code fences, removes heading-scheme echoes from the prompt, and truncates repetition loops.
  7. Saves sampled chunks to eval_chunks/{i}.md (LLM output) and {i}.ref.md (HTML converted to Markdown as reference).
💡

Why chunk by HTML tag? Splitting at top-level tags inside <body> ensures each chunk is semantically coherent. Fixed-size windows would cut mid-sentence.


metadata/extractor.py — Metadata Extraction

MetadataExtractor extracts five fields — author, title, year, genre — using zero-shot prompting on the text model.

Page selection in collect_samples()

VariablePagesUsed for
front_files0 – 4author, title, year (BIBLIO_PROMPT)
body_files7 – 9genre (GENRE_PROMPT)

All selected pages belong to the guaranteed first-10 set, so they are always available regardless of book length. Pages 7–9 are past the front matter but still early enough to represent the book's tone and style.

Two separate batch calls in run()

python
# Call 1: bibliographic info
biblio_outputs = [o.outputs[0].text for o in self.llm.chat(biblio_dataset, sp)]

# Call 2: genre
genre_outputs  = [o.outputs[0].text for o in self.llm.chat(genre_dataset, sp)]

Keeping the calls separate maximizes prefix cache hit rate: within each batch, all messages share the same constant system prompt.

CSV resume

If output_csv already exists, existing records are loaded, already-processed books are filtered out, and only new records are appended. If nothing is new, the method returns early without rewriting the file (preserving mtime).

_parse_json(raw)

Handles imperfect model output: first attempts json.loads() directly; on failure, recovers by taking the text before the first blank line and appending a missing } if needed.


evaluation/evaluator.py — Quality Evaluation

QualityEvaluator uses reference-based metrics from Page2MDBench to measure conversion quality. No LLM or GPU is required for evaluation.

Metrics

MetricDirectionDescription
NEDlower ↓Normalised Edit Distance between reference and prediction
BLEUhigher ↑n-gram precision score
MarkdownStructureF1higher ↑F1 overlap of structural Markdown elements (headings, lists, tables)
BERTScorehigher ↑Semantic similarity via BERT embeddings (optional; automatically uses GPU if available, falls back to CPU)

References

Each metric compares a reference against the LLM-generated Markdown:

💡

Why rule-based extraction as reference? Rule-based Markdown preserves document structure (headings, paragraphs, tables) without hallucination, making it a reliable ground-truth proxy. Comparing plain text to Markdown would penalise structural metrics unfairly, so the reference is always Markdown.

Setup

Run bash setup.sh --with-eval (see Setup section) — this clones Page2MDBench and installs all evaluation dependencies in one step.

The evaluator imports metrics directly if Page2MDBench is on the Python path, or falls back to inserting the cloned directory into sys.path automatically.

evaluate_pdf(eval_pages_dir, scores_dir)

Reads all {i}.ref.md + {i}.md pairs from eval_pages_dir and computes metrics for each. Requires books to have been converted with the current version of PDFToMarkdownConverter (which saves .ref.md during conversion).

evaluate_epub(eval_chunks_dir, scores_dir)

Reads all {i}.ref.md + {i}.md pairs from eval_chunks_dir and computes metrics. Same scheme as PDF evaluation.

evaluate_all(output_dir, scores_dir)

Iterates over all book folders in output_dir, detects the conversion type from the presence of eval_pages/ (PDF) or eval_chunks/ (EPUB), and calls the matching method.

python
from book2md import QualityEvaluator

evaluator = QualityEvaluator(use_bertscore=False)
evaluator.evaluate_all(output_dir="output/", scores_dir="scores/")

Results are saved as scores/{book_name}_scores.json with the structure:

json
{
  "average": {"ned": 0.12, "bleu": 68.4, "structure_f1": 0.91},
  "pages": {"0": {"ned": 0.10, "bleu": 71.2, "structure_f1": 0.93}, ...}
}

Design note: evaluation is fully decoupled from conversion. For PDF, the reference files are saved during conversion; for EPUB, the HTML chunks are sufficient. No access to original PDF/EPUB files is needed at evaluation time.


parsing/parser.py — Linguistic Annotation

DependencyParser applies morphosyntactic dependency analysis to the main Markdown file of each converted book.

Flow

  1. run() finds one .md per book by matching output/{stem}/{stem}.md — eval pages and eval chunks are ignored.
  2. Markdown is stripped to plain text via markdown + BeautifulSoup.
  3. Language is detected automatically from the first 3000 characters using langdetect (DetectorFactory.seed = 0 for reproducibility).
  4. Only the matching Stanza pipeline runs: tokenize + MWT + POS + lemma + depparse + NER.
  5. Output: a single {stem}.conllu / {stem}.json, no language suffix.

CoNLL-U token fields

conllu
id | text | lemma | upos | xpos | feats | head | deprel | _ | MISC

head is the index of the governing token (0 = root). deprel is the dependency relation type (e.g. nsubj, obj, root). MISC contains the NER tag in BIO format (e.g. NER=B-PER, NER=I-ORG); non-entity tokens have _.

JSON NER output

Each sentence object contains a "tokens" array (with a per-token "ner" BIO tag) and an "entities" array with span-level entries:

json
{"text": "Eurac Research", "type": "ORG", "start": 12, "end": 26}

Key Architectural Decisions

DecisionDiscarded alternativeRationale
vLLM batching for the entire book Page-by-page inference Much higher throughput via continuous batching
PDF rasterization to PNG PyMuPDF text extraction Preserves visual layout, formulas, and complex tables
EPUB chunking by HTML tag Fixed-size text windows Semantically coherent chunks, no text split mid-sentence
Constant system prompts Prompts with inline variables Maximizes prefix cache hit rate with vLLM APC
Eval pages saved during conversion Re-reading original files for eval Decouples stages; evaluation works without the originals
Resume based on .md existence Flags in a database or JSON Zero overhead; survives crashes and interruptions
Two separate LLMs (VL + text) A single multimodal model Text model is faster and uses less VRAM for EPUB and metadata
Reference-based metrics (Page2MDBench) for evaluation LLM-as-judge No GPU or LLM needed for evaluation; deterministic and reproducible scores
Rule-based Markdown as PDF reference (.ref.md) Plain text extraction or image-only comparison Preserves document structure so structural metrics (MarkdownStructureF1) are meaningful
Save .ref.md during conversion Extract reference at evaluation time Original PDF is available during conversion; evaluation is decoupled and needs no original files
Automatic language detection per book Running all pipelines on every book Avoids redundant computation; one correctly-annotated file per book

Tests

Tests use pytest's tmp_path fixture for complete isolation (no global state). LLM models are replaced by stubs (FakeLLM) that return fixed JSON responses, so all tests run without a GPU.


References

Sources and papers for the core tools and metrics used in this pipeline.

Inference & Models

Evaluation Metrics

NLP & Parsing

PDF & EPUB Processing