Book2MD Converter Pipeline
Convert Italian and German books (PDF / EPUB) to structured Markdown, extract bibliographic metadata, evaluate conversion quality, and annotate linguistic structure.
Overview
The pipeline converts books in PDF and EPUB format into structured Markdown, extracts
bibliographic and genre metadata, evaluates conversion quality, and produces dependency
parsing annotations. Everything is packaged as an installable Python library (pip install -e .),
exposing a clean public API and a book2md CLI command usable from any directory.
Project Structure
book2md/
├── __init__.py # Public API: from book2md import ConverterPipeline, ...
├── base.py # Abstract PipelineStep base class
├── config.py # Typed configuration dataclasses
├── utils.py # Shared utilities
├── pipeline.py # ConverterPipeline orchestrator
├── cli.py # CLI entry point (book2md command)
├── converters/
│ ├── pdf.py # PDF → Markdown via Qwen3-VL
│ ├── epub.py # EPUB → Markdown via Qwen3
│ └── text.py # Rule-based conversion (no LLM)
├── metadata/
│ └── extractor.py # Author / title / year / genre extraction
├── parsing/
│ └── parser.py # Linguistic annotation with Stanza
└── evaluation/
└── evaluator.py # Quality evaluation (NED, BLEU, MarkdownStructureF1)
books/ # Input: original PDF and EPUB files
output/ # Output: Markdown + eval pages/chunks
scores/ # Per-book score JSON files
pyproject.toml # Package definition (pip install -e .)
setup.sh # System dependency installer
Setup
Local / server
bash setup.sh # system + Python dependencies
bash setup.sh --with-eval # also installs evaluation libraries + Page2MDBench
pip install -e . # installs book2md as a command
Google Colab
!bash setup.sh --with-eval
!pip install -e .
setup.sh installs poppler-utils (required by pdf2image) and all
Python dependencies from requirements.txt. The --with-eval flag
additionally clones Page2MDBench
and installs the evaluation libraries (rapidfuzz, sacrebleu,
mistune, bert-score). The pip install -e . step
registers the book2md command and makes all modules importable from any working directory.
Pipeline Flow
→ eval_pages/{i}.md + {i}.ref.md
*.epub → EPUBConverter → output/{stem}/{stem}.md + {stem}.txt
→ eval_chunks/{i}.md + {i}.ref.md
**/eval_pages/ → MetadataExtractor → metadata/metadata.csv
**/{stem}.md → DependencyParser → parsed/{stem}.conllu
config.py — Typed Configuration
All parameters are organised in typed dataclasses. Module-level singleton instances
(pdf_config, epub_config, etc.) serve as defaults throughout the
codebase. There are three ways to customise them — no source code editing required for
common use cases.
Level 1 — CLI flags
book2md --input /my/books/ --output /results/ convert --pdf
book2md parse --langs it de --format json
Level 2 — Constructor arguments (Python / Colab)
from book2md import ConverterPipeline
pipeline = ConverterPipeline(
input_dir="books/",
pdf_model_id="Qwen/Qwen3-VL-7B-Instruct",
)
Level 3 — Config override (DPI, token limits, prompts)
from book2md.config import pdf_config, epub_config
pdf_config.dpi = 150
pdf_config.max_new_tokens = 2048
epub_config.repetition_penalty = 1.2
Override before instantiating any class; defaults apply otherwise.
Dataclasses
PDFConfig # model_id, dpi, max_new_tokens, repetition_penalty, prompt
EPUBConfig # model_id, max_chunk_chars, max_new_tokens, repetition_penalty, prompt
MetadataConfig # max_new_tokens, biblio_prompt, genre_prompt
EvalConfig # n (pages sampled), enable_prefix_caching
ParseConfig # langs, output_format
PathConfig # input_dir, output_dir, scores_dir, metadata_csv
Qwen3 is open-weight, available on HuggingFace, with strong Italian and German support. The VL variant is required for PDF because pages are rasterised to PNG, preserving layout, formulas, and images.
Prefix Caching
ENABLE_PREFIX_CACHING = True enables vLLM Automatic Prefix Caching (APC).
Since all books in a batch share the same system prompt, vLLM caches it in the KV states
and avoids recomputing it per request. All system prompts
(BIBLIO_PROMPT, GENRE_PROMPT, PDF_PROMPT)
are constants to maximize the cache hit rate.
Stderr filtering
_StderrFilter filters non-fatal protobuf/grpc error messages that vLLM
emits on stderr. The underlying incompatibility is pinned at the dependency level
(protobuf<4.0 in requirements.txt).
utils.py — Shared Utilities
pil_to_data_url(img)
Encodes a PIL image as a base64 JPEG data URL (data:image/jpeg;base64,...).
Images are resized to at most 1024 px on the longest side and compressed at JPEG quality 85
before encoding — reducing payload by ~3–5× versus raw PNG. Used to pass page images to the
vision-language model via vLLM's OpenAI-compatible multimodal format.
sample_indices(total, n=20)
Selects n page indices for evaluation using a hybrid strategy:
- Guaranteed: the first
min(10, n)pages are always included (title page, colophon, metadata). - Stratified: remaining slots are filled by random sampling — ~75% from the body, ~25% from the back.
Pure uniform sampling over-represents the body and ignores final pages (indexes, colophons). The 75/25 ratio empirically covers typographic variability better.
truncate_repetitions(text)
Post-processing guard against LLM repetition loops. Runs two passes:
- Line-level: if a non-trivial line (≥25 chars) reappears within 6 lines of its previous occurrence, the text is truncated before the second occurrence.
- Inline: if a phrase of ≥40 chars reappears within 400 chars of running text, truncation is applied before the second occurrence. This catches loops inside a single paragraph — common with scanned PDFs where the model generates running text without newlines.
Applied inside _clean() of both converters, as a second layer on top of
repetition_penalty in SamplingParams.
md_to_txt(md)
Strips Markdown syntax from a string and returns plain text. Removes headings markers,
bold/italic, images, links, blockquotes, list markers, tables, horizontal rules, and
footnote declarations. Used by all converters to produce a .txt file
alongside every .md output file.
suppress_worker_stderr()
Intercepts file descriptor 2 at the OS level (not just sys.stderr) during
LLM initialization. vLLM worker processes are forked and inherit fd 2 directly, so a
Python-level wrapper is not sufficient.
pipeline.py — Pipeline Orchestrator
ConverterPipeline exposes three conversion methods:
| Method | Converter | When to use |
|---|---|---|
run_simple() | DocumentProcessor (rule-based) | Quick draft, no GPU needed |
run_pdf_llm() | PDFToMarkdownConverter | PDFs with complex layout |
run_epub_llm() | EpubToMarkdownConverter | EPUB files |
Resume and idempotency: _already_converted(stem) checks
whether output/{stem}/{stem}.md exists. If it does, the book is skipped.
This allows the pipeline to be interrupted and resumed without reprocessing already-converted
books, which is critical when working with large corpora.
CLI
After pip install -e ., all operations are available via the book2md
command from any directory. Global flags (--input, --output, etc.)
must be placed before the subcommand:
book2md --input books/ --output output/ convert --pdf
book2md --output output/ --scores scores/ evaluate --bertscore
book2md --help # full option list
book2md convert --help # subcommand options
Subcommands: convert (flags: --pdf, --epub, --simple),
metadata, evaluate (flag: --bertscore),
parse (flags: --langs, --format).
converters/text.py — Rule-Based Conversion
DocumentProcessor converts without an LLM using deterministic rules.
Uses PyMuPDF (fitz) in rawdict mode, which exposes font size and flags per span.
| Condition | Markdown output |
|---|---|
| font size ≥ 22 | # H1 |
| font size ≥ 18 | ## H2 |
| font size ≥ 14 | ### H3 |
| font size ≥ 12 + bold | #### H4 |
| flags & 16 (bold) | **text** |
| flags & 2 (italic) | *text* |
| size < 9, starts with †‡§ | > [^fn]: ... |
| matches figura/fig/tabella/tab | *caption* |
EPUB
Uses ebooklib to extract HTML chapters and a recursive HTML-to-Markdown walker
that handles: headings, bold/italic, code blocks, lists, GFM tables, footnotes
(epub:type="footnote"), figures with captions, blockquotes, and callout boxes
(detected by CSS class: callout|note|warning|tip).
Limitation: fails on scanned PDFs, complex LaTeX formulas, and multi-column layouts — motivating the LLM-based converters.
converters/pdf.py — PDF to Markdown via LLM
PDFToMarkdownConverter uses Qwen3-VL (vision-language):
- Rasterizes each PDF page to PNG at
PDF_DPI=300dpi. - Resizes each image to max 1024 px and re-encodes as JPEG (quality 85) before sending to the model.
- Blank pages (<0.1% non-white pixels) are detected and skipped before batch inference.
- Builds a batch of multimodal messages and runs inference with
LLM.chat()— all pages in one call. - Passes each output through
_clean()(strips code fences, truncates repetition loops viatruncate_repetitions()). - Saves the full Markdown to
{stem}.mdand a plain-text version to{stem}.txt(viamd_to_txt()). - Saves sampled page pairs to
eval_pages/{i}.mdand the PyMuPDF text-layer reference toeval_pages/{i}.ref.md.
Why rasterize? Direct text extraction loses visual structure (columns, tables, equations). Rasterization lets the model see the page exactly as a human reader would.
Batching: vLLM's continuous batching runs parallel inference on all pages in a single forward pass — far more efficient than page-by-page inference.
converters/epub.py — EPUB to Markdown via LLM
EpubToMarkdownConverter uses Qwen3 (text-only):
- Iterates chapters in spine order using
ebooklibdirectly (no pypandoc). - Skips TOC/navigation documents (
toc,nav,ncx,contentsby filename orepub:type="nav"in HTML). - Extracts images to
images/and rewrites<img>src attributes. - Splits each chapter HTML into chunks of at most
EPUB_MAX_CHUNK_CHARS=8000chars at top-level block tags inside<body>. - Runs batch inference on all chunks and joins results with
\n\n. - Passes each output through
_clean(): strips code fences, removes heading-scheme echoes from the prompt, and truncates repetition loops. - Saves sampled chunks to
eval_chunks/{i}.md(LLM output) and{i}.ref.md(HTML converted to Markdown as reference).
Why chunk by HTML tag? Splitting at top-level tags inside <body>
ensures each chunk is semantically coherent. Fixed-size windows would cut mid-sentence.
metadata/extractor.py — Metadata Extraction
MetadataExtractor extracts five fields — author, title, year, genre — using
zero-shot prompting on the text model.
Page selection in collect_samples()
| Variable | Pages | Used for |
|---|---|---|
front_files | 0 – 4 | author, title, year (BIBLIO_PROMPT) |
body_files | 7 – 9 | genre (GENRE_PROMPT) |
All selected pages belong to the guaranteed first-10 set, so they are always available regardless of book length. Pages 7–9 are past the front matter but still early enough to represent the book's tone and style.
Two separate batch calls in run()
# Call 1: bibliographic info
biblio_outputs = [o.outputs[0].text for o in self.llm.chat(biblio_dataset, sp)]
# Call 2: genre
genre_outputs = [o.outputs[0].text for o in self.llm.chat(genre_dataset, sp)]
Keeping the calls separate maximizes prefix cache hit rate: within each batch, all messages share the same constant system prompt.
CSV resume
If output_csv already exists, existing records are loaded, already-processed
books are filtered out, and only new records are appended. If nothing is new, the method
returns early without rewriting the file (preserving mtime).
_parse_json(raw)
Handles imperfect model output: first attempts json.loads() directly; on failure,
recovers by taking the text before the first blank line and appending a missing
} if needed.
evaluation/evaluator.py — Quality Evaluation
QualityEvaluator uses reference-based metrics from
Page2MDBench
to measure conversion quality. No LLM or GPU is required for evaluation.
Metrics
| Metric | Direction | Description |
|---|---|---|
| NED | lower ↓ | Normalised Edit Distance between reference and prediction |
| BLEU | higher ↑ | n-gram precision score |
| MarkdownStructureF1 | higher ↑ | F1 overlap of structural Markdown elements (headings, lists, tables) |
| BERTScore | higher ↑ | Semantic similarity via BERT embeddings (optional; automatically uses GPU if available, falls back to CPU) |
References
Each metric compares a reference against the LLM-generated Markdown:
-
PDF: reference is
{i}.ref.md, extracted directly from the PDF text layer viapage.get_text("markdown")(PyMuPDF) during conversion. No image rasterisation involved. Stored ineval_pages/alongside the LLM output{i}.md. -
EPUB: reference is
{i}.ref.md, produced during conversion by converting each HTML chunk to Markdown viaDocumentProcessor._epub_html_to_markdown()and saved toeval_chunks/. Same scheme as PDF.
Why rule-based extraction as reference? Rule-based Markdown preserves document structure (headings, paragraphs, tables) without hallucination, making it a reliable ground-truth proxy. Comparing plain text to Markdown would penalise structural metrics unfairly, so the reference is always Markdown.
Setup
Run bash setup.sh --with-eval (see Setup section) — this
clones Page2MDBench and installs all evaluation dependencies in one step.
The evaluator imports metrics directly if Page2MDBench is on the Python path,
or falls back to inserting the cloned directory into sys.path automatically.
evaluate_pdf(eval_pages_dir, scores_dir)
Reads all {i}.ref.md + {i}.md pairs from eval_pages_dir
and computes metrics for each. Requires books to have been converted with the current version of
PDFToMarkdownConverter (which saves .ref.md during conversion).
evaluate_epub(eval_chunks_dir, scores_dir)
Reads all {i}.ref.md + {i}.md pairs from eval_chunks_dir
and computes metrics. Same scheme as PDF evaluation.
evaluate_all(output_dir, scores_dir)
Iterates over all book folders in output_dir, detects the conversion type
from the presence of eval_pages/ (PDF) or eval_chunks/ (EPUB),
and calls the matching method.
from book2md import QualityEvaluator
evaluator = QualityEvaluator(use_bertscore=False)
evaluator.evaluate_all(output_dir="output/", scores_dir="scores/")
Results are saved as scores/{book_name}_scores.json with the structure:
{
"average": {"ned": 0.12, "bleu": 68.4, "structure_f1": 0.91},
"pages": {"0": {"ned": 0.10, "bleu": 71.2, "structure_f1": 0.93}, ...}
}
Design note: evaluation is fully decoupled from conversion. For PDF, the reference files are saved during conversion; for EPUB, the HTML chunks are sufficient. No access to original PDF/EPUB files is needed at evaluation time.
parsing/parser.py — Linguistic Annotation
DependencyParser applies morphosyntactic dependency analysis to the main
Markdown file of each converted book.
Flow
-
run()finds one.mdper book by matchingoutput/{stem}/{stem}.md— eval pages and eval chunks are ignored. - Markdown is stripped to plain text via
markdown+ BeautifulSoup. -
Language is detected automatically from the first 3000 characters using
langdetect(DetectorFactory.seed = 0for reproducibility). - Only the matching Stanza pipeline runs: tokenize + MWT + POS + lemma + depparse + NER.
- Output: a single
{stem}.conllu/{stem}.json, no language suffix.
CoNLL-U token fields
id | text | lemma | upos | xpos | feats | head | deprel | _ | MISC
head is the index of the governing token (0 = root).
deprel is the dependency relation type (e.g. nsubj, obj, root).
MISC contains the NER tag in BIO format (e.g. NER=B-PER, NER=I-ORG);
non-entity tokens have _.
JSON NER output
Each sentence object contains a "tokens" array (with a per-token "ner" BIO tag)
and an "entities" array with span-level entries:
{"text": "Eurac Research", "type": "ORG", "start": 12, "end": 26}
Key Architectural Decisions
| Decision | Discarded alternative | Rationale |
|---|---|---|
| vLLM batching for the entire book | Page-by-page inference | Much higher throughput via continuous batching |
| PDF rasterization to PNG | PyMuPDF text extraction | Preserves visual layout, formulas, and complex tables |
| EPUB chunking by HTML tag | Fixed-size text windows | Semantically coherent chunks, no text split mid-sentence |
| Constant system prompts | Prompts with inline variables | Maximizes prefix cache hit rate with vLLM APC |
| Eval pages saved during conversion | Re-reading original files for eval | Decouples stages; evaluation works without the originals |
Resume based on .md existence |
Flags in a database or JSON | Zero overhead; survives crashes and interruptions |
| Two separate LLMs (VL + text) | A single multimodal model | Text model is faster and uses less VRAM for EPUB and metadata |
| Reference-based metrics (Page2MDBench) for evaluation | LLM-as-judge | No GPU or LLM needed for evaluation; deterministic and reproducible scores |
Rule-based Markdown as PDF reference (.ref.md) |
Plain text extraction or image-only comparison | Preserves document structure so structural metrics (MarkdownStructureF1) are meaningful |
Save .ref.md during conversion |
Extract reference at evaluation time | Original PDF is available during conversion; evaluation is decoupled and needs no original files |
| Automatic language detection per book | Running all pipelines on every book | Avoids redundant computation; one correctly-annotated file per book |
Tests
Tests use pytest's tmp_path fixture for complete isolation (no global state).
LLM models are replaced by stubs (FakeLLM) that return fixed JSON responses,
so all tests run without a GPU.
- test_book_converter.py: tests resume logic (
_already_converted) and correct skipping across all three run methods - test_metadata.py: tests
_parse_json(robust parsing),collect_samples(page selection), andrun(CSV append/skip behavior) - test_utils.py: tests
sample_indices(first-10 guarantee, stratification) andpil_to_data_url(PNG round-trip)
References
Sources and papers for the core tools and metrics used in this pipeline.
Inference & Models
-
vLLM — PagedAttention-based LLM serving engine.
Kwon et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention. SOSP 2023.
arxiv.org/abs/2309.06180 · github.com/vllm-project/vllm -
Qwen3-VL — Vision-language model used for PDF page conversion.
Qwen Team, Alibaba Cloud (2025). Qwen3 Technical Report.
arxiv.org/abs/2505.09388 · github.com/QwenLM/Qwen3 -
Qwen3 (text) — Text-only model used for EPUB conversion and metadata extraction.
Qwen Team, Alibaba Cloud (2025). Qwen3 Technical Report.
arxiv.org/abs/2505.09388 · github.com/QwenLM/Qwen3
Evaluation Metrics
-
Page2MDBench — Benchmark suite providing NED, BLEU, MarkdownStructureF1, and BERTScore for PDF-to-Markdown evaluation.
github.com/Hipsterfil998/Page2MDBench -
BLEU — Bilingual Evaluation Understudy; n-gram precision metric for text generation quality.
Papineni et al. (2002). BLEU: a Method for Automatic Evaluation of Machine Translation. ACL 2002.
aclanthology.org/P02-1040 · Implementation via sacrebleu (Post, 2018. arxiv.org/abs/1804.08771) -
BERTScore — Semantic similarity metric using contextual BERT embeddings.
Zhang et al. (2020). BERTScore: Evaluating Text Generation with BERT. ICLR 2020.
arxiv.org/abs/1904.09675 · github.com/Tiiiger/bert_score - NED — Normalised Edit Distance; character-level string similarity based on Levenshtein distance, normalised to [0, 1]. Implementation via RapidFuzz.
NLP & Parsing
-
Stanza — Neural NLP library for tokenization, POS tagging, lemmatization, and dependency parsing.
Qi et al. (2020). Stanza: A Python Natural Language Processing Toolkit for Many Human Languages. ACL 2020.
arxiv.org/abs/2003.07082 · github.com/stanfordnlp/stanza -
langdetect — Language detection library ported from Google's language-detection.
github.com/Mimino666/langdetect
PDF & EPUB Processing
-
PyMuPDF (fitz) — Python bindings for MuPDF; used for PDF rendering and text-layer extraction.
pymupdf.readthedocs.io · github.com/pymupdf/PyMuPDF -
pdf2image — Converts PDF pages to PIL images via Poppler.
github.com/Belval/pdf2image -
ebooklib — Python library for reading and writing EPUB2/EPUB3 files.
github.com/aerkalov/ebooklib -
BeautifulSoup4 — HTML/XML parser used for extracting and cleaning EPUB chapter content.
crummy.com/software/BeautifulSoup