Curated Claude Code catalog
Updated 07.05.2026 · 19:39 CET
01 / Skill
deusyu

translate-book

Quality
9.0

This Claude Code skill translates full books from various formats (PDF, DOCX, EPUB) into multiple languages, leveraging parallel subagents for efficient, chunked translation. It's ideal for users needing high-quality, consistent translations of lengthy documents, ensuring context isolation and resumable progress.

USP

Unlike single-session translation, this skill uses parallel subagents with isolated contexts, preventing context accumulation and truncation. It features resumable runs, manifest validation for integrity, and a glossary for term consistenc…

Use cases

  • 01Translating academic papers or research books
  • 02Localizing technical manuals or documentation
  • 03Converting foreign language novels into your native tongue
  • 04Processing large text corpora for multilingual analysis
  • 05Creating translated versions of e-books for wider audiences

Detected files (2)

  • SKILL.mdskill
    Show content (19499 bytes)
    ---
    name: translate-book
    description: Translate books (PDF/DOCX/EPUB) into any language using parallel sub-agents. Converts input -> Markdown chunks -> translated chunks -> HTML/DOCX/EPUB/PDF.
    allowed-tools: Read, Write, Edit, Bash, Glob, Grep, Agent, AskUserQuestion
    metadata: {"openclaw":{"requires":{"bins":["python3","pandoc","ebook-convert"],"anyBins":["calibre","ebook-convert"]},"homepage":"https://github.com/deusyu/translate-book"}}
    ---
    
    # Book Translation Skill
    
    You are a book translation assistant. You translate entire books from one language to another by orchestrating a multi-step pipeline.
    
    ## Workflow
    
    ### 1. Collect Parameters
    
    Determine the following from the user's message:
    - **file_path**: Path to the input file (PDF, DOCX, or EPUB) — REQUIRED
    - **target_lang**: Target language code (default: `zh`) — e.g. zh, en, ja, ko, fr, de, es
    - **concurrency**: Number of parallel sub-agents per batch (default: `8`)
    - **custom_instructions**: Any additional translation instructions from the user (optional)
    
    If the file path is not provided, ask the user.
    
    ### 2. Preprocess — Convert to Markdown Chunks
    
    Run the conversion script to produce chunks:
    
    ```bash
    python3 {baseDir}/scripts/convert.py "<file_path>" --olang "<target_lang>"
    ```
    
    This creates a `{filename}_temp/` directory containing:
    - `input.html`, `input.md` — intermediate files
    - `chunk0001.md`, `chunk0002.md`, ... — source chunks for translation
    - `manifest.json` — chunk manifest for tracking and validation
    - `config.txt` — pipeline configuration with metadata
    
    ### 3. Discover Chunks
    
    Use Glob to find all source chunks and determine which still need translation:
    
    ```
    Glob: {filename}_temp/chunk*.md
    Glob: {filename}_temp/output_chunk*.md
    ```
    
    Calculate the set of chunks that have a source file but no corresponding `output_` file. These are the chunks to translate.
    
    If all chunks already have translations, skip to step 5.
    
    ### 3.5. Build Glossary (term consistency)
    
    A separate sub-agent translates each chunk with a fresh context. Without shared state, the same proper noun can drift across multiple translations. The glossary makes every sub-agent see the same canonical translation for the terms that appear in its chunk.
    
    If `<temp_dir>/glossary.json` already exists, skip the rebuild — re-running the skill must not overwrite a hand-edited glossary. To force a rebuild, delete the file.
    
    Otherwise:
    
    1. **Sample chunks**: read `chunk0001.md`, the last chunk, and 3 evenly-spaced middle chunks. If `chunk_count < 5`, sample all of them.
    2. **Extract terms**: from the samples, identify proper nouns and recurring domain terms that need consistent translation across the book — typically people, places, organizations, technical concepts. Translate each into the target language. Skip generic vocabulary that any translator would render the same way.
    3. **Write `glossary.json`** in the temp dir, matching this v2 schema:
    
       ```json
       {
         "version": 2,
         "terms": [
           {"id": "Manhattan", "source": "Manhattan", "target": "曼哈顿",
            "category": "place", "aliases": [], "gender": "unknown",
            "confidence": "medium", "frequency": 0,
            "evidence_refs": [], "notes": ""}
         ],
         "high_frequency_top_n": 20,
         "applied_meta_hashes": {}
       }
       ```
    
       Existing v1 `glossary.json` files are auto-upgraded to v2 on first load. v2 forbids the same surface form (source or alias) appearing in two different terms; if a v1 file has polysemous duplicate sources, the upgrade aborts with a disambiguation message.
    
    4. **Count frequencies** by running:
    
       ```bash
       python3 {baseDir}/scripts/glossary.py count-frequencies "<temp_dir>"
       ```
    
       This scans every `chunk*.md` (excluding `output_chunk*.md`), updates each term's `frequency` field, and writes back atomically.
    
    The glossary is hand-editable. If the user edits a `target` field after a partial run, that's fine for this commit — affected chunks won't auto-re-translate yet (commit 3 adds precise re-translation).
    
    ### 4. Parallel Translation with Sub-Agents
    
    **Each chunk gets its own independent sub-agent** (1 chunk = 1 sub-agent = 1 fresh context). This prevents context accumulation and output truncation.
    
    Launch chunks in batches to respect API rate limits:
    - Each batch: up to `concurrency` sub-agents in parallel (default: 8)
    - Wait for the current batch to complete before launching the next
    
    **Spawn each sub-agent with the following task.** Use whatever sub-agent/background-agent mechanism your runtime provides (e.g. the Agent tool, sessions_spawn, or equivalent).
    
    The output file is `output_` prefixed to the source filename: `chunk0001.md` → `output_chunk0001.md`.
    
    > Translate the file `<temp_dir>/chunk<NNNN>.md` to {TARGET_LANGUAGE} and write the result to `<temp_dir>/output_chunk<NNNN>.md`. Follow the translation rules below. Output only the translated content — no commentary.
    
    Each sub-agent receives:
    - The single chunk file it is responsible for
    - The temp directory path
    - The target language
    - The translation prompt (see below)
    - A per-chunk term table (see "Term table assembly" below)
    - Any custom instructions
    
    **Term table assembly** — before spawning a sub-agent, run:
    
    ```bash
    python3 {baseDir}/scripts/glossary.py print-terms-for-chunk "<temp_dir>" "chunk<NNNN>.md"
    ```
    
    Capture stdout. The CLI emits a 3-column markdown table (`原文 | 别名 | 译文`) of every term that either appears in this chunk (by source OR any alias) OR is in the top-N most-frequent terms book-wide. Inject the table as `{TERM_TABLE}` in rule #13 of the translation prompt. **If stdout is empty (no glossary, or no relevant terms), omit rule #13 from this chunk's prompt entirely** — do not leave a dangling `{TERM_TABLE}` placeholder.
    
    **Each sub-agent's task**:
    1. Read the source chunk file (e.g. `chunk0001.md`)
    2. Translate the content following the translation rules below
    3. Write the translated content to `output_chunk0001.md`
    4. Write observations to `output_chunk0001.meta.json` matching the schema below. **Non-blocking** — leave fields empty if unsure; do not invent entities. Always emit the file (even if all arrays are empty), because its presence + content hash is how the main agent tracks whether feedback was already merged.
    
    **Sub-agent meta schema** (`output_chunk<NNNN>.meta.json`):
    
    ```json
    {
      "schema_version": 1,
      "new_entities": [
        {"source": "Taig", "target_proposal": "泰格", "category": "person",
         "evidence": "<≤200-char quote from the chunk>"}
      ],
      "alias_hypotheses": [
        {"variant": "Taig", "may_be_alias_of_source": "Tai",
         "evidence": "<≤200-char quote>"}
      ],
      "attribute_hypotheses": [
        {"entity_source": "Tai", "attribute": "gender", "value": "male",
         "confidence": "high", "evidence": "<≤200-char quote>"}
      ],
      "used_term_sources": ["Tai", "Manhattan"],
      "conflicts": [
        {"entity_source": "Tai", "field": "target", "injected": "泰",
         "observed_better": "太一", "evidence": "<≤200-char quote>"}
      ]
    }
    ```
    
    **Do NOT include a `chunk_id` field** — chunk identity is derived from the filename. Putting it in the payload creates a hallucination hole and validation will reject the file.
    
    The meta file is read by the main agent later and merged into `glossary.json` (see `merge_meta.py`). Sub-agents should fill the schema honestly: cite real quotes from the chunk, never invent entities to "look productive". An empty meta is a perfectly valid output.
    
    **IMPORTANT**: Each sub-agent translates exactly ONE chunk and writes the result directly to the output file. No START/END markers needed.
    
    #### Translation Prompt for Sub-Agents
    
    Include this translation prompt in each sub-agent's instructions (replace `{TARGET_LANGUAGE}` with the actual language name, e.g. "Chinese"):
    
    ---
    
    请翻译markdown文件为 {TARGET_LANGUAGE}.
    IMPORTANT REQUIREMENTS:
    1. 严格保持 Markdown 格式不变,包括标题、链接、图片引用等
    2. 仅翻译文字内容,保留所有 Markdown 语法和文件名
    3. 删除空链接、不必要的字符和如: 行末的'\\'。页码已由 convert.py 上游处理,不要再删除独立的数字行(可能是年份 1984、章节编号、引用编号等正文内容)。
    4. 保证格式和语义准确翻译内容自然流畅
    5. 只输出翻译后的正文内容,不要有任何说明、提示、注释或对话内容。
    6. 表达清晰简洁,不要使用复杂的句式。请严格按顺序翻译,不要跳过任何内容。
    7. 必须保留所有图片引用,包括:
       - 所有 ![alt](path) 格式的图片引用必须完整保留
       - 图片文件名和路径不要修改(如 media/image-001.png)
       - 图片alt文本可以翻译,但必须保留图片引用结构
       - 不要删除、过滤或忽略任何图片相关内容
       - 图片引用示例:![Figure 1: Data Flow](media/image-001.png) -> ![图1:数据流](media/image-001.png)
       - **原始 HTML 标签(如 `<img alt="..." />`、`<a title="...">`)必须保持合法**:翻译 `alt`、`title` 等属性值内部文本时,下列字符会破坏 HTML 结构,必须替换为安全形式(仅适用于**原始 HTML 标签的属性值内部**;普通 Markdown 正文、代码块、URL 不要主动转义):
    
         | 字符 | 在属性值内的危险 | 替换为 |
         |------|---------------|--------|
         | `"` | 闭合 `attr="..."` | 目标语言合适的弯引号(如中文 `“` `”`)或 `&quot;` |
         | `'` | 闭合 `attr='...'` | 目标语言合适的弯引号(如中文 `‘` `’`)或 `&#39;` |
         | `<` | 被解析为新标签 | `&lt;` |
         | `>` | 被解析为标签结束 | `&gt;` |
         | `&` | 被解析为实体起始(除非已是 `&xxx;`) | `&amp;` |
    
         不要修改 `src`、`href` 等结构性属性的值,只翻译可见文本属性(`alt`、`title`)。
    
         - 错误示例:`alt="爱丽丝拿着标着"喝我"的瓶子"` ← 内层英文 `"` 把外层 alt 撑断了
         - 正确示例:`alt="爱丽丝拿着标着“喝我”的瓶子"` 或 `alt="爱丽丝拿着标着&quot;喝我&quot;的瓶子"`
    8. 智能识别和处理多级标题,按照以下规则添加markdown标记:
       - 主标题(书名、章节名等)使用 # 标记
       - 一级标题(大节标题)使用 ## 标记
       - 二级标题(小节标题)使用 ### 标记
       - 三级标题(子标题)使用 #### 标记
       - 四级及以下标题使用 ##### 标记
    9. 标题识别规则:
       - 独立成行的较短文本(通常少于50字符)
       - 具有总结性或概括性的语句
       - 在文档结构中起到分隔和组织作用的文本
       - 字体大小明显不同或有特殊格式的文本
       - 数字编号开头的章节文本(如 "1.1 概述"、"第三章"等)
    10. 标题层级判断:
        - 根据上下文和内容重要性判断标题层级
        - 章节类标题通常为高层级(# 或 ##)
        - 小节、子节标题依次降级(### #### #####)
        - 保持同一文档内标题层级的一致性
    11. 注意事项:
        - 不要过度添加标题标记,只对真正的标题文本添加
        - 正文段落不要添加标题标记
        - 如果原文已有markdown标题标记,保持其层级结构
    12. {CUSTOM_INSTRUCTIONS if provided}
    13. 术语一致性:以下术语必须严格使用指定译法,不要自行变换。表格中"原文"列**或"别名"列**任一形式出现在正文中时,都必须翻译为"译文"列对应的形式。
    
    {TERM_TABLE}
    
    markdown文件正文:
    
    ---
    
    ### 4.5. Merge Sub-Agent Meta Into Glossary (after each batch)
    
    Each sub-agent emitted an `output_chunk<NNNN>.meta.json` alongside its translated chunk. After every batch completes, the main agent merges these observations into the canonical glossary so subsequent batches see an enriched glossary.
    
    1. Run prepare-merge:
    
       ```bash
       python3 {baseDir}/scripts/merge_meta.py prepare-merge "<temp_dir>"
       ```
    
       Capture stdout JSON. It contains four arrays:
       - `auto_apply` — new entities with no glossary collision and unanimous (target, category) across all proposing chunks.
       - `decisions_needed` — items requiring main-agent judgment. Each has `id`, `kind`, an `options` array, and the data needed to pick. Kinds:
         - `alias` — `{variant, candidate_source, evidence}`. Choices: `yes_alias` / `no_separate_entity` / `skip`.
         - `conflict` — `{entity_source, field, current, proposed, evidence}`. Choices: `keep_current` / `accept_proposed` / `record_in_notes`.
         - `new_entity_existing_alias` — sub-agents propose `proposed_source` as a new entity, but it's already someone's alias. `{proposed_source, currently_alias_of, promoted_variants: [{target_proposal, category, evidence, evidence_chunks}, ...]}`. Choices: one `use_variant_N` per distinct (target, category) promotion variant (promote `proposed_source` to standalone with that target+category, removing it from the host's aliases) / `keep_as_alias` / `skip`.
         - `existing_entity_conflict` — sub-agents proposed a (target, category) for `entity_source` that differs from the canonical. Multiple distinct differing proposals all get exposed. `{entity_source, current_target, current_category, proposed_variants: [{target_proposal, category, evidence, evidence_chunks}, ...]}`. Choices: `keep_current` / one `use_variant_N` per competing proposal (overwrites both target AND category, stamps the prior values into notes) / `record_in_notes` (canonical unchanged; every proposed variant gets logged to notes).
         - `alias_or_new_entity` — `variant` has multiple competing options that can't all coexist under v2's surface-form uniqueness rule. Triggered when (a) `variant` was proposed both as a new standalone entity AND as an alias of one or more candidates, OR (b) `variant` was proposed as an alias of two or more different candidates with no standalone competitor. `{variant, alias_candidates: [{candidate_source, evidence, evidence_chunks}, ...], standalone_variants: [{target_proposal, category, evidence, evidence_chunks}, ...]}`. Choices: one `use_alias_N` per candidate (attach as alias of that candidate), one `use_standalone_N` per competing standalone proposal (add as standalone with that target+category), or `skip`.
         - `conflicting_new_entity_proposals` — `{source, variants: [{target_proposal, category, evidence, evidence_chunks}, ...]}`. Choices: `use_variant_0`, `use_variant_1`, ..., `skip`.
       - `consumed_chunk_ids` — every meta file scanned this round (regardless of whether it produced a finding). These hashes get recorded in `applied_meta_hashes` on apply.
       - `malformed_meta_chunk_ids` — meta files that failed validation. Quarantined: not consumed, not crashing the run. Surface them in your batch progress.
    
    2. **If `consumed_chunk_ids` is empty** → nothing was scanned; skip to Step 5.
    
    3. **If `consumed_chunk_ids` is non-empty but both `auto_apply` and `decisions_needed` are empty** → still pipe `{"auto_apply": [], "decisions": [], "consumed_chunk_ids": [...]}` into `apply-merge` so the hashes get recorded. **Skipping this is the bug** — no-op metas would re-scan forever otherwise.
    
    4. **Otherwise, resolve each decision**:
       - Read its evidence quotes inline.
       - Pick one option from its `options` array.
       - Build a `decisions` entry that round-trips the original decision plus your choice. The entry MUST include the original `kind` and (for `conflicting_new_entity_proposals`) the `variants` array, so apply-merge can validate and act:
    
         ```json
         {"id": "d1", "kind": "alias", "variant": "Taig", "candidate_source": "Tai", "choice": "yes_alias"}
         ```
    
    5. Pipe the decisions JSON into apply-merge:
    
       ```bash
       echo '{"auto_apply": [...], "decisions": [...], "consumed_chunk_ids": [...]}' \
         | python3 {baseDir}/scripts/merge_meta.py apply-merge "<temp_dir>"
       ```
    
       Surface the summary JSON (`auto_applied`, `decisions_resolved`, `consumed_chunks`, `errors`) in your batch progress message.
    
       **apply-merge is transactional.** If any decision is malformed (wrong choice for kind, missing fields, references a non-existent entity), the entire batch aborts with a non-zero exit and stderr details — no glossary mutation, no hashes recorded. On non-zero exit, fix the offending decision and re-pipe; `prepare-merge` will surface the same proposals because nothing was consumed.
    
       **Decision order in the input list is not significant.** `apply-merge` internally dispatches entity-creating decisions before alias-attaching ones, so `yes_alias` decisions whose candidate is created by another decision in the same batch (a `use_standalone_N`, `use_variant_N`, or `promote_to_separate_entity`) succeed regardless of the order you pass them in. Alias chains (e.g. `Taighi → Taig` where `Taig → Tai` is also a pending alias decision) resolve via a fixed-point loop within the alias-attacher pass — you don't need to topo-sort or sequence chained aliases manually.
    
    On a fresh run after a previous interrupted batch, `prepare-merge` will pick up any meta files left behind. Don't manually delete them.
    
    ### 5. Verify Completeness and Retry
    
    After all batches complete, use Glob to check that every source chunk has a corresponding output file.
    
    If any are missing, retry them — each missing chunk as its own sub-agent. Maximum 2 attempts per chunk (initial + 1 retry).
    
    Also read `manifest.json` and verify:
    - Every chunk id has a corresponding output file
    - No output file is empty (0 bytes)
    
    Then run the meta-merge observability snapshot:
    
    ```bash
    python3 {baseDir}/scripts/merge_meta.py status "<temp_dir>"
    ```
    
    Surface a one-line summary in the verification report:
    
    > Translated chunks: 50 • Meta files: 48 found / 47 consumed • Malformed: 1 (chunk0099 — see stderr) • Chunks missing meta: chunk0017, chunk0042
    
    Severity rules (none of these fail the run — meta is non-blocking):
    
    - `unmerged_meta_files > 0` after Step 4.5 ran → bug, flag prominently. Resume should have caught this.
    - `malformed_meta_files > 0` → sub-agent emitted invalid meta; print chunk_ids and a "fix the file by hand and re-run if you want this chunk's feedback merged" note.
    - `meta_files_found < translated_chunks` → sub-agent-compliance issue (some chunks didn't emit meta at all). Print missing chunk_ids.
    
    Report any chunks that failed translation after retry.
    
    ### 6. Translate Book Title
    
    Read `config.txt` from the temp directory to get the `original_title` field.
    
    Translate the title to the target language. For Chinese, wrap in 书名号: `《translated_title》`.
    
    ### 7. Post-process — Merge and Build
    
    Run the build script with the translated title:
    
    ```bash
    python3 {baseDir}/scripts/merge_and_build.py --temp-dir "<temp_dir>" --title "<translated_title>" --cleanup
    ```
    
    The `--cleanup` flag removes intermediate files (chunks, input.html, etc.) after a fully successful build. If the user asked to keep intermediates, omit `--cleanup`.
    
    The script reads `output_lang` from `config.txt` automatically. Optional overrides: `--lang`, `--author`.
    
    This produces in the temp directory:
    - `output.md` — merged translated markdown
    - `book.html` — web version with floating TOC
    - `book_doc.html` — ebook version
    - `book.docx`, `book.epub`, `book.pdf` — format conversions (requires Calibre)
    
    ### 8. Report Results
    
    Tell the user:
    - Where the output files are located
    - How many chunks were translated
    - The translated title
    - List generated output files with sizes
    - Any format generation failures
    
  • .claude/commands/release.mdcommand
    Show content (898 bytes)
    ---
    description: Release a new version to GitHub + ClawHub
    argument-hint: <semver, e.g. 0.3.0>
    ---
    
    Release version `$1` by running these three commands in order. Stop and report immediately if any step fails — do not attempt to recover automatically.
    
    ```bash
    git push origin main
    git tag v$1 && git push --tags
    npx clawhub@latest publish ./ --version $1
    ```
    
    `$1` is bare semver (e.g. `0.3.0`). The `v` prefix is applied only to the git tag, not to the ClawHub version.
    
    First-time ClawHub publish on a machine requires `npx clawhub@latest login` (browser auth, cached per machine). If step 3 fails with `Not logged in`, ask the user to run that login command, then retry only step 3.
    
    If step 3 fails for any other reason after the tag is already pushed, fix the cause and re-run only step 3 with the same version. Do not force-overwrite the tag (`git tag -f`) without explicit user approval.
    

README

Rainman Translate Book

English | 中文

Claude Code skill that translates entire books (PDF/DOCX/EPUB) into any language using parallel subagents.

Inspired by claude_translater. The original project uses shell scripts as its entry point, coordinating the Claude CLI with multiple step scripts to perform chunked translation. This project restructures the workflow as a Claude Code Skill, using subagents to translate chunks in parallel, with manifest-driven integrity checks, resumable runs, and multi-format output unified into a single pipeline. As the project structure and implementation differ significantly from the original, this is an independent project rather than a fork.


How It Works

Input (PDF/DOCX/EPUB)
  │
  ▼
Calibre ebook-convert → HTMLZ → HTML → Markdown
  │
  ▼
Split into chunks (chunk0001.md, chunk0002.md, ...)
  │  manifest.json tracks chunk hashes
  ▼
Parallel subagents (8 concurrent by default)
  │  each subagent: read 1 chunk → translate → write output_chunk*.md
  │  batched to respect API rate limits
  ▼
Validate (manifest hash check, 1:1 source↔output match)
  │
  ▼
Merge → Pandoc → HTML (with TOC) → Calibre → DOCX / EPUB / PDF

Each chunk gets its own independent subagent with a fresh context window. This prevents context accumulation and output truncation that happen when translating a full book in a single session.

Features

  • Parallel subagents — 8 concurrent translators per batch, each with isolated context
  • Resumable — chunk-level resume; already-translated chunks are skipped on re-run (for metadata/template changes, use a fresh run)
  • Manifest validation — SHA-256 hash tracking prevents stale or corrupt outputs from being merged
  • Multi-format output — HTML (with floating TOC), DOCX, EPUB, PDF
  • Multi-language — zh, en, ja, ko, fr, de, es (extensible)
  • PDF/DOCX/EPUB input — Calibre handles the conversion heavy lifting

Prerequisites

  • Claude Code CLI — installed and authenticated
  • Calibreebook-convert command must be available (download)
  • Pandoc — for HTML↔Markdown conversion (download)
  • Python 3 with:
    • pypandoc — required (pip install pypandoc)
    • beautifulsoup4 — optional, for better TOC generation (pip install beautifulsoup4)

Quick Start

1. Install the skill

Option A: npx (recommended)

npx skills add deusyu/translate-book -a claude-code -g

Option B: ClawHub

clawhub install translate-book

Option C: Git clone

git clone https://github.com/deusyu/translate-book.git ~/.claude/skills/translate-book

2. Translate a book

In Claude Code, say:

translate /path/to/book.pdf to Chinese

Or use the slash command:

/translate-book translate /path/to/book.pdf to Japanese

The skill handles the full pipeline automatically — convert, chunk, translate in parallel, validate, merge, and build all output formats.

3. Find your outputs

All files are in {book_name}_temp/:

FileDescription
output.mdMerged translated Markdown
book.htmlWeb version with floating TOC
book.docxWord document
book.epubE-book
book.pdfPrint-ready PDF

Repository Test Assets

  • Checked-in baseline inputs live under tests/baselines/<book-id>/.
  • Generated full-pipeline outputs live under tests/.artifacts/ and should not be committed.
  • Because scripts/convert.py writes {book_name}_temp/ under the current working directory, run repository baseline tests from inside tests/.artifacts/ to keep generated files out of the repo root.

Full-Pipeline Baseline Example

mkdir -p tests/.artifacts
cd tests/.artifacts
python3 ../../scripts/convert.py ../baselines/standard-alice/standard-alice.epub --olang zh
# then run translation via the skill
python3 ../../scripts/merge_and_build.py --temp-dir standard-alice_temp --title "test"

Pipeline Details

Step 1: Convert

python3 scripts/convert.py /path/to/book.pdf --olang zh

Calibre converts the input to HTMLZ, which is extracted and converted to Markdown, then split into chunks (~6000 chars each). A manifest.json records the SHA-256 hash of each source chunk for later validation.

Step 1.5: Glossary (term consistency across chunks)

Each chunk is translated by a fresh-context sub-agent, which means the same proper noun can drift across multiple translations on a 100-chunk book. To fix this, the skill builds a glossary before translation:

  1. Sample 5 chunks (first, last, 3 evenly-spaced middle).
  2. Extract proper nouns and recurring domain terms; pick canonical translations.
  3. Write <temp_dir>/glossary.json (hand-editable schema below).
  4. Run python3 scripts/glossary.py count-frequencies <temp_dir> to populate per-term frequencies (ASCII terms use word-boundary regex so cat doesn't match category; CJK terms use substring; single-CJK-char terms are rejected; aliases count toward the term they belong to).
  5. For each chunk, the orchestrator calls python3 scripts/glossary.py print-terms-for-chunk <temp_dir> chunkNNNN.md and injects the resulting 3-column (原文 | 别名 | 译文) markdown table into that chunk's prompt as a hard constraint. Term selection = (terms whose source OR any alias appears in this chunk) ∪ (top-N most-frequent book-wide).
{
  "version": 2,
  "terms": [
    {"id": "Manhattan", "source": "Manhattan", "target": "曼哈顿",
     "category": "place", "aliases": [], "gender": "unknown",
     "confidence": "medium", "frequency": 12,
     "evidence_refs": [], "notes": ""}
  ],
  "high_frequency_top_n": 20,
  "applied_meta_hashes": {}
}

Existing v1 glossary.json files are auto-upgraded to v2 on first load. v2 forbids the same surface form (source or alias) appearing in two different terms; if a v1 file has polysemous duplicate sources, the upgrade aborts with a disambiguation message — fix the file by hand and reload.

Edit glossary.json between runs to fix translations; existing glossary.json is never overwritten — delete it to rebuild from scratch.

Note on partial reruns: in the current release, editing glossary.json after some chunks have been translated does NOT auto-invalidate those chunks — they keep their old translations. Precise glossary-driven re-translation is planned for the next commit. For now, delete the affected output_chunk*.md files (or the whole temp dir) to apply glossary edits.

Step 2: Translate (parallel subagents)

The skill launches subagents in batches (default: 8 concurrent). Each subagent:

  1. Reads one source chunk (e.g. chunk0042.md)
  2. Translates to the target language
  3. Writes the result to output_chunk0042.md

If a run is interrupted, re-running skips chunks that already have valid output files. Failed chunks are retried once automatically.

Step 3: Merge & Build

python3 scripts/merge_and_build.py --temp-dir book_temp --title "《translated title》"

Before merging, the script validates:

  • Every source chunk has a corresponding output file (1:1 match)
  • Source chunk hashes match the manifest (no stale outputs)
  • No output files are empty

Then: merge → Pandoc HTML → inject TOC → Calibre generates DOCX, EPUB, PDF.

Note: {book_name}_temp/ is a working directory for a single translation run. If you change the title, author, output language, template, or image assets, either use a fresh temp directory or delete the existing final artifacts (output.md, book*.html, book.docx, book.epub, book.pdf) before re-running.

Project Structure

FilePurpose
SKILL.mdClaude Code skill definition — orchestrates the full pipeline
scripts/convert.pyPDF/DOCX/EPUB → Markdown chunks via Calibre HTMLZ
scripts/manifest.pyChunk manifest: SHA-256 tracking and merge validation
scripts/glossary.pyGlossary management: per-chunk term tables for consistent terminology
scripts/meta.pyPer-chunk sub-agent observation file schema (output_chunkNNNN.meta.json)
scripts/merge_meta.pyBatch-boundary merge: sub-agent observations → canonical glossary
scripts/merge_and_build.pyMerge chunks → HTML → DOCX/EPUB/PDF
scripts/calibre_html_publish.pyCalibre wrapper for format conversion
scripts/template.htmlWeb HTML template with floating TOC
scripts/template_ebook.htmlEbook HTML template
tests/baselines/Checked-in baseline book inputs for full-pipeline testing
tests/.artifacts/Ignored full-pipeline test outputs

Troubleshooting

ProblemSolution
Calibre ebook-convert not foundInstall Calibre and ensure ebook-convert is in PATH
Manifest validation failedSource chunks changed since splitting — re-run convert.py
Missing source chunkSource file deleted — re-run convert.py to regenerate
Incomplete translationRe-run the skill — it resumes from where it stopped
Changed title/template/assets but output didn't updateDelete existing output.md, book*.html, book.docx, book.epub, book.pdf from the temp dir, then re-run merge_and_build.py
Want page-number footers stripped from PDF outputBy default, monotonic page-number sequences (e.g. 1, 2, 3, ...) are auto-detected and dropped while outliers like years (1984), chapter numbers, and citation indices stay preserved. If detection misses your case, pass --strip-page-numbers to convert.py to aggressively delete every standalone-digit line. The flag aborts if a cached input.md or chunk*.md already exists — delete them first so the flag actually takes effect.
output.md exists but manifest invalidStale output — the script auto-deletes and re-merges
Glossary upgrade rejected: duplicate sourcev2 disallows two terms sharing a source/alias surface form. Edit glossary.json to disambiguate (e.g., rename one source from Apple to Apple (Inc.)) and reload.
PDF generation failsEnsure Calibre is installed with PDF output support

Roadmap

Tracking issue #7 — name/term inconsistency and pronoun/gender errors across chunks. Today's glossary covers high-frequency main entities; secondary characters, spelling variants, and pronoun resolution are not yet addressed. The plan is four independently shippable phases.

Design principles

  • Scripts do bookkeeping; LLMs do semantic merge. State, schemas, dedup, hashing, IO are deterministic Python. Naming, gender attribution, alias judgment, conflict resolution are LLM calls.
  • Single writer for shared state. Only the main agent writes glossary.json and run_state.json; sub-agents write per-chunk meta files. No locking needed.
  • Conservative merge. New entities require evidence; alias merges need LLM judgment, not just string similarity; gender starts at unknown and only moves up under explicit evidence; canonical values aren't silently overwritten on conflict.
  • Three-layer state, three separate files. glossary.json (canonical, sub-agents read), output_chunkNNNN.meta.json (raw per-chunk observations), run_state.json (orchestration).

Phase 1 — Sub-agent feedback + glossary merge (shipped)

Closes the read+write loop. Glossary v2 adds id, aliases, gender, confidence, evidence_refs, notes (v1 files auto-upgrade on first load; the term table is now 3-col and aliases participate in selection). Sub-agents emit output_chunkNNNN.meta.json alongside each translated chunk. scripts/merge_meta.py (prepare-merge / apply-merge / status) merges per-batch with conservative rules: surface-form uniqueness enforced, malformed metas quarantined (warn + skip + count), confidence escalation via both evidence_chunks and used_term_sources, FIFO-cap at 5. See SKILL.md Step 4 / Step 4.5 / Step 5.

Phase 2 — Neighbor context for pronouns (not started, independent of Phase 1)

Inject prev_excerpt (last ~300 chars of previous chunk) and next_excerpt (first ~300 chars of next chunk) into each sub-agent prompt as read-only context. No new state files. Pure prompt-assembly change.

Phase 3 — Selective re-translation (not started, depends on Phase 1)

Phase 1's batch feedback only improves forward. Selective rerun closes the backward loop: new scripts/run_state.py + run_state.json schema; per-chunk tracking of glossary_version_used, entity_ids_used, output_hash; five decision rules for deciding which chunks need re-translation this run.

Phase 4 — Bootstrap warm-up (experimental, gated on Phase 1 data)

Phase 1 grows the glossary batch-by-batch, so the first batch sees the smallest glossary and has the highest drift risk. Possible approaches: sequential bootstrap, variable concurrency, or skip entirely. Decision belongs to whoever has run the system on real books.

The specific schemas and file layouts in each phase are illustrative — they may shift as Phase 1 hits real data. Phase 4 is gated on data; Phase 3 may be re-scoped or dropped if Phase 1 alone proves "good enough".

Parallel track — Pipeline / UX backlog (not started, separate from issue #7)

Recent PR discussions also surfaced several useful workflow improvements, but these are broader than one-off patches and touch repo contracts (artifact names, temp-dir behavior, cleanup semantics, or EPUB compatibility scope). These are being tracked as maintainer-owned roadmap items rather than merged directly from the current PRs:

  • Explicit EPUB cover support. Add --cover <image> and pass it through the HTML -> EPUB Calibre step. Keep --cover-from <epub> / EPUB cover auto-extraction out of scope until the project is ready to own EPUB parsing compatibility across different package layouts. (context: closed #3)
  • Configurable temp workspace location. Keep the current cwd-local {book_name}_temp/ default for compatibility. If this is revisited later, prefer an explicit --temp-root / --work-dir style option rather than silently changing the default location. (context: closed #4)
  • Safer Calibre/Pandoc artifact cleanup. Continue improving cleanup rules incrementally under regression tests, while preserving the current page-number detection semantics and not stripping real display-math delimiters or content numbers. (context: closed #5)
  • Optional user-facing export names. Keep canonical pipeline artifacts as book.html, book_doc.html, book.docx, book.epub, and book.pdf. If title-based filenames are added later, they should likely be optional exported aliases/copies rather than a silent replacement of the internal artifact contract. (context: closed #6)

Star History

If you find this project helpful, please consider giving it a Star ⭐!

Star History Chart

Sponsor

If this project saves you time, consider sponsoring to keep it maintained and improved.

Sponsor

License

MIT