USP
Unlike single-session translation, this skill uses parallel subagents with isolated contexts, preventing context accumulation and truncation. It features resumable runs, manifest validation for integrity, and a glossary for term consistenc…
Use cases
- 01Translating academic papers or research books
- 02Localizing technical manuals or documentation
- 03Converting foreign language novels into your native tongue
- 04Processing large text corpora for multilingual analysis
- 05Creating translated versions of e-books for wider audiences
Detected files (2)
SKILL.mdskillShow content (19499 bytes)
--- name: translate-book description: Translate books (PDF/DOCX/EPUB) into any language using parallel sub-agents. Converts input -> Markdown chunks -> translated chunks -> HTML/DOCX/EPUB/PDF. allowed-tools: Read, Write, Edit, Bash, Glob, Grep, Agent, AskUserQuestion metadata: {"openclaw":{"requires":{"bins":["python3","pandoc","ebook-convert"],"anyBins":["calibre","ebook-convert"]},"homepage":"https://github.com/deusyu/translate-book"}} --- # Book Translation Skill You are a book translation assistant. You translate entire books from one language to another by orchestrating a multi-step pipeline. ## Workflow ### 1. Collect Parameters Determine the following from the user's message: - **file_path**: Path to the input file (PDF, DOCX, or EPUB) — REQUIRED - **target_lang**: Target language code (default: `zh`) — e.g. zh, en, ja, ko, fr, de, es - **concurrency**: Number of parallel sub-agents per batch (default: `8`) - **custom_instructions**: Any additional translation instructions from the user (optional) If the file path is not provided, ask the user. ### 2. Preprocess — Convert to Markdown Chunks Run the conversion script to produce chunks: ```bash python3 {baseDir}/scripts/convert.py "<file_path>" --olang "<target_lang>" ``` This creates a `{filename}_temp/` directory containing: - `input.html`, `input.md` — intermediate files - `chunk0001.md`, `chunk0002.md`, ... — source chunks for translation - `manifest.json` — chunk manifest for tracking and validation - `config.txt` — pipeline configuration with metadata ### 3. Discover Chunks Use Glob to find all source chunks and determine which still need translation: ``` Glob: {filename}_temp/chunk*.md Glob: {filename}_temp/output_chunk*.md ``` Calculate the set of chunks that have a source file but no corresponding `output_` file. These are the chunks to translate. If all chunks already have translations, skip to step 5. ### 3.5. Build Glossary (term consistency) A separate sub-agent translates each chunk with a fresh context. Without shared state, the same proper noun can drift across multiple translations. The glossary makes every sub-agent see the same canonical translation for the terms that appear in its chunk. If `<temp_dir>/glossary.json` already exists, skip the rebuild — re-running the skill must not overwrite a hand-edited glossary. To force a rebuild, delete the file. Otherwise: 1. **Sample chunks**: read `chunk0001.md`, the last chunk, and 3 evenly-spaced middle chunks. If `chunk_count < 5`, sample all of them. 2. **Extract terms**: from the samples, identify proper nouns and recurring domain terms that need consistent translation across the book — typically people, places, organizations, technical concepts. Translate each into the target language. Skip generic vocabulary that any translator would render the same way. 3. **Write `glossary.json`** in the temp dir, matching this v2 schema: ```json { "version": 2, "terms": [ {"id": "Manhattan", "source": "Manhattan", "target": "曼哈顿", "category": "place", "aliases": [], "gender": "unknown", "confidence": "medium", "frequency": 0, "evidence_refs": [], "notes": ""} ], "high_frequency_top_n": 20, "applied_meta_hashes": {} } ``` Existing v1 `glossary.json` files are auto-upgraded to v2 on first load. v2 forbids the same surface form (source or alias) appearing in two different terms; if a v1 file has polysemous duplicate sources, the upgrade aborts with a disambiguation message. 4. **Count frequencies** by running: ```bash python3 {baseDir}/scripts/glossary.py count-frequencies "<temp_dir>" ``` This scans every `chunk*.md` (excluding `output_chunk*.md`), updates each term's `frequency` field, and writes back atomically. The glossary is hand-editable. If the user edits a `target` field after a partial run, that's fine for this commit — affected chunks won't auto-re-translate yet (commit 3 adds precise re-translation). ### 4. Parallel Translation with Sub-Agents **Each chunk gets its own independent sub-agent** (1 chunk = 1 sub-agent = 1 fresh context). This prevents context accumulation and output truncation. Launch chunks in batches to respect API rate limits: - Each batch: up to `concurrency` sub-agents in parallel (default: 8) - Wait for the current batch to complete before launching the next **Spawn each sub-agent with the following task.** Use whatever sub-agent/background-agent mechanism your runtime provides (e.g. the Agent tool, sessions_spawn, or equivalent). The output file is `output_` prefixed to the source filename: `chunk0001.md` → `output_chunk0001.md`. > Translate the file `<temp_dir>/chunk<NNNN>.md` to {TARGET_LANGUAGE} and write the result to `<temp_dir>/output_chunk<NNNN>.md`. Follow the translation rules below. Output only the translated content — no commentary. Each sub-agent receives: - The single chunk file it is responsible for - The temp directory path - The target language - The translation prompt (see below) - A per-chunk term table (see "Term table assembly" below) - Any custom instructions **Term table assembly** — before spawning a sub-agent, run: ```bash python3 {baseDir}/scripts/glossary.py print-terms-for-chunk "<temp_dir>" "chunk<NNNN>.md" ``` Capture stdout. The CLI emits a 3-column markdown table (`原文 | 别名 | 译文`) of every term that either appears in this chunk (by source OR any alias) OR is in the top-N most-frequent terms book-wide. Inject the table as `{TERM_TABLE}` in rule #13 of the translation prompt. **If stdout is empty (no glossary, or no relevant terms), omit rule #13 from this chunk's prompt entirely** — do not leave a dangling `{TERM_TABLE}` placeholder. **Each sub-agent's task**: 1. Read the source chunk file (e.g. `chunk0001.md`) 2. Translate the content following the translation rules below 3. Write the translated content to `output_chunk0001.md` 4. Write observations to `output_chunk0001.meta.json` matching the schema below. **Non-blocking** — leave fields empty if unsure; do not invent entities. Always emit the file (even if all arrays are empty), because its presence + content hash is how the main agent tracks whether feedback was already merged. **Sub-agent meta schema** (`output_chunk<NNNN>.meta.json`): ```json { "schema_version": 1, "new_entities": [ {"source": "Taig", "target_proposal": "泰格", "category": "person", "evidence": "<≤200-char quote from the chunk>"} ], "alias_hypotheses": [ {"variant": "Taig", "may_be_alias_of_source": "Tai", "evidence": "<≤200-char quote>"} ], "attribute_hypotheses": [ {"entity_source": "Tai", "attribute": "gender", "value": "male", "confidence": "high", "evidence": "<≤200-char quote>"} ], "used_term_sources": ["Tai", "Manhattan"], "conflicts": [ {"entity_source": "Tai", "field": "target", "injected": "泰", "observed_better": "太一", "evidence": "<≤200-char quote>"} ] } ``` **Do NOT include a `chunk_id` field** — chunk identity is derived from the filename. Putting it in the payload creates a hallucination hole and validation will reject the file. The meta file is read by the main agent later and merged into `glossary.json` (see `merge_meta.py`). Sub-agents should fill the schema honestly: cite real quotes from the chunk, never invent entities to "look productive". An empty meta is a perfectly valid output. **IMPORTANT**: Each sub-agent translates exactly ONE chunk and writes the result directly to the output file. No START/END markers needed. #### Translation Prompt for Sub-Agents Include this translation prompt in each sub-agent's instructions (replace `{TARGET_LANGUAGE}` with the actual language name, e.g. "Chinese"): --- 请翻译markdown文件为 {TARGET_LANGUAGE}. IMPORTANT REQUIREMENTS: 1. 严格保持 Markdown 格式不变,包括标题、链接、图片引用等 2. 仅翻译文字内容,保留所有 Markdown 语法和文件名 3. 删除空链接、不必要的字符和如: 行末的'\\'。页码已由 convert.py 上游处理,不要再删除独立的数字行(可能是年份 1984、章节编号、引用编号等正文内容)。 4. 保证格式和语义准确翻译内容自然流畅 5. 只输出翻译后的正文内容,不要有任何说明、提示、注释或对话内容。 6. 表达清晰简洁,不要使用复杂的句式。请严格按顺序翻译,不要跳过任何内容。 7. 必须保留所有图片引用,包括: - 所有  格式的图片引用必须完整保留 - 图片文件名和路径不要修改(如 media/image-001.png) - 图片alt文本可以翻译,但必须保留图片引用结构 - 不要删除、过滤或忽略任何图片相关内容 - 图片引用示例: ->  - **原始 HTML 标签(如 `<img alt="..." />`、`<a title="...">`)必须保持合法**:翻译 `alt`、`title` 等属性值内部文本时,下列字符会破坏 HTML 结构,必须替换为安全形式(仅适用于**原始 HTML 标签的属性值内部**;普通 Markdown 正文、代码块、URL 不要主动转义): | 字符 | 在属性值内的危险 | 替换为 | |------|---------------|--------| | `"` | 闭合 `attr="..."` | 目标语言合适的弯引号(如中文 `“` `”`)或 `"` | | `'` | 闭合 `attr='...'` | 目标语言合适的弯引号(如中文 `‘` `’`)或 `'` | | `<` | 被解析为新标签 | `<` | | `>` | 被解析为标签结束 | `>` | | `&` | 被解析为实体起始(除非已是 `&xxx;`) | `&` | 不要修改 `src`、`href` 等结构性属性的值,只翻译可见文本属性(`alt`、`title`)。 - 错误示例:`alt="爱丽丝拿着标着"喝我"的瓶子"` ← 内层英文 `"` 把外层 alt 撑断了 - 正确示例:`alt="爱丽丝拿着标着“喝我”的瓶子"` 或 `alt="爱丽丝拿着标着"喝我"的瓶子"` 8. 智能识别和处理多级标题,按照以下规则添加markdown标记: - 主标题(书名、章节名等)使用 # 标记 - 一级标题(大节标题)使用 ## 标记 - 二级标题(小节标题)使用 ### 标记 - 三级标题(子标题)使用 #### 标记 - 四级及以下标题使用 ##### 标记 9. 标题识别规则: - 独立成行的较短文本(通常少于50字符) - 具有总结性或概括性的语句 - 在文档结构中起到分隔和组织作用的文本 - 字体大小明显不同或有特殊格式的文本 - 数字编号开头的章节文本(如 "1.1 概述"、"第三章"等) 10. 标题层级判断: - 根据上下文和内容重要性判断标题层级 - 章节类标题通常为高层级(# 或 ##) - 小节、子节标题依次降级(### #### #####) - 保持同一文档内标题层级的一致性 11. 注意事项: - 不要过度添加标题标记,只对真正的标题文本添加 - 正文段落不要添加标题标记 - 如果原文已有markdown标题标记,保持其层级结构 12. {CUSTOM_INSTRUCTIONS if provided} 13. 术语一致性:以下术语必须严格使用指定译法,不要自行变换。表格中"原文"列**或"别名"列**任一形式出现在正文中时,都必须翻译为"译文"列对应的形式。 {TERM_TABLE} markdown文件正文: --- ### 4.5. Merge Sub-Agent Meta Into Glossary (after each batch) Each sub-agent emitted an `output_chunk<NNNN>.meta.json` alongside its translated chunk. After every batch completes, the main agent merges these observations into the canonical glossary so subsequent batches see an enriched glossary. 1. Run prepare-merge: ```bash python3 {baseDir}/scripts/merge_meta.py prepare-merge "<temp_dir>" ``` Capture stdout JSON. It contains four arrays: - `auto_apply` — new entities with no glossary collision and unanimous (target, category) across all proposing chunks. - `decisions_needed` — items requiring main-agent judgment. Each has `id`, `kind`, an `options` array, and the data needed to pick. Kinds: - `alias` — `{variant, candidate_source, evidence}`. Choices: `yes_alias` / `no_separate_entity` / `skip`. - `conflict` — `{entity_source, field, current, proposed, evidence}`. Choices: `keep_current` / `accept_proposed` / `record_in_notes`. - `new_entity_existing_alias` — sub-agents propose `proposed_source` as a new entity, but it's already someone's alias. `{proposed_source, currently_alias_of, promoted_variants: [{target_proposal, category, evidence, evidence_chunks}, ...]}`. Choices: one `use_variant_N` per distinct (target, category) promotion variant (promote `proposed_source` to standalone with that target+category, removing it from the host's aliases) / `keep_as_alias` / `skip`. - `existing_entity_conflict` — sub-agents proposed a (target, category) for `entity_source` that differs from the canonical. Multiple distinct differing proposals all get exposed. `{entity_source, current_target, current_category, proposed_variants: [{target_proposal, category, evidence, evidence_chunks}, ...]}`. Choices: `keep_current` / one `use_variant_N` per competing proposal (overwrites both target AND category, stamps the prior values into notes) / `record_in_notes` (canonical unchanged; every proposed variant gets logged to notes). - `alias_or_new_entity` — `variant` has multiple competing options that can't all coexist under v2's surface-form uniqueness rule. Triggered when (a) `variant` was proposed both as a new standalone entity AND as an alias of one or more candidates, OR (b) `variant` was proposed as an alias of two or more different candidates with no standalone competitor. `{variant, alias_candidates: [{candidate_source, evidence, evidence_chunks}, ...], standalone_variants: [{target_proposal, category, evidence, evidence_chunks}, ...]}`. Choices: one `use_alias_N` per candidate (attach as alias of that candidate), one `use_standalone_N` per competing standalone proposal (add as standalone with that target+category), or `skip`. - `conflicting_new_entity_proposals` — `{source, variants: [{target_proposal, category, evidence, evidence_chunks}, ...]}`. Choices: `use_variant_0`, `use_variant_1`, ..., `skip`. - `consumed_chunk_ids` — every meta file scanned this round (regardless of whether it produced a finding). These hashes get recorded in `applied_meta_hashes` on apply. - `malformed_meta_chunk_ids` — meta files that failed validation. Quarantined: not consumed, not crashing the run. Surface them in your batch progress. 2. **If `consumed_chunk_ids` is empty** → nothing was scanned; skip to Step 5. 3. **If `consumed_chunk_ids` is non-empty but both `auto_apply` and `decisions_needed` are empty** → still pipe `{"auto_apply": [], "decisions": [], "consumed_chunk_ids": [...]}` into `apply-merge` so the hashes get recorded. **Skipping this is the bug** — no-op metas would re-scan forever otherwise. 4. **Otherwise, resolve each decision**: - Read its evidence quotes inline. - Pick one option from its `options` array. - Build a `decisions` entry that round-trips the original decision plus your choice. The entry MUST include the original `kind` and (for `conflicting_new_entity_proposals`) the `variants` array, so apply-merge can validate and act: ```json {"id": "d1", "kind": "alias", "variant": "Taig", "candidate_source": "Tai", "choice": "yes_alias"} ``` 5. Pipe the decisions JSON into apply-merge: ```bash echo '{"auto_apply": [...], "decisions": [...], "consumed_chunk_ids": [...]}' \ | python3 {baseDir}/scripts/merge_meta.py apply-merge "<temp_dir>" ``` Surface the summary JSON (`auto_applied`, `decisions_resolved`, `consumed_chunks`, `errors`) in your batch progress message. **apply-merge is transactional.** If any decision is malformed (wrong choice for kind, missing fields, references a non-existent entity), the entire batch aborts with a non-zero exit and stderr details — no glossary mutation, no hashes recorded. On non-zero exit, fix the offending decision and re-pipe; `prepare-merge` will surface the same proposals because nothing was consumed. **Decision order in the input list is not significant.** `apply-merge` internally dispatches entity-creating decisions before alias-attaching ones, so `yes_alias` decisions whose candidate is created by another decision in the same batch (a `use_standalone_N`, `use_variant_N`, or `promote_to_separate_entity`) succeed regardless of the order you pass them in. Alias chains (e.g. `Taighi → Taig` where `Taig → Tai` is also a pending alias decision) resolve via a fixed-point loop within the alias-attacher pass — you don't need to topo-sort or sequence chained aliases manually. On a fresh run after a previous interrupted batch, `prepare-merge` will pick up any meta files left behind. Don't manually delete them. ### 5. Verify Completeness and Retry After all batches complete, use Glob to check that every source chunk has a corresponding output file. If any are missing, retry them — each missing chunk as its own sub-agent. Maximum 2 attempts per chunk (initial + 1 retry). Also read `manifest.json` and verify: - Every chunk id has a corresponding output file - No output file is empty (0 bytes) Then run the meta-merge observability snapshot: ```bash python3 {baseDir}/scripts/merge_meta.py status "<temp_dir>" ``` Surface a one-line summary in the verification report: > Translated chunks: 50 • Meta files: 48 found / 47 consumed • Malformed: 1 (chunk0099 — see stderr) • Chunks missing meta: chunk0017, chunk0042 Severity rules (none of these fail the run — meta is non-blocking): - `unmerged_meta_files > 0` after Step 4.5 ran → bug, flag prominently. Resume should have caught this. - `malformed_meta_files > 0` → sub-agent emitted invalid meta; print chunk_ids and a "fix the file by hand and re-run if you want this chunk's feedback merged" note. - `meta_files_found < translated_chunks` → sub-agent-compliance issue (some chunks didn't emit meta at all). Print missing chunk_ids. Report any chunks that failed translation after retry. ### 6. Translate Book Title Read `config.txt` from the temp directory to get the `original_title` field. Translate the title to the target language. For Chinese, wrap in 书名号: `《translated_title》`. ### 7. Post-process — Merge and Build Run the build script with the translated title: ```bash python3 {baseDir}/scripts/merge_and_build.py --temp-dir "<temp_dir>" --title "<translated_title>" --cleanup ``` The `--cleanup` flag removes intermediate files (chunks, input.html, etc.) after a fully successful build. If the user asked to keep intermediates, omit `--cleanup`. The script reads `output_lang` from `config.txt` automatically. Optional overrides: `--lang`, `--author`. This produces in the temp directory: - `output.md` — merged translated markdown - `book.html` — web version with floating TOC - `book_doc.html` — ebook version - `book.docx`, `book.epub`, `book.pdf` — format conversions (requires Calibre) ### 8. Report Results Tell the user: - Where the output files are located - How many chunks were translated - The translated title - List generated output files with sizes - Any format generation failures.claude/commands/release.mdcommandShow content (898 bytes)
--- description: Release a new version to GitHub + ClawHub argument-hint: <semver, e.g. 0.3.0> --- Release version `$1` by running these three commands in order. Stop and report immediately if any step fails — do not attempt to recover automatically. ```bash git push origin main git tag v$1 && git push --tags npx clawhub@latest publish ./ --version $1 ``` `$1` is bare semver (e.g. `0.3.0`). The `v` prefix is applied only to the git tag, not to the ClawHub version. First-time ClawHub publish on a machine requires `npx clawhub@latest login` (browser auth, cached per machine). If step 3 fails with `Not logged in`, ask the user to run that login command, then retry only step 3. If step 3 fails for any other reason after the tag is already pushed, fix the cause and re-run only step 3 with the same version. Do not force-overwrite the tag (`git tag -f`) without explicit user approval.
README
Rainman Translate Book
English | 中文
Claude Code skill that translates entire books (PDF/DOCX/EPUB) into any language using parallel subagents.
Inspired by claude_translater. The original project uses shell scripts as its entry point, coordinating the Claude CLI with multiple step scripts to perform chunked translation. This project restructures the workflow as a Claude Code Skill, using subagents to translate chunks in parallel, with manifest-driven integrity checks, resumable runs, and multi-format output unified into a single pipeline. As the project structure and implementation differ significantly from the original, this is an independent project rather than a fork.
How It Works
Input (PDF/DOCX/EPUB)
│
▼
Calibre ebook-convert → HTMLZ → HTML → Markdown
│
▼
Split into chunks (chunk0001.md, chunk0002.md, ...)
│ manifest.json tracks chunk hashes
▼
Parallel subagents (8 concurrent by default)
│ each subagent: read 1 chunk → translate → write output_chunk*.md
│ batched to respect API rate limits
▼
Validate (manifest hash check, 1:1 source↔output match)
│
▼
Merge → Pandoc → HTML (with TOC) → Calibre → DOCX / EPUB / PDF
Each chunk gets its own independent subagent with a fresh context window. This prevents context accumulation and output truncation that happen when translating a full book in a single session.
Features
- Parallel subagents — 8 concurrent translators per batch, each with isolated context
- Resumable — chunk-level resume; already-translated chunks are skipped on re-run (for metadata/template changes, use a fresh run)
- Manifest validation — SHA-256 hash tracking prevents stale or corrupt outputs from being merged
- Multi-format output — HTML (with floating TOC), DOCX, EPUB, PDF
- Multi-language — zh, en, ja, ko, fr, de, es (extensible)
- PDF/DOCX/EPUB input — Calibre handles the conversion heavy lifting
Prerequisites
- Claude Code CLI — installed and authenticated
- Calibre —
ebook-convertcommand must be available (download) - Pandoc — for HTML↔Markdown conversion (download)
- Python 3 with:
pypandoc— required (pip install pypandoc)beautifulsoup4— optional, for better TOC generation (pip install beautifulsoup4)
Quick Start
1. Install the skill
Option A: npx (recommended)
npx skills add deusyu/translate-book -a claude-code -g
Option B: ClawHub
clawhub install translate-book
Option C: Git clone
git clone https://github.com/deusyu/translate-book.git ~/.claude/skills/translate-book
2. Translate a book
In Claude Code, say:
translate /path/to/book.pdf to Chinese
Or use the slash command:
/translate-book translate /path/to/book.pdf to Japanese
The skill handles the full pipeline automatically — convert, chunk, translate in parallel, validate, merge, and build all output formats.
3. Find your outputs
All files are in {book_name}_temp/:
| File | Description |
|---|---|
output.md | Merged translated Markdown |
book.html | Web version with floating TOC |
book.docx | Word document |
book.epub | E-book |
book.pdf | Print-ready PDF |
Repository Test Assets
- Checked-in baseline inputs live under
tests/baselines/<book-id>/. - Generated full-pipeline outputs live under
tests/.artifacts/and should not be committed. - Because
scripts/convert.pywrites{book_name}_temp/under the current working directory, run repository baseline tests from insidetests/.artifacts/to keep generated files out of the repo root.
Full-Pipeline Baseline Example
mkdir -p tests/.artifacts
cd tests/.artifacts
python3 ../../scripts/convert.py ../baselines/standard-alice/standard-alice.epub --olang zh
# then run translation via the skill
python3 ../../scripts/merge_and_build.py --temp-dir standard-alice_temp --title "test"
Pipeline Details
Step 1: Convert
python3 scripts/convert.py /path/to/book.pdf --olang zh
Calibre converts the input to HTMLZ, which is extracted and converted to Markdown, then split into chunks (~6000 chars each). A manifest.json records the SHA-256 hash of each source chunk for later validation.
Step 1.5: Glossary (term consistency across chunks)
Each chunk is translated by a fresh-context sub-agent, which means the same proper noun can drift across multiple translations on a 100-chunk book. To fix this, the skill builds a glossary before translation:
- Sample 5 chunks (first, last, 3 evenly-spaced middle).
- Extract proper nouns and recurring domain terms; pick canonical translations.
- Write
<temp_dir>/glossary.json(hand-editable schema below). - Run
python3 scripts/glossary.py count-frequencies <temp_dir>to populate per-term frequencies (ASCII terms use word-boundary regex socatdoesn't matchcategory; CJK terms use substring; single-CJK-char terms are rejected; aliases count toward the term they belong to). - For each chunk, the orchestrator calls
python3 scripts/glossary.py print-terms-for-chunk <temp_dir> chunkNNNN.mdand injects the resulting 3-column (原文 | 别名 | 译文) markdown table into that chunk's prompt as a hard constraint. Term selection = (terms whose source OR any alias appears in this chunk) ∪ (top-N most-frequent book-wide).
{
"version": 2,
"terms": [
{"id": "Manhattan", "source": "Manhattan", "target": "曼哈顿",
"category": "place", "aliases": [], "gender": "unknown",
"confidence": "medium", "frequency": 12,
"evidence_refs": [], "notes": ""}
],
"high_frequency_top_n": 20,
"applied_meta_hashes": {}
}
Existing v1 glossary.json files are auto-upgraded to v2 on first load. v2 forbids the same surface form (source or alias) appearing in two different terms; if a v1 file has polysemous duplicate sources, the upgrade aborts with a disambiguation message — fix the file by hand and reload.
Edit glossary.json between runs to fix translations; existing glossary.json is never overwritten — delete it to rebuild from scratch.
Note on partial reruns: in the current release, editing
glossary.jsonafter some chunks have been translated does NOT auto-invalidate those chunks — they keep their old translations. Precise glossary-driven re-translation is planned for the next commit. For now, delete the affectedoutput_chunk*.mdfiles (or the whole temp dir) to apply glossary edits.
Step 2: Translate (parallel subagents)
The skill launches subagents in batches (default: 8 concurrent). Each subagent:
- Reads one source chunk (e.g.
chunk0042.md) - Translates to the target language
- Writes the result to
output_chunk0042.md
If a run is interrupted, re-running skips chunks that already have valid output files. Failed chunks are retried once automatically.
Step 3: Merge & Build
python3 scripts/merge_and_build.py --temp-dir book_temp --title "《translated title》"
Before merging, the script validates:
- Every source chunk has a corresponding output file (1:1 match)
- Source chunk hashes match the manifest (no stale outputs)
- No output files are empty
Then: merge → Pandoc HTML → inject TOC → Calibre generates DOCX, EPUB, PDF.
Note: {book_name}_temp/ is a working directory for a single translation run. If you change the title, author, output language, template, or image assets, either use a fresh temp directory or delete the existing final artifacts (output.md, book*.html, book.docx, book.epub, book.pdf) before re-running.
Project Structure
| File | Purpose |
|---|---|
SKILL.md | Claude Code skill definition — orchestrates the full pipeline |
scripts/convert.py | PDF/DOCX/EPUB → Markdown chunks via Calibre HTMLZ |
scripts/manifest.py | Chunk manifest: SHA-256 tracking and merge validation |
scripts/glossary.py | Glossary management: per-chunk term tables for consistent terminology |
scripts/meta.py | Per-chunk sub-agent observation file schema (output_chunkNNNN.meta.json) |
scripts/merge_meta.py | Batch-boundary merge: sub-agent observations → canonical glossary |
scripts/merge_and_build.py | Merge chunks → HTML → DOCX/EPUB/PDF |
scripts/calibre_html_publish.py | Calibre wrapper for format conversion |
scripts/template.html | Web HTML template with floating TOC |
scripts/template_ebook.html | Ebook HTML template |
tests/baselines/ | Checked-in baseline book inputs for full-pipeline testing |
tests/.artifacts/ | Ignored full-pipeline test outputs |
Troubleshooting
| Problem | Solution |
|---|---|
Calibre ebook-convert not found | Install Calibre and ensure ebook-convert is in PATH |
Manifest validation failed | Source chunks changed since splitting — re-run convert.py |
Missing source chunk | Source file deleted — re-run convert.py to regenerate |
| Incomplete translation | Re-run the skill — it resumes from where it stopped |
| Changed title/template/assets but output didn't update | Delete existing output.md, book*.html, book.docx, book.epub, book.pdf from the temp dir, then re-run merge_and_build.py |
| Want page-number footers stripped from PDF output | By default, monotonic page-number sequences (e.g. 1, 2, 3, ...) are auto-detected and dropped while outliers like years (1984), chapter numbers, and citation indices stay preserved. If detection misses your case, pass --strip-page-numbers to convert.py to aggressively delete every standalone-digit line. The flag aborts if a cached input.md or chunk*.md already exists — delete them first so the flag actually takes effect. |
output.md exists but manifest invalid | Stale output — the script auto-deletes and re-merges |
Glossary upgrade rejected: duplicate source | v2 disallows two terms sharing a source/alias surface form. Edit glossary.json to disambiguate (e.g., rename one source from Apple to Apple (Inc.)) and reload. |
| PDF generation fails | Ensure Calibre is installed with PDF output support |
Roadmap
Tracking issue #7 — name/term inconsistency and pronoun/gender errors across chunks. Today's glossary covers high-frequency main entities; secondary characters, spelling variants, and pronoun resolution are not yet addressed. The plan is four independently shippable phases.
Design principles
- Scripts do bookkeeping; LLMs do semantic merge. State, schemas, dedup, hashing, IO are deterministic Python. Naming, gender attribution, alias judgment, conflict resolution are LLM calls.
- Single writer for shared state. Only the main agent writes
glossary.jsonandrun_state.json; sub-agents write per-chunk meta files. No locking needed. - Conservative merge. New entities require evidence; alias merges need LLM judgment, not just string similarity; gender starts at
unknownand only moves up under explicit evidence; canonical values aren't silently overwritten on conflict. - Three-layer state, three separate files.
glossary.json(canonical, sub-agents read),output_chunkNNNN.meta.json(raw per-chunk observations),run_state.json(orchestration).
Phase 1 — Sub-agent feedback + glossary merge (shipped)
Closes the read+write loop. Glossary v2 adds id, aliases, gender, confidence, evidence_refs, notes (v1 files auto-upgrade on first load; the term table is now 3-col and aliases participate in selection). Sub-agents emit output_chunkNNNN.meta.json alongside each translated chunk. scripts/merge_meta.py (prepare-merge / apply-merge / status) merges per-batch with conservative rules: surface-form uniqueness enforced, malformed metas quarantined (warn + skip + count), confidence escalation via both evidence_chunks and used_term_sources, FIFO-cap at 5. See SKILL.md Step 4 / Step 4.5 / Step 5.
Phase 2 — Neighbor context for pronouns (not started, independent of Phase 1)
Inject prev_excerpt (last ~300 chars of previous chunk) and next_excerpt (first ~300 chars of next chunk) into each sub-agent prompt as read-only context. No new state files. Pure prompt-assembly change.
Phase 3 — Selective re-translation (not started, depends on Phase 1)
Phase 1's batch feedback only improves forward. Selective rerun closes the backward loop: new scripts/run_state.py + run_state.json schema; per-chunk tracking of glossary_version_used, entity_ids_used, output_hash; five decision rules for deciding which chunks need re-translation this run.
Phase 4 — Bootstrap warm-up (experimental, gated on Phase 1 data)
Phase 1 grows the glossary batch-by-batch, so the first batch sees the smallest glossary and has the highest drift risk. Possible approaches: sequential bootstrap, variable concurrency, or skip entirely. Decision belongs to whoever has run the system on real books.
The specific schemas and file layouts in each phase are illustrative — they may shift as Phase 1 hits real data. Phase 4 is gated on data; Phase 3 may be re-scoped or dropped if Phase 1 alone proves "good enough".
Parallel track — Pipeline / UX backlog (not started, separate from issue #7)
Recent PR discussions also surfaced several useful workflow improvements, but these are broader than one-off patches and touch repo contracts (artifact names, temp-dir behavior, cleanup semantics, or EPUB compatibility scope). These are being tracked as maintainer-owned roadmap items rather than merged directly from the current PRs:
- Explicit EPUB cover support. Add
--cover <image>and pass it through the HTML -> EPUB Calibre step. Keep--cover-from <epub>/ EPUB cover auto-extraction out of scope until the project is ready to own EPUB parsing compatibility across different package layouts. (context: closed #3) - Configurable temp workspace location. Keep the current cwd-local
{book_name}_temp/default for compatibility. If this is revisited later, prefer an explicit--temp-root/--work-dirstyle option rather than silently changing the default location. (context: closed #4) - Safer Calibre/Pandoc artifact cleanup. Continue improving cleanup rules incrementally under regression tests, while preserving the current page-number detection semantics and not stripping real display-math delimiters or content numbers. (context: closed #5)
- Optional user-facing export names. Keep canonical pipeline artifacts as
book.html,book_doc.html,book.docx,book.epub, andbook.pdf. If title-based filenames are added later, they should likely be optional exported aliases/copies rather than a silent replacement of the internal artifact contract. (context: closed #6)
Star History
If you find this project helpful, please consider giving it a Star ⭐!
Sponsor
If this project saves you time, consider sponsoring to keep it maintained and improved.