USP

It offers native bindings for 10+ languages, VLM OCR with 146 LLM providers, and a TOON wire format for token-efficient RAG pipelines. Its extensible architecture supports custom OCR backends and post-processors, making it highly adaptable.

Use cases

01Building RAG pipelines for LLM applications
02Automated document processing and content extraction
03Extracting code intelligence for AI agents and analysis
04Performing OCR on scanned documents and images
05Batch processing various file types for data ingestion

Detected files (6)

.ai-rulez/skills/extraction-pipeline-patterns/SKILL.mdskill

Show content (7050 bytes)

---
description: "Document extraction pipeline architecture and patterns"
name: extraction-pipeline-patterns
priority: critical
---

# Extraction Pipeline Patterns

**Kreuzberg's format detection -> extraction -> fallback orchestration for 75+ file formats**

## Core Pipeline Architecture

The extraction pipeline (`crates/kreuzberg/src/core/pipeline.rs`, `crates/kreuzberg/src/extraction/`) orchestrates:

1. **Format Detection** - MIME type inference + extension validation -> select appropriate extractor
2. **Intelligent Extraction** - Route to format-specific extractors (PDF, DOCX, Excel, HTML, images, archives, etc.)
3. **Fallback Strategies** - Password-protected PDFs, OCR for images, nested archive handling, corrupted file recovery
4. **Post-Processing Pipeline** - Validators, quality processing, chunking, custom hooks (see `core/pipeline.rs`)

## Format Detection Strategy

**Location**: `crates/kreuzberg/src/core/mime.rs`, `crates/kreuzberg/src/core/formats.rs`

Pattern: detect via magic bytes, validate extension alignment (prevent spoofing), route to extractor. Multiple extractors for same format -> choose highest confidence/specificity.

```rust
// Pseudocode: core/mime.rs
match (magic_bytes(content), extension) {
    (Some(fmt), Some(ext)) if aligned -> Ok(fmt),
    (Some(fmt), Some(ext)) if misaligned -> Err(FormatMismatch),
    (Some(fmt), None) -> Ok(fmt),  // magic bytes only
    (None, Some(ext)) -> Ok(from_extension(ext)),
    _ -> Err(UnknownFormat),
}
```

## Extraction Modules (75 Formats)

| Category     | Extractors                                       | Key Modules                                          |
| ------------ | ------------------------------------------------ | ---------------------------------------------------- |
| **Office**   | DOCX, XLSX, XLSM, XLSB, XLS, PPTX, ODP, ODS      | `extraction/{docx,excel,pptx}.rs`                    |
| **PDF**      | Standard + encrypted, password attempts          | `pdf/` subdirectory (13 files)                       |
| **Images**   | PNG, JPG, TIFF, WebP, JP2, SVG (OCR-enabled)     | `extraction/image.rs` + `ocr/`                       |
| **Web**      | HTML, XHTML, XML, SVG (DOM parsing)              | `extraction/html.rs` (67KB - complex table handling) |
| **Email**    | EML, MSG (headers, body, attachments, threading) | `extraction/email.rs`                                |
| **Archives** | ZIP, TAR, GZ, 7Z (recursive extraction)          | `extraction/archive.rs` (31KB)                       |
| **Markdown** | MD, TXT, RST, Org Mode, RTF                      | `extraction/markdown.rs`                             |
| **Academic** | LaTeX, BibTeX, JATS, Jupyter, DocBook            | `extraction/{structured,xml}.rs`                     |

## Extraction Dispatcher

```rust
// Pseudocode: extraction/mod.rs
let format = detect_format(source.bytes, source.extension);
let result = match format {
    Pdf -> extract_pdf(source, config),
    Docx -> extract_docx(source, config),
    Image -> extract_image_with_ocr_fallback(source, config),
    Archive -> extract_archive_recursive(source, config),
    _ -> extract_with_plugin(format, source, config),
};
run_pipeline(result, config)  // post-processing always runs
```

## Fallback Strategies

- **Password-Protected PDFs**: Try primary password -> secondary password list -> return `is_encrypted=true` in metadata on failure
- **OCR Fallback**: If image text extraction confidence < threshold, trigger OCR backend; return both results with scores
- **Nested Archives**: Recursive extraction with configurable depth limit; flatten or preserve hierarchy
- **Corrupted File Recovery**: Stream-based parsing, emit content up to error point, include error location in metadata

## Configuration Integration

**Location**: `crates/kreuzberg/src/core/config.rs`, `crates/kreuzberg/src/core/config_validation.rs`

`ExtractionConfig` holds format-specific configs (`pdf`, `image`, `html`, `office`), fallback orchestration (`fallback`), and post-processing (`postprocessor`, `chunking`, `keywords`). See struct definition in `config.rs`.

## Plugin System Integration

**Location**: `crates/kreuzberg/src/plugins/`

- **CustomExtractor**: Override built-in format extractors
- **PostProcessor**: Modify results after extraction (Early/Middle/Late stages)
- **Validator**: Fail-fast validation (e.g., minimum text length)
- **OCRBackend**: Swap OCR engine

Plugin registry loaded at startup, cached for zero-cost lookup.

## Feature Flag Strategy

**Location**: `Cargo.toml` (workspace), `crates/kreuzberg/Cargo.toml`, `FEATURE_MATRIX.md`

20+ features across 9 language bindings. Key feature groups:

| Group    | Features                                                                             | Notes                             |
| -------- | ------------------------------------------------------------------------------------ | --------------------------------- |
| OCR      | `tesseract` (default), `tesseract-static`, `ocr-minimal`                             | Mutually exclusive recommendation |
| Formats  | `pdf`, `pdf-minimal`, `office`, `office-minimal`                                     |                                   |
| AI/ML    | `embeddings` (requires ONNX), `keywords-yake`, `keywords-rake`, `language-detection` |                                   |
| Server   | `api` (Axum), `mcp`, `tokio-runtime`, `lite-runtime`                                 |                                   |
| Bindings | `python-bindings`, `ruby-bindings`, `php-bindings`, `node-bindings`, `wasm`          |                                   |

Conditional compilation: modules gated with `#[cfg(feature = "...")]`. Runtime `validate_config()` warns if requested feature not compiled in.

### Feature Flag Critical Rules

1. **Never mix conflicting features** - e.g., `ocr-minimal` + `tesseract` should error at compile time
2. **Always provide feature diagnostics** - Config validation must warn if feature unavailable
3. **Default to maximum feature set** - Unless embedded/minimal specifically requested
4. **Test all feature combinations** - Matrix testing in CI catches regressions
5. **WASM incompatible** with embeddings, keywords, OCR

## Critical Rules

1. **Always use format detection** before routing to extractors (prevent confusion attacks)
2. **Stream-based parsing** for PDFs/archives to handle multi-GB files
3. **Post-pipeline is mandatory**: All extraction results flow through `run_pipeline()` for validators/hooks
4. **Plugin overrides are order-dependent**: Plugins registered first take priority
5. **Fallback timeouts**: Set reasonable OCR/archive extraction timeouts (config-driven)
6. **Metadata preservation**: Include format detection confidence, extraction method used, any fallbacks applied

## Related Skills

- **ocr-backend-management** - OCR engine selection and image preprocessing
- **chunking-embeddings** - Post-extraction text splitting with FastEmbed
- **api-server-mcp** - Axum endpoint for extraction pipeline exposure and MCP server

.ai-rulez/skills/format-specific-extraction/SKILL.mdskill

Show content (2735 bytes)

---
name: format-specific-extraction
description: "Format-specific document extraction workflows"
priority: high
---

# Format-Specific Extraction Workflows

## Office XML (DOCX/PPTX/ODT)

```text
ZIP archive → Security validation → XML parsing → Text + tables + metadata
```

1. `ZipBombValidator::new(limits).validate(&mut archive)?`
2. Extract XML files from archive (`word/document.xml`, `ppt/slides/*.xml`, `content.xml`)
3. Parse with `quick-xml::Reader` (streaming) + `DepthValidator` + `StringGrowthValidator`
4. Extract metadata via `crate::extraction::office_metadata::extract_metadata()`
5. See: `extractors/docx.rs`, `extractors/pptx.rs`, `extractors/odt.rs`

## PDF

```text
Bytes → pdf_oxide → Per-page text + OCR fallback → Tables → Metadata
```

1. `pdf_oxide::PdfDocument::from_bytes(content)?`
2. Check if needs OCR: `config.force_ocr || !has_searchable_text()`
3. Extract text per page, tables if `config.pages` enabled
4. Feature-gated: `#[cfg(feature = "pdf")]`
5. See: `extractors/pdf/mod.rs`

## Archives (ZIP/TAR/7z/GZIP)

```text
Validate → Extract metadata → Extract plaintext files only
```

1. `ZipBombValidator` BEFORE any extraction
2. Extract metadata (file list, sizes)
3. Extract text content from plaintext files
4. Use `build_archive_result()` helper
5. See: `extractors/archive.rs`, `extraction/archive/*.rs`

## Structured Text (JSON/YAML/TOML/XML)

```text
Detect format from MIME → Parse → Pretty-print → Metadata
```

Single `StructuredExtractor` handles multiple MIME types. Parse with format-specific library, pretty-print to text.
See: `extractors/structured.rs`

## Email (EML/MSG)

```text
Parse headers → Extract body (text/html) → Process attachments
```

See: `extraction/email.rs`, `extractors/email.rs`

## Common Helpers

| Helper                                | Location                    | Purpose                        |
| ------------------------------------- | --------------------------- | ------------------------------ |
| `office_metadata::extract_metadata()` | `extraction/office.rs`      | Office XML metadata            |
| `cells_to_markdown()`                 | `extraction/mod.rs`         | Convert cell grid to GFM table |
| `build_archive_result()`              | `extraction/archive/mod.rs` | Standard archive result        |

## Adding a New Format

1. Add MIME type to `EXT_TO_MIME` in `core/mime.rs`
2. Create extractor implementing `DocumentExtractor` trait
3. Set `supported_mime_types()` and `priority()` (default: 50)
4. Register in `extractors/mod.rs` → `register_default_extractors()`
5. Feature-gate if optional: `#[cfg(feature = "my-format")]`
6. Apply security validators for user content
7. Add tests with fixture files

.ai-rulez/skills/api-server-mcp/SKILL.mdskill

Show content (9419 bytes)

---
description: "REST API server and MCP protocol integration"
name: api-server-mcp
priority: critical
---

# API Server & MCP Protocol

**Axum server design for document extraction endpoints, middleware, async processing, and Model Context Protocol integration for AI agents**

## Kreuzberg API Architecture

**Location**: `crates/kreuzberg/src/api/`, `crates/kreuzberg-cli/`

Kreuzberg provides a dual REST API + MCP server built with Axum + Tokio.

```text
Request Flow:
HTTP Client / AI Agent (Claude)
    |
[Transport Layer]
├── REST API (Axum HTTP)
└── MCP Protocol (HTTP or Stdio)
    |
[Middleware Layer]
├── CORS, Request Logging (TraceLayer)
├── Request/Response size limits
└── Rate limiting (optional)
    |
[Router]
├── REST Endpoints
│   ├── POST /extract - File upload extraction
│   ├── POST /extract-url - URL-based extraction
│   ├── GET /formats - List supported formats
│   ├── GET /health - Server health check
│   ├── POST /batch - Batch document processing
│   ├── GET /cache/stats - Cache statistics
│   └── DELETE /cache - Clear extraction cache
├── MCP Endpoints
│   ├── POST /mcp/tools - List available tools
│   ├── POST /mcp/tools/call - Call a tool
│   ├── GET /mcp/resources - List resources
│   ├── GET /mcp/resources/:uri - Read resource
│   ├── GET /mcp/prompts - List prompts
│   └── GET /mcp/prompts/:name - Get prompt
    |
[Handler / Tool Layer]
├── extract_handler / extract_file tool
├── batch_handler / batch_extract tool
├── health_handler / get_capabilities tool
└── format_handler
    |
[Extraction Core]
├── Format detection
├── Extraction pipeline
├── Post-processing (chunking, embeddings)
└── Result formatting
    |
JSON Response / MCP ToolResult
```

## Server Setup & Configuration

**Location**: `crates/kreuzberg/src/api/server.rs`

Server initialization pattern: Create `ApiState` (holds `ExtractionConfig` + `ExtractionCache`), build Axum `Router` with all REST + MCP routes, apply middleware layers (body limits, CORS, tracing), serve via `tokio::net::TcpListener`.

Key middleware layers applied in order:

- `DefaultBodyLimit::max(100MB)` + `RequestBodyLimitLayer` -- configurable via env vars
- `CorsLayer::permissive()` -- restrict in production via `CORS_ALLOWED_ORIGINS`
- `TraceLayer::new_for_http()` -- request/response logging

## Core REST Handlers

**Location**: `crates/kreuzberg/src/api/handlers.rs`

| Handler               | Method            | Description                                                                                            |
| --------------------- | ----------------- | ------------------------------------------------------------------------------------------------------ |
| `extract_handler`     | POST /extract     | Multipart upload: parse file + optional config JSON, check cache, call `extract_bytes()`, cache result |
| `extract_url_handler` | POST /extract-url | Fetch URL via reqwest, extract bytes                                                                   |
| `batch_handler`       | POST /batch       | Parallel extraction with `Semaphore`-limited concurrency (default: CPU count)                          |
| `health_handler`      | GET /health       | Report status, version, uptime, feature availability (OCR, embeddings), cache stats                    |
| `formats_handler`     | GET /formats      | Return supported format categories (office, pdf, images, web, email, archives, academic)               |
| `cache_stats_handler` | GET /cache/stats  | Hit/miss counts and hit rate                                                                           |
| `cache_clear_handler` | DELETE /cache     | Clear LRU cache                                                                                        |

## Caching Strategy

**Location**: `crates/kreuzberg/src/cache/mod.rs`

LRU cache keyed by `SHA256(file_content)`, stores `Arc<ExtractionResult>`. Default 1000 entries. Thread-safe via `RwLock`. Tracks hit/miss counters with `AtomicU64` for stats endpoint.

## Error Handling

**Location**: `crates/kreuzberg/src/api/error.rs`

`ApiError` enum maps to HTTP status codes:

- `MissingFile` -> 400, `FileNotFound` -> 404
- `OnnxRuntimeMissing` / `TesseractMissing` -> 503 (with remediation message)
- `PayloadTooLarge` -> 413
- `ExtractionFailed` / `InvalidConfig` / `UnsupportedFormat` -> 500

## MCP Server Implementation

**Location**: `crates/kreuzberg/src/mcp/server.rs`

The MCP server allows Claude and other AI agents to call Kreuzberg extraction functions through the Model Context Protocol.

### MCP Tools (Callable Functions)

Three tools are registered:

| Tool               | Purpose                                                   | Required Params |
| ------------------ | --------------------------------------------------------- | --------------- |
| `extract_file`     | Extract text/tables/metadata from documents (75+ formats) | `file_path`     |
| `batch_extract`    | Extract from multiple documents in parallel               | `file_paths[]`  |
| `get_capabilities` | List supported formats, features, backends                | (none)          |

**Tool registration pattern** (example: `extract_file`):

```rust
// Define Tool with name, description, JSON Schema inputSchema
// Register with server.register_tool(tool, handler_fn)
// Handler: parse params -> build ExtractionConfig -> call extract_file() -> return ToolResult as JSON
```

`extract_file` optional params: `format`, `extract_tables`, `extract_images`, `ocr_enabled`, `extract_metadata`, `chunking_preset`, `generate_embeddings`.

### MCP Resources (Static Knowledge)

Three resources provide static information to agents:

- `kreuzberg://formats` -- Supported format list as JSON
- `kreuzberg://features` -- Cross-binding feature matrix (from `FEATURE_MATRIX.md`)
- `kreuzberg://api-reference` -- Generated API documentation

### MCP Prompts (Agent Templates)

Two prompts guide agent extraction workflows:

- `extract_for_rag` -- Document type-specific RAG extraction guidance (research paper, contract, report). Recommends chunking preset and embedding config.
- `batch_document_processing` -- Optimal concurrency, grouping, and error handling for batch workflows.

### MCP Transport Protocols

- **HTTP/REST**: MCP routes mounted alongside REST API on separate `/mcp/` prefix
- **Stdio**: JSON-RPC 2.0 over stdin/stdout for local CLI integration (e.g., Claude Desktop)

### Integration with Claude Desktop

```json
{
  "mcpServers": {
    "kreuzberg": {
      "command": "kreuzberg-mcp",
      "env": {
        "KREUZBERG_API_BASE": "http://localhost:8000",
        "KREUZBERG_MCP_TRANSPORT": "stdio"
      }
    }
  }
}
```

### MCP Error Handling

`ToolError` variants: `FileNotFound`, `UnsupportedFormat`, `ExtractionFailed`, `OnnxRuntimeMissing`, `TesseractMissing`, `Timeout`. Each maps to an MCP `ToolResultError` with descriptive code and message.

## Environment Configuration

See `.env.example` for all configurable variables. Key categories:

- **Server**: `KREUZBERG_HOST`, `KREUZBERG_PORT`
- **Size limits**: `KREUZBERG_MAX_REQUEST_BODY_BYTES` (default 100MB), `KREUZBERG_MAX_MULTIPART_FIELD_BYTES`
- **Features**: `KREUZBERG_ENABLE_OCR`, `KREUZBERG_ENABLE_EMBEDDINGS`, `KREUZBERG_ENABLE_KEYWORDS`
- **Cache**: `KREUZBERG_CACHE_ENABLED`, `KREUZBERG_CACHE_SIZE`
- **CORS**: `CORS_ALLOWED_ORIGINS` (comma-separated)
- **MCP**: `KREUZBERG_MCP_HOST`, `KREUZBERG_MCP_PORT`, `KREUZBERG_MCP_TRANSPORT` (stdio/http)
- **Logging**: `RUST_LOG=kreuzberg=info,tower_http=debug`

## Critical Rules

### REST API Rules

1. **Always validate multipart file uploads** - Check MIME type, size, magic bytes
2. **Timeout long-running extractions** - Set per-handler timeout (5 min default)
3. **Stream large files** - Never buffer entire multi-GB file in memory
4. **Cache aggressively** - Identical files should return from cache in <1ms
5. **Parallel extraction is CPU-bound** - Limit workers to CPU count + 1
6. **Error responses must be actionable** - Include error code and remediation suggestion
7. **Health checks must verify features** - Report missing dependencies (ONNX, Tesseract)
8. **Size limits are configurable** - Allow override via env var for large deployments
9. **CORS is permissive by default** - Restrict in production via env var
10. **Logging all requests** - Track extraction metrics for observability

### MCP Rules

1. **All tools must have timeout** - Prevent hanging on large files (default 5 min)
2. **Error responses must be detailed** - Include suggestions for missing dependencies
3. **Feature gates must be checked** - Return helpful message if feature unavailable (embeddings, OCR)
4. **Resources should be static** - Don't query external services in resource handlers
5. **Prompts guide agents** - Provide clear examples and best practices
6. **Batch tools must support cancellation** - Allow agent to stop long-running batch operations
7. **Logging all tool calls** - Track usage for analytics and debugging

## Related Skills

- **extraction-pipeline-patterns** - Core extraction called by handlers and MCP tools
- **chunking-embeddings** - Optional chunking/embedding parameters in extraction
- **ocr-backend-management** - OCR engine selection and image preprocessing

.ai-rulez/skills/chunking-embeddings/SKILL.mdskill

Show content (5734 bytes)

---
description: "Chunking, embeddings, and RAG pipeline integration"
name: chunking-embeddings
priority: critical
---

# Chunking & Embeddings

**Text splitting strategies, embedding generation with FastEmbed, RAG pipeline integration**

## Chunking Architecture Overview

**Location**: `crates/kreuzberg/src/chunking/`, `crates/kreuzberg/src/embeddings.rs`

```text
Extracted Text
    |
[1. Normalization] -> Clean whitespace, remove control chars
    |
[2. Chunk Strategy Selection] -> Fixed-size, semantic, syntax-aware, recursive
    |
[3. Overlap Management] -> Control context window overlap
    |
[4. Optional Embedding] -> Generate vectors with FastEmbed
    |
Output: Vec<Chunk> with text, vectors, metadata
```

## Chunking Strategies

**Location**: `crates/kreuzberg/src/chunking/mod.rs`

| Strategy                          | Pattern                                                 | Best For                                                           |
| --------------------------------- | ------------------------------------------------------- | ------------------------------------------------------------------ |
| **Fixed-Size**                    | Sliding window with configurable overlap                | Uniform chunks for embedding models with fixed token limits        |
| **Semantic**                      | Split by sentences, merge/split by similarity threshold | Smart context preservation for LLM consumption and semantic search |
| **Syntax-Aware**                  | Split by paragraph/section/heading/code-block structure | Preserving document structure (sections, code blocks) in RAG       |
| **Recursive** (LangChain pattern) | Try separators in order: `\n\n`, `\n`, `,`              | Best general-purpose chunking; auto-finds optimal split points     |

Key config fields per strategy (see struct definitions in `chunking/mod.rs`):

- Fixed-Size: `chunk_size`, `overlap`, `trim_whitespace`
- Semantic: `target_chunk_size`, `min/max_chunk_size`, `semantic_threshold`, `use_sentence_boundaries`
- Syntax-Aware: `chunk_by` (Paragraph/Section/Heading/Sentence/CodeBlock), `max_chunk_size`, `respect_code_blocks`
- Recursive: `separators[]`, `chunk_size`, `overlap`

## Chunking Configuration Presets

**Location**: `crates/kreuzberg/src/chunking/mod.rs`

| Preset       | Chunk Size  | Overlap | Strategy   | Use Case               |
| ------------ | ----------- | ------- | ---------- | ---------------------- |
| **Balanced** | 512 tokens  | 50      | Semantic   | RAG sweet spot         |
| **Compact**  | 256 tokens  | 32      | Fixed-Size | Dense vectors          |
| **Extended** | 1024 tokens | 100     | Recursive  | Full context           |
| **Minimal**  | 128 tokens  | 16      | (default)  | Lightweight embeddings |

Usage: set `config.chunking.preset = Some("balanced")` in `ExtractionConfig`.

## Embedding Generation with FastEmbed

**Location**: `crates/kreuzberg/src/embeddings.rs`

### Model Selection

| Model                               | Dimensions | Notes                            |
| ----------------------------------- | ---------- | -------------------------------- |
| `BAAI/bge-small-en-v1.5` (default)  | 384        | Fast, excellent for RAG          |
| `BAAI/bge-small-zh-v1.5`            | 384        | Chinese optimized                |
| `BAAI/bge-base-en-v1.5`             | 768        | Better quality, slower           |
| `jinaai/jina-embeddings-v2-base-en` | 768        | Long context (up to 8192 tokens) |
| `Custom(path)`                      | varies     | Custom ONNX model path           |

### Embedding Pattern

`TextEmbeddingManager` provides singleton-cached models per config. Pattern:

1. `get_or_init_model()` -- lazy-loads ONNX model (downloads if needed), caches in `Arc<RwLock<HashMap>>`
2. `embed_chunks()` -- collects chunk texts, calls `model.embed(texts, batch_size)`, zips results back to `ChunkWithEmbedding`

Default config: `batch_size=256`, `device=CPU`, `parallel_requests=4`.

### ONNX Runtime Requirement

Embeddings require ONNX Runtime. Feature-gated via:

```toml
[features]
embeddings = ["dep:fastembed", "dep:ort"]
```

Install: `brew install onnxruntime` (macOS) / `apt install libonnxruntime libonnxruntime-dev` (Linux). Verify: `echo $ORT_DYLIB_PATH`.

## RAG Integration Pattern

The full extraction-to-RAG pipeline:

1. **Extract**: `extract_file(path, config)` -> `ExtractionResult`
2. **Chunk**: Apply preset strategy to `result.content` -> `Vec<Chunk>`
3. **Embed**: If embedding config present, `TextEmbeddingManager::embed_chunks()` -> `Vec<ChunkWithEmbedding>`
4. **Output**: `RagDocument { file_path, metadata, chunks }` ready for vector DB ingestion

See `ChunkWithEmbedding` struct in `types.rs`: contains `text`, `embedding: Vec<f32>`, `dimensions`, `norm`, `metadata`.

## Critical Rules

1. **Chunking is preprocessing** - Always apply before embedding to ensure consistent vector sizes
2. **Overlap prevents information loss** - Set overlap to 15-20% of chunk size
3. **Embedding models are stateful** - Lazy load and cache to avoid repeated initialization
4. **ONNX Runtime is required** - Gracefully degrade if not available (skip embeddings)
5. **Batch embedding for performance** - Never embed single chunks; batch 50-1000 chunks
6. **Normalize embeddings for search** - Use L2 norm for cosine similarity
7. **Cache embedding results** - Don't re-embed identical text chunks
8. **Model selection impacts quality** - bge-small (384) for speed, bge-base (768) for quality

## Related Skills

- **extraction-pipeline-patterns** - Text extraction preceding chunking
- **api-server-mcp** - Endpoint for chunking + embedding operations
- **ocr-backend-management** - OCR text quality affects chunking success

skills/kreuzberg/SKILL.mdskill

Show content (15343 bytes)

---
name: kreuzberg
description: >-
  Extract text, tables, metadata, and images from 91+ document formats
  (PDF, Office, images, HTML, email, archives, academic) using Kreuzberg.
  Use when writing code that calls Kreuzberg APIs in Python, Node.js/TypeScript,
  Rust, or CLI. Covers installation, extraction (sync/async), configuration
  (OCR, chunking, output format), batch processing, error handling, and plugins.
license: Elastic-2.0
metadata:
  author: kreuzberg-dev
  version: "1.0"
  repository: https://github.com/kreuzberg-dev/kreuzberg
---

# Kreuzberg Document Extraction

Kreuzberg is a high-performance document intelligence library with a Rust core and native bindings for Python, Node.js/TypeScript, Ruby, Go, Java, C#, PHP, and Elixir. It extracts text, tables, metadata, and images from 91+ file formats including PDF, Office documents, images (with OCR), HTML, email, archives, and academic formats.

Use this skill when writing code that:

- Extracts text or metadata from documents
- Performs OCR on scanned documents or images
- Batch-processes multiple files
- Configures extraction options (output format, chunking, OCR, language detection)
- Implements custom plugins (post-processors, validators, OCR backends)

## Installation

### Python

```bash
pip install kreuzberg
# Optional OCR backends:
pip install kreuzberg[easyocr]    # EasyOCR
```

### Node.js

```bash
npm install @kreuzberg/node
```

### Rust

```toml
# Cargo.toml
[dependencies]
kreuzberg = { version = "4", features = ["tokio-runtime"] }
# features: tokio-runtime (required for sync + batch), pdf, ocr, chunking,
#           embeddings, language-detection, keywords-yake, keywords-rake
```

### CLI

```bash
# Download from GitHub releases, or:
cargo install kreuzberg-cli
```

## Quick Start

### Python (Async)

```python
from kreuzberg import extract_file

result = await extract_file("document.pdf")
print(result.content)       # extracted text
print(result.metadata)      # document metadata
print(result.tables)        # extracted tables
```

### Python (Sync)

```python
from kreuzberg import extract_file_sync

result = extract_file_sync("document.pdf")
print(result.content)
```

### Node.js

```typescript
import { extractFile } from "@kreuzberg/node";

const result = await extractFile("document.pdf");
console.log(result.content);
console.log(result.metadata);
console.log(result.tables);
```

### Node.js (Sync)

```typescript
import { extractFileSync } from "@kreuzberg/node";

const result = extractFileSync("document.pdf");
```

### Rust (Async)

```rust
use kreuzberg::{extract_file, ExtractionConfig};

#[tokio::main]
async fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig::default();
    let result = extract_file("document.pdf", None, &config).await?;
    println!("{}", result.content);
    Ok(())
}
```

### Rust (Sync) — requires `tokio-runtime` feature

```rust
use kreuzberg::{extract_file_sync, ExtractionConfig};

fn main() -> kreuzberg::Result<()> {
    let config = ExtractionConfig::default();
    let result = extract_file_sync("document.pdf", None, &config)?;
    println!("{}", result.content);
    Ok(())
}
```

### CLI

```bash
kreuzberg extract document.pdf
kreuzberg extract document.pdf --format json
kreuzberg extract document.pdf --output-format markdown
```

## Configuration

All languages use the same configuration structure with language-appropriate naming conventions.

### Python (snake_case)

```python
from kreuzberg import (
    ExtractionConfig, OcrConfig, TesseractConfig,
    PdfConfig, ChunkingConfig,
)

config = ExtractionConfig(
    ocr=OcrConfig(
        backend="tesseract",
        language="eng",
        tesseract_config=TesseractConfig(psm=6, enable_table_detection=True),
    ),
    pdf_options=PdfConfig(passwords=["secret123"]),
    chunking=ChunkingConfig(max_chars=1000, max_overlap=200),
    output_format="markdown",
)

result = await extract_file("document.pdf", config=config)
```

### Node.js (camelCase)

```typescript
import { extractFile, type ExtractionConfig } from "@kreuzberg/node";

const config: ExtractionConfig = {
  ocr: { backend: "tesseract", language: "eng" },
  pdfOptions: { passwords: ["secret123"] },
  chunking: { maxChars: 1000, maxOverlap: 200 },
  outputFormat: "markdown",
};

const result = await extractFile("document.pdf", null, config);
```

### Rust (snake_case)

```rust
use kreuzberg::{ExtractionConfig, OcrConfig, ChunkingConfig, OutputFormat};

let config = ExtractionConfig {
    ocr: Some(OcrConfig {
        backend: "tesseract".into(),
        language: "eng".into(),
        ..Default::default()
    }),
    chunking: Some(ChunkingConfig {
        max_characters: 1000,
        overlap: 200,
        ..Default::default()
    }),
    output_format: OutputFormat::Markdown,
    ..Default::default()
};

let result = extract_file("document.pdf", None, &config).await?;
```

### Config File (TOML)

```toml
output_format = "markdown"

[ocr]
backend = "tesseract"
language = "eng"

[chunking]
max_chars = 1000
max_overlap = 200

[pdf_options]
passwords = ["secret123"]
```

```bash
# CLI: auto-discovers kreuzberg.toml in current/parent directories
kreuzberg extract doc.pdf
# or explicit:
kreuzberg extract doc.pdf --config kreuzberg.toml
kreuzberg extract doc.pdf --config-json '{"ocr":{"backend":"tesseract","language":"deu"}}'
```

## Batch Processing

### Python

```python
from kreuzberg import batch_extract_files, batch_extract_files_sync

# Async
results = await batch_extract_files(["doc1.pdf", "doc2.docx", "doc3.xlsx"])

# Sync
results = batch_extract_files_sync(["doc1.pdf", "doc2.docx"])

for result in results:
    print(f"{len(result.content)} chars extracted")
```

### Node.js

```typescript
import { batchExtractFiles } from "@kreuzberg/node";

const results = await batchExtractFiles(["doc1.pdf", "doc2.docx"]);
```

### Rust — requires `tokio-runtime` feature

```rust
use kreuzberg::{batch_extract_file, ExtractionConfig};

let config = ExtractionConfig::default();
let paths = vec!["doc1.pdf", "doc2.docx"];
let results = batch_extract_file(paths, &config).await?;
```

### CLI

```bash
kreuzberg batch *.pdf --format json
kreuzberg batch docs/*.docx --output-format markdown
```

## OCR

OCR runs automatically for images and scanned PDFs. Tesseract is the default backend (native binding, no external install required).

### Backends

- **Tesseract** (default): Built-in native binding. All Tesseract languages supported.
- **EasyOCR** (Python only): `pip install kreuzberg[easyocr]`. Pass `easyocr_kwargs={"gpu": True}`.
- **PaddleOCR** (Python only): Bundled since 4.8.5, no extra install needed. Pass `paddleocr_kwargs={"use_angle_cls": True}`.
- **Guten** (Node.js only): Built-in OCR backend via `GutenOcrBackend`.

### Language Codes

```python
config = ExtractionConfig(ocr=OcrConfig(language="eng"))       # English
config = ExtractionConfig(ocr=OcrConfig(language="eng+deu"))   # Multiple
config = ExtractionConfig(ocr=OcrConfig(language="all"))       # All installed
```

### Force OCR

```python
config = ExtractionConfig(force_ocr=True)  # OCR even if text is extractable
```

## ExtractionResult Fields

| Field        | Python                      | Node.js                    | Rust                        | Description                                   |
| ------------ | --------------------------- | -------------------------- | --------------------------- | --------------------------------------------- |
| Text content | `result.content`            | `result.content`           | `result.content`            | Extracted text (str/String)                   |
| MIME type    | `result.mime_type`          | `result.mimeType`          | `result.mime_type`          | Input document MIME type                      |
| Metadata     | `result.metadata`           | `result.metadata`          | `result.metadata`           | Document metadata (dict/object/HashMap)       |
| Tables       | `result.tables`             | `result.tables`            | `result.tables`             | Extracted tables with cells + markdown        |
| Languages    | `result.detected_languages` | `result.detectedLanguages` | `result.detected_languages` | Detected languages (if enabled)               |
| Chunks       | `result.chunks`             | `result.chunks`            | `result.chunks`             | Text chunks (if chunking enabled)             |
| Images       | `result.images`             | `result.images`            | `result.images`             | Extracted images (if enabled)                 |
| Elements     | `result.elements`           | `result.elements`          | `result.elements`           | Semantic elements (if element_based format)   |
| Pages        | `result.pages`              | `result.pages`             | `result.pages`              | Per-page content (if page extraction enabled) |
| Keywords     | `result.keywords`           | `result.keywords`          | `result.keywords`           | Extracted keywords (if enabled)               |

## Error Handling

### Python

```python
from kreuzberg import (
    extract_file_sync, KreuzbergError, ParsingError,
    OCRError, ValidationError, MissingDependencyError,
)

try:
    result = extract_file_sync("file.pdf")
except ParsingError as e:
    print(f"Failed to parse: {e}")
except OCRError as e:
    print(f"OCR failed: {e}")
except ValidationError as e:
    print(f"Invalid input: {e}")
except MissingDependencyError as e:
    print(f"Missing dependency: {e}")
except KreuzbergError as e:
    print(f"Extraction failed: {e}")
```

### Node.js

```typescript
import {
  extractFile,
  KreuzbergError,
  ParsingError,
  OcrError,
  ValidationError,
  MissingDependencyError,
} from "@kreuzberg/node";

try {
  const result = await extractFile("file.pdf");
} catch (e) {
  if (e instanceof ParsingError) {
    /* ... */
  } else if (e instanceof OcrError) {
    /* ... */
  } else if (e instanceof ValidationError) {
    /* ... */
  } else if (e instanceof KreuzbergError) {
    /* ... */
  }
}
```

### Rust

```rust
use kreuzberg::{extract_file, ExtractionConfig, KreuzbergError};

let config = ExtractionConfig::default();
match extract_file("file.pdf", None, &config).await {
    Ok(result) => println!("{}", result.content),
    Err(KreuzbergError::Parsing(msg)) => eprintln!("Parse error: {msg}"),
    Err(KreuzbergError::Ocr(msg)) => eprintln!("OCR error: {msg}"),
    Err(e) => eprintln!("Error: {e}"),
}
```

## Common Pitfalls

1. **Python ChunkingConfig fields**: Use `max_chars` and `max_overlap`, NOT `max_characters` or `overlap`.
2. **Rust extract_file signature**: Third argument is `&ExtractionConfig` (a reference), not `Option`. Use `&ExtractionConfig::default()` for defaults.
3. **Rust feature gates**: `extract_file_sync`, `batch_extract_file`, and `batch_extract_file_sync` all require `features = ["tokio-runtime"]` in Cargo.toml.
4. **Rust async context**: `extract_file` is async. Use `#[tokio::main]` or call from an async context.
5. **CLI --format vs --output-format**: `--format` controls CLI output (text/json). `--output-format` controls content format (plain/markdown/djot/html).
6. **Node.js extractFile signature**: `extractFile(path, mimeType?, config?)` — mimeType is the second arg (pass `null` to skip).
7. **Python detect_mime_type**: The function for detecting from bytes is `detect_mime_type(data)`. For paths use `detect_mime_type_from_path(path)`.
8. **Config file field names**: Use snake_case in TOML/YAML/JSON config files (e.g., `max_chars`, `max_overlap`, `pdf_options`).

## Supported Formats (Summary)

| Category          | Extensions                                                                                                                                                  |
| ----------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- |
| **PDF**           | `.pdf`                                                                                                                                                      |
| **Word**          | `.docx`, `.odt`                                                                                                                                             |
| **Spreadsheets**  | `.xlsx`, `.xlsm`, `.xlsb`, `.xls`, `.xla`, `.xlam`, `.xltm`, `.ods`                                                                                         |
| **Presentations** | `.pptx`, `.ppt`, `.ppsx`                                                                                                                                    |
| **eBooks**        | `.epub`, `.fb2`                                                                                                                                             |
| **Images**        | `.png`, `.jpg`, `.jpeg`, `.gif`, `.webp`, `.bmp`, `.tiff`, `.tif`, `.jp2`, `.jpx`, `.jpm`, `.mj2`, `.jbig2`, `.jb2`, `.pnm`, `.pbm`, `.pgm`, `.ppm`, `.svg` |
| **Markup**        | `.html`, `.htm`, `.xhtml`, `.xml`                                                                                                                           |
| **Data**          | `.json`, `.yaml`, `.yml`, `.toml`, `.csv`, `.tsv`                                                                                                           |
| **Text**          | `.txt`, `.md`, `.markdown`, `.djot`, `.rst`, `.org`, `.rtf`                                                                                                 |
| **Email**         | `.eml`, `.msg`                                                                                                                                              |
| **Archives**      | `.zip`, `.tar`, `.tgz`, `.gz`, `.7z`                                                                                                                        |
| **Academic**      | `.bib`, `.biblatex`, `.ris`, `.nbib`, `.enw`, `.csl`, `.tex`, `.latex`, `.typ`, `.jats`, `.ipynb`, `.docbook`, `.opml`, `.pod`, `.mdoc`, `.troff`           |

See [references/supported-formats.md](references/supported-formats.md) for the complete format reference with MIME types.

## Additional Resources

Detailed reference files for specific topics:

- **[Python API Reference](references/python-api.md)** — All functions, config classes, plugin protocols, exact signatures
- **[Node.js API Reference](references/nodejs-api.md)** — All functions, TypeScript interfaces, worker pool APIs
- **[Rust API Reference](references/rust-api.md)** — All functions with feature gates, structs, Cargo.toml examples
- **[CLI Reference](references/cli-reference.md)** — All commands, flags, config precedence, exit codes
- **[Configuration Reference](references/configuration.md)** — TOML/YAML/JSON formats, auto-discovery, env vars, full schema
- **[Supported Formats](references/supported-formats.md)** — All 85+ formats with file extensions and MIME types
- **[Advanced Features](references/advanced-features.md)** — Plugins, embeddings, MCP server, API server, security limits
- **[Other Language Bindings](references/other-bindings.md)** — Go, Ruby, Java, C#, PHP, Elixir, WASM, Docker

Full documentation: <https://docs.kreuzberg.dev>
GitHub: <https://github.com/kreuzberg-dev/kreuzberg>

.ai-rulez/skills/plugin-architecture-patterns/SKILL.mdskill

Show content (3045 bytes)

---
name: plugin-architecture-patterns
description: "Plugin architecture, registration, and trait patterns"
priority: critical
---

# Plugin Architecture & Registration

## Plugin Types

| Type               | Trait                       | Location                     |
| ------------------ | --------------------------- | ---------------------------- |
| Document Extractor | `DocumentExtractor: Plugin` | `plugins/extractor/trait.rs` |
| OCR Backend        | `OcrBackend: Plugin`        | `plugins/ocr/trait.rs`       |
| Post Processor     | `PostProcessor: Plugin`     | `plugins/processor/trait.rs` |
| Validator          | `Validator: Plugin`         | `plugins/validator/trait.rs` |

## DocumentExtractor Implementation

```rust
use crate::plugins::{DocumentExtractor, Plugin};
use async_trait::async_trait;

pub struct MyExtractor;

impl Plugin for MyExtractor {
    fn name(&self) -> &str { "my-extractor" }
    fn version(&self) -> String { env!("CARGO_PKG_VERSION").to_string() }
}

#[async_trait]
impl DocumentExtractor for MyExtractor {
    async fn extract_bytes(&self, content: &[u8], mime_type: &str, config: &ExtractionConfig)
        -> Result<ExtractionResult> { /* ... */ }

    fn supported_mime_types(&self) -> &[&str] { &["application/x-custom"] }
    fn priority(&self) -> i32 { 50 }

    // WASM support (optional)
    fn as_sync_extractor(&self) -> Option<&dyn SyncExtractor> { None }
}
```

## Priority System

| Range  | Use                       |
| ------ | ------------------------- |
| 0-25   | Fallback/low-quality      |
| 26-49  | Alternative extractors    |
| **50** | **Default (built-in)**    |
| 51-75  | Premium/enhanced          |
| 76-100 | Specialized/high-priority |

Registry selects **highest priority** extractor for each MIME type. Override built-ins with priority > 50.

## Registration

```rust
// In extractors/mod.rs → register_default_extractors()
let registry = get_document_extractor_registry();
let mut registry = registry.write()
    .map_err(|e| KreuzbergError::Other(format!("Registry lock poisoned: {}", e)))?;
registry.register(Arc::new(MyExtractor::new()))?;
```

## Feature-Gated Registration

```rust
#[cfg(feature = "office")]
{
    registry.register(Arc::new(DocxExtractor::new()))?;
    registry.register(Arc::new(PptxExtractor::new()))?;
}
```

## PostProcessor Pattern

```rust
impl PostProcessor for MyProcessor {
    async fn process(&self, result: &mut ExtractionResult, config: &ExtractionConfig)
        -> Result<()> {
        result.content = process_content(&result.content);
        Ok(())
    }
    fn stage(&self) -> ProcessorStage { ProcessorStage::Middle }
}
```

Stages: `Early` → `Middle` → `Late`. Failures isolated (don't block others).

## Critical Rules

1. All plugins **MUST be `Send + Sync`**
2. Feature gate with `#[cfg(feature = "...")]` for optional formats
3. Use `#[async_trait]` for `DocumentExtractor`
4. Initialization via `ensure_initialized()` (lazy, called before first extraction)
5. Plugin names: kebab-case (e.g., `"pdf-extractor"`)

README

Kreuzberg

Extract text, metadata, and code intelligence from 97+ file formats and 305 programming languages at native speeds without needing a GPU.

Key Features

Code intelligence – Extract functions, classes, imports, symbols, and docstrings from 248 programming languages via tree-sitter. Results in ExtractionResult.code_intelligence with semantic chunking
Extensible architecture – Plugin system for custom OCR backends, validators, post-processors, document extractors, and renderers
Polyglot – Native bindings for Rust, Python, TypeScript/Node.js, Ruby, Go, Java, C#, PHP, Elixir, R, and C
91+ file formats – PDF, Office documents, images, HTML, XML, emails, archives, academic formats across 8 categories
LLM intelligence – VLM OCR (GPT-4o, Claude, Gemini, Ollama), structured JSON extraction with schema constraints, and provider-hosted embeddings via 146 LLM providers (including local engines: Ollama, LM Studio, vLLM, llama.cpp) through liter-llm
OCR support – Tesseract (all bindings, including Tesseract-WASM for browsers), PaddleOCR (all native bindings), EasyOCR (Python), VLM OCR (146 vision model providers including local engines), extensible via plugin API
High performance – Rust core with pure-Rust PDF, SIMD optimizations and full parallelism
Flexible deployment – Use as library, CLI tool, REST API server, or MCP server
TOON wire format – Token-efficient serialization for LLM/RAG pipelines, ~30-50% fewer tokens than JSON
GFM-quality output – Comrak-based rendering with proper fenced code blocks, table nodes, bracket escaping, and cross-format parity (Markdown, HTML, Djot, Plain)
HTML passthrough – HTML-to-Markdown conversion uses html-to-markdown output directly, bypassing lossy intermediate round-trips
Memory efficient – Streaming parsers for multi-GB files

Complete Documentation | Live Demo | Installation Guides

Installation

Each language binding provides comprehensive documentation with examples and best practices. Choose your platform to get started:

Scripting Languages:

Python – PyPI package, async/sync APIs, OCR backends (Tesseract, PaddleOCR, EasyOCR)
Ruby – RubyGems package, idiomatic Ruby API, native bindings
PHP – Composer package, modern PHP 8.4+ support, type-safe API, async extraction
Elixir – Hex package, OTP integration, concurrent processing
R – r-universe package, idiomatic R API, extendr bindings

JavaScript/TypeScript:

@kreuzberg/node – Native NAPI-RS bindings for Node.js/Bun, fastest performance
@kreuzberg/wasm – WebAssembly for browsers/Deno/Cloudflare Workers, full feature parity (PDF, Excel, OCR, archives)

Compiled Languages:

Go – Go module with FFI bindings, context-aware async
Java – Maven Central, Foreign Function & Memory API
C# – NuGet package, .NET 6.0+, full async/await support

Native:

Rust – Core library, flexible feature flags, zero-copy APIs
C (FFI) – C header + shared library, pkg-config/CMake support, cross-platform

Containers:

Docker – Official images with API, CLI, and MCP server modes (Core: ~1.0-1.3GB, Full: ~1.0-1.3GB with OCR + legacy format support)

Command-Line:

CLI – Cross-platform binary, batch processing, MCP server mode

All language bindings include precompiled binaries for both x86_64 and aarch64 architectures on Linux and macOS.

Platform Support

Complete architecture coverage across all language bindings:

Language	Linux x86_64	Linux aarch64	macOS ARM64	Windows x64
Python	✅	✅	✅	✅
Node.js	✅	✅	✅	✅
WASM	✅	✅	✅	✅
Ruby	✅	✅	✅	-
R	✅	✅	✅	✅
Elixir	✅	✅	✅	✅
Go	✅	✅	✅	✅
Java	✅	✅	✅	✅
C#	✅	✅	✅	✅
PHP	✅	✅	✅	✅
Rust	✅	✅	✅	✅
C (FFI)	✅	✅	✅	✅
CLI	✅	✅	✅	✅
Docker	✅	✅	✅	-

Note: ✅ = Precompiled binaries available with instant installation. WASM runs in any environment with WebAssembly support (browsers, Deno, Bun, Cloudflare Workers). All platforms are tested in CI. MacOS support is Apple Silicon only.

Embeddings Support (Optional)

To use embeddings functionality:

Install ONNX Runtime 1.24+:
- Linux: Download from ONNX Runtime releases (Debian packages may have older versions)
- MacOS: brew install onnxruntime
- Windows: Download from ONNX Runtime releases
Use embeddings in your code - see Embeddings Guide

Note: Kreuzberg requires ONNX Runtime version 1.24+ for embeddings. All other Kreuzberg features work without ONNX Runtime.

Supported Formats

91+ file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.

Office Documents

Category	Formats	Capabilities
Word Processing	`.docx`, `.docm`, `.dotx`, `.dotm`, `.dot`, `.odt`, `.pages`	Full text, tables, lists, images, metadata, styles
Spreadsheets	`.xlsx`, `.xlsm`, `.xlsb`, `.xls`, `.xla`, `.xlam`, `.xltm`, `.xltx`, `.xlt`, `.ods`, `.numbers`	Sheet data, formulas, cell metadata, charts
Presentations	`.pptx`, `.pptm`, `.ppsx`, `.potx`, `.potm`, `.pot`, `.key`	Slides, speaker notes, images, metadata
PDF	`.pdf`	Text, tables, images, metadata, OCR support
eBooks	`.epub`, `.fb2`	Chapters, metadata, embedded resources
Database	`.dbf`	Table data extraction, field type support
Hangul	`.hwp`, `.hwpx`	Korean document format, text extraction

Images (OCR-Enabled)

Category	Formats	Features
Raster	`.png`, `.jpg`, `.jpeg`, `.gif`, `.webp`, `.bmp`, `.tiff`, `.tif`	OCR, table detection, EXIF metadata, dimensions, color space
Advanced	`.jp2`, `.jpx`, `.jpm`, `.mj2`, `.jbig2`, `.jb2`, `.pnm`, `.pbm`, `.pgm`, `.ppm`	Pure Rust decoders (JPEG 2000, JBIG2), OCR, table detection
Vector	`.svg`	DOM parsing, embedded text, graphics metadata

Web & Data

Category	Formats	Features
Markup	`.html`, `.htm`, `.xhtml`, `.xml`, `.svg`	DOM parsing, metadata (Open Graph, Twitter Card), link extraction
Structured Data	`.json`, `.yaml`, `.yml`, `.toml`, `.csv`, `.tsv`	Schema detection, nested structures, validation
Text & Markdown	`.txt`, `.md`, `.markdown`, `.djot`, `.mdx`, `.rst`, `.org`, `.rtf`	CommonMark, GFM, Djot, MDX, reStructuredText, Org Mode, Rich Text

Email & Archives

Category	Formats	Features
Email	`.eml`, `.msg`	Headers, body (HTML/plain), attachments, UTF-16 support
Archives	`.zip`, `.tar`, `.tgz`, `.gz`, `.7z`	Recursive extraction, nested archives, metadata

Academic & Scientific

Category	Formats	Features
Citations	`.bib`, `.ris`, `.nbib`, `.enw`, `.csl`	BibTeX/BibLaTeX, RIS, PubMed/MEDLINE, EndNote XML, CSL JSON
Scientific	`.tex`, `.latex`, `.typ`, `.typst`, `.jats`, `.ipynb`	LaTeX, Typst, JATS journal articles, Jupyter notebooks
Publishing	`.fb2`, `.docbook`, `.dbk`, `.opml`	FictionBook, DocBook XML, OPML outlines
Documentation	`.pod`, `.mdoc`, `.troff`	Perl POD, man pages, troff

Complete Format Reference →

Code Intelligence (248 Languages)

Feature	Description
Structure Extraction	Functions, classes, methods, structs, interfaces, enums
Import/Export Analysis	Module dependencies, re-exports, wildcard imports
Symbol Extraction	Variables, constants, type aliases, properties
Docstring Parsing	Google, NumPy, Sphinx, JSDoc, RustDoc, and 10+ formats
Diagnostics	Parse errors with line/column positions
Syntax-Aware Chunking	Split code by semantic boundaries, not arbitrary byte offsets

Key Features

OCR with Table Extraction

Multiple OCR backends (Tesseract, EasyOCR, PaddleOCR) with intelligent table detection and reconstruction. Extract structured data from scanned documents and images with configurable accuracy thresholds.

OCR Backend Documentation →

Batch Processing

Process multiple documents concurrently with configurable parallelism. Optimize throughput for large-scale document processing workloads with automatic resource management.

Batch Processing Guide →

Password-Protected PDFs

Handle encrypted PDFs with single or multiple password attempts. Supports both RC4 and AES encryption with automatic fallback strategies.

PDF Configuration →

Language Detection

Automatic language detection in extracted text using fast-langdetect. Configure confidence thresholds and access per-language statistics.

Language Detection Guide →

Metadata Extraction

Extract comprehensive metadata from all supported formats: authors, titles, creation dates, page counts, EXIF data, and format-specific properties.

Metadata Guide →

AI Coding Assistants

Kreuzberg ships with an Agent Skill that teaches AI coding assistants how to use the library correctly. It works with Claude Code, Codex, Gemini CLI, Cursor, VS Code, Amp, Goose, Roo Code, and any tool supporting the Agent Skills standard.

Install the skill into any project using the Vercel Skills CLI:

npx skills add kreuzberg-dev/kreuzberg

The skill is located at skills/kreuzberg/SKILL.md and is automatically discovered by supported AI coding tools once installed.

Documentation

Installation Guide – Setup and dependencies
User Guide – Comprehensive usage guide
API Reference – Complete API documentation
Format Support – Supported file formats
OCR Backends – OCR engine setup
CLI Guide – Command-line usage
Migration Guides – Upgrading from other libraries

Contributing

Contributions are welcome! See CONTRIBUTING.md for guidelines.

License

Elastic License 2.0 (ELv2) - see LICENSE for details. See https://www.elastic.co/licensing/elastic-license for the full license text.