Curated Claude Code catalog
Updated 07.05.2026 · 19:39 CET
01 / Skill
kreuzberg-dev

kreuzberg

Quality
9.0

Kreuzberg is a high-performance document intelligence library built on a Rust core, capable of extracting text, metadata, and code intelligence from over 97 file formats and 305 programming languages. It's ideal for building robust RAG pipelines, AI agents, or any application requiring fast, comprehensive document analysis across diverse data sources.

USP

It offers native bindings for 10+ languages, VLM OCR with 146 LLM providers, and a TOON wire format for token-efficient RAG pipelines. Its extensible architecture supports custom OCR backends and post-processors, making it highly adaptable.

Use cases

  • 01Building RAG pipelines for LLM applications
  • 02Automated document processing and content extraction
  • 03Extracting code intelligence for AI agents and analysis
  • 04Performing OCR on scanned documents and images
  • 05Batch processing various file types for data ingestion

Detected files (6)

  • .ai-rulez/skills/extraction-pipeline-patterns/SKILL.mdskill
    Show content (7050 bytes)
    ---
    description: "Document extraction pipeline architecture and patterns"
    name: extraction-pipeline-patterns
    priority: critical
    ---
    
    # Extraction Pipeline Patterns
    
    **Kreuzberg's format detection -> extraction -> fallback orchestration for 75+ file formats**
    
    ## Core Pipeline Architecture
    
    The extraction pipeline (`crates/kreuzberg/src/core/pipeline.rs`, `crates/kreuzberg/src/extraction/`) orchestrates:
    
    1. **Format Detection** - MIME type inference + extension validation -> select appropriate extractor
    2. **Intelligent Extraction** - Route to format-specific extractors (PDF, DOCX, Excel, HTML, images, archives, etc.)
    3. **Fallback Strategies** - Password-protected PDFs, OCR for images, nested archive handling, corrupted file recovery
    4. **Post-Processing Pipeline** - Validators, quality processing, chunking, custom hooks (see `core/pipeline.rs`)
    
    ## Format Detection Strategy
    
    **Location**: `crates/kreuzberg/src/core/mime.rs`, `crates/kreuzberg/src/core/formats.rs`
    
    Pattern: detect via magic bytes, validate extension alignment (prevent spoofing), route to extractor. Multiple extractors for same format -> choose highest confidence/specificity.
    
    ```rust
    // Pseudocode: core/mime.rs
    match (magic_bytes(content), extension) {
        (Some(fmt), Some(ext)) if aligned -> Ok(fmt),
        (Some(fmt), Some(ext)) if misaligned -> Err(FormatMismatch),
        (Some(fmt), None) -> Ok(fmt),  // magic bytes only
        (None, Some(ext)) -> Ok(from_extension(ext)),
        _ -> Err(UnknownFormat),
    }
    ```
    
    ## Extraction Modules (75 Formats)
    
    | Category     | Extractors                                       | Key Modules                                          |
    | ------------ | ------------------------------------------------ | ---------------------------------------------------- |
    | **Office**   | DOCX, XLSX, XLSM, XLSB, XLS, PPTX, ODP, ODS      | `extraction/{docx,excel,pptx}.rs`                    |
    | **PDF**      | Standard + encrypted, password attempts          | `pdf/` subdirectory (13 files)                       |
    | **Images**   | PNG, JPG, TIFF, WebP, JP2, SVG (OCR-enabled)     | `extraction/image.rs` + `ocr/`                       |
    | **Web**      | HTML, XHTML, XML, SVG (DOM parsing)              | `extraction/html.rs` (67KB - complex table handling) |
    | **Email**    | EML, MSG (headers, body, attachments, threading) | `extraction/email.rs`                                |
    | **Archives** | ZIP, TAR, GZ, 7Z (recursive extraction)          | `extraction/archive.rs` (31KB)                       |
    | **Markdown** | MD, TXT, RST, Org Mode, RTF                      | `extraction/markdown.rs`                             |
    | **Academic** | LaTeX, BibTeX, JATS, Jupyter, DocBook            | `extraction/{structured,xml}.rs`                     |
    
    ## Extraction Dispatcher
    
    ```rust
    // Pseudocode: extraction/mod.rs
    let format = detect_format(source.bytes, source.extension);
    let result = match format {
        Pdf -> extract_pdf(source, config),
        Docx -> extract_docx(source, config),
        Image -> extract_image_with_ocr_fallback(source, config),
        Archive -> extract_archive_recursive(source, config),
        _ -> extract_with_plugin(format, source, config),
    };
    run_pipeline(result, config)  // post-processing always runs
    ```
    
    ## Fallback Strategies
    
    - **Password-Protected PDFs**: Try primary password -> secondary password list -> return `is_encrypted=true` in metadata on failure
    - **OCR Fallback**: If image text extraction confidence < threshold, trigger OCR backend; return both results with scores
    - **Nested Archives**: Recursive extraction with configurable depth limit; flatten or preserve hierarchy
    - **Corrupted File Recovery**: Stream-based parsing, emit content up to error point, include error location in metadata
    
    ## Configuration Integration
    
    **Location**: `crates/kreuzberg/src/core/config.rs`, `crates/kreuzberg/src/core/config_validation.rs`
    
    `ExtractionConfig` holds format-specific configs (`pdf`, `image`, `html`, `office`), fallback orchestration (`fallback`), and post-processing (`postprocessor`, `chunking`, `keywords`). See struct definition in `config.rs`.
    
    ## Plugin System Integration
    
    **Location**: `crates/kreuzberg/src/plugins/`
    
    - **CustomExtractor**: Override built-in format extractors
    - **PostProcessor**: Modify results after extraction (Early/Middle/Late stages)
    - **Validator**: Fail-fast validation (e.g., minimum text length)
    - **OCRBackend**: Swap OCR engine
    
    Plugin registry loaded at startup, cached for zero-cost lookup.
    
    ## Feature Flag Strategy
    
    **Location**: `Cargo.toml` (workspace), `crates/kreuzberg/Cargo.toml`, `FEATURE_MATRIX.md`
    
    20+ features across 9 language bindings. Key feature groups:
    
    | Group    | Features                                                                             | Notes                             |
    | -------- | ------------------------------------------------------------------------------------ | --------------------------------- |
    | OCR      | `tesseract` (default), `tesseract-static`, `ocr-minimal`                             | Mutually exclusive recommendation |
    | Formats  | `pdf`, `pdf-minimal`, `office`, `office-minimal`                                     |                                   |
    | AI/ML    | `embeddings` (requires ONNX), `keywords-yake`, `keywords-rake`, `language-detection` |                                   |
    | Server   | `api` (Axum), `mcp`, `tokio-runtime`, `lite-runtime`                                 |                                   |
    | Bindings | `python-bindings`, `ruby-bindings`, `php-bindings`, `node-bindings`, `wasm`          |                                   |
    
    Conditional compilation: modules gated with `#[cfg(feature = "...")]`. Runtime `validate_config()` warns if requested feature not compiled in.
    
    ### Feature Flag Critical Rules
    
    1. **Never mix conflicting features** - e.g., `ocr-minimal` + `tesseract` should error at compile time
    2. **Always provide feature diagnostics** - Config validation must warn if feature unavailable
    3. **Default to maximum feature set** - Unless embedded/minimal specifically requested
    4. **Test all feature combinations** - Matrix testing in CI catches regressions
    5. **WASM incompatible** with embeddings, keywords, OCR
    
    ## Critical Rules
    
    1. **Always use format detection** before routing to extractors (prevent confusion attacks)
    2. **Stream-based parsing** for PDFs/archives to handle multi-GB files
    3. **Post-pipeline is mandatory**: All extraction results flow through `run_pipeline()` for validators/hooks
    4. **Plugin overrides are order-dependent**: Plugins registered first take priority
    5. **Fallback timeouts**: Set reasonable OCR/archive extraction timeouts (config-driven)
    6. **Metadata preservation**: Include format detection confidence, extraction method used, any fallbacks applied
    
    ## Related Skills
    
    - **ocr-backend-management** - OCR engine selection and image preprocessing
    - **chunking-embeddings** - Post-extraction text splitting with FastEmbed
    - **api-server-mcp** - Axum endpoint for extraction pipeline exposure and MCP server
    
  • .ai-rulez/skills/format-specific-extraction/SKILL.mdskill
    Show content (2735 bytes)
    ---
    name: format-specific-extraction
    description: "Format-specific document extraction workflows"
    priority: high
    ---
    
    # Format-Specific Extraction Workflows
    
    ## Office XML (DOCX/PPTX/ODT)
    
    ```text
    ZIP archive → Security validation → XML parsing → Text + tables + metadata
    ```
    
    1. `ZipBombValidator::new(limits).validate(&mut archive)?`
    2. Extract XML files from archive (`word/document.xml`, `ppt/slides/*.xml`, `content.xml`)
    3. Parse with `quick-xml::Reader` (streaming) + `DepthValidator` + `StringGrowthValidator`
    4. Extract metadata via `crate::extraction::office_metadata::extract_metadata()`
    5. See: `extractors/docx.rs`, `extractors/pptx.rs`, `extractors/odt.rs`
    
    ## PDF
    
    ```text
    Bytes → pdf_oxide → Per-page text + OCR fallback → Tables → Metadata
    ```
    
    1. `pdf_oxide::PdfDocument::from_bytes(content)?`
    2. Check if needs OCR: `config.force_ocr || !has_searchable_text()`
    3. Extract text per page, tables if `config.pages` enabled
    4. Feature-gated: `#[cfg(feature = "pdf")]`
    5. See: `extractors/pdf/mod.rs`
    
    ## Archives (ZIP/TAR/7z/GZIP)
    
    ```text
    Validate → Extract metadata → Extract plaintext files only
    ```
    
    1. `ZipBombValidator` BEFORE any extraction
    2. Extract metadata (file list, sizes)
    3. Extract text content from plaintext files
    4. Use `build_archive_result()` helper
    5. See: `extractors/archive.rs`, `extraction/archive/*.rs`
    
    ## Structured Text (JSON/YAML/TOML/XML)
    
    ```text
    Detect format from MIME → Parse → Pretty-print → Metadata
    ```
    
    Single `StructuredExtractor` handles multiple MIME types. Parse with format-specific library, pretty-print to text.
    See: `extractors/structured.rs`
    
    ## Email (EML/MSG)
    
    ```text
    Parse headers → Extract body (text/html) → Process attachments
    ```
    
    See: `extraction/email.rs`, `extractors/email.rs`
    
    ## Common Helpers
    
    | Helper                                | Location                    | Purpose                        |
    | ------------------------------------- | --------------------------- | ------------------------------ |
    | `office_metadata::extract_metadata()` | `extraction/office.rs`      | Office XML metadata            |
    | `cells_to_markdown()`                 | `extraction/mod.rs`         | Convert cell grid to GFM table |
    | `build_archive_result()`              | `extraction/archive/mod.rs` | Standard archive result        |
    
    ## Adding a New Format
    
    1. Add MIME type to `EXT_TO_MIME` in `core/mime.rs`
    2. Create extractor implementing `DocumentExtractor` trait
    3. Set `supported_mime_types()` and `priority()` (default: 50)
    4. Register in `extractors/mod.rs` → `register_default_extractors()`
    5. Feature-gate if optional: `#[cfg(feature = "my-format")]`
    6. Apply security validators for user content
    7. Add tests with fixture files
    
  • .ai-rulez/skills/api-server-mcp/SKILL.mdskill
    Show content (9419 bytes)
    ---
    description: "REST API server and MCP protocol integration"
    name: api-server-mcp
    priority: critical
    ---
    
    # API Server & MCP Protocol
    
    **Axum server design for document extraction endpoints, middleware, async processing, and Model Context Protocol integration for AI agents**
    
    ## Kreuzberg API Architecture
    
    **Location**: `crates/kreuzberg/src/api/`, `crates/kreuzberg-cli/`
    
    Kreuzberg provides a dual REST API + MCP server built with Axum + Tokio.
    
    ```text
    Request Flow:
    HTTP Client / AI Agent (Claude)
        |
    [Transport Layer]
    ├── REST API (Axum HTTP)
    └── MCP Protocol (HTTP or Stdio)
        |
    [Middleware Layer]
    ├── CORS, Request Logging (TraceLayer)
    ├── Request/Response size limits
    └── Rate limiting (optional)
        |
    [Router]
    ├── REST Endpoints
    │   ├── POST /extract - File upload extraction
    │   ├── POST /extract-url - URL-based extraction
    │   ├── GET /formats - List supported formats
    │   ├── GET /health - Server health check
    │   ├── POST /batch - Batch document processing
    │   ├── GET /cache/stats - Cache statistics
    │   └── DELETE /cache - Clear extraction cache
    ├── MCP Endpoints
    │   ├── POST /mcp/tools - List available tools
    │   ├── POST /mcp/tools/call - Call a tool
    │   ├── GET /mcp/resources - List resources
    │   ├── GET /mcp/resources/:uri - Read resource
    │   ├── GET /mcp/prompts - List prompts
    │   └── GET /mcp/prompts/:name - Get prompt
        |
    [Handler / Tool Layer]
    ├── extract_handler / extract_file tool
    ├── batch_handler / batch_extract tool
    ├── health_handler / get_capabilities tool
    └── format_handler
        |
    [Extraction Core]
    ├── Format detection
    ├── Extraction pipeline
    ├── Post-processing (chunking, embeddings)
    └── Result formatting
        |
    JSON Response / MCP ToolResult
    ```
    
    ## Server Setup & Configuration
    
    **Location**: `crates/kreuzberg/src/api/server.rs`
    
    Server initialization pattern: Create `ApiState` (holds `ExtractionConfig` + `ExtractionCache`), build Axum `Router` with all REST + MCP routes, apply middleware layers (body limits, CORS, tracing), serve via `tokio::net::TcpListener`.
    
    Key middleware layers applied in order:
    
    - `DefaultBodyLimit::max(100MB)` + `RequestBodyLimitLayer` -- configurable via env vars
    - `CorsLayer::permissive()` -- restrict in production via `CORS_ALLOWED_ORIGINS`
    - `TraceLayer::new_for_http()` -- request/response logging
    
    ## Core REST Handlers
    
    **Location**: `crates/kreuzberg/src/api/handlers.rs`
    
    | Handler               | Method            | Description                                                                                            |
    | --------------------- | ----------------- | ------------------------------------------------------------------------------------------------------ |
    | `extract_handler`     | POST /extract     | Multipart upload: parse file + optional config JSON, check cache, call `extract_bytes()`, cache result |
    | `extract_url_handler` | POST /extract-url | Fetch URL via reqwest, extract bytes                                                                   |
    | `batch_handler`       | POST /batch       | Parallel extraction with `Semaphore`-limited concurrency (default: CPU count)                          |
    | `health_handler`      | GET /health       | Report status, version, uptime, feature availability (OCR, embeddings), cache stats                    |
    | `formats_handler`     | GET /formats      | Return supported format categories (office, pdf, images, web, email, archives, academic)               |
    | `cache_stats_handler` | GET /cache/stats  | Hit/miss counts and hit rate                                                                           |
    | `cache_clear_handler` | DELETE /cache     | Clear LRU cache                                                                                        |
    
    ## Caching Strategy
    
    **Location**: `crates/kreuzberg/src/cache/mod.rs`
    
    LRU cache keyed by `SHA256(file_content)`, stores `Arc<ExtractionResult>`. Default 1000 entries. Thread-safe via `RwLock`. Tracks hit/miss counters with `AtomicU64` for stats endpoint.
    
    ## Error Handling
    
    **Location**: `crates/kreuzberg/src/api/error.rs`
    
    `ApiError` enum maps to HTTP status codes:
    
    - `MissingFile` -> 400, `FileNotFound` -> 404
    - `OnnxRuntimeMissing` / `TesseractMissing` -> 503 (with remediation message)
    - `PayloadTooLarge` -> 413
    - `ExtractionFailed` / `InvalidConfig` / `UnsupportedFormat` -> 500
    
    ## MCP Server Implementation
    
    **Location**: `crates/kreuzberg/src/mcp/server.rs`
    
    The MCP server allows Claude and other AI agents to call Kreuzberg extraction functions through the Model Context Protocol.
    
    ### MCP Tools (Callable Functions)
    
    Three tools are registered:
    
    | Tool               | Purpose                                                   | Required Params |
    | ------------------ | --------------------------------------------------------- | --------------- |
    | `extract_file`     | Extract text/tables/metadata from documents (75+ formats) | `file_path`     |
    | `batch_extract`    | Extract from multiple documents in parallel               | `file_paths[]`  |
    | `get_capabilities` | List supported formats, features, backends                | (none)          |
    
    **Tool registration pattern** (example: `extract_file`):
    
    ```rust
    // Define Tool with name, description, JSON Schema inputSchema
    // Register with server.register_tool(tool, handler_fn)
    // Handler: parse params -> build ExtractionConfig -> call extract_file() -> return ToolResult as JSON
    ```
    
    `extract_file` optional params: `format`, `extract_tables`, `extract_images`, `ocr_enabled`, `extract_metadata`, `chunking_preset`, `generate_embeddings`.
    
    ### MCP Resources (Static Knowledge)
    
    Three resources provide static information to agents:
    
    - `kreuzberg://formats` -- Supported format list as JSON
    - `kreuzberg://features` -- Cross-binding feature matrix (from `FEATURE_MATRIX.md`)
    - `kreuzberg://api-reference` -- Generated API documentation
    
    ### MCP Prompts (Agent Templates)
    
    Two prompts guide agent extraction workflows:
    
    - `extract_for_rag` -- Document type-specific RAG extraction guidance (research paper, contract, report). Recommends chunking preset and embedding config.
    - `batch_document_processing` -- Optimal concurrency, grouping, and error handling for batch workflows.
    
    ### MCP Transport Protocols
    
    - **HTTP/REST**: MCP routes mounted alongside REST API on separate `/mcp/` prefix
    - **Stdio**: JSON-RPC 2.0 over stdin/stdout for local CLI integration (e.g., Claude Desktop)
    
    ### Integration with Claude Desktop
    
    ```json
    {
      "mcpServers": {
        "kreuzberg": {
          "command": "kreuzberg-mcp",
          "env": {
            "KREUZBERG_API_BASE": "http://localhost:8000",
            "KREUZBERG_MCP_TRANSPORT": "stdio"
          }
        }
      }
    }
    ```
    
    ### MCP Error Handling
    
    `ToolError` variants: `FileNotFound`, `UnsupportedFormat`, `ExtractionFailed`, `OnnxRuntimeMissing`, `TesseractMissing`, `Timeout`. Each maps to an MCP `ToolResultError` with descriptive code and message.
    
    ## Environment Configuration
    
    See `.env.example` for all configurable variables. Key categories:
    
    - **Server**: `KREUZBERG_HOST`, `KREUZBERG_PORT`
    - **Size limits**: `KREUZBERG_MAX_REQUEST_BODY_BYTES` (default 100MB), `KREUZBERG_MAX_MULTIPART_FIELD_BYTES`
    - **Features**: `KREUZBERG_ENABLE_OCR`, `KREUZBERG_ENABLE_EMBEDDINGS`, `KREUZBERG_ENABLE_KEYWORDS`
    - **Cache**: `KREUZBERG_CACHE_ENABLED`, `KREUZBERG_CACHE_SIZE`
    - **CORS**: `CORS_ALLOWED_ORIGINS` (comma-separated)
    - **MCP**: `KREUZBERG_MCP_HOST`, `KREUZBERG_MCP_PORT`, `KREUZBERG_MCP_TRANSPORT` (stdio/http)
    - **Logging**: `RUST_LOG=kreuzberg=info,tower_http=debug`
    
    ## Critical Rules
    
    ### REST API Rules
    
    1. **Always validate multipart file uploads** - Check MIME type, size, magic bytes
    2. **Timeout long-running extractions** - Set per-handler timeout (5 min default)
    3. **Stream large files** - Never buffer entire multi-GB file in memory
    4. **Cache aggressively** - Identical files should return from cache in <1ms
    5. **Parallel extraction is CPU-bound** - Limit workers to CPU count + 1
    6. **Error responses must be actionable** - Include error code and remediation suggestion
    7. **Health checks must verify features** - Report missing dependencies (ONNX, Tesseract)
    8. **Size limits are configurable** - Allow override via env var for large deployments
    9. **CORS is permissive by default** - Restrict in production via env var
    10. **Logging all requests** - Track extraction metrics for observability
    
    ### MCP Rules
    
    1. **All tools must have timeout** - Prevent hanging on large files (default 5 min)
    2. **Error responses must be detailed** - Include suggestions for missing dependencies
    3. **Feature gates must be checked** - Return helpful message if feature unavailable (embeddings, OCR)
    4. **Resources should be static** - Don't query external services in resource handlers
    5. **Prompts guide agents** - Provide clear examples and best practices
    6. **Batch tools must support cancellation** - Allow agent to stop long-running batch operations
    7. **Logging all tool calls** - Track usage for analytics and debugging
    
    ## Related Skills
    
    - **extraction-pipeline-patterns** - Core extraction called by handlers and MCP tools
    - **chunking-embeddings** - Optional chunking/embedding parameters in extraction
    - **ocr-backend-management** - OCR engine selection and image preprocessing
    
  • .ai-rulez/skills/chunking-embeddings/SKILL.mdskill
    Show content (5734 bytes)
    ---
    description: "Chunking, embeddings, and RAG pipeline integration"
    name: chunking-embeddings
    priority: critical
    ---
    
    # Chunking & Embeddings
    
    **Text splitting strategies, embedding generation with FastEmbed, RAG pipeline integration**
    
    ## Chunking Architecture Overview
    
    **Location**: `crates/kreuzberg/src/chunking/`, `crates/kreuzberg/src/embeddings.rs`
    
    ```text
    Extracted Text
        |
    [1. Normalization] -> Clean whitespace, remove control chars
        |
    [2. Chunk Strategy Selection] -> Fixed-size, semantic, syntax-aware, recursive
        |
    [3. Overlap Management] -> Control context window overlap
        |
    [4. Optional Embedding] -> Generate vectors with FastEmbed
        |
    Output: Vec<Chunk> with text, vectors, metadata
    ```
    
    ## Chunking Strategies
    
    **Location**: `crates/kreuzberg/src/chunking/mod.rs`
    
    | Strategy                          | Pattern                                                 | Best For                                                           |
    | --------------------------------- | ------------------------------------------------------- | ------------------------------------------------------------------ |
    | **Fixed-Size**                    | Sliding window with configurable overlap                | Uniform chunks for embedding models with fixed token limits        |
    | **Semantic**                      | Split by sentences, merge/split by similarity threshold | Smart context preservation for LLM consumption and semantic search |
    | **Syntax-Aware**                  | Split by paragraph/section/heading/code-block structure | Preserving document structure (sections, code blocks) in RAG       |
    | **Recursive** (LangChain pattern) | Try separators in order: `\n\n`, `\n`, `,`              | Best general-purpose chunking; auto-finds optimal split points     |
    
    Key config fields per strategy (see struct definitions in `chunking/mod.rs`):
    
    - Fixed-Size: `chunk_size`, `overlap`, `trim_whitespace`
    - Semantic: `target_chunk_size`, `min/max_chunk_size`, `semantic_threshold`, `use_sentence_boundaries`
    - Syntax-Aware: `chunk_by` (Paragraph/Section/Heading/Sentence/CodeBlock), `max_chunk_size`, `respect_code_blocks`
    - Recursive: `separators[]`, `chunk_size`, `overlap`
    
    ## Chunking Configuration Presets
    
    **Location**: `crates/kreuzberg/src/chunking/mod.rs`
    
    | Preset       | Chunk Size  | Overlap | Strategy   | Use Case               |
    | ------------ | ----------- | ------- | ---------- | ---------------------- |
    | **Balanced** | 512 tokens  | 50      | Semantic   | RAG sweet spot         |
    | **Compact**  | 256 tokens  | 32      | Fixed-Size | Dense vectors          |
    | **Extended** | 1024 tokens | 100     | Recursive  | Full context           |
    | **Minimal**  | 128 tokens  | 16      | (default)  | Lightweight embeddings |
    
    Usage: set `config.chunking.preset = Some("balanced")` in `ExtractionConfig`.
    
    ## Embedding Generation with FastEmbed
    
    **Location**: `crates/kreuzberg/src/embeddings.rs`
    
    ### Model Selection
    
    | Model                               | Dimensions | Notes                            |
    | ----------------------------------- | ---------- | -------------------------------- |
    | `BAAI/bge-small-en-v1.5` (default)  | 384        | Fast, excellent for RAG          |
    | `BAAI/bge-small-zh-v1.5`            | 384        | Chinese optimized                |
    | `BAAI/bge-base-en-v1.5`             | 768        | Better quality, slower           |
    | `jinaai/jina-embeddings-v2-base-en` | 768        | Long context (up to 8192 tokens) |
    | `Custom(path)`                      | varies     | Custom ONNX model path           |
    
    ### Embedding Pattern
    
    `TextEmbeddingManager` provides singleton-cached models per config. Pattern:
    
    1. `get_or_init_model()` -- lazy-loads ONNX model (downloads if needed), caches in `Arc<RwLock<HashMap>>`
    2. `embed_chunks()` -- collects chunk texts, calls `model.embed(texts, batch_size)`, zips results back to `ChunkWithEmbedding`
    
    Default config: `batch_size=256`, `device=CPU`, `parallel_requests=4`.
    
    ### ONNX Runtime Requirement
    
    Embeddings require ONNX Runtime. Feature-gated via:
    
    ```toml
    [features]
    embeddings = ["dep:fastembed", "dep:ort"]
    ```
    
    Install: `brew install onnxruntime` (macOS) / `apt install libonnxruntime libonnxruntime-dev` (Linux). Verify: `echo $ORT_DYLIB_PATH`.
    
    ## RAG Integration Pattern
    
    The full extraction-to-RAG pipeline:
    
    1. **Extract**: `extract_file(path, config)` -> `ExtractionResult`
    2. **Chunk**: Apply preset strategy to `result.content` -> `Vec<Chunk>`
    3. **Embed**: If embedding config present, `TextEmbeddingManager::embed_chunks()` -> `Vec<ChunkWithEmbedding>`
    4. **Output**: `RagDocument { file_path, metadata, chunks }` ready for vector DB ingestion
    
    See `ChunkWithEmbedding` struct in `types.rs`: contains `text`, `embedding: Vec<f32>`, `dimensions`, `norm`, `metadata`.
    
    ## Critical Rules
    
    1. **Chunking is preprocessing** - Always apply before embedding to ensure consistent vector sizes
    2. **Overlap prevents information loss** - Set overlap to 15-20% of chunk size
    3. **Embedding models are stateful** - Lazy load and cache to avoid repeated initialization
    4. **ONNX Runtime is required** - Gracefully degrade if not available (skip embeddings)
    5. **Batch embedding for performance** - Never embed single chunks; batch 50-1000 chunks
    6. **Normalize embeddings for search** - Use L2 norm for cosine similarity
    7. **Cache embedding results** - Don't re-embed identical text chunks
    8. **Model selection impacts quality** - bge-small (384) for speed, bge-base (768) for quality
    
    ## Related Skills
    
    - **extraction-pipeline-patterns** - Text extraction preceding chunking
    - **api-server-mcp** - Endpoint for chunking + embedding operations
    - **ocr-backend-management** - OCR text quality affects chunking success
    
  • skills/kreuzberg/SKILL.mdskill
    Show content (15343 bytes)
    ---
    name: kreuzberg
    description: >-
      Extract text, tables, metadata, and images from 91+ document formats
      (PDF, Office, images, HTML, email, archives, academic) using Kreuzberg.
      Use when writing code that calls Kreuzberg APIs in Python, Node.js/TypeScript,
      Rust, or CLI. Covers installation, extraction (sync/async), configuration
      (OCR, chunking, output format), batch processing, error handling, and plugins.
    license: Elastic-2.0
    metadata:
      author: kreuzberg-dev
      version: "1.0"
      repository: https://github.com/kreuzberg-dev/kreuzberg
    ---
    
    # Kreuzberg Document Extraction
    
    Kreuzberg is a high-performance document intelligence library with a Rust core and native bindings for Python, Node.js/TypeScript, Ruby, Go, Java, C#, PHP, and Elixir. It extracts text, tables, metadata, and images from 91+ file formats including PDF, Office documents, images (with OCR), HTML, email, archives, and academic formats.
    
    Use this skill when writing code that:
    
    - Extracts text or metadata from documents
    - Performs OCR on scanned documents or images
    - Batch-processes multiple files
    - Configures extraction options (output format, chunking, OCR, language detection)
    - Implements custom plugins (post-processors, validators, OCR backends)
    
    ## Installation
    
    ### Python
    
    ```bash
    pip install kreuzberg
    # Optional OCR backends:
    pip install kreuzberg[easyocr]    # EasyOCR
    ```
    
    ### Node.js
    
    ```bash
    npm install @kreuzberg/node
    ```
    
    ### Rust
    
    ```toml
    # Cargo.toml
    [dependencies]
    kreuzberg = { version = "4", features = ["tokio-runtime"] }
    # features: tokio-runtime (required for sync + batch), pdf, ocr, chunking,
    #           embeddings, language-detection, keywords-yake, keywords-rake
    ```
    
    ### CLI
    
    ```bash
    # Download from GitHub releases, or:
    cargo install kreuzberg-cli
    ```
    
    ## Quick Start
    
    ### Python (Async)
    
    ```python
    from kreuzberg import extract_file
    
    result = await extract_file("document.pdf")
    print(result.content)       # extracted text
    print(result.metadata)      # document metadata
    print(result.tables)        # extracted tables
    ```
    
    ### Python (Sync)
    
    ```python
    from kreuzberg import extract_file_sync
    
    result = extract_file_sync("document.pdf")
    print(result.content)
    ```
    
    ### Node.js
    
    ```typescript
    import { extractFile } from "@kreuzberg/node";
    
    const result = await extractFile("document.pdf");
    console.log(result.content);
    console.log(result.metadata);
    console.log(result.tables);
    ```
    
    ### Node.js (Sync)
    
    ```typescript
    import { extractFileSync } from "@kreuzberg/node";
    
    const result = extractFileSync("document.pdf");
    ```
    
    ### Rust (Async)
    
    ```rust
    use kreuzberg::{extract_file, ExtractionConfig};
    
    #[tokio::main]
    async fn main() -> kreuzberg::Result<()> {
        let config = ExtractionConfig::default();
        let result = extract_file("document.pdf", None, &config).await?;
        println!("{}", result.content);
        Ok(())
    }
    ```
    
    ### Rust (Sync) — requires `tokio-runtime` feature
    
    ```rust
    use kreuzberg::{extract_file_sync, ExtractionConfig};
    
    fn main() -> kreuzberg::Result<()> {
        let config = ExtractionConfig::default();
        let result = extract_file_sync("document.pdf", None, &config)?;
        println!("{}", result.content);
        Ok(())
    }
    ```
    
    ### CLI
    
    ```bash
    kreuzberg extract document.pdf
    kreuzberg extract document.pdf --format json
    kreuzberg extract document.pdf --output-format markdown
    ```
    
    ## Configuration
    
    All languages use the same configuration structure with language-appropriate naming conventions.
    
    ### Python (snake_case)
    
    ```python
    from kreuzberg import (
        ExtractionConfig, OcrConfig, TesseractConfig,
        PdfConfig, ChunkingConfig,
    )
    
    config = ExtractionConfig(
        ocr=OcrConfig(
            backend="tesseract",
            language="eng",
            tesseract_config=TesseractConfig(psm=6, enable_table_detection=True),
        ),
        pdf_options=PdfConfig(passwords=["secret123"]),
        chunking=ChunkingConfig(max_chars=1000, max_overlap=200),
        output_format="markdown",
    )
    
    result = await extract_file("document.pdf", config=config)
    ```
    
    ### Node.js (camelCase)
    
    ```typescript
    import { extractFile, type ExtractionConfig } from "@kreuzberg/node";
    
    const config: ExtractionConfig = {
      ocr: { backend: "tesseract", language: "eng" },
      pdfOptions: { passwords: ["secret123"] },
      chunking: { maxChars: 1000, maxOverlap: 200 },
      outputFormat: "markdown",
    };
    
    const result = await extractFile("document.pdf", null, config);
    ```
    
    ### Rust (snake_case)
    
    ```rust
    use kreuzberg::{ExtractionConfig, OcrConfig, ChunkingConfig, OutputFormat};
    
    let config = ExtractionConfig {
        ocr: Some(OcrConfig {
            backend: "tesseract".into(),
            language: "eng".into(),
            ..Default::default()
        }),
        chunking: Some(ChunkingConfig {
            max_characters: 1000,
            overlap: 200,
            ..Default::default()
        }),
        output_format: OutputFormat::Markdown,
        ..Default::default()
    };
    
    let result = extract_file("document.pdf", None, &config).await?;
    ```
    
    ### Config File (TOML)
    
    ```toml
    output_format = "markdown"
    
    [ocr]
    backend = "tesseract"
    language = "eng"
    
    [chunking]
    max_chars = 1000
    max_overlap = 200
    
    [pdf_options]
    passwords = ["secret123"]
    ```
    
    ```bash
    # CLI: auto-discovers kreuzberg.toml in current/parent directories
    kreuzberg extract doc.pdf
    # or explicit:
    kreuzberg extract doc.pdf --config kreuzberg.toml
    kreuzberg extract doc.pdf --config-json '{"ocr":{"backend":"tesseract","language":"deu"}}'
    ```
    
    ## Batch Processing
    
    ### Python
    
    ```python
    from kreuzberg import batch_extract_files, batch_extract_files_sync
    
    # Async
    results = await batch_extract_files(["doc1.pdf", "doc2.docx", "doc3.xlsx"])
    
    # Sync
    results = batch_extract_files_sync(["doc1.pdf", "doc2.docx"])
    
    for result in results:
        print(f"{len(result.content)} chars extracted")
    ```
    
    ### Node.js
    
    ```typescript
    import { batchExtractFiles } from "@kreuzberg/node";
    
    const results = await batchExtractFiles(["doc1.pdf", "doc2.docx"]);
    ```
    
    ### Rust — requires `tokio-runtime` feature
    
    ```rust
    use kreuzberg::{batch_extract_file, ExtractionConfig};
    
    let config = ExtractionConfig::default();
    let paths = vec!["doc1.pdf", "doc2.docx"];
    let results = batch_extract_file(paths, &config).await?;
    ```
    
    ### CLI
    
    ```bash
    kreuzberg batch *.pdf --format json
    kreuzberg batch docs/*.docx --output-format markdown
    ```
    
    ## OCR
    
    OCR runs automatically for images and scanned PDFs. Tesseract is the default backend (native binding, no external install required).
    
    ### Backends
    
    - **Tesseract** (default): Built-in native binding. All Tesseract languages supported.
    - **EasyOCR** (Python only): `pip install kreuzberg[easyocr]`. Pass `easyocr_kwargs={"gpu": True}`.
    - **PaddleOCR** (Python only): Bundled since 4.8.5, no extra install needed. Pass `paddleocr_kwargs={"use_angle_cls": True}`.
    - **Guten** (Node.js only): Built-in OCR backend via `GutenOcrBackend`.
    
    ### Language Codes
    
    ```python
    config = ExtractionConfig(ocr=OcrConfig(language="eng"))       # English
    config = ExtractionConfig(ocr=OcrConfig(language="eng+deu"))   # Multiple
    config = ExtractionConfig(ocr=OcrConfig(language="all"))       # All installed
    ```
    
    ### Force OCR
    
    ```python
    config = ExtractionConfig(force_ocr=True)  # OCR even if text is extractable
    ```
    
    ## ExtractionResult Fields
    
    | Field        | Python                      | Node.js                    | Rust                        | Description                                   |
    | ------------ | --------------------------- | -------------------------- | --------------------------- | --------------------------------------------- |
    | Text content | `result.content`            | `result.content`           | `result.content`            | Extracted text (str/String)                   |
    | MIME type    | `result.mime_type`          | `result.mimeType`          | `result.mime_type`          | Input document MIME type                      |
    | Metadata     | `result.metadata`           | `result.metadata`          | `result.metadata`           | Document metadata (dict/object/HashMap)       |
    | Tables       | `result.tables`             | `result.tables`            | `result.tables`             | Extracted tables with cells + markdown        |
    | Languages    | `result.detected_languages` | `result.detectedLanguages` | `result.detected_languages` | Detected languages (if enabled)               |
    | Chunks       | `result.chunks`             | `result.chunks`            | `result.chunks`             | Text chunks (if chunking enabled)             |
    | Images       | `result.images`             | `result.images`            | `result.images`             | Extracted images (if enabled)                 |
    | Elements     | `result.elements`           | `result.elements`          | `result.elements`           | Semantic elements (if element_based format)   |
    | Pages        | `result.pages`              | `result.pages`             | `result.pages`              | Per-page content (if page extraction enabled) |
    | Keywords     | `result.keywords`           | `result.keywords`          | `result.keywords`           | Extracted keywords (if enabled)               |
    
    ## Error Handling
    
    ### Python
    
    ```python
    from kreuzberg import (
        extract_file_sync, KreuzbergError, ParsingError,
        OCRError, ValidationError, MissingDependencyError,
    )
    
    try:
        result = extract_file_sync("file.pdf")
    except ParsingError as e:
        print(f"Failed to parse: {e}")
    except OCRError as e:
        print(f"OCR failed: {e}")
    except ValidationError as e:
        print(f"Invalid input: {e}")
    except MissingDependencyError as e:
        print(f"Missing dependency: {e}")
    except KreuzbergError as e:
        print(f"Extraction failed: {e}")
    ```
    
    ### Node.js
    
    ```typescript
    import {
      extractFile,
      KreuzbergError,
      ParsingError,
      OcrError,
      ValidationError,
      MissingDependencyError,
    } from "@kreuzberg/node";
    
    try {
      const result = await extractFile("file.pdf");
    } catch (e) {
      if (e instanceof ParsingError) {
        /* ... */
      } else if (e instanceof OcrError) {
        /* ... */
      } else if (e instanceof ValidationError) {
        /* ... */
      } else if (e instanceof KreuzbergError) {
        /* ... */
      }
    }
    ```
    
    ### Rust
    
    ```rust
    use kreuzberg::{extract_file, ExtractionConfig, KreuzbergError};
    
    let config = ExtractionConfig::default();
    match extract_file("file.pdf", None, &config).await {
        Ok(result) => println!("{}", result.content),
        Err(KreuzbergError::Parsing(msg)) => eprintln!("Parse error: {msg}"),
        Err(KreuzbergError::Ocr(msg)) => eprintln!("OCR error: {msg}"),
        Err(e) => eprintln!("Error: {e}"),
    }
    ```
    
    ## Common Pitfalls
    
    1. **Python ChunkingConfig fields**: Use `max_chars` and `max_overlap`, NOT `max_characters` or `overlap`.
    2. **Rust extract_file signature**: Third argument is `&ExtractionConfig` (a reference), not `Option`. Use `&ExtractionConfig::default()` for defaults.
    3. **Rust feature gates**: `extract_file_sync`, `batch_extract_file`, and `batch_extract_file_sync` all require `features = ["tokio-runtime"]` in Cargo.toml.
    4. **Rust async context**: `extract_file` is async. Use `#[tokio::main]` or call from an async context.
    5. **CLI --format vs --output-format**: `--format` controls CLI output (text/json). `--output-format` controls content format (plain/markdown/djot/html).
    6. **Node.js extractFile signature**: `extractFile(path, mimeType?, config?)` — mimeType is the second arg (pass `null` to skip).
    7. **Python detect_mime_type**: The function for detecting from bytes is `detect_mime_type(data)`. For paths use `detect_mime_type_from_path(path)`.
    8. **Config file field names**: Use snake_case in TOML/YAML/JSON config files (e.g., `max_chars`, `max_overlap`, `pdf_options`).
    
    ## Supported Formats (Summary)
    
    | Category          | Extensions                                                                                                                                                  |
    | ----------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- |
    | **PDF**           | `.pdf`                                                                                                                                                      |
    | **Word**          | `.docx`, `.odt`                                                                                                                                             |
    | **Spreadsheets**  | `.xlsx`, `.xlsm`, `.xlsb`, `.xls`, `.xla`, `.xlam`, `.xltm`, `.ods`                                                                                         |
    | **Presentations** | `.pptx`, `.ppt`, `.ppsx`                                                                                                                                    |
    | **eBooks**        | `.epub`, `.fb2`                                                                                                                                             |
    | **Images**        | `.png`, `.jpg`, `.jpeg`, `.gif`, `.webp`, `.bmp`, `.tiff`, `.tif`, `.jp2`, `.jpx`, `.jpm`, `.mj2`, `.jbig2`, `.jb2`, `.pnm`, `.pbm`, `.pgm`, `.ppm`, `.svg` |
    | **Markup**        | `.html`, `.htm`, `.xhtml`, `.xml`                                                                                                                           |
    | **Data**          | `.json`, `.yaml`, `.yml`, `.toml`, `.csv`, `.tsv`                                                                                                           |
    | **Text**          | `.txt`, `.md`, `.markdown`, `.djot`, `.rst`, `.org`, `.rtf`                                                                                                 |
    | **Email**         | `.eml`, `.msg`                                                                                                                                              |
    | **Archives**      | `.zip`, `.tar`, `.tgz`, `.gz`, `.7z`                                                                                                                        |
    | **Academic**      | `.bib`, `.biblatex`, `.ris`, `.nbib`, `.enw`, `.csl`, `.tex`, `.latex`, `.typ`, `.jats`, `.ipynb`, `.docbook`, `.opml`, `.pod`, `.mdoc`, `.troff`           |
    
    See [references/supported-formats.md](references/supported-formats.md) for the complete format reference with MIME types.
    
    ## Additional Resources
    
    Detailed reference files for specific topics:
    
    - **[Python API Reference](references/python-api.md)** — All functions, config classes, plugin protocols, exact signatures
    - **[Node.js API Reference](references/nodejs-api.md)** — All functions, TypeScript interfaces, worker pool APIs
    - **[Rust API Reference](references/rust-api.md)** — All functions with feature gates, structs, Cargo.toml examples
    - **[CLI Reference](references/cli-reference.md)** — All commands, flags, config precedence, exit codes
    - **[Configuration Reference](references/configuration.md)** — TOML/YAML/JSON formats, auto-discovery, env vars, full schema
    - **[Supported Formats](references/supported-formats.md)** — All 85+ formats with file extensions and MIME types
    - **[Advanced Features](references/advanced-features.md)** — Plugins, embeddings, MCP server, API server, security limits
    - **[Other Language Bindings](references/other-bindings.md)** — Go, Ruby, Java, C#, PHP, Elixir, WASM, Docker
    
    Full documentation: <https://docs.kreuzberg.dev>
    GitHub: <https://github.com/kreuzberg-dev/kreuzberg>
    
  • .ai-rulez/skills/plugin-architecture-patterns/SKILL.mdskill
    Show content (3045 bytes)
    ---
    name: plugin-architecture-patterns
    description: "Plugin architecture, registration, and trait patterns"
    priority: critical
    ---
    
    # Plugin Architecture & Registration
    
    ## Plugin Types
    
    | Type               | Trait                       | Location                     |
    | ------------------ | --------------------------- | ---------------------------- |
    | Document Extractor | `DocumentExtractor: Plugin` | `plugins/extractor/trait.rs` |
    | OCR Backend        | `OcrBackend: Plugin`        | `plugins/ocr/trait.rs`       |
    | Post Processor     | `PostProcessor: Plugin`     | `plugins/processor/trait.rs` |
    | Validator          | `Validator: Plugin`         | `plugins/validator/trait.rs` |
    
    ## DocumentExtractor Implementation
    
    ```rust
    use crate::plugins::{DocumentExtractor, Plugin};
    use async_trait::async_trait;
    
    pub struct MyExtractor;
    
    impl Plugin for MyExtractor {
        fn name(&self) -> &str { "my-extractor" }
        fn version(&self) -> String { env!("CARGO_PKG_VERSION").to_string() }
    }
    
    #[async_trait]
    impl DocumentExtractor for MyExtractor {
        async fn extract_bytes(&self, content: &[u8], mime_type: &str, config: &ExtractionConfig)
            -> Result<ExtractionResult> { /* ... */ }
    
        fn supported_mime_types(&self) -> &[&str] { &["application/x-custom"] }
        fn priority(&self) -> i32 { 50 }
    
        // WASM support (optional)
        fn as_sync_extractor(&self) -> Option<&dyn SyncExtractor> { None }
    }
    ```
    
    ## Priority System
    
    | Range  | Use                       |
    | ------ | ------------------------- |
    | 0-25   | Fallback/low-quality      |
    | 26-49  | Alternative extractors    |
    | **50** | **Default (built-in)**    |
    | 51-75  | Premium/enhanced          |
    | 76-100 | Specialized/high-priority |
    
    Registry selects **highest priority** extractor for each MIME type. Override built-ins with priority > 50.
    
    ## Registration
    
    ```rust
    // In extractors/mod.rs → register_default_extractors()
    let registry = get_document_extractor_registry();
    let mut registry = registry.write()
        .map_err(|e| KreuzbergError::Other(format!("Registry lock poisoned: {}", e)))?;
    registry.register(Arc::new(MyExtractor::new()))?;
    ```
    
    ## Feature-Gated Registration
    
    ```rust
    #[cfg(feature = "office")]
    {
        registry.register(Arc::new(DocxExtractor::new()))?;
        registry.register(Arc::new(PptxExtractor::new()))?;
    }
    ```
    
    ## PostProcessor Pattern
    
    ```rust
    impl PostProcessor for MyProcessor {
        async fn process(&self, result: &mut ExtractionResult, config: &ExtractionConfig)
            -> Result<()> {
            result.content = process_content(&result.content);
            Ok(())
        }
        fn stage(&self) -> ProcessorStage { ProcessorStage::Middle }
    }
    ```
    
    Stages: `Early` → `Middle` → `Late`. Failures isolated (don't block others).
    
    ## Critical Rules
    
    1. All plugins **MUST be `Send + Sync`**
    2. Feature gate with `#[cfg(feature = "...")]` for optional formats
    3. Use `#[async_trait]` for `DocumentExtractor`
    4. Initialization via `ensure_initialized()` (lazy, called before first extraction)
    5. Plugin names: kebab-case (e.g., `"pdf-extractor"`)
    

README

Kreuzberg

Linkedin- Banner

Extract text, metadata, and code intelligence from 97+ file formats and 305 programming languages at native speeds without needing a GPU.

Key Features

  • Code intelligence – Extract functions, classes, imports, symbols, and docstrings from 248 programming languages via tree-sitter. Results in ExtractionResult.code_intelligence with semantic chunking
  • Extensible architecture – Plugin system for custom OCR backends, validators, post-processors, document extractors, and renderers
  • Polyglot – Native bindings for Rust, Python, TypeScript/Node.js, Ruby, Go, Java, C#, PHP, Elixir, R, and C
  • 91+ file formats – PDF, Office documents, images, HTML, XML, emails, archives, academic formats across 8 categories
  • LLM intelligence – VLM OCR (GPT-4o, Claude, Gemini, Ollama), structured JSON extraction with schema constraints, and provider-hosted embeddings via 146 LLM providers (including local engines: Ollama, LM Studio, vLLM, llama.cpp) through liter-llm
  • OCR support – Tesseract (all bindings, including Tesseract-WASM for browsers), PaddleOCR (all native bindings), EasyOCR (Python), VLM OCR (146 vision model providers including local engines), extensible via plugin API
  • High performance – Rust core with pure-Rust PDF, SIMD optimizations and full parallelism
  • Flexible deployment – Use as library, CLI tool, REST API server, or MCP server
  • TOON wire format – Token-efficient serialization for LLM/RAG pipelines, ~30-50% fewer tokens than JSON
  • GFM-quality output – Comrak-based rendering with proper fenced code blocks, table nodes, bracket escaping, and cross-format parity (Markdown, HTML, Djot, Plain)
  • HTML passthrough – HTML-to-Markdown conversion uses html-to-markdown output directly, bypassing lossy intermediate round-trips
  • Memory efficient – Streaming parsers for multi-GB files

Complete Documentation | Live Demo | Installation Guides

Installation

Each language binding provides comprehensive documentation with examples and best practices. Choose your platform to get started:

Scripting Languages:

  • Python – PyPI package, async/sync APIs, OCR backends (Tesseract, PaddleOCR, EasyOCR)
  • Ruby – RubyGems package, idiomatic Ruby API, native bindings
  • PHP – Composer package, modern PHP 8.4+ support, type-safe API, async extraction
  • Elixir – Hex package, OTP integration, concurrent processing
  • R – r-universe package, idiomatic R API, extendr bindings

JavaScript/TypeScript:

  • @kreuzberg/node – Native NAPI-RS bindings for Node.js/Bun, fastest performance
  • @kreuzberg/wasm – WebAssembly for browsers/Deno/Cloudflare Workers, full feature parity (PDF, Excel, OCR, archives)

Compiled Languages:

  • Go – Go module with FFI bindings, context-aware async
  • Java – Maven Central, Foreign Function & Memory API
  • C# – NuGet package, .NET 6.0+, full async/await support

Native:

  • Rust – Core library, flexible feature flags, zero-copy APIs
  • C (FFI) – C header + shared library, pkg-config/CMake support, cross-platform

Containers:

  • Docker – Official images with API, CLI, and MCP server modes (Core: ~1.0-1.3GB, Full: ~1.0-1.3GB with OCR + legacy format support)

Command-Line:

  • CLI – Cross-platform binary, batch processing, MCP server mode

All language bindings include precompiled binaries for both x86_64 and aarch64 architectures on Linux and macOS.

Platform Support

Complete architecture coverage across all language bindings:

LanguageLinux x86_64Linux aarch64macOS ARM64Windows x64
Python
Node.js
WASM
Ruby-
R
Elixir
Go
Java
C#
PHP
Rust
C (FFI)
CLI
Docker-

Note: ✅ = Precompiled binaries available with instant installation. WASM runs in any environment with WebAssembly support (browsers, Deno, Bun, Cloudflare Workers). All platforms are tested in CI. MacOS support is Apple Silicon only.

Embeddings Support (Optional)

To use embeddings functionality:

  1. Install ONNX Runtime 1.24+:

  2. Use embeddings in your code - see Embeddings Guide

Note: Kreuzberg requires ONNX Runtime version 1.24+ for embeddings. All other Kreuzberg features work without ONNX Runtime.

Supported Formats

91+ file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.

Office Documents

CategoryFormatsCapabilities
Word Processing.docx, .docm, .dotx, .dotm, .dot, .odt, .pagesFull text, tables, lists, images, metadata, styles
Spreadsheets.xlsx, .xlsm, .xlsb, .xls, .xla, .xlam, .xltm, .xltx, .xlt, .ods, .numbersSheet data, formulas, cell metadata, charts
Presentations.pptx, .pptm, .ppsx, .potx, .potm, .pot, .keySlides, speaker notes, images, metadata
PDF.pdfText, tables, images, metadata, OCR support
eBooks.epub, .fb2Chapters, metadata, embedded resources
Database.dbfTable data extraction, field type support
Hangul.hwp, .hwpxKorean document format, text extraction

Images (OCR-Enabled)

CategoryFormatsFeatures
Raster.png, .jpg, .jpeg, .gif, .webp, .bmp, .tiff, .tifOCR, table detection, EXIF metadata, dimensions, color space
Advanced.jp2, .jpx, .jpm, .mj2, .jbig2, .jb2, .pnm, .pbm, .pgm, .ppmPure Rust decoders (JPEG 2000, JBIG2), OCR, table detection
Vector.svgDOM parsing, embedded text, graphics metadata

Web & Data

CategoryFormatsFeatures
Markup.html, .htm, .xhtml, .xml, .svgDOM parsing, metadata (Open Graph, Twitter Card), link extraction
Structured Data.json, .yaml, .yml, .toml, .csv, .tsvSchema detection, nested structures, validation
Text & Markdown.txt, .md, .markdown, .djot, .mdx, .rst, .org, .rtfCommonMark, GFM, Djot, MDX, reStructuredText, Org Mode, Rich Text

Email & Archives

CategoryFormatsFeatures
Email.eml, .msgHeaders, body (HTML/plain), attachments, UTF-16 support
Archives.zip, .tar, .tgz, .gz, .7zRecursive extraction, nested archives, metadata

Academic & Scientific

CategoryFormatsFeatures
Citations.bib, .ris, .nbib, .enw, .cslBibTeX/BibLaTeX, RIS, PubMed/MEDLINE, EndNote XML, CSL JSON
Scientific.tex, .latex, .typ, .typst, .jats, .ipynbLaTeX, Typst, JATS journal articles, Jupyter notebooks
Publishing.fb2, .docbook, .dbk, .opmlFictionBook, DocBook XML, OPML outlines
Documentation.pod, .mdoc, .troffPerl POD, man pages, troff

Complete Format Reference →

Code Intelligence (248 Languages)

FeatureDescription
Structure ExtractionFunctions, classes, methods, structs, interfaces, enums
Import/Export AnalysisModule dependencies, re-exports, wildcard imports
Symbol ExtractionVariables, constants, type aliases, properties
Docstring ParsingGoogle, NumPy, Sphinx, JSDoc, RustDoc, and 10+ formats
DiagnosticsParse errors with line/column positions
Syntax-Aware ChunkingSplit code by semantic boundaries, not arbitrary byte offsets

Powered by tree-sitter-language-pack with dynamic grammar download. See TSLP documentation for the full language list.

Key Features

OCR with Table Extraction

Multiple OCR backends (Tesseract, EasyOCR, PaddleOCR) with intelligent table detection and reconstruction. Extract structured data from scanned documents and images with configurable accuracy thresholds.

OCR Backend Documentation →

Batch Processing

Process multiple documents concurrently with configurable parallelism. Optimize throughput for large-scale document processing workloads with automatic resource management.

Batch Processing Guide →

Password-Protected PDFs

Handle encrypted PDFs with single or multiple password attempts. Supports both RC4 and AES encryption with automatic fallback strategies.

PDF Configuration →

Language Detection

Automatic language detection in extracted text using fast-langdetect. Configure confidence thresholds and access per-language statistics.

Language Detection Guide →

Metadata Extraction

Extract comprehensive metadata from all supported formats: authors, titles, creation dates, page counts, EXIF data, and format-specific properties.

Metadata Guide →

AI Coding Assistants

Kreuzberg ships with an Agent Skill that teaches AI coding assistants how to use the library correctly. It works with Claude Code, Codex, Gemini CLI, Cursor, VS Code, Amp, Goose, Roo Code, and any tool supporting the Agent Skills standard.

Install the skill into any project using the Vercel Skills CLI:

npx skills add kreuzberg-dev/kreuzberg

The skill is located at skills/kreuzberg/SKILL.md and is automatically discovered by supported AI coding tools once installed.

Documentation

Contributing

Contributions are welcome! See CONTRIBUTING.md for guidelines.

License

Elastic License 2.0 (ELv2) - see LICENSE for details. See https://www.elastic.co/licensing/elastic-license for the full license text.