USP
It offers native bindings for 10+ languages, VLM OCR with 146 LLM providers, and a TOON wire format for token-efficient RAG pipelines. Its extensible architecture supports custom OCR backends and post-processors, making it highly adaptable.
Use cases
- 01Building RAG pipelines for LLM applications
- 02Automated document processing and content extraction
- 03Extracting code intelligence for AI agents and analysis
- 04Performing OCR on scanned documents and images
- 05Batch processing various file types for data ingestion
Detected files (6)
.ai-rulez/skills/extraction-pipeline-patterns/SKILL.mdskillShow content (7050 bytes)
--- description: "Document extraction pipeline architecture and patterns" name: extraction-pipeline-patterns priority: critical --- # Extraction Pipeline Patterns **Kreuzberg's format detection -> extraction -> fallback orchestration for 75+ file formats** ## Core Pipeline Architecture The extraction pipeline (`crates/kreuzberg/src/core/pipeline.rs`, `crates/kreuzberg/src/extraction/`) orchestrates: 1. **Format Detection** - MIME type inference + extension validation -> select appropriate extractor 2. **Intelligent Extraction** - Route to format-specific extractors (PDF, DOCX, Excel, HTML, images, archives, etc.) 3. **Fallback Strategies** - Password-protected PDFs, OCR for images, nested archive handling, corrupted file recovery 4. **Post-Processing Pipeline** - Validators, quality processing, chunking, custom hooks (see `core/pipeline.rs`) ## Format Detection Strategy **Location**: `crates/kreuzberg/src/core/mime.rs`, `crates/kreuzberg/src/core/formats.rs` Pattern: detect via magic bytes, validate extension alignment (prevent spoofing), route to extractor. Multiple extractors for same format -> choose highest confidence/specificity. ```rust // Pseudocode: core/mime.rs match (magic_bytes(content), extension) { (Some(fmt), Some(ext)) if aligned -> Ok(fmt), (Some(fmt), Some(ext)) if misaligned -> Err(FormatMismatch), (Some(fmt), None) -> Ok(fmt), // magic bytes only (None, Some(ext)) -> Ok(from_extension(ext)), _ -> Err(UnknownFormat), } ``` ## Extraction Modules (75 Formats) | Category | Extractors | Key Modules | | ------------ | ------------------------------------------------ | ---------------------------------------------------- | | **Office** | DOCX, XLSX, XLSM, XLSB, XLS, PPTX, ODP, ODS | `extraction/{docx,excel,pptx}.rs` | | **PDF** | Standard + encrypted, password attempts | `pdf/` subdirectory (13 files) | | **Images** | PNG, JPG, TIFF, WebP, JP2, SVG (OCR-enabled) | `extraction/image.rs` + `ocr/` | | **Web** | HTML, XHTML, XML, SVG (DOM parsing) | `extraction/html.rs` (67KB - complex table handling) | | **Email** | EML, MSG (headers, body, attachments, threading) | `extraction/email.rs` | | **Archives** | ZIP, TAR, GZ, 7Z (recursive extraction) | `extraction/archive.rs` (31KB) | | **Markdown** | MD, TXT, RST, Org Mode, RTF | `extraction/markdown.rs` | | **Academic** | LaTeX, BibTeX, JATS, Jupyter, DocBook | `extraction/{structured,xml}.rs` | ## Extraction Dispatcher ```rust // Pseudocode: extraction/mod.rs let format = detect_format(source.bytes, source.extension); let result = match format { Pdf -> extract_pdf(source, config), Docx -> extract_docx(source, config), Image -> extract_image_with_ocr_fallback(source, config), Archive -> extract_archive_recursive(source, config), _ -> extract_with_plugin(format, source, config), }; run_pipeline(result, config) // post-processing always runs ``` ## Fallback Strategies - **Password-Protected PDFs**: Try primary password -> secondary password list -> return `is_encrypted=true` in metadata on failure - **OCR Fallback**: If image text extraction confidence < threshold, trigger OCR backend; return both results with scores - **Nested Archives**: Recursive extraction with configurable depth limit; flatten or preserve hierarchy - **Corrupted File Recovery**: Stream-based parsing, emit content up to error point, include error location in metadata ## Configuration Integration **Location**: `crates/kreuzberg/src/core/config.rs`, `crates/kreuzberg/src/core/config_validation.rs` `ExtractionConfig` holds format-specific configs (`pdf`, `image`, `html`, `office`), fallback orchestration (`fallback`), and post-processing (`postprocessor`, `chunking`, `keywords`). See struct definition in `config.rs`. ## Plugin System Integration **Location**: `crates/kreuzberg/src/plugins/` - **CustomExtractor**: Override built-in format extractors - **PostProcessor**: Modify results after extraction (Early/Middle/Late stages) - **Validator**: Fail-fast validation (e.g., minimum text length) - **OCRBackend**: Swap OCR engine Plugin registry loaded at startup, cached for zero-cost lookup. ## Feature Flag Strategy **Location**: `Cargo.toml` (workspace), `crates/kreuzberg/Cargo.toml`, `FEATURE_MATRIX.md` 20+ features across 9 language bindings. Key feature groups: | Group | Features | Notes | | -------- | ------------------------------------------------------------------------------------ | --------------------------------- | | OCR | `tesseract` (default), `tesseract-static`, `ocr-minimal` | Mutually exclusive recommendation | | Formats | `pdf`, `pdf-minimal`, `office`, `office-minimal` | | | AI/ML | `embeddings` (requires ONNX), `keywords-yake`, `keywords-rake`, `language-detection` | | | Server | `api` (Axum), `mcp`, `tokio-runtime`, `lite-runtime` | | | Bindings | `python-bindings`, `ruby-bindings`, `php-bindings`, `node-bindings`, `wasm` | | Conditional compilation: modules gated with `#[cfg(feature = "...")]`. Runtime `validate_config()` warns if requested feature not compiled in. ### Feature Flag Critical Rules 1. **Never mix conflicting features** - e.g., `ocr-minimal` + `tesseract` should error at compile time 2. **Always provide feature diagnostics** - Config validation must warn if feature unavailable 3. **Default to maximum feature set** - Unless embedded/minimal specifically requested 4. **Test all feature combinations** - Matrix testing in CI catches regressions 5. **WASM incompatible** with embeddings, keywords, OCR ## Critical Rules 1. **Always use format detection** before routing to extractors (prevent confusion attacks) 2. **Stream-based parsing** for PDFs/archives to handle multi-GB files 3. **Post-pipeline is mandatory**: All extraction results flow through `run_pipeline()` for validators/hooks 4. **Plugin overrides are order-dependent**: Plugins registered first take priority 5. **Fallback timeouts**: Set reasonable OCR/archive extraction timeouts (config-driven) 6. **Metadata preservation**: Include format detection confidence, extraction method used, any fallbacks applied ## Related Skills - **ocr-backend-management** - OCR engine selection and image preprocessing - **chunking-embeddings** - Post-extraction text splitting with FastEmbed - **api-server-mcp** - Axum endpoint for extraction pipeline exposure and MCP server.ai-rulez/skills/format-specific-extraction/SKILL.mdskillShow content (2735 bytes)
--- name: format-specific-extraction description: "Format-specific document extraction workflows" priority: high --- # Format-Specific Extraction Workflows ## Office XML (DOCX/PPTX/ODT) ```text ZIP archive → Security validation → XML parsing → Text + tables + metadata ``` 1. `ZipBombValidator::new(limits).validate(&mut archive)?` 2. Extract XML files from archive (`word/document.xml`, `ppt/slides/*.xml`, `content.xml`) 3. Parse with `quick-xml::Reader` (streaming) + `DepthValidator` + `StringGrowthValidator` 4. Extract metadata via `crate::extraction::office_metadata::extract_metadata()` 5. See: `extractors/docx.rs`, `extractors/pptx.rs`, `extractors/odt.rs` ## PDF ```text Bytes → pdf_oxide → Per-page text + OCR fallback → Tables → Metadata ``` 1. `pdf_oxide::PdfDocument::from_bytes(content)?` 2. Check if needs OCR: `config.force_ocr || !has_searchable_text()` 3. Extract text per page, tables if `config.pages` enabled 4. Feature-gated: `#[cfg(feature = "pdf")]` 5. See: `extractors/pdf/mod.rs` ## Archives (ZIP/TAR/7z/GZIP) ```text Validate → Extract metadata → Extract plaintext files only ``` 1. `ZipBombValidator` BEFORE any extraction 2. Extract metadata (file list, sizes) 3. Extract text content from plaintext files 4. Use `build_archive_result()` helper 5. See: `extractors/archive.rs`, `extraction/archive/*.rs` ## Structured Text (JSON/YAML/TOML/XML) ```text Detect format from MIME → Parse → Pretty-print → Metadata ``` Single `StructuredExtractor` handles multiple MIME types. Parse with format-specific library, pretty-print to text. See: `extractors/structured.rs` ## Email (EML/MSG) ```text Parse headers → Extract body (text/html) → Process attachments ``` See: `extraction/email.rs`, `extractors/email.rs` ## Common Helpers | Helper | Location | Purpose | | ------------------------------------- | --------------------------- | ------------------------------ | | `office_metadata::extract_metadata()` | `extraction/office.rs` | Office XML metadata | | `cells_to_markdown()` | `extraction/mod.rs` | Convert cell grid to GFM table | | `build_archive_result()` | `extraction/archive/mod.rs` | Standard archive result | ## Adding a New Format 1. Add MIME type to `EXT_TO_MIME` in `core/mime.rs` 2. Create extractor implementing `DocumentExtractor` trait 3. Set `supported_mime_types()` and `priority()` (default: 50) 4. Register in `extractors/mod.rs` → `register_default_extractors()` 5. Feature-gate if optional: `#[cfg(feature = "my-format")]` 6. Apply security validators for user content 7. Add tests with fixture files.ai-rulez/skills/api-server-mcp/SKILL.mdskillShow content (9419 bytes)
--- description: "REST API server and MCP protocol integration" name: api-server-mcp priority: critical --- # API Server & MCP Protocol **Axum server design for document extraction endpoints, middleware, async processing, and Model Context Protocol integration for AI agents** ## Kreuzberg API Architecture **Location**: `crates/kreuzberg/src/api/`, `crates/kreuzberg-cli/` Kreuzberg provides a dual REST API + MCP server built with Axum + Tokio. ```text Request Flow: HTTP Client / AI Agent (Claude) | [Transport Layer] ├── REST API (Axum HTTP) └── MCP Protocol (HTTP or Stdio) | [Middleware Layer] ├── CORS, Request Logging (TraceLayer) ├── Request/Response size limits └── Rate limiting (optional) | [Router] ├── REST Endpoints │ ├── POST /extract - File upload extraction │ ├── POST /extract-url - URL-based extraction │ ├── GET /formats - List supported formats │ ├── GET /health - Server health check │ ├── POST /batch - Batch document processing │ ├── GET /cache/stats - Cache statistics │ └── DELETE /cache - Clear extraction cache ├── MCP Endpoints │ ├── POST /mcp/tools - List available tools │ ├── POST /mcp/tools/call - Call a tool │ ├── GET /mcp/resources - List resources │ ├── GET /mcp/resources/:uri - Read resource │ ├── GET /mcp/prompts - List prompts │ └── GET /mcp/prompts/:name - Get prompt | [Handler / Tool Layer] ├── extract_handler / extract_file tool ├── batch_handler / batch_extract tool ├── health_handler / get_capabilities tool └── format_handler | [Extraction Core] ├── Format detection ├── Extraction pipeline ├── Post-processing (chunking, embeddings) └── Result formatting | JSON Response / MCP ToolResult ``` ## Server Setup & Configuration **Location**: `crates/kreuzberg/src/api/server.rs` Server initialization pattern: Create `ApiState` (holds `ExtractionConfig` + `ExtractionCache`), build Axum `Router` with all REST + MCP routes, apply middleware layers (body limits, CORS, tracing), serve via `tokio::net::TcpListener`. Key middleware layers applied in order: - `DefaultBodyLimit::max(100MB)` + `RequestBodyLimitLayer` -- configurable via env vars - `CorsLayer::permissive()` -- restrict in production via `CORS_ALLOWED_ORIGINS` - `TraceLayer::new_for_http()` -- request/response logging ## Core REST Handlers **Location**: `crates/kreuzberg/src/api/handlers.rs` | Handler | Method | Description | | --------------------- | ----------------- | ------------------------------------------------------------------------------------------------------ | | `extract_handler` | POST /extract | Multipart upload: parse file + optional config JSON, check cache, call `extract_bytes()`, cache result | | `extract_url_handler` | POST /extract-url | Fetch URL via reqwest, extract bytes | | `batch_handler` | POST /batch | Parallel extraction with `Semaphore`-limited concurrency (default: CPU count) | | `health_handler` | GET /health | Report status, version, uptime, feature availability (OCR, embeddings), cache stats | | `formats_handler` | GET /formats | Return supported format categories (office, pdf, images, web, email, archives, academic) | | `cache_stats_handler` | GET /cache/stats | Hit/miss counts and hit rate | | `cache_clear_handler` | DELETE /cache | Clear LRU cache | ## Caching Strategy **Location**: `crates/kreuzberg/src/cache/mod.rs` LRU cache keyed by `SHA256(file_content)`, stores `Arc<ExtractionResult>`. Default 1000 entries. Thread-safe via `RwLock`. Tracks hit/miss counters with `AtomicU64` for stats endpoint. ## Error Handling **Location**: `crates/kreuzberg/src/api/error.rs` `ApiError` enum maps to HTTP status codes: - `MissingFile` -> 400, `FileNotFound` -> 404 - `OnnxRuntimeMissing` / `TesseractMissing` -> 503 (with remediation message) - `PayloadTooLarge` -> 413 - `ExtractionFailed` / `InvalidConfig` / `UnsupportedFormat` -> 500 ## MCP Server Implementation **Location**: `crates/kreuzberg/src/mcp/server.rs` The MCP server allows Claude and other AI agents to call Kreuzberg extraction functions through the Model Context Protocol. ### MCP Tools (Callable Functions) Three tools are registered: | Tool | Purpose | Required Params | | ------------------ | --------------------------------------------------------- | --------------- | | `extract_file` | Extract text/tables/metadata from documents (75+ formats) | `file_path` | | `batch_extract` | Extract from multiple documents in parallel | `file_paths[]` | | `get_capabilities` | List supported formats, features, backends | (none) | **Tool registration pattern** (example: `extract_file`): ```rust // Define Tool with name, description, JSON Schema inputSchema // Register with server.register_tool(tool, handler_fn) // Handler: parse params -> build ExtractionConfig -> call extract_file() -> return ToolResult as JSON ``` `extract_file` optional params: `format`, `extract_tables`, `extract_images`, `ocr_enabled`, `extract_metadata`, `chunking_preset`, `generate_embeddings`. ### MCP Resources (Static Knowledge) Three resources provide static information to agents: - `kreuzberg://formats` -- Supported format list as JSON - `kreuzberg://features` -- Cross-binding feature matrix (from `FEATURE_MATRIX.md`) - `kreuzberg://api-reference` -- Generated API documentation ### MCP Prompts (Agent Templates) Two prompts guide agent extraction workflows: - `extract_for_rag` -- Document type-specific RAG extraction guidance (research paper, contract, report). Recommends chunking preset and embedding config. - `batch_document_processing` -- Optimal concurrency, grouping, and error handling for batch workflows. ### MCP Transport Protocols - **HTTP/REST**: MCP routes mounted alongside REST API on separate `/mcp/` prefix - **Stdio**: JSON-RPC 2.0 over stdin/stdout for local CLI integration (e.g., Claude Desktop) ### Integration with Claude Desktop ```json { "mcpServers": { "kreuzberg": { "command": "kreuzberg-mcp", "env": { "KREUZBERG_API_BASE": "http://localhost:8000", "KREUZBERG_MCP_TRANSPORT": "stdio" } } } } ``` ### MCP Error Handling `ToolError` variants: `FileNotFound`, `UnsupportedFormat`, `ExtractionFailed`, `OnnxRuntimeMissing`, `TesseractMissing`, `Timeout`. Each maps to an MCP `ToolResultError` with descriptive code and message. ## Environment Configuration See `.env.example` for all configurable variables. Key categories: - **Server**: `KREUZBERG_HOST`, `KREUZBERG_PORT` - **Size limits**: `KREUZBERG_MAX_REQUEST_BODY_BYTES` (default 100MB), `KREUZBERG_MAX_MULTIPART_FIELD_BYTES` - **Features**: `KREUZBERG_ENABLE_OCR`, `KREUZBERG_ENABLE_EMBEDDINGS`, `KREUZBERG_ENABLE_KEYWORDS` - **Cache**: `KREUZBERG_CACHE_ENABLED`, `KREUZBERG_CACHE_SIZE` - **CORS**: `CORS_ALLOWED_ORIGINS` (comma-separated) - **MCP**: `KREUZBERG_MCP_HOST`, `KREUZBERG_MCP_PORT`, `KREUZBERG_MCP_TRANSPORT` (stdio/http) - **Logging**: `RUST_LOG=kreuzberg=info,tower_http=debug` ## Critical Rules ### REST API Rules 1. **Always validate multipart file uploads** - Check MIME type, size, magic bytes 2. **Timeout long-running extractions** - Set per-handler timeout (5 min default) 3. **Stream large files** - Never buffer entire multi-GB file in memory 4. **Cache aggressively** - Identical files should return from cache in <1ms 5. **Parallel extraction is CPU-bound** - Limit workers to CPU count + 1 6. **Error responses must be actionable** - Include error code and remediation suggestion 7. **Health checks must verify features** - Report missing dependencies (ONNX, Tesseract) 8. **Size limits are configurable** - Allow override via env var for large deployments 9. **CORS is permissive by default** - Restrict in production via env var 10. **Logging all requests** - Track extraction metrics for observability ### MCP Rules 1. **All tools must have timeout** - Prevent hanging on large files (default 5 min) 2. **Error responses must be detailed** - Include suggestions for missing dependencies 3. **Feature gates must be checked** - Return helpful message if feature unavailable (embeddings, OCR) 4. **Resources should be static** - Don't query external services in resource handlers 5. **Prompts guide agents** - Provide clear examples and best practices 6. **Batch tools must support cancellation** - Allow agent to stop long-running batch operations 7. **Logging all tool calls** - Track usage for analytics and debugging ## Related Skills - **extraction-pipeline-patterns** - Core extraction called by handlers and MCP tools - **chunking-embeddings** - Optional chunking/embedding parameters in extraction - **ocr-backend-management** - OCR engine selection and image preprocessing.ai-rulez/skills/chunking-embeddings/SKILL.mdskillShow content (5734 bytes)
--- description: "Chunking, embeddings, and RAG pipeline integration" name: chunking-embeddings priority: critical --- # Chunking & Embeddings **Text splitting strategies, embedding generation with FastEmbed, RAG pipeline integration** ## Chunking Architecture Overview **Location**: `crates/kreuzberg/src/chunking/`, `crates/kreuzberg/src/embeddings.rs` ```text Extracted Text | [1. Normalization] -> Clean whitespace, remove control chars | [2. Chunk Strategy Selection] -> Fixed-size, semantic, syntax-aware, recursive | [3. Overlap Management] -> Control context window overlap | [4. Optional Embedding] -> Generate vectors with FastEmbed | Output: Vec<Chunk> with text, vectors, metadata ``` ## Chunking Strategies **Location**: `crates/kreuzberg/src/chunking/mod.rs` | Strategy | Pattern | Best For | | --------------------------------- | ------------------------------------------------------- | ------------------------------------------------------------------ | | **Fixed-Size** | Sliding window with configurable overlap | Uniform chunks for embedding models with fixed token limits | | **Semantic** | Split by sentences, merge/split by similarity threshold | Smart context preservation for LLM consumption and semantic search | | **Syntax-Aware** | Split by paragraph/section/heading/code-block structure | Preserving document structure (sections, code blocks) in RAG | | **Recursive** (LangChain pattern) | Try separators in order: `\n\n`, `\n`, `,` | Best general-purpose chunking; auto-finds optimal split points | Key config fields per strategy (see struct definitions in `chunking/mod.rs`): - Fixed-Size: `chunk_size`, `overlap`, `trim_whitespace` - Semantic: `target_chunk_size`, `min/max_chunk_size`, `semantic_threshold`, `use_sentence_boundaries` - Syntax-Aware: `chunk_by` (Paragraph/Section/Heading/Sentence/CodeBlock), `max_chunk_size`, `respect_code_blocks` - Recursive: `separators[]`, `chunk_size`, `overlap` ## Chunking Configuration Presets **Location**: `crates/kreuzberg/src/chunking/mod.rs` | Preset | Chunk Size | Overlap | Strategy | Use Case | | ------------ | ----------- | ------- | ---------- | ---------------------- | | **Balanced** | 512 tokens | 50 | Semantic | RAG sweet spot | | **Compact** | 256 tokens | 32 | Fixed-Size | Dense vectors | | **Extended** | 1024 tokens | 100 | Recursive | Full context | | **Minimal** | 128 tokens | 16 | (default) | Lightweight embeddings | Usage: set `config.chunking.preset = Some("balanced")` in `ExtractionConfig`. ## Embedding Generation with FastEmbed **Location**: `crates/kreuzberg/src/embeddings.rs` ### Model Selection | Model | Dimensions | Notes | | ----------------------------------- | ---------- | -------------------------------- | | `BAAI/bge-small-en-v1.5` (default) | 384 | Fast, excellent for RAG | | `BAAI/bge-small-zh-v1.5` | 384 | Chinese optimized | | `BAAI/bge-base-en-v1.5` | 768 | Better quality, slower | | `jinaai/jina-embeddings-v2-base-en` | 768 | Long context (up to 8192 tokens) | | `Custom(path)` | varies | Custom ONNX model path | ### Embedding Pattern `TextEmbeddingManager` provides singleton-cached models per config. Pattern: 1. `get_or_init_model()` -- lazy-loads ONNX model (downloads if needed), caches in `Arc<RwLock<HashMap>>` 2. `embed_chunks()` -- collects chunk texts, calls `model.embed(texts, batch_size)`, zips results back to `ChunkWithEmbedding` Default config: `batch_size=256`, `device=CPU`, `parallel_requests=4`. ### ONNX Runtime Requirement Embeddings require ONNX Runtime. Feature-gated via: ```toml [features] embeddings = ["dep:fastembed", "dep:ort"] ``` Install: `brew install onnxruntime` (macOS) / `apt install libonnxruntime libonnxruntime-dev` (Linux). Verify: `echo $ORT_DYLIB_PATH`. ## RAG Integration Pattern The full extraction-to-RAG pipeline: 1. **Extract**: `extract_file(path, config)` -> `ExtractionResult` 2. **Chunk**: Apply preset strategy to `result.content` -> `Vec<Chunk>` 3. **Embed**: If embedding config present, `TextEmbeddingManager::embed_chunks()` -> `Vec<ChunkWithEmbedding>` 4. **Output**: `RagDocument { file_path, metadata, chunks }` ready for vector DB ingestion See `ChunkWithEmbedding` struct in `types.rs`: contains `text`, `embedding: Vec<f32>`, `dimensions`, `norm`, `metadata`. ## Critical Rules 1. **Chunking is preprocessing** - Always apply before embedding to ensure consistent vector sizes 2. **Overlap prevents information loss** - Set overlap to 15-20% of chunk size 3. **Embedding models are stateful** - Lazy load and cache to avoid repeated initialization 4. **ONNX Runtime is required** - Gracefully degrade if not available (skip embeddings) 5. **Batch embedding for performance** - Never embed single chunks; batch 50-1000 chunks 6. **Normalize embeddings for search** - Use L2 norm for cosine similarity 7. **Cache embedding results** - Don't re-embed identical text chunks 8. **Model selection impacts quality** - bge-small (384) for speed, bge-base (768) for quality ## Related Skills - **extraction-pipeline-patterns** - Text extraction preceding chunking - **api-server-mcp** - Endpoint for chunking + embedding operations - **ocr-backend-management** - OCR text quality affects chunking successskills/kreuzberg/SKILL.mdskillShow content (15343 bytes)
--- name: kreuzberg description: >- Extract text, tables, metadata, and images from 91+ document formats (PDF, Office, images, HTML, email, archives, academic) using Kreuzberg. Use when writing code that calls Kreuzberg APIs in Python, Node.js/TypeScript, Rust, or CLI. Covers installation, extraction (sync/async), configuration (OCR, chunking, output format), batch processing, error handling, and plugins. license: Elastic-2.0 metadata: author: kreuzberg-dev version: "1.0" repository: https://github.com/kreuzberg-dev/kreuzberg --- # Kreuzberg Document Extraction Kreuzberg is a high-performance document intelligence library with a Rust core and native bindings for Python, Node.js/TypeScript, Ruby, Go, Java, C#, PHP, and Elixir. It extracts text, tables, metadata, and images from 91+ file formats including PDF, Office documents, images (with OCR), HTML, email, archives, and academic formats. Use this skill when writing code that: - Extracts text or metadata from documents - Performs OCR on scanned documents or images - Batch-processes multiple files - Configures extraction options (output format, chunking, OCR, language detection) - Implements custom plugins (post-processors, validators, OCR backends) ## Installation ### Python ```bash pip install kreuzberg # Optional OCR backends: pip install kreuzberg[easyocr] # EasyOCR ``` ### Node.js ```bash npm install @kreuzberg/node ``` ### Rust ```toml # Cargo.toml [dependencies] kreuzberg = { version = "4", features = ["tokio-runtime"] } # features: tokio-runtime (required for sync + batch), pdf, ocr, chunking, # embeddings, language-detection, keywords-yake, keywords-rake ``` ### CLI ```bash # Download from GitHub releases, or: cargo install kreuzberg-cli ``` ## Quick Start ### Python (Async) ```python from kreuzberg import extract_file result = await extract_file("document.pdf") print(result.content) # extracted text print(result.metadata) # document metadata print(result.tables) # extracted tables ``` ### Python (Sync) ```python from kreuzberg import extract_file_sync result = extract_file_sync("document.pdf") print(result.content) ``` ### Node.js ```typescript import { extractFile } from "@kreuzberg/node"; const result = await extractFile("document.pdf"); console.log(result.content); console.log(result.metadata); console.log(result.tables); ``` ### Node.js (Sync) ```typescript import { extractFileSync } from "@kreuzberg/node"; const result = extractFileSync("document.pdf"); ``` ### Rust (Async) ```rust use kreuzberg::{extract_file, ExtractionConfig}; #[tokio::main] async fn main() -> kreuzberg::Result<()> { let config = ExtractionConfig::default(); let result = extract_file("document.pdf", None, &config).await?; println!("{}", result.content); Ok(()) } ``` ### Rust (Sync) — requires `tokio-runtime` feature ```rust use kreuzberg::{extract_file_sync, ExtractionConfig}; fn main() -> kreuzberg::Result<()> { let config = ExtractionConfig::default(); let result = extract_file_sync("document.pdf", None, &config)?; println!("{}", result.content); Ok(()) } ``` ### CLI ```bash kreuzberg extract document.pdf kreuzberg extract document.pdf --format json kreuzberg extract document.pdf --output-format markdown ``` ## Configuration All languages use the same configuration structure with language-appropriate naming conventions. ### Python (snake_case) ```python from kreuzberg import ( ExtractionConfig, OcrConfig, TesseractConfig, PdfConfig, ChunkingConfig, ) config = ExtractionConfig( ocr=OcrConfig( backend="tesseract", language="eng", tesseract_config=TesseractConfig(psm=6, enable_table_detection=True), ), pdf_options=PdfConfig(passwords=["secret123"]), chunking=ChunkingConfig(max_chars=1000, max_overlap=200), output_format="markdown", ) result = await extract_file("document.pdf", config=config) ``` ### Node.js (camelCase) ```typescript import { extractFile, type ExtractionConfig } from "@kreuzberg/node"; const config: ExtractionConfig = { ocr: { backend: "tesseract", language: "eng" }, pdfOptions: { passwords: ["secret123"] }, chunking: { maxChars: 1000, maxOverlap: 200 }, outputFormat: "markdown", }; const result = await extractFile("document.pdf", null, config); ``` ### Rust (snake_case) ```rust use kreuzberg::{ExtractionConfig, OcrConfig, ChunkingConfig, OutputFormat}; let config = ExtractionConfig { ocr: Some(OcrConfig { backend: "tesseract".into(), language: "eng".into(), ..Default::default() }), chunking: Some(ChunkingConfig { max_characters: 1000, overlap: 200, ..Default::default() }), output_format: OutputFormat::Markdown, ..Default::default() }; let result = extract_file("document.pdf", None, &config).await?; ``` ### Config File (TOML) ```toml output_format = "markdown" [ocr] backend = "tesseract" language = "eng" [chunking] max_chars = 1000 max_overlap = 200 [pdf_options] passwords = ["secret123"] ``` ```bash # CLI: auto-discovers kreuzberg.toml in current/parent directories kreuzberg extract doc.pdf # or explicit: kreuzberg extract doc.pdf --config kreuzberg.toml kreuzberg extract doc.pdf --config-json '{"ocr":{"backend":"tesseract","language":"deu"}}' ``` ## Batch Processing ### Python ```python from kreuzberg import batch_extract_files, batch_extract_files_sync # Async results = await batch_extract_files(["doc1.pdf", "doc2.docx", "doc3.xlsx"]) # Sync results = batch_extract_files_sync(["doc1.pdf", "doc2.docx"]) for result in results: print(f"{len(result.content)} chars extracted") ``` ### Node.js ```typescript import { batchExtractFiles } from "@kreuzberg/node"; const results = await batchExtractFiles(["doc1.pdf", "doc2.docx"]); ``` ### Rust — requires `tokio-runtime` feature ```rust use kreuzberg::{batch_extract_file, ExtractionConfig}; let config = ExtractionConfig::default(); let paths = vec!["doc1.pdf", "doc2.docx"]; let results = batch_extract_file(paths, &config).await?; ``` ### CLI ```bash kreuzberg batch *.pdf --format json kreuzberg batch docs/*.docx --output-format markdown ``` ## OCR OCR runs automatically for images and scanned PDFs. Tesseract is the default backend (native binding, no external install required). ### Backends - **Tesseract** (default): Built-in native binding. All Tesseract languages supported. - **EasyOCR** (Python only): `pip install kreuzberg[easyocr]`. Pass `easyocr_kwargs={"gpu": True}`. - **PaddleOCR** (Python only): Bundled since 4.8.5, no extra install needed. Pass `paddleocr_kwargs={"use_angle_cls": True}`. - **Guten** (Node.js only): Built-in OCR backend via `GutenOcrBackend`. ### Language Codes ```python config = ExtractionConfig(ocr=OcrConfig(language="eng")) # English config = ExtractionConfig(ocr=OcrConfig(language="eng+deu")) # Multiple config = ExtractionConfig(ocr=OcrConfig(language="all")) # All installed ``` ### Force OCR ```python config = ExtractionConfig(force_ocr=True) # OCR even if text is extractable ``` ## ExtractionResult Fields | Field | Python | Node.js | Rust | Description | | ------------ | --------------------------- | -------------------------- | --------------------------- | --------------------------------------------- | | Text content | `result.content` | `result.content` | `result.content` | Extracted text (str/String) | | MIME type | `result.mime_type` | `result.mimeType` | `result.mime_type` | Input document MIME type | | Metadata | `result.metadata` | `result.metadata` | `result.metadata` | Document metadata (dict/object/HashMap) | | Tables | `result.tables` | `result.tables` | `result.tables` | Extracted tables with cells + markdown | | Languages | `result.detected_languages` | `result.detectedLanguages` | `result.detected_languages` | Detected languages (if enabled) | | Chunks | `result.chunks` | `result.chunks` | `result.chunks` | Text chunks (if chunking enabled) | | Images | `result.images` | `result.images` | `result.images` | Extracted images (if enabled) | | Elements | `result.elements` | `result.elements` | `result.elements` | Semantic elements (if element_based format) | | Pages | `result.pages` | `result.pages` | `result.pages` | Per-page content (if page extraction enabled) | | Keywords | `result.keywords` | `result.keywords` | `result.keywords` | Extracted keywords (if enabled) | ## Error Handling ### Python ```python from kreuzberg import ( extract_file_sync, KreuzbergError, ParsingError, OCRError, ValidationError, MissingDependencyError, ) try: result = extract_file_sync("file.pdf") except ParsingError as e: print(f"Failed to parse: {e}") except OCRError as e: print(f"OCR failed: {e}") except ValidationError as e: print(f"Invalid input: {e}") except MissingDependencyError as e: print(f"Missing dependency: {e}") except KreuzbergError as e: print(f"Extraction failed: {e}") ``` ### Node.js ```typescript import { extractFile, KreuzbergError, ParsingError, OcrError, ValidationError, MissingDependencyError, } from "@kreuzberg/node"; try { const result = await extractFile("file.pdf"); } catch (e) { if (e instanceof ParsingError) { /* ... */ } else if (e instanceof OcrError) { /* ... */ } else if (e instanceof ValidationError) { /* ... */ } else if (e instanceof KreuzbergError) { /* ... */ } } ``` ### Rust ```rust use kreuzberg::{extract_file, ExtractionConfig, KreuzbergError}; let config = ExtractionConfig::default(); match extract_file("file.pdf", None, &config).await { Ok(result) => println!("{}", result.content), Err(KreuzbergError::Parsing(msg)) => eprintln!("Parse error: {msg}"), Err(KreuzbergError::Ocr(msg)) => eprintln!("OCR error: {msg}"), Err(e) => eprintln!("Error: {e}"), } ``` ## Common Pitfalls 1. **Python ChunkingConfig fields**: Use `max_chars` and `max_overlap`, NOT `max_characters` or `overlap`. 2. **Rust extract_file signature**: Third argument is `&ExtractionConfig` (a reference), not `Option`. Use `&ExtractionConfig::default()` for defaults. 3. **Rust feature gates**: `extract_file_sync`, `batch_extract_file`, and `batch_extract_file_sync` all require `features = ["tokio-runtime"]` in Cargo.toml. 4. **Rust async context**: `extract_file` is async. Use `#[tokio::main]` or call from an async context. 5. **CLI --format vs --output-format**: `--format` controls CLI output (text/json). `--output-format` controls content format (plain/markdown/djot/html). 6. **Node.js extractFile signature**: `extractFile(path, mimeType?, config?)` — mimeType is the second arg (pass `null` to skip). 7. **Python detect_mime_type**: The function for detecting from bytes is `detect_mime_type(data)`. For paths use `detect_mime_type_from_path(path)`. 8. **Config file field names**: Use snake_case in TOML/YAML/JSON config files (e.g., `max_chars`, `max_overlap`, `pdf_options`). ## Supported Formats (Summary) | Category | Extensions | | ----------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------- | | **PDF** | `.pdf` | | **Word** | `.docx`, `.odt` | | **Spreadsheets** | `.xlsx`, `.xlsm`, `.xlsb`, `.xls`, `.xla`, `.xlam`, `.xltm`, `.ods` | | **Presentations** | `.pptx`, `.ppt`, `.ppsx` | | **eBooks** | `.epub`, `.fb2` | | **Images** | `.png`, `.jpg`, `.jpeg`, `.gif`, `.webp`, `.bmp`, `.tiff`, `.tif`, `.jp2`, `.jpx`, `.jpm`, `.mj2`, `.jbig2`, `.jb2`, `.pnm`, `.pbm`, `.pgm`, `.ppm`, `.svg` | | **Markup** | `.html`, `.htm`, `.xhtml`, `.xml` | | **Data** | `.json`, `.yaml`, `.yml`, `.toml`, `.csv`, `.tsv` | | **Text** | `.txt`, `.md`, `.markdown`, `.djot`, `.rst`, `.org`, `.rtf` | | **Email** | `.eml`, `.msg` | | **Archives** | `.zip`, `.tar`, `.tgz`, `.gz`, `.7z` | | **Academic** | `.bib`, `.biblatex`, `.ris`, `.nbib`, `.enw`, `.csl`, `.tex`, `.latex`, `.typ`, `.jats`, `.ipynb`, `.docbook`, `.opml`, `.pod`, `.mdoc`, `.troff` | See [references/supported-formats.md](references/supported-formats.md) for the complete format reference with MIME types. ## Additional Resources Detailed reference files for specific topics: - **[Python API Reference](references/python-api.md)** — All functions, config classes, plugin protocols, exact signatures - **[Node.js API Reference](references/nodejs-api.md)** — All functions, TypeScript interfaces, worker pool APIs - **[Rust API Reference](references/rust-api.md)** — All functions with feature gates, structs, Cargo.toml examples - **[CLI Reference](references/cli-reference.md)** — All commands, flags, config precedence, exit codes - **[Configuration Reference](references/configuration.md)** — TOML/YAML/JSON formats, auto-discovery, env vars, full schema - **[Supported Formats](references/supported-formats.md)** — All 85+ formats with file extensions and MIME types - **[Advanced Features](references/advanced-features.md)** — Plugins, embeddings, MCP server, API server, security limits - **[Other Language Bindings](references/other-bindings.md)** — Go, Ruby, Java, C#, PHP, Elixir, WASM, Docker Full documentation: <https://docs.kreuzberg.dev> GitHub: <https://github.com/kreuzberg-dev/kreuzberg>.ai-rulez/skills/plugin-architecture-patterns/SKILL.mdskillShow content (3045 bytes)
--- name: plugin-architecture-patterns description: "Plugin architecture, registration, and trait patterns" priority: critical --- # Plugin Architecture & Registration ## Plugin Types | Type | Trait | Location | | ------------------ | --------------------------- | ---------------------------- | | Document Extractor | `DocumentExtractor: Plugin` | `plugins/extractor/trait.rs` | | OCR Backend | `OcrBackend: Plugin` | `plugins/ocr/trait.rs` | | Post Processor | `PostProcessor: Plugin` | `plugins/processor/trait.rs` | | Validator | `Validator: Plugin` | `plugins/validator/trait.rs` | ## DocumentExtractor Implementation ```rust use crate::plugins::{DocumentExtractor, Plugin}; use async_trait::async_trait; pub struct MyExtractor; impl Plugin for MyExtractor { fn name(&self) -> &str { "my-extractor" } fn version(&self) -> String { env!("CARGO_PKG_VERSION").to_string() } } #[async_trait] impl DocumentExtractor for MyExtractor { async fn extract_bytes(&self, content: &[u8], mime_type: &str, config: &ExtractionConfig) -> Result<ExtractionResult> { /* ... */ } fn supported_mime_types(&self) -> &[&str] { &["application/x-custom"] } fn priority(&self) -> i32 { 50 } // WASM support (optional) fn as_sync_extractor(&self) -> Option<&dyn SyncExtractor> { None } } ``` ## Priority System | Range | Use | | ------ | ------------------------- | | 0-25 | Fallback/low-quality | | 26-49 | Alternative extractors | | **50** | **Default (built-in)** | | 51-75 | Premium/enhanced | | 76-100 | Specialized/high-priority | Registry selects **highest priority** extractor for each MIME type. Override built-ins with priority > 50. ## Registration ```rust // In extractors/mod.rs → register_default_extractors() let registry = get_document_extractor_registry(); let mut registry = registry.write() .map_err(|e| KreuzbergError::Other(format!("Registry lock poisoned: {}", e)))?; registry.register(Arc::new(MyExtractor::new()))?; ``` ## Feature-Gated Registration ```rust #[cfg(feature = "office")] { registry.register(Arc::new(DocxExtractor::new()))?; registry.register(Arc::new(PptxExtractor::new()))?; } ``` ## PostProcessor Pattern ```rust impl PostProcessor for MyProcessor { async fn process(&self, result: &mut ExtractionResult, config: &ExtractionConfig) -> Result<()> { result.content = process_content(&result.content); Ok(()) } fn stage(&self) -> ProcessorStage { ProcessorStage::Middle } } ``` Stages: `Early` → `Middle` → `Late`. Failures isolated (don't block others). ## Critical Rules 1. All plugins **MUST be `Send + Sync`** 2. Feature gate with `#[cfg(feature = "...")]` for optional formats 3. Use `#[async_trait]` for `DocumentExtractor` 4. Initialization via `ensure_initialized()` (lazy, called before first extraction) 5. Plugin names: kebab-case (e.g., `"pdf-extractor"`)
README
Kreuzberg
Extract text, metadata, and code intelligence from 97+ file formats and 305 programming languages at native speeds without needing a GPU.
Key Features
- Code intelligence – Extract functions, classes, imports, symbols, and docstrings from 248 programming languages via tree-sitter. Results in
ExtractionResult.code_intelligencewith semantic chunking - Extensible architecture – Plugin system for custom OCR backends, validators, post-processors, document extractors, and renderers
- Polyglot – Native bindings for Rust, Python, TypeScript/Node.js, Ruby, Go, Java, C#, PHP, Elixir, R, and C
- 91+ file formats – PDF, Office documents, images, HTML, XML, emails, archives, academic formats across 8 categories
- LLM intelligence – VLM OCR (GPT-4o, Claude, Gemini, Ollama), structured JSON extraction with schema constraints, and provider-hosted embeddings via 146 LLM providers (including local engines: Ollama, LM Studio, vLLM, llama.cpp) through liter-llm
- OCR support – Tesseract (all bindings, including Tesseract-WASM for browsers), PaddleOCR (all native bindings), EasyOCR (Python), VLM OCR (146 vision model providers including local engines), extensible via plugin API
- High performance – Rust core with pure-Rust PDF, SIMD optimizations and full parallelism
- Flexible deployment – Use as library, CLI tool, REST API server, or MCP server
- TOON wire format – Token-efficient serialization for LLM/RAG pipelines, ~30-50% fewer tokens than JSON
- GFM-quality output – Comrak-based rendering with proper fenced code blocks, table nodes, bracket escaping, and cross-format parity (Markdown, HTML, Djot, Plain)
- HTML passthrough – HTML-to-Markdown conversion uses html-to-markdown output directly, bypassing lossy intermediate round-trips
- Memory efficient – Streaming parsers for multi-GB files
Complete Documentation | Live Demo | Installation Guides
Installation
Each language binding provides comprehensive documentation with examples and best practices. Choose your platform to get started:
Scripting Languages:
- Python – PyPI package, async/sync APIs, OCR backends (Tesseract, PaddleOCR, EasyOCR)
- Ruby – RubyGems package, idiomatic Ruby API, native bindings
- PHP – Composer package, modern PHP 8.4+ support, type-safe API, async extraction
- Elixir – Hex package, OTP integration, concurrent processing
- R – r-universe package, idiomatic R API, extendr bindings
JavaScript/TypeScript:
- @kreuzberg/node – Native NAPI-RS bindings for Node.js/Bun, fastest performance
- @kreuzberg/wasm – WebAssembly for browsers/Deno/Cloudflare Workers, full feature parity (PDF, Excel, OCR, archives)
Compiled Languages:
- Go – Go module with FFI bindings, context-aware async
- Java – Maven Central, Foreign Function & Memory API
- C# – NuGet package, .NET 6.0+, full async/await support
Native:
- Rust – Core library, flexible feature flags, zero-copy APIs
- C (FFI) – C header + shared library, pkg-config/CMake support, cross-platform
Containers:
- Docker – Official images with API, CLI, and MCP server modes (Core: ~1.0-1.3GB, Full: ~1.0-1.3GB with OCR + legacy format support)
Command-Line:
- CLI – Cross-platform binary, batch processing, MCP server mode
All language bindings include precompiled binaries for both x86_64 and aarch64 architectures on Linux and macOS.
Platform Support
Complete architecture coverage across all language bindings:
| Language | Linux x86_64 | Linux aarch64 | macOS ARM64 | Windows x64 |
|---|---|---|---|---|
| Python | ✅ | ✅ | ✅ | ✅ |
| Node.js | ✅ | ✅ | ✅ | ✅ |
| WASM | ✅ | ✅ | ✅ | ✅ |
| Ruby | ✅ | ✅ | ✅ | - |
| R | ✅ | ✅ | ✅ | ✅ |
| Elixir | ✅ | ✅ | ✅ | ✅ |
| Go | ✅ | ✅ | ✅ | ✅ |
| Java | ✅ | ✅ | ✅ | ✅ |
| C# | ✅ | ✅ | ✅ | ✅ |
| PHP | ✅ | ✅ | ✅ | ✅ |
| Rust | ✅ | ✅ | ✅ | ✅ |
| C (FFI) | ✅ | ✅ | ✅ | ✅ |
| CLI | ✅ | ✅ | ✅ | ✅ |
| Docker | ✅ | ✅ | ✅ | - |
Note: ✅ = Precompiled binaries available with instant installation. WASM runs in any environment with WebAssembly support (browsers, Deno, Bun, Cloudflare Workers). All platforms are tested in CI. MacOS support is Apple Silicon only.
Embeddings Support (Optional)
To use embeddings functionality:
-
Install ONNX Runtime 1.24+:
- Linux: Download from ONNX Runtime releases (Debian packages may have older versions)
- MacOS:
brew install onnxruntime - Windows: Download from ONNX Runtime releases
-
Use embeddings in your code - see Embeddings Guide
Note: Kreuzberg requires ONNX Runtime version 1.24+ for embeddings. All other Kreuzberg features work without ONNX Runtime.
Supported Formats
91+ file formats across 8 major categories with intelligent format detection and comprehensive metadata extraction.
Office Documents
| Category | Formats | Capabilities |
|---|---|---|
| Word Processing | .docx, .docm, .dotx, .dotm, .dot, .odt, .pages | Full text, tables, lists, images, metadata, styles |
| Spreadsheets | .xlsx, .xlsm, .xlsb, .xls, .xla, .xlam, .xltm, .xltx, .xlt, .ods, .numbers | Sheet data, formulas, cell metadata, charts |
| Presentations | .pptx, .pptm, .ppsx, .potx, .potm, .pot, .key | Slides, speaker notes, images, metadata |
.pdf | Text, tables, images, metadata, OCR support | |
| eBooks | .epub, .fb2 | Chapters, metadata, embedded resources |
| Database | .dbf | Table data extraction, field type support |
| Hangul | .hwp, .hwpx | Korean document format, text extraction |
Images (OCR-Enabled)
| Category | Formats | Features |
|---|---|---|
| Raster | .png, .jpg, .jpeg, .gif, .webp, .bmp, .tiff, .tif | OCR, table detection, EXIF metadata, dimensions, color space |
| Advanced | .jp2, .jpx, .jpm, .mj2, .jbig2, .jb2, .pnm, .pbm, .pgm, .ppm | Pure Rust decoders (JPEG 2000, JBIG2), OCR, table detection |
| Vector | .svg | DOM parsing, embedded text, graphics metadata |
Web & Data
| Category | Formats | Features |
|---|---|---|
| Markup | .html, .htm, .xhtml, .xml, .svg | DOM parsing, metadata (Open Graph, Twitter Card), link extraction |
| Structured Data | .json, .yaml, .yml, .toml, .csv, .tsv | Schema detection, nested structures, validation |
| Text & Markdown | .txt, .md, .markdown, .djot, .mdx, .rst, .org, .rtf | CommonMark, GFM, Djot, MDX, reStructuredText, Org Mode, Rich Text |
Email & Archives
| Category | Formats | Features |
|---|---|---|
.eml, .msg | Headers, body (HTML/plain), attachments, UTF-16 support | |
| Archives | .zip, .tar, .tgz, .gz, .7z | Recursive extraction, nested archives, metadata |
Academic & Scientific
| Category | Formats | Features |
|---|---|---|
| Citations | .bib, .ris, .nbib, .enw, .csl | BibTeX/BibLaTeX, RIS, PubMed/MEDLINE, EndNote XML, CSL JSON |
| Scientific | .tex, .latex, .typ, .typst, .jats, .ipynb | LaTeX, Typst, JATS journal articles, Jupyter notebooks |
| Publishing | .fb2, .docbook, .dbk, .opml | FictionBook, DocBook XML, OPML outlines |
| Documentation | .pod, .mdoc, .troff | Perl POD, man pages, troff |
Code Intelligence (248 Languages)
| Feature | Description |
|---|---|
| Structure Extraction | Functions, classes, methods, structs, interfaces, enums |
| Import/Export Analysis | Module dependencies, re-exports, wildcard imports |
| Symbol Extraction | Variables, constants, type aliases, properties |
| Docstring Parsing | Google, NumPy, Sphinx, JSDoc, RustDoc, and 10+ formats |
| Diagnostics | Parse errors with line/column positions |
| Syntax-Aware Chunking | Split code by semantic boundaries, not arbitrary byte offsets |
Powered by tree-sitter-language-pack with dynamic grammar download. See TSLP documentation for the full language list.
Key Features
OCR with Table Extraction
Multiple OCR backends (Tesseract, EasyOCR, PaddleOCR) with intelligent table detection and reconstruction. Extract structured data from scanned documents and images with configurable accuracy thresholds.
Batch Processing
Process multiple documents concurrently with configurable parallelism. Optimize throughput for large-scale document processing workloads with automatic resource management.
Password-Protected PDFs
Handle encrypted PDFs with single or multiple password attempts. Supports both RC4 and AES encryption with automatic fallback strategies.
Language Detection
Automatic language detection in extracted text using fast-langdetect. Configure confidence thresholds and access per-language statistics.
Metadata Extraction
Extract comprehensive metadata from all supported formats: authors, titles, creation dates, page counts, EXIF data, and format-specific properties.
AI Coding Assistants
Kreuzberg ships with an Agent Skill that teaches AI coding assistants how to use the library correctly. It works with Claude Code, Codex, Gemini CLI, Cursor, VS Code, Amp, Goose, Roo Code, and any tool supporting the Agent Skills standard.
Install the skill into any project using the Vercel Skills CLI:
npx skills add kreuzberg-dev/kreuzberg
The skill is located at skills/kreuzberg/SKILL.md and is automatically discovered by supported AI coding tools once installed.
Documentation
- Installation Guide – Setup and dependencies
- User Guide – Comprehensive usage guide
- API Reference – Complete API documentation
- Format Support – Supported file formats
- OCR Backends – OCR engine setup
- CLI Guide – Command-line usage
- Migration Guides – Upgrading from other libraries
Contributing
Contributions are welcome! See CONTRIBUTING.md for guidelines.
License
Elastic License 2.0 (ELv2) - see LICENSE for details. See https://www.elastic.co/licensing/elastic-license for the full license text.