USP

Daft uniquely combines native multimodal processing with built-in AI operations, offering a Python-native experience powered by Rust for blazing performance and seamless scaling to distributed clusters without JVM complexity.

Use cases

01Processing images, audio, video, and embeddings at scale
02Running LLM prompts and generating embeddings
03Optimizing User-Defined Functions (UDFs) for GPU inference or async I/O
04Scaling data workflows to distributed Ray or Kubernetes clusters
05ETL for AI and machine learning pipelines

Detected files (4)

.claude/skills/daft-udf-tuning/SKILL.mdskill

Show content (1410 bytes)

---
name: "daft-udf-tuning"
description: "Optimize Daft UDF performance. Invoke when user needs GPU inference, encounters slow UDFs, or asks about async/batch processing."
---

# Daft UDF Tuning

Optimize User-Defined Functions for performance.

## UDF Types

| Type | Decorator | Use Case |
|---|---|---|
| **Stateless** | `@daft.func` | Simple transforms. Use `async` for I/O-bound tasks. |
| **Stateful** | `@daft.cls` | Expensive init (e.g., loading models). Supports `gpus=N`. |
| **Batch** | `@daft.func.batch` | Vectorized CPU/GPU ops (NumPy/PyTorch). Faster. |

## Quick Recipes

### 1. Async I/O (Web APIs)

```python
@daft.func
async def fetch(url: str):
    async with aiohttp.ClientSession() as s:
        return await s.get(url).text()
```

### 2. GPU Batch Inference (PyTorch/Models)

```python
@daft.cls(gpus=1)
class Classifier:
    def __init__(self):
        self.model = load_model().cuda() # Run once per worker

    @daft.method.batch(batch_size=32)
    def predict(self, images):
        return self.model(images.to_pylist())

# Run with concurrency
df.with_column("preds", Classifier(max_concurrency=4).predict(df["img"]))
```

## Tuning Keys

-   **`max_concurrency`**: Total parallel UDF instances.
-   **`gpus=N`**: GPU request per instance.
-   **`batch_size`**: Rows per call. Too small = overhead; too big = OOM.
-   **`into_batches(N)`**: Pre-slice partitions if memory is tight.

.claude/skills/daft-distributed-scaling/SKILL.mdskill

Show content (2227 bytes)

---
name: "daft-distributed-scaling"
description: "Scale Daft workflows to distributed Ray clusters. Invoke when optimizing performance or handling large data."
---

# Daft Distributed Scaling

Scale single-node workflows to distributed execution.

## Core Strategies

| Strategy | API | Use Case | Pros/Cons |
|---|---|---|---|
| **Shuffle** | `repartition(N)` | Light data (e.g. file paths), Joins | **Global balance**. High memory usage (materializes data). |
| **Streaming** | `into_batches(N)` | Heavy data (images, tensors) | **Low memory** (streaming). High scheduling overhead if batches too small. |

## Quick Recipes

### 1. Light Data: Repartitioning
Best for distributing file paths before heavy reads.

```python
# Create enough partitions to saturate workers
df = daft.read_parquet("s3://metadata").repartition(100)
df = df.with_column("data", read_heavy_data(df["path"]))
```

### 2. Heavy Data: Streaming Batches
Best for processing large partitions without OOM.

```python
# Stream 1GB partition in 64-row chunks to control memory
df = df.read_parquet("heavy_data").into_batches(64)
df = df.with_column("embed", model.predict(df["img"]))
```

## Advanced Tuning

Target: Keep all actors busy without OOM or scheduling bottlenecks.

### Formula 1: Repartitioning (Light Data / Paths)
Calculate the **Max Partition Count** to ensure each task has enough data to feed local actors.

1.  **Min Rows Per Partition** = `Batch Size * (Total Concurrency / Nodes)`
2.  **Max Partitions** = `Total Rows / Min Rows Per Partition`

**Example**:
- 1M rows, 4 nodes, 16 total concurrency, Batch Size 64.
- **Min Rows**: `64 * (16/4) = 256`.
- **Max Partitions**: `1,000,000 / 256 ≈ 3906`.
- *Recommendation*: Use ~1000 partitions to run multiple batches per task.

```python
df = df.repartition(1000) # Balanced fan-out
```

### Formula 2: Streaming (Heavy Data / Images)
Avoid creating tiny partitions. Use `into_batches` to stream data within larger partitions.

**Strategy**: Keep partitions large (e.g. 1GB+), use `into_batches(Batch Size)` to control memory.

```python
# Stream batches to control memory usage per actor
df = df.into_batches(64).with_column("preds", model(max_concurrency=16).predict(df["img"]))
```

.claude/skills/daft-docs-navigation/SKILL.mdskill

Show content (1130 bytes)

---
name: "daft-docs-navigation"
description: "Navigate Daft documentation. Invoke when user asks general questions about APIs, concepts, or examples, or wants to search the docs."
---

# Daft Docs Navigation

Navigate Daft's documentation for APIs, concepts, and examples.

## Documentation Structure

1.  **Live Site**: [`https://docs.daft.ai`](https://docs.daft.ai) (Primary source, searchable).
2.  **Source Repo**: `docs/` directory in the repository.
    -   If `docs/` is missing or empty, clone the repo: `git clone https://github.com/Eventual-Inc/Daft.git`

## Key Locations in `docs/`

-   **API Reference**: `api/` (e.g., `api/io.md` for reading/writing).
-   **Optimization**: `optimization/` (e.g., `optimization/partitioning.md`).
-   **Distributed**: `distributed/` (e.g., `distributed/ray.md`).
-   **UDFs**: `custom-code/` (e.g., `func.md`, `cls.md`).
-   **Connectors**: `connectors/` (e.g., S3, Lance).
-   **Table of Contents**: `docs/SUMMARY.md`.

## Usage

**Search for API Usage:**
```bash
grep_search(pattern='from_glob', path='docs/')
```

**Browse Topics:**
```bash
read(file_path='docs/SUMMARY.md')
```

.claude-plugin/marketplace.jsonmarketplace

Show content (327 bytes)

{
  "name": "Eventual-Inc",
  "owner": {
    "name": "Eventual, Inc.",
    "email": "yksu@eventualcomputing.com"
  },
  "plugins": [
    {
      "name": "daft",
      "source": "./",
      "description": "Skills for working with Daft: UDF tuning, distributed scaling, and docs navigation",
      "version": "0.1.0"
    }
  ]
}

README

|Banner|

Website <https://www.daft.ai>_ • Docs <https://docs.daft.ai>_ • Installation <https://docs.daft.ai/en/stable/install/>_ • Daft Quickstart <https://docs.daft.ai/en/stable/quickstart/>_ • Community and Support <https://github.com/Eventual-Inc/Daft/discussions>_

Daft: High-Performance Data Engine for AI and Multimodal Workloads

|TrendShift|

Daft <https://www.daft.ai>_ is a high-performance data engine for AI and multimodal workloads. Process images, audio, video, and structured data at any scale.

Native multimodal processing: Process images, audio, video, and embeddings alongside structured data in a single framework
Built-in AI operations: Run LLM prompts, generate embeddings, and classify data at scale using OpenAI, Transformers, or custom models
Python-native, Rust-powered: Skip the JVM complexity with Python at its core and Rust under the hood for blazing performance
Seamless scaling: Start local, scale to distributed clusters on Ray <https://docs.daft.ai/en/stable/distributed/ray/>, Kubernetes <https://docs.daft.ai/en/stable/distributed/kubernetes/>
Universal connectivity: Access data anywhere (S3, GCS, Iceberg, Delta Lake, Hugging Face, Unity Catalog)
Out-of-box reliability: Intelligent memory management and sensible defaults eliminate configuration headaches

Getting Started

Installation ^^^^^^^^^^^^

Install Daft with pip install daft. Requires Python 3.10 or higher.

For more advanced installations (e.g. installing from source or with extra dependencies such as Ray and AWS utilities), please see our Installation Guide <https://docs.daft.ai/en/stable/install/>_

Quickstart ^^^^^^^^^^

Get started in minutes with our Quickstart <https://docs.daft.ai/en/stable/quickstart/>_ - load a real-world e-commerce dataset, process product images, and run AI inference at scale.

More Resources ^^^^^^^^^^^^^^

Examples <https://docs.daft.ai/en/stable/examples/>_ - see Daft in action with use cases across text, images, audio, and more
User Guide <https://docs.daft.ai/en/stable/>_ - take a deep-dive into each topic within Daft
API Reference <https://docs.daft.ai/en/stable/api/>_ - API reference for public classes/functions of Daft

Benchmarks

|Benchmark Image|

To see the full benchmarks, detailed setup, and logs, check out our benchmarking page. <https://docs.daft.ai/en/stable/benchmarks>_

Contributing

We ❤️ developers! To start contributing to Daft, please read CONTRIBUTING.md <https://github.com/Eventual-Inc/Daft/blob/main/CONTRIBUTING.md>_. This document describes the development lifecycle and toolchain for working on Daft. It also details how to add new functionality to the core engine and expose it through a Python API.

Here's a list of good first issues <https://github.com/Eventual-Inc/Daft/issues?q=is%3Aopen+is%3Aissue+label%3A%22good+first+issue%22>_ to get yourself warmed up with Daft. Comment in the issue to pick it up, and feel free to ask any questions!

Telemetry

To help improve Daft, we collect non-identifiable data via Scarf (https://scarf.sh).

To disable this behavior, set the environment variable DO_NOT_TRACK=true.

The data that we collect is:

Non-identifiable: No session IDs or user identifiers are collected
Metadata-only: We do not collect any of our users’ proprietary code or data
For development only: We do not buy or sell any user data

Please see our documentation <https://docs.daft.ai/en/stable/telemetry/>_ for more details.

.. image:: https://static.scarf.sh/a.png?x-pxid=31f8d5ba-7e09-4d75-8895-5252bbf06cf6

Related Projects

+---------------------------------------------------+-----------------+---------------+-------------+-----------------+-----------------------------+-------------+ | Engine | Query Optimizer | Multimodal | Distributed | Arrow Backed | Vectorized Execution Engine | Out-of-core | +===================================================+=================+===============+=============+=================+=============================+=============+ | Daft | Yes | Yes | Yes | Yes | Yes | Yes | +---------------------------------------------------+-----------------+---------------+-------------+-----------------+-----------------------------+-------------+ | Pandas <https://github.com/pandas-dev/pandas>_ | No | Python object | No | optional >= 2.0 | Some(Numpy) | No | +---------------------------------------------------+-----------------+---------------+-------------+-----------------+-----------------------------+-------------+ | Polars <https://github.com/pola-rs/polars>_ | Yes | Python object | No | Yes | Yes | Yes | +---------------------------------------------------+-----------------+---------------+-------------+-----------------+-----------------------------+-------------+ | Modin <https://github.com/modin-project/modin>_ | Yes | Python object | Yes | No | Some(Pandas) | Yes | +---------------------------------------------------+-----------------+---------------+-------------+-----------------+-----------------------------+-------------+ | Ray Data <https://github.com/ray-project/ray>_ | No | Yes | Yes | Yes | Some(PyArrow) | Yes | +---------------------------------------------------+-----------------+---------------+-------------+-----------------+-----------------------------+-------------+ | PySpark <https://github.com/apache/spark>_ | Yes | No | Yes | Pandas UDF/IO | Pandas UDF | Yes | +---------------------------------------------------+-----------------+---------------+-------------+-----------------+-----------------------------+-------------+ | Dask DF <https://github.com/dask/dask>_ | No | Python object | Yes | No | Some(Pandas) | Yes | +---------------------------------------------------+-----------------+---------------+-------------+-----------------+-----------------------------+-------------+

License

Daft has an Apache 2.0 license - please see the LICENSE file.

.. |Quickstart Image| image:: https://github.com/Eventual-Inc/Daft/assets/17691182/dea2f515-9739-4f3e-ac58-cd96d51e44a8 :alt: Dataframe code to load a folder of images from AWS S3 and create thumbnails :height: 256

.. |Benchmark Image| image:: https://raw.githubusercontent.com/Eventual-Inc/Daft/refs/heads/main/assets/benchmark.png :alt: AI Benchmarks

.. |Banner| image:: https://daft.ai/images/diagram.png :target: https://www.daft.ai :alt: Daft dataframes can load any data such as PDF documents, images, protobufs, csv, parquet and audio files into a table dataframe structure for easy querying

.. |CI| image:: https://github.com/Eventual-Inc/Daft/actions/workflows/pr-test-suite.yml/badge.svg :target: https://github.com/Eventual-Inc/Daft/actions/workflows/pr-test-suite.yml?query=branch:main :alt: GitHub Actions tests

.. |PyPI| image:: https://img.shields.io/pypi/v/daft.svg?label=pip&logo=PyPI&logoColor=white :target: https://pypi.org/project/daft :alt: PyPI

.. |Latest Tag| image:: https://img.shields.io/github/v/tag/Eventual-Inc/Daft?label=latest&logo=GitHub :target: https://github.com/Eventual-Inc/Daft/tags :alt: latest tag

.. |Coverage| image:: https://codecov.io/gh/Eventual-Inc/Daft/branch/main/graph/badge.svg?token=J430QVFE89 :target: https://codecov.io/gh/Eventual-Inc/Daft :alt: Coverage

.. |Slack| image:: https://img.shields.io/badge/slack-@distdata-purple.svg?logo=slack :target: https://join.slack.com/t/dist-data/shared_invite/zt-3rh9jr9iv-tmmTNOlQpfvhEy2NTMWS_w :alt: slack community

.. |TrendShift| image:: https://trendshift.io/api/badge/repositories/8239 :target: https://trendshift.io/repositories/8239 :alt: Eventual-Inc/Daft | Trendshift :width: 250px :height: 55px