Metadata-Version: 2.4
Name: codablellm
Version: 1.3.2
Summary: A framework for creating and curating high-quality code datasets tailored for large language models
Author-email: Dylan Manuel <dylan.manuel@my.utsa.edu>
Project-URL: Homepage, https://codablellm.readthedocs.io
Project-URL: Bug Tracker, https://github.com/dmanuel64/codablellm/issues
Project-URL: Documentation, https://codablellm.readthedocs.io
Project-URL: GitHub, https://github.com/dmanuel64/codablellm
Keywords: large language models,automation,reverse engineering,software security,dataset generation
Classifier: Development Status :: 4 - Beta
Classifier: Intended Audience :: Developers
Classifier: Topic :: Software Development :: Build Tools
Classifier: Topic :: Software Development :: Code Generators
Classifier: Topic :: Software Development :: Pre-processors
Classifier: Topic :: Software Development :: Version Control
Classifier: Topic :: Security
Classifier: Topic :: Scientific/Engineering :: Artificial Intelligence
Classifier: License :: OSI Approved :: MIT License
Classifier: Operating System :: OS Independent
Classifier: Programming Language :: Python
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.8
Classifier: Programming Language :: Python :: 3.9
Classifier: Programming Language :: Python :: 3.10
Classifier: Programming Language :: Python :: 3.11
Classifier: Programming Language :: Python :: 3.12
Requires-Python: <3.13,>=3.9
Description-Content-Type: text/markdown
License-File: LICENSE
Requires-Dist: Deprecated>=1.2.18
Requires-Dist: GitPython>=3.1.43
Requires-Dist: jinja2>=3.1.6
Requires-Dist: pandas>=2.2.3
Requires-Dist: prefect[dask]>=3.2.15
Requires-Dist: requests>=2.32.3
Requires-Dist: tree-sitter==0.23.2
Requires-Dist: tree-sitter-c==0.23.4
Requires-Dist: tree-sitter-cpp==0.23.4
Requires-Dist: typer>=0.15.1
Provides-Extra: rust
Requires-Dist: tree-sitter-rust==0.23.2; extra == "rust"
Provides-Extra: javascript
Requires-Dist: tree-sitter-javascript==0.23.1; extra == "javascript"
Provides-Extra: typescript
Requires-Dist: tree-sitter-typescript==0.23.2; extra == "typescript"
Provides-Extra: python
Requires-Dist: tree-sitter-python==0.23.6; extra == "python"
Provides-Extra: java
Requires-Dist: tree-sitter-java==0.23.5; extra == "java"
Provides-Extra: langs
Requires-Dist: tree-sitter-rust==0.23.2; extra == "langs"
Requires-Dist: tree-sitter-javascript==0.23.1; extra == "langs"
Requires-Dist: tree-sitter-typescript==0.23.2; extra == "langs"
Requires-Dist: tree-sitter-python==0.23.6; extra == "langs"
Requires-Dist: tree-sitter-java==0.23.5; extra == "langs"
Provides-Extra: angr
Requires-Dist: angr>=9.2.148; extra == "angr"
Provides-Extra: radare2
Requires-Dist: r2pipe>=1.9.4; extra == "radare2"
Provides-Extra: excel
Requires-Dist: openpyxl>=3.1.5; extra == "excel"
Provides-Extra: markdown
Requires-Dist: tabulate>=0.9.0; extra == "markdown"
Provides-Extra: xml
Requires-Dist: lxml>=5.3.0; extra == "xml"
Provides-Extra: all
Requires-Dist: openpyxl>=3.1.5; extra == "all"
Requires-Dist: tabulate>=0.9.0; extra == "all"
Requires-Dist: lxml>=5.3.0; extra == "all"
Requires-Dist: angr>=9.2.148; extra == "all"
Requires-Dist: r2pipe>=1.9.4; extra == "all"
Requires-Dist: tree-sitter-rust==0.23.2; extra == "all"
Requires-Dist: tree-sitter-javascript==0.23.1; extra == "all"
Requires-Dist: tree-sitter-typescript==0.23.2; extra == "all"
Requires-Dist: tree-sitter-python==0.23.6; extra == "all"
Requires-Dist: tree-sitter-java==0.23.5; extra == "all"
Provides-Extra: dev
Requires-Dist: pytest>=8.3.4; extra == "dev"
Provides-Extra: docs
Requires-Dist: mkdocs>=1.6.1; extra == "docs"
Requires-Dist: mkdocs-gen-files>=0.5.0; extra == "docs"
Requires-Dist: mkdocs-literate-nav>=0.6.1; extra == "docs"
Requires-Dist: mkdocs-material>=9.6.5; extra == "docs"
Requires-Dist: mkdocs-section-index>=0.3.9; extra == "docs"
Requires-Dist: mkdocstrings[python]>=0.26.0; extra == "docs"
Dynamic: license-file

<!-- markdownlint-disable MD041 -->
![Build Status](https://github.com/dmanuel64/codablellm/actions/workflows/test.yml/badge.svg?branch=main)
![Python Version](https://img.shields.io/pypi/pyversions/codablellm)
![PyPI](https://img.shields.io/pypi/v/codablellm)
![Downloads](https://img.shields.io/pypi/dm/codablellm)
![License](https://img.shields.io/github/license/dmanuel64/codablellm)
![Documentation Status](https://readthedocs.org/projects/codablellm/badge/?version=latest)

# CodableLLM

**CodableLLM** is a Python framework for creating and curating high-quality code datasets tailored for training and evaluating large language models (LLMs). It supports source code and decompiled code extraction, with a flexible architecture for handling multiple languages and integration with custom LLM prompts.

## Installation

### PyPI

Install CodableLLM directly from PyPI:

```bash
pip install codablellm
```

### Docker Compose (Recommended)

CodableLLM uses [Prefect](https://www.prefect.io/) for orchestration and parallel processing.
Because Prefect relies on a backend database, we recommend using the provided Docker Compose setup, which includes a configured PostgreSQL database.

**Run an example extraction using Docker Compose**:

```bash
docker compose run --rm app \
  codablellm \
  --url https://github.com/dmanuel64/codablellm/raw/refs/heads/main/examples/demo-c-repo.zip \
  /tmp/demo-c-repo \
  ./demo-c-repo.csv \
  /tmp/demo-c-repo \
  --strip \
  --transform my_transform.transform \
  --generation-mode temp-append \
  --build make
```

This command does the following:

- Downloads and extracts a compressed C project archive from the given --url to `/tmp/demo-c-repo`.
- Uses `/tmp/demo-c-repo` as both the source of extracted code and the location of compiled binaries.
- Outputs a dataset to `./demo-c-repo.csv` (relative to your host machine).
- Runs the build command (`make`) inside the extracted repo directory to generate binaries.
- Applies transformations using the function defined in `my_transform.py` (i.e., `my_transform.transform`).
- Uses --generation-mode `temp-append`, which appends transformed outputs to the original dataset, preserving both.

> **This uses the `app` service defined in `docker-compose.yml`, giving you access to the full environment including Prefect and PostgreSQL, which are required for managing flows and task state.**

## Features

- Extracts functions and methods from source code repositories using [tree-sitter](https://github.com/tree-sitter/tree-sitter).
- Easy integration with LLMs to refine or augment extracted code (e.g. rename variables, insert comments, etc.)
- Language-agnostic design with support for plugin-based extractor and decompiler extensions.
- Extendable API for building your own workflows and datasets.
- Fast and scalable, using Prefect to orchestrate and parallelize code extraction, transformation, and dataset generation across multiple processes and tasks.

## Documentation

Complete documentation is available on [Read the Docs](https://codablellm.readthedocs.io/):

- [User Guide](https://codablellm.readthedocs.io/en/latest/User%20Guide/)
- [Supported Languages & Decompilers](https://codablellm.readthedocs.io/en/latest/Built-In%20Support/)
- [API Reference](https://codablellm.readthedocs.io/en/latest/documentation/codablellm/)

## Citation

If you use this tool in your research, please cite [the paper](https://arxiv.org/abs/2507.22066) associated with it:

```bibtex
@misc{manuel2025codablellmautomatingdecompiledsource,
      title={CodableLLM: Automating Decompiled and Source Code Mapping for LLM Dataset Generation}, 
      author={Dylan Manuel and Paul Rad},
      year={2025},
      eprint={2507.22066},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2507.22066}, 
}
```

## Contributing

We welcome contributions from the community! See [CONTRIBUTING.md](https://github.com/dmanuel64/codablellm/blob/main/CONTRIBUTING.md) for guidelines, development setup, and how to get started.
