Check us out at profine.ai
Profile your PyTorch code on real GPUs. Get a transparent rewrite. Ship measured speedups before the multi-hour run.
On Karpathy's minGPT (single A100, measured end-to-end by profine):
| Metric | Baseline | profine | Δ |
|---|---|---|---|
| Step time (ms) | 25.22 | 8.11 | −67.8% (3.11× faster) |
| Peak memory (GB) | 1.43 | 0.45 | −68.7% |
| Correctness (loss curves match) | ✓ | ✓ | within BF16-widened tolerance |
Stack applied: BF16 Mixed Precision + TF32 matmul + torch.compile (max-autotune) + SDPA + Fused AdamW. Reproducible with:
profine run-all examples/minGPT/projects/chargpt/chargpt.py \
--hardware 1x_a100 --steps 25 --warmup 10 --seed 42Full artifacts (JSON + Markdown reports for every pipeline step) live in examples/minGPT/profine_output/. Read SUMMARY.md first.
pip install profineRequires:
- A Modal account (GPU execution backend)
- An LLM. OpenAI, Anthropic, or any OpenAI-compatible local server (Ollama, vLLM, LM Studio, llama.cpp, LiteLLM)
export MODAL_TOKEN_ID=...
export MODAL_TOKEN_SECRET=...
# Pick one LLM:
export OPENAI_API_KEY=... # OpenAI
export ANTHROPIC_API_KEY=... # Anthropic
# ...or run a local server (no API key needed) — see "Local LLMs" below
export HF_TOKEN=... # optional, for gated modelsprofine talks to any OpenAI-compatible server. Run with --provider local, supply --model, and (optionally) --base-url.
Ollama (default endpoint http://localhost:11434/v1):
ollama serve &
ollama pull llama3.1:8b
profine run-all path/to/train.py --provider local --model llama3.1:8bvLLM:
profine run-all path/to/train.py \
--provider local \
--model meta-llama/Llama-3.1-8B-Instruct \
--base-url http://localhost:8000/v1LM Studio / llama.cpp server / LiteLLM: point --base-url at the server. The endpoint can also be set via the PROFINE_LOCAL_BASE_URL environment variable.
Note: the agent loop expects strong instruction-following and clean JSON output. Smaller open models (≤7B) may struggle on the
interpretandsuggeststeps; we recommend 70B-class models or higher for end-to-end reliability.
read → profile → interpret → suggest → edit → benchmark
Each step reads the previous step's output from profine_output/.
Global flags (all commands): --provider {openai,anthropic,local} (default openai), --api-key, --model, --base-url (for local), --seed (best-effort, makes LLM rankings reproducible), -o/--output (default profine_output), --prefs.
Run the entire pipeline end-to-end on one script.
profine run-all examples/minGPT/projects/chargpt/chargpt.py --hardware 1x_a100| Flag | Default | Description |
|---|---|---|
--hardware |
1x_a100 |
Hardware preset |
--steps |
60 |
Total optimizer steps |
--warmup |
30 |
Warmup steps |
--timeout |
900 |
Modal container timeout (s) |
--warmstart |
off | Reuse deployed Modal app between runs |
--top |
all | Apply top N ranked optimizations |
--rtol / --atol |
0.01 / 0.0001 |
Loss tolerances (auto-widened for precision/quantization) |
Aborts on any failed step. Per-step artifacts land in their usual subdirectories under profine_output/.
Extract model architecture, optimizer, dataloader, precision, and distributed strategy via AST + LLM.
profine read nanoGPT/train.pyNo additional flags. Output: profine_output/read/architecture_record.json
Instrument the script and run on Modal with torch.profiler; collects step times, kernel breakdown, GPU utilization, and memory.
profine profile nanoGPT/train.py --hardware 1x_a100| Flag | Default | Description |
|---|---|---|
--hardware |
1x_a100 |
Hardware preset name |
--steps |
60 |
Total optimizer steps |
--warmup |
30 |
Warmup steps (discarded) |
--timeout |
900 |
Modal container timeout (s) |
--warmstart |
off | Reuse deployed Modal app between runs |
Output: profine_output/profile/profile_record.json
Deterministic analysis (cost, memory utilization, per-category kernel times) + LLM bottleneck diagnosis.
profine interpret --profile-dir profine_output/profile| Flag | Default | Description |
|---|---|---|
--profile-dir |
required | Directory containing profile_record.json |
Output: profine_output/interpret/bottleneck_report.json
Filter applicable optimizations from the catalog; LLM ranks by ROI.
profine suggest --interpret-dir profine_output/interpret| Flag | Default | Description |
|---|---|---|
--interpret-dir |
required | Directory containing bottleneck_report.json |
--arch-dir |
auto | Directory containing architecture_record.json |
--profile-dir |
auto | Directory containing profile_record.json |
Output: profine_output/suggest/suggestion_report.json
Apply an optimization. Multi-file aware: discovers local modules the entry script imports and edits whichever file owns the code being optimized. Patched library files land under profine_output/edit/files/<rel-path> — your source tree is never modified.
profine edit nanoGPT/train.py --suggestion-dir profine_output/suggest
profine edit nanoGPT/train.py --suggestion-dir profine_output/suggest --optimization torch_compile
profine edit nanoGPT/train.py --suggestion-dir profine_output/suggest --top 3| Flag | Default | Description |
|---|---|---|
--suggestion-dir |
required | Directory containing suggestion_report.json |
--optimization |
1 |
Rank (1, 2, ...) or entry ID (torch_compile). Ignored when --top is set. |
--top |
unset | Apply top N ranked optimizations sequentially, stacked. |
With --top N, per-iteration artifacts go in profine_output/edit/01_<entry_id>/, 02_<entry_id>/, etc.; cumulative result at profine_output/edit/edited_train.py. Optimizations the LLM declines are recorded in the manifest's skipped list and the loop continues.
Output: profine_output/edit/edited_train.py, profine_output/edit/files/, profine_output/edit/change_manifest.json
Run original and optimized back-to-back on the same hardware. Patched library files in profine_output/edit/files/ are overlaid on the optimized run. Loss tolerance auto-widens for numerics-perturbing classes (BF16/mixed precision: rtol 5%, quantization: rtol 10%).
# Picks up the editor's most recent output automatically:
profine benchmark nanoGPT/train.py --hardware 1x_a100
# Or point at a specific optimized script:
profine benchmark nanoGPT/train.py --optimized profine_output/edit/edited_train.py| Flag | Default | Description |
|---|---|---|
--optimized |
<output>/edit/edited_train.py |
Path to the optimized script |
--hardware |
1x_a100 |
Hardware preset name |
--steps |
60 |
Total optimizer steps |
--warmup |
30 |
Warmup steps |
--rtol |
0.01 |
Relative tolerance for loss check (auto-widened) |
--atol |
0.0001 |
Absolute tolerance for loss check (auto-widened) |
--edit-dir |
<output>/edit |
Directory whose files/ subtree is overlaid onto the optimized run (multi-file edits) |
--timeout |
900 |
Modal container timeout (s) |
--warmstart |
off | Reuse deployed Modal app between runs |
Output: profine_output/benchmark/
Defined in profine/config/hardware.yaml.
| Preset | GPU | VRAM | Cost/hr |
|---|---|---|---|
1x_t4 |
T4 | 16 GB | $0.59 |
1x_l4 |
L4 | 24 GB | $0.80 |
1x_a10g |
A10G | 24 GB | $1.10 |
1x_a100 |
A100 | 80 GB | $2.50 |
1x_h100 |
H100 | 80 GB | $3.95 |
Prices from modal.com/pricing.
All data tables (hardware, optimization catalog, kernel patterns, extractor patterns) live in profine/config/*.yaml and can be extended without code changes.
MIT. See LICENSE.