Open source · LLM agent over GraphSAGE

Ask your AWS bill questions. In English.

A natural-language agent that infers resource ownership from CloudTrail, IAM, and cost behaviour using a Graph Neural Network. Answers questions like why did our bill spike Tuesday? with specific resources, teams, and dollar amounts.

Try the live demo →Run on your AWS bill →View on GitHub ↗

2.6M

Real Azure VMs in the audit set

13/15

Real-AWS attribution accuracy (87%)

+53%

Lift over best baseline (k-fold)

Clouds: AWS · Azure · GCP

Try it live

Chat with the agent over a synthetic 68-resource AWS account. The agent has 10 tools — it picks which to chain based on your question, hits the GraphSAGE-attributed scan, and answers in plain English.

~/costdna · costdna chat — live demosynthetic AWS account · 68 resources · 4 teams

Ask anything about a synthetic AWS account with 68 resources across 4 teams. The agent has 10 tools and runs on GPT-4o.

Live demo runs on GPT-4o · ~$0.01 per question · rate-limited to 5/IP/hour

The problem

Every FinOps team's most painful metric: the percentage of AWS spend that can't be attributed to any team. Industry estimates put it at 40–60%. Tagging is the standard answer, but tags drift. Resources are created in a hurry. Engineers leave.

Existing FinOps dashboards (CloudHealth, Vantage, Apptio) are only as good as the tags you have — and on most accounts, the tags you have aren't enough. CostDNA is the input layer: a tool that infers the missing tags from behaviour, then lets you ask English questions about the result.

The audit story

I had a 97% accuracy result on Microsoft's published 2.6M-VM Azure dataset. I audited it. It was a tautology.

First-cut

97%

Inflated by leak

Honest

6.9%

After audit, 100 classes

Random baseline

12× lift remains

The catch

Across all 33,205 deployments in the Azure trace, 100% mapped 1:1 to a single subscription. So deployment_id, used as a graph edge, was a perfect lookup of subscription_id. LabelProp's "97%" was a graph database join, not learning.

Most engineers stop when they see a high accuracy number and ship it. I caught the leak by asking "are you sure data is accurate?" The same audit on Microsoft's Philly DL trace surfaced another partial leak: 85% of users belong to one virtual cluster. Three datasets, three different shortcuts, one consistent finding.

Dataset	Resources	First-cut	Audited shortcut	Honest behavioural
Microsoft Azure	2.6M VMs	97%	deployment_id ≡ sub (100% deterministic)	6.9% (12× rand)
Microsoft Philly	117K jobs	89%	user_id → vc (85% deterministic)	14% (2× rand)

The methodological finding: production cloud attribution is mostly a metadata-lookup problem. Behavioural fingerprinting matters specifically when metadata is missing or unreliable — exactly the gap CostDNA's synthetic env reproduces.

How it works

Three layers, all cloud-agnostic. Only the collector layer (left) changes per cloud — the GNN and agent (right) are identical for AWS, Azure, and GCP.

STEP 1

Collect

Hardened boto3 / azure-mgmt / google-cloud collectors pull CloudTrail (or equivalent), Cost Explorer, IAM roles, VPC flow logs. Throttle-aware retry. AWS production-tested.

STEP 2

Train

Behavioural features (peak_hour, write_ratio, event_diversity, …) + LLM-derived semantic features (sentence-transformer over IAM names) → 2/4-layer GraphSAGE GNN. Auto-shrinks for small label sets.

STEP 3

Ask

10-tool LLM agent (GPT-4o, function-calling) answers natural-language questions. Tools are pure data lookups against the trained scan output — fast, deterministic, auditable.

10 tools the agent chains

Each tool wraps a piece of the underlying CostDNA pipeline. The LLM decides which to call (or chain) based on the visitor's question.

summarize_account

High-level rollup: total resources, by-team spend and confidence.

attribute_resource

Look up which team owns a specific resource and the why-explanation.

top_spenders

Top resources by total cost, optionally filtered by team.

find_cost_spikes

Largest spikes + Granger-causality attribution to deploys.

find_anomalies

Resources that don't fit any team — investigate manually.

search_resources

Substring match across resource IDs.

signal_history

Recent CloudTrail events + cost samples for one resource.

find_idle

Low-activity resources to consider for cleanup.

compare_teams

Side-by-side comparison: counts, spend, top resources, by type.

find_abandoned

Resources whose activity has collapsed in the recent half of the window — likely abandoned. Sorted by spend.

Real-AWS deployment

Provisioned a labeled AWS environment, ran per-team workload simulators on a 24/7 EC2 to generate authentic CloudTrail signal, scanned the live account. Same code that powers the live demo.

13 / 15

Per-resource accuracy (87%)

13 / 13

High-confidence (≥0.79) accuracy — 100%

13,402

CloudTrail events processed

Both wrong predictions came back with confidence below 0.7 and were correctly surfaced by find_anomalies for human review — exactly the active-learning workflow the system is designed for. Verification artifacts in docs/real-aws-evidence/.

Visual proof — embedding space

GraphSAGE learns a 2D-projected representation where same-team resources cluster together and unowned ones sit visibly separate.

UMAP embedding of synthetic AWS resources

Synthetic AWS env. The tan unowned cluster (vendor / legacy / orphan / shadow resources) sits visibly apart from the team clusters. The anomaly detector catches them automatically.

Multi-cloud architecture

The model + features + agent are cloud-agnostic — only the collector layer is provider-specific. AWS calls cloudtrail:LookupEvents; Azure calls monitor.activity_logs.list; GCP calls cloud_logging.list_entries. Same downstream pipeline.

AWS

✓

Production-tested

13/15 = 87% on real AWS · live demo runs against this stack

Azure

—

Implemented · awaiting validation

Code follows azure-mgmt-resource v25 + activity-logs + cost-management v4 patterns; mocked-shape tests pass

GCP

—

Implemented · awaiting validation

Code follows google-cloud-asset v4 + cloud-logging protobuf payloads; mocked-shape tests pass

Run it yourself

Three usage patterns — CLI, REPL, web UI. All run against the same agent code that powers the live demo above.

A.One-shot CLI

bash

$ costdna ask "why did our bill spike Tuesday?" \
    --from-dir runs/today

B.Multi-turn REPL

bash

$ costdna chat --from-dir runs/today
[0] ❯ summarize this account
[1] ❯ which 5 resources are spending the most?
[2] ❯ tell me about i-0c4f3230 specifically

C.Web chat UI (Streamlit)

bash

$ pip install 'costdna[ui,agent]'
$ costdna serve
# open http://localhost:8501 → "Chat with the agent" tab

D.Docker (no install)

bash

$ docker run --rm pauti04/costdna scan --synthetic

Setup for the agent commands: pip install 'costdna[agent]' + export OPENAI_API_KEY=...

Stack

▸Python 3.11— pandas, numpy, scikit-learn, statsmodels, networkx
▸PyTorch 2.x + PyTorch Geometric— GraphSAGE classifier — 2 to 4 layers, residual
▸sentence-transformers— MiniLM embeddings of IAM role names + resource IDs + tags
▸OpenAI SDK— function-calling agent loop — pluggable LLM backend
▸boto3 (hardened)— adaptive retry, throttle-aware CloudTrail lookup_events
▸azure-mgmt-* + google-cloud-*— multi-cloud collectors (lazy-loaded extras)
▸Streamlit + Click + Rich— CLI commands + interactive chat UI
▸Terraform— labelled AWS env with CloudTrail data events + VPC Flow Logs
▸pytest + GitHub Actions— CI on every commit; Docker auto-publish on tag
▸Next.js + Vercel— this landing page + serverless agent endpoint

Open source. Forkable. Ready.

MIT licensed, hardened collectors, multi-cloud architecture, real-AWS numbers in the README. If you're hiring for cloud-cost / FinOps / ML-infra roles — I'd like to do this kind of work full-time.

↗ View on GitHub Try the live demo