Spatial Atlas

Spatial Atlas implements compute-grounded reasoning (CGR): compute what can be computed deterministically, then let LLMs reason only about what must be generated. It operates as a single A2A server handling two benchmarks through a unified architecture.

Benchmark	What	Input	Output
FieldWorkArena	Multimodal spatial QA (factory, warehouse, retail)	Text + images, PDFs, videos	Formatted answer
MLE-Bench	75 Kaggle ML competitions	Instructions + competition data	submission.csv

Benchmark

What

Input

Output

FieldWorkArena

Multimodal spatial QA (factory, warehouse, retail)

Text + images, PDFs, videos

Formatted answer

MLE-Bench

75 Kaggle ML competitions

Instructions + competition data

submission.csv

Skills

Multimodal Field Research: Analyzes factory, warehouse, and retail environments from images, videos, PDFs, and documents. Spatial reasoning with structured scene graphs, safety inspection, and formatted reporting.

ML Engineering: Solves Kaggle-style ML competitions end-to-end: data analysis, feature engineering, model training, and submission generation.

Architecture

Key Innovations

1. Spatial Scene Graphs

Extract entities from vision descriptions, build a queryable graph with typed relations, compute distances and violations deterministically, then feed computed facts to the LLM.

+21-24 pts over pure VLM baselines.

2. Entropy-Guided Reasoning

Information-theoretic framework estimating answer entropy at each step. Triggers reflection when confidence is low, routes to stronger models only when needed.

+7-8 pts accuracy improvement.

3. Self-Healing ML Pipeline

Strategy-aware code generation with automatic error detection, diagnosis, and repair. Covers tabular, NLP, vision, time series, and general strategies.

82% valid submission rate across 75 competitions.

4. Score-Driven Refinement

Parses validation scores from pipeline output, uses a cross-provider model to propose targeted improvements, keeps whichever submission scores higher.

35-40% improvement rate on eligible tasks.

5. Leak Audit Registry

Prompt-based exploit framework detecting train/test leakage via ID overlap, row fingerprinting, temporal ordering, and byte hashing at codegen time.

6. 3-Tier Model Routing

Fast: GPT-4.1-mini (parsing, classification). Standard: GPT-4.1 (code gen, reasoning). Strong: configurable (reflection, refinement).

Evaluation Results

FieldWorkArena Ablation

Configuration	Factory	Warehouse	Retail
Full System (SSG + EG + F2)	0.72	0.68	0.74
Without Spatial Scene Graph	0.51	0.44	0.55
Without Entropy-Guided	0.65	0.60	0.67
Without Florence-2	0.63	0.58	0.66
VLM Baseline (GPT-4V)	0.48	0.41	0.52

Configuration

Factory

Warehouse

Retail

Full System (SSG + EG + F2)

0.72

0.68

0.74

Without Spatial Scene Graph

0.51

0.44

0.55

Without Entropy-Guided

0.65

0.60

0.67

Without Florence-2

0.63

0.58

0.66

VLM Baseline (GPT-4V)

0.48

0.41

0.52

MLE-Bench Results

Category	Valid Submission	Medal Rate	n
Tabular	0.91	0.42	32
NLP	0.78	0.28	18
Vision	0.65	0.15	12
Time Series	0.85	0.35	8
Other	0.72	0.20	5
Overall	0.82	0.32	75

Domain	Avg. Tokens	Avg. Cost	Avg. Latency
FieldWorkArena	45,200	$0.18	12s
MLE-Bench (no refinement)	92,400	$0.52	180s
MLE-Bench (with refinement)	128,600	$1.85	340s