One primitive
The lease covers every resource access — same TTL, same teardown, same fence on failure.
Architecture
One primitive — the lease — gives you distributed memory, storage, compute, networking, and coordination without containers, heartbeats, or cleanup scripts.
The lease covers every resource access — same TTL, same teardown, same fence on failure.
Builder → Lease → Handle → Drop. Memory, block, GPU, CPU, network all share the pattern.
Five outcomes name themselves; a contract names the recovery. No silent degradation.
The foundation
Every resource access in grafOS goes through a lease: a cryptographic binding with a TTL and mandatory teardown. When the lease expires, the resource is deterministically reclaimed. If a program crashes, its leases expire and the resources return to the fabric automatically.
The lease TTL is your distributed garbage collector. No heartbeats, no finalizers, no manual cleanup.
Resource types
Every resource follows the same pattern: Builder → Lease → Handle → Drop.
let lease = MemBuilder::new().min_bytes(4096).acquire()?;
lease.mem().write(0, b"hello fabric")?;
let data = lease.mem().read(0, 12)?;
// lease drops → memory returns to fabric let lease = BlockBuilder::new().min_blocks(1024).acquire()?;
lease.block().write_block(0, §or_data)?; let lease = GpuBuilder::new().min_vram(1 << 30).acquire()?;
lease.gpu().submit(kernel, &args)?; let lease = CpuBuilder::new().acquire()?;
lease.cpu().submit(wasm_bytes, &input)?; let lease = NetBuilder::new()
.min_bandwidth(1_000_000_000).acquire()?; All five resource types work identically on WASM (real fabric) and native Rust (mock, for testing).
What you can build
Everything below is built, tested, and available in the SDK today. Each one is backed by leased fabric resources — when the lease expires, the data structure, service, or pipeline disappears cleanly.
use grafos_collections::vec::FabricVec;
use grafos_collections::map::FabricHashMap;
use grafos_collections::queue::FabricQueue;
// Growable array in fabric memory
let mut v: FabricVec = FabricVec::new(mem_lease, 1024)?;
v.push(&42)?;
// Hash map (open addressing, linear probing)
let mut m: FabricHashMap = FabricHashMap::new(mem_lease, 256)?;
m.put(&1, &100)?;
// Ring buffer for producer-consumer (bounded SPSC)
let mut q: FabricQueue = FabricQueue::new(mem_lease, 64, 16)?;
q.push(&42)?; let mut kv = KvBuilder::new()
.hot_buckets(64)
.default_ttl_secs(300)
.build()?;
kv.put(b"session:abc", b"user_data")?;
let data = kv.get(b"session:abc")?;
// Key expires when TTL fires — no eviction logic needed let mut mgr = TopicManager::new();
mgr.create("events", TopicConfig {
partitions: 4, ..Default::default()
})?;
let mut producer = Producer::new("events");
producer.send(&mut mgr, b"order.created")?;
let mut consumer = Consumer::new("events", "order-processor");
consumer.assign(&[0, 1, 2, 3]);
let messages = consumer.poll(&mgr, 100)?; // With proc macro:
#[grafos_rpc_service]
pub trait Greeter {
fn greet(&self, name: String) -> String;
fn add(&self, a: f64, b: f64) -> f64;
}
// Client call — shared fabric memory, not TCP
let client = GreeterClient::new(transport);
let greeting = client.greet("fabric".to_string())?; let mut graph = TaskGraph::new();
let extract = graph.add_task(TaskDef::new("extract", extract_fn)
.output("raw_data"));
let transform = graph.add_task(TaskDef::new("transform", transform_fn)
.input("raw_data").output("clean_data"));
let load = graph.add_task(TaskDef::new("load", load_fn)
.input("clean_data"));
graph.add_dependency(extract, transform)?;
graph.add_dependency(transform, load)?;
let plan = graph.build()?; // Topological sort → 3 waves
let result = Executor::run(plan)?; let result = Pipeline::from_source(VecSource::new(vec![1, 2, 3, 4, 5]))
.map(|x| x * 2)
.filter(|x| *x > 4)
.collect()
.run()?;
// [6, 8, 10] let a = FabricTensor::from_data(
vec![1.0, 2.0, 3.0, 4.0], vec![2, 2])?;
let b = FabricTensor::from_data(
vec![5.0, 6.0, 7.0, 8.0], vec![2, 2])?;
let c = a.matmul(&b)?;
let d = c.relu()?; let mut fs = FabricFs::format(vec![block_lease])?;
let mut fh = fs.create("/model/weights.bin")?;
fs.write(&mut fh, &weight_data)?;
fs.close(fh)?;
let fh = fs.open("/model/weights.bin", OpenFlags::Read)?;
let mut buf = vec![0u8; weight_data.len()];
fs.read(&fh, &mut buf)?; let mut store = TieredObjectStore::new(
hot_store, cold_store, lru_capacity)?;
let uri: FabricUri = "fabric://default/models/v3".parse()?;
store.put(&uri, &model_bytes, None)?;
// Hot objects in memory, cold on block — automatic tiering // Mutex — RAII guard, auto-release on drop or crash
let mtx = FabricMutex::new(mem_lease, 0, 0u64)?;
{
let mut guard = mtx.lock(owner_id, timeout_ms)?;
*guard += 1;
} // guard drops → lock released
// Barrier — multi-party phase sync
let barrier = FabricBarrier::new(mem_lease, 0, 3)?;
barrier.arrive_and_wait(party_id, timeout_ms)?; let (mut writer, reader) = RegistryBuilder::new()
.capacity(64).build()?;
writer.register(
ServiceRegistration::new("inference-server", "2.1.0", node_id)
.with_endpoint(ServiceEndpoint::net([10, 0, 0, 5], 8080))
.with_tag("gpu", "h100")
)?;
let services = reader.lookup("inference-server")?; // Cross-failure-domain replicated resources
let registry = ReplicatedServiceRegistry::new(
LogicalResourceName::new("inference"),
placement_policy(FailureDomain::Region { ... }),
)?;
// Quorum, freshness, and recovery semantics are explicit
let log: ReplicatedLog = ReplicatedLog::with_quorum(
QuorumPolicy::Majority,
FreshnessPolicy::ReadIndex,
)?;
// Failure drills exercise the contract surface
let drill = FailureDrill::AvailabilityZoneLoss;
graph.simulate(drill)?; Infrastructure
Poll-driven renewal — no timers, no threads. Compatible with bare-metal main loops.
let mut mgr = RenewalManager::new();
mgr.register(lease_id, expiry_ms, RenewalPolicy::default());
// In main loop:
let summary = mgr.tick(now_ms); Reject stale writes. Fence resources after failed teardown. Monotonic epoch counters for leader election.
let mut guard = FenceGuard::new(FenceEpoch::zero());
let msg = Fenced::new(42, FenceEpoch::zero());
guard.check(msg.epoch())?; // OK
guard.advance();
guard.check(msg.epoch())?; // Err — stale epoch Metrics, events, distributed tracing (W3C traceparent), structured logging. no_std core — works on bare metal and Linux.
let m = FabricMetrics::global();
m.leases_total.inc();
m.op_latency.observe(250); // microseconds
#[grafos_instrument]
fn process_request(req: &Request) -> Response { ... } Capacity ledger, admission gating, quota enforcement, preemption, and attestation verification.
let decision = admission.evaluate(&request, &ledger)?;
match decision {
Admit(node_id) => { /* mint token, create lease */ }
Deny(reason) => { /* queue or reject */ }
Preempt(victim) => { /* revoke lower-priority */ }
} Epoch-based encryption with key rotation. Keys expire with their lease — fail-closed by default.
Lease lifecycle, preemption, and admission events are sealed into a SHA-256-linked chain. The 32-byte head pointer lives on the daemon's narrow durable surface; an upstream collector validates linkage on ingest.
// Audit-chain assembler — sealed, hash-linked records
let assembler = ChainAssembler::new(prior_anchor, signer);
let record = assembler.seal(AuditEvent {
kind: AuditEventKind::LeasePreempted,
identity: workload.identity(),
reason: Some(PreemptionReason::QuotaPressure),
sequence,
timestamp: now_ms,
})?;
// Every committed rewrite seals one record per affected edge.
// The embedded bytes decode back to a typed EdgeRecord so a
// collector can replay the full mutation history.
let edge_record = assembler.seal(AuditEvent {
kind: AuditEventKind::EdgeRewritten,
event_data: Some(AuditEventData::EdgeRewritten {
edge_id,
edge_record_bytes: edge.to_canonical_bytes(),
}),
identity: workload.identity(),
sequence,
timestamp: now_ms,
})?;
// record.current_event_hash links to record.prev_event_hash
// Persist anchor in the daemon's narrow durable surface
durable.set_audit_anchor(edge_record.current_event_hash);
collector.append(edge_record)?; A readiness state machine gates lease admission until persistence, replay cache, and revocation surfaces are loaded. GET /api/v1/readiness exposes the state. Operator fence list / fence clear ops surface and clear quarantined resources.
Typed failures
Distributed systems fail. The question is whether your software knows how it failed — or just that something went wrong. In grafOS, every failure is a typed value. No silent degradation, no mystery timeouts.
Every lease operation, every edge probe, every resource access returns one of these:
pub enum Failure {
LeaseExpired, // TTL elapsed — deterministic, expected
Revoked, // Explicitly recalled by authority or preemption
Fenced, // Resource quarantined — teardown failed, no new access
Disconnected, // Node or network unreachable
Degraded, // Reduced functionality, partial service
Incompatible, // Resource shape doesn't match the request
RateLimited, // Throttled by quota, admission, or backpressure
InternalError, // Implementation fault — never surfaces unwrapped
} These flow through the graph — every edge carries its failure reason, every rewrite plan can trigger on specific outcomes.
pub enum EdgeState {
Active,
Failed(Failure), // The reason is part of the state
}
A DegradationContract maps typed failures to behavioral
modes. Attached to graph nodes, validated at construction time.
pub struct DegradationRule {
when: Vec<Failure>, // Which failures trigger this rule
mode: DegradationMode, // How to operate under this failure
surface_upstream: SurfaceUpstream, // What callers see
emit: EmitAction, // What event to emit
}
pub enum DegradationMode {
Normal, // Full read-write
ReadOnly, // Reads served, writes rejected
StaleReadsOk, // Cached/stale data acceptable
DegradedOk, // Reduced functionality, partial results
FailClosed, // All operations rejected — the safe default
}
A cache might tolerate Disconnected (serve stale reads)
but FailClosed on Fenced. The contract
makes this explicit and machine-checkable.
When a failure occurs, the runtime evaluates the contract and decides what to do — automatically. Only pre-declared recovery actions are permitted: Failover, Throttle, ReplicaAdd, ReplicaRemove, TierDown, TierUp.
// Failure arrives → contract lookup → policy action
let rule = contract.lookup_rule(&failure);
let action = evaluate_contract(&failure, &contract, &graph);
match action {
PolicyAction::Failover { target_edge, .. } => {
// Swap to standby — generate a RewritePlan
}
PolicyAction::Throttle { edge_id } => {
// Reduce traffic to degraded resource
}
PolicyAction::Fence { edge_id } => {
// Quarantine — no new access until manual intervention
}
PolicyAction::FailClosed { reason } => {
// Reject all operations, surface error upstream
}
} Traditional systems give you a timeout and a log line. grafOS failures are algebraic — each maps to a different recovery path, declared in advance, validated at construction, enforced at runtime.
Design principles
Holding a lease = holding a resource. Crash = lease expires = automatic cleanup.
Every resource follows the same acquisition pattern. No special cases.
Leases free on drop. No manual free() calls, no cleanup scripts.
No threads, no timers. Progress is driven by tick(). Works on bare metal.
Distributed locks and barriers auto-release on lease expiry. No deadlock.
Every failure is a named value — Expired, Revoked, Fenced, Disconnected, Degraded. Recovery is algebraic, not ad-hoc.
All SDK code runs on WASM (real fabric) and native Rust (mock, for testing).
The entire foundation works without an operating system.
Architecture