lux/docs/PERFORMANCE_AND_TRADEOFFS.md

# Lux Performance Characteristics and Language Tradeoffs

## Executive Summary

Lux is a tree-walking interpreted language with algebraic effects. This document analyzes its performance characteristics, compares it to other languages, and explains the design tradeoffs made.

**Key Performance Characteristics:**
- **Interpretation overhead:** ~100-1000x slower than native compiled languages
- **Tail call optimization:** Effective, prevents stack overflow
- **Effect handling:** ~10-20% overhead per effect operation
- **Memory:** Reference counting for closures, aggressive cloning for collections

---

## Benchmark Results

### Test System
Benchmarks run via tree-walking interpreter in release mode.

### Results Summary

| Benchmark | Time | Operations | Ops/sec | Notes |
|-----------|------|------------|---------|-------|
| Fibonacci (naive, n=30) | 34,980ms | ~1.3M calls | 37K | Exponential recursion |
| Fibonacci (TCO, n=100K) | 498ms | 100K iterations | 200K | Tail-call optimized |
| List operations (10K) | 461ms | 30K ops | 65K | map+filter+fold |
| Pattern matching (32K nodes) | 964ms | 65K matches | 67K | Tree traversal |
| Closures (100K calls) | 538ms | 100K closures | 186K | Closure creation + calls |
| String ops (1K concat) | 457ms | 1K concats | 2.2K | String building |

### Analysis

**Naive Recursion is Expensive:**
- fib(30) takes 35 seconds due to exponential call overhead
- Each function call involves: environment extension, parameter binding, AST traversal
- Compare: Python ~2s, JavaScript ~0.05s, Rust ~0.001s

**TCO is Effective:**
- fib(100,000) completes in 500ms without stack overflow
- Linear time, constant stack space
- The trampoline approach works well

**Collection Operations Have Cloning Overhead:**
- List.map/filter/fold clone the entire list to extract from Value enum
- Pre-allocation in List.map helps but cloning dominates
- Larger lists will show worse performance

---

## Implementation Details

### Evaluation Strategy: Tree-Walking Interpreter

```
Source Code → Lexer → Tokens → Parser → AST → Interpreter → Value
```

**Pros:**
- Simple to implement and debug
- Direct correspondence between AST and execution
- Easy to add new features

**Cons:**
- No optimization passes
- Repeated AST traversal
- No instruction caching
- ~100-1000x slower than bytecode/native

**Comparison:**

| Language | Strategy | Relative Speed |
|----------|----------|----------------|
| Lux | Tree-walking | 1x (baseline) |
| Python | Bytecode VM | 10-50x faster |
| JavaScript (V8) | JIT compiled | 100-500x faster |
| Haskell (GHC) | Native compiled | 500-2000x faster |
| Rust | Native compiled | 1000-5000x faster |

### Value Representation

```rust
pub enum Value {
    Int(i64),                    // Unboxed, 8 bytes
    Float(f64),                  // Unboxed, 8 bytes
    Bool(bool),                  // Unboxed, 1 byte
    String(String),              // Heap-allocated, ~24 bytes + data
    List(Vec<Value>),            // Heap-allocated, ~24 bytes + n*size(Value)
    Function(Rc<Closure>),       // Reference-counted, 8 bytes pointer
    Constructor { ... },         // Tagged union
    ...
}
```

**Memory Overhead:**
- Each `Value` is ~40-80 bytes due to enum discriminant + largest variant
- Lists are `Vec<Value>`, so each element is a full `Value` enum
- No small-value optimization

**Tradeoffs:**

| Aspect | Lux Approach | Alternative | Tradeoff |
|--------|--------------|-------------|----------|
| Primitives | Unboxed in enum | NaN-boxing | Simpler code, more memory |
| Strings | Owned String | Interned/Rc | Simpler, more copying |
| Lists | Vec<Value> | Rc<Vec<Rc<Value>>> | Simpler, expensive clone |
| Closures | Rc<Closure> | Owned | Cheap sharing, GC needed |

### Closure Capture

```rust
pub struct Closure {
    params: Vec<String>,
    body: Expr,
    env: Env,  // Entire lexical environment
}

pub struct Env {
    bindings: Rc<RefCell<HashMap<String, Value>>>,
    parent: Option<Box<Env>>,
}
```

**Characteristics:**
- Closures capture the entire environment chain (lexical scoping)
- Environment lookup is O(depth) - traverses parent chain
- Variable access clones the value (expensive for large values)

**Comparison:**

| Language | Capture Strategy | Lookup Cost |
|----------|------------------|-------------|
| Lux | Scope chain | O(depth) |
| JavaScript | Scope chain | O(depth), optimized |
| Python | Cell references | O(1) after first access |
| Rust | Move/borrow | O(1), compile-time resolved |

### Effect Handling

```rust
fn handle_effect(&mut self, request: EffectRequest) -> Result<Value, RuntimeError> {
    // Linear search through handler stack (LIFO)
    for handler in self.handler_stack.iter().rev() {
        if handler.effect == request.effect {
            // Clone handler environment and execute
            ...
        }
    }
}
```

**Overhead per Effect Operation:**
1. Create `EffectRequest` struct
2. Linear search through handler stack (typically O(1-5))
3. Clone handler environment
4. Execute handler body
5. Return value

**Comparison with Other Approaches:**

| Approach | Overhead | Flexibility |
|----------|----------|-------------|
| Lux (runtime handlers) | ~10-20% | High - dynamic dispatch |
| Koka (evidence passing) | ~1-5% | High - optimized |
| Haskell mtl (transformers) | ~5-10% | Medium - static |
| Rust (traits) | 0% | Low - compile-time only |

### Tail Call Optimization

```rust
pub enum EvalResult {
    Value(Value),
    Effect(EffectRequest),
    TailCall { func, args, span },  // Trampoline marker
}

// Trampoline loop
loop {
    match result {
        EvalResult::Value(v) => return Ok(v),
        EvalResult::TailCall { func, args, span } => {
            result = self.eval_call(func, args, span)?;
        }
    }
}
```

**Characteristics:**
- Explicit tail position tracking via `tail: bool` parameter
- TailCall variant prevents stack growth
- Only function calls in tail position are optimized
- Arguments are always evaluated eagerly before tail call

**Comparison:**

| Language | TCO Support | Mechanism |
|----------|-------------|-----------|
| Lux | Full | Trampoline |
| Scheme | Full | Required by spec |
| Haskell | Full | Lazy evaluation + STG |
| JavaScript | Safari only | Implementation-dependent |
| Python | None | Explicit recursion limit |
| Rust | Limited | LLVM optimization |

---

## Language Tradeoffs

### 1. Safety vs Performance

**Choice: Safety First**

| Decision | Safety Benefit | Performance Cost |
|----------|----------------|------------------|
| Immutable values | No data races | Clone on every modification |
| Explicit effects | No hidden side effects | Handler lookup overhead |
| Type checking | Catch errors early | Compile-time overhead |
| Exhaustive matching | No missed cases | Runtime pattern matching |

### 2. Simplicity vs Optimization

**Choice: Simplicity First**

| Decision | Simplicity Benefit | Lost Optimization |
|----------|-------------------|-------------------|
| Tree-walking | Easy to implement | No bytecode caching |
| Value enum | Uniform handling | No NaN-boxing |
| Clone semantics | Predictable memory | No move optimization |
| No mutation | No aliasing issues | Can't update in place |

### 3. Expressiveness vs Compilation

**Choice: Expressiveness First**

| Feature | Expressiveness Benefit | Compilation Challenge |
|---------|------------------------|----------------------|
| Algebraic effects | Composable side effects | Hard to optimize |
| First-class handlers | Runtime flexibility | Dynamic dispatch |
| Effect polymorphism (planned) | Generic effect code | Complex inference |
| Refinement types (planned) | Precise specifications | SMT solver needed |

### 4. Comparison Matrix

| Aspect | Lux | Koka | Haskell | Rust | TypeScript |
|--------|-----|------|---------|------|------------|
| **Execution** | Interpreted | Compiled | Compiled | Compiled | JIT |
| **Effects** | Algebraic | Algebraic | Monads | Traits | Promises |
| **Memory** | RC + Clone | RC + Reuse | GC | Ownership | GC |
| **Mutability** | Immutable | Immutable | Immutable | Controlled | Mutable |
| **TCO** | Trampoline | Native | Native | LLVM | No |
| **Typing** | HM Inference | HM + Effects | HM + Extensions | Explicit | Structural |

---

## How to Measure Performance

### Running Benchmarks

```bash
# Run a specific benchmark
nix develop --command cargo run --release -- benchmarks/fibonacci.lux

# Time a benchmark
time nix develop --command cargo run --release -- benchmarks/fibonacci_tco.lux

# Run with effect tracing (slower but shows effect operations)
# In REPL: :trace on
```

### Benchmark Suite

| File | Tests | Expected Time |
|------|-------|---------------|
| `fibonacci.lux` | Function call overhead | ~35s (fib 30) |
| `fibonacci_tco.lux` | Tail call optimization | ~0.5s (fib 100K) |
| `list_operations.lux` | Collection performance | ~0.5s (10K elements) |
| `pattern_matching.lux` | ADT matching | ~1s (32K nodes) |
| `effects.lux` | Effect dispatch | ~0.4s (10K effects) |
| `closures.lux` | Closure performance | ~0.5s (100K closures) |
| `strings.lux` | String operations | ~0.5s (1K concats) |

### Key Metrics to Measure

1. **Function calls per second**: Use recursive fibonacci
2. **Effect operations per second**: Use counter effect benchmark
3. **Pattern matches per second**: Use tree traversal
4. **Closure creations per second**: Use makeAdder benchmark
5. **List operations per second**: Use map/filter/fold chain
6. **Memory usage**: Monitor with system tools (not built-in yet)

### Comparison Benchmarks

To compare with other languages, implement the same algorithms:

**Fibonacci (n=30) comparison:**
```
Lux (interpreted):     ~35,000 ms
Python 3:              ~2,000 ms
Node.js:               ~50 ms
Haskell (ghci):        ~200 ms
Haskell (compiled):    ~5 ms
Rust:                  ~1 ms
```

---

## Performance Improvement Opportunities

### Short-term (Interpreter Improvements)

1. **Bytecode compilation**: Convert AST to bytecode for faster dispatch
2. **Value representation**: Use NaN-boxing for primitives
3. **Environment optimization**: Use flat closure representation
4. **List operations**: Avoid cloning by using Rc<Vec<Rc<Value>>>
5. **String interning**: Deduplicate string values

### Medium-term (New Backend)

1. **WASM compilation**: Target WebAssembly for portable native speed
2. **JavaScript emission**: Leverage V8/SpiderMonkey JIT
3. **LLVM backend**: Generate native code via LLVM IR

### Long-term (Advanced Optimizations)

1. **Effect fusion**: Combine adjacent effect operations
2. **Inlining**: Inline small functions
3. **Specialization**: Generate specialized code for monomorphic calls
4. **Escape analysis**: Stack-allocate non-escaping values

### Estimated Speedup Potential

| Optimization | Expected Speedup | Effort |
|--------------|------------------|--------|
| Bytecode VM | 5-10x | Medium |
| NaN-boxing | 1.5-2x | Low |
| Flat closures | 2-3x | Medium |
| WASM backend | 50-100x | High |
| LLVM backend | 100-500x | Very High |

---

## Conclusion

Lux prioritizes **expressiveness, safety, and simplicity** over raw performance. The current interpreter is suitable for:
- Prototyping and development
- Educational purposes
- Small scripts and tools
- Testing effect-based designs

For production workloads requiring high performance, a compilation backend would be necessary. The language design is amenable to efficient compilation - algebraic effects can be compiled using CPS transformation or evidence passing, and the pure functional core can benefit from standard optimizations.

The key insight is that Lux's performance ceiling is set by implementation choices (interpreter vs compiler), not fundamental language limitations. Languages like Koka demonstrate that algebraic effects can be compiled efficiently.