Learn CLER

CLER is designed around simplicity: Blocks are just structs with methods, Channels are type-safe queues, and the framework compiles away to almost nothing.

Understanding CLER

Real Examples, Real Code

Learning DSP frameworks is best done by seeing working code in action. The examples below come directly from our desktop_examples directory — they're not toy code, but real applications demonstrating CLER's core capabilities.

Platform Support

CLER supports both desktop and embedded platforms. Desktop examples use standard threads, while embedded frameworks include bare-metal, Zephyr, FreeRTOS, and ThreadX implementations.

Core Concepts

Blocks Are Just Structs

Every CLER block is a C++ struct with a procedure() method. Simple inheritance from BlockBase with minimal virtual functions only for interface:

struct GainBlock : public cler::BlockBase {
    float gain;
    
    GainBlock(const char* name, float gain_value) 
        : BlockBase(name), gain(gain_value) {}
    
    cler::Result<cler::Empty, cler::Error> procedure(
        cler::ChannelBase<float>* in,
        cler::ChannelBase<float>* out) {
        
        size_t transferable = std::min(in->size(), out->space());
        for (size_t i = 0; i < transferable; ++i) {
            float value;
            in->pop(value);
            out->push(value * gain);
        }
        return cler::Empty{};
    }
};

Design Rationale: Minimal virtual functions (only for interface), very little runtime type erasure, minimal hidden costs. What you write is mostly what gets compiled.

Multiple Input/Output Blocks

Blocks can have any number of inputs and outputs. Here's an adder that takes 2 inputs:

struct AdderBlock : public cler::BlockBase {
    cler::Channel<float> in0;   // First input channel (owned by block)
    cler::Channel<double> in1;  // Second input (different type!)
    
    AdderBlock(const char* name) : BlockBase(name), in0(512), in1(512) {}
    
    cler::Result<cler::Empty, cler::Error> procedure(
        cler::ChannelBase<float>* out) {
        
        size_t transferable = std::min({in0.size(), in1.size(), out->space()});
        for (size_t i = 0; i < transferable; ++i) {
            float value0;
            double value1;
            in0.pop(value0);
            in1.pop(value1);
            out->push(value0 + static_cast<float>(value1));
        }
        return cler::Empty{};
    }
};

Key Design: Input channels are owned by the block, output channels are passed as parameters. This prevents one-to-many connection issues.

Type-Agnostic Channels

Channels work with any type — not just numbers:

// Custom packet type
struct Packet {
    uint32_t id;
    std::vector<uint8_t> data;
};

// Channel that carries packets
cler::Channel<Packet> packet_channel(64);

// Block that processes packets
struct PacketProcessor : public cler::BlockBase {
    cler::Channel<Packet> in;  // Input channel owned by block
    
    PacketProcessor(const char* name) : BlockBase(name), in(64) {}
    
    template<typename... OChannels>
    cler::Result<cler::Empty, cler::Error> procedure(OChannels*... outs) {
        Packet packet;
        while (in.pop(packet)) {
            // Process packet...
            packet.id += 1000;  // Example processing
            // Push to all outputs using fold expression
            ((outs->push(packet)), ...);
        }
        return cler::Empty{};
    }
};

Flexibility: You're not limited to primitive types. Pass complex data structures, protocol packets, or custom objects between blocks.

Two Execution Models

Different applications need different trade-offs. Flowgraph mode is convenient and handles threading for you. Streamlined mode gives you maximum control for embedded systems or when you need every microsecond.

Flowgraph Mode

Each block runs in its own thread, framework manages everything:

// Desktop version
auto flowgraph = cler::make_desktop_flowgraph(
    cler::BlockRunner(&source, &adder.in[0], &adder.in[1]),  // multiple outputs
    cler::BlockRunner(&adder, &gain.in),                     // single output
    cler::BlockRunner(&gain, &sink.in),
    cler::BlockRunner(&sink)                                  // no outputs (sink)
);

// Configure scheduler and performance options
cler::FlowGraphConfig config;
config.scheduler = cler::SchedulerType::ThreadPerBlock;  // Or FixedThreadPool (recommended)
// config.adaptive_sleep = true;  // Optional: for sparse/intermittent data only
flowgraph.run(config);

// Embedded version (FreeRTOS example)
#include "task_policies/cler_freertos_tpolicy.hpp"
auto flowgraph = cler::FlowGraph<cler::FreeRTOSTaskPolicy,
    decltype(cler::BlockRunner(&source, &adder.in[0], &adder.in[1])),
    decltype(cler::BlockRunner(&adder, &gain.in)),
    decltype(cler::BlockRunner(&gain, &sink.in)),
    decltype(cler::BlockRunner(&sink))
>(
    cler::BlockRunner(&source, &adder.in[0], &adder.in[1]),
    cler::BlockRunner(&adder, &gain.in),
    cler::BlockRunner(&gain, &sink.in),
    cler::BlockRunner(&sink)
);

flowgraph.run(config);  // Same config works on all platforms

BlockRunner Syntax: First argument is the block, remaining arguments are its outputs (passed to procedure() as parameters). Same syntax works across all platforms.

Streamlined Mode

You control the loop, manual scheduling (from streamlined.cpp):

while (true) {
    source.procedure(&adder.in[0], &adder.in[1]);  // multiple outputs
    adder.procedure(&gain.in);                      // single output
    gain.procedure(&sink.in);
    sink.procedure();                                // no outputs
}

Use Cases: When blocks are simple, you avoid thread overhead. Perfect for embedded systems or when you need deterministic timing.

Platform Support & Task Policies

CLER supports multiple platforms through its task policy abstraction. Different platforms require different threading models, but the same block code works everywhere.

Desktop Development

#include "cler.hpp"
#include "task_policies/cler_desktop_tpolicy.hpp"

int main() {
    // Create blocks...
    SourceCWBlock<float> source("Source", 1.0f, 10.0f, 1000);
    GainBlock<float> gain("Gain", 2.0f);
    
    // Desktop flowgraph uses std::thread
    auto flowgraph = cler::make_desktop_flowgraph(
        cler::BlockRunner(&source, &gain.in),
        cler::BlockRunner(&gain, &sink.in),
        cler::BlockRunner(&sink)
    );
    
    // Configure scheduler for optimal performance
    cler::FlowGraphConfig config;
    config.scheduler = cler::SchedulerType::ThreadPerBlock;  // Default scheduler scheduler
    // config.adaptive_sleep = true;  // Optional: reduce CPU when idle
    flowgraph.run(config);
    return 0;
}

Embedded Systems

FreeRTOS

#include "cler.hpp"
#include "task_policies/cler_freertos_tpolicy.hpp"

int main() {
    // Same blocks work on any platform
    SourceCWBlock<float> source("Source", 1.0f, 10.0f, 1000);
    GainBlock<float> gain("Gain", 2.0f);
    
    // FreeRTOS flowgraph uses xTaskCreate
    auto flowgraph = cler::FlowGraph<cler::FreeRTOSTaskPolicy,
        decltype(cler::BlockRunner(&source, &gain.in)),
        decltype(cler::BlockRunner(&gain, &sink.in)),
        decltype(cler::BlockRunner(&sink))
    >(
        cler::BlockRunner(&source, &gain.in),
        cler::BlockRunner(&gain, &sink.in),
        cler::BlockRunner(&sink)
    );
    
    flowgraph.run();
    vTaskStartScheduler();  // FreeRTOS scheduler
    return 0;
}

Bare Metal (No Threading)

#include "cler.hpp"
// No task policy needed for bare metal

int main() {
    // Stack-allocated channels for deterministic memory
    struct SimpleBlock : public cler::BlockBase {
        cler::Channel<float, 64> in;  // Fixed size, no heap allocation
        
        SimpleBlock(const char* name) : BlockBase(name) {}
        
        template<typename... OChannels>
        cler::Result<cler::Empty, cler::Error> procedure(OChannels*... outs) {
            // Minimal processing for embedded constraints
            float sample;
            while (in.pop(sample)) {
                ((outs->push(sample * 2.0f)), ...);
            }
            return cler::Empty{};
        }
    };
    
    SimpleBlock block1("Block1"), block2("Block2");
    
    // Streamlined mode only for bare metal
    while (true) {
        block1.procedure(&block2.in);
        block2.procedure();  // No outputs
        // Hardware-specific timing control
    }
}

Platform Abstraction: The same block code works on desktop (std::thread), FreeRTOS (xTaskCreate), ThreadX (tx_thread_create), Zephyr (k_thread_create), or bare metal (no threading). Only the task policy changes.

Cyclic Graphs - Control System Example

This demonstrates CLER's support for cyclic graphs — the controller's output feeds the plant, and the plant's output feeds back to the controller, creating a closed feedback loop. This isn't just signal processing — it's a complete interactive simulation with real-time GUI controls.

The Plant Block

struct PlantBlock : public cler::BlockBase {
    cler::Channel<float> force_in;
    
    PlantBlock(const char* name) : BlockBase(name), force_in(1024) {
        force_in.push(0.0f); // CRITICAL: Initial force for cyclic graph
    }
    
    cler::Result<cler::Empty, cler::Error> procedure(
        cler::ChannelBase<float>* measured_position_out) {
        
        float force;
        force_in.pop(force);
        
        // Mass-spring-damper physics simulation
        float acceleration = (force - K * _x - C * _v) / M;
        _v += acceleration * DT;
        _x += _v * DT + 0.5f * acceleration * DT * DT;
        
        measured_position_out->push(_x);
        return cler::Empty{};
    }
    
private:
    float _x = 0.0, _v = 0.0;  // Position and velocity state
};

Critical Detail: The plant block maintains internal state and connects to the controller in a feedback loop. The force_in.push(0.0f) is crucial — cyclic graphs need initial conditions or they'll deadlock on startup.

Interactive PID Controller with Thread-Safe GUI

struct ControllerBlock : public cler::BlockBase {
    cler::Channel<float> measured_position_in;
    
    // Thread-safe parameters modified by GUI
    std::atomic<float> _kp{2.0f}, _ki{1.0f}, _kd{1.0f};
    std::atomic<float> _target{10.0f};
    std::atomic<bool>  _feed_forward{false};
    
    cler::Result<cler::Empty, cler::Error> procedure(
        cler::ChannelBase<float>* force_out) {
        
        float measured_position;
        measured_position_in.pop(measured_position);
        
        // Read parameters atomically (modified by GUI thread)
        float target = _target.load(std::memory_order_relaxed);
        float kp = _kp.load(std::memory_order_relaxed);
        
        float error = target - measured_position;
        float derivative = (error - _error_prev) / DT;
        _integral += error * DT;
        
        float force = kp * error + _ki.load() * _integral + _kd.load() * derivative;
        
        // Optional feed-forward control
        if (_feed_forward.load()) {
            force += K * target;  // Compensation for spring constant
        }
        
        force_out->push(force);
        _error_prev = error;
        return cler::Empty{};
    }
    
private:
    float _error_prev = 0.0, _integral = 0.0;
};

Thread Safety: The GUI thread modifies PID parameters while the control loop runs at 100Hz. Atomic variables ensure thread-safe updates without locks, maintaining real-time performance. The controller remains responsive even while parameters change.

The Cyclic Flowgraph

auto flowgraph = cler::make_desktop_flowgraph(
    cler::BlockRunner(&controller, &throttle.in),
    cler::BlockRunner(&throttle, &plant.force_in),          // Controller → Plant
    cler::BlockRunner(&plant, &fanout.in),
    cler::BlockRunner(&fanout, &plot.in[0], &controller.measured_position_in)  // Plant → Controller (feedback)
);

BlockRunner Syntax: First argument is the block to run, remaining arguments are the outputs (channels passed to the block's procedure() method).

Feedback Loop: Notice how the data flows Controller → Plant → Fanout → back to Controller, creating a complete feedback loop. The initial force value in the plant prevents deadlock at startup.

Performance Considerations

Choosing between flowgraph and streamlined modes depends on the computational weight of your blocks versus the overhead of orchestration.

When Streamlined Mode Wins

If your blocks do minimal work per sample, the thread synchronization overhead in flowgraph mode can dominate. Consider this anti-pattern:

// Anti-pattern: Simple gain block in flowgraph mode
struct GainBlock : public cler::BlockBase {
    cler::Channel<float> in;  // Input channel owned by block
    float gain;
    
    GainBlock(const char* name, float gain_value) 
        : BlockBase(name), in(512), gain(gain_value) {}
    
    template<typename... OChannels>
    cler::Result<cler::Empty, cler::Error> procedure(OChannels*... outs) {
        // Just multiplying by a constant - very little work!
        float value;
        while (in.pop(value)) {
            // Push to all outputs using fold expression
            ((outs->push(value * gain)), ...);
        }
        return cler::Empty{};
    }
};

The Problem: Each sample requires thread synchronization (pushing/popping from channels). The overhead of synchronization far exceeds the actual computation (one multiplication).

Better Approaches

Option 1: Combine lightweight operations into heavier blocks

// Better: Combine multiple operations
struct ProcessingBlock : public cler::BlockBase {
    cler::Channel<float> in;  // Input channel owned by block
    float gain;
    float offset;
    
    ProcessingBlock(const char* name, float gain_value, float offset_value)
        : BlockBase(name), in(512), gain(gain_value), offset(offset_value) {}
    
    template<typename... OChannels>
    cler::Result<cler::Empty, cler::Error> procedure(OChannels*... outs) {
        float buffer[256];
        size_t count = in.readN(buffer, 256);
        
        // Process in batches - amortizes synchronization cost
        for (size_t i = 0; i < count; ++i) {
            buffer[i] = buffer[i] * gain + offset;
            // Could add filtering, decimation, etc. here
        }
        
        // Write to all outputs using fold expression
        ((outs->writeN(buffer, count)), ...);
        return cler::Empty{};
    }
};

Option 2: Use streamlined mode for simple chains

// Streamlined mode - no thread overhead
while (running) {
    source.procedure(&gain.in);
    gain.procedure(&offset.in);
    offset.procedure(&sink.in);
    sink.procedure();
}

Stack-Allocated Channels

For embedded systems or when you know the exact buffer size needed:

struct FilterBlock : public cler::BlockBase {
    // Stack-allocated channel - no heap allocation!
    cler::Channel<float, 1024> in;  // Template parameter sets size
    
    FilterBlock(const char* name) : BlockBase(name) {}
    
    template<typename... OChannels>
    cler::Result<cler::Empty, cler::Error> procedure(OChannels*... outs) {
        // Process data...
        // Push to all outputs using fold expression when ready
        return cler::Empty{};
    }
};

Benefits: No dynamic allocation, better cache locality, deterministic memory usage — critical for real-time embedded systems.

Scheduler Types & Performance Optimization

CLER provides two highly optimized schedulers designed for different workload characteristics. Choosing the right scheduler can significantly improve throughput and reduce CPU usage.

ThreadPerBlock (Default)

cler::FlowGraphConfig config;
config.scheduler = cler::SchedulerType::ThreadPerBlock;  // Default scheduler
// config.adaptive_sleep = true;  // Only for sparse/intermittent data
flowgraph.run(config);

Best for: Default choice, debugging, simple flowgraphs, guaranteed per-block isolation

Characteristics: One dedicated thread per block (default, simple, debuggable)

Trade-offs: Simple and predictable, but thread overhead grows with block count. Excellent starting point for most applications.

FixedThreadPool (Cache-Optimized)

cler::FlowGraphConfig config;
config.scheduler = cler::SchedulerType::FixedThreadPool;  // Recommended
config.num_workers = 4;  // Number of worker threads (minimum 2, adjust for your CPU)
// config.adaptive_sleep = true;  // Optional: only for sparse data
// config.collect_detailed_stats = false;  // Ultra-high performance mode
flowgraph.run(config);

Best for: Constrained systems, embedded applications, memory-limited environments

Characteristics: Cache-optimized worker threads with platform-aware memory layout

Benefits: Better for constrained systems, automatic cache line detection for x86/ARM/RISC-V, lower thread overhead than ThreadPerBlock.

Adaptive Sleep Configuration

⚠️ Warning: Adaptive sleep reduces CPU usage but can significantly impact throughput. Only use for sparse/intermittent data sources like sensors or network packets with gaps.

// Adaptive sleep works with both scheduler types - USE CAREFULLY
// Can significantly reduce throughput, only use for sparse/intermittent data
config.adaptive_sleep = true;  // CAUTION: reduces CPU usage but may impact throughput
config.adaptive_sleep_multiplier = 1.5;       // How aggressively to increase sleep
config.adaptive_sleep_max_us = 5000.0;        // Maximum sleep time (microseconds)
config.adaptive_sleep_fail_threshold = 10;    // Start sleeping after N failures

// For very sparse data (sensors, network packets)
config.adaptive_sleep_multiplier = 2.0;       // Aggressive backoff
config.adaptive_sleep_max_us = 10000.0;       // Up to 10ms sleep
config.adaptive_sleep_fail_threshold = 5;     // Sleep after 5 failures

Use Cases: Sensor data, network packets, user input processing. Can achieve >90% CPU reduction during idle periods while maintaining low latency when data arrives.

Buffer Access Patterns

CLER provides three main ways to access buffer data. Here's what we recommend:

Simple Recommendation

read_dbf/write_dbf

Best Performance

Try this first - fastest and cleanest for most cases

readN/writeN

Good Performance

Fallback if DBF not available - still very fast

peek/commit

Specialized

Only if you need fine control over buffer management

push/pop

AVOID

Extremely slow - use only for control/config data

1. read_dbf/write_dbf (TRY THIS FIRST)

// Fastest and cleanest - zero-copy direct buffer access
auto [read_ptr, read_size] = in->read_dbf();
auto [write_ptr, write_size] = out->write_dbf();
size_t to_process = std::min(read_size, write_size);

for (size_t i = 0; i < to_process; ++i) {
    write_ptr[i] = read_ptr[i] * gain;  // Direct processing
}

in->commit_read(to_process);
out->commit_write(to_process);

✅ RECOMMENDED - Best performance, clean code.

2. readN/writeN (SOLID FALLBACK)

// Simple and reliable - use when DBF not available
float buffer[256];
size_t count = in->readN(buffer, 256);

for (size_t i = 0; i < count; ++i) {
    buffer[i] *= gain;
}

out->writeN(buffer, count);

✅ GOOD FALLBACK - Still very fast, works everywhere, simple API

3. peek/commit (COMPLEX - NOT RECOMMENDED)

const float *ptr1, *ptr2;
size_t size1, size2;
size_t available = in.peek_read(ptr1, size1, ptr2, size2);

// Process directly from input buffer
for (size_t i = 0; i < size1; ++i) buffer[i] = ptr1[i] * gain;
for (size_t i = 0; i < size2; ++i) buffer[size1+i] = ptr2[i] * gain;

in.commit_read(size1 + size2);
out->writeN(buffer, size1 + size2);

⚠️ Complex implementation - Only ~5% faster than readN/writeN, not worth the complexity

4. push/pop (AVOID)

float sample;
in->pop(sample);
out->push(sample * gain);

⚠️ Orders of magnitude slower - Function call overhead per sample. Use only for control data

Quick Summary

Try DBF first - It's usually the fastest and cleanest option
Use readN/writeN as fallback - Simple and still very fast
DBF requirements - Buffer must be heap-allocated and page-aligned (≥4KB)
Peek/commit is rarely needed - Only for specialized buffer management
Never use push/pop for DSP data - It's extremely slow
Run benchmarks: cd build/performance && ./perf_read_write_techniques

Real-World Applications

CLER shines in embedded systems where clean architecture meets resource constraints. Here are practical applications showing when flowgraphs excel and when manual approaches work better.

Microcontroller Applications

IoT Sensor Hub (STM32)

A environmental monitoring node collecting data from multiple sensors and forwarding to a gateway. RAM requirement: 64-128KB for comfortable operation.

// Manual callback hell approach - hard to maintain
struct SensorManager {
    void (*temp_callback)(float);
    void (*humidity_callback)(float);  
    void (*pressure_callback)(float);
    void (*network_callback)(SensorPacket);
    
    void process_sensors() {
        float temp = read_temperature();
        if (temp_callback) temp_callback(temp);
        
        float humidity = read_humidity();
        if (humidity_callback) humidity_callback(humidity);
        
        // ... multiply by 6 sensors, nested callbacks, error handling
    }
};

// CLER flowgraph approach - clean and testable
struct TemperatureSensor : public cler::BlockBase {
    cler::Channel out;
    
    cler::Result procedure() {
        if (out.space() > 0) {
            float temp = read_temperature_sensor();
            out.push(temp);
        }
        return cler::Empty{};
    }
};

struct SensorAggregator : public cler::BlockBase {
    cler::Channel temp_in, humidity_in, pressure_in;
    
    template
    cler::Result procedure(OChannels*... network_out) {
        if (temp_in.size() && humidity_in.size() && pressure_in.size()) {
            SensorPacket packet;
            temp_in.pop(packet.temperature);
            humidity_in.pop(packet.humidity);
            pressure_in.pop(packet.pressure);
            packet.timestamp = get_system_time();
            
            ((network_out->push(packet)), ...);
        }
        return cler::Empty{};
    }
};

// Clean, testable flowgraph  
auto flowgraph = cler::FlowGraph(
    cler::BlockRunner(&temp_sensor, &aggregator.temp_in),
    cler::BlockRunner(&humidity_sensor, &aggregator.humidity_in),
    cler::BlockRunner(&pressure_sensor, &aggregator.pressure_in),
    cler::BlockRunner(&aggregator, &network_transmitter.in),
    cler::BlockRunner(&network_transmitter)
);

Why Flowgraphs Win: Complex data flows with multiple sensors become manageable. Each sensor can be tested independently, and adding new sensors doesn't require rewiring callbacks.

Real-Time Audio Processing

struct AudioFilterBlock : public cler::BlockBase {
    cler::Channel in;  // 44.1kHz audio samples
    BiquadFilter filter;             // State preserved between calls
    
    template
    cler::Result procedure(OChannels*... outs) {
        int16_t samples[64];
        size_t count = in.readN(samples, 64);
        
        // Process audio block with preserved filter state
        for (size_t i = 0; i < count; ++i) {
            samples[i] = filter.process(samples[i]);
        }
        
        ((outs->writeN(samples, count)), ...);
        return cler::Empty{};
    }
};

// Real-time audio chain: ADC → Filter → Compressor → DAC
auto audio_flowgraph = cler::FlowGraph(
    cler::BlockRunner(&adc_source, &filter.in),
    cler::BlockRunner(&filter, &compressor.in),
    cler::BlockRunner(&compressor, &dac_sink.in),
    cler::BlockRunner(&dac_sink)
);

Real-Time Constraints: Audio processing needs deterministic timing. CLER's static allocation and predictable execution makes this possible on microcontrollers.

Motor Control System

struct EncoderBlock : public cler::BlockBase {
    cler::Channel position_out;
    volatile uint32_t* encoder_register;
    
    cler::Result procedure() {
        if (position_out.space() > 0) {
            uint32_t raw_count = *encoder_register; 
            float position = (raw_count * 2.0f * M_PI) / 4096.0f; // Convert to radians
            position_out.push(position);
        }
        return cler::Empty{};
    }
};

struct PIDController : public cler::BlockBase {
    cler::Channel setpoint_in, feedback_in;
    std::atomic kp{1.0f}, ki{0.1f}, kd{0.01f};
    float integral = 0.0f, previous_error = 0.0f;
    
    template  
    cler::Result procedure(OChannels*... control_out) {
        float setpoint, feedback;
        if (setpoint_in.pop(setpoint) && feedback_in.pop(feedback)) {
            float error = setpoint - feedback;
            integral += error * DT;
            float derivative = (error - previous_error) / DT;
            
            float output = kp.load() * error + ki.load() * integral + kd.load() * derivative;
            previous_error = error;
            
            ((control_out->push(output)), ...);
        }
        return cler::Empty{};
    }
};

// Control loop: Setpoint → PID Controller → Motor Driver
//                      ↑     Encoder ←  Motor ←

Control Systems: Flowgraphs naturally express feedback loops. Thread-safe atomic parameters allow runtime tuning without stopping the control loop.

FPGA Hybrid Designs

CLER excels in FPGA systems where hardware acceleration combines with software flexibility.

SDR with FPGA Preprocessing

// FPGA handles high-rate operations, ARM processes protocols
struct FPGAInterface : public cler::BlockBase {
    cler::Channel, 1024> iq_out;
    volatile uint32_t* fpga_fifo;           // Memory-mapped FPGA interface
    volatile uint32_t* fpga_control;
    
    cler::Result procedure() {
        // Check FPGA FIFO status
        if ((*fpga_control & FIFO_NOT_EMPTY) && (iq_out.space() >= 256)) {
            std::complex samples[256];
            
            // Burst read from FPGA (preprocessed samples)
            for (int i = 0; i < 256; ++i) {
                uint32_t iq_data = *fpga_fifo;
                float i_val = (int16_t)(iq_data >> 16) / 32768.0f;
                float q_val = (int16_t)(iq_data & 0xFFFF) / 32768.0f;
                samples[i] = std::complex(i_val, q_val);
            }
            
            iq_out.writeN(samples, 256);
        }
        return cler::Empty{};
    }
};

struct ProtocolDecoder : public cler::BlockBase {
    cler::Channel, 512> iq_in;
    DemodulatorState demod_state;           // Complex software processing
    
    template
    cler::Result procedure(OChannels*... packet_out) {
        std::complex samples[128];
        size_t count = iq_in.readN(samples, 128);
        
        if (count > 0) {
            // Complex demodulation that benefits from ARM cores
            for (size_t i = 0; i < count; ++i) {
                if (process_sample(&demod_state, samples[i])) {
                    // Packet detected and decoded
                    ((packet_out->push(demod_state.decoded_packet)), ...);
                }
            }
        }
        return cler::Empty{};
    }
};

// Hybrid architecture: FPGA → ARM Processing → Network
auto sdr_flowgraph = cler::FlowGraph(
    cler::BlockRunner(&fpga_interface, &channelizer.in),      // High-rate data from FPGA
    cler::BlockRunner(&channelizer, &decoder1.in, &decoder2.in), // Split to protocols
    cler::BlockRunner(&decoder1, &network1.in),               // Process on ARM cores
    cler::BlockRunner(&decoder2, &network2.in),
    cler::BlockRunner(&network1),
    cler::BlockRunner(&network2)
);

Hybrid Benefits: FPGA handles computationally intensive tasks (filtering, channelization) while ARM cores handle complex protocol processing. CLER coordinates the data flow between hardware and software domains.

Real-Time Image Processing

struct ImageAcquisition : public cler::BlockBase {
    cler::Channel frame_out; // Small buffer for real-time
    CameraInterface* camera;
    
    cler::Result procedure() {
        if (frame_out.space() > 0 && camera->frame_ready()) {
            ImageFrame frame;
            camera->capture_frame(&frame);
            frame_out.push(std::move(frame));
        }
        return cler::Empty{};
    }
};

struct FPGAEdgeDetection : public cler::BlockBase {
    cler::Channel frame_in;
    volatile uint32_t* fpga_processor;      // Hardware accelerated
    
    template
    cler::Result procedure(OChannels*... processed_out) {
        ImageFrame frame;
        if (frame_in.pop(frame)) {
            // DMA transfer to FPGA, hardware processes edges
            setup_fpga_dma(frame.data, frame.width, frame.height);
            start_fpga_processing();
            
            // Wait for completion (or use interrupt-driven approach)
            while (!fpga_processing_complete()) { /* yield or sleep */ }
            
            // Processed frame ready
            ((processed_out->push(std::move(frame))), ...);
        }
        return cler::Empty{};
    }
};

// Computer vision pipeline: Camera → FPGA Edge Detection → ARM Analysis → Display
auto vision_flowgraph = cler::FlowGraph(
    cler::BlockRunner(&camera, &fpga_edge.frame_in),
    cler::BlockRunner(&fpga_edge, &arm_analysis.in),  
    cler::BlockRunner(&arm_analysis, &display.in),
    cler::BlockRunner(&display)
);

Real-Time Vision: Hardware accelerates compute-intensive operations while software handles higher-level analysis. Natural pipeline structure matches hardware capabilities.

When to Use Flowgraphs vs Manual Wiring

✅ Flowgraphs Excel When:

Complex Data Flows: Multiple inputs, outputs, or feedback loops benefit from flowgraph structure.

// Multi-sensor fusion with feedback - naturally expressed as flowgraph
auto sensor_fusion = cler::FlowGraph(
    cler::BlockRunner(&gps, &kalman_filter.gps_in),
    cler::BlockRunner(&imu, &kalman_filter.imu_in),
    cler::BlockRunner(&mag, &kalman_filter.mag_in),
    cler::BlockRunner(&kalman_filter, &navigation.in, &control_feedback.in),
    cler::BlockRunner(&navigation, &display.in),
    cler::BlockRunner(&control_feedback, &actuator.in),
    cler::BlockRunner(&actuator)
);

Rapid Prototyping: Easy to add/remove blocks, change connections, test different configurations.

Testing & Validation: Each block can be unit tested independently. Mock blocks easily replace hardware interfaces.

// Easy testing - replace hardware blocks with test blocks
#ifdef UNIT_TEST
    MockGPSBlock mock_gps("GPS", test_positions);
    auto test_flowgraph = cler::FlowGraph(
        cler::BlockRunner(&mock_gps, &kalman_filter.gps_in),  // Mock instead of real GPS
        // ... rest of flowgraph unchanged
    );
#endif

Parallel Processing: Natural parallelism when blocks can run independently on multiple cores.

❌ Manual Wiring Better When:

Simple Linear Chains: A → B → C with minimal processing overhead benefits from direct calls.

// Simple processing chain - manual is more efficient
while (running) {
    sensor.read_raw(&raw_data);           // Direct hardware access
    filter.process(raw_data, &filtered);  // Inline processing  
    transmit.send(filtered);              // Direct transmission
    delay_ms(10);                         // Simple timing
}

Tight Timing Constraints: Hard real-time systems may need manual control for predictability.

Minimal Resource Usage: Systems with <32KB RAM may benefit from eliminating flowgraph overhead.

// Ultra-constrained system (8KB RAM)
void main_loop() {
    uint16_t adc_value = read_adc();
    uint16_t filtered = simple_filter(adc_value);  // Stack-based, no allocation
    if (filtered > threshold) {
        trigger_output();
    }
}  // Total overhead: ~100 bytes

Single-Threaded Embedded: Baremetal systems without RTOS may prefer direct control loops.

STM32 Practical Guidelines

Memory Requirements

Minimum for flowgraphs: 64KB RAM (simple chains, 2-4 blocks)
Comfortable operation: 128KB+ RAM (complex flowgraphs, 6+ blocks)
Per-block overhead: ~2-4KB (threads, channels, state)
Manual alternative: <32KB systems, use streamlined mode
Optimization: Stack-allocated channels reduce heap pressure

Architecture Decision Framework:
• <3 blocks + simple chain → Manual wiring
• 3-6 blocks + moderate complexity → Consider both approaches
• 6+ blocks OR complex data flow → Flowgraphs excel
• Need testing/modularity → Flowgraphs
• <64KB RAM OR hard real-time → Manual wiring

Architecture Deep Dive

CLER uses C++17 templates to eliminate all runtime overhead while maintaining type safety and flexibility.

Compile-Time Magic

template<typename Block, typename... Channels>
struct BlockRunner {
    Block* block;
    std::tuple<Channels*...> outputs;
    
    // Template deduction guide - compiler figures out types automatically
    template<typename... InputChannels>
    BlockRunner(Block* blk, InputChannels*... outs)
        : block(blk), outputs(static_cast<Channels*>(outs)...) {}
};

Zero Overhead: Channel types are deduced at compile time. Minimal runtime type checking, very few virtual function calls, minimal dynamic allocations. The flowgraph code compiles close to what you'd write by hand — often with template optimizations.

AI-Friendly Core (<1000 LOC)

CLER's core is intentionally small and focused:

<1000 lines: Perfect context window for AI assistance
Clear abstractions: BlockBase, Channel, FlowGraph, Result types
Minimal dependencies: Only standard library, easy to understand
Template-heavy: AI tools excel at understanding and extending template code

This means when you get stuck with complex error messages (CLER's one weakness), any LLM can quickly understand your code and suggest fixes.

Lock-Free Channel Implementation

Based on Dmitry Drogalis's SPSC queue, optimized for DSP workloads:

Cache-aligned: Prevents false sharing between producer/consumer
Memory ordering: Carefully tuned acquire/release semantics for x86/ARM
Configurable sizing: Stack allocation for embedded, heap for desktop
Multiple access patterns: From simple push/pop to zero-copy peek/commit

Embedded-First Design

CLER's design choices address real embedded constraints:

Adaptive performance: Uses advanced techniques like doubly-mapped buffers when available, falls back gracefully on constrained systems
Predictable timing: Manual memory management, no unexpected allocations
Small memory footprint: Perfect for microcontrollers
Type safety: Catch connection errors at compile time, not runtime