Files
msghandler/docs/solution-design.md
2026-05-22 08:51:47 +07:00

398 lines
21 KiB
Markdown

# Solution Design: msghandler
**Version**: 1.3.0
**Date**: 2026-05-22
**Status**: Active
**Ground Truth**: [`src/msghandler.jl`](../src/msghandler.jl)
**ASG Framework Alignment**: v8 pillars - Requirements → Solution Design → Specification → Walkthrough → Implementation Plan → Validation → Runbook
---
## 1. Problem Decomposition
msghandler addresses the challenge of cross-platform data exchange between **Julia**, **JavaScript**, **Python**, **Dart**, **Rust**, and **MicroPython** applications using message brokers as transport layers.
### User Problems
| Problem | Description | User Impact | Requirement ID |
|---------|-------------|-------------|----------------|
| **P-001**: Cross-platform data serialization | Different languages have incompatible data types and serialization formats | Developers must write platform-specific conversion code | FR-001, FR-002 |
| **P-002**: Large payload handling | Message brokers have size limits, but large files need to be transferred | Large files either fail or require complex workarounds | FR-003 |
| **P-003**: Transport abstraction | Each platform has different message broker libraries and APIs | No unified interface across platforms | FR-013, FR-014 |
| **P-004**: Request-response patterns | Bi-directional communication requires complex correlation tracking | Developers must implement custom message routing | FR-011 |
| **P-005**: File server reliability | File server may be temporarily unavailable during downloads | Failed downloads without retry mechanism | FR-010 |
| **P-006**: Payload type preservation | Different platforms have different type systems | Data corruption or misinterpretation on receiving end | FR-006, FR-007 |
### Solution Boundaries
**In Scope**:
- Unified API for `smartpack()` and `smartunpack()` across all platforms
- Automatic transport selection based on payload size
- File server integration using Claim-Check pattern
- Multi-payload support with mixed types in single message
- Exponential backoff for reliable file downloads
- Correlation ID propagation for message tracing
**Out of Scope**:
- Message compression (adds complexity without clear benefit)
- Message encryption (application-layer concern)
- Advanced message routing (simple topic matching sufficient)
- Persistent message queues (transport pattern sufficient)
### Decision IDs
| Decision ID | Decision | Description | Requirement IDs | NFR IDs |
|-------------|----------|-------------|-----------------|---------|
| SD-001 | Claim-Check Pattern | Large payloads uploaded to HTTP server, small payloads sent directly | FR-003, FR-004 | NFR-104, NFR-105 |
| SD-002 | Automatic Transport Selection | <0.5MB = direct, ≥0.5MB = link based on size threshold | FR-003, FR-004 | NFR-104, NFR-105 |
| SD-003 | Handler Function Abstraction | Pluggable file server implementations via handler functions | FR-008, FR-009 | NFR-202 |
| SD-004 | Unified Tuple Format | Same `(dataname, data, type)` format across all platforms | FR-006, FR-007 | - |
| SD-005 | Base64 Encoding | JSON-compatible binary data transport | FR-012 | - |
| SD-006 | Transport Abstraction | Support multiple broker protocols (NATS/MQTT/WebSocket) transparently | FR-013, FR-014 | NFR-201 |
| SD-007 | Exponential Backoff | Retry failed file downloads with exponential backoff | FR-010 | NFR-202 |
| SD-008 | Correlation ID Propagation | Propagate correlation IDs through all message processing steps | FR-011 | NFR-401, NFR-403 |
---
## 2. Solution Approach
msghandler implements a **Claim-Check pattern** with intelligent transport selection:
```
Sender (smartpack) Transport Layer Receiver (smartunpack)
┌─────────────────┐ ┌───────────────┐ ┌───────────────────┐
│ │ │ │ │ │
│ 1. Data tuples │────────────>│ │───────────>│ 1. Parse envelope │
│ [(name, │ JSON │ Message │ JSON │ 2. Check transport│
│ data, type)]│ format │ Broker │ format │ 3. Fetch/Decode │
│ │ │ (NATS/MQTT/ │ │ 4. Return tuples │
└─────────────────┘ │ WebSocket) │ │ │
│ │ └───────────────────┘
└───────────────┘
```
### Key Design Decisions
| Decision ID | Decision | Rationale | Alternatives Rejected |
|-------------|----------|-----------|----------------------|
| **SD-001** | Claim-Check Pattern | Large payloads (>0.5MB) uploaded to HTTP server, small payloads sent directly via transport | Client-side compression - adds complexity; Server-side compression - not universally supported |
| **SD-002** | Automatic Transport Selection | <0.5MB = direct (fast), ≥0.5MB = link (avoid transport limits) | Manual selection - error-prone; Fixed threshold - not adaptive |
| **SD-003** | Handler Function Abstraction | Allows pluggable file server implementations (Plik, AWS S3, custom) | Hardcoded Plik - not flexible; Interface-based - too complex for this use case |
| **SD-004** | Unified Tuple Format | Same input/output format across all platforms | Platform-native formats - no interoperability; Protocol buffers - too heavy |
| **SD-005** | Base64 Encoding | JSON-compatible binary data transport | Raw bytes - not JSON-compatible; Hex encoding - 2x size overhead |
| **SD-006** | Transport Abstraction | Support multiple broker protocols (NATS/MQTT/WebSocket) transparently | Platform-specific libraries - no interoperability |
| **SD-007** | Exponential Backoff | Retry failed file downloads with exponential backoff | Simple retry - too aggressive; No retry - poor reliability |
| **SD-008** | Correlation ID Propagation | Propagate correlation IDs through all message processing steps | Manual correlation - error-prone; No tracing - debug impossible |
### Architecture Components
```mermaid
flowchart TB
subgraph Client["Client Application"]
direction TB
APP["Application Code"]
API["msghandler API"]
APP -->|Data tuples| API
API -->|JSON envelope| TRANSPORT
end
subgraph Transport["Transport Layer"]
direction TB
BROKER["Message Broker<br/>NATS/MQTT/WebSocket"]
TOPICS["Topic Subscription"]
API -->|Publish| BROKER
BROKER -->|Deliver| TOPICS
TOPICS -->|Subscribe| API
end
subgraph FileServer["File Server"]
direction TB
UPLOAD["Upload Handler"]
DOWNLOAD["Download Handler"]
API -.->|Upload URL| UPLOAD
DOWNLOAD -.->|Fetch URL| API
end
style Client fill:#e1f5fe,stroke:#0288d1,stroke-width:2px
style Transport fill:#ffe0b2,stroke:#f57c00,stroke-width:2px
style FileServer fill:#c8e6c9,stroke:#43a047,stroke-width:2px
```
---
## 3. Alternatives Considered
| Alternative | Pros | Cons | Decision |
|-------------|------|------|----------|
| **gRPC/Protobuf** | Strong typing, efficient binary format | No native MicroPython support; Complex schema management | Rejected - not cross-platform enough |
| **MessagePack** | Compact binary, good performance | Browser support limited; No standard for tabular data | Rejected - missing Arrow IPC alternative |
| **Protocol Buffers** | Type-safe, efficient | No native support for tabular data exchange | Rejected - cannot represent DataFrames natively |
| **REST HTTP Upload** | Simple, universal | High latency; No real-time capability | Rejected - not suitable for message broker pattern |
| **Hybrid (direct/link)** | Optimal for both small and large payloads | More complex implementation | Accepted - matches user requirements (FR-003, FR-004) |
| **Single transport type** | Simpler implementation | Cannot handle large payloads efficiently | Rejected - violates FR-003 requirement |
| **Platform-specific APIs** | Native performance | No interoperability; Maintenance burden | Rejected - violates cross-platform goal |
---
## 4. High-Level Component Diagram
```mermaid
flowchart TD
subgraph msghandler["msghandler Core Module"]
direction TB
subgraph Serialization["Serialization Layer"]
DIR["Direct Transport"]
LNK["Link Transport"]
DIR -->|Base64| JSON_MSG
LNK -->|HTTP URL| JSON_MSG
end
subgraph Envelope["Envelope Builder"]
HDR["Message Header"]
PAY["Payload Manager"]
HDR --> PAY
end
subgraph Handlers["Handler Functions"]
UPD["Upload Handler"]
DWN["Download Handler"]
UPD --> LNK
DWN --> LNK
end
API["smartpack() / smartunpack()"]
API -->|Input| Serialization
API -->|Output| Serialization
API -->|Configure| Handlers
end
subgraph Transport["Transport Layer"]
BROKER["NATS / MQTT / WebSocket"]
API -->|JSON| BROKER
BROKER -->|JSON| API
end
subgraph FileServer["File Server"]
Plik["HTTP Server"]
UPD -.->|POST| Plik
Plik -.->|URL| DWN
end
style msghandler fill:#b3e5fc,stroke:#0288d1,stroke-width:2px
style Transport fill:#ffe0b2,stroke:#f57c00,stroke-width:2px
style FileServer fill:#c8e6c9,stroke:#43a047,stroke-width:2px
```
### Component Responsibilities
| Component | Responsibilities | Decision IDs | Requirements Addressed |
|-----------|-----------------|--------------|----------------------|
| **Serialization Layer** | Convert data types to transport format (Base64/URL) | SD-005 | FR-001, FR-002, FR-012 |
| **Envelope Builder** | Create standardized message envelope with metadata | SD-001, SD-008 | FR-011, FR-013, FR-014 |
| **Handler Functions** | Abstract file server operations for pluggability | SD-003, SD-007 | FR-008, FR-009, FR-010 |
| **Transport Adapter** | Support multiple broker protocols transparently | SD-006 | FR-013, FR-014 |
| **Payload Manager** | Track payload types, sizes, and encoding | SD-004 | FR-006, FR-007 |
---
## 5. Decision Rationale
### SD-001: Why Claim-Check Pattern?
**Requirement**: FR-003 (Large file handling), FR-004 (Direct transport for small payloads)
**NFRs**: NFR-104 (File upload latency <1s), NFR-105 (File download latency <1s)
**Rationale**:
- Transport layers (NATS, MQTT) have message size limits (typically 1MB)
- Direct transport is faster for small payloads (no file server round-trip)
- Link transport avoids transport limits for large payloads
- User doesn't need to manually choose - automatic selection based on threshold
### SD-002: Why Handler Functions for File Server?
**Requirement**: FR-008 (Plik integration), FR-009 (Custom file server support)
**NFR**: NFR-202 (File server availability <5% failure rate)
**Rationale**:
- Plik is common open-source solution for file server
- Some users need AWS S3 or custom implementation
- Handler functions provide clean abstraction without vendor lock-in
- Same signature across all platforms (unified API)
### SD-003: Why Tuple Format for Payloads?
**Requirement**: FR-006 (Multi-payload messages), FR-007 (Payload type preservation)
**Rationale**:
- `(dataname, data, type)` tuple is language-agnostic
- Simple to understand: name, content, type
- Supports mixed payload types in single message
- Easy to serialize/deserialize across platforms
### SD-004: Why Base64 Encoding?
**Requirement**: FR-012 (Message serialization), FR-001 (Cross-platform text messaging)
**Rationale**:
- JSON is universal - works on all platforms
- Base64 converts binary to ASCII for JSON compatibility
- Standard format with native support in all languages
- No additional dependencies needed
### SD-005: Why Automatic Transport Selection?
**Requirement**: FR-003 (Large file handling), FR-004 (Direct transport for small payloads)
**NFRs**: NFR-104 (File upload latency <1s), NFR-105 (File download latency <1s)
**Rationale**:
- <0.5MB payloads use direct transport (<1s latency, FR-004 KPI)
- ≥0.5MB payloads use link transport to avoid transport limits (FR-003 KPI: 99% successful uploads)
- User doesn't need to manually choose - automatic selection based on threshold
### SD-006: Why Transport Abstraction?
**Requirement**: FR-013 (Transport publishing), FR-014 (Transport subscription)
**NFR**: NFR-201 (Message delivery at-least-once)
**Rationale**:
- Support multiple broker protocols (NATS, MQTT, WebSocket) transparently
- Caller handles actual transport publishing/subscription
- Unified API across all platforms
- At-least-once delivery semantics via transport layer
### SD-007: Why Exponential Backoff?
**Requirement**: FR-010 (Exponential backoff retry)
**NFR**: NFR-202 (File server availability <5% failure rate)
**Rationale**:
- File server may be temporarily unavailable
- Exponential backoff prevents overwhelming server during outages
- Default: 5 retries, 100ms base delay, 5000ms max delay
- 95% successful downloads within retry limit (FR-010 KPI)
### SD-008: Why Correlation ID Propagation?
**Requirement**: FR-011 (Correlation ID propagation)
**NFRs**: NFR-401 (Required logs), NFR-403 (Tracing)
**Rationale**:
- Trace messages across distributed systems
- Correlation ID logged with every message (NFR-401)
- Propagated through all message processing steps (NFR-403)
- Enables debugging and performance analysis in production
---
## 6. Risk Assessment
| Risk | Impact | Probability | Mitigation | Requirement IDs | NFR IDs |
|------|--------|-------------|------------|-----------------|---------|
| **Performance degradation with >500KB payloads** | High | Medium | Size threshold detection; Link transport fallback | FR-003, FR-004 | NFR-104, NFR-105 |
| **File server availability issues** | Medium | Low | Exponential backoff retry; Graceful degradation | FR-010 | NFR-202 |
| **Platform-specific bugs** | Medium | Low | Comprehensive test suite per platform; CI validation | FR-001, FR-002, FR-006, FR-007 | - |
| **Encoding mismatches between platforms** | High | Low | Strict specification; Test contracts; Validation rules | FR-012 | NFR-301 |
| **Transport layer incompatibility** | Medium | Low | Transport-agnostic design; Handler abstraction | FR-013, FR-014 | NFR-201 |
| **Correlation ID loss in processing** | Medium | Low | Centralized trace context management | FR-011 | NFR-401, NFR-403 |
---
## 7. Requirements Traceability
| Solution Component | Decision ID | Requirement ID | Description |
|-------------------|-------------|----------------|-------------|
| **smartpack() function** | SD-001, SD-002, SD-004, SD-005, SD-006, SD-008 | FR-001, FR-002, FR-003, FR-004, FR-005, FR-006, FR-007, FR-008, FR-009, FR-010, FR-011, FR-012, FR-013, FR-014 | Unified API for sending messages across all platforms |
| **smartunpack() function** | SD-001, SD-002, SD-004, SD-005, SD-006, SD-007, SD-008 | FR-001, FR-002, FR-003, FR-004, FR-005, FR-006, FR-007, FR-008, FR-009, FR-010, FR-011, FR-012, FR-013, FR-014 | Unified API for receiving messages across all platforms |
| **Direct transport** | SD-002 | FR-004 | Send payloads < threshold directly via transport |
| **Link transport** | SD-001, SD-002 | FR-003 | Upload payloads ≥ threshold to file server |
| **File server handler** | SD-003, SD-007 | FR-008, FR-009, FR-010 | Pluggable upload/download handlers with retry logic |
| **Payload type preservation** | SD-004 | FR-006, FR-007 | Support text, dictionary, arrowtable, jsontable, image, audio, video, binary |
| **Correlation ID propagation** | SD-008 | FR-011 | Message tracing across distributed systems |
| **Multi-payload support** | SD-004 | FR-006, FR-007 | List of (dataname, data, type) tuples |
### Non-Functional Requirements Traceability
| Solution Component | Decision ID | NFR ID | Description |
|-------------------|-------------|--------|-------------|
| **Serialization optimization** | SD-005 | NFR-101, NFR-102 | <50ms overhead for 10KB payloads |
| **Transport efficiency** | SD-006 | NFR-103 | <100ms connection establishment |
| **File server latency** | SD-001, SD-002 | NFR-104, NFR-105 | <1s upload/download for 0.5MB files |
| **Concurrent connections** | SD-006 | NFR-106 | Support 100+ simultaneous connections |
| **Message throughput** | SD-005, SD-006 | NFR-107 | Handle 1000+ messages/second per instance |
| **At-least-once delivery** | SD-006 | NFR-201 | Transport layer semantics |
| **Graceful degradation** | SD-003, SD-007 | NFR-202 | File server unavailability handling |
| **Auto-reconnect** | SD-006 | NFR-203 | Transport connection failure recovery |
| **Payload integrity** | SD-005 | NFR-301 | 100% SHA-256 checksum validation |
| **Transport security** | SD-006 | NFR-302 | 100% TLS connections in production |
| **File server security** | SD-003 | NFR-303 | 100% authenticated file uploads |
| **Required logs** | SD-001, SD-008 | NFR-401 | Correlation ID, msg_id, timestamp, etc. |
| **Critical metrics** | SD-001, SD-005 | NFR-402 | messages_sent_total, file upload/download duration |
| **Tracing** | SD-001, SD-008 | NFR-403 | Correlation ID propagation |
| **Alerting** | SD-007 | NFR-404 | <5min alert latency for `download_retry_exceeded` |
---
## 8. Gap-Check Validation
| Stage Transition | Gap-Check Question | Status |
|------------------|-------------------|--------|
| **Requirements → Solution Design** | Does the Solution Design clearly explain how the system solves the user problem, not just what it does? | ✅ Verified - All user problems mapped to solution components with requirement ID and decision ID references |
| **Solution Design → Specification** | Does the Specification define all technical details that the solution approach requires? | ⏳ Pending - Specification needs review for completeness |
| **Solution Design → Walkthrough** | Does the Walkthrough reflect the complete flow including error states and timing? | ⏳ Pending - Walkthrough needs validation against design |
### Solution Design Validation
**User Problems** (from requirements.md):
- **P-001**: Cross-platform data serialization (FR-001, FR-002)
- **P-002**: Large payload handling (FR-003)
- **P-003**: Transport abstraction (FR-013, FR-014)
- **P-004**: Request-response patterns (FR-011)
- **P-005**: File server reliability (FR-010)
- **P-006**: Payload type preservation (FR-006, FR-007)
**Solution Components**:
1. **SD-001** - `smartpack()` / `smartunpack()` - Unified API for all platforms
2. **SD-002** - Claim-Check pattern - Automatic transport selection based on size threshold
3. **SD-003** - Handler function abstraction - Plik/AWS S3/custom file server support
4. **SD-004** - Tuple format - `(dataname, data, type)` - platform-agnostic
5. **SD-005** - Base64 encoding - JSON-compatible binary data transport
6. **SD-006** - Transport abstraction - Support multiple broker protocols transparently
7. **SD-007** - Exponential backoff - Reliable file downloads with retry logic
8. **SD-008** - Correlation ID propagation - Message tracing across distributed systems
**Requirement Mapping**:
- **Functional Requirements**: FR-001 through FR-014 ✅
- **Non-Functional Requirements**: NFR-101 through NFR-405 ✅
**Gap Check**: Does this solution explain *how* users will actually use the system?
**Answer**: Yes - the walkthrough provides concrete examples:
1. JavaScript sends `[(msg, "Hello", "text"), (avatar, binary_data, "image")]`
2. `smartpack()` automatically selects transport based on size (SD-002)
3. Large file (≥0.5MB) → link transport → file server upload (SD-001)
4. Small payload (<0.5MB) → direct transport → base64 encoding (SD-005)
5. Receiver calls `smartunpack()` → receives same tuple format with preserved types
**NFR Traceability**:
- **Performance**: NFR-101 (serialization <50ms), NFR-102 (deserialization <50ms), NFR-103 (connection <100ms) ✅
- **Reliability**: NFR-201 (at-least-once delivery), NFR-202 (file server <5% failure), NFR-203 (auto-reconnect <30s) ✅
- **Security**: NFR-301 (SHA-256 checksum), NFR-302 (TLS 100%), NFR-303 (authenticated uploads) ✅
- **Observability**: NFR-401 (required logs), NFR-402 (metrics), NFR-403 (tracing), NFR-404 (alerting <5min) ✅
---
*This solution design document is versioned and maintained in git alongside the codebase. All implementations must adhere to this design.*
**Traceability Summary**:
- All requirements traced to solution components with SD-XXX decision IDs
- Each decision ID references the corresponding requirement IDs (FR-XXX, NFR-XXX)
- Specification must cite SD-XXX references for each technical detail