2026-03-14 00:53:03 +00:00
1 changed files with 108 additions and 199 deletions
--- a/docs/SDD_FRAMEWORK.md
+++ b/docs/SDD_FRAMEWORK.md
@@ -1,279 +1,188 @@
-# SDD + GitOps Documentation Stack
+# SDD + GitOps Documentation Framework
-A comprehensive documentation strategy for modern software development that aligns different types of documentation with their specific purposes, audiences, and tooling.
+## Overview
-## The Big Picture
+The **SDD + GitOps Documentation Framework** is a comprehensive, structured approach to software development documentation that aligns technical work with business outcomes through clear separation of concerns.
-This framework ensures that every piece of documentation serves a clear purpose and reaches the right audience. It emphasizes:
+This framework ensures that every piece of documentation serves a specific purpose, reaches the right audience, and can be measured for effectiveness. It's designed to prevent common pitfalls like feature creep, communication gaps, and operational fragility.
 - **Machine-readable truths** as the foundation for automation
 - **Separation of concerns** between human-facing docs and machine-consumable contracts
 - **GitOps integration** where deployment and configuration are version-controlled
 - **Multi-role audience targeting** from stakeholders to DevOps
 ---
-## Documentation Matrix
+## The Documentation Matrix
-| Document | Purpose ("The Why") | Primary Audience | Format / Tooling | Example (SaaS Context) |
+| Document | Purpose & Rationale (The "Why") | Audience | Format / Content | Measurement (KPI/SLO) | Example (SaaS Context) |
-|----------|---------------------|------------------|------------------|------------------------|
+|----------|----------------------------------|----------|------------------|----------------------|------------------------|
-| **Requirements** | Define business goals & user needs | Stakeholders, PM, Lead Dev | GitHub Issues, Notion | "System must support 5-member teams with real-time sync." |
+| **Requirements** | The Business North Star. Defines exactly what problem the user has and what success looks like. It prevents "feature creep" by setting hard boundaries on what we will NOT build. | Founder, Team, PM | Format: Shared Wiki (Notion/GitHub Wiki). Content: User stories, business constraints, competitive context, and success metrics. | **KPI**: Business Outcomes. Measured by User Retention, Conversion Rates, and Monthly Recurring Revenue (MRR). | "The system must process high-volume math so clients see reports instantly. Goal: 15% increase in daily active users." |
-| **The Spec** | The Contract. Machine-readable truth. | Developers, QA, Machines | OpenAPI, Protobuf, YAML | A `.yaml` file defining `user_id` as a UUID in snake_case. |
+| **Spec** | The Technical Contract. A machine-readable, strictly typed definition of all data interfaces. It is the "Single Source of Truth" that prevents bugs caused by communication gaps between services. | Developers, QA, Automation | Format: OpenAPI/YAML or Protobuf. Content: API endpoints, snake_case key naming, data validation rules, and error response codes. | **SLA/SLO**: System Performance. Measured by API Uptime (99.9%), Response Latency (<100ms), and Error Rates. | A `contract.yaml` defining exactly how Julia sends Arrow data to Node.js. It forces `user_id` to be a UUID. |
-| **Architecture** | High-level structural blueprint | Senior Devs, DevOps | Mermaid.js, IcePanel | Diagram of SvelteKit ↔ NATS ↔ Julia 6-node cluster. |
+| **Architecture** | The Structural Blueprint. A visual map of how the components (services, DBs, networks) fit together. It shows how the data flows through the 6-node cluster and where bottlenecks live. | Senior Devs, DevOps | Format: Diagrams-as-code (Mermaid.js). Content: System Context diagrams, Database ERDs, Network Security Policies, and Infrastructure maps. | **Efficiency Metrics**: Resource utilization. Measured by CPU Load (<70%), RAM per pod, and internal network throughput. | A diagram showing the data path: Caddy (Proxy) → Node.js (API) → NATS (Queue) → Julia (Math Engine). |
-| **Walkthrough** | The Intuition. The "Big Picture" narrative. | New Devs, The Team | Recorded Video, TOUR.md | "Why we use a Claim-Check pattern for large Arrow data." |
+| **Walkthrough** | The Intuition & Logic. A narrative guide that explains the "steps" and "rationale" behind end-to-end flows. It's about building a mental model so devs understand why the sequence matters. | The Team, New Hires | Format: TOUR.md file or Loom Video. Content: Step-by-step traces of core features, explanation of architectural trade-offs, and "The Big Picture" flow. | **Quality**: Developer Velocity. Measured by "Time-to-First-Commit" for new hires and reduction in conceptual bugs. | "End-to-End Trace": 1. UI sends JSON. 2. API wraps it in Claim-Check. 3. Julia pulls it. Rationale: To avoid NATS memory spikes. |
-| **Implementation** | The actual logic & generated code | Developers | SvelteKit, Julia, Node.js | Auto-generated TypeScript types from the OpenAPI spec. |
+| **Implementation** | The Functional Reality. The actual code that does the work. In SDD, the "boring" parts (types/routes) are auto-generated from the Spec to ensure the code never lies. | Developers, Reviewers | Format: Git Repository. Content: Business logic, internal helper functions, Unit Tests, and a README.md for local environment setup. | **Code Health**: Internal Quality. Measured by Test Coverage (90%+), Linting compliance, and Cyclomatic Complexity. | The SvelteKit frontend components and the specific Julia math-processing functions. |
-| **Validation** | Automated "Contract" enforcement | CI/CD Pipelines, QA | GitHub Actions, Prism | A test that fails if the Julia API returns camelCase keys. |
+| **Validation** | The Enforcement Layer. Automated gates that prove the Implementation matches the Spec. It prevents human error (like changing a key name) from reaching production. | CI/CD Pipeline, QA | Format: GitHub Actions / Tests. Content: Contract tests (Dredd/Prism), Integration tests, and Security scans that run on every pull request. | **Compliance**: Safety Metrics. Measured by Build Success Rate and 0 "Contract Violations" in the production logs. | A CI job that blocks a Pull Request because a developer used camelCase in a database field instead of snake_case. |
-| **Runbook** | Deployment, Scaling, & Recovery | DevOps, SRE | K8s Manifests, Flux | `git push` to update the replica count from 3 to 6. |
+| **Maintenance** | The Health & Evolution. Defines how to upgrade dependencies, manage technical debt, and rotate secrets. It's the guide for "future-proofing" the software over time. | The Team, DevOps | Format: MAINTENANCE.md. Content: Dependency update schedules, Secret rotation steps, DB Migration logs, and Tech Debt "Graveyard" tracking. | **Sustainability**: System Longevity. Measured by "Package Age", "Security Vulnerabilities Found", and "Migration Success Rate". | "Steps to upgrade the Julia version across all 6 nodes without downtime using a Blue-Green deployment strategy." |
 | **Runbook** | The Operational Life-Support. The instructions for when the system is alive (or dying). In GitOps, this is the "Desired State" of the infrastructure. | DevOps, SRE, On-call Devs | Format: K8s Manifests (Flux/Argo). Content: Deployment steps, Scaling triggers, Backup/Restore procedures, and "3:00 AM" troubleshooting guides. | **Reliability**: Operational Health. Measured by MTTR (Mean Time to Recovery) and Error-Free Deployments. | A Flux manifest that ensures 6 replicas of the Julia service are always healthy and restarts them if they hit 80% RAM. |
 ---
-## Detailed Explanations
+## Detailed Document Descriptions
 ### 1. Requirements
-**Purpose**: Define business goals & user needs.
+**Purpose**: Establish the Business North Star.
-**Why it matters**: Before writing code, we need to understand *why* we're building something. Requirements capture the business context, user pain points, and success criteria.
+**Why It Matters**: Without clear requirements, teams drift into "feature creep" - building things that don't solve the actual problem. This document anchors the project in business outcomes.
-**Primary Audience**:
+**Key Elements**:
- **Stakeholders**: Business owners who need to approve the direction
+- **User Stories**: What the user needs to accomplish
- **Product Managers**: Translate requirements into features
+- **Business Constraints**: Budget, timeline, regulatory requirements
- **Lead Developers**: Understand scope and technical constraints
+- **Competitive Context**: What competitors do and how you differentiate
-
+- **Success Metrics**: Quantifiable goals that define "done"
 **Format / Tooling**:
 - **GitHub Issues**: Simple, version-controlled, integrated with code
 - **Notion**: Rich text, collaborative, good for initial brainstorming
 **Best Practices**:
- Write in user story format: "As a [role], I want [feature] so that [benefit]"
+- Keep it in a shared wiki (Notion, GitHub Wiki) for collaborative editing
- Include acceptance criteria as checklist items
+- Focus on outcomes, not solutions
- Link to related specs and architecture decisions
+- Explicitly state what you will NOT build
 **Example**: "System must support 5-member teams with real-time sync."
 ---
-### 2. The Spec (The Contract)
+### 2. Spec (Specification)
-**Purpose**: Machine-readable truth that defines the API contract.
+**Purpose**: Create a machine-readable technical contract.
-**Why it matters**: The spec is the single source of truth for how systems communicate. It enables code generation, automated testing, and ensures consistency across services.
+**Why It Matters**: Communication gaps between services cause bugs. A strict, typed spec prevents these by being the Single Source of Truth.
-**Primary Audience**:
+**Key Elements**:
- **Developers**: Implement the API according to the spec
+- **API Endpoints**: All routes with HTTP methods
- **QA Engineers**: Create test cases based on the spec
+- **Data Types**: Strict typing with validation rules
- **Machines**: Used for code generation, validation, and documentation
+- **Error Codes**: Comprehensive error response definitions
-
+- **Naming Conventions**: snake_case keys, consistent patterns
 **Format / Tooling**:
 - **OpenAPI (Swagger)**: REST API specifications
 - **Protobuf**: gRPC service definitions
 - **YAML/JSON**: Configuration and data schema definitions
 **Best Practices**:
- Use snake_case for consistency
+- Use OpenAPI (YAML/JSON) for REST APIs or Protobuf for gRPC
- Define all fields with types and constraints
+- Automate generation of client/server code from the spec
- Include examples for complex data structures
+- Run contract tests against the spec in CI/CD
 - Keep specs versioned alongside code
 **Example**: A `.yaml` file defining `user_id` as a UUID in snake_case.
 ---
 ### 3. Architecture
-**Purpose**: High-level structural blueprint showing how components interact.
+**Purpose**: Visualize the system structure and data flow.
-**Why it matters**: Architecture diagrams help everyone understand the system's structure without drowning in implementation details. They're crucial for onboarding, design reviews, and long-term maintainability.
+**Why It Matters**: Complex systems (like your 6-node cluster) need clear maps. Without them, teams can't identify bottlenecks or make informed decisions.
-**Primary Audience**:
+**Key Elements**:
- **Senior Developers**: Design decisions and component responsibilities
+- **System Context Diagram**: Shows the system and its external dependencies
- **DevOps**: Understand deployment topology and service dependencies
+- **Database ERD**: Entity-Relationship diagrams for data model
- **Technical Leads**: Evaluate trade-offs and scalability concerns
+- **Network Security Policies**: Firewall rules, service mesh configs
-
+- **Infrastructure Maps**: Cloud resources, scaling groups
 **Format / Tooling**:
 - **Mermaid.js**: Code-based diagrams that are version-controlled
 - **IcePanel**: Interactive, automated architecture visualization
 - **C4 Model**: Standardized approach to architectural diagrams
 **Best Practices**:
- Focus on *relationships* between components, not implementation details
+- Use Mermaid.js for diagrams-as-code (versionable, diffable)
 - Include technology choices (e.g., NATS vs WebSocket)
 - Show data flow direction with arrows
 - Update diagrams when architecture changes
-
+- Focus on data flow and decision points
 **Example**: Diagram of SvelteKit ↔ NATS ↔ Julia 6-node cluster.
 ---
 ### 4. Walkthrough
-**Purpose**: The intuition and "Big Picture" narrative.
+**Purpose**: Build a mental model through narrative.
-**Why it matters**: Code alone doesn't explain *why* decisions were made. Walkthroughs provide context, historical decisions, and architectural intuition that helps new developers become productive quickly.
+**Why It Matters**: Code doesn't explain *why*. Walkthroughs capture the reasoning behind architectural trade-offs, making onboarding faster and reducing conceptual bugs.
-**Primary Audience**:
+**Key Elements**:
- **New Developers**: Understand the system's philosophy and patterns
+- **Step-by-step traces**: End-to-end flow of user actions
- **The Team**: Share context and reasoning behind design choices
+- **Trade-off explanations**: Why you chose option A over B
- **Code Reviewers**: Evaluate design decisions alongside implementation
+- **The Big Picture**: How components fit together conceptually
 **Format / Tooling**:
 - **Recorded Video**: Personal, engaging, good for complex explanations
 - **TOUR.md**: Markdown file with narrative walk-through of the codebase
 - **Architecture Decision Records (ADRs)**: Formal documentation of key decisions
 **Best Practices**:
- Explain *why* more than *how*
+- Write in a TOUR.md file or record Loom videos
- Include anti-patterns to avoid
+- Focus on intuition, not just mechanics
- Link to related documentation
+- Include "Rationale" sections for each major decision
 - Keep walkthroughs updated with architecture changes
 **Example**: "Why we use a Claim-Check pattern for large Arrow data."
 ---
 ### 5. Implementation
-**Purpose**: The actual logic and generated code.
+**Purpose**: The functional reality - the actual code.
-**Why it matters**: This is the executable truth of the system. Well-structured implementation code should be clear, maintainable, and follow established patterns.
+**Why It Matters**: This is what runs in production. In SDD, the spec-driven approach ensures boring parts are generated automatically, so developers focus on business logic.
-**Primary Audience**:
+**Key Elements**:
- **Developers**: Read, modify, and extend the code
+- **Business Logic**: The unique value you provide
- **Reviewers**: Verify correctness and adherence to standards
+- **Unit Tests**: Covering edge cases and error paths
- **CI/CD**: Run tests and builds
+- **README.md**: Local environment setup instructions
 **Format / Tooling**:
 - **SvelteKit**: Frontend framework with server-side rendering
 - **Julia**: High-performance numerical computing
 - **Node.js**: Backend services and tooling
 **Best Practices**:
- Generate code from specs to ensure consistency
+- Generate boilerplate (types, routes) from the Spec
- Use consistent naming conventions (snake_case, camelCase appropriately)
+- Maintain 90%+ test coverage
- Include unit tests alongside implementation
+- Keep README.md up-to-date for local development
 - Document complex algorithms with inline comments
 **Example**: Auto-generated TypeScript types from the OpenAPI spec.
 ---
 ### 6. Validation
-**Purpose**: Automated "Contract" enforcement.
+**Purpose**: Automated quality gates.
-**Why it matters**: Automated tests ensure that the system behaves as specified and prevent regressions. Validation in CI/CD pipelines catches issues before they reach production.
+**Why It Matters**: Human error happens. Validation layers catch mistakes before they reach production, preventing contract violations and security issues.
-**Primary Audience**:
+**Key Elements**:
- **CI/CD Pipelines**: Run tests automatically on every commit
+- **Contract Tests**: Verify implementation matches spec (Dredd, Prism)
- **QA Engineers**: Verify system behavior against requirements
+- **Integration Tests**: Test service-to-service interactions
- **Developers**: Get immediate feedback on changes
+- **Security Scans**: SAST/SBOM analysis on every PR
 **Format / Tooling**:
 - **GitHub Actions**: Automated testing and validation workflows
 - **Prism (ReadMe)**: OpenAPI spec validation in CI
 - **Jest/Vitest**: JavaScript testing framework
 - **Pytest**: Python testing framework
 **Best Practices**:
- Test the contract (spec) not just implementation details
+- Run validation on every pull request
- Use contract testing (PACT) for service-to-service validation
+- Block merges on contract violations
- Fail fast: tests should run quickly and provide clear error messages
+- Track build success rate as a KPI
 - Include negative test cases (invalid inputs, edge cases)
 **Example**: A test that fails if the Julia API returns camelCase keys.
 ---
-### 7. Runbook
+### 7. Maintenance
-**Purpose**: Deployment, scaling, and recovery procedures.
+**Purpose**: Guide for long-term health and evolution.
-**Why it matters**: Runbooks ensure that deployments are consistent, repeatable, and recoverable. In GitOps, the runbook *is* the configuration, version-controlled alongside the code.
+**Why It Matters**: Software decays. Without a maintenance plan, dependency upgrades become risky, secrets accumulate, and technical debt piles up.
-**Primary Audience**:
+**Key Elements**:
- **DevOps Engineers**: Execute deployments and scaling operations
+- **Dependency Update Schedule**: When and how to upgrade packages
- **SREs**: Manage system reliability and incident response
+- **Secret Rotation Steps**: How to rotate credentials securely
- **Developers**: Deploy feature branches for testing
+- **DB Migration Logs**: History of schema changes
-
+- **Tech Debt "Graveyard"**: Documented technical debt with remediation plans
 **Format / Tooling**:
 - **Kubernetes Manifests**: Declarative deployment configurations
 - **Flux**: GitOps operator for Kubernetes
 - **Helm Charts**: Package management for Kubernetes
 - **Docker Compose**: Local development environments
 **Best Practices**:
- Use Git as the source of truth (GitOps)
+- Document the "how" for common maintenance tasks
- Make deployments idempotent (running twice has same effect)
+- Track package age and security vulnerabilities
- Include rollback procedures
+- Schedule regular tech debt reviews
 - Document scaling procedures for different load levels
 **Example**: `git push` to update the replica count from 3 to 6.
 ---
-## How the Stack Fits Together
+### 8. Runbook
-```
+**Purpose**: Operational life-support for production systems.
 ┌─────────────────────────────────────────────────────────────┐
 │                    Requirements                             │
 │  (Business goals, user needs)                              │
 └───────────────────┬─────────────────────────────────────────┘
                    │
                    ▼
 ┌─────────────────────────────────────────────────────────────┐
 │                    The Spec                                 │
 │  (Machine-readable contract: OpenAPI, Protobuf)           │
 └───────────────────┬─────────────────────────────────────────┘
                    │
                    ▼
 ┌─────────────────────────────────────────────────────────────┐
 │                    Architecture                             │
 │  (Structural blueprint: Mermaid, IcePanel)                 │
 └───────────────────┬─────────────────────────────────────────┘
                    │
                    ▼
 ┌─────────────────────────────────────────────────────────────┐
 │                    Walkthrough                              │
 │  (Intuition, big picture narrative)                        │
 └───────────────────┬─────────────────────────────────────────┘
                    │
                    ▼
 ┌─────────────────────────────────────────────────────────────┐
 │                    Implementation                           │
 │  (Actual code: SvelteKit, Julia, Node.js)                  │
 └───────────────────┬─────────────────────────────────────────┘
                    │
                    ▼
 ┌─────────────────────────────────────────────────────────────┐
 │                    Validation                             │
 │  (Automated tests: GitHub Actions, Prism)                  │
 └───────────────────┬─────────────────────────────────────────┘
                    │
                    ▼
 ┌─────────────────────────────────────────────────────────────┐
 │                    Runbook                                  │
 │  (Deployment, scaling: K8s, Flux)                          │
 └─────────────────────────────────────────────────────────────┘
 ```
-## Key Principles
+**Why It Matters**: When production is down, teams need clear instructions. In GitOps, the runbook is the "desired state" that the system constantly works toward.
-1. **Machine-Readable Truth**: Specs and configurations should be machine-readable to enable automation
+**Key Elements**:
-2. **Separation of Concerns**: Different audiences need different types of information
+- **Deployment Steps**: How to deploy new versions
-3. **Version Control**: All documentation should be in Git, just like code
+- **Scaling Triggers**: When and how to scale up/down
-4. **Automation-First**: Validation should be automated and integrated into CI/CD
+- **Backup/Restore Procedures**: Disaster recovery steps
-5. **Living Documentation**: Documentation should evolve with the codebase
+- **"3:00 AM" Troubleshooting**: Quick fixes for common failures
-## Getting Started
+**Best Practices**:
 - Store in K8s manifests (Flux/Argo) for GitOps
 - Automate as much as possible
 - Test runbook procedures regularly
-To adopt this stack in your project:
+---
-1. Start with requirements in GitHub Issues or Notion
+## How to Use This Framework
 2. Create a spec file (OpenAPI/Protobuf) as the contract
 3. Add architecture diagrams using Mermaid.js
 4. Write a walkthrough explaining the "why" behind decisions
 5. Implement code following the spec
 6. Add automated tests that validate the spec
 7. Create runbooks for deployment and scaling
-This framework ensures that every piece of documentation serves a clear purpose and reaches the right audience.
+1. **Start with Requirements** - Define the business problem and success criteria
 2. **Create the Spec** - Translate requirements into machine-readable contracts
 3. **Design Architecture** - Visualize how the system will work
 4. **Write Walkthrough** - Document the logic and trade-offs
 5. **Implement** - Build the actual code
 6. **Set up Validation** - Add automated tests and gates
 7. **Document Maintenance** - Plan for long-term health
 8. **Create Runbook** - Define operational procedures
 This framework ensures that every document serves a clear purpose and that your project remains maintainable, scalable, and aligned with business goals.