diff --git a/docs/SDD_FRAMEWORK.md b/docs/SDD_FRAMEWORK.md index bf5eae7..3a2418f 100644 --- a/docs/SDD_FRAMEWORK.md +++ b/docs/SDD_FRAMEWORK.md @@ -1,279 +1,188 @@ -# SDD + GitOps Documentation Stack +# SDD + GitOps Documentation Framework -A comprehensive documentation strategy for modern software development that aligns different types of documentation with their specific purposes, audiences, and tooling. +## Overview -## The Big Picture +The **SDD + GitOps Documentation Framework** is a comprehensive, structured approach to software development documentation that aligns technical work with business outcomes through clear separation of concerns. -This framework ensures that every piece of documentation serves a clear purpose and reaches the right audience. It emphasizes: - -- **Machine-readable truths** as the foundation for automation -- **Separation of concerns** between human-facing docs and machine-consumable contracts -- **GitOps integration** where deployment and configuration are version-controlled -- **Multi-role audience targeting** from stakeholders to DevOps +This framework ensures that every piece of documentation serves a specific purpose, reaches the right audience, and can be measured for effectiveness. It's designed to prevent common pitfalls like feature creep, communication gaps, and operational fragility. --- -## Documentation Matrix +## The Documentation Matrix -| Document | Purpose ("The Why") | Primary Audience | Format / Tooling | Example (SaaS Context) | -|----------|---------------------|------------------|------------------|------------------------| -| **Requirements** | Define business goals & user needs | Stakeholders, PM, Lead Dev | GitHub Issues, Notion | "System must support 5-member teams with real-time sync." | -| **The Spec** | The Contract. Machine-readable truth. | Developers, QA, Machines | OpenAPI, Protobuf, YAML | A `.yaml` file defining `user_id` as a UUID in snake_case. | -| **Architecture** | High-level structural blueprint | Senior Devs, DevOps | Mermaid.js, IcePanel | Diagram of SvelteKit ↔ NATS ↔ Julia 6-node cluster. | -| **Walkthrough** | The Intuition. The "Big Picture" narrative. | New Devs, The Team | Recorded Video, TOUR.md | "Why we use a Claim-Check pattern for large Arrow data." | -| **Implementation** | The actual logic & generated code | Developers | SvelteKit, Julia, Node.js | Auto-generated TypeScript types from the OpenAPI spec. | -| **Validation** | Automated "Contract" enforcement | CI/CD Pipelines, QA | GitHub Actions, Prism | A test that fails if the Julia API returns camelCase keys. | -| **Runbook** | Deployment, Scaling, & Recovery | DevOps, SRE | K8s Manifests, Flux | `git push` to update the replica count from 3 to 6. | +| Document | Purpose & Rationale (The "Why") | Audience | Format / Content | Measurement (KPI/SLO) | Example (SaaS Context) | +|----------|----------------------------------|----------|------------------|----------------------|------------------------| +| **Requirements** | The Business North Star. Defines exactly what problem the user has and what success looks like. It prevents "feature creep" by setting hard boundaries on what we will NOT build. | Founder, Team, PM | Format: Shared Wiki (Notion/GitHub Wiki). Content: User stories, business constraints, competitive context, and success metrics. | **KPI**: Business Outcomes. Measured by User Retention, Conversion Rates, and Monthly Recurring Revenue (MRR). | "The system must process high-volume math so clients see reports instantly. Goal: 15% increase in daily active users." | +| **Spec** | The Technical Contract. A machine-readable, strictly typed definition of all data interfaces. It is the "Single Source of Truth" that prevents bugs caused by communication gaps between services. | Developers, QA, Automation | Format: OpenAPI/YAML or Protobuf. Content: API endpoints, snake_case key naming, data validation rules, and error response codes. | **SLA/SLO**: System Performance. Measured by API Uptime (99.9%), Response Latency (<100ms), and Error Rates. | A `contract.yaml` defining exactly how Julia sends Arrow data to Node.js. It forces `user_id` to be a UUID. | +| **Architecture** | The Structural Blueprint. A visual map of how the components (services, DBs, networks) fit together. It shows how the data flows through the 6-node cluster and where bottlenecks live. | Senior Devs, DevOps | Format: Diagrams-as-code (Mermaid.js). Content: System Context diagrams, Database ERDs, Network Security Policies, and Infrastructure maps. | **Efficiency Metrics**: Resource utilization. Measured by CPU Load (<70%), RAM per pod, and internal network throughput. | A diagram showing the data path: Caddy (Proxy) → Node.js (API) → NATS (Queue) → Julia (Math Engine). | +| **Walkthrough** | The Intuition & Logic. A narrative guide that explains the "steps" and "rationale" behind end-to-end flows. It's about building a mental model so devs understand why the sequence matters. | The Team, New Hires | Format: TOUR.md file or Loom Video. Content: Step-by-step traces of core features, explanation of architectural trade-offs, and "The Big Picture" flow. | **Quality**: Developer Velocity. Measured by "Time-to-First-Commit" for new hires and reduction in conceptual bugs. | "End-to-End Trace": 1. UI sends JSON. 2. API wraps it in Claim-Check. 3. Julia pulls it. Rationale: To avoid NATS memory spikes. | +| **Implementation** | The Functional Reality. The actual code that does the work. In SDD, the "boring" parts (types/routes) are auto-generated from the Spec to ensure the code never lies. | Developers, Reviewers | Format: Git Repository. Content: Business logic, internal helper functions, Unit Tests, and a README.md for local environment setup. | **Code Health**: Internal Quality. Measured by Test Coverage (90%+), Linting compliance, and Cyclomatic Complexity. | The SvelteKit frontend components and the specific Julia math-processing functions. | +| **Validation** | The Enforcement Layer. Automated gates that prove the Implementation matches the Spec. It prevents human error (like changing a key name) from reaching production. | CI/CD Pipeline, QA | Format: GitHub Actions / Tests. Content: Contract tests (Dredd/Prism), Integration tests, and Security scans that run on every pull request. | **Compliance**: Safety Metrics. Measured by Build Success Rate and 0 "Contract Violations" in the production logs. | A CI job that blocks a Pull Request because a developer used camelCase in a database field instead of snake_case. | +| **Maintenance** | The Health & Evolution. Defines how to upgrade dependencies, manage technical debt, and rotate secrets. It's the guide for "future-proofing" the software over time. | The Team, DevOps | Format: MAINTENANCE.md. Content: Dependency update schedules, Secret rotation steps, DB Migration logs, and Tech Debt "Graveyard" tracking. | **Sustainability**: System Longevity. Measured by "Package Age", "Security Vulnerabilities Found", and "Migration Success Rate". | "Steps to upgrade the Julia version across all 6 nodes without downtime using a Blue-Green deployment strategy." | +| **Runbook** | The Operational Life-Support. The instructions for when the system is alive (or dying). In GitOps, this is the "Desired State" of the infrastructure. | DevOps, SRE, On-call Devs | Format: K8s Manifests (Flux/Argo). Content: Deployment steps, Scaling triggers, Backup/Restore procedures, and "3:00 AM" troubleshooting guides. | **Reliability**: Operational Health. Measured by MTTR (Mean Time to Recovery) and Error-Free Deployments. | A Flux manifest that ensures 6 replicas of the Julia service are always healthy and restarts them if they hit 80% RAM. | --- -## Detailed Explanations +## Detailed Document Descriptions ### 1. Requirements -**Purpose**: Define business goals & user needs. +**Purpose**: Establish the Business North Star. -**Why it matters**: Before writing code, we need to understand *why* we're building something. Requirements capture the business context, user pain points, and success criteria. +**Why It Matters**: Without clear requirements, teams drift into "feature creep" - building things that don't solve the actual problem. This document anchors the project in business outcomes. -**Primary Audience**: -- **Stakeholders**: Business owners who need to approve the direction -- **Product Managers**: Translate requirements into features -- **Lead Developers**: Understand scope and technical constraints - -**Format / Tooling**: -- **GitHub Issues**: Simple, version-controlled, integrated with code -- **Notion**: Rich text, collaborative, good for initial brainstorming +**Key Elements**: +- **User Stories**: What the user needs to accomplish +- **Business Constraints**: Budget, timeline, regulatory requirements +- **Competitive Context**: What competitors do and how you differentiate +- **Success Metrics**: Quantifiable goals that define "done" **Best Practices**: -- Write in user story format: "As a [role], I want [feature] so that [benefit]" -- Include acceptance criteria as checklist items -- Link to related specs and architecture decisions - -**Example**: "System must support 5-member teams with real-time sync." +- Keep it in a shared wiki (Notion, GitHub Wiki) for collaborative editing +- Focus on outcomes, not solutions +- Explicitly state what you will NOT build --- -### 2. The Spec (The Contract) +### 2. Spec (Specification) -**Purpose**: Machine-readable truth that defines the API contract. +**Purpose**: Create a machine-readable technical contract. -**Why it matters**: The spec is the single source of truth for how systems communicate. It enables code generation, automated testing, and ensures consistency across services. +**Why It Matters**: Communication gaps between services cause bugs. A strict, typed spec prevents these by being the Single Source of Truth. -**Primary Audience**: -- **Developers**: Implement the API according to the spec -- **QA Engineers**: Create test cases based on the spec -- **Machines**: Used for code generation, validation, and documentation - -**Format / Tooling**: -- **OpenAPI (Swagger)**: REST API specifications -- **Protobuf**: gRPC service definitions -- **YAML/JSON**: Configuration and data schema definitions +**Key Elements**: +- **API Endpoints**: All routes with HTTP methods +- **Data Types**: Strict typing with validation rules +- **Error Codes**: Comprehensive error response definitions +- **Naming Conventions**: snake_case keys, consistent patterns **Best Practices**: -- Use snake_case for consistency -- Define all fields with types and constraints -- Include examples for complex data structures -- Keep specs versioned alongside code - -**Example**: A `.yaml` file defining `user_id` as a UUID in snake_case. +- Use OpenAPI (YAML/JSON) for REST APIs or Protobuf for gRPC +- Automate generation of client/server code from the spec +- Run contract tests against the spec in CI/CD --- ### 3. Architecture -**Purpose**: High-level structural blueprint showing how components interact. +**Purpose**: Visualize the system structure and data flow. -**Why it matters**: Architecture diagrams help everyone understand the system's structure without drowning in implementation details. They're crucial for onboarding, design reviews, and long-term maintainability. +**Why It Matters**: Complex systems (like your 6-node cluster) need clear maps. Without them, teams can't identify bottlenecks or make informed decisions. -**Primary Audience**: -- **Senior Developers**: Design decisions and component responsibilities -- **DevOps**: Understand deployment topology and service dependencies -- **Technical Leads**: Evaluate trade-offs and scalability concerns - -**Format / Tooling**: -- **Mermaid.js**: Code-based diagrams that are version-controlled -- **IcePanel**: Interactive, automated architecture visualization -- **C4 Model**: Standardized approach to architectural diagrams +**Key Elements**: +- **System Context Diagram**: Shows the system and its external dependencies +- **Database ERD**: Entity-Relationship diagrams for data model +- **Network Security Policies**: Firewall rules, service mesh configs +- **Infrastructure Maps**: Cloud resources, scaling groups **Best Practices**: -- Focus on *relationships* between components, not implementation details -- Include technology choices (e.g., NATS vs WebSocket) -- Show data flow direction with arrows +- Use Mermaid.js for diagrams-as-code (versionable, diffable) - Update diagrams when architecture changes - -**Example**: Diagram of SvelteKit ↔ NATS ↔ Julia 6-node cluster. +- Focus on data flow and decision points --- ### 4. Walkthrough -**Purpose**: The intuition and "Big Picture" narrative. +**Purpose**: Build a mental model through narrative. -**Why it matters**: Code alone doesn't explain *why* decisions were made. Walkthroughs provide context, historical decisions, and architectural intuition that helps new developers become productive quickly. +**Why It Matters**: Code doesn't explain *why*. Walkthroughs capture the reasoning behind architectural trade-offs, making onboarding faster and reducing conceptual bugs. -**Primary Audience**: -- **New Developers**: Understand the system's philosophy and patterns -- **The Team**: Share context and reasoning behind design choices -- **Code Reviewers**: Evaluate design decisions alongside implementation - -**Format / Tooling**: -- **Recorded Video**: Personal, engaging, good for complex explanations -- **TOUR.md**: Markdown file with narrative walk-through of the codebase -- **Architecture Decision Records (ADRs)**: Formal documentation of key decisions +**Key Elements**: +- **Step-by-step traces**: End-to-end flow of user actions +- **Trade-off explanations**: Why you chose option A over B +- **The Big Picture**: How components fit together conceptually **Best Practices**: -- Explain *why* more than *how* -- Include anti-patterns to avoid -- Link to related documentation -- Keep walkthroughs updated with architecture changes - -**Example**: "Why we use a Claim-Check pattern for large Arrow data." +- Write in a TOUR.md file or record Loom videos +- Focus on intuition, not just mechanics +- Include "Rationale" sections for each major decision --- ### 5. Implementation -**Purpose**: The actual logic and generated code. +**Purpose**: The functional reality - the actual code. -**Why it matters**: This is the executable truth of the system. Well-structured implementation code should be clear, maintainable, and follow established patterns. +**Why It Matters**: This is what runs in production. In SDD, the spec-driven approach ensures boring parts are generated automatically, so developers focus on business logic. -**Primary Audience**: -- **Developers**: Read, modify, and extend the code -- **Reviewers**: Verify correctness and adherence to standards -- **CI/CD**: Run tests and builds - -**Format / Tooling**: -- **SvelteKit**: Frontend framework with server-side rendering -- **Julia**: High-performance numerical computing -- **Node.js**: Backend services and tooling +**Key Elements**: +- **Business Logic**: The unique value you provide +- **Unit Tests**: Covering edge cases and error paths +- **README.md**: Local environment setup instructions **Best Practices**: -- Generate code from specs to ensure consistency -- Use consistent naming conventions (snake_case, camelCase appropriately) -- Include unit tests alongside implementation -- Document complex algorithms with inline comments - -**Example**: Auto-generated TypeScript types from the OpenAPI spec. +- Generate boilerplate (types, routes) from the Spec +- Maintain 90%+ test coverage +- Keep README.md up-to-date for local development --- ### 6. Validation -**Purpose**: Automated "Contract" enforcement. +**Purpose**: Automated quality gates. -**Why it matters**: Automated tests ensure that the system behaves as specified and prevent regressions. Validation in CI/CD pipelines catches issues before they reach production. +**Why It Matters**: Human error happens. Validation layers catch mistakes before they reach production, preventing contract violations and security issues. -**Primary Audience**: -- **CI/CD Pipelines**: Run tests automatically on every commit -- **QA Engineers**: Verify system behavior against requirements -- **Developers**: Get immediate feedback on changes - -**Format / Tooling**: -- **GitHub Actions**: Automated testing and validation workflows -- **Prism (ReadMe)**: OpenAPI spec validation in CI -- **Jest/Vitest**: JavaScript testing framework -- **Pytest**: Python testing framework +**Key Elements**: +- **Contract Tests**: Verify implementation matches spec (Dredd, Prism) +- **Integration Tests**: Test service-to-service interactions +- **Security Scans**: SAST/SBOM analysis on every PR **Best Practices**: -- Test the contract (spec) not just implementation details -- Use contract testing (PACT) for service-to-service validation -- Fail fast: tests should run quickly and provide clear error messages -- Include negative test cases (invalid inputs, edge cases) - -**Example**: A test that fails if the Julia API returns camelCase keys. +- Run validation on every pull request +- Block merges on contract violations +- Track build success rate as a KPI --- -### 7. Runbook +### 7. Maintenance -**Purpose**: Deployment, scaling, and recovery procedures. +**Purpose**: Guide for long-term health and evolution. -**Why it matters**: Runbooks ensure that deployments are consistent, repeatable, and recoverable. In GitOps, the runbook *is* the configuration, version-controlled alongside the code. +**Why It Matters**: Software decays. Without a maintenance plan, dependency upgrades become risky, secrets accumulate, and technical debt piles up. -**Primary Audience**: -- **DevOps Engineers**: Execute deployments and scaling operations -- **SREs**: Manage system reliability and incident response -- **Developers**: Deploy feature branches for testing - -**Format / Tooling**: -- **Kubernetes Manifests**: Declarative deployment configurations -- **Flux**: GitOps operator for Kubernetes -- **Helm Charts**: Package management for Kubernetes -- **Docker Compose**: Local development environments +**Key Elements**: +- **Dependency Update Schedule**: When and how to upgrade packages +- **Secret Rotation Steps**: How to rotate credentials securely +- **DB Migration Logs**: History of schema changes +- **Tech Debt "Graveyard"**: Documented technical debt with remediation plans **Best Practices**: -- Use Git as the source of truth (GitOps) -- Make deployments idempotent (running twice has same effect) -- Include rollback procedures -- Document scaling procedures for different load levels - -**Example**: `git push` to update the replica count from 3 to 6. +- Document the "how" for common maintenance tasks +- Track package age and security vulnerabilities +- Schedule regular tech debt reviews --- -## How the Stack Fits Together +### 8. Runbook -``` -┌─────────────────────────────────────────────────────────────┐ -│ Requirements │ -│ (Business goals, user needs) │ -└───────────────────┬─────────────────────────────────────────┘ - │ - ▼ -┌─────────────────────────────────────────────────────────────┐ -│ The Spec │ -│ (Machine-readable contract: OpenAPI, Protobuf) │ -└───────────────────┬─────────────────────────────────────────┘ - │ - ▼ -┌─────────────────────────────────────────────────────────────┐ -│ Architecture │ -│ (Structural blueprint: Mermaid, IcePanel) │ -└───────────────────┬─────────────────────────────────────────┘ - │ - ▼ -┌─────────────────────────────────────────────────────────────┐ -│ Walkthrough │ -│ (Intuition, big picture narrative) │ -└───────────────────┬─────────────────────────────────────────┘ - │ - ▼ -┌─────────────────────────────────────────────────────────────┐ -│ Implementation │ -│ (Actual code: SvelteKit, Julia, Node.js) │ -└───────────────────┬─────────────────────────────────────────┘ - │ - ▼ -┌─────────────────────────────────────────────────────────────┐ -│ Validation │ -│ (Automated tests: GitHub Actions, Prism) │ -└───────────────────┬─────────────────────────────────────────┘ - │ - ▼ -┌─────────────────────────────────────────────────────────────┐ -│ Runbook │ -│ (Deployment, scaling: K8s, Flux) │ -└─────────────────────────────────────────────────────────────┘ -``` +**Purpose**: Operational life-support for production systems. -## Key Principles +**Why It Matters**: When production is down, teams need clear instructions. In GitOps, the runbook is the "desired state" that the system constantly works toward. -1. **Machine-Readable Truth**: Specs and configurations should be machine-readable to enable automation -2. **Separation of Concerns**: Different audiences need different types of information -3. **Version Control**: All documentation should be in Git, just like code -4. **Automation-First**: Validation should be automated and integrated into CI/CD -5. **Living Documentation**: Documentation should evolve with the codebase +**Key Elements**: +- **Deployment Steps**: How to deploy new versions +- **Scaling Triggers**: When and how to scale up/down +- **Backup/Restore Procedures**: Disaster recovery steps +- **"3:00 AM" Troubleshooting**: Quick fixes for common failures -## Getting Started +**Best Practices**: +- Store in K8s manifests (Flux/Argo) for GitOps +- Automate as much as possible +- Test runbook procedures regularly -To adopt this stack in your project: +--- -1. Start with requirements in GitHub Issues or Notion -2. Create a spec file (OpenAPI/Protobuf) as the contract -3. Add architecture diagrams using Mermaid.js -4. Write a walkthrough explaining the "why" behind decisions -5. Implement code following the spec -6. Add automated tests that validate the spec -7. Create runbooks for deployment and scaling +## How to Use This Framework -This framework ensures that every piece of documentation serves a clear purpose and reaches the right audience. \ No newline at end of file +1. **Start with Requirements** - Define the business problem and success criteria +2. **Create the Spec** - Translate requirements into machine-readable contracts +3. **Design Architecture** - Visualize how the system will work +4. **Write Walkthrough** - Document the logic and trade-offs +5. **Implement** - Build the actual code +6. **Set up Validation** - Add automated tests and gates +7. **Document Maintenance** - Plan for long-term health +8. **Create Runbook** - Define operational procedures + +This framework ensures that every document serves a clear purpose and that your project remains maintainable, scalable, and aligned with business goals. \ No newline at end of file