From 437ca81e76b12fcfe4ba54405a1984cf3f98c7b9 Mon Sep 17 00:00:00 2001 From: narawat Date: Fri, 13 Mar 2026 09:47:10 +0700 Subject: [PATCH] update --- docs/SDD_FRAMEWORK.md | 305 ++++++++++++++++++++++++++++-------------- 1 file changed, 206 insertions(+), 99 deletions(-) diff --git a/docs/SDD_FRAMEWORK.md b/docs/SDD_FRAMEWORK.md index 3a2418f..099a17b 100644 --- a/docs/SDD_FRAMEWORK.md +++ b/docs/SDD_FRAMEWORK.md @@ -2,187 +2,294 @@ ## Overview -The **SDD + GitOps Documentation Framework** is a comprehensive, structured approach to software development documentation that aligns technical work with business outcomes through clear separation of concerns. +The **SDD (Software Design Documentation) + GitOps Documentation Framework** is a comprehensive, structured approach to software development documentation that aligns technical work with business outcomes through clear separation of concerns. -This framework ensures that every piece of documentation serves a specific purpose, reaches the right audience, and can be measured for effectiveness. It's designed to prevent common pitfalls like feature creep, communication gaps, and operational fragility. +This framework ensures that every piece of documentation serves a specific purpose, reaches the right audience, and is measurable through clear KPIs and SLOs. --- ## The Documentation Matrix | Document | Purpose & Rationale (The "Why") | Audience | Format / Content | Measurement (KPI/SLO) | Example (SaaS Context) | -|----------|----------------------------------|----------|------------------|----------------------|------------------------| -| **Requirements** | The Business North Star. Defines exactly what problem the user has and what success looks like. It prevents "feature creep" by setting hard boundaries on what we will NOT build. | Founder, Team, PM | Format: Shared Wiki (Notion/GitHub Wiki). Content: User stories, business constraints, competitive context, and success metrics. | **KPI**: Business Outcomes. Measured by User Retention, Conversion Rates, and Monthly Recurring Revenue (MRR). | "The system must process high-volume math so clients see reports instantly. Goal: 15% increase in daily active users." | -| **Spec** | The Technical Contract. A machine-readable, strictly typed definition of all data interfaces. It is the "Single Source of Truth" that prevents bugs caused by communication gaps between services. | Developers, QA, Automation | Format: OpenAPI/YAML or Protobuf. Content: API endpoints, snake_case key naming, data validation rules, and error response codes. | **SLA/SLO**: System Performance. Measured by API Uptime (99.9%), Response Latency (<100ms), and Error Rates. | A `contract.yaml` defining exactly how Julia sends Arrow data to Node.js. It forces `user_id` to be a UUID. | -| **Architecture** | The Structural Blueprint. A visual map of how the components (services, DBs, networks) fit together. It shows how the data flows through the 6-node cluster and where bottlenecks live. | Senior Devs, DevOps | Format: Diagrams-as-code (Mermaid.js). Content: System Context diagrams, Database ERDs, Network Security Policies, and Infrastructure maps. | **Efficiency Metrics**: Resource utilization. Measured by CPU Load (<70%), RAM per pod, and internal network throughput. | A diagram showing the data path: Caddy (Proxy) → Node.js (API) → NATS (Queue) → Julia (Math Engine). | -| **Walkthrough** | The Intuition & Logic. A narrative guide that explains the "steps" and "rationale" behind end-to-end flows. It's about building a mental model so devs understand why the sequence matters. | The Team, New Hires | Format: TOUR.md file or Loom Video. Content: Step-by-step traces of core features, explanation of architectural trade-offs, and "The Big Picture" flow. | **Quality**: Developer Velocity. Measured by "Time-to-First-Commit" for new hires and reduction in conceptual bugs. | "End-to-End Trace": 1. UI sends JSON. 2. API wraps it in Claim-Check. 3. Julia pulls it. Rationale: To avoid NATS memory spikes. | -| **Implementation** | The Functional Reality. The actual code that does the work. In SDD, the "boring" parts (types/routes) are auto-generated from the Spec to ensure the code never lies. | Developers, Reviewers | Format: Git Repository. Content: Business logic, internal helper functions, Unit Tests, and a README.md for local environment setup. | **Code Health**: Internal Quality. Measured by Test Coverage (90%+), Linting compliance, and Cyclomatic Complexity. | The SvelteKit frontend components and the specific Julia math-processing functions. | -| **Validation** | The Enforcement Layer. Automated gates that prove the Implementation matches the Spec. It prevents human error (like changing a key name) from reaching production. | CI/CD Pipeline, QA | Format: GitHub Actions / Tests. Content: Contract tests (Dredd/Prism), Integration tests, and Security scans that run on every pull request. | **Compliance**: Safety Metrics. Measured by Build Success Rate and 0 "Contract Violations" in the production logs. | A CI job that blocks a Pull Request because a developer used camelCase in a database field instead of snake_case. | -| **Maintenance** | The Health & Evolution. Defines how to upgrade dependencies, manage technical debt, and rotate secrets. It's the guide for "future-proofing" the software over time. | The Team, DevOps | Format: MAINTENANCE.md. Content: Dependency update schedules, Secret rotation steps, DB Migration logs, and Tech Debt "Graveyard" tracking. | **Sustainability**: System Longevity. Measured by "Package Age", "Security Vulnerabilities Found", and "Migration Success Rate". | "Steps to upgrade the Julia version across all 6 nodes without downtime using a Blue-Green deployment strategy." | -| **Runbook** | The Operational Life-Support. The instructions for when the system is alive (or dying). In GitOps, this is the "Desired State" of the infrastructure. | DevOps, SRE, On-call Devs | Format: K8s Manifests (Flux/Argo). Content: Deployment steps, Scaling triggers, Backup/Restore procedures, and "3:00 AM" troubleshooting guides. | **Reliability**: Operational Health. Measured by MTTR (Mean Time to Recovery) and Error-Free Deployments. | A Flux manifest that ensures 6 replicas of the Julia service are always healthy and restarts them if they hit 80% RAM. | +|----------|---------------------------------|----------|------------------|----------------------|------------------------| +| **Requirements** | The Business North Star. Defines exactly what problem the user has and what success looks like. It prevents "feature creep" by setting hard boundaries on what we will NOT build. | Founder, Team, PM | Format: Shared Wiki (Notion/GitHub Wiki). Content: User stories, business constraints, competitive context, and success metrics. | KPI: Business Outcomes. Measured by User Retention, Conversion Rates, and Monthly Recurring Revenue (MRR). | "The system must process high-volume math so clients see reports instantly. Goal: 15% increase in daily active users." | +| **Spec** | The Technical Contract. A machine-readable, strictly typed definition of all data interfaces. It is the "Single Source of Truth" that prevents bugs caused by communication gaps between services. | Developers, QA, Automation | Format: OpenAPI/YAML or Protobuf. Content: API endpoints, snake_case key naming, data validation rules, and error response codes. | SLA/SLO: System Performance. Measured by API Uptime (99.9%), Response Latency (<100ms), and Error Rates. | A `contract.yaml` defining exactly how Julia sends Arrow data to Node.js. It forces `user_id` to be a UUID. | +| **Architecture** | The Structural Blueprint. A visual map of how the components (services, DBs, networks) fit together. It shows how the data flows through the 6-node cluster and where bottlenecks live. | Senior Devs, DevOps | Format: Diagrams-as-code (Mermaid.js). Content: System Context diagrams, Database ERDs, Network Security Policies, and Infrastructure maps. | Efficiency Metrics: Resource utilization. Measured by CPU Load (<70%), RAM per pod, and internal network throughput. | A diagram showing the data path: Caddy (Proxy) → Node.js (API) → NATS (Queue) → Julia (Math Engine). | +| **Walkthrough** | The Intuition & Logic. A narrative guide that explains the "steps" and "rationale" behind end-to-end flows. It's about building a mental model so devs understand why the sequence matters. | The Team, New Hires | Format: TOUR.md file or Loom Video. Content: Step-by-step traces of core features, explanation of architectural trade-offs, and "The Big Picture" flow. | Quality: Developer Velocity. Measured by "Time-to-First-Commit" for new hires and reduction in conceptual bugs. | "End-to-End Trace:" 1. UI sends JSON. 2. API wraps it in Claim-Check. 3. Julia pulls it. Rationale: To avoid NATS memory spikes. | +| **Implementation** | The Functional Reality. The actual code that does the work. In SDD, the "boring" parts (types/routes) are auto-generated from the Spec to ensure the code never lies. | Developers, Reviewers | Format: Git Repository. Content: Business logic, internal helper functions, Unit Tests, and a README.md for local environment setup. | Code Health: Internal Quality. Measured by Test Coverage (90%+), Linting compliance, and Cyclomatic Complexity. | The SvelteKit frontend components and the specific Julia math-processing functions. | +| **Validation** | The Enforcement Layer. Automated gates that prove the Implementation matches the Spec. It prevents human error (like changing a key name) from reaching production. | CI/CD Pipeline, QA | Format: GitHub Actions / Tests. Content: Contract tests (Dredd/Prism), Integration tests, and Security scans that run on every pull request. | Compliance: Safety Metrics. Measured by Build Success Rate and 0 "Contract Violations" in the production logs. | A CI job that blocks a Pull Request because a developer used camelCase in a database field instead of snake_case. | +| **Maintenance** | The Health & Evolution. Defines how to upgrade dependencies, manage technical debt, and rotate secrets. It's the guide for "future-proofing" the software over time. | The Team, DevOps | Format: MAINTENANCE.md. Content: Dependency update schedules, Secret rotation steps, DB Migration logs, and Tech Debt "Graveyard" tracking. | Sustainability: System Longevity. Measured by "Package Age," "Security Vulnerabilities Found," and "Migration Success Rate." | "Steps to upgrade the Julia version across all 6 nodes without downtime using a Blue-Green deployment strategy." | +| **Runbook** | The Operational Life-Support. The instructions for when the system is alive (or dying). In GitOps, this is the "Desired State" of the infrastructure. | DevOps, SRE, On-call Devs | Format: K8s Manifests (Flux/Argo). Content: Deployment steps, Scaling triggers, Backup/Restore procedures, and "3:00 AM" troubleshooting guides. | Reliability: Operational Health. Measured by MTTR (Mean Time to Recovery) and Error-Free Deployments. | A Flux manifest that ensures 6 replicas of the Julia service are always healthy and restarts them if they hit 80% RAM. | --- -## Detailed Document Descriptions +## Detailed Breakdown of Each Document Type ### 1. Requirements -**Purpose**: Establish the Business North Star. +**Purpose**: Establish the Business North Star -**Why It Matters**: Without clear requirements, teams drift into "feature creep" - building things that don't solve the actual problem. This document anchors the project in business outcomes. +The Requirements document is your anchor point. It answers the fundamental question: "What problem are we solving, and how do we know we've succeeded?" -**Key Elements**: -- **User Stories**: What the user needs to accomplish -- **Business Constraints**: Budget, timeline, regulatory requirements -- **Competitive Context**: What competitors do and how you differentiate -- **Success Metrics**: Quantifiable goals that define "done" +**Key Characteristics**: +- **Business-Focused**: Written in business terms, not technical jargon +- **Boundary-Setting**: Explicitly defines what we will NOT build +- **Outcome-Oriented**: Focuses on user outcomes, not features **Best Practices**: -- Keep it in a shared wiki (Notion, GitHub Wiki) for collaborative editing -- Focus on outcomes, not solutions -- Explicitly state what you will NOT build +- Include user stories that describe the user's perspective +- Document business constraints (regulatory, legal, compliance) +- Define competitive context and market positioning +- Establish clear success metrics from day one + +**Common Pitfalls to Avoid**: +- Vague descriptions like "improve user experience" +- Changing requirements without updating the document +- Not defining what's out of scope --- ### 2. Spec (Specification) -**Purpose**: Create a machine-readable technical contract. +**Purpose**: Create the Technical Contract -**Why It Matters**: Communication gaps between services cause bugs. A strict, typed spec prevents these by being the Single Source of Truth. +The Spec serves as the Single Source of Truth for all data interfaces. It's a machine-readable definition that ensures consistency across services. -**Key Elements**: -- **API Endpoints**: All routes with HTTP methods -- **Data Types**: Strict typing with validation rules -- **Error Codes**: Comprehensive error response definitions -- **Naming Conventions**: snake_case keys, consistent patterns +**Key Characteristics**: +- **Machine-Readable**: Can be parsed by tools for validation and code generation +- **Strictly Typed**: Enforces data types and validation rules +- **Comprehensive**: Covers all endpoints, request/response formats, and error codes **Best Practices**: -- Use OpenAPI (YAML/JSON) for REST APIs or Protobuf for gRPC -- Automate generation of client/server code from the spec -- Run contract tests against the spec in CI/CD +- Use OpenAPI/Swagger for REST APIs or Protobuf for gRPC +- Enforce consistent naming conventions (e.g., snake_case) +- Define validation rules for all data fields +- Document all possible error responses + +**Common Pitfalls to Avoid**: +- Letting the spec diverge from the implementation +- Incomplete error handling documentation +- Not versioning the API spec --- ### 3. Architecture -**Purpose**: Visualize the system structure and data flow. +**Purpose**: Visualize the System Structure -**Why It Matters**: Complex systems (like your 6-node cluster) need clear maps. Without them, teams can't identify bottlenecks or make informed decisions. +The Architecture document provides a visual map of how components fit together. It helps identify bottlenecks and understand data flow. -**Key Elements**: -- **System Context Diagram**: Shows the system and its external dependencies -- **Database ERD**: Entity-Relationship diagrams for data model -- **Network Security Policies**: Firewall rules, service mesh configs -- **Infrastructure Maps**: Cloud resources, scaling groups +**Key Characteristics**: +- **Visual**: Uses diagrams to represent complex relationships +- **Comprehensive**: Covers system context, data flow, and infrastructure +- **Living Document**: Updated as the system evolves **Best Practices**: -- Use Mermaid.js for diagrams-as-code (versionable, diffable) -- Update diagrams when architecture changes -- Focus on data flow and decision points +- Use Mermaid.js for diagrams-as-code (versionable in Git) +- Include multiple views: System Context, C4 model, ERDs, network topology +- Document trade-offs and architectural decisions +- Show data flow through the system + +**Common Pitfalls to Avoid**: +- Over-engineering diagrams with unnecessary detail +- Not updating diagrams when the architecture changes +- Using static images instead of diagrams-as-code --- ### 4. Walkthrough -**Purpose**: Build a mental model through narrative. +**Purpose**: Build Mental Models -**Why It Matters**: Code doesn't explain *why*. Walkthroughs capture the reasoning behind architectural trade-offs, making onboarding faster and reducing conceptual bugs. +The Walkthrough document explains the "why" behind the "how." It helps developers understand the rationale behind design decisions. -**Key Elements**: -- **Step-by-step traces**: End-to-end flow of user actions -- **Trade-off explanations**: Why you chose option A over B -- **The Big Picture**: How components fit together conceptually +**Key Characteristics**: +- **Narrative-Driven**: Tells a story about how the system works +- **Context-Rich**: Explains trade-offs and decisions +- **End-to-End**: Traces flows from user input to system output **Best Practices**: -- Write in a TOUR.md file or record Loom videos -- Focus on intuition, not just mechanics -- Include "Rationale" sections for each major decision +- Document step-by-step traces of core features +- Explain architectural trade-offs and why you chose them +- Include "The Big Picture" context +- Use real examples and data flows + +**Common Pitfalls to Avoid**: +- Only documenting the happy path +- Assuming developers will figure out the "why" +- Not explaining the rationale behind decisions --- ### 5. Implementation -**Purpose**: The functional reality - the actual code. +**Purpose**: The Functional Reality -**Why It Matters**: This is what runs in production. In SDD, the spec-driven approach ensures boring parts are generated automatically, so developers focus on business logic. +The Implementation is the actual code that does the work. In SDD, the "boring" parts are auto-generated from the Spec to ensure consistency. -**Key Elements**: -- **Business Logic**: The unique value you provide -- **Unit Tests**: Covering edge cases and error paths -- **README.md**: Local environment setup instructions +**Key Characteristics**: +- **Machine-Generated**: Types and routes auto-generated from Spec +- **Human-Written**: Business logic and helper functions +- **Tested**: Includes unit and integration tests **Best Practices**: -- Generate boilerplate (types, routes) from the Spec -- Maintain 90%+ test coverage -- Keep README.md up-to-date for local development +- Auto-generate boring parts (types, routes) from the Spec +- Keep business logic separate from boilerplate +- Maintain comprehensive test coverage +- Document the local development setup + +**Common Pitfalls to Avoid**: +- Hand-writing types that should be auto-generated +- Inconsistent code style +- Insufficient test coverage --- ### 6. Validation -**Purpose**: Automated quality gates. +**Purpose**: Enforce the Contract -**Why It Matters**: Human error happens. Validation layers catch mistakes before they reach production, preventing contract violations and security issues. +The Validation layer provides automated gates that ensure the Implementation matches the Spec. It prevents human error from reaching production. -**Key Elements**: -- **Contract Tests**: Verify implementation matches spec (Dredd, Prism) -- **Integration Tests**: Test service-to-service interactions -- **Security Scans**: SAST/SBOM analysis on every PR +**Key Characteristics**: +- **Automated**: Runs on every commit/Pull Request +- **Comprehensive**: Covers contract tests, integration tests, and security scans +- **Blocking**: Prevents merges that violate the contract **Best Practices**: -- Run validation on every pull request -- Block merges on contract violations -- Track build success rate as a KPI +- Use contract testing tools (Dredd, Prism) to validate API contracts +- Run integration tests on every commit +- Include security scans in the CI pipeline +- Fail builds on contract violations + +**Common Pitfalls to Avoid**: +- Not running tests on every commit +- Allowing manual overrides of validation gates +- Not updating tests when the Spec changes --- ### 7. Maintenance -**Purpose**: Guide for long-term health and evolution. +**Purpose**: Ensure Long-Term Health -**Why It Matters**: Software decays. Without a maintenance plan, dependency upgrades become risky, secrets accumulate, and technical debt piles up. +The Maintenance document defines how to upgrade dependencies, manage technical debt, and rotate secrets. It's the guide for "future-proofing" the software. -**Key Elements**: -- **Dependency Update Schedule**: When and how to upgrade packages -- **Secret Rotation Steps**: How to rotate credentials securely -- **DB Migration Logs**: History of schema changes -- **Tech Debt "Graveyard"**: Documented technical debt with remediation plans +**Key Characteristics**: +- **Procedural**: Step-by-step instructions for common tasks +- **Scheduled**: Includes regular maintenance windows +- **Documented**: Tracks technical debt and migration history **Best Practices**: -- Document the "how" for common maintenance tasks -- Track package age and security vulnerabilities -- Schedule regular tech debt reviews +- Document dependency update schedules +- Create secret rotation procedures +- Track technical debt in a "Graveyard" +- Document migration history and rollback procedures + +**Common Pitfalls to Avoid**: +- Ad-hoc upgrades without documentation +- Ignoring technical debt until it becomes critical +- Not testing upgrades in staging first --- ### 8. Runbook -**Purpose**: Operational life-support for production systems. +**Purpose**: Operational Life-Support -**Why It Matters**: When production is down, teams need clear instructions. In GitOps, the runbook is the "desired state" that the system constantly works toward. +The Runbook provides instructions for when the system is alive (or dying). In GitOps, this is the "Desired State" of the infrastructure. -**Key Elements**: -- **Deployment Steps**: How to deploy new versions -- **Scaling Triggers**: When and how to scale up/down -- **Backup/Restore Procedures**: Disaster recovery steps -- **"3:00 AM" Troubleshooting**: Quick fixes for common failures +**Key Characteristics**: +- **Action-Oriented**: Step-by-step instructions for common operations +- **Automated**: Infrastructure as code defines the desired state +- **Crisis-Ready**: Includes "3:00 AM" troubleshooting guides **Best Practices**: -- Store in K8s manifests (Flux/Argo) for GitOps -- Automate as much as possible -- Test runbook procedures regularly +- Document deployment procedures +- Define scaling triggers and procedures +- Include backup and restore procedures +- Create troubleshooting guides for common issues + +**Common Pitfalls to Avoid**: +- Not documenting procedures for common issues +- Not testing runbook procedures +- Not versioning runbooks with the infrastructure --- -## How to Use This Framework +## How to Use This Approach Effectively -1. **Start with Requirements** - Define the business problem and success criteria -2. **Create the Spec** - Translate requirements into machine-readable contracts -3. **Design Architecture** - Visualize how the system will work -4. **Write Walkthrough** - Document the logic and trade-offs -5. **Implement** - Build the actual code -6. **Set up Validation** - Add automated tests and gates -7. **Document Maintenance** - Plan for long-term health -8. **Create Runbook** - Define operational procedures +### Phase 1: Foundation (Week 1-2) -This framework ensures that every document serves a clear purpose and that your project remains maintainable, scalable, and aligned with business goals. \ No newline at end of file +1. **Create Requirements Document** + - Define the Business North Star + - Establish success metrics + - Define out-of-scope items + +2. **Write the Spec** + - Define all data interfaces + - Establish naming conventions + - Document validation rules + +3. **Design Architecture** + - Create system diagrams + - Document data flow + - Identify potential bottlenecks + +### Phase 2: Development (Week 3+) + +4. **Write Walkthrough** + - Document end-to-end flows + - Explain architectural trade-offs + - Create mental models for developers + +5. **Implement Code** + - Auto-generate boring parts from Spec + - Write business logic + - Implement tests + +### Phase 3: Quality Assurance + +6. **Set Up Validation** + - Configure CI/CD pipeline + - Set up contract testing + - Configure security scans + +7. **Create Runbook** + - Document deployment procedures + - Define scaling triggers + - Create troubleshooting guides + +### Phase 4: Maintenance + +8. **Document Maintenance** + - Create dependency update schedule + - Document secret rotation + - Track technical debt + +--- + +## Key Principles for Success + +1. **Separation of Concerns**: Keep business concerns separate from technical concerns +2. **Machine-Readable Contracts**: Use OpenAPI/Protobuf for specs to enable automation +3. **Automation**: Automate boring parts and validation to reduce human error +4. **Measurability**: Every document should have measurable outcomes +5. **Version Control**: Keep all documentation in Git for history and collaboration +6. **Living Documents**: Update documentation as the system evolves +7. **Audience-Focused**: Write for the intended audience's needs and knowledge level + +--- + +## Conclusion + +The SDD + GitOps Documentation Framework provides a comprehensive, structured approach to software development documentation. By following this framework, teams can ensure that: + +- Business goals are clearly defined and measurable +- Technical contracts are machine-readable and enforced +- System architecture is visualized and understood +- Developers have clear mental models of the system +- Code quality is maintained through automation +- Operations are reliable and repeatable + +This framework is not just about documentation—it's about creating a shared understanding across the entire team and ensuring that every decision is aligned with business goals. \ No newline at end of file