cfn-stack-updater/.kiro/specs/one-click-cfn-stack-updater/design.md
Vijaya Manne 632ac9e328 Initial commit: One-Click CloudFormation Stack Updater
- Python CLI tool for rolling updates of CL-AppPipe-* and CL-SvcPipe-* stacks
- Async update engine with configurable concurrency (asyncio.Semaphore)
- Exponential backoff retry for API throttling
- Dry-run mode for safe preview
- IAM permission pre-validation
- Comprehensive test suite (80 tests: 11 property-based + 69 unit)
- Full spec documentation (requirements, design, tasks)
2026-05-29 14:56:59 -04:00

15 KiB
Raw Blame History

Design Document: One-Click CloudFormation Stack Updater

Overview

The One-Click CFN Stack Updater is a CLI tool (Python) that automates the rolling update of all CL-AppPipe-* CloudFormation stacks in an audit account. It discovers stacks dynamically by prefix, validates IAM permissions, and updates each stack using its existing parameters so the only change is the refreshed nested template URL resolving to the latest version. The tool supports concurrency control, dry-run mode, exponential backoff on throttling, and produces a structured summary report.

The tool is implemented as a single Python package using boto3 for AWS interactions. Python is chosen because it is the standard language for AWS automation tooling, boto3 provides first-class CloudFormation support, and the operator audience is already familiar with Python-based AWS scripts.

Architecture

The system follows a pipeline architecture with four sequential phases:

flowchart LR
    A[Permission\nValidation] --> B[Stack\nDiscovery]
    B --> C[Stack\nUpdate Engine]
    C --> D[Report\nGenerator]
  1. Permission Validation — Verifies the executing role has the required IAM permissions before any work begins.
  2. Stack Discovery — Lists all CloudFormation stacks matching the CL-AppPipe- prefix and filters to updatable states.
  3. Stack Update Engine — Updates stacks concurrently (bounded by Concurrency_Limit) with retry logic for throttling errors.
  4. Report Generator — Aggregates results and produces the final summary.

Concurrency Model

The update engine uses a semaphore-based concurrency model with asyncio to run up to Concurrency_Limit stack updates in parallel. Each update is an independent coroutine that:

  1. Fetches current stack parameters
  2. Calls UpdateStack with existing parameters and the nested template URL
  3. Polls DescribeStacks until the update completes or fails
  4. Records the result
flowchart TD
    S[Semaphore: Concurrency_Limit] --> U1[Update Stack 1]
    S --> U2[Update Stack 2]
    S --> U3[Update Stack N]
    U1 --> R[Result Collector]
    U2 --> R
    U3 --> R
    R --> Report[Summary Report]

Components and Interfaces

CLI Entry Point (cli.py)

Parses command-line arguments and orchestrates the pipeline.

def main(
    prefix: str = "CL-AppPipe-",
    concurrency: int = 5,
    dry_run: bool = False,
    region: str | None = None,
) -> int:
    """
    Entry point. Returns 0 on full success, 1 if any stack failed.
    """

Arguments:

Flag Type Default Description
--prefix str CL-AppPipe- Stack name prefix to match
--concurrency int 5 Max parallel updates
--dry-run bool False Preview mode, no updates
--region str SDK default AWS region override

Permission Validator (permissions.py)

def validate_permissions(cfn_client) -> list[str]:
    """
    Checks required IAM permissions by performing dry-run API calls.
    Returns a list of missing permission names. Empty list means all OK.
    """

Validates by attempting:

  • cloudformation:ListStacks — calls list_stacks with a narrow filter
  • cloudformation:DescribeStacks — calls describe_stacks with a non-existent stack name (expects specific error)
  • cloudformation:UpdateStack — validated implicitly during updates; pre-check uses IAM policy simulation via iam:SimulatePrincipalPolicy if available, otherwise deferred

Stack Discovery (discovery.py)

@dataclass
class DiscoveredStack:
    name: str
    status: str
    updatable: bool

def discover_stacks(cfn_client, prefix: str) -> list[DiscoveredStack]:
    """
    Lists all stacks matching the prefix. Paginates through all results.
    Marks each stack as updatable or not based on its status.
    """

Non-updatable statuses:

  • ROLLBACK_COMPLETE
  • ROLLBACK_IN_PROGRESS
  • UPDATE_ROLLBACK_IN_PROGRESS
  • UPDATE_ROLLBACK_COMPLETE_CLEANUP_IN_PROGRESS
  • DELETE_IN_PROGRESS
  • DELETE_COMPLETE

Stack Updater (updater.py)

@dataclass
class StackUpdateResult:
    stack_name: str
    status: Literal["succeeded", "failed", "skipped", "no-update-needed"]
    error: str | None = None
    duration_seconds: float = 0.0

async def update_stack(
    cfn_client,
    stack_name: str,
    template_url: str,
    max_retries: int = 3,
) -> StackUpdateResult:
    """
    Updates a single stack. Handles 'No updates' response, throttling retries,
    and non-updatable state detection.
    """

async def update_all_stacks(
    cfn_client,
    stacks: list[DiscoveredStack],
    template_url: str,
    concurrency: int = 5,
    max_retries: int = 3,
) -> list[StackUpdateResult]:
    """
    Updates all stacks with bounded concurrency using asyncio.Semaphore.
    """

Retry logic:

  • Triggered on Throttling or RequestLimitExceeded error codes
  • Exponential backoff: base_delay * 2^attempt (base_delay = 1s)
  • Maximum 3 retries per stack

Report Generator (report.py)

@dataclass
class UpdateRunReport:
    start_time: datetime
    end_time: datetime
    total_found: int
    succeeded: int
    failed: int
    skipped: int
    no_update_needed: int
    results: list[StackUpdateResult]

def generate_report(
    results: list[StackUpdateResult],
    total_found: int,
    start_time: datetime,
    end_time: datetime,
) -> UpdateRunReport:
    """
    Aggregates results into a summary report.
    """

def format_report(report: UpdateRunReport) -> str:
    """
    Formats the report as a human-readable string for console output.
    """

Data Models

DiscoveredStack

Field Type Description
name str CloudFormation stack name
status str Current stack status (e.g., CREATE_COMPLETE)
updatable bool Whether the stack is in an updatable state

StackUpdateResult

Field Type Description
stack_name str Name of the stack
status Literal["succeeded", "failed", "skipped", "no-update-needed"] Outcome of the update attempt
error str | None Error message if failed
duration_seconds float Time taken for this stack's update

UpdateRunReport

Field Type Description
start_time datetime When the Update_Run started
end_time datetime When the Update_Run ended
total_found int Total Target_Stacks discovered
succeeded int Count of successfully updated stacks
failed int Count of failed stacks
skipped int Count of skipped (non-updatable) stacks
no_update_needed int Count of stacks with no changes
results list[StackUpdateResult] Per-stack results

Configuration Constants

TEMPLATE_URL = "https://s3.amazonaws.com/solutions-reference/centralized-logging-with-opensearch/latest/AppLogS3Buffer.template"
DEFAULT_PREFIX = "CL-AppPipe-"
DEFAULT_CONCURRENCY = 5
MAX_RETRIES = 3
BASE_RETRY_DELAY = 1.0  # seconds
NON_UPDATABLE_STATUSES = frozenset({
    "ROLLBACK_COMPLETE",
    "ROLLBACK_IN_PROGRESS",
    "UPDATE_ROLLBACK_IN_PROGRESS",
    "UPDATE_ROLLBACK_COMPLETE_CLEANUP_IN_PROGRESS",
    "DELETE_IN_PROGRESS",
    "DELETE_COMPLETE",
})

Correctness Properties

A property is a characteristic or behavior that should hold true across all valid executions of a system — essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.

Property 1: Discovery returns exactly prefix-matched stacks with correct count

For any list of CloudFormation stacks with arbitrary names, the discovery function should return exactly those stacks whose names start with the configured prefix, and the reported count should equal the length of that filtered list.

Validates: Requirements 1.1, 1.3

Property 2: All updatable stacks are attempted

For any set of discovered stacks marked as updatable, the update engine should produce exactly one update result per updatable stack — no stack is silently dropped and no stack is attempted twice.

Validates: Requirements 2.2

Property 3: Concurrency limit invariant

For any positive concurrency limit and any list of stacks, at no point during an Update_Run should the number of concurrently in-progress stack updates exceed the specified concurrency limit.

Validates: Requirements 2.3, 3.3

Property 4: Update call preserves existing parameters and uses correct template URL

For any stack with any set of existing parameters, the UpdateStack API call should include exactly those same parameter keys with UsePreviousValue=True, and the TemplateURL argument should equal the configured Nested_Template_URL.

Validates: Requirements 3.1, 3.2

Property 5: "No updates" response maps to no-update-needed status

For any stack where CloudFormation returns a "No updates are to be performed" error, the resulting StackUpdateResult should have status no-update-needed (not failed).

Validates: Requirements 3.4

Property 6: Non-updatable stacks are skipped

For any stack whose CloudFormation status is in the set of non-updatable statuses (e.g., ROLLBACK_COMPLETE, DELETE_IN_PROGRESS), the result should have status skipped and no UpdateStack API call should be made for that stack.

Validates: Requirements 4.2

Property 7: Fault isolation — failures do not block remaining stacks

For any list of N updatable stacks where K of them fail (including after retry exhaustion), the update engine should still produce results for all N stacks, and the number of attempted updates should equal N.

Validates: Requirements 4.1, 4.4

Property 8: Throttling triggers exponential backoff retries

For any stack that receives throttling errors, the system should retry up to MAX_RETRIES times, and the delay between the i-th and (i+1)-th attempt should be at least BASE_RETRY_DELAY * 2^i seconds.

Validates: Requirements 4.3

Property 9: Report aggregation and exit code correctness

For any list of StackUpdateResult values, the generated report's succeeded, failed, skipped, and no_update_needed counts should equal the actual counts of each status in the input list, and the exit code should be non-zero if and only if failed > 0.

Validates: Requirements 5.2, 5.3

Property 10: Dry-run performs no updates and lists all discovered stacks

For any set of discovered stacks, when dry-run mode is enabled, zero UpdateStack API calls should be made, and the output should contain the name and current status of every discovered stack.

Validates: Requirements 6.2, 6.3

Property 11: Permission validation correctness

For any subset of required permissions that are missing, the permission validator should return exactly those missing permissions, and when any permissions are missing, the Update_Run should terminate without making any UpdateStack API calls.

Validates: Requirements 7.1, 7.2

Error Handling

Error Categories and Responses

Error Source Response
Missing IAM permissions Permission validation phase Report missing permissions, exit with non-zero code, no updates attempted
No stacks found Discovery phase Log warning, exit with code 0 (not an error)
Stack in non-updatable state Update phase Skip stack, log warning, record as skipped
"No updates to be performed" CloudFormation UpdateStack API Treat as success, record as no-update-needed
Throttling / RequestLimitExceeded CloudFormation API Retry with exponential backoff (max 3 retries)
Throttling after max retries CloudFormation API Mark stack as failed, continue with remaining stacks
UpdateStack failure (other) CloudFormation API Log error details, mark as failed, continue with remaining stacks
Boto3 connection error Network / SDK Mark stack as failed, log error, continue
Invalid CLI arguments Argument parsing Print usage, exit with non-zero code

Retry Strategy

async def retry_with_backoff(func, max_retries=3, base_delay=1.0):
    for attempt in range(max_retries + 1):
        try:
            return await func()
        except ClientError as e:
            code = e.response["Error"]["Code"]
            if code in ("Throttling", "RequestLimitExceeded") and attempt < max_retries:
                delay = base_delay * (2 ** attempt)
                await asyncio.sleep(delay)
            else:
                raise

Exit Codes

Code Meaning
0 All stacks updated successfully (or no stacks found, or dry-run)
1 One or more stacks failed to update
2 Permission validation failed

Testing Strategy

Testing Framework

  • Unit tests: pytest
  • Property-based tests: hypothesis (Python's standard PBT library)
  • Mocking: unittest.mock and botocore.stub.Stubber for AWS API mocking

Property-Based Tests

Each correctness property from the design maps to a single property-based test. All property tests run a minimum of 100 iterations using Hypothesis settings.

Property Test Description Key Generators
P1 Discovery prefix filtering Random stack name lists (some with prefix, some without)
P2 All updatable stacks attempted Random lists of DiscoveredStack with mixed updatable flags
P3 Concurrency limit invariant Random concurrency values (120), random stack counts (150)
P4 Parameter preservation and template URL Random parameter key-value dicts
P5 "No updates" status mapping Random stacks with mocked "no updates" responses
P6 Non-updatable stack skipping Random stacks with statuses drawn from updatable and non-updatable sets
P7 Fault isolation Random stack lists with random failure injection
P8 Exponential backoff retries Random retry counts (03), verify delay sequence
P9 Report aggregation Random lists of StackUpdateResult with random statuses
P10 Dry-run no-op Random discovered stacks, verify zero update calls
P11 Permission validation Random subsets of required permissions marked as missing

Each test must be tagged with a comment:

# Feature: one-click-cfn-stack-updater, Property 9: Report aggregation and exit code correctness

Unit Tests

Unit tests complement property tests by covering:

  • Specific examples: Known stack names, known parameter sets, expected API responses
  • Edge cases: Empty stack list (Req 1.4), concurrency of 1, all stacks failing, all stacks already up-to-date
  • Integration points: CLI argument parsing, boto3 Stubber-based API interaction tests
  • Error conditions: Malformed API responses, unexpected exception types

Test Organization

tests/
├── test_discovery.py       # P1, P6 property tests + unit tests
├── test_updater.py         # P2, P3, P4, P5, P7, P8 property tests + unit tests
├── test_report.py          # P9 property tests + unit tests
├── test_dry_run.py         # P10 property tests + unit tests
├── test_permissions.py     # P11 property tests + unit tests
└── test_cli.py             # CLI argument parsing unit tests