Vijaya Manne 632ac9e328 Initial commit: One-Click CloudFormation Stack Updater

- Python CLI tool for rolling updates of CL-AppPipe-* and CL-SvcPipe-* stacks
- Async update engine with configurable concurrency (asyncio.Semaphore)
- Exponential backoff retry for API throttling
- Dry-run mode for safe preview
- IAM permission pre-validation
- Comprehensive test suite (80 tests: 11 property-based + 69 unit)
- Full spec documentation (requirements, design, tasks)

2026-05-29 14:56:59 -04:00

15 KiB

Raw Blame History

Design Document: One-Click CloudFormation Stack Updater

Overview

The One-Click CFN Stack Updater is a CLI tool (Python) that automates the rolling update of all CL-AppPipe-* CloudFormation stacks in an audit account. It discovers stacks dynamically by prefix, validates IAM permissions, and updates each stack using its existing parameters so the only change is the refreshed nested template URL resolving to the latest version. The tool supports concurrency control, dry-run mode, exponential backoff on throttling, and produces a structured summary report.

The tool is implemented as a single Python package using boto3 for AWS interactions. Python is chosen because it is the standard language for AWS automation tooling, boto3 provides first-class CloudFormation support, and the operator audience is already familiar with Python-based AWS scripts.

Architecture

The system follows a pipeline architecture with four sequential phases:

flowchart LR
    A[Permission\nValidation] --> B[Stack\nDiscovery]
    B --> C[Stack\nUpdate Engine]
    C --> D[Report\nGenerator]

Permission Validation — Verifies the executing role has the required IAM permissions before any work begins.
Stack Discovery — Lists all CloudFormation stacks matching the CL-AppPipe- prefix and filters to updatable states.
Stack Update Engine — Updates stacks concurrently (bounded by Concurrency_Limit) with retry logic for throttling errors.
Report Generator — Aggregates results and produces the final summary.

Concurrency Model

The update engine uses a semaphore-based concurrency model with asyncio to run up to Concurrency_Limit stack updates in parallel. Each update is an independent coroutine that:

Fetches current stack parameters
Calls UpdateStack with existing parameters and the nested template URL
Polls DescribeStacks until the update completes or fails
Records the result

flowchart TD
    S[Semaphore: Concurrency_Limit] --> U1[Update Stack 1]
    S --> U2[Update Stack 2]
    S --> U3[Update Stack N]
    U1 --> R[Result Collector]
    U2 --> R
    U3 --> R
    R --> Report[Summary Report]

Components and Interfaces

CLI Entry Point (`cli.py`)

Parses command-line arguments and orchestrates the pipeline.

def main(
    prefix: str = "CL-AppPipe-",
    concurrency: int = 5,
    dry_run: bool = False,
    region: str | None = None,
) -> int:
    """
    Entry point. Returns 0 on full success, 1 if any stack failed.
    """

Arguments:

Flag	Type	Default	Description
`--prefix`	`str`	`CL-AppPipe-`	Stack name prefix to match
`--concurrency`	`int`	`5`	Max parallel updates
`--dry-run`	`bool`	`False`	Preview mode, no updates
`--region`	`str`	SDK default	AWS region override

Permission Validator (`permissions.py`)

def validate_permissions(cfn_client) -> list[str]:
    """
    Checks required IAM permissions by performing dry-run API calls.
    Returns a list of missing permission names. Empty list means all OK.
    """

Validates by attempting:

cloudformation:ListStacks — calls list_stacks with a narrow filter
cloudformation:DescribeStacks — calls describe_stacks with a non-existent stack name (expects specific error)
cloudformation:UpdateStack — validated implicitly during updates; pre-check uses IAM policy simulation via iam:SimulatePrincipalPolicy if available, otherwise deferred

Stack Discovery (`discovery.py`)

@dataclass
class DiscoveredStack:
    name: str
    status: str
    updatable: bool

def discover_stacks(cfn_client, prefix: str) -> list[DiscoveredStack]:
    """
    Lists all stacks matching the prefix. Paginates through all results.
    Marks each stack as updatable or not based on its status.
    """

Non-updatable statuses:

ROLLBACK_COMPLETE
ROLLBACK_IN_PROGRESS
UPDATE_ROLLBACK_IN_PROGRESS
UPDATE_ROLLBACK_COMPLETE_CLEANUP_IN_PROGRESS
DELETE_IN_PROGRESS
DELETE_COMPLETE

Stack Updater (`updater.py`)

@dataclass
class StackUpdateResult:
    stack_name: str
    status: Literal["succeeded", "failed", "skipped", "no-update-needed"]
    error: str | None = None
    duration_seconds: float = 0.0

async def update_stack(
    cfn_client,
    stack_name: str,
    template_url: str,
    max_retries: int = 3,
) -> StackUpdateResult:
    """
    Updates a single stack. Handles 'No updates' response, throttling retries,
    and non-updatable state detection.
    """

async def update_all_stacks(
    cfn_client,
    stacks: list[DiscoveredStack],
    template_url: str,
    concurrency: int = 5,
    max_retries: int = 3,
) -> list[StackUpdateResult]:
    """
    Updates all stacks with bounded concurrency using asyncio.Semaphore.
    """

Retry logic:

Triggered on Throttling or RequestLimitExceeded error codes
Exponential backoff: base_delay * 2^attempt (base_delay = 1s)
Maximum 3 retries per stack

Report Generator (`report.py`)

@dataclass
class UpdateRunReport:
    start_time: datetime
    end_time: datetime
    total_found: int
    succeeded: int
    failed: int
    skipped: int
    no_update_needed: int
    results: list[StackUpdateResult]

def generate_report(
    results: list[StackUpdateResult],
    total_found: int,
    start_time: datetime,
    end_time: datetime,
) -> UpdateRunReport:
    """
    Aggregates results into a summary report.
    """

def format_report(report: UpdateRunReport) -> str:
    """
    Formats the report as a human-readable string for console output.
    """

Data Models

DiscoveredStack

Field	Type	Description
`name`	`str`	CloudFormation stack name
`status`	`str`	Current stack status (e.g., `CREATE_COMPLETE`)
`updatable`	`bool`	Whether the stack is in an updatable state

StackUpdateResult

Field	Type	Description
`stack_name`	`str`	Name of the stack
`status`	`Literal["succeeded", "failed", "skipped", "no-update-needed"]`	Outcome of the update attempt
`error`	`str \| None`	Error message if failed
`duration_seconds`	`float`	Time taken for this stack's update

UpdateRunReport

Field	Type	Description
`start_time`	`datetime`	When the Update_Run started
`end_time`	`datetime`	When the Update_Run ended
`total_found`	`int`	Total Target_Stacks discovered
`succeeded`	`int`	Count of successfully updated stacks
`failed`	`int`	Count of failed stacks
`skipped`	`int`	Count of skipped (non-updatable) stacks
`no_update_needed`	`int`	Count of stacks with no changes
`results`	`list[StackUpdateResult]`	Per-stack results

Configuration Constants

TEMPLATE_URL = "https://s3.amazonaws.com/solutions-reference/centralized-logging-with-opensearch/latest/AppLogS3Buffer.template"
DEFAULT_PREFIX = "CL-AppPipe-"
DEFAULT_CONCURRENCY = 5
MAX_RETRIES = 3
BASE_RETRY_DELAY = 1.0  # seconds
NON_UPDATABLE_STATUSES = frozenset({
    "ROLLBACK_COMPLETE",
    "ROLLBACK_IN_PROGRESS",
    "UPDATE_ROLLBACK_IN_PROGRESS",
    "UPDATE_ROLLBACK_COMPLETE_CLEANUP_IN_PROGRESS",
    "DELETE_IN_PROGRESS",
    "DELETE_COMPLETE",
})

Correctness Properties

A property is a characteristic or behavior that should hold true across all valid executions of a system — essentially, a formal statement about what the system should do. Properties serve as the bridge between human-readable specifications and machine-verifiable correctness guarantees.

Property 1: Discovery returns exactly prefix-matched stacks with correct count

For any list of CloudFormation stacks with arbitrary names, the discovery function should return exactly those stacks whose names start with the configured prefix, and the reported count should equal the length of that filtered list.

Validates: Requirements 1.1, 1.3

Property 2: All updatable stacks are attempted

For any set of discovered stacks marked as updatable, the update engine should produce exactly one update result per updatable stack — no stack is silently dropped and no stack is attempted twice.

Validates: Requirements 2.2

Property 3: Concurrency limit invariant

For any positive concurrency limit and any list of stacks, at no point during an Update_Run should the number of concurrently in-progress stack updates exceed the specified concurrency limit.

Validates: Requirements 2.3, 3.3

Property 4: Update call preserves existing parameters and uses correct template URL

For any stack with any set of existing parameters, the UpdateStack API call should include exactly those same parameter keys with UsePreviousValue=True, and the TemplateURL argument should equal the configured Nested_Template_URL.

Validates: Requirements 3.1, 3.2

Property 5: "No updates" response maps to no-update-needed status

For any stack where CloudFormation returns a "No updates are to be performed" error, the resulting StackUpdateResult should have status no-update-needed (not failed).

Validates: Requirements 3.4

Property 6: Non-updatable stacks are skipped

For any stack whose CloudFormation status is in the set of non-updatable statuses (e.g., ROLLBACK_COMPLETE, DELETE_IN_PROGRESS), the result should have status skipped and no UpdateStack API call should be made for that stack.

Validates: Requirements 4.2

Property 7: Fault isolation — failures do not block remaining stacks

For any list of N updatable stacks where K of them fail (including after retry exhaustion), the update engine should still produce results for all N stacks, and the number of attempted updates should equal N.

Validates: Requirements 4.1, 4.4

Property 8: Throttling triggers exponential backoff retries

For any stack that receives throttling errors, the system should retry up to MAX_RETRIES times, and the delay between the i-th and (i+1)-th attempt should be at least BASE_RETRY_DELAY * 2^i seconds.

Validates: Requirements 4.3

Property 9: Report aggregation and exit code correctness

For any list of StackUpdateResult values, the generated report's succeeded, failed, skipped, and no_update_needed counts should equal the actual counts of each status in the input list, and the exit code should be non-zero if and only if failed > 0.

Validates: Requirements 5.2, 5.3

Property 10: Dry-run performs no updates and lists all discovered stacks

For any set of discovered stacks, when dry-run mode is enabled, zero UpdateStack API calls should be made, and the output should contain the name and current status of every discovered stack.

Validates: Requirements 6.2, 6.3

Property 11: Permission validation correctness

For any subset of required permissions that are missing, the permission validator should return exactly those missing permissions, and when any permissions are missing, the Update_Run should terminate without making any UpdateStack API calls.

Validates: Requirements 7.1, 7.2

Error Handling

Error Categories and Responses

Error	Source	Response
Missing IAM permissions	Permission validation phase	Report missing permissions, exit with non-zero code, no updates attempted
No stacks found	Discovery phase	Log warning, exit with code 0 (not an error)
Stack in non-updatable state	Update phase	Skip stack, log warning, record as `skipped`
"No updates to be performed"	CloudFormation UpdateStack API	Treat as success, record as `no-update-needed`
Throttling / RequestLimitExceeded	CloudFormation API	Retry with exponential backoff (max 3 retries)
Throttling after max retries	CloudFormation API	Mark stack as `failed`, continue with remaining stacks
UpdateStack failure (other)	CloudFormation API	Log error details, mark as `failed`, continue with remaining stacks
Boto3 connection error	Network / SDK	Mark stack as `failed`, log error, continue
Invalid CLI arguments	Argument parsing	Print usage, exit with non-zero code

Retry Strategy

async def retry_with_backoff(func, max_retries=3, base_delay=1.0):
    for attempt in range(max_retries + 1):
        try:
            return await func()
        except ClientError as e:
            code = e.response["Error"]["Code"]
            if code in ("Throttling", "RequestLimitExceeded") and attempt < max_retries:
                delay = base_delay * (2 ** attempt)
                await asyncio.sleep(delay)
            else:
                raise

Exit Codes

Code	Meaning
`0`	All stacks updated successfully (or no stacks found, or dry-run)
`1`	One or more stacks failed to update
`2`	Permission validation failed

Testing Strategy

Testing Framework

Unit tests: pytest
Property-based tests: hypothesis (Python's standard PBT library)
Mocking: unittest.mock and botocore.stub.Stubber for AWS API mocking

Property-Based Tests

Each correctness property from the design maps to a single property-based test. All property tests run a minimum of 100 iterations using Hypothesis settings.

Property	Test Description	Key Generators
P1	Discovery prefix filtering	Random stack name lists (some with prefix, some without)
P2	All updatable stacks attempted	Random lists of DiscoveredStack with mixed updatable flags
P3	Concurrency limit invariant	Random concurrency values (1–20), random stack counts (1–50)
P4	Parameter preservation and template URL	Random parameter key-value dicts
P5	"No updates" status mapping	Random stacks with mocked "no updates" responses
P6	Non-updatable stack skipping	Random stacks with statuses drawn from updatable and non-updatable sets
P7	Fault isolation	Random stack lists with random failure injection
P8	Exponential backoff retries	Random retry counts (0–3), verify delay sequence
P9	Report aggregation	Random lists of StackUpdateResult with random statuses
P10	Dry-run no-op	Random discovered stacks, verify zero update calls
P11	Permission validation	Random subsets of required permissions marked as missing

Each test must be tagged with a comment:

# Feature: one-click-cfn-stack-updater, Property 9: Report aggregation and exit code correctness

Unit Tests

Unit tests complement property tests by covering:

Specific examples: Known stack names, known parameter sets, expected API responses
Edge cases: Empty stack list (Req 1.4), concurrency of 1, all stacks failing, all stacks already up-to-date
Integration points: CLI argument parsing, boto3 Stubber-based API interaction tests
Error conditions: Malformed API responses, unexpected exception types

Test Organization

tests/
├── test_discovery.py       # P1, P6 property tests + unit tests
├── test_updater.py         # P2, P3, P4, P5, P7, P8 property tests + unit tests
├── test_report.py          # P9 property tests + unit tests
├── test_dry_run.py         # P10 property tests + unit tests
├── test_permissions.py     # P11 property tests + unit tests
└── test_cli.py             # CLI argument parsing unit tests

15 KiB Raw Blame History Unescape Escape