CostEC2AWS

AWS Spot Instances: Running Interruptible Workloads at 90% Discount

Vigilare Engineering

Platform Team · December 4, 2025 · 8 min read

Spot Instances are the most dramatic discount AWS offers — up to 90% below On-Demand pricing for spare EC2 capacity — with one significant constraint: AWS can reclaim them with two minutes' warning when the capacity is needed elsewhere. That constraint makes spot unsuitable for some workloads (user-facing API servers where interruption means dropped connections) and perfect for others (batch processing, training jobs, CI/CD build workers, stateless compute that can checkpoint and resume).

The organizations that use spot well run a meaningful portion of their compute at a fraction of the cost, with interruption handling that's invisible to users and upstream systems. This guide covers the mechanics and architectural patterns that make that possible.

How Spot Pricing Works

Spot prices fluctuate based on supply and demand for each instance type in each Availability Zone. Unlike the old model where you bid above market price, current spot pricing uses a flat market price that changes over time. You specify a maximum price (or accept the on-demand price as the default maximum), and your instances run as long as the spot price remains below your maximum. When price exceeds your maximum, or when capacity is needed by On-Demand users, AWS sends a two-minute interruption notice and terminates the instance.

In practice, most spot instance types run for hours or days without interruption. Interruption frequency varies by instance type, region, and Availability Zone — newer generation instance families and less popular AZs tend to have lower interruption rates. The EC2 Spot Instance Advisor (available in the console) shows historical interruption frequency by instance type and AZ, which should inform your instance type selection.

Instance Selection for Spot

The most important principle for spot instance selection: request multiple instance types in multiple AZs. Don't commit to a single instance type when using spot. AWS's capacity pools for individual instance types can be exhausted temporarily, leading to unavailability. By requesting across multiple types with similar capabilities, you dramatically increase the probability of finding available capacity at any given time.

Configure your Auto Scaling groups or Spot Fleets with a capacity-optimized allocation strategy (which selects from pools with the most available capacity, reducing interruption risk) rather than lowest-price (which maximizes discount but concentrates in lower-capacity pools). The savings difference between these strategies is usually small; the interruption rate difference can be significant.

Use instance type diversity: for an application needing 8 vCPUs and 16 GB RAM, configure a fleet that will accept m5.2xlarge, m5a.2xlarge, m4.2xlarge, m6i.2xlarge, m6a.2xlarge, r5.xlarge, r5a.xlarge. This breadth of compatible instances gives the fleet many capacity pools to draw from.

Handling Spot Interruptions

The two-minute interruption notice arrives as both an EC2 instance metadata event (accessible from the instance at http://169.254.169.254/latest/meta-data/spot/termination-time) and an EventBridge event. A properly designed spot workload polls this endpoint and responds to the termination notice by gracefully checkpointing state, finishing the current unit of work, or signaling to the orchestration layer that the instance is being reclaimed.

For batch processing jobs, the practical response to an interruption notice is: finish the current item in the batch (if it'll complete within 2 minutes), checkpoint the queue position, flush any in-memory state to persistent storage, and exit gracefully. AWS's managed services — EMR, ECS, Batch — handle interruption gracefully by default when configured for spot capacity. Building on these services is simpler than implementing interruption handling yourself.

For stateless web workers behind a load balancer, the load balancer automatically removes the instance from rotation when it stops passing health checks. The two-minute window gives in-flight requests time to complete. Ensure your application server's graceful shutdown process finishes active requests within the two-minute window — most web servers do this by default.

Spot for Specific Use Cases

CI/CD build workers: Build jobs are inherently resumable (re-run the build from the beginning) and don't have user-facing latency requirements. Running build fleets entirely on spot is safe and delivers substantial cost savings for high-volume CI/CD environments.

ML training: Training jobs can checkpoint model state periodically. With SageMaker Managed Spot Training, AWS handles the interruption and resumption automatically — training resumes from the last checkpoint rather than from the beginning. For training jobs that run hours or days, spot pricing + checkpointing can reduce training costs by 70-80%.

Batch data processing: ETL pipelines, data transformation jobs, and analysis workloads that process data in chunks are natural fits for spot. Use SQS or Kinesis as the work queue so that interruption of a consumer just returns the work item to the queue for another instance to pick up.

Stateless microservices: Microservices that are horizontally scalable and don't hold user session state can run on spot with appropriate load balancing and auto-recovery. The key requirement is that individual instances can terminate without affecting user experience — session state must live externally (Redis, DynamoDB), and the load balancer must handle instance removal gracefully.

Mixing Spot with On-Demand

A mixed capacity strategy — base capacity on On-Demand or Reserved Instances, bursting on Spot — provides cost efficiency without all-or-nothing spot reliance. Configure Auto Scaling groups with a base On-Demand capacity (minimum instances that are always on-demand) and spot for additional scaling. AWS Auto Scaling groups with mixed instance policies support this directly, including the ability to set what percentage of scaled capacity should be spot versus on-demand.

Related Reading

FAQ

Can I use Spot Instances for production databases?

Generally no, unless your database is designed for interruption — distributed databases like Cassandra or ScyllaDB that can tolerate node loss, or databases with automated failover where the standby can take over within seconds. Traditional RDS deployments, single-node MySQL or Postgres instances, and Redis without replication are not good spot candidates. The two-minute warning is not enough time for most database failover processes to complete gracefully.

How do I know if a spot instance was interrupted or crashed?

Check the instance termination reason in the EC2 console or via the API. An instance terminated due to spot interruption shows "Server.SpotInstanceShutdown" or "Server.SpotInstanceTermination" as the reason. This distinction matters for monitoring and alerting — spot interruptions are normal and expected; non-spot terminations are anomalies worth investigating.

What's the difference between Spot Instances and EC2 Auto Scaling with spot?

Spot Instances can be launched directly (like On-Demand) or through Auto Scaling groups and Spot Fleet. Using Auto Scaling with spot is strongly recommended over direct launch — Auto Scaling handles interruption replacement automatically, manages multiple instance type configurations, and integrates with load balancers for seamless capacity management. Direct spot launch requires you to implement all of this yourself.

Protect your AWS accounts before it's too late

Vigilare monitors your AWS accounts for suspension risks — billing anomalies, IAM issues, GuardDuty findings, and more — and alerts you before AWS takes action.

Written by Vigilare Engineering

Platform Team