
Stephen Bridwell
Senior Applied Scientist - CX Foundations, Amazon
Stephen has 10+ years of data science and ML experience, including 7+ years at Amazon. He currently architects advanced GenAI systems using AWS Bedrock and Claude that process billions of customer interactions. Previously, he led data science teams at Amazon DROID Analytics, Alexa AI, and Consumer FP&A, managing petabyte-scale Redshift clusters and deploying production ML systems.
What is AWS Bedrock?
Amazon Bedrock is a fully managed service that provides access to high-performing foundation models from AI21 Labs, Amazon, Anthropic, Cohere, Meta, Mistral AI, and others through a unified API. It's serverless, so you don't manage infrastructure, and includes enterprise tools like Knowledge Bases for RAG, Guardrails for content filtering, and Agents for task automation.
How much does AWS Bedrock cost?
Bedrock offers on-demand pricing (pay per token), batch mode (50% cheaper), and reserved capacity for predictable workloads. For Claude 3.5 Sonnet, on-demand pricing is $6.00 per million input tokens and $3.00 per million output tokens. Batch processing cuts output costs to $15.00 per million tokens. Additional tools like Guardrails and Knowledge Bases have separate per-unit pricing.
Is AWS Bedrock better than OpenAI API?
It depends on your requirements. Bedrock excels for enterprises already in AWS, offering native IAM integration, VPC endpoints, data residency controls, and access to multiple model providers. OpenAI API offers simpler integration for standalone projects and earlier access to GPT model updates. Many enterprises choose Bedrock for compliance and security, not just model performance.
What models are available on AWS Bedrock?
Bedrock provides access to 15+ model providers including Anthropic (Claude 3.5 Sonnet, Claude 3 Opus, Claude 3 Haiku), Meta (Llama 3.1, Llama 3.2), Amazon (Titan, Nova), Mistral AI, Cohere, AI21 Labs, DeepSeek, and more. You can also import custom models trained on other platforms.
- Amazon Bedrock
Amazon Bedrock is a fully managed service that makes high-performing foundation models from leading AI companies available through a unified API. It provides serverless access to models for text generation, image generation, embeddings, and more — without managing infrastructure. Bedrock also includes tools for building complete GenAI applications: Knowledge Bases for RAG, Guardrails for safety, and Agents for task automation.
When I first started building GenAI systems at Amazon, the landscape was fragmented. Different teams used different model APIs, each with its own authentication, rate limiting, and billing. Bedrock changed that by providing a single interface to multiple foundation models with enterprise-grade controls.
Here's what makes Bedrock different from calling model APIs directly:
The platform has evolved rapidly since its 2023 launch. Today, it includes latency-optimized inference for real-time applications, cross-region inference for global deployments, and prompt caching to reduce costs for repetitive workloads.
Bedrock isn't just model access — it's a platform for building production GenAI applications with enterprise security, multiple model options, and built-in tools for RAG, safety, and automation.
I've built GenAI systems at Amazon using both direct API access and Bedrock. For enterprise applications, Bedrock wins on three dimensions: security, flexibility, and operational simplicity.
Security and Compliance
When you call the OpenAI API, your data leaves your infrastructure and travels to OpenAI's servers. For many enterprise use cases — especially those involving customer data, financial information, or proprietary business logic — that's a non-starter.
Bedrock keeps everything within your AWS account:
- VPC Endpoints: Traffic never traverses the public internet
- IAM Integration: Fine-grained access control using existing AWS policies
- Data Encryption: At rest and in transit, with customer-managed keys via KMS
- Audit Logging: CloudTrail integration for compliance and governance
- Data Residency: Choose which AWS regions process your data
At Amazon, these weren't nice-to-haves — they were requirements. When you're processing billions of customer interactions, you need guarantees about where data flows and who can access it.
For regulated industries (healthcare, finance), Bedrock's BAA (Business Associate Agreement) eligibility and SOC/PCI compliance can be the deciding factor. Make sure to enable CloudTrail logging and use customer-managed KMS keys from day one.
Model Flexibility
One of the most valuable lessons I've learned: the best model for your use case today won't be the best model in six months.
Bedrock's unified API means you can:
- A/B test models: Compare Claude vs. Llama vs. Mistral on your actual workloads
- Use different models for different tasks: Fast model for classification, powerful model for generation
- Upgrade seamlessly: When Claude 4 releases, switch models without rewriting code
- Avoid vendor lock-in: No single-provider dependency
| Capability | Direct API (OpenAI/Anthropic) | AWS Bedrock |
|---|---|---|
| Model Access | Single provider | 15+ providers, unified API |
| Security | External data transfer | VPC endpoints, IAM, KMS |
| Infrastructure | Self-managed rate limits | Serverless, auto-scaling |
| RAG Support | Build your own | Knowledge Bases built-in |
| Safety | Build your own | Guardrails built-in |
| Billing | Separate per provider | Consolidated AWS billing |
Operational Simplicity
Building production GenAI is more than model inference. You need:
- Rate limiting and retry logic: Bedrock handles this automatically
- Cost monitoring: Integrated with AWS Cost Explorer
- Logging and debugging: CloudWatch integration out of the box
- Team access control: IAM policies you already know
When I was managing ML infrastructure at Amazon, half the work was operational overhead. Bedrock eliminates most of that, letting you focus on the application logic.
Choose Bedrock when security, model flexibility, and operational simplicity matter more than having the absolute latest model version. For most enterprise use cases, that's the right tradeoff.
Bedrock gives you access to a growing roster of foundation models. Here's how I think about choosing between them.
The Major Players
Claude is my go-to for most enterprise text tasks. Strong reasoning, excellent at following complex instructions, and handles long contexts well (up to 200K tokens). Claude 3.5 Sonnet offers the best balance of capability and cost.
- Best for: Complex reasoning, document analysis, code generation, agentic workflows
- Pricing: $6.00 / 1M input tokens, $3.00-15.00 / 1M output tokens
Open-weight models with strong performance and lower costs. Good option when you want more control or need to run inference on your own infrastructure later.
- Best for: Cost-sensitive applications, fine-tuning, on-prem deployment planning
- Pricing: Competitive with Claude Haiku tier
Amazon's own models, optimized for Bedrock. Titan Text for generation, Titan Embeddings for vector search. Strong integration with other AWS services.
- Best for: AWS-native workflows, embeddings, cost-conscious deployments
- Pricing: Generally lower than third-party models
Amazon's newest model family with multimodal capabilities (text, image, video). Nova 2 Lite supports vision and is available globally.
- Best for: Multimodal applications, vision tasks, global availability
European-based models with strong performance on reasoning tasks. Good option for EU data residency requirements.
- Best for: European deployments, cost-effective reasoning
- Excellent reasoning and instruction following
- 200K token context window
- Strong at code generation and analysis
- Good balance of capability and speed
- Supports tool use and agentic workflows
- Higher cost than Haiku or Llama alternatives
- No image generation (text and vision only)
- May have lower availability during peak times
- Slower than Haiku for simple tasks
My Model Selection Framework
Here's the decision tree I use:
-
What's the task complexity?
- Simple classification/extraction → Claude Haiku or Llama
- Complex reasoning/generation → Claude Sonnet or Opus
- Multi-step agentic workflows → Claude Sonnet with tool use
-
What's the latency requirement?
- Real-time (< 1 second) → Haiku or Mistral with latency optimization
- Near-real-time (< 5 seconds) → Sonnet
- Batch processing → Any model with batch mode
-
What's the cost constraint?
- Cost-sensitive → Llama or Haiku
- Balanced → Sonnet
- Quality-first → Opus
-
What are the compliance requirements?
- EU data residency → Mistral or models in EU regions
- Specific certifications → Check model-specific compliance docs
Don't default to the most powerful model. I've seen teams use Claude Opus for tasks that Haiku handles perfectly at 1/10th the cost. Always benchmark on your actual workload before choosing.
Start with Claude 3.5 Sonnet for complex tasks, Claude Haiku for simple tasks, and Llama for cost-sensitive applications. Benchmark on your specific use case — model performance varies significantly by task type.
I've spent thousands of hours working with Claude on Bedrock. Here's what I've learned about getting the most out of it.
Why Claude for Enterprise
Claude has become my default choice for enterprise GenAI because of:
Prompt Engineering Patterns That Scale
When you're processing billions of interactions, prompt engineering becomes a discipline, not an afterthought.
Always request structured output (JSON, XML) for programmatic consumption. Claude is reliable at generating valid JSON when explicitly instructed.
Analyze the following customer interaction and return your analysis as JSON with the following structure:
{
"intent": "one of [support, sales, feedback, complaint]",
"sentiment": "one of [positive, neutral, negative]",
"urgency": "one of [low, medium, high]",
"key_topics": ["array of topic strings"]
}
Don't just show typical examples — show the edge cases you care about. This dramatically improves handling of unusual inputs.
For multi-step reasoning, explicitly request step-by-step thinking. This improves accuracy and gives you debuggable intermediate outputs.
Anticipate failure modes and handle them in the prompt. "If the input is unclear or insufficient, respond with 'INSUFFICIENT_DATA' rather than guessing."
The difference between a demo and production is handling the 5% of inputs that don't fit your mental model. Spend 80% of your prompt engineering time on edge cases.
Cost Optimization
Claude pricing on Bedrock:
| Model | Input (per 1M tokens) | Output (per 1M tokens) |
|---|---|---|
| Claude 3.5 Sonnet | $6.00 | $3.00 |
| Claude 3 Opus | $15.00 | $75.00 |
| Claude 3 Haiku | $0.25 | $1.25 |
Claude on Bedrock is the enterprise workhorse — reliable instruction following, long context, and mature tool use. Invest in prompt engineering for edge cases, use prompt caching for repeated context, and right-size your model choice for each task.
Moving from prototype to production is where most GenAI projects fail. Here's how to do it right.
Architecture Patterns
For real-time applications (chatbots, live analysis), use the InvokeModel API with streaming for better perceived latency.
User Request → API Gateway → Lambda → Bedrock InvokeModel → Response
Key considerations:
- Lambda timeout: Set to 30+ seconds for complex generations
- Streaming: Use InvokeModelWithResponseStream for chat interfaces
- Error handling: Implement exponential backoff for throttling
For batch workloads, use SQS + Lambda or Step Functions for orchestration.
Input Queue → Lambda → Bedrock → Output Queue/S3
Key considerations:
- Dead letter queues for failed requests
- Batch API for high-volume, non-urgent processing
- S3 for storing inputs and outputs
For complex multi-step tasks, use Bedrock Agents or build custom orchestration.
User Query → Agent → [Knowledge Base | Tool | Model] → Response
Key considerations:
- Define clear tool schemas
- Implement guardrails for agent actions
- Log all intermediate steps for debugging
Error Handling and Resilience
Production systems fail. Plan for it:
Monitoring and Observability
You can't improve what you can't measure:
Production GenAI requires the same engineering discipline as any production system: retry logic, fallbacks, monitoring, and testing. The model is just one component — the surrounding infrastructure determines reliability.
- Retrieval Augmented Generation (RAG)
RAG is a pattern that enhances LLM responses by retrieving relevant information from external data sources and including it in the prompt. This grounds the model's responses in your specific data, reducing hallucinations and enabling domain-specific knowledge.
Bedrock Knowledge Bases provide managed RAG infrastructure. Here's how to use them effectively.
How Knowledge Bases Work
- Data Ingestion: Upload documents to S3 (PDF, HTML, Word, etc.)
- Chunking: Bedrock splits documents into searchable chunks
- Embedding: Each chunk is converted to a vector using Titan Embeddings or Cohere
- Storage: Vectors are stored in OpenSearch Serverless, Aurora, or Pinecone
- Retrieval: User queries are embedded and matched against stored vectors
- Augmentation: Retrieved chunks are injected into the prompt for the LLM
When to Use Knowledge Bases
- Customer support with product documentation
- Internal knowledge assistants
- Document Q&A systems
- Policy compliance checking
- Real-time data that changes frequently
- Highly structured data better served by SQL
- Tasks requiring precise numerical computation
Best Practices
The quality of your RAG system depends more on your data preparation than your model choice. Clean, well-structured documents with good metadata will outperform a better model on messy data.
Knowledge Bases simplify RAG implementation, but data quality and chunking strategy determine success. Invest in document preparation and build evaluation frameworks before scaling.
Let me walk you through a real system I built using Bedrock at Amazon — automated rule generation for customer experience protection.
The Problem
Amazon handles billions of customer interactions. Some of those interactions come from automated traffic (bots) that can degrade customer experience and platform integrity. We needed a system to:
- Analyze patterns in customer interaction data
- Identify behavioral signals that distinguish automated from legitimate traffic
- Generate detection rules that balance protection with customer accessibility
- Validate rules before deployment to avoid false positives
The scale: billions of interactions, hundreds of behavioral features, rules that needed to be interpretable by operations teams.
The Solution Architecture
Data Pipeline → Feature Engineering → Bedrock (Claude) → Rule Validation → Deployment
We built a feature engineering pipeline that extracts behavioral signals from interaction data — timing patterns, navigation sequences, technical fingerprints. These features feed into our ML models and provide context for LLM-based rule generation.
Here's where Bedrock comes in. We use Claude to:
- Analyze feature importance rankings and generate human-readable explanations
- Translate statistical patterns into detection rule logic
- Generate rule documentation for operations teams
- Suggest variations and edge case handling
The key insight: Claude excels at translating between technical signal representations and human-interpretable rules. Our prompt engineering focused on maintaining accuracy while producing output that non-technical stakeholders could understand and validate.
Every generated rule goes through automated validation:
- Backtesting against historical data
- False positive rate estimation
- Impact simulation
Claude helped here too — explaining why certain rules triggered, identifying potential false positive scenarios, and suggesting modifications.
Lessons Learned
The goal wasn't to replace human judgment — it was to augment it. Claude generates candidate rules at a pace humans can't match. Humans provide the judgment that Claude can't match. Together, we protect customer experience at scale.
Enterprise GenAI shines when it augments human expertise rather than replacing it. LLMs can translate between technical signals and human-readable logic, but validation and judgment remain human responsibilities.
The three major enterprise GenAI platforms each have distinct strengths.
| Dimension | AWS Bedrock | OpenAI API | Azure OpenAI |
|---|---|---|---|
| Model Access | 15+ providers (Claude, Llama, Titan) | GPT-4, DALL-E, Whisper | GPT-4 (same as OpenAI) |
| Best For | AWS-native enterprises | Startups, standalone apps | Azure enterprises |
| Security | VPC, IAM, KMS native | API keys, external data | Azure AD, VNet native |
| RAG/Knowledge | Knowledge Bases built-in | Assistants API | Azure AI Search integration |
| Model Freshness | Depends on provider release | Latest GPT versions first | Slight delay from OpenAI |
| Pricing | Per-token, multiple tiers | Per-token, usage caps | Per-token, commitment options |
When to Choose Bedrock
- Your infrastructure is primarily AWS
- You need access to multiple model providers
- Enterprise security (VPC, IAM) is mandatory
- You want Claude specifically (Claude on Azure is limited)
When to Choose OpenAI API
- You're building a standalone application
- You want the latest GPT models immediately
- Simpler integration outweighs enterprise features
- You're early-stage and not locked into cloud vendors
When to Choose Azure OpenAI
- Your infrastructure is primarily Azure
- You need GPT-4 with Azure's security model
- You want Microsoft's enterprise support
- You're already using Azure AI services
The Hybrid Approach
Many enterprises use multiple platforms:
- Bedrock for production workloads requiring Claude and enterprise security
- OpenAI API for experimentation with latest GPT releases
- Custom evaluation to determine which models work best for specific tasks
Choose your GenAI platform based on your cloud infrastructure and security requirements, not just model performance. For AWS enterprises, Bedrock's security integration and multi-model access make it the natural choice.
After years of building and reviewing GenAI systems, here are the mistakes I see most often.
- Treating prompts as throwaway code — they need version control, testing, and review
- Using the most powerful model for every task instead of right-sizing
- Skipping evaluation frameworks and relying on vibes for quality
- Ignoring latency requirements until after architecture is set
- Building RAG without investing in data quality and chunking strategy
- Deploying without fallback mechanisms for model unavailability
- Underestimating the importance of prompt caching for cost control
- Not involving domain experts in prompt engineering and validation
Mistake Deep Dive: Treating Prompts as Throwaway
I've seen teams iterate on prompts in Jupyter notebooks, find something that works, and copy-paste it into production. Six months later, nobody knows why the prompt has that weird clause in paragraph three, and nobody wants to touch it.
- Version control with meaningful commit messages
- Code review for significant changes
- Automated testing against evaluation datasets
- Documentation explaining the reasoning behind key instructions
Mistake Deep Dive: Skipping Evaluation
"It seems to work" is not an evaluation framework. Without systematic evaluation, you can't:
- Measure improvement from prompt changes
- Detect regression when models are updated
- Compare models objectively
- Build confidence for stakeholders
- Create a test dataset with ground truth answers
- Define metrics that matter for your use case
- Run evaluation automatically on prompt changes
- Track metrics over time
The difference between demo and production GenAI is engineering discipline. Treat prompts as code, build evaluation frameworks, and plan for failure modes from the start.
- 01Bedrock provides unified access to 15+ foundation models with enterprise-grade security — VPC endpoints, IAM, KMS encryption
- 02Choose Claude for complex reasoning, Haiku for simple tasks, and Llama for cost-sensitive applications — benchmark on your workload
- 03Production GenAI requires engineering discipline: retry logic, fallbacks, monitoring, and testing
- 04Knowledge Bases simplify RAG, but data quality and chunking strategy determine success
- 05Prompt engineering at scale means version control, testing, and treating prompts as production code
- 06The best model today won't be the best model in six months — Bedrock's unified API makes switching easy
- 07Enterprise GenAI augments human judgment rather than replacing it — build validation and human review into your workflows
What is AWS Bedrock?
Amazon Bedrock is a fully managed service that provides access to foundation models from leading AI providers (Anthropic, Meta, Mistral, Amazon, and others) through a unified API. It includes serverless infrastructure, enterprise security features, and tools for building complete GenAI applications like Knowledge Bases for RAG, Guardrails for safety, and Agents for task automation.
How much does AWS Bedrock cost?
Bedrock offers on-demand pricing (pay per token with no commitment), batch mode (50% discount for non-urgent workloads), and reserved capacity. For Claude 3.5 Sonnet: $6.00 per million input tokens, $3.00 per million output tokens on-demand. Prompt caching can reduce costs by up to 90% for repeated context. Additional tools like Guardrails ($0.15-0.75 per 1K text units) and Knowledge Bases have separate pricing.
What models are available on AWS Bedrock?
Bedrock provides access to models from 15+ providers: Anthropic (Claude 3.5 Sonnet, Claude 3 Opus, Claude 3 Haiku), Meta (Llama 3.1, Llama 3.2), Amazon (Titan, Nova), Mistral AI, Cohere, AI21 Labs, DeepSeek, and more. The model roster continues to expand, and you can also import custom models trained on other platforms.
Is AWS Bedrock better than OpenAI?
It depends on your requirements. Bedrock excels for AWS-native enterprises needing VPC endpoints, IAM integration, and access to multiple model providers including Claude. OpenAI API offers simpler integration and earlier access to GPT updates. Many enterprises choose Bedrock for security and compliance rather than just model performance.
What is a Knowledge Base in Bedrock?
Knowledge Bases provide managed RAG (Retrieval Augmented Generation) infrastructure. You upload documents to S3, Bedrock chunks and embeds them, stores vectors in OpenSearch or Aurora, and retrieves relevant content when users query. This grounds LLM responses in your specific data, reducing hallucinations.
What are Bedrock Guardrails?
Guardrails are configurable safety filters that block harmful content, detect PII, and enforce topic restrictions on both inputs and outputs. They help enterprises deploy GenAI safely by preventing inappropriate responses and protecting sensitive data. Pricing is based on text units processed.
How do I choose between Claude, Llama, and Titan?
Use Claude for complex reasoning, instruction following, and code generation. Use Llama for cost-sensitive applications or when you plan future on-prem deployment. Use Titan for AWS-native workflows and embeddings. Always benchmark on your specific use case — model performance varies significantly by task type.
- 01Amazon Bedrock User Guide — Amazon Web Services (2026)
- 02Amazon Bedrock Pricing — Amazon Web Services (2026)
- 03Supported Foundation Models in Amazon Bedrock — Amazon Web Services (2026)
- 04Amazon Bedrock Knowledge Bases — Amazon Web Services (2026)
- 05Amazon Bedrock Guardrails — Amazon Web Services (2026)
- 06Amazon Bedrock Agents — Amazon Web Services (2026)
- 07Anthropic Claude on Amazon Bedrock — Anthropic (2026)
- 08Prompt Engineering Guidelines for Amazon Bedrock — Amazon Web Services (2026)