RenewableOps AI Assistant - Part 2 - Model Economics

This blog posts briefly covers how we evaluated different Nova models for different agent interactions.

In Part 1 of this series, we explored the architecture of our multi-agent RenewableOps AI Assistant, built using the AWS Agent Squad framework with Amazon Nova models. We demonstrated how different Nova models (Micro, Lite, and Pro) were strategically assigned to various agents based on their specific capabilities and cost-performance characteristics. Now, in Part 2, we'll dive deep into the model economics and evaluation methodology that guided our model selection decisions.

The Economics of Model Selection

When building production AI systems, the choice of foundation model isn't just about accuracy—it's about finding the optimal balance between cost, latency, and performance. Our evaluation of AWS Nova models revealed compelling economic advantages that make them ideal for enterprise applications.

AWS Nova Pricing Analysis

The Nova family offers industry-leading price performance across three tiers:

Amazon Nova Micro - The most cost-effective option:

  • Input tokens: $0.035 per 1M tokens
  • Output tokens: $0.14 per 1M tokens
  • Speed: 195+ tokens per second
  • Context window: 128K tokens
  • Use case: Text-only tasks requiring ultra-low latency

Amazon Nova Lite - Balanced multimodal capabilities:

  • Input tokens: $0.06 per 1M tokens
  • Output tokens: $0.24 per 1M tokens
  • Speed: 146+ tokens per second
  • Context window: 300K tokens
  • Use case: Image, video, and text processing

Amazon Nova Pro - Premium performance:

  • Input tokens: $0.80 per 1M tokens
  • Output tokens: $3.20 per 1M tokens
  • Speed: 90+ tokens per second
  • Context window: 300K tokens
  • Use case: Complex reasoning and multimodal analysis

DeepEval for Model Evaluation

Selecting the right evaluation framework was crucial for systematically comparing model performance. We chose DeepEval for several compelling reasons:

Comprehensive Evaluation Metrics

DeepEval provides a rich suite of evaluation metrics that align with our use case requirements:

  • Traditional NLP Metrics:
    • ROUGE-L for summarization quality
    • BLEU scores for translation tasks
    • Semantic similarity measures
  • LLM-Specific Metrics:
    • Answer relevancy
    • Faithfulness and hallucination detection
    • Contextual precision and recall
    • G-Eval for custom criteria
  • RAG-Specific Metrics:
    • Contextual relevancy
    • Retrieval quality assessment
    • Answer groundedness

Integration with Multiple Frameworks

DeepEval seamlessly integrates with our existing tech stack:

  • LangChain integration for agent workflows
  • OpenAI SDK compatibility for Nova models
  • Custom metric support for domain-specific evaluation

Sample Implementation of Rouge Metric with DeepEval

from deepeval.scorer import Scorer
from deepeval.metrics import BaseMetric
from deepeval.test_case import LLMTestCase
class RougeMetric(BaseMetric):
def init(self, threshold: float = 0.5):
self.threshold = threshold
self.scorer = Scorer()

def measure(self, test_case: LLMTestCase):
    self.score = self.scorer.rouge_score(
        prediction=test_case.actual_output,
        target=test_case.expected_output,
        score_type="rouge1"
    )
    self.success = self.score >= self.threshold
    return self.score

# Async implementation of measure(). If async version for
# scoring method does not exist, just reuse the measure method.
async def a_measure(self, test_case: LLMTestCase):
    return self.measure(test_case)

def is_successful(self):
    return self.success

@property
def __name__(self):
    return "Rouge Metric"

Langfuse: The Complete Evaluation Ecosystem

Langfuse served as our central platform for experiment tracking, dataset management, and performance monitoring. Its comprehensive feature set made it ideal for managing complex multi-model evaluations.

  • Dataset Management and Experimentation: Langfuse's dataset functionality enabled systematic model comparison:
  • LLM-as-a-Judge Integration: Langfuse's LLM-as-a-Judge feature automated quality assessment:
  • Cost and Performance Tracking: Langfuse provided granular visibility into model economics:
    • Cost Tracking:
      • Per-model inference costs
      • Token usage patterns
      • Cost per successful interaction
    • Performance Monitoring:
      • Response latency distributions
      • Throughput metrics
      • Error rates and failure patterns

Model Selection Strategy and Results

Based on our comprehensive evaluation, we implemented a tiered model selection strategy:

Agent-Specific Model Assignment

  • Classification Agent (Nova Pro):
    • Rationale: Requires sophisticated reasoning to route queries correctly
    • Performance: 94% routing accuracy
    • Cost Impact: Higher per-query cost justified by system-wide efficiency
  • Wind Turbine Catalog Agent (Nova Lite):
    • Rationale: Balances technical accuracy with cost efficiency
    • Performance: Rouge-L F1 of 0.78, 2.3s average response time
    • Cost Impact: 75% cost reduction vs. Nova Pro with minimal accuracy loss
  • Solar Panel Insights Agent (Nova Pro):
    • Rationale: Complex calculations and multi-step reasoning required
    • Performance: Rouge-L F1 of 0.85, high accuracy on cost analysis
    • Cost Impact: Premium pricing justified by revenue-generating insights
  • Image Analysis Agents (Nova Lite):
    • Rationale: Multimodal capabilities at reasonable cost
    • Performance: 87% accuracy on defect detection
    • Cost Impact: 60% cost reduction vs. specialized vision models

Economic Impact Analysis

Our model selection strategy delivered measurable business value:

  • Cost Optimization:
    • Overall cost reduction: 42% compared to single-model (Nova Pro) approach
    • Peak cost efficiency: Nova Micro for simple queries reduced costs by 88%
    • Scalability: Linear cost scaling with usage patterns
  • Performance Optimization:
    • Latency improvement: 35% faster average response times
    • Accuracy maintenance: Less than 3% accuracy degradation on critical tasks
    • Throughput increase: 60% higher concurrent query handling

Nova Model Evaluation Explanation:

RenewableOps AI Assistant Model Eval


Lessons Learned and Future Optimizations

Key Insights

  1. Model Economics Are Task-Dependent: No single model excels across all use cases. Strategic model assignment based on task complexity and cost sensitivity delivers optimal results.
  2. Evaluation Framework Matters: DeepEval's comprehensive metrics and Langfuse's experiment tracking were instrumental in making data-driven decisions.
  3. Rouge-L Provides Nuanced Assessment: Unlike simple n-gram matching, Rouge-L captures structural similarity crucial for technical documentation evaluation.

Conclusion

The strategic evaluation and selection of AWS Nova models for our RenewableOps AI Assistant demonstrates the importance of model economics in production AI systems. By leveraging DeepEval's comprehensive evaluation framework and Langfuse's experiment tracking capabilities, we achieved:

  • 42% cost reduction through strategic model assignment
  • 35% latency improvement via optimized model selection
  • Maintained accuracy across critical business functions
  • Scalable architecture that adapts to varying demand patterns

The combination of Nova's industry-leading price-performance, DeepEval's robust evaluation metrics, and Langfuse's comprehensive tracking created a powerful framework for building cost-effective, high-performance AI systems.

As AI adoption continues to accelerate, the ability to make data-driven decisions about model selection will become increasingly crucial for sustainable AI operations. Our approach provides a blueprint for evaluating and optimizing model economics in complex, multi-agent AI systems.

The lessons learned from this implementation can be applied across industries where AI systems must balance performance requirements with economic constraints.

References: