Evaluate Claude 3 with AWS Bedrock & FMEval

Artificial Intelligence

Raghavan Madabusi

Jun 24, 2024 • 17 min read

Evaluating Large Language Models (LLMs) is essential to ensure they meet the specific needs of diverse use cases while maintaining responsible AI practices. In this blog, we delve into how AWS Bedrock and FMEval facilitate the evaluation of Anthropic Claude 3 models. These tools provide a comprehensive framework for assessing model performance, helping data scientists and ML engineers to streamline their evaluation processes. The ability to conduct large-scale evaluations with built-in and custom algorithms makes FMEval an invaluable resource for those working with LLMs.

Anthropic’s Claude 3 models, including Claude 3 Opus, Claude 3 Sonnet, and Claude 3 Haiku, set new standards in reasoning, math, and coding. These models offer significant improvements in understanding non-English languages, maintaining high accuracy, and reducing hallucinations compared to other models. By leveraging AWS Bedrock and FMEval, users can evaluate these advanced models effectively, ensuring optimal performance and reliability. This blog will guide you through creating a custom Model Runner for evaluating Claude 3 models and demonstrate its application in various use cases.

Deep Dive into FMEval

With FM evaluations, AWS introduces FMEval, an open-source LLM evaluation library designed for data scientists and ML engineers. FMEval provides a code-first experience to evaluate LLMs for quality and responsibility, helping users select or adapt LLMs to specific use cases. It supports evaluations for LLM model endpoints or entire generative AI services. FMEval measures evaluation dimensions such as accuracy, robustness, bias, toxicity, and factual knowledge for any LLM. Users can evaluate AWS-hosted LLMs like Amazon Bedrock, Jumpstart, and other SageMaker models, as well as LLMs hosted on third-party platforms like ChatGPT, HuggingFace, and LangChain. This consolidation allows customers to streamline their LLM evaluation logic in one place.

Getting Started with FMEval

FMEval can be used wherever your workloads run, as a Python package or via the open-source code repository on GitHub. It provides easy-to-understand data and reports, enabling AWS customers to make informed decisions. Users can upload their own prompt datasets and algorithms. The core evaluation function, evaluate(), is extensible, allowing users to upload a prompt dataset, select and upload an evaluation function, and run an evaluation job. Results are delivered in multiple formats, helping to review, analyze, and operationalize high-risk items, and choose the right LLM for specific use cases.

Supported Algorithms

FMEval offers 12 built-in evaluations covering four different tasks and five evaluation dimensions. The library is designed with extensibility in mind, based on the latest scientific findings and the most popular open-source evaluations. The proposed evaluations are intended to cover popular aspects out-of-the-box and enable the addition of new ones.

Evaluation Tasks and Dimensions

Open-ended generation: Evaluates prompt stereotyping, toxicity, factual knowledge, and semantic robustness.
Text summarization: Measures accuracy, toxicity, and semantic robustness.
Question answering (Q&A): Assesses accuracy, toxicity, and semantic robustness.
Classification: Evaluates accuracy and semantic robustness.

Evaluation Details

Accuracy: Assesses model performance across tasks such as summarization, Q&A, and classification, using metrics like ROUGE-N, METEOR, and BERTScore for summarization; Exact Match, Quasi-Exact Match, and F1 scores for Q&A; and standard metrics like accuracy, precision, recall, and balanced accuracy for classification.
Semantic Robustness: Evaluates the performance change in model output due to semantic-preserving perturbations like typos, random upper-case changes, and whitespace modifications.
Factual Knowledge: Measures the model’s ability to reproduce real-world facts using prompts from datasets like T-REx.
Prompt Stereotyping: Analyzes whether the model encodes stereotypes across various categories, using the CrowS-Pairs dataset.
Toxicity: Assesses the level of toxic content generated by the model, using datasets like Real Toxicity Prompts and BOLD, with UnitaryAI Detoxify-unbiased as the toxicity detector.

Using FMEval for Evaluations

The FMEval package includes core constructs necessary for conducting evaluation jobs:

Data Config: Points to the dataset location and contains fields like model_input, target_output, and model_output. It can be customized based on the evaluation algorithm.
Model Runner: Represents the hosted FM used for inference. FMEval supports native runners for JumpStart, Amazon Bedrock, and SageMaker endpoints, and allows for custom runners.
Evaluation Algorithm: Supports built-in and custom algorithms for evaluating LLMs. Users can implement custom logic by inheriting the base Evaluation Algorithm class.

FMEval empowers users to evaluate their models comprehensively, providing insights to optimize performance and ensure responsible AI practices.

Meet Anthropic Claude 3

Anthropic’s data reveals that Claude 3 Opus, the most advanced model in their lineup, sets a new benchmark by outperforming other models, including OpenAI’s GPT-4, in reasoning, math, and coding. Claude 3 Sonnet excels with a blend of high intelligence and speed, operating twice as fast as earlier Claude models for most tasks. Claude 3 Haiku, the fastest and most compact model, is an affordable option in its intelligence category, offering near-instant responses that mimic human interactions.

Claude 3 models show notable improvements over other available models in tasks like coding and understanding non-English languages, including Spanish, Japanese, and French. They exhibit fewer hallucinations and higher accuracy compared to popular models from other providers. These models are adept at following complex, multi-step instructions and maintaining brand voice in customer-facing applications. Additionally, Claude 3 models can generate outputs suited for natural language classification and sentiment analysis, simplifying application development for these use cases.

All three new models feature advanced vision capabilities, allowing them to process various data formats and analyze images. This meets the increasing demand for models that comprehend charts, graphs, technical diagrams, photos, and other visual assets.

AWS Customers can access the Claude 3 family of vision-enabled models on Amazon Bedrock. Amazon Bedrock is the first managed service to offer all Claude 3 models—Claude 3 Opus, Claude 3 Sonnet, and Claude 3 Haiku—as generally available

FMEval Custom Claude 3 Model Runner

FMEval provides two model runners: SageMaker JumpStart & Bedrock out of the box. Custom model runners can be created by extending existing Model Runner classes that require special processing.

Bedrock Model Runner uses Text Completion API for older Anthropic Claude 2.x verison models but in order to evaluate newer Anthropic Claude 3 Models such as Haiku or Sonnet or Opus, a custom Model Runner need to be created as it uses Messages API. The rest of the blog posts focus on how to create a custom Model Runner and evaluate Claude 3 Models on couple of use cases.

Summarization & Factual Knowledge Evaluation

At a minimum, two packages are required

requirements.txt

boto3
fmeval

Custom Bedrock Model Runner for Claude 3:

bedrock_claud3_model_runner.py

"""
Module to manage model runners for Bedrock models.
"""
import json
import logging
from fmeval.util import require
from typing import Optional, Tuple
from fmeval.constants import MIME_TYPE_JSON
from fmeval.model_runners.model_runner import ModelRunner
from fmeval.model_runners.util import get_bedrock_runtime_client

logger = logging.getLogger(__name__)


class BedrockClaude3ModelRunner(ModelRunner):
    """
    A class to manage the creation and deletion of Bedrock model runner when user provides
    a Bedrock model id.
    """

    def __init__(
        self,
        model_id: str,
        content_template: str,
        output: Optional[str] = None,
        log_probability: Optional[str] = None,
        content_type: str = MIME_TYPE_JSON,
        accept_type: str = MIME_TYPE_JSON,
    ):
        """
        :param model_id: Id of the Bedrock model to be used for model predictions
        :param content_template: String template to compose the model input from the prompt
        :param output: JMESPath expression of output in the model output
        :param log_probability: JMESPath expression of log probability in the model output
        :param content_type: The content type of the request sent to the model for inference
        :param accept_type: The accept type of the request sent to the model for inference
        """
        super().__init__(content_template, output, log_probability, content_type, accept_type)
        self._model_id = model_id
        self._content_template = content_template
        self._output = output
        self._log_probability = log_probability
        self._content_type = content_type
        self._accept_type = accept_type

        require(
            output is not None or log_probability is not None,
            "One of output jmespath expression or log probability jmespath expression must be provided",
        )
        require(self._accept_type == MIME_TYPE_JSON, f"Model accept type `{self._accept_type}` is not supported.")
        require(
            self._content_type == MIME_TYPE_JSON,
            f"Model content type `{self._content_type}` is not supported.",
        )
        self._bedrock_runtime_client = get_bedrock_runtime_client()

    def predict(self, prompt: str) -> Tuple[Optional[str], Optional[float]]:
        """
        Invoke the Bedrock model and parse the model response.
        :param prompt: Input data for which you want the model to provide inference.
        """
        composed_data = self._composer.compose(prompt)
        
        body=json.dumps(
            {
                "anthropic_version": composed_data['anthropic_version'],
                "max_tokens": composed_data['max_tokens'],
                "system": composed_data['system'],
                "messages": [{"role": "user", "content": composed_data['prompt']}]
            }  
        )
        
        response = self._bedrock_runtime_client.invoke_model(
            body=body, modelId=self._model_id, accept=self._accept_type, contentType=self._content_type
        )
        
        # Claude 3 models provide completion in an array of content element that includes text as the key
        model_output = json.loads(response.get("body").read())['content'][0]
        
        output = (
            self._extractor.extract_output(data=model_output, num_records=1)
            if self._extractor.output_jmespath_expression
            else None
        )
        
        log_probability = (
            self._extractor.extract_log_probability(data=model_output, num_records=1)
            if self._extractor.log_probability_jmespath_expression
            else None
        )
        return output, log_probability

    def __reduce__(self):
        """
        Custom serializer method used by Ray when it serializes instances of this
        class in eval_algorithms.util.generate_model_predict_response_for_dataset.
        """
        serialized_data = (
            self._model_id,
            self._content_template,
            self._output,
            self._log_probability,
            self._content_type,
            self._accept_type,
        )
        return self.__class__, serialized_data

Summarization Use Case:

A typical Evaluation require a json file with .jsonl extension that contains few samples. Each sample contains a document key with some text as value, and summary key with expected output. Each sample is fed into the model and its output is evaluated with the expected summary and provides 3 different metrics such as METEOR, ROUGE and BERTScore.

METEOR

METEOR (Metric for Evaluation of Translation with Explicit ORdering) is a metric used to evaluate the quality of machine-generated translations. It extends beyond simple word overlap by incorporating stemming, synonym matching, and exact matches. The primary features of METEOR include:

Stemming: Reducing words to their base or root form.
Synonym Matching: Recognizing synonyms to account for different words with similar meanings.
Precision and Recall: Balancing the measure of exact matches (precision) and relevant matches retrieved (recall).
F-score: A weighted harmonic mean of precision and recall, providing a single score that balances both.

METEOR aims to improve correlation with human judgment over other metrics by considering linguistic phenomena like synonyms and stemming.

ROUGE

ROUGE (Recall-Oriented Understudy for Gisting Evaluation) is a set of metrics for evaluating automatic summarization and machine translation. The most common variants include:

ROUGE-N: Measures the overlap of n-grams (sequences of N words) between the generated summary and a reference summary. For example, ROUGE-1 measures unigram (single word) overlap, while ROUGE-2 measures bigram (two-word sequence) overlap.
ROUGE-L: Measures the longest common subsequence (LCS) between the generated and reference summaries, capturing sentence-level structure similarity.
ROUGE-S: Measures the overlap of skip-bigrams, which are pairs of words in their sentence order allowing gaps between them.

ROUGE primarily focuses on recall, making it useful for evaluating summaries where capturing all relevant information is crucial.

BERTScore

BERTScore is a metric that leverages deep learning models, specifically BERT (Bidirectional Encoder Representations from Transformers), to evaluate text similarity. It calculates the similarity of the embeddings (vector representations) of words in the generated and reference texts. Key aspects of BERTScore include:

Contextual Embeddings: Uses BERT to generate context-aware embeddings for words, capturing semantic nuances.
Cosine Similarity: Measures the cosine similarity between the embeddings of the generated text and reference text.
Precision, Recall, and F1: Similar to traditional metrics, BERTScore provides precision, recall, and an F1 score, offering a balanced view of the model’s performance.

BERTScore aims to capture semantic similarity more effectively than surface-level metrics, accounting for the meaning and context of the text.

#summarize_accuracy.jsonl

{"document": "five-time world champion michelle kwan withdrew from the 2006 us figure skating championships on wednesday, but will petition us skating officials for the chance to compete at the 2006 turin olympics.","summary": "injury leaves kwan's olympic hopes in limbo","idx": "0"}
{"document": "the company announced a new range of products in a press release today, highlighting their commitment to innovation and customer satisfaction. the new line includes several high-tech gadgets aimed at improving daily life.","summary": "company unveils innovative product line","idx": "1"}
{"document": "scientists have discovered a new species of fish in the deep waters of the pacific ocean. the fish, which has unique characteristics, adds to the biodiversity of the region and provides new opportunities for research.","summary": "new fish species discovered in pacific ocean","idx": "2"}
{"document": "the mayor held a press conference to address the recent rise in crime rates in the city. she outlined new measures that will be implemented to improve public safety and reduce criminal activities.","summary": "mayor announces new measures to tackle crime","idx": "3"}
{"document": "a major earthquake struck the coastal region early this morning, causing significant damage to buildings and infrastructure. emergency services are on the scene, and the government has declared a state of emergency.","summary": "earthquake causes widespread damage on coast","idx": "4"}
{"document": "the tech giant reported record profits for the third quarter of the fiscal year, driven by strong sales of its flagship products. the company's stock prices surged following the announcement.","summary": "tech giant reports record third-quarter profits","idx": "5"}
{"document": "researchers at the university have developed a new vaccine that shows promise in protecting against the zika virus. initial trials have been successful, and further testing is planned.","summary": "new zika virus vaccine shows promise","idx": "6"}
{"document": "the local sports team clinched the championship title after a thrilling final match. fans celebrated the victory, marking the team's first championship win in over a decade.","summary": "local team wins championship after thrilling final","idx": "7"}
{"document": "a new law aimed at reducing carbon emissions was passed by the legislature today. the law includes measures to promote renewable energy sources and improve energy efficiency.","summary": "legislature passes new law to reduce carbon emissions","idx": "8"}
{"document": "the famous artist's latest exhibition opened to the public today, featuring a collection of new works that explore themes of nature and humanity. the exhibition will run for the next three months.","summary": "artist's new exhibition explores nature and humanity","idx": "9"}

#summarization_bedrock.py

import boto3
import os
import glob
import json
import pandas as pd

from fmeval.data_loaders.data_config import DataConfig
from bedrock_claud3_model_runner import BedrockClaude3ModelRunner
from fmeval.constants import MIME_TYPE_JSONLINES
from fmeval.eval_algorithms.summarization_accuracy import SummarizationAccuracy

os.environ["PARALLELIZATION_FACTOR"] = "1"
os.environ["DISABLE_FMEVAL_TELEMETRY"] = "true"

# Bedrock clients for model inference
bedrock = boto3.client(service_name='bedrock')
bedrock_runtime = boto3.client(service_name='bedrock-runtime')

model_id = 'anthropic.claude-3-haiku-20240307-v1:0'
# model_id = 'anthropic.claude-3-sonnet-20240229-v1:0'

accept = "application/json"
contentType = "application/json"

config = DataConfig(
    dataset_name="summarize_accuracy_sample",
    dataset_uri="data/summarize_accuracy.jsonl",
    dataset_mime_type=MIME_TYPE_JSONLINES,
    model_input_location="document",
    target_output_location="summary"
)
    
content_template = '{"anthropic_version": "bedrock-2023-05-31", "system": "Summarise the following text in one sentence and less than 10 words", "prompt": $prompt, "max_tokens": 500}'

bedrock_model_runner = BedrockClaude3ModelRunner(
    model_id=model_id,
    output='text',
    content_template=content_template
)

eval_algo = SummarizationAccuracy()
eval_output = eval_algo.evaluate(model=bedrock_model_runner, dataset_config=config, 
                                 prompt_template="$model_input", save=True)
                                 
eval_response = json.dumps(eval_output, default=vars, indent=4)
eval_response_json = json.loads(eval_response)
output_path = eval_response_json[0]["output_path"]

with open(output_path, "r") as file:
    for line in file:
        print(line)

Summarization Evaluation Result – Haiku Model

model_input	model_output	target_output	Scores
the local sports team clinched the championship title after a thrilling final match. fans celebrated the victory, marking the team’s first championship win in over a decade.	The local sports team won the championship after a thrilling final match, sparking celebrations from their fans.	local team wins championship after thrilling final	meteor: 0.7740169238427078 rouge: 0.18181818181818182 bertscore: 0.7768877148628235
scientists have discovered a new species of fish in the deep waters of the pacific ocean. the fish, which has unique characteristics, adds to the biodiversity of the region and provides new opportunities for research.	Scientists discovered a new deep-sea Pacific fish species, expanding regional biodiversity.	new fish species discovered in pacific ocean	meteor: 0.48947368421052634 rouge: 0.11764705882352942 bertscore: 0.7626062631607056
the famous artist’s latest exhibition opened to the public today, featuring a collection of new works that explore themes of nature and humanity. the exhibition will run for the next three months.	The famous artist’s new exhibition showcasing nature and humanity themes is open for three months.	artist’s new exhibition explores nature and humanity	meteor: 0.777344645723458 rouge: 0.4545454545454545 bertscore: 0.7821741104125977
the company announced a new range of products in a press release today, highlighting their commitment to innovation and customer satisfaction. the new line includes several high-tech gadgets aimed at improving daily life.	The company announced a new innovative product line to enhance customer experience.	company unveils innovative product line	meteor: 0.646551724137931 rouge: 0.26666666666666666 bertscore: 0.8270793557167053
five-time world champion michelle kwan withdrew from the 2006 us figure skating championships on wednesday, but will petition us skating officials for the chance to compete at the 2006 turin olympics.	Five-time world champion Michelle Kwan withdrew from 2006 US championships, petitioning for 2006 Olympics spot.	injury leaves kwan’s olympic hopes in limbo	meteor: 0.11235955056179776 rouge: 0.0 bertscore: 0.6205575466156006
a new law aimed at reducing carbon emissions was passed by the legislature today. the law includes measures to promote renewable energy sources and improve energy efficiency.	The legislature passed a new law to reduce carbon emissions by promoting renewable energy and improving energy efficiency.	legislature passes new law to reduce carbon emissions	meteor: 0.8722527472527473 rouge: 0.5 bertscore: 0.831701397895813
a major earthquake struck the coastal region early this morning, causing significant damage to buildings and infrastructure. emergency services are on the scene, and the government has declared a state of emergency.	A major earthquake struck the coastal region, causing significant damage and prompting a state of emergency.	earthquake causes widespread damage on coast	meteor: 0.20833333333333334 rouge: 0.0 bertscore: 0.7660402655601501
the tech giant reported record profits for the third quarter of the fiscal year, driven by strong sales of its flagship products. the company’s stock prices surged following the announcement.	The tech giant reported record profits and saw stock price surge.	tech giant reports record third-quarter profits	meteor: 0.7333333333333334 rouge: 0.37499999999999994 bertscore: 0.7267693877220154
researchers at the university have developed a new vaccine that shows promise in protecting against the zika virus. initial trials have been successful, and further testing is planned.	Researchers develop promising Zika virus vaccine, initial trials successful.	new zika virus vaccine shows promise	meteor: 0.576923076923077 rouge: 0.3076923076923077 bertscore: 0.7036830186843872
the mayor held a press conference to address the recent rise in crime rates in the city. she outlined new measures that will be implemented to improve public safety and reduce criminal activities.	The mayor announced new measures to address rising crime rates.	mayor announces new measures to tackle crime	meteor: 0.7957957957957957 rouge: 0.5333333333333333 bertscore: 0.8745450973510742

Summarization Evaluation Result – Sonnet Model

model_input	model_output	target_output	scores
the local sports team clinched the championship title after a thrilling final match. fans celebrated the victory, marking the team’s first championship win in over a decade.	The local sports team won the championship after a thrilling final.	local team wins championship after thrilling final	meteor: 0.8462585034013606 rouge: 0.25 bertscore: 0.8629528284072876
scientists have discovered a new species of fish in the deep waters of the pacific ocean. the fish, which has unique characteristics, adds to the biodiversity of the region and provides new opportunities for research.	Scientists discovered a unique new fish species in the Pacific’s deep waters.	new fish species discovered in pacific ocean	meteor: 0.6637806637806638 rouge: 0.2222222222222222 bertscore: 0.823331356048584
the famous artist’s latest exhibition opened to the public today, featuring a collection of new works that explore themes of nature and humanity. the exhibition will run for the next three months.	The famous artist’s new nature and humanity-themed exhibition opened today for three months.	artist’s new exhibition explores nature and humanity	meteor: 0.646551724137931 rouge: 0.38095238095238093 bertscore: 0.7324685454368591
the company announced a new range of products in a press release today, highlighting their commitment to innovation and customer satisfaction. the new line includes several high-tech gadgets aimed at improving daily life.	The company launched a new innovative high-tech product line.	company unveils innovative product line	meteor: 0.5738636363636364 rouge: 0.15384615384615383 bertscore: 0.8442436456680298
five-time world champion michelle kwan withdrew from the 2006 us figure skating championships on wednesday, but will petition us skating officials for the chance to compete at the 2006 turin olympics.	Michelle Kwan withdrew from 2006 US Championships, but petitioned for 2006 Olympics.	injury leaves kwan’s olympic hopes in limbo	meteor: 0.11627906976744187 rouge: 0.0 bertscore: 0.640730082988739
a new law aimed at reducing carbon emissions was passed by the legislature today. the law includes measures to promote renewable energy sources and improve energy efficiency.	The legislature passed a law promoting renewable energy and energy efficiency to reduce carbon emissions.	legislature passes new law to reduce carbon emissions	meteor: 0.7641465677179964 rouge: 0.38095238095238093 bertscore: 0.8178695440292358
a major earthquake struck the coastal region early this morning, causing significant damage to buildings and infrastructure. emergency services are on the scene, and the government has declared a state of emergency.	Massive earthquake hit coast, causing destruction; emergency response underway.	earthquake causes widespread damage on coast	meteor: 0.22727272727272727 rouge: 0.0 bertscore: 0.7507191300392151
the tech giant reported record profits for the third quarter of the fiscal year, driven by strong sales of its flagship products. the company’s stock prices surged following the announcement.	The tech giant reported record Q3 profits, boosting stock prices.	tech giant reports record third-quarter profits	meteor: 0.7333333333333334 rouge: 0.4 bertscore: 0.8279321193695068
researchers at the university have developed a new vaccine that shows promise in protecting against the zika virus. initial trials have been successful, and further testing is planned.	Researchers developed a promising new Zika virus vaccine with successful initial trials.	new zika virus vaccine shows promise	meteor: 0.7223880597014926 rouge: 0.37499999999999994 bertscore: 0.7561468482017517
the mayor held a press conference to address the recent rise in crime rates in the city. she outlined new measures that will be implemented to improve public safety and reduce criminal activities.	The mayor announced new public safety measures to curb rising crime rates.	mayor announces new measures to tackle crime	meteor: 0.7401315789473685 rouge: 0.3529411764705882 bertscore: 0.8440559506416321

Factual Knowledge Use Case:

A typical Evaluation require a json file with .jsonl extension that contains few samples. Each sample contains a question key with some text as value, answer key with expected output (It can have multiple answers with OR), and knowledge_category key to indicate category. Each sample is fed into the model and its output is evaluated with the expected summary and provides 1 metric score with either 0 or 1 for no-match or match respectively.

# factual_knowledge.jsonl

{"question": "Paracetamol is commonly used to treat","answers": "pain<OR>fever<OR>headache<OR>flu","knowledge_category": "Pharmaceuticals"}
{"question": "The Great Wall of China was primarily built to","answers": "protect against invasions<OR>defend from nomadic tribes<OR>defend from invasions<OR>protect from nomadic tribes","knowledge_category": "History"}
{"question": "The heart is part of the","answers": "circulatory system<OR>cardiovascular system<OR>human circulatory system<OR>human cardiovascular system","knowledge_category": "Biology"}
{"question": "COVID-19 is caused by","answers": "coronavirus<OR>SARS-CoV-2<OR>the novel coronavirus<OR>COVID-19 virus","knowledge_category": "Healthcare"}
{"question": "Penicillin was discovered by","answers": "Alexander Fleming<OR>Fleming<OR>Sir Alexander Fleming","knowledge_category": "Scientists"}
{"question": "The speed of light in vacuum is approximately","answers": "299,792 kilometers per second<OR>299,792 km/s<OR>300,000 km/s","knowledge_category": "Physics"}
{"question": "The periodic table was created by","answers": "Dmitri Mendeleev<OR>Mendeleev<OR>Dmitri Ivanovich Mendeleev","knowledge_category": "Chemistry"}
{"question": "The first woman to win a Nobel Prize was","answers": "Marie Curie<OR>Curie<OR>Madame Curie","knowledge_category": "History"}
{"question": "The longest reigning British monarch is","answers": "Queen Elizabeth II<OR>Elizabeth II<OR>Queen Elizabeth the Second","knowledge_category": "History"}
{"question": "The author of '1984' is","answers": "George Orwell<OR>Orwell<OR>Eric Arthur Blair","knowledge_category": "Literature"}

# factual_knowledge_bedrock.py

import boto3
import os
import glob
import json

from fmeval.data_loaders.data_config import DataConfig
from bedrock_claud3_model_runner import BedrockClaude3ModelRunner
from fmeval.constants import MIME_TYPE_JSONLINES
from fmeval.eval_algorithms.factual_knowledge import FactualKnowledge, FactualKnowledgeConfig

# Bedrock clients for model inference
bedrock = boto3.client(service_name='bedrock')
bedrock_runtime = boto3.client(service_name='bedrock-runtime')

model_id = 'anthropic.claude-3-haiku-20240307-v1:0'
# model_id = 'anthropic.claude-3-sonnet-20240229-v1:0'

config = DataConfig(
    dataset_name="factual_knowledge",
    dataset_uri="data/factual_knowledge.jsonl",
    dataset_mime_type=MIME_TYPE_JSONLINES,
    model_input_location="question",
    target_output_location="answers"
)

content_template = '{"anthropic_version": "bedrock-2023-05-31", "system": "Provide the answer in one sentence and less than 5 words", "prompt": $prompt, "max_tokens": 500}'

bedrock_model_runner = BedrockClaude3ModelRunner(
    model_id=model_id,
    output='text',
    content_template=content_template
)

eval_algo = FactualKnowledge(FactualKnowledgeConfig(target_output_delimiter="<OR>"))
eval_output = eval_algo.evaluate(model=bedrock_model_runner, dataset_config=config, 
                                 prompt_template="$model_input", save=True)
                                 

eval_response = json.dumps(eval_output, default=vars, indent=4)
eval_response_json = json.loads(eval_response)
output_path = eval_response_json[0]["output_path"]

with open(output_path, "r") as file:
    for line in file:
        print(line)

Factual Knowledge Evaluation Result – Haiku Model

model_input	model_output	target_output	scores
The first woman to win a Nobel Prize was	Marie Curie (Physics, 1903).	Marie Curie<OR>Curie<OR>Madame Curie	1
The heart is part of the	cardiovascular system.	circulatory system<OR>cardiovascular system<OR>human circulatory system<OR>human cardiovascular system	1
The author of ‘1984’ is	George Orwell.	George Orwell<OR>Orwell<OR>Eric Arthur Blair	1
The Great Wall of China was primarily built to	Defend against northern invaders.	protect against invasions<OR>defend from nomadic tribes<OR>defend from invasions<OR>protect from nomadic tribes	0
Paracetamol is commonly used to treat	Pain and fever.	pain<OR>fever<OR>headache<OR>flu	1
The longest reigning British monarch is	Queen Elizabeth II.	Queen Elizabeth II<OR>Elizabeth II<OR>Queen Elizabeth the Second	1
Penicillin was discovered by	Alexander Fleming.	Alexander Fleming<OR>Fleming<OR>Sir Alexander Fleming	1
The speed of light in vacuum is approximately	3 x 10^8 meters per second.	299,792 kilometers per second<OR>299,792 km/s<OR>300,000 km/s	0
The periodic table was created by	Dmitri Mendeleev.	Dmitri Mendeleev<OR>Mendeleev<OR>Dmitri Ivanovich Mendeleev	1
COVID-19 is caused by	SARS-CoV-2 virus.	coronavirus<OR>SARS-CoV-2<OR>the novel coronavirus<OR>COVID-19 virus	1

Conclusion

FM evaluations empower customers to ensure the LLM they choose is suitable for their specific use case and operates responsibly. This extensible responsible AI framework is seamlessly integrated into Amazon SageMaker, enhancing the transparency of language models by simplifying the evaluation and communication of risks throughout the ML lifecycle. It represents a significant advancement in building trust and promoting the adoption of LLMs on AWS.

FMEval and AWS Bedrock further support this process by providing powerful tools for evaluating and managing LLMs. FMEval offers a vast set of built-in algorithms across many NLP tasks and serves as a scalable and flexible tool for large-scale evaluations of your own models, datasets, and algorithms. AWS Bedrock simplifies deploying and operationalizing these evaluations, enabling organizations to efficiently manage and scale their AI models. Together, FMEval and AWS Bedrock help ensure that language models perform optimally and responsibly, facilitating their seamless integration into production environments.

References

Author: Raghavan Madabusi