Setting up Experiments
1
Setup your test-cases
Create a list of test-cases that models the expected behavior of your application. Add these test-cases to You can setup test-cases by creating a static list or fetching from external data-sources. You have access to various metrics to measure the performance of your LLM Application but you can also create your own custom metrics. Learn more about setting up test-cases.
src/evals/<test-group>/index.ts
file.src/evals/history/index.ts
2
Run an evaluation
Run evaluations against different variation of your application in Studio using App Config. Navigate to the Studio > Experiments, create a new experiment, and run evaluations against different versions of your application.
3
Analyze results
Analyze result across different evaluations in Studio > Experiments > Notebooks. Notebook gives you various data analysis tools to help you analyze the results.
Defining Test Cases
Create test groups by creating a folder withinevals
directory. Each test group should have an index.ts
file that exports a TestDatasetFN
function. This function should return an array of test-cases.
Here’s an example of statically defining test-cases:
src/evals/history/index.ts
Test Cases from External Data-Source
You can also fetch test-cases from external data-sources. Here’s an example fetching test-cases from a database:src/evals/history/index.ts
Metrics
We provide a set of metrics out of the box but you can also create your own custom metrics. Here’s a list of metrics we provide:Metric Name | Description |
---|---|
containsAllEvalMetric | Checks if the response contains the provided substring |
containsAnyEvalMetric | Checks if the response contains any of the provided substrings |
exactMatchEvalMetric | Check if the response is contains the exact string |
levensteinEvalMetric | NLP Similarity based metrics that checks the Levenstein distance |
rougeLCSSimilarityEvalMetric | Similarity metrics for sentence structure (Learn more) |
rougeNGramSimilarityEvalMetric | Similarity metrics at words and phrase levels (Learn more) |
rougeSkipBigramSimilarityEvalMetric | Captures more flexible phrase structures, reflecting coherence in non-contiguous word patterns (Learn more) |
System Metrics
You can measure system metrics such as latency, cost, and more. To measure a specific metrics, you need to add the metric to your agent’sresponse.metadata
. Here’s an example tracking total cost:
Chat
handler:
ResponseMetadataKey.TotalCost
ResponseMetadataKey.InputTokens
ResponseMetadataKey.OutputTokens
ResponseMetadataKey.TotalTokens
Execution Time
Custom Metrics
You can create your own metrics by providing anEvalMetric
object. Here’s an example of a custom metric that checks if the response length is within a specific range: