Setting up Experiments
1
Setup your test-cases
Create a list of test-cases that models the expected behavior of your application. Add these test-cases to You can setup test-cases by creating a static list or fetching from external data-sources. You have access to various metrics to measure the performance of your LLM Application but you can also create your own custom metrics. Learn more about setting up test-cases.
src/evals/<test-group>/index.ts file.src/evals/history/index.ts
2
Run an evaluation
Run evaluations against different variation of your application in Studio using App Config. Navigate to the Studio > Experiments, create a new experiment, and run evaluations against different versions of your application.
3
Analyze results
Analyze result across different evaluations in Studio > Experiments > Notebooks. Notebook gives you various data analysis tools to help you analyze the results.
Defining Test Cases
Create test groups by creating a folder withinevals directory. Each test group should have an index.ts file that exports a TestDatasetFN function. This function should return an array of test-cases.
Here’s an example of statically defining test-cases:
src/evals/history/index.ts
Test Cases from External Data-Source
You can also fetch test-cases from external data-sources. Here’s an example fetching test-cases from a database:src/evals/history/index.ts
Metrics
We provide a set of metrics out of the box but you can also create your own custom metrics. Here’s a list of metrics we provide:| Metric Name | Description |
|---|---|
| containsAllEvalMetric | Checks if the response contains the provided substring |
| containsAnyEvalMetric | Checks if the response contains any of the provided substrings |
| exactMatchEvalMetric | Check if the response is contains the exact string |
| levensteinEvalMetric | NLP Similarity based metrics that checks the Levenstein distance |
| rougeLCSSimilarityEvalMetric | Similarity metrics for sentence structure (Learn more) |
| rougeNGramSimilarityEvalMetric | Similarity metrics at words and phrase levels (Learn more) |
| rougeSkipBigramSimilarityEvalMetric | Captures more flexible phrase structures, reflecting coherence in non-contiguous word patterns (Learn more) |
System Metrics
You can measure system metrics such as latency, cost, and more. To measure a specific metrics, you need to add the metric to your agent’sresponse.metadata. Here’s an example tracking total cost:
Chat handler:
ResponseMetadataKey.TotalCostResponseMetadataKey.InputTokensResponseMetadataKey.OutputTokensResponseMetadataKey.TotalTokens
Execution Time
Custom Metrics
You can create your own metrics by providing anEvalMetric object. Here’s an example of a custom metric that checks if the response length is within a specific range:
