Experiments
To improve the performance of an application, we need to setup an iterative loop that helps us systematically measure and improve performance of our application.
Palico helps you setup this iterative loop for you and your team with Experiments.
Setting up Experiments
Setup your test-cases
Create a list of test-cases that models the expected behavior of your application. Add these test-cases to src/evals/<test-group>/index.ts
file.
You can setup test-cases by creating a static list or fetching from external data-sources. You have access to various metrics to measure the performance of your LLM Application but you can also create your own custom metrics. Learn more about setting up test-cases.
Run an evaluation
Run evaluations against different variation of your application in Studio using App Config. Navigate to the Studio > Experiments, create a new experiment, and run evaluations against different versions of your application.
Analyze results
Analyze result across different evaluations in Studio > Experiments > Notebooks. Notebook gives you various data analysis tools to help you analyze the results.
Defining Test Cases
Create test groups by creating a folder within evals
directory. Each test group should have an index.ts
file that exports a TestDatasetFN
function. This function should return an array of test-cases.
Here’s an example of statically defining test-cases:
Test Cases from External Data-Source
You can also fetch test-cases from external data-sources. Here’s an example fetching test-cases from a database:
Metrics
We provide a set of metrics out of the box but you can also create your own custom metrics. Here’s a list of metrics we provide:
Metric Name | Description |
---|---|
containsAllEvalMetric | Checks if the response contains the provided substring |
containsAnyEvalMetric | Checks if the response contains any of the provided substrings |
exactMatchEvalMetric | Check if the response is contains the exact string |
levensteinEvalMetric | NLP Similarity based metrics that checks the Levenstein distance |
rougeLCSSimilarityEvalMetric | Similarity metrics for sentence structure (Learn more) |
rougeNGramSimilarityEvalMetric | Similarity metrics at words and phrase levels (Learn more) |
rougeSkipBigramSimilarityEvalMetric | Captures more flexible phrase structures, reflecting coherence in non-contiguous word patterns (Learn more) |
System Metrics
You can measure system metrics such as latency, cost, and more. To measure a specific metrics, you need to add the metric to your agent’s response.metadata
. Here’s an example tracking total cost:
You can add the following system metrics as part of your Chat
handler:
ResponseMetadataKey.TotalCost
ResponseMetadataKey.InputTokens
ResponseMetadataKey.OutputTokens
ResponseMetadataKey.TotalTokens
The following system metrics are automatically tracked:
Execution Time
Custom Metrics
You can create your own metrics by providing an EvalMetric
object. Here’s an example of a custom metric that checks if the response length is within a specific range: