βš–οΈUnderstanding Bulk Runner and Evaluation

Why do you need bulk runner and evaluations?

When building your Gooey.AI workflows, you will have to tweak the settings often to ensure the responses show parity and are grounded and verifiable.

There are several components to test:

  • testing prompts

  • ensuring the synthetic data retrieval works

  • checking the suitability of the language model and its advanced settings

  • Latency of generated answers

  • evaluation of the final AI Copilot to produce the Golden Answers

  • evaluation of the price per run

  • regression tests

How can you do this at scale?

This is where Gooey.AI’s Bulk and Evaluation features shine!

Features of Bulk Runner and Evaluation

  • Run several models in one click

  • Run several iterations of your workflows at scale

  • Choose any of the API Response Outputs to populate your test

  • Get output in CSV for further data analysis

  • Built-in evaluation tool for quick analysis

  • Use CSV or Google Sheets as input

Quickstart

Here are the quickstart guides for Bulk Runner and Evaluation:

How Does Bulk Runner Work?

Bulk Run Overview

This diagram details the process of generating AI-driven answers to a set of test questions using a Language Model (LLM) with Retrieval-Augmented Generation (RAG) capabilities.

1. Test Question Set

The process begins with a curated set of questions. Examples of such questions include:

- "What is the lipsync tool's API?"

- "What is the step-by-step method to make a good animation?"

2. Bulk Run

Your β€œSaved” AI Copilot run processes the entire question set. Each question is individually processed to generate the corresponding answers.

3. Generated Output Texts

The generated answers are compiled into an output table. Each question is paired with its respective AI-generated response. For example, the answer to "What is the lipsync tool's API?" provides detailed information regarding the API's functionality and integration methods.

How Does Evaluation Work?

Comparison and Evaluation Overview

This section details the process for comparing and evaluating generated answers against a set of golden answers to assess their semantic and technical accuracy.

  1. Input: Questions and Golden Answers

    1. Test Question Set: A curated set of questions to be answered, such as:

      - "What is the lipsync tool's API?" - "What is the step-by-step method to make a good animation?"

    2. Golden Answers Set: Expert-generated answers serve as the benchmark for evaluation.

  2. Bulk Runs The test question set undergoes multiple bulk runs with differently configured Copilot Runs. Eg, you can have various runs where you have tweaked the prompts, or you wish to test out which LLM would answer your questions the best The test questions are processed to produce a corresponding set of Generated Answers.

  3. Compare and Evaluate The generated answer sets from each bulk run are compared against the golden answers to evaluate their accuracy.

    Scoring: Each generated answer set is scored based on its semantic and technical accuracy relative to the golden answers.

In this example:

  • Generated Answer Set 1: Scores 0.8, indicating it is 80% close in accuracy.

  • Generated Answer Set 2: Scores 1.0, indicating perfect alignment with the golden answers.

  • Generated Answer Set 3: Scores 0.6, indicating 60% accuracy.

The iterative bulk runs and systematic comparison provide a framework for improving AI-driven answer generation.

When do you use Bulk vs Bulk+Eval vs Eval?

  • Bulk workflow only - If you want to test your Copilot’s functionality for regression tests, monitoring and observability, and bugs.

  • Bulk and Eval - If you are testing improvements on your prompts, or updating your documents, want to consider A/B testing.

  • Eval workflow only- If you already have test data and want to use β€œLLM as Judge” to evaluate it

Common terms

  • Golden Answer: Most suitable and accurate answers provided by humans with expertise on the subject

  • Semantic Closeness: Since LLM will not output the same answer every time, the evaluation will check for how semantically close the output of the LLM is to your β€œGolden Answer”

  • Score and Rank: For each generated answer the Evaluation workflow will give a β€œscore” between 0 and 1, and rank the best answer.

  • Reasoning: Evaluation LLM will share a short "reasoning" of how the score was given

  • Chart: Based on the aggregate score, the Evaluation workflow will create a compare chart that

Last updated