LogoLogo
HomeExploreDocsAPIBlogContact
  • 🗃️Gooey.AI Docs
  • Changelog
  • 📖Guides
    • 🤖How to build an AI Copilot?
      • AI Prompting: Best practices
      • Curate your Knowledge Base Documents
      • Advanced Settings
      • Prepare Synthetic Data
      • Conversation Analysis
        • Glossary
      • Building a Multi-Modal Copilot
      • Frequently Asked Questions about AI Copilot
      • How to Automate Data Export?
    • 🚀How to deploy an AI Copilot?
      • Deploy to Web
      • Deploy to WhatsApp
      • Deploy to Slack
      • Deploy to Facebook
      • Broadcast Messages (via web or API)
      • Add buttons to your Copilot
    • ⚖️Understanding Bulk Runner and Evaluation
      • 💪How to set up Bulk Runner?
      • 🕵️‍♀️How to set up Evaluations?
      • How to use Bulk Run via API
    • 👄How to use AI Lip Sync Generator?
      • Lip Sync Animation Generator (WITH AUDIO FILES)
      • LipSync videos with Custom Voices
      • Set up your API for Lipsync with Local Folders
      • Tips to create great HD lipsync output
      • Frequently Asked Questions about Lipsync
    • 🗣️How to use ASR?
      • 📊How to create language evaluation for ASR?
    • How to use Compare AI Translations?
      • Google Translate Glossary
    • How does RAG-based document search work?
    • 🧩How to use Gooey Functions?
      • ✨LLM-enabled Functions
      • How to use SECRETS in Functions?
      • 🔥How to connect FirebaseDB to Copilot
    • 🎞️How to create AI Animations?
    • 🤳How to make amazing AI Art QR Codes?
      • API tips on AI Art QR Codes
    • 🖼️Create an AI Image with text
      • AI Image Prompting
      • API Tips for AI Image Generator
    • 📸AI Photo Editor
      • Build your avatar with AI
    • 🧑‍🏫How to use Gooey.AI’s Image Model Trainer?
    • 🔍Generate “People Also Ask” SEO Content
    • 🌐How to create SEO-Optimized content with AI?
    • How to use Workspaces?
      • How to use Version History?
      • How to add SECRETS in your Workspace?
    • 🍟How can I get free credits?
  • 😇CONTRIBUTING
    • Contributing
    • Documentation Style Guide
  • 🤓API REFERENCE
    • Getting started
    • API Generator
    • Rate Limits
    • Error Codes
  • 🍭ENDPOINTS
    • Copilot
    • Lipsync
    • Lipsync TTS
    • AI Art QR Generator
    • AI Animation Generator
    • Compare AI Image Generator
    • Gooey.AI on GitHub
Powered by GitBook
LogoLogo

Home

  • Gooey.AI
  • Explore Workflows
  • Sign In
  • Pricing

Learn

  • Docs
  • Blog
  • FAQs
  • Videos

Developers

  • How-to Guides
  • Get your Gooey.AI Key
  • Github
  • API Endpoints

Connect

  • Book a Demo
  • Discord
  • Team
  • Jobs

@Dara.network / Gooey.AI / support@gooey.ai

On this page
  • Why do you need bulk runner and evaluations?
  • Features of Bulk Runner and Evaluation
  • Quickstart
  • How Does Bulk Runner Work?
  • Bulk Run Overview
  • How Does Evaluation Work?
  • Comparison and Evaluation Overview
  • When do you use Bulk vs Bulk+Eval vs Eval?
  • Common terms

Was this helpful?

Edit on GitHub
  1. Guides

Understanding Bulk Runner and Evaluation

Why do you need bulk runner and evaluations?

When building your Gooey.AI workflows, you will have to tweak the settings often to ensure the responses show parity and are grounded and verifiable.

There are several components to test:

  • testing prompts

  • ensuring the synthetic data retrieval works

  • checking the suitability of the language model and its advanced settings

  • Latency of generated answers

  • evaluation of the final AI Copilot to produce the Golden Answers

  • evaluation of the price per run

  • regression tests

How can you do this at scale?

This is where Gooey.AI’s Bulk and Evaluation features shine!

Features of Bulk Runner and Evaluation

  • Run several models in one click

  • Run several iterations of your workflows at scale

  • Choose any of the API Response Outputs to populate your test

  • Get output in CSV for further data analysis

  • Built-in evaluation tool for quick analysis

  • Use CSV or Google Sheets as input

Quickstart

Here are the quickstart guides for Bulk Runner and Evaluation:

How Does Bulk Runner Work?

Bulk Run Overview

This diagram details the process of generating AI-driven answers to a set of test questions using a Language Model (LLM) with Retrieval-Augmented Generation (RAG) capabilities.

1. Test Question Set

The process begins with a curated set of questions. Examples of such questions include:

- "What is the lipsync tool's API?"

- "What is the step-by-step method to make a good animation?"

2. Bulk Run

Your “Saved” AI Copilot run processes the entire question set. Each question is individually processed to generate the corresponding answers.

3. Generated Output Texts

The generated answers are compiled into an output table. Each question is paired with its respective AI-generated response. For example, the answer to "What is the lipsync tool's API?" provides detailed information regarding the API's functionality and integration methods.

How Does Evaluation Work?

Comparison and Evaluation Overview

This section details the process for comparing and evaluating generated answers against a set of golden answers to assess their semantic and technical accuracy.

  1. Input: Questions and Golden Answers

    1. Test Question Set: A curated set of questions to be answered, such as:

      - "What is the lipsync tool's API?" - "What is the step-by-step method to make a good animation?"

    2. Golden Answers Set: Expert-generated answers serve as the benchmark for evaluation.

  2. Bulk Runs The test question set undergoes multiple bulk runs with differently configured Copilot Runs. Eg, you can have various runs where you have tweaked the prompts, or you wish to test out which LLM would answer your questions the best The test questions are processed to produce a corresponding set of Generated Answers.

  3. Compare and Evaluate The generated answer sets from each bulk run are compared against the golden answers to evaluate their accuracy.

    Scoring: Each generated answer set is scored based on its semantic and technical accuracy relative to the golden answers.

In this example:

  • Generated Answer Set 1: Scores 0.8, indicating it is 80% close in accuracy.

  • Generated Answer Set 2: Scores 1.0, indicating perfect alignment with the golden answers.

  • Generated Answer Set 3: Scores 0.6, indicating 60% accuracy.

The iterative bulk runs and systematic comparison provide a framework for improving AI-driven answer generation.

When do you use Bulk vs Bulk+Eval vs Eval?

Common terms

  • Golden Answer: Most suitable and accurate answers provided by humans with expertise on the subject

  • Semantic Closeness: Since LLM will not output the same answer every time, the evaluation will check for how semantically close the output of the LLM is to your “Golden Answer”

  • Score and Rank: For each generated answer the Evaluation workflow will give a “score” between 0 and 1, and rank the best answer.

  • Reasoning: Evaluation LLM will share a short "reasoning" of how the score was given

  • Chart: Based on the aggregate score, the Evaluation workflow will create a compare chart that

Last updated 4 months ago

Was this helpful?

- If you want to test your Copilot’s functionality for regression tests, monitoring and observability, and bugs.

- If you are testing improvements on your prompts, or updating your documents, want to consider A/B testing.

- If you already have test data and want to use “LLM as Judge” to evaluate it

📖
⚖️
Bulk workflow only
Bulk and Eval
Eval workflow only
Cover

How to set up Bulk Runner?

Cover

How to set up Evaluation?