Prepare Synthetic Data

Learn how to add AI synthesized data to improve your AI Copilot results

Why do you need synthetic data for AI Copilot?

Often AI Copilots must respond succinctly and answer in FAQ style.

When using direct transcriptions from video tutorials and recordings for your Knowledge Base Documents, you will need some synthetic data to ensure the user's query is recognized and answered correctly. Video or audio tutorials, can often be very long sentences and have a casual tone of conversation. This can hinder the AI Copilot's search and summary ability. For this, we recommend using our Synthetic Document Extractor workflow.

Features of Synthetic Document Extractor

Extract information from videos for various purposes
Collect lists of YouTube videos or PDFs for data crunching.
Update Google Sheets as workflow progresses.

LINK TO WORKFLOW: https://gooey.ai/doc-extract

Step 1: Create a New Google Sheet

Create a new, empty Google sheet to store your extracted data. Set the access permissions to "Anyone with link can edit."

Step 2: Enter Raw Data links

What will work:

Hosted video and audio links
Youtube links
PDFs (OCR and Tabulated Data will work)

PRO TIP: If you copy the link to the Google Folder with your docs/pdf. You should immediately see all the files in the folder

Step 3: Add instructions

Open the settings tabs and add the relevant instructions for the synthetic data conversion. Example below:

You are a Javascript tutor. Read the video training transcripts and create a properly outputted data with the sections with the following headings: Provide a short and succinct title, with an additional delimiter at the title's end - Description: provide an short summary of the video as a description Facts: succinct and accurately list all the facts from the transcription that will be useful to the students; don't self reference the video. FAQs: think about the questions that students would ask for this Javascript course based on the transcripts. Remember the notes below:
make a comprehensive set of questions and answers based on the transcript
avoid repetitions
avoid self referencing the course
don't make up questions and answers beyond contents of the transcript

Step 4: Select a Model

Choose a Language model for the synthetic data extraction among the available options.

Step 5: Select ASR Model

Choose the relevant ASR Model that will work best for your speech recognition.

Step 6: Hit SUBMIT

Hit "Submit." The tool will prepare the sheet and update it in real-time. It will then auto-populate all the needed information along with a transcription.

Harnessing Additional Functions

Synthetic Data Extractor Workflow allows you to upload all videos from your YouTube playlist through one playlist link, which gives an entire transcription output. Likewise, you can manually choose a list of videos for specific transcription tasks.

Note: Adding new data on the same sheet may overwrite the saved information.

The tool works best for content that is less than 30-40 minutes due to word limit restrictions on Google Sheets.

Transcription Bonus: Extract Data from PDFs

This tool also supports the extraction of data from PDFs. Simply paste the link of the accessible PDF in the input and hit "Submit." Like videos, it will extract important data from your document while also updating a Google sheet in real-time.

Tutorial available here:

Last updated 7 months ago

Was this helpful?