Skip to Main Content

AI Verification Guide

Understanding AI Content and Process

Understanding just a few things about how Generative Artificial Intelligence functions is crucial to identifying student writing created by one.

AI Datasets

Generative AIs are trained on datasets. There are datasets for the Large Language Model (LLM) AIs such as ChatGPT and datasets for the image AIs such as DALL-E.

Most of the AIs a student might use (of one type):

  • Share the same dataset - that's right, they have the same content but use it differently.
  • Have content that largely comes from a crawl of free internet material and Wikipedia.
  • Are limited in currency to the date the crawl was performed.
  • Don't necessarily have reliable content.
  • May have an intellectual property problem.
  • Are not able to trespass the firewalls of private data or paid subscription databases (YET).

- Update from April 2024 - 
The game is changing - some AI LLMs are now purchasing content from paid datasets. Most search engines now include some sort of AI (the quality varies) and soon there will be an AI presence in library resources as well.


A pie chart of ChatGPT's training dataset sources.


A Generative AI system pulls together many pieces of information called tokens that seem to match each other. This process is called tokenization.

Just because the tokens seem to match does not mean that they correctly match.


The ball rolled down the __________? What would you say?


Obviously it should have been hill? But what if hill wasn't right?

Creative Learning Librarian and Coordinator of Research Services

Profile Photo
Mari Kermit-Canfield