Training Data

Simple Definition

Training data is the dataset an AI model learns from. During training, the model is exposed to this data millions or billions of times and adjusts its internal parameters to make better predictions.

For a language model like GPT-4 or Claude, training data includes billions of words from books, websites, code repositories, and other text sources.

Why Training Data Matters

The data shapes everything about a model:

  • What it knows — a model can only know what was in its training data
  • What it does well — more examples of a task → better performance on that task
  • Its biases — patterns in the data become patterns in the model’s outputs
  • Its knowledge cutoff — training has a date, so the model doesn’t know about events after it

Types of Training Data for LLMs

  • Web text — scraped websites (Common Crawl, etc.)
  • Books — broad vocabulary and reasoning patterns
  • Code — improves coding ability
  • Conversations — helps the model respond naturally
  • Curated human feedback — used in RLHF to align the model with human preferences

Data Quality vs. Quantity

More data isn’t always better. Recent research shows that high-quality, curated data often outperforms larger but noisier datasets. This is why newer models focus as much on data curation as data scale.

Training Data and Bias

If training data over-represents certain viewpoints, demographics, or writing styles, the model will reflect those biases. This is one of the core challenges in building fair and reliable AI systems.

  • Machine Learning — the process that uses training data to build models
  • Fine-Tuning — additional training on a smaller, specialized dataset
  • Bias in AI — how unbalanced training data produces skewed outputs
  • LLM — large language models trained on massive text datasets

See AI terms in action

Browse practical AI workflows that use the concepts in this glossary.

Last updated: