AI and LLM terms, explained

A/B testing compares two variants to measure impact.

Access control

Access control limits who can use data or models.

Adapters

Adapters are small modules inserted into a model for specialization.

Adversarial testing

Adversarial testing probes models for edge cases.

Agent

An agent is a system that plans and acts to achieve a goal using tools and models.

Agent sandboxing

Agent sandboxing restricts tools and permissions.

Agentic workflow

An agentic workflow breaks a goal into steps and executes them with tool calls.

Alignment

Alignment ensures model behavior matches human intent.

API rate limit

Approximate nearest neighbor

API rate limit caps how many requests you can send in a time window.

ANN search finds similar vectors efficiently.

Assistant message

An assistant message is the model's response in a chat exchange.

Attention head

An attention head is one parallel attention mechanism inside a transformer layer.

Audit log

An audit log records system activity for review.

Autoregressive model

An autoregressive model predicts the next token given previous tokens.

B

Batching

Batching combines multiple requests into one call.

Beam search

Beam search explores multiple candidate sequences in parallel.

Benchmark

A benchmark is a standard test set for comparing models.

Bias

Bias is systematic unfairness or skew in model outputs.

BM25

BM25 is a keyword ranking algorithm for search.

C

Caching

Caching stores results to avoid repeated computation.

Catastrophic forgetting

Catastrophic forgetting is loss of earlier knowledge after new training.

Chain-of-thought

Chain-of-thought is the step-by-step reasoning a model may use.

Checkpoint

A checkpoint is a saved snapshot of model weights.

Chunk overlap

Chunk overlap repeats some text across chunks.

Chunking

Chunking splits content into smaller pieces for retrieval.

Circuit breaker

A circuit breaker stops requests to a failing service.

Citations

Citations link claims to sources.

Code interpreter

A code interpreter executes code to solve tasks.

Compliance

Compliance ensures systems meet legal or policy requirements.

Concurrency limit

Concurrency limits cap simultaneous requests.

Content filter

A content filter blocks or removes unsafe content.

Context compression

Context compression shortens input while preserving key information.

Context reranking

Context reranking reorders retrieved chunks for relevance.

Context truncation

Context truncation drops older tokens when the context limit is reached.

Context window

The context window is the maximum amount of text a model can consider at once.

Continual learning

Continual learning updates models over time without full retraining.

Conversation state

Conversation state is the accumulated context of a chat.

Cosine similarity

Cosine similarity measures the angle between vectors.

Cost optimization

Cost optimization reduces spend without hurting quality.

Cost per token

Cost per token is the price paid for input and output tokens.

Critic model

A critic model reviews outputs and flags issues.

D

Data augmentation

Data augmentation creates variations of training data.

Data residency

Data residency defines where data is stored geographically.

De-identification

De-identification removes identifiers from data.

Decoder-only model

Decoder-only models generate text from left to right using a single stack.

Delegation

Delegation assigns subtasks to specialized tools or models.

Delimiters

Delimiters clearly separate parts of a prompt.

Deterministic decoding

Direct preference optimization

Deterministic decoding removes randomness from generation.

Direct preference optimization aligns models using preference pairs.

Distillation

Distillation transfers knowledge from a large model to a smaller one.

Dot product similarity

Dot product similarity multiplies vector components.

Draft model

A draft model is a smaller model used to propose tokens.

Drift detection

Drift detection finds changes in model behavior over time.

E

Early stopping

Early stopping halts training when validation stops improving.

Edge inference

Edge inference runs models close to the user instead of a central cloud.

Embeddings

Embeddings are vector representations of text or images.

Encoder-decoder

Encoder-decoder models use separate networks for input and output.

Encryption at rest

Encryption at rest protects stored data.

Encryption in transit

Encryption in transit protects data over networks.

Evaluation measures model quality on tasks or benchmarks.

F

Fairness

Fairness aims for equitable model behavior across groups.

Feature flag

A feature flag toggles functionality on or off.

Few-shot learning

Few-shot learning uses a small number of examples in the prompt.

Fine-tuning

Fine-tuning adapts a pretrained model to a specific task or domain.

Flash attention

Flash attention is an optimized attention algorithm for GPUs.

Foundation model

A foundation model is a large pretrained model adaptable to many tasks.

Frequency penalty

Frequency penalty discourages repeated tokens.

Function calling

Function calling is a structured way for a model to request tool execution.

Function schema

A function schema defines arguments for tool calls.

G

Greedy decoding

Greedy decoding always picks the most probable next token.

Grounding

Grounding ties model outputs to verifiable sources.

Guardrails

Guardrails are rules that constrain model behavior.

H

Hallucination

A hallucination is a confident but incorrect model output.

HNSW

HNSW is a graph-based ANN index.

Hybrid search

Hybrid search combines keyword and semantic search.

I

Image captioning

Image captioning generates descriptions of images.

Inference

Inference is running a trained model to produce outputs.

Instruction hierarchy

Instruction hierarchy resolves conflicts between system and user instructions.

Instruction tuning

Instruction tuning trains a model to follow instructions more reliably.

IVF

IVF clusters vectors to narrow search.

J

Jailbreak

A jailbreak tries to bypass model safety constraints.

JSON mode

JSON mode constrains output to valid JSON.

K

Key rotation

Key rotation regularly replaces secrets and keys.

Keyword search

Keyword search matches exact terms in documents.

KV cache

KV cache stores key and value tensors for previous tokens.

L

Latency

Latency is the time between request and response.

Least privilege

Least privilege grants only necessary permissions.

LLMOps

LLMOps is the practice of deploying and monitoring LLM systems.

Local LLM

A local LLM runs on your own machine instead of a cloud API.

Log probabilities

Log probabilities are the log-likelihoods of candidate tokens.

Logit bias

Logit bias adjusts token probabilities up or down.

Logits

Logits are the raw scores a model assigns to next-token candidates.

Long-term memory

Long-term memory persists information across sessions.

LoRA

LoRA adds low-rank adapters for efficient fine-tuning.

M

Memory

Memory stores information an agent can reuse across steps or sessions.

Metadata filtering

Metadata filtering limits search by attributes.

Mixture of experts

Mixture of experts (MoE) routes inputs to specialized sub-models.

Moderation

Moltbot (formerly Clawdbot)

Moderation detects and filters harmful or unsafe content.

Moltbot is a community-known name for a local-first AI assistant, previously referred to as Clawdbot.

Multi-head attention

Multi-head attention runs several attention mechanisms in parallel.

Multi-tenant isolation

Multi-tenant isolation keeps customer data separated.

Multimodal models handle more than one data type, such as text and images.

O

Observability

Observability is the ability to understand system behavior from logs and metrics.

OCR

OCR extracts text from images or PDFs.

On-device AI

On-device AI runs models directly on user hardware.

OpenAI-compatible API

An OpenAI-compatible API follows the same request and response format.

Orchestration

Orchestration coordinates multiple tools, models, and steps in a workflow.

Output schema

An output schema defines the required response structure.

Overfitting

Overfitting occurs when a model memorizes training data.

P

Parameter count

Parameter count is the number of learned weights in a model.

Passage retrieval

Passage retrieval finds specific passages instead of full documents.

PEFT

PEFT stands for parameter-efficient fine-tuning.

Perplexity

Perplexity measures how well a model predicts text.

PII

PII stands for personally identifiable information.

PII detection

PII detection finds personal identifiers in text.

Planner

A planner produces a structured plan for completing a task.

Planning agent

A planning agent creates a multi-step plan before acting.

Positional encoding

Positional encoding injects token order information into embeddings.

Preference dataset

A preference dataset contains ranked or paired responses.

Presence penalty

Presence penalty discourages reuse of tokens that already appeared.

Pretraining

Pretraining is the large-scale training phase on broad data.

Prompt

A prompt is the input text that guides the model's response.

Prompt chaining

Prompt chaining splits a task into multiple prompt steps.

Prompt engineering

Prompt engineering is crafting prompts to improve output quality.

Prompt evaluation

Prompt evaluation measures output quality across prompt variants.

Prompt injection

Prompt injection is a malicious attempt to override system instructions.

Prompt monitoring

Prompt monitoring tracks prompt usage and performance.

Prompt registry

A prompt registry stores prompts and metadata.

Prompt template

A prompt template is a reusable prompt with placeholders.

Prompt versioning

Prompt versioning tracks prompt changes over time.

Q

Quantization

Quantization reduces numerical precision to speed up inference.

Query embedding

Query embedding represents a query as a vector.

Query expansion

Query expansion adds related terms to a query.

R

Ralph loop

Ralph loop is an agentic workflow that resets context between iterations to reduce drift.

ReAct

ReAct interleaves reasoning with actions.

Red teaming

Red teaming tests models with adversarial prompts.

Redaction

Redaction removes or masks sensitive data.

Reflex agent

A reflex agent selects actions based on immediate inputs.

Regularization

Regularization constrains models to reduce overfitting.

Relevance feedback

Relevance feedback uses user signals to improve rankings.

Reranking

Reranking reorders retrieved results using a stronger model.

Retries and backoff

Retrieval-augmented generation

Retries and backoff handle transient failures.

RAG combines search with generation to answer questions using external data.

Reward model

A reward model scores outputs by quality.

RLAIF

RLAIF is Reinforcement Learning from AI Feedback.

RLHF

RLHF stands for Reinforcement Learning from Human Feedback.

Role prompting

Rotary position embedding

Role prompting assigns a role or persona to guide responses.

Rotary position embedding (RoPE) encodes positions by rotating vectors.

S

Safety classifier

A safety classifier labels content risk or policy categories.

Safety policy

A safety policy defines allowed and disallowed content.

Sampling

Sampling chooses output tokens probabilistically.

Sampling seed

A sampling seed fixes the random sequence used in generation.

Schema validation

Schema validation checks output against a defined structure.

Secret management

Secret management stores and controls access to credentials.

Self-attention

Self-attention lets a model weigh relationships between tokens in a sequence.

Self-consistency

Self-consistency samples multiple outputs and selects the best.

Self-reflection

Self-reflection is a review step where a model critiques its output.

Semantic caching

Semantic caching reuses responses for similar queries.

Semantic search

Semantic search finds results based on meaning, not just keywords.

Session

A session is a bounded interaction period with a user.

Short-term memory

Short-term memory holds recent context for the current task.

Sliding window attention

Sliding window attention limits attention to a moving context window.

Sparse attention

Sparse attention computes attention only for selected token pairs.

Sparse MoE

Sparse MoE activates only a subset of experts per token.

Speculative decoding

Speculative decoding uses a draft model to propose tokens quickly.

Speech-to-text

Speech-to-text converts audio into written text.

Stop sequence

A stop sequence tells the model when to stop generating.

Streaming

Streaming sends partial output as it is generated.

Structured output

Structured output enforces a specific response format.

Supervised fine-tuning

Supervised fine-tuning trains on labeled input-output pairs.

Supervisor agent

A supervisor agent coordinates sub-agents.

Synthetic data

Synthetic data is model-generated training data.

System prompt

A system prompt sets the overall behavior and rules for the model.

T

Task decomposition

Task decomposition splits a goal into smaller steps.

Temperature

Temperature controls randomness in text generation.

Temperature scaling

Temperature scaling adjusts the sharpness of token probabilities.

Text-to-speech

Text-to-speech converts text into spoken audio.

Thread

A thread is a sequence of messages for a single topic.

Throughput

Throughput is the amount of work completed per unit time.

Time to first token

TTFT measures how fast the first token arrives.

Time to last token

TTLT measures total response time.

Token accounting

Token accounting tracks input and output token usage.

Token budget

A token budget is the maximum tokens allowed for prompt and response.

Tokenization

Tokenization is the process of splitting text into tokens.

Tokens

Tokens are the basic units of text that models process.

Tokens per second

Tokens per second measures generation speed.

Tool router

A tool router selects the best tool for a given request.

Tool sandbox

A tool sandbox runs tools in a restricted environment.

Tool use

Tool use lets a model call external capabilities such as search, code execution, or APIs.

Top-k

Top-k sampling limits generation to the k most probable tokens.

Top-p

Top-p (nucleus sampling) limits output to the most probable tokens whose total probability is p.

Toxicity

Toxicity is harmful, abusive, or offensive content.

Tracing

Tracing records the path of a request through a system.

Transformer

Transformer is the neural network architecture behind most modern LLMs.

U

Usage metering

Usage metering tracks API usage and cost.

User prompt

A user prompt is the instruction or question from the user.

V

Vector database

A vector database stores embeddings and supports similarity search.

Vector similarity

Vector similarity measures closeness between embeddings.

Vision-language model

Visual question answering

A vision-language model processes images and text together.

Visual question answering answers questions about images.

W

Web search tool

A web search tool fetches current information from the internet.

Z

Zero-shot learning