Glossary
AI and LLM terms, explained
Short, practical definitions for the concepts behind modern AI assistants, chatbots, and local AI clients.
A
A/B testing compares two variants to measure impact.
Access control limits who can use data or models.
Adapters are small modules inserted into a model for specialization.
Adversarial testing probes models for edge cases.
An agent is a system that plans and acts to achieve a goal using tools and models.
Agent sandboxing restricts tools and permissions.
An agentic workflow breaks a goal into steps and executes them with tool calls.
Alignment ensures model behavior matches human intent.
API rate limit caps how many requests you can send in a time window.
ANN search finds similar vectors efficiently.
An assistant message is the model's response in a chat exchange.
An attention head is one parallel attention mechanism inside a transformer layer.
An audit log records system activity for review.
An autoregressive model predicts the next token given previous tokens.
B
Batching combines multiple requests into one call.
Beam search explores multiple candidate sequences in parallel.
A benchmark is a standard test set for comparing models.
Bias is systematic unfairness or skew in model outputs.
BM25 is a keyword ranking algorithm for search.
C
Caching stores results to avoid repeated computation.
Catastrophic forgetting is loss of earlier knowledge after new training.
Chain-of-thought is the step-by-step reasoning a model may use.
A checkpoint is a saved snapshot of model weights.
Chunk overlap repeats some text across chunks.
Chunking splits content into smaller pieces for retrieval.
A circuit breaker stops requests to a failing service.
Citations link claims to sources.
A code interpreter executes code to solve tasks.
Compliance ensures systems meet legal or policy requirements.
Concurrency limits cap simultaneous requests.
A content filter blocks or removes unsafe content.
Context compression shortens input while preserving key information.
Context reranking reorders retrieved chunks for relevance.
Context truncation drops older tokens when the context limit is reached.
The context window is the maximum amount of text a model can consider at once.
Continual learning updates models over time without full retraining.
Conversation state is the accumulated context of a chat.
Cosine similarity measures the angle between vectors.
Cost optimization reduces spend without hurting quality.
Cost per token is the price paid for input and output tokens.
A critic model reviews outputs and flags issues.
D
Data augmentation creates variations of training data.
Data residency defines where data is stored geographically.
De-identification removes identifiers from data.
Decoder-only models generate text from left to right using a single stack.
Delegation assigns subtasks to specialized tools or models.
Delimiters clearly separate parts of a prompt.
Deterministic decoding removes randomness from generation.
Direct preference optimization aligns models using preference pairs.
Distillation transfers knowledge from a large model to a smaller one.
Dot product similarity multiplies vector components.
A draft model is a smaller model used to propose tokens.
Drift detection finds changes in model behavior over time.
E
Early stopping halts training when validation stops improving.
Edge inference runs models close to the user instead of a central cloud.
Embeddings are vector representations of text or images.
Encoder-decoder models use separate networks for input and output.
Encryption at rest protects stored data.
Encryption in transit protects data over networks.
Evaluation measures model quality on tasks or benchmarks.
F
Fairness aims for equitable model behavior across groups.
A feature flag toggles functionality on or off.
Few-shot learning uses a small number of examples in the prompt.
Fine-tuning adapts a pretrained model to a specific task or domain.
Flash attention is an optimized attention algorithm for GPUs.
A foundation model is a large pretrained model adaptable to many tasks.
Frequency penalty discourages repeated tokens.
Function calling is a structured way for a model to request tool execution.
A function schema defines arguments for tool calls.
G
H
I
Image captioning generates descriptions of images.
Inference is running a trained model to produce outputs.
Instruction hierarchy resolves conflicts between system and user instructions.
Instruction tuning trains a model to follow instructions more reliably.
IVF clusters vectors to narrow search.
J
K
L
Latency is the time between request and response.
Least privilege grants only necessary permissions.
LLMOps is the practice of deploying and monitoring LLM systems.
A local LLM runs on your own machine instead of a cloud API.
Log probabilities are the log-likelihoods of candidate tokens.
Logit bias adjusts token probabilities up or down.
Logits are the raw scores a model assigns to next-token candidates.
Long-term memory persists information across sessions.
LoRA adds low-rank adapters for efficient fine-tuning.
M
Memory stores information an agent can reuse across steps or sessions.
Metadata filtering limits search by attributes.
Mixture of experts (MoE) routes inputs to specialized sub-models.
Moderation detects and filters harmful or unsafe content.
Moltbot is a community-known name for a local-first AI assistant, previously referred to as Clawdbot.
Multi-head attention runs several attention mechanisms in parallel.
Multi-tenant isolation keeps customer data separated.
Multimodal models handle more than one data type, such as text and images.
O
Observability is the ability to understand system behavior from logs and metrics.
OCR extracts text from images or PDFs.
On-device AI runs models directly on user hardware.
An OpenAI-compatible API follows the same request and response format.
Orchestration coordinates multiple tools, models, and steps in a workflow.
An output schema defines the required response structure.
Overfitting occurs when a model memorizes training data.
P
Parameter count is the number of learned weights in a model.
Passage retrieval finds specific passages instead of full documents.
PEFT stands for parameter-efficient fine-tuning.
Perplexity measures how well a model predicts text.
PII stands for personally identifiable information.
PII detection finds personal identifiers in text.
A planner produces a structured plan for completing a task.
A planning agent creates a multi-step plan before acting.
Positional encoding injects token order information into embeddings.
A preference dataset contains ranked or paired responses.
Presence penalty discourages reuse of tokens that already appeared.
Pretraining is the large-scale training phase on broad data.
A prompt is the input text that guides the model's response.
Prompt chaining splits a task into multiple prompt steps.
Prompt engineering is crafting prompts to improve output quality.
Prompt evaluation measures output quality across prompt variants.
Prompt injection is a malicious attempt to override system instructions.
Prompt monitoring tracks prompt usage and performance.
A prompt registry stores prompts and metadata.
A prompt template is a reusable prompt with placeholders.
Prompt versioning tracks prompt changes over time.
Q
R
Ralph loop is an agentic workflow that resets context between iterations to reduce drift.
ReAct interleaves reasoning with actions.
Red teaming tests models with adversarial prompts.
Redaction removes or masks sensitive data.
A reflex agent selects actions based on immediate inputs.
Regularization constrains models to reduce overfitting.
Relevance feedback uses user signals to improve rankings.
Reranking reorders retrieved results using a stronger model.
Retries and backoff handle transient failures.
RAG combines search with generation to answer questions using external data.
A reward model scores outputs by quality.
RLAIF is Reinforcement Learning from AI Feedback.
RLHF stands for Reinforcement Learning from Human Feedback.
Role prompting assigns a role or persona to guide responses.
Rotary position embedding (RoPE) encodes positions by rotating vectors.
S
A safety classifier labels content risk or policy categories.
A safety policy defines allowed and disallowed content.
Sampling chooses output tokens probabilistically.
A sampling seed fixes the random sequence used in generation.
Schema validation checks output against a defined structure.
Secret management stores and controls access to credentials.
Self-attention lets a model weigh relationships between tokens in a sequence.
Self-consistency samples multiple outputs and selects the best.
Self-reflection is a review step where a model critiques its output.
Semantic caching reuses responses for similar queries.
Semantic search finds results based on meaning, not just keywords.
A session is a bounded interaction period with a user.
Short-term memory holds recent context for the current task.
Sliding window attention limits attention to a moving context window.
Sparse attention computes attention only for selected token pairs.
Sparse MoE activates only a subset of experts per token.
Speculative decoding uses a draft model to propose tokens quickly.
Speech-to-text converts audio into written text.
A stop sequence tells the model when to stop generating.
Streaming sends partial output as it is generated.
Structured output enforces a specific response format.
Supervised fine-tuning trains on labeled input-output pairs.
A supervisor agent coordinates sub-agents.
Synthetic data is model-generated training data.
A system prompt sets the overall behavior and rules for the model.
T
Task decomposition splits a goal into smaller steps.
Temperature controls randomness in text generation.
Temperature scaling adjusts the sharpness of token probabilities.
Text-to-speech converts text into spoken audio.
A thread is a sequence of messages for a single topic.
Throughput is the amount of work completed per unit time.
TTFT measures how fast the first token arrives.
TTLT measures total response time.
Token accounting tracks input and output token usage.
A token budget is the maximum tokens allowed for prompt and response.
Tokenization is the process of splitting text into tokens.
Tokens are the basic units of text that models process.
Tokens per second measures generation speed.
A tool router selects the best tool for a given request.
A tool sandbox runs tools in a restricted environment.
Tool use lets a model call external capabilities such as search, code execution, or APIs.
Top-k sampling limits generation to the k most probable tokens.
Top-p (nucleus sampling) limits output to the most probable tokens whose total probability is p.
Toxicity is harmful, abusive, or offensive content.
Tracing records the path of a request through a system.
Transformer is the neural network architecture behind most modern LLMs.
U
V
A vector database stores embeddings and supports similarity search.
Vector similarity measures closeness between embeddings.
A vision-language model processes images and text together.
Visual question answering answers questions about images.