"The brain does not store memories like files on a disk — it stores them as patterns of connection strengths, built up gradually by experience and retrieved by partial cues."- Claude 2026
How biological brains and artificial networks both change through experience — and what the similarities and gaps between them reveal about how memory works.
By the end of this page you should be able to:
The story of learning in the brain starts at the synapse — the tiny gap between two neurons across which a signal passes. When a synapse is used repeatedly and effectively, it grows stronger: the transmitting neuron releases more signal, or the receiving neuron grows more receptors to catch it. When it is used rarely, it weakens. This activity-dependent change in connection strength — synaptic plasticity — is the physical basis of memory. The idea was stated plainly by Donald Hebb in 1949 in what has become known as Hebb's rule: neurons that fire together wire together. If neuron A repeatedly helps trigger neuron B, the connection between them strengthens.
Hebb's rule (1949): neurons that fire together wire together. If neuron A repeatedly helps trigger neuron B, the synapse between them strengthens. This single principle underlies nearly every learning model on this page.
The most studied form of synaptic strengthening is long-term potentiation (LTP), first documented in 1973. Whether a synapse potentiates or depresses depends on its recent activity: heavily used synapses tend toward LTP; rarely used ones toward LTD. Together they give the brain a mechanism for writing experience into connection strengths — exactly what a network's training algorithm does numerically.
Persistent strengthening of a synapse following repeated, rapid activation. The transmitting neuron releases more signal or the receiving neuron grows more receptors. The "volume" stays turned up for hours to a lifetime — the physical substrate of long-term memory formation.
Persistent weakening of a synapse following low or ineffective activity. The reverse of LTP — pruning connections that carry little useful signal, keeping the memory system from becoming saturated and preserving selectivity.
Biological memory is not one thing. Cognitive neuroscience distinguishes several systems that operate differently and depend on different brain structures.
Artificial neural networks conflate most of these into a single substrate — the weight matrix — but computational models have begun to represent each system separately, as the final section of this page explores.
A neural network "learns" by adjusting its connection weights until its outputs match a target. The mechanism that makes this adjustment is called backpropagation — short for backward propagation of errors — combined with an optimization strategy called gradient descent. Understanding these two ideas is essential because they are the closest artificial analog to the synaptic plasticity described above.
Think of the network's total error as a surface in a high-dimensional space, with a valley at the minimum. Gradient descent navigates toward that valley by stepping opposite to the steepest upward slope. The size of each step is the learning rate — too large and it overshoots; too small and it stalls. Backpropagation supplies the gradients needed: using the chain rule of calculus, it propagates the output error backward through each layer, apportioning blame to each weight in proportion to its contribution to the mistake. Each training iteration runs in three stages:
Input flows through the network layer by layer, producing a prediction.
A loss function measures how far the prediction is from the target.
Gradients propagate backward; each weight is nudged to reduce the error.
In batch gradient descent, the gradient is averaged over the full training set before any weights are updated — stable but slow and memory-heavy. In stochastic (online) gradient descent, weights are updated after every single example — fast and memory-light but noisy. Mini-batch gradient descent splits the difference, averaging over small random subsets and is by far the most common practice in deep learning.
Overfitting occurs when a network memorizes the training data rather than learning its underlying pattern — it performs well on training examples but poorly on new ones. The biological parallel is rote memorization without generalization. Regularization techniques (adding a penalty for large weights, or randomly disabling neurons during training via dropout) push the network toward simpler, more generalizable solutions.
The code below trains a small two-layer network from scratch on a classic non-linear problem: the XOR function, which a single-layer network cannot solve. Only NumPy is used — no deep learning library — so every step of the gradient computation is visible.
import numpy as np X = np.array([[0,0],[0,1],[1,0],[1,1]]) y = np.array([[0],[1],[1],[0]]) np.random.seed(0) W1 = np.random.randn(2, 4) b1 = np.zeros((1, 4)) W2 = np.random.randn(4, 1) b2 = np.zeros((1, 1)) def sigmoid(z): return 1 / (1 + np.exp(-z)) def sigmoid_deriv(z): s = sigmoid(z) return s * (1 - s) lr = 0.5 for epoch in range(10000): z1 = X @ W1 + b1 a1 = sigmoid(z1) z2 = a1 @ W2 + b2 a2 = sigmoid(z2) loss = np.mean((y - a2) ** 2) d_a2 = -2 * (y - a2) / len(y) d_z2 = d_a2 * sigmoid_deriv(z2) d_W2 = a1.T @ d_z2 d_b2 = d_z2.sum(axis=0, keepdims=True) d_a1 = d_z2 @ W2.T d_z1 = d_a1 * sigmoid_deriv(z1) d_W1 = X.T @ d_z1 d_b1 = d_z1.sum(axis=0, keepdims=True) W2 -= lr * d_W2 b2 -= lr * d_b2 W1 -= lr * d_W1 b1 -= lr * d_b1 if epoch % 2000 == 0: print(f'epoch {epoch:5d} loss {loss:.4f}') print('predictions:', np.round(a2.T, 2))
After 10,000 epochs the predictions converge to near [0, 1, 1, 0]. The loop structure — forward pass, compute error, backward pass, update — is identical in principle to the iterative weight updates happening in any deep learning framework, just without the engineering scaffolding that makes it fast at scale.
Standard feedforward networks trained by backpropagation are powerful classifiers, but they do not model memory as cognitive science understands it. Three families of model have been developed specifically to bridge that gap, each grounded in a different aspect of what neuroscience has found out about how memory works.
A Hopfield network is a fully connected recurrent network proposed by John Hopfield in 1982, designed to model associative (content-addressable) memory — the ability to retrieve a complete memory from a partial or noisy cue, the way hearing the first few bars of a song brings back the whole melody. Patterns are stored as stable states of the network. The network is given a partial or corrupted input and iteratively updates its neurons — each checking whether flipping its state would lower a global energy function — until it settles into the nearest stored pattern. The biological parallel is the brain's ability to "fill in" a degraded percept from stored experience, a phenomenon psychologists call pattern completion.
Hopfield networks have a limited storage capacity — roughly 0.15 × N patterns for a network of N neurons — and can produce spurious states: stable configurations that were never stored, analogous to a false memory. Both limits have biological parallels and have been extensively studied as models of memory failure.
One of the most influential models bridging neuroscience and machine learning is the Complementary Learning Systems (CLS) theory, introduced by McClelland, McNaughton, and O'Reilly in 1995. It addresses a fundamental question: how does the brain learn new things quickly without overwriting what it already knows? A standard neural network suffers from catastrophic forgetting — training it intensively on a new task degrades its performance on old ones, because the same weights serve both. The brain avoids this.
CLS proposes that two systems with different learning dynamics work together to solve this. Memories are initially encoded in the hippocampus, then gradually transferred to neocortex through replay during sleep and rest. CLS has directly inspired continual learning and experience replay in modern deep reinforcement learning.
One-shot learning. Sparse, non-overlapping representations keep new memories distinct (pattern separation). Acts as short-term buffer; damage disrupts recent but not remote memory.
The hippocampus replays recent memories during sleep, each replay nudging neocortical weights a little. Gradual transfer from fast short-term store to slow long-term store.
Integrates regularities across many exposures into distributed, overlapping representations. Supports semantic knowledge and generalization (pattern completion).
The Long Short-Term Memory (LSTM) architecture, introduced by Hochreiter and Schmidhuber in 1997, solves the vanishing-gradient problem of standard RNNs with an explicit cell state — a memory register controlled by three learned gates that model the selective maintenance and updating of working memory:
Decides which new information from the current input is worth storing in the cell state. Analogous to encoding a new experience into working memory.
Decides which existing content to keep and which to discard. Mirrors the selective maintenance of relevant context in working memory while clearing outdated information.
Controls what portion of the cell state is exposed as the output at this step — what the network "pays attention to" from its stored context right now.
The three models each address a different memory function: Hopfield networks model pattern-completion retrieval from partial cues; CLS theory models the long-term consolidation of experience from fast hippocampal encoding to slow cortical generalization; and LSTMs model the active maintenance and updating of short-term context. A complete computational account of memory would need all three — and the brain appears to use analogs of all three simultaneously.