Course Overview
This comprehensive study guide covers the first five lectures of Stanford's CS224N course on Natural Language Processing with Deep Learning. The course explores modern approaches to understanding and generating human language using neural networks and deep learning techniques.
5 Lectures
Comprehensive coverage of foundational NLP concepts
20+ Quizzes
Practice questions with instant feedback
30+ Flashcards
Interactive study cards for key concepts
Lecture 1: Introduction and Word Vectors
Introduction and Word Vectors
Natural Language Processing (NLP) is a field at the intersection of computer science, artificial intelligence, and linguistics. The goal is to enable computers to understand, interpret, and generate human language in a valuable way.
- Understanding text and speech
- Machine translation
- Question answering systems
- Sentiment analysis
- Information extraction
- Text generation and summarization
Human language is incredibly complex and nuanced. Several factors make NLP particularly challenging:
- Ambiguity: Words and sentences can have multiple meanings depending on context
- Context Dependence: Meaning changes based on surrounding words and discourse
- World Knowledge: Understanding often requires external knowledge beyond the text
- Variability: Infinite ways to express the same idea
- Compositionality: Meaning of phrases depends on word combinations
Traditional approaches used discrete symbols (one-hot encoding), but this fails to capture semantic relationships. Modern approaches use distributed representations.
"You shall know a word by the company it keeps" - J.R. Firth
Words that appear in similar contexts tend to have similar meanings
Word2Vec is a framework for learning word embeddings using neural networks. Two main architectures:
- Skip-gram: Predicts context words given a center word
- CBOW (Continuous Bag of Words): Predicts center word given context words
Maximize: J(θ) = (1/T) Σ Σ log P(wt+j | wt)
Where T is corpus size, and we predict context within window m
Word vectors capture semantic and syntactic relationships through vector arithmetic:
vec("king") - vec("man") + vec("woman") ≈ vec("queen")
vec("Paris") - vec("France") + vec("Italy") ≈ vec("Rome")
Key Takeaways
- Word vectors represent words as dense, continuous vectors in high-dimensional space
- Distributional semantics: words in similar contexts have similar meanings
- Word2Vec learns embeddings by predicting context words
- Vector arithmetic can capture semantic relationships
Lecture 2: Word Vectors and Language Models
Word Vectors and Language Models
Training word vectors requires optimizing the objective function using gradient descent.
- Gradient descent updates parameters in direction of steepest descent
- Stochastic Gradient Descent (SGD) uses mini-batches for efficiency
- Learning rate controls step size
- Negative sampling reduces computational complexity
θnew = θold - α∇θJ(θ)
Where α is the learning rate
Efficient training techniques make Word2Vec practical for large corpora:
- Negative Sampling: Sample k negative examples instead of full softmax
- Hierarchical Softmax: Use binary tree structure for efficient computation
- Subsampling: Reduce frequency of common words
- Context Window: Dynamic window size for varied context
GloVe combines global matrix factorization and local context window methods.
J = Σ f(Xij)(wiTw̃j + bi + b̃j - log Xij)2
Where Xij is co-occurrence count, f is weighting function
- Leverages global word-word co-occurrence statistics
- Fast training and good performance with smaller corpora
- Captures both semantic and syntactic patterns
Two main approaches to evaluate word embeddings:
- Intrinsic Evaluation: Word similarity, analogy tasks (e.g., king:queen :: man:?)
- Extrinsic Evaluation: Performance on downstream tasks (NER, sentiment analysis)
- Intrinsic is faster but may not correlate with real-world performance
- Extrinsic is more meaningful but computationally expensive
Language models assign probabilities to sequences of words.
P(w1, w2, ..., wn) = P(w1)P(w2|w1)...P(wn|w1,...,wn-1)
- Core task: predict next word given previous words
- Applications: speech recognition, machine translation, text generation
- Perplexity used as evaluation metric
Comparison of traditional and neural approaches:
- N-gram Models: Use count-based statistics with Markov assumption
- Neural Models: Learn distributed representations, better generalization
- N-grams suffer from data sparsity and curse of dimensionality
- Neural models handle longer contexts and unseen word combinations
Key Takeaways
- Efficient training techniques like negative sampling enable Word2Vec at scale
- GloVe combines global statistics with local context for better embeddings
- Language models predict word sequences and are fundamental to NLP
- Neural language models outperform traditional n-gram approaches
Lecture 3: Backpropagation and Neural Networks
Backpropagation and Neural Networks
Neural networks are composed of layers of interconnected neurons that transform inputs to outputs.
- Input Layer: Receives raw data (e.g., word vectors)
- Hidden Layers: Learn hierarchical representations
- Output Layer: Produces final predictions
- Activation Functions: Introduce non-linearity (sigmoid, tanh, ReLU)
h = f(wTx + b)
Where f is activation function, w is weights, b is bias
Forward propagation computes the output of the network given an input.
h(1) = f(W(1)x + b(1))
h(2) = f(W(2)h(1) + b(2))
ŷ = softmax(W(3)h(2) + b(3))
- Data flows from input to output layer
- Each layer applies linear transformation followed by non-linearity
- Final layer often uses softmax for classification
Backpropagation efficiently computes gradients using the chain rule of calculus.
- Compute loss at output layer
- Propagate error backwards through network
- Calculate gradient for each parameter
- Update parameters using gradient descent
∂L/∂w = (∂L/∂ŷ)(∂ŷ/∂h)(∂h/∂w)
Gradients flow backwards through computational graph
Computing gradients efficiently is crucial for training deep networks.
- Local Gradients: Derivative of operation with respect to inputs
- Upstream Gradients: Gradient flowing from layers above
- Downstream Gradients: Product of local and upstream gradients
- Computational graph helps visualize gradient flow
Matrix calculus enables efficient computation of gradients for vectorized operations.
∂(Wx)/∂W = xT
∂(Wx)/∂x = WT
∂(xTWx)/∂x = (W + WT)x
- Jacobian matrix contains all partial derivatives
- Chain rule extends to matrix operations
- Vectorization improves computational efficiency
Advanced optimization methods improve training speed and convergence.
- SGD: Stochastic Gradient Descent with mini-batches
- Momentum: Accumulates velocity to accelerate convergence
- Adam: Adaptive learning rates per parameter
- Learning Rate Scheduling: Decay learning rate over time
mt = β1mt-1 + (1-β1)gt
vt = β2vt-1 + (1-β2)gt2
θt = θt-1 - α·m̂t/(√v̂t + ε)
Key Takeaways
- Backpropagation uses chain rule to compute gradients efficiently
- Forward pass computes outputs, backward pass computes gradients
- Matrix calculus enables vectorized gradient computation
- Advanced optimizers like Adam improve training convergence
Lecture 4: Dependency Parsing
Dependency Parsing
Dependency grammar represents sentence structure as directed graphs showing word relationships.
- Words are nodes, dependencies are directed edges
- Each word has exactly one head (except root)
- Captures grammatical relations (subject, object, modifier)
- More flexible than phrase structure grammar
- Works well across different languages
Common dependency relations capture grammatical functions:
- nsubj: Nominal subject
- dobj: Direct object
- amod: Adjectival modifier
- prep: Prepositional modifier
- det: Determiner
- aux: Auxiliary verb
Example: "The cat sat on the mat"
sat (root) → cat (nsubj), on (prep) → mat (pobj), cat → The (det), mat → the (det)
Transition-based parsers use a sequence of actions to build dependency trees.
- Configuration: (Stack, Buffer, Dependencies)
- SHIFT: Move word from buffer to stack
- LEFT-ARC: Add dependency from top to second-top, remove second-top
- RIGHT-ARC: Add dependency from second-top to top, remove top
Initial: ([ROOT], [w1,...,wn], ∅)
Terminal: ([ROOT], [], A) where A is set of dependencies
Neural networks learn to predict parsing actions from parser configurations.
- Extract features from stack, buffer, and existing dependencies
- Use word embeddings, POS embeddings, dependency label embeddings
- Feed-forward network predicts next action
- Much faster and more accurate than traditional parsers
x = [ew1; ew2; ...; et1; et2; ...; el1; ...]
h = ReLU(W1x + b1)
p = softmax(W2h + b2)
Parser performance measured by attachment scores:
- UAS (Unlabeled Attachment Score): % of words with correct head
- LAS (Labeled Attachment Score): % of words with correct head and label
- LAS is stricter and more informative
- Modern neural parsers achieve >95% UAS on English
UAS = (# correct heads) / (# total words)
LAS = (# correct head+label) / (# total words)
Key Takeaways
- Dependency parsing captures grammatical relationships between words
- Transition-based parsing builds trees through sequence of actions
- Neural parsers learn to predict actions from parser state
- UAS and LAS measure parser accuracy
Lecture 5: Recurrent Neural Networks
Recurrent Neural Networks
Recurrent Neural Networks process sequential data by maintaining hidden state across time steps.
- Handle variable-length sequences
- Share parameters across time steps
- Maintain memory of previous inputs
- Natural fit for language modeling, machine translation, speech
ht = tanh(Whhht-1 + Wxhxt + bh)
yt = Whyht + by
Training RNNs on long sequences faces gradient instability issues.
- Vanishing Gradients: Gradients shrink exponentially, preventing learning of long-term dependencies
- Exploding Gradients: Gradients grow exponentially, causing numerical instability
- Caused by repeated multiplication of weight matrix
- Solutions: gradient clipping, better architectures (LSTM, GRU)
∂ht/∂h0 = Π WhhT · diag(tanh')
Product of many matrices causes exponential growth/decay
LSTMs use gating mechanisms to control information flow and maintain long-term memory.
- Forget Gate: Decides what information to discard from cell state
- Input Gate: Controls what new information to add
- Output Gate: Determines what to output based on cell state
- Cell state acts as highway for gradient flow
ft = σ(Wf[ht-1, xt] + bf) (forget)
it = σ(Wi[ht-1, xt] + bi) (input)
c̃t = tanh(Wc[ht-1, xt] + bc) (candidate)
ct = ft ⊙ ct-1 + it ⊙ c̃t (cell state)
ot = σ(Wo[ht-1, xt] + bo) (output)
ht = ot ⊙ tanh(ct) (hidden state)
GRUs simplify LSTM architecture while maintaining similar performance.
- Combines forget and input gates into single update gate
- Merges cell state and hidden state
- Fewer parameters than LSTM
- Often performs comparably to LSTM with faster training
zt = σ(Wz[ht-1, xt]) (update gate)
rt = σ(Wr[ht-1, xt]) (reset gate)
h̃t = tanh(W[rt ⊙ ht-1, xt]) (candidate)
ht = (1 - zt) ⊙ ht-1 + zt ⊙ h̃t
Bidirectional RNNs process sequences in both forward and backward directions.
- Forward RNN processes left-to-right
- Backward RNN processes right-to-left
- Concatenate forward and backward hidden states
- Captures context from both past and future
- Useful for tasks where full sequence is available (e.g., NER, POS tagging)
ht = [→ht; ←ht]
Where →ht is forward, ←ht is backward hidden state
RNNs and their variants power many NLP applications:
- Language Modeling: Predict next word in sequence
- Machine Translation: Encoder-decoder architectures
- Sentiment Analysis: Classify text sentiment
- Named Entity Recognition: Tag entities in text
- Speech Recognition: Convert audio to text
- Text Generation: Generate coherent text sequences
Key Takeaways
- RNNs process sequences by maintaining hidden state over time
- Vanilla RNNs suffer from vanishing/exploding gradients
- LSTMs and GRUs use gates to control information flow
- Bidirectional RNNs capture context from both directions
- RNN variants enable many sequence-to-sequence applications
Practice Quizzes
Test your understanding with these practice questions. Click an option to check your answer!
Score: 0 / 20
Interactive Flashcards
Click on a card to flip it and reveal the answer. Use the navigation buttons to browse through all concepts.
Concept Mind Map
Visual representation of key NLP concepts and their relationships across all lectures.
Quick Reference Formulas
J(θ) = (1/T) Σt=1T Σ-m≤j≤m, j≠0 log P(wt+j | wt)
P(wj | wi) = exp(ujTvi) / Σk=1V exp(ukTvi)
J = Σi,j=1V f(Xij)(wiTw̃j + bi + b̃j - log Xij)2
θnew = θold - α∇θJ(θ)
∂L/∂w = (∂L/∂ŷ)(∂ŷ/∂h)(∂h/∂w)
ht = tanh(Whhht-1 + Wxhxt + bh)
ct = ft ⊙ ct-1 + it ⊙ c̃t
L = -Σ yi log(ŷi)
Glossary of Terms
Study Resources
Video Lectures
Additional Materials
Stanford AI ProgramsStudy Tips
- Review lecture videos and take detailed notes
- Practice implementing concepts in code
- Work through problem sets and assignments
- Use flashcards for memorizing key definitions and formulas
- Test yourself regularly with the practice quizzes
- Join study groups to discuss challenging concepts
- Apply concepts to real-world NLP projects