Stanford CS224N Study Guide NLP with Deep Learning

Course Overview

This comprehensive study guide covers the first five lectures of Stanford's CS224N course on Natural Language Processing with Deep Learning. The course explores modern approaches to understanding and generating human language using neural networks and deep learning techniques.

5 Lectures

Comprehensive coverage of foundational NLP concepts

20+ Quizzes

Practice questions with instant feedback

30+ Flashcards

Interactive study cards for key concepts

Lecture 1: Introduction and Word Vectors

Lecture 1

Introduction and Word Vectors

What is Natural Language Processing?

Natural Language Processing (NLP) is a field at the intersection of computer science, artificial intelligence, and linguistics. The goal is to enable computers to understand, interpret, and generate human language in a valuable way.

Understanding text and speech
Machine translation
Question answering systems
Sentiment analysis
Information extraction
Text generation and summarization

Why is NLP Challenging?

Human language is incredibly complex and nuanced. Several factors make NLP particularly challenging:

Ambiguity: Words and sentences can have multiple meanings depending on context
Context Dependence: Meaning changes based on surrounding words and discourse
World Knowledge: Understanding often requires external knowledge beyond the text
Variability: Infinite ways to express the same idea
Compositionality: Meaning of phrases depends on word combinations

Word Meaning Representations

Traditional approaches used discrete symbols (one-hot encoding), but this fails to capture semantic relationships. Modern approaches use distributed representations.

Distributional Hypothesis:
"You shall know a word by the company it keeps" - J.R. Firth
Words that appear in similar contexts tend to have similar meanings

Word2Vec: Skip-gram and CBOW

Word2Vec is a framework for learning word embeddings using neural networks. Two main architectures:

Skip-gram: Predicts context words given a center word
CBOW (Continuous Bag of Words): Predicts center word given context words

Skip-gram Objective:
Maximize: J(θ) = (1/T) Σ Σ log P(w_t+j | w_t)
Where T is corpus size, and we predict context within window m

Word Vector Properties

Word vectors capture semantic and syntactic relationships through vector arithmetic:

Famous Examples:
vec("king") - vec("man") + vec("woman") ≈ vec("queen")
vec("Paris") - vec("France") + vec("Italy") ≈ vec("Rome")

Key Takeaways

Word vectors represent words as dense, continuous vectors in high-dimensional space
Distributional semantics: words in similar contexts have similar meanings
Word2Vec learns embeddings by predicting context words
Vector arithmetic can capture semantic relationships

Lecture 2: Word Vectors and Language Models

Lecture 2

Word Vectors and Language Models

Optimization Basics for Word Vectors

Training word vectors requires optimizing the objective function using gradient descent.

Gradient descent updates parameters in direction of steepest descent
Stochastic Gradient Descent (SGD) uses mini-batches for efficiency
Learning rate controls step size
Negative sampling reduces computational complexity

Gradient Descent Update:
θ^new = θ^old - α∇_θJ(θ)
Where α is the learning rate

Word2Vec Training Details

Efficient training techniques make Word2Vec practical for large corpora:

Negative Sampling: Sample k negative examples instead of full softmax
Hierarchical Softmax: Use binary tree structure for efficient computation
Subsampling: Reduce frequency of common words
Context Window: Dynamic window size for varied context

GloVe: Global Vectors for Word Representation

GloVe combines global matrix factorization and local context window methods.

GloVe Objective:
J = Σ f(X_ij)(w_i^Tw̃_j + b_i + b̃_j - log X_ij)²
Where X_ij is co-occurrence count, f is weighting function

Leverages global word-word co-occurrence statistics
Fast training and good performance with smaller corpora
Captures both semantic and syntactic patterns

Evaluating Word Vectors

Two main approaches to evaluate word embeddings:

Intrinsic Evaluation: Word similarity, analogy tasks (e.g., king:queen :: man:?)
Extrinsic Evaluation: Performance on downstream tasks (NER, sentiment analysis)
Intrinsic is faster but may not correlate with real-world performance
Extrinsic is more meaningful but computationally expensive

Language Modeling Fundamentals

Language models assign probabilities to sequences of words.

Language Model Objective:
P(w₁, w₂, ..., w_n) = P(w₁)P(w₂|w₁)...P(w_n|w₁,...,w_n-1)

Core task: predict next word given previous words
Applications: speech recognition, machine translation, text generation
Perplexity used as evaluation metric

N-gram vs Neural Language Models

Comparison of traditional and neural approaches:

N-gram Models: Use count-based statistics with Markov assumption
Neural Models: Learn distributed representations, better generalization
N-grams suffer from data sparsity and curse of dimensionality
Neural models handle longer contexts and unseen word combinations

Key Takeaways

Efficient training techniques like negative sampling enable Word2Vec at scale
GloVe combines global statistics with local context for better embeddings
Language models predict word sequences and are fundamental to NLP
Neural language models outperform traditional n-gram approaches

Lecture 3: Backpropagation and Neural Networks

Lecture 3

Backpropagation and Neural Networks

Neural Network Fundamentals

Neural networks are composed of layers of interconnected neurons that transform inputs to outputs.

Input Layer: Receives raw data (e.g., word vectors)
Hidden Layers: Learn hierarchical representations
Output Layer: Produces final predictions
Activation Functions: Introduce non-linearity (sigmoid, tanh, ReLU)

Single Neuron:
h = f(w^Tx + b)
Where f is activation function, w is weights, b is bias

Forward Propagation

Forward propagation computes the output of the network given an input.

Multi-layer Forward Pass:
h⁽¹⁾ = f(W⁽¹⁾x + b⁽¹⁾)
h⁽²⁾ = f(W⁽²⁾h⁽¹⁾ + b⁽²⁾)
ŷ = softmax(W⁽³⁾h⁽²⁾ + b⁽³⁾)

Data flows from input to output layer
Each layer applies linear transformation followed by non-linearity
Final layer often uses softmax for classification

Backpropagation Algorithm

Backpropagation efficiently computes gradients using the chain rule of calculus.

Compute loss at output layer
Propagate error backwards through network
Calculate gradient for each parameter
Update parameters using gradient descent

Chain Rule:
∂L/∂w = (∂L/∂ŷ)(∂ŷ/∂h)(∂h/∂w)
Gradients flow backwards through computational graph

Gradient Computation

Computing gradients efficiently is crucial for training deep networks.

Local Gradients: Derivative of operation with respect to inputs
Upstream Gradients: Gradient flowing from layers above
Downstream Gradients: Product of local and upstream gradients
Computational graph helps visualize gradient flow

Matrix Calculus for Deep Learning

Matrix calculus enables efficient computation of gradients for vectorized operations.

Key Matrix Derivatives:
∂(Wx)/∂W = x^T
∂(Wx)/∂x = W^T
∂(x^TWx)/∂x = (W + W^T)x

Jacobian matrix contains all partial derivatives
Chain rule extends to matrix operations
Vectorization improves computational efficiency

Optimization Techniques

Advanced optimization methods improve training speed and convergence.

SGD: Stochastic Gradient Descent with mini-batches
Momentum: Accumulates velocity to accelerate convergence
Adam: Adaptive learning rates per parameter
Learning Rate Scheduling: Decay learning rate over time

Adam Update:
m_t = β₁m_t-1 + (1-β₁)g_t
v_t = β₂v_t-1 + (1-β₂)g_t²
θ_t = θ_t-1 - α·m̂_t/(√v̂_t + ε)

Key Takeaways

Backpropagation uses chain rule to compute gradients efficiently
Forward pass computes outputs, backward pass computes gradients
Matrix calculus enables vectorized gradient computation
Advanced optimizers like Adam improve training convergence

Lecture 4: Dependency Parsing

Lecture 4

Dependency Parsing

Syntactic Structure and Dependency Grammar

Dependency grammar represents sentence structure as directed graphs showing word relationships.

Words are nodes, dependencies are directed edges
Each word has exactly one head (except root)
Captures grammatical relations (subject, object, modifier)
More flexible than phrase structure grammar
Works well across different languages

Dependency Trees and Relations

Common dependency relations capture grammatical functions:

nsubj: Nominal subject
dobj: Direct object
amod: Adjectival modifier
prep: Prepositional modifier
det: Determiner
aux: Auxiliary verb

Example: "The cat sat on the mat"
sat (root) → cat (nsubj), on (prep) → mat (pobj), cat → The (det), mat → the (det)

Transition-Based Parsing

Transition-based parsers use a sequence of actions to build dependency trees.

Configuration: (Stack, Buffer, Dependencies)
SHIFT: Move word from buffer to stack
LEFT-ARC: Add dependency from top to second-top, remove second-top
RIGHT-ARC: Add dependency from second-top to top, remove top

Arc-Standard Algorithm:
Initial: ([ROOT], [w₁,...,w_n], ∅)
Terminal: ([ROOT], [], A) where A is set of dependencies

Neural Dependency Parsers

Neural networks learn to predict parsing actions from parser configurations.

Extract features from stack, buffer, and existing dependencies
Use word embeddings, POS embeddings, dependency label embeddings
Feed-forward network predicts next action
Much faster and more accurate than traditional parsers

Neural Parser Architecture:
x = [e_w1; e_w2; ...; e_t1; e_t2; ...; e_l1; ...]
h = ReLU(W₁x + b₁)
p = softmax(W₂h + b₂)

Evaluation Metrics

Parser performance measured by attachment scores:

UAS (Unlabeled Attachment Score): % of words with correct head
LAS (Labeled Attachment Score): % of words with correct head and label
LAS is stricter and more informative
Modern neural parsers achieve >95% UAS on English

Metrics:
UAS = (# correct heads) / (# total words)
LAS = (# correct head+label) / (# total words)

Key Takeaways

Dependency parsing captures grammatical relationships between words
Transition-based parsing builds trees through sequence of actions
Neural parsers learn to predict actions from parser state
UAS and LAS measure parser accuracy

Lecture 5: Recurrent Neural Networks

Lecture 5

Recurrent Neural Networks

RNN Architecture and Motivation

Recurrent Neural Networks process sequential data by maintaining hidden state across time steps.

Handle variable-length sequences
Share parameters across time steps
Maintain memory of previous inputs
Natural fit for language modeling, machine translation, speech

RNN Update Equations:
h_t = tanh(W_hhh_t-1 + W_xhx_t + b_h)
y_t = W_hyh_t + b_y

Vanishing and Exploding Gradient Problems

Training RNNs on long sequences faces gradient instability issues.

Vanishing Gradients: Gradients shrink exponentially, preventing learning of long-term dependencies
Exploding Gradients: Gradients grow exponentially, causing numerical instability
Caused by repeated multiplication of weight matrix
Solutions: gradient clipping, better architectures (LSTM, GRU)

Gradient Flow:
∂h_t/∂h₀ = Π W_hh^T · diag(tanh')
Product of many matrices causes exponential growth/decay

Long Short-Term Memory (LSTM) Networks

LSTMs use gating mechanisms to control information flow and maintain long-term memory.

Forget Gate: Decides what information to discard from cell state
Input Gate: Controls what new information to add
Output Gate: Determines what to output based on cell state
Cell state acts as highway for gradient flow

LSTM Equations:
f_t = σ(W_f[h_t-1, x_t] + b_f) (forget)
i_t = σ(W_i[h_t-1, x_t] + b_i) (input)
c̃_t = tanh(W_c[h_t-1, x_t] + b_c) (candidate)
c_t = f_t ⊙ c_t-1 + i_t ⊙ c̃_t (cell state)
o_t = σ(W_o[h_t-1, x_t] + b_o) (output)
h_t = o_t ⊙ tanh(c_t) (hidden state)

Gated Recurrent Units (GRU)

GRUs simplify LSTM architecture while maintaining similar performance.

Combines forget and input gates into single update gate
Merges cell state and hidden state
Fewer parameters than LSTM
Often performs comparably to LSTM with faster training

GRU Equations:
z_t = σ(W_z[h_t-1, x_t]) (update gate)
r_t = σ(W_r[h_t-1, x_t]) (reset gate)
h̃_t = tanh(W[r_t ⊙ h_t-1, x_t]) (candidate)
h_t = (1 - z_t) ⊙ h_t-1 + z_t ⊙ h̃_t

Bidirectional RNNs

Bidirectional RNNs process sequences in both forward and backward directions.

Forward RNN processes left-to-right
Backward RNN processes right-to-left
Concatenate forward and backward hidden states
Captures context from both past and future
Useful for tasks where full sequence is available (e.g., NER, POS tagging)

Bidirectional Output:
h_t = [→h_t; ←h_t]
Where →h_t is forward, ←h_t is backward hidden state

Applications in NLP

RNNs and their variants power many NLP applications:

Language Modeling: Predict next word in sequence
Machine Translation: Encoder-decoder architectures
Sentiment Analysis: Classify text sentiment
Named Entity Recognition: Tag entities in text
Speech Recognition: Convert audio to text
Text Generation: Generate coherent text sequences

Key Takeaways

RNNs process sequences by maintaining hidden state over time
Vanilla RNNs suffer from vanishing/exploding gradients
LSTMs and GRUs use gates to control information flow
Bidirectional RNNs capture context from both directions
RNN variants enable many sequence-to-sequence applications

Practice Quizzes

Test your understanding with these practice questions. Click an option to check your answer!

Score: 0 / 20

Interactive Flashcards

Click on a card to flip it and reveal the answer. Use the navigation buttons to browse through all concepts.

1 / 30

Concept Mind Map

Visual representation of key NLP concepts and their relationships across all lectures.

Quick Reference Formulas

Word2Vec Skip-gram Objective
J(θ) = (1/T) Σ_t=1^T Σ_{-m≤j≤m, j≠0} log P(w_t+j | w_t)

Softmax Function
P(w_j | w_i) = exp(u_j^Tv_i) / Σ_k=1^V exp(u_k^Tv_i)

GloVe Objective
J = Σ_i,j=1^V f(X_ij)(w_i^Tw̃_j + b_i + b̃_j - log X_ij)²

Gradient Descent Update
θ^new = θ^old - α∇_θJ(θ)

Chain Rule (Backpropagation)
∂L/∂w = (∂L/∂ŷ)(∂ŷ/∂h)(∂h/∂w)

RNN Hidden State Update
h_t = tanh(W_hhh_t-1 + W_xhx_t + b_h)

LSTM Cell State Update
c_t = f_t ⊙ c_t-1 + i_t ⊙ c̃_t

Cross-Entropy Loss
L = -Σ y_i log(ŷ_i)

Glossary of Terms

Word Embedding: Dense vector representation of words that captures semantic meaning.

Distributional Semantics: The principle that words appearing in similar contexts have similar meanings.

Skip-gram: Word2Vec model that predicts context words given a center word.

CBOW (Continuous Bag of Words): Word2Vec model that predicts center word from context words.

Negative Sampling: Training technique that samples negative examples instead of computing full softmax.

Backpropagation: Algorithm for computing gradients in neural networks using chain rule.

Gradient Descent: Optimization algorithm that iteratively updates parameters in direction of steepest descent.

Vanishing Gradient: Problem where gradients become exponentially small, preventing learning.

Dependency Parsing: Task of analyzing grammatical structure by identifying word dependencies.

LSTM (Long Short-Term Memory): RNN variant with gating mechanisms for long-term dependencies.

GRU (Gated Recurrent Unit): Simplified LSTM with fewer parameters but similar performance.

Perplexity: Evaluation metric for language models; lower is better.

UAS (Unlabeled Attachment Score): Percentage of words with correct head in dependency parsing.

LAS (Labeled Attachment Score): Percentage of words with correct head and dependency label.

Attention Mechanism: Technique allowing models to focus on relevant parts of input sequence.

Study Resources

Video Lectures

Lecture 1: Intro and Word Vectors Lecture 2: Word Vectors and Language Models Lecture 3: Backpropagation and Neural Networks Lecture 4: Dependency Parsing Lecture 5: Recurrent Neural Networks

Additional Materials

Stanford AI Programs

Study Tips

Review lecture videos and take detailed notes
Practice implementing concepts in code
Work through problem sets and assignments
Use flashcards for memorizing key definitions and formulas
Test yourself regularly with the practice quizzes
Join study groups to discuss challenging concepts
Apply concepts to real-world NLP projects