Course Overview

This comprehensive study guide covers the first five lectures of Stanford's CS224N course on Natural Language Processing with Deep Learning. The course explores modern approaches to understanding and generating human language using neural networks and deep learning techniques.

5 Lectures

Comprehensive coverage of foundational NLP concepts

20+ Quizzes

Practice questions with instant feedback

30+ Flashcards

Interactive study cards for key concepts

Lecture 1: Introduction and Word Vectors

Lecture 1

Introduction and Word Vectors

What is Natural Language Processing?

Natural Language Processing (NLP) is a field at the intersection of computer science, artificial intelligence, and linguistics. The goal is to enable computers to understand, interpret, and generate human language in a valuable way.

  • Understanding text and speech
  • Machine translation
  • Question answering systems
  • Sentiment analysis
  • Information extraction
  • Text generation and summarization
Why is NLP Challenging?

Human language is incredibly complex and nuanced. Several factors make NLP particularly challenging:

  • Ambiguity: Words and sentences can have multiple meanings depending on context
  • Context Dependence: Meaning changes based on surrounding words and discourse
  • World Knowledge: Understanding often requires external knowledge beyond the text
  • Variability: Infinite ways to express the same idea
  • Compositionality: Meaning of phrases depends on word combinations
Word Meaning Representations

Traditional approaches used discrete symbols (one-hot encoding), but this fails to capture semantic relationships. Modern approaches use distributed representations.

Distributional Hypothesis:
"You shall know a word by the company it keeps" - J.R. Firth
Words that appear in similar contexts tend to have similar meanings
Word2Vec: Skip-gram and CBOW

Word2Vec is a framework for learning word embeddings using neural networks. Two main architectures:

  • Skip-gram: Predicts context words given a center word
  • CBOW (Continuous Bag of Words): Predicts center word given context words
Skip-gram Objective:
Maximize: J(θ) = (1/T) Σ Σ log P(wt+j | wt)
Where T is corpus size, and we predict context within window m
Word Vector Properties

Word vectors capture semantic and syntactic relationships through vector arithmetic:

Famous Examples:
vec("king") - vec("man") + vec("woman") ≈ vec("queen")
vec("Paris") - vec("France") + vec("Italy") ≈ vec("Rome")

Key Takeaways

  • Word vectors represent words as dense, continuous vectors in high-dimensional space
  • Distributional semantics: words in similar contexts have similar meanings
  • Word2Vec learns embeddings by predicting context words
  • Vector arithmetic can capture semantic relationships

Lecture 2: Word Vectors and Language Models

Lecture 2

Word Vectors and Language Models

Optimization Basics for Word Vectors

Training word vectors requires optimizing the objective function using gradient descent.

  • Gradient descent updates parameters in direction of steepest descent
  • Stochastic Gradient Descent (SGD) uses mini-batches for efficiency
  • Learning rate controls step size
  • Negative sampling reduces computational complexity
Gradient Descent Update:
θnew = θold - α∇θJ(θ)
Where α is the learning rate
Word2Vec Training Details

Efficient training techniques make Word2Vec practical for large corpora:

  • Negative Sampling: Sample k negative examples instead of full softmax
  • Hierarchical Softmax: Use binary tree structure for efficient computation
  • Subsampling: Reduce frequency of common words
  • Context Window: Dynamic window size for varied context
GloVe: Global Vectors for Word Representation

GloVe combines global matrix factorization and local context window methods.

GloVe Objective:
J = Σ f(Xij)(wiTj + bi + b̃j - log Xij)2
Where Xij is co-occurrence count, f is weighting function
  • Leverages global word-word co-occurrence statistics
  • Fast training and good performance with smaller corpora
  • Captures both semantic and syntactic patterns
Evaluating Word Vectors

Two main approaches to evaluate word embeddings:

  • Intrinsic Evaluation: Word similarity, analogy tasks (e.g., king:queen :: man:?)
  • Extrinsic Evaluation: Performance on downstream tasks (NER, sentiment analysis)
  • Intrinsic is faster but may not correlate with real-world performance
  • Extrinsic is more meaningful but computationally expensive
Language Modeling Fundamentals

Language models assign probabilities to sequences of words.

Language Model Objective:
P(w1, w2, ..., wn) = P(w1)P(w2|w1)...P(wn|w1,...,wn-1)
  • Core task: predict next word given previous words
  • Applications: speech recognition, machine translation, text generation
  • Perplexity used as evaluation metric
N-gram vs Neural Language Models

Comparison of traditional and neural approaches:

  • N-gram Models: Use count-based statistics with Markov assumption
  • Neural Models: Learn distributed representations, better generalization
  • N-grams suffer from data sparsity and curse of dimensionality
  • Neural models handle longer contexts and unseen word combinations

Key Takeaways

  • Efficient training techniques like negative sampling enable Word2Vec at scale
  • GloVe combines global statistics with local context for better embeddings
  • Language models predict word sequences and are fundamental to NLP
  • Neural language models outperform traditional n-gram approaches

Lecture 3: Backpropagation and Neural Networks

Lecture 3

Backpropagation and Neural Networks

Neural Network Fundamentals

Neural networks are composed of layers of interconnected neurons that transform inputs to outputs.

  • Input Layer: Receives raw data (e.g., word vectors)
  • Hidden Layers: Learn hierarchical representations
  • Output Layer: Produces final predictions
  • Activation Functions: Introduce non-linearity (sigmoid, tanh, ReLU)
Single Neuron:
h = f(wTx + b)
Where f is activation function, w is weights, b is bias
Forward Propagation

Forward propagation computes the output of the network given an input.

Multi-layer Forward Pass:
h(1) = f(W(1)x + b(1))
h(2) = f(W(2)h(1) + b(2))
ŷ = softmax(W(3)h(2) + b(3))
  • Data flows from input to output layer
  • Each layer applies linear transformation followed by non-linearity
  • Final layer often uses softmax for classification
Backpropagation Algorithm

Backpropagation efficiently computes gradients using the chain rule of calculus.

  • Compute loss at output layer
  • Propagate error backwards through network
  • Calculate gradient for each parameter
  • Update parameters using gradient descent
Chain Rule:
∂L/∂w = (∂L/∂ŷ)(∂ŷ/∂h)(∂h/∂w)
Gradients flow backwards through computational graph
Gradient Computation

Computing gradients efficiently is crucial for training deep networks.

  • Local Gradients: Derivative of operation with respect to inputs
  • Upstream Gradients: Gradient flowing from layers above
  • Downstream Gradients: Product of local and upstream gradients
  • Computational graph helps visualize gradient flow
Matrix Calculus for Deep Learning

Matrix calculus enables efficient computation of gradients for vectorized operations.

Key Matrix Derivatives:
∂(Wx)/∂W = xT
∂(Wx)/∂x = WT
∂(xTWx)/∂x = (W + WT)x
  • Jacobian matrix contains all partial derivatives
  • Chain rule extends to matrix operations
  • Vectorization improves computational efficiency
Optimization Techniques

Advanced optimization methods improve training speed and convergence.

  • SGD: Stochastic Gradient Descent with mini-batches
  • Momentum: Accumulates velocity to accelerate convergence
  • Adam: Adaptive learning rates per parameter
  • Learning Rate Scheduling: Decay learning rate over time
Adam Update:
mt = β1mt-1 + (1-β1)gt
vt = β2vt-1 + (1-β2)gt2
θt = θt-1 - α·m̂t/(√v̂t + ε)

Key Takeaways

  • Backpropagation uses chain rule to compute gradients efficiently
  • Forward pass computes outputs, backward pass computes gradients
  • Matrix calculus enables vectorized gradient computation
  • Advanced optimizers like Adam improve training convergence

Lecture 4: Dependency Parsing

Lecture 4

Dependency Parsing

Syntactic Structure and Dependency Grammar

Dependency grammar represents sentence structure as directed graphs showing word relationships.

  • Words are nodes, dependencies are directed edges
  • Each word has exactly one head (except root)
  • Captures grammatical relations (subject, object, modifier)
  • More flexible than phrase structure grammar
  • Works well across different languages
Dependency Trees and Relations

Common dependency relations capture grammatical functions:

  • nsubj: Nominal subject
  • dobj: Direct object
  • amod: Adjectival modifier
  • prep: Prepositional modifier
  • det: Determiner
  • aux: Auxiliary verb

Example: "The cat sat on the mat"
sat (root) → cat (nsubj), on (prep) → mat (pobj), cat → The (det), mat → the (det)

Transition-Based Parsing

Transition-based parsers use a sequence of actions to build dependency trees.

  • Configuration: (Stack, Buffer, Dependencies)
  • SHIFT: Move word from buffer to stack
  • LEFT-ARC: Add dependency from top to second-top, remove second-top
  • RIGHT-ARC: Add dependency from second-top to top, remove top
Arc-Standard Algorithm:
Initial: ([ROOT], [w1,...,wn], ∅)
Terminal: ([ROOT], [], A) where A is set of dependencies
Neural Dependency Parsers

Neural networks learn to predict parsing actions from parser configurations.

  • Extract features from stack, buffer, and existing dependencies
  • Use word embeddings, POS embeddings, dependency label embeddings
  • Feed-forward network predicts next action
  • Much faster and more accurate than traditional parsers
Neural Parser Architecture:
x = [ew1; ew2; ...; et1; et2; ...; el1; ...]
h = ReLU(W1x + b1)
p = softmax(W2h + b2)
Evaluation Metrics

Parser performance measured by attachment scores:

  • UAS (Unlabeled Attachment Score): % of words with correct head
  • LAS (Labeled Attachment Score): % of words with correct head and label
  • LAS is stricter and more informative
  • Modern neural parsers achieve >95% UAS on English
Metrics:
UAS = (# correct heads) / (# total words)
LAS = (# correct head+label) / (# total words)

Key Takeaways

  • Dependency parsing captures grammatical relationships between words
  • Transition-based parsing builds trees through sequence of actions
  • Neural parsers learn to predict actions from parser state
  • UAS and LAS measure parser accuracy

Lecture 5: Recurrent Neural Networks

Lecture 5

Recurrent Neural Networks

RNN Architecture and Motivation

Recurrent Neural Networks process sequential data by maintaining hidden state across time steps.

  • Handle variable-length sequences
  • Share parameters across time steps
  • Maintain memory of previous inputs
  • Natural fit for language modeling, machine translation, speech
RNN Update Equations:
ht = tanh(Whhht-1 + Wxhxt + bh)
yt = Whyht + by
Vanishing and Exploding Gradient Problems

Training RNNs on long sequences faces gradient instability issues.

  • Vanishing Gradients: Gradients shrink exponentially, preventing learning of long-term dependencies
  • Exploding Gradients: Gradients grow exponentially, causing numerical instability
  • Caused by repeated multiplication of weight matrix
  • Solutions: gradient clipping, better architectures (LSTM, GRU)
Gradient Flow:
∂ht/∂h0 = Π WhhT · diag(tanh')
Product of many matrices causes exponential growth/decay
Long Short-Term Memory (LSTM) Networks

LSTMs use gating mechanisms to control information flow and maintain long-term memory.

  • Forget Gate: Decides what information to discard from cell state
  • Input Gate: Controls what new information to add
  • Output Gate: Determines what to output based on cell state
  • Cell state acts as highway for gradient flow
LSTM Equations:
ft = σ(Wf[ht-1, xt] + bf) (forget)
it = σ(Wi[ht-1, xt] + bi) (input)
t = tanh(Wc[ht-1, xt] + bc) (candidate)
ct = ft ⊙ ct-1 + it ⊙ c̃t (cell state)
ot = σ(Wo[ht-1, xt] + bo) (output)
ht = ot ⊙ tanh(ct) (hidden state)
Gated Recurrent Units (GRU)

GRUs simplify LSTM architecture while maintaining similar performance.

  • Combines forget and input gates into single update gate
  • Merges cell state and hidden state
  • Fewer parameters than LSTM
  • Often performs comparably to LSTM with faster training
GRU Equations:
zt = σ(Wz[ht-1, xt]) (update gate)
rt = σ(Wr[ht-1, xt]) (reset gate)
t = tanh(W[rt ⊙ ht-1, xt]) (candidate)
ht = (1 - zt) ⊙ ht-1 + zt ⊙ h̃t
Bidirectional RNNs

Bidirectional RNNs process sequences in both forward and backward directions.

  • Forward RNN processes left-to-right
  • Backward RNN processes right-to-left
  • Concatenate forward and backward hidden states
  • Captures context from both past and future
  • Useful for tasks where full sequence is available (e.g., NER, POS tagging)
Bidirectional Output:
ht = [→ht; ←ht]
Where →ht is forward, ←ht is backward hidden state
Applications in NLP

RNNs and their variants power many NLP applications:

  • Language Modeling: Predict next word in sequence
  • Machine Translation: Encoder-decoder architectures
  • Sentiment Analysis: Classify text sentiment
  • Named Entity Recognition: Tag entities in text
  • Speech Recognition: Convert audio to text
  • Text Generation: Generate coherent text sequences

Key Takeaways

  • RNNs process sequences by maintaining hidden state over time
  • Vanilla RNNs suffer from vanishing/exploding gradients
  • LSTMs and GRUs use gates to control information flow
  • Bidirectional RNNs capture context from both directions
  • RNN variants enable many sequence-to-sequence applications

Practice Quizzes

Test your understanding with these practice questions. Click an option to check your answer!

Score: 0 / 20

Interactive Flashcards

Click on a card to flip it and reveal the answer. Use the navigation buttons to browse through all concepts.

1 / 30

Concept Mind Map

Visual representation of key NLP concepts and their relationships across all lectures.

Quick Reference Formulas

Word2Vec Skip-gram Objective
J(θ) = (1/T) Σt=1T Σ-m≤j≤m, j≠0 log P(wt+j | wt)
Softmax Function
P(wj | wi) = exp(ujTvi) / Σk=1V exp(ukTvi)
GloVe Objective
J = Σi,j=1V f(Xij)(wiTj + bi + b̃j - log Xij)2
Gradient Descent Update
θnew = θold - α∇θJ(θ)
Chain Rule (Backpropagation)
∂L/∂w = (∂L/∂ŷ)(∂ŷ/∂h)(∂h/∂w)
RNN Hidden State Update
ht = tanh(Whhht-1 + Wxhxt + bh)
LSTM Cell State Update
ct = ft ⊙ ct-1 + it ⊙ c̃t
Cross-Entropy Loss
L = -Σ yi log(ŷi)

Glossary of Terms

Word Embedding: Dense vector representation of words that captures semantic meaning.
Distributional Semantics: The principle that words appearing in similar contexts have similar meanings.
Skip-gram: Word2Vec model that predicts context words given a center word.
CBOW (Continuous Bag of Words): Word2Vec model that predicts center word from context words.
Negative Sampling: Training technique that samples negative examples instead of computing full softmax.
Backpropagation: Algorithm for computing gradients in neural networks using chain rule.
Gradient Descent: Optimization algorithm that iteratively updates parameters in direction of steepest descent.
Vanishing Gradient: Problem where gradients become exponentially small, preventing learning.
Dependency Parsing: Task of analyzing grammatical structure by identifying word dependencies.
LSTM (Long Short-Term Memory): RNN variant with gating mechanisms for long-term dependencies.
GRU (Gated Recurrent Unit): Simplified LSTM with fewer parameters but similar performance.
Perplexity: Evaluation metric for language models; lower is better.
UAS (Unlabeled Attachment Score): Percentage of words with correct head in dependency parsing.
LAS (Labeled Attachment Score): Percentage of words with correct head and dependency label.
Attention Mechanism: Technique allowing models to focus on relevant parts of input sequence.

Study Resources

Video Lectures

Lecture 1: Intro and Word Vectors Lecture 2: Word Vectors and Language Models Lecture 3: Backpropagation and Neural Networks Lecture 4: Dependency Parsing Lecture 5: Recurrent Neural Networks

Additional Materials

Stanford AI Programs

Study Tips

  • Review lecture videos and take detailed notes
  • Practice implementing concepts in code
  • Work through problem sets and assignments
  • Use flashcards for memorizing key definitions and formulas
  • Test yourself regularly with the practice quizzes
  • Join study groups to discuss challenging concepts
  • Apply concepts to real-world NLP projects