10 lectures

This series consists of 10 sessions, each lasting 2 hours, focused on the mathematics of machine learning. It outlines the primary concepts without delving into the intricacies of the proofs. Clicking on the title of each session provides access to the transcript. Additionally, there are basic notes available to guide you through the structure and progression of the content.

**Content:**

- Introduction and motivation
- Gradients, Jacobians, Hessians
- Gradient descent and acceleration
- Stochastic Gradient Descent (SGD)

**Materials:**

- Notebook on Regression
- Notebook on Classification
- My course notes: Optimization for Machine Learning
- Exercises Sheet

**Bibliography:**

- Convex Optimization, by Boyd and Vandenberghe
- Introduction to Nonlinear Optimization: Theory, Algorithms, and Applications, by Amir Beck

**Content:**

- Proofs of gradient descent and acceleration
- Linear models and regularization
- Ridge versus Lasso
- ISTA Algorithm

**Materials:**

- Notebook on Linear Regression (specifically the Lasso part)
- Notebook on Interior Point Methods
- My course notes: The Mathematical Tours of Signal Processing

**Bibliography:**

- Course Notes on Convexity by Vincent Duval
- Introduction to Nonlinear Optimization: Theory, Algorithms, and Applications, by Amir Beck

**Content:**

- Examples of non-smooth functionals (Lasso, TV regularization, constraints)
- Subgradient and proximal operators
- Forward-backward splitting, connection with FISTA
- ADMM, Douglas-Rachford (DR), Primal-Dual
- Compressive sensing theory

**Materials:**

- My course notes: The Mathematical Tours of Signal Processing,
- Notebook on Douglas-Rachford Proximal Method
- Proximal Operators Repository (including Python code)
- Non-Smooth Optimization Slides
- Compressed Sensing Slides

**Bibliography:**

*A Mathematical Introduction to Compressive Sensing*by Foucart, Simon and Rauhut, Holger (advanced)- Convex Optimization, by Boyd and Vandenberghe
- Proximal Algorithms, by N. Parikh and S. Boyd

**Content:**

- Transition from ridge regression to kernels
- Multilayer Perceptron (MLP)
- Convolutional Neural Networks (CNN)
- ResNet architecture
- Transformer models

**Materials:**

- Slides on deep learning
- My course notes on Optimization for Machine Learning
- Notebook on Multilayer Perceptron and Autograd

**Bibliography:**

- The Elements of Statistical Learning, by Jerome H. Friedman, Robert Tibshirani, and Trevor Hastie
- Machine Learning: A Probabilistic Perspective, by Kevin Patrick Murphy (covers the theory of ML)

**Content:**

- Review of MLP and its variants (CNN, ResNet)
- Theoretical framework of two-layer MLPs
- Gradient and Jacobians in neural networks
- Introduction to backpropagation

**Materials:**

- Slides on deep learning
- Slides on automatic differentiation
- My course notes on Optimization for Machine Learning
- Notebook on deep learning
- Notebook on texture synthesis with deep networks

**Bibliography:**

**Content:**

- Recap on Gradient and Jacobian
- Forward and reverse mode automatic differentiation
- Introduction to PyTorch
- The adjoint method in computational mathematics

**Materials:**

- Slides on automatic differentiation
- My course notes on Optimization for Machine Learning
- Code example: Multilayer perceptron and autograd

**Bibliography:**

**Content:**

- Refresher on Stochastic Gradient Descent (SGD)
- Introduction to Langevin dynamics
- Overview of diffusion models

**Materials:**

- Numerical tour on diffusion models
- Course notes on Diffusion Models

**Content:**

- Overview of different generative model concepts
- Introduction to generative models (VAE, GANs, U-Net, diffusion)
- Semi-supervised learning and next token prediction
- Tokenizers
- Transformer architectures, Flash attention
- State space models

**Materials:**

**Bibliography:**

- Andrej Karpathy’s video on tokenization
- Byte Pair Encoding (Wikipedia)
- Online tokenizer demo by OpenAI
- Rotary Position Embedding paper
- Codes: Flash attention, xFormers, Triton
- Theory paper on Flash Attention
- Mamba paper, Blog on Mamba (SSM), Parallel Prefix Sum algorithm

**Content:**

- Understanding generative models as density fitting techniques.
- Basics of Maximum Likelihood Estimation and f-divergences.
- Gaussian mixtures and the Expectation-Maximization algorithm.
- Variational Autoencoders (VAE).
- Introduction to Normalizing Flows.
- Generative Adversarial Networks (GANs), Wasserstein GANs (WGANs).
- Diffusion Models.

**Materials:**

**Bibliography:**

**Content:**

- Introduction to Monge and Kantorovich formulations.
- The Sinkhorn algorithm.
- Training of generative models.
- Duality and Wasserstein GANs.

**Materials:**

- Slides on Optimal Transport
- Notebook on Linear Programming for Optimal Transport
- Notebook on the Sinkhorn algorithm

**Bibliography:**

- Computational Optimal Transport, by Gabriel Peyré and Marco Cuturi
- Optimal Transport for Applied Mathematicians, by Filippo Santambrogio (advanced)
- Python POT (Python Optimal Transport) toolbox