# Foundations of deep learning

September 29, 2023

- resurgence in neural networks
- data, hardware, software

- architecture
- perceptron
- building block of neural networks
- includes multi-output perceptrons
- linked via dense layers (all inputs connected to all outputs) and hidden layers

- perceptron
- algorithm
- forward pass / propagation
- input values (including bias)
- weights
- sum
- activation function
- sigmoid, hyperbolic tangent, rectified linear unit
- introduce non-linearities into the network, approximate arbitrarily complex functions

- output value

- objective function / loss function
- types
- empriical loss (total loss over dataset)
- binary cross entropy loss (models that output probabilities between 0 and 1)
- mean sqaured error loss (regression models that output continuous real numbers)

- measure the cost incurred from incorrect predictions

- types
- backward pass / backpropagation
- computing gradients via the chain rule
- how does a small change in one weight affect the final loss

- forward pass / propagation
- training
- goal is to find network weights that achieve lowest loss (ie: determine a function of the weights)
- pick initial weights (usually random, but there are optimisations that can be made)
- compute gradient
- update weights (take small step in opposite direction of gradient)
- repeat until convergence

- in practice
- optimisation
- learning rate
- too small (converges slowly, stuck in local minima)
- too big (overshoots, unstable and diverges)
- setting the right rate
- iterate through different rates
- adaptive learning rate
- how large is the gradient
- how fast is learning happening
- size of particular weights

- gradient descent optmisers
- SGD, Adam, Adadelta, Adagrad, RMSProp

- learning rate
- batches and mini-batches
- more accurate estimate of gradient
- smoother convergence
- allows for larger learning rate
- faster training (parallelize computation, achieve speed increases)

- regularisation
- technique to constrain optimisation problem to discourage complex models
- improve generalisation of model on unseen data
- underfitting
- model doesn't learn data

- overfitting
- too complex, extra parameters, does not generalise well

- underfitting
- techniques
- dropout
- during training, randomly set some activations to 0
- forces network to not rely on one node

- early stopping
- stop before overfitting

- dropout

- optimisation

References:

- MIT Introduction to Deep Learning | 6.S191 | Foundations of Deep Learning