Foundations of deep learning

September 29, 2023

  • resurgence in neural networks
    • data, hardware, software
  • architecture
    • perceptron
      • building block of neural networks
      • includes multi-output perceptrons
      • linked via dense layers (all inputs connected to all outputs) and hidden layers
  • algorithm
    • forward pass / propagation
      • input values (including bias)
      • weights
      • sum
      • activation function
        • sigmoid, hyperbolic tangent, rectified linear unit
        • introduce non-linearities into the network, approximate arbitrarily complex functions
      • output value
    • objective function / loss function
      • types
        • empriical loss (total loss over dataset)
        • binary cross entropy loss (models that output probabilities between 0 and 1)
        • mean sqaured error loss (regression models that output continuous real numbers)
      • measure the cost incurred from incorrect predictions
    • backward pass / backpropagation
      • computing gradients via the chain rule
      • how does a small change in one weight affect the final loss
  • training
    • goal is to find network weights that achieve lowest loss (ie: determine a function of the weights)
    • pick initial weights (usually random, but there are optimisations that can be made)
    • compute gradient
    • update weights (take small step in opposite direction of gradient)
    • repeat until convergence
  • in practice
    • optimisation
      • learning rate
        • too small (converges slowly, stuck in local minima)
        • too big (overshoots, unstable and diverges)
        • setting the right rate
          • iterate through different rates
          • adaptive learning rate
            • how large is the gradient
            • how fast is learning happening
            • size of particular weights
      • gradient descent optmisers
        • SGD, Adam, Adadelta, Adagrad, RMSProp
    • batches and mini-batches
      • more accurate estimate of gradient
      • smoother convergence
      • allows for larger learning rate
      • faster training (parallelize computation, achieve speed increases)
    • regularisation
      • technique to constrain optimisation problem to discourage complex models
      • improve generalisation of model on unseen data
        • underfitting
          • model doesn't learn data
        • overfitting
          • too complex, extra parameters, does not generalise well
      • techniques
        • dropout
          • during training, randomly set some activations to 0
          • forces network to not rely on one node
        • early stopping
          • stop before overfitting

References: