LLM Instability

  1. algorithmic (training): diverge/oscillate, stuck in local minimum, gradient vanishing
  2. numerical: overflow, underflow, (maybe diverge due to the overflow or underflow)

Algorithmic

Here’s an explanation of the causes and potential fixes for algorithmic instability and numerical instability in large language models (LLMs):


1. Algorithmic Instability

Causes:

  • Divergence/Oscillation:
    • Caused by a poorly chosen learning rate. If the learning rate is too high, the optimization steps may overshoot the minima, leading to divergence or oscillation.
    • Lack of a properly defined loss landscape or insufficient regularization can exacerbate instability.
  • Stuck in Local Minima:
    • Optimization algorithms may converge to a suboptimal solution, particularly in highly non-convex loss landscapes.
    • Insufficient exploration during optimization (e.g., small learning rates or lack of noise in the system) can prevent finding the global minimum.
  • Gradient Vanishing:
    • Gradients become too small as they backpropagate through deep layers, especially when activation functions like sigmoid or tanh are used.
    • This prevents effective weight updates for earlier layers.

Potential Fixes:

  • Divergence/Oscillation:
    • Use an adaptive optimizer like Adam, RMSprop, or Adagrad that adjusts learning rates dynamically.
    • Implement learning rate schedules (e.g., warm-up or decay).
  • Stuck in Local Minima:
    • Employ advanced optimizers such as stochastic gradient descent with momentum.
    • Introduce noise to the system through dropout or stochastic regularization.
    • Use techniques like simulated annealing or restart mechanisms.
  • Gradient Vanishing:
    • Use activation functions like ReLU, GELU, or variants that mitigate vanishing gradients.
    • Apply batch normalization to stabilize and scale gradients.
    • Initialize weights carefully using methods like Xavier or He initialization.

2. Numerical Instability

Causes:

  • Overflow:
    • Occurs when computations produce values that exceed the representable range of floating-point numbers (e.g., exponential calculations in softmax or loss functions).
    • Often caused by unnormalized inputs, extreme weight values, or large learning rates.
  • Underflow:
    • Happens when computations result in values too small to be represented, effectively becoming zero.
    • Common in calculations involving probabilities (e.g., when multiplying small likelihoods) or deep layers with repeated multiplications.
  • Divergence Due to Overflow or Underflow:
    • Overflow or underflow propagating through the network can destabilize training and cause divergence.

Potential Fixes:

  • Overflow:
    • Use gradient clipping to cap large values during backpropagation.
    • Normalize inputs and outputs at every layer (e.g., through layer normalization or batch normalization).
    • Replace problematic operations (e.g., use log-sum-exp for numerical stability instead of direct exponentiation in softmax).
  • Underflow:
    • Use higher precision (e.g., float64 instead of float32) for critical computations.
    • Regularize probabilities using log-transformed operations to avoid tiny values.
    • Rescale input data to an appropriate range (e.g., standardization or min-max scaling).
  • Divergence Prevention:
    • Combine overflow and underflow prevention techniques.
    • Monitor intermediate values (e.g., gradients and activations) to detect issues early and adjust hyperparameters.

By addressing both algorithmic and numerical issues with a combination of proper model design, hyperparameter tuning, and advanced mathematical techniques, LLM training can become more stable and robust. Let me know if you’d like further details on any aspect!

Here’s a table summarizing the causes and potential fixes for algorithmic instability and numerical instability in large language models:

Type of Instability Causes Potential Fixes
Algorithmic Instability    
Divergence/Oscillation - High learning rate causing overshooting. - Use adaptive optimizers (e.g., Adam, RMSprop).
  - Poorly defined loss landscape or lack of regularization. - Implement learning rate schedules (e.g., warm-up, decay).
     
Stuck in Local Minima - Highly non-convex loss landscapes. - Use stochastic gradient descent with momentum.
  - Insufficient exploration during optimization (e.g., small learning rates). - Introduce noise (e.g., dropout, stochastic regularization).
    - Employ techniques like simulated annealing or restarts.
Gradient Vanishing - Gradients become too small when backpropagating through deep layers with sigmoid/tanh activations. - Use ReLU, GELU, or other activation functions that mitigate gradient vanishing.
  - Poor weight initialization. - Apply batch normalization to stabilize gradients.
    - Initialize weights using Xavier or He initialization.
Numerical Instability    
Overflow - Values exceed the representable range during computations (e.g., exponentiation in softmax). - Use gradient clipping to cap large values.
  - Large inputs or weights. - Normalize inputs and outputs (e.g., batch normalization, layer normalization).
    - Replace direct exponentiation with log-sum-exp for numerical stability.
Underflow - Values become too small to be represented (e.g., repeated multiplications of small probabilities). - Use higher precision (e.g., float64 for critical computations).
  - Multiplications in deep networks. - Rescale inputs to appropriate ranges (e.g., standardization or min-max scaling).
    - Use log-transformed operations to avoid tiny values.
Divergence Due to Overflow or Underflow - Instabilities propagate through the network, destabilizing training. - Combine overflow and underflow prevention techniques (e.g., gradient clipping, normalization).
    - Monitor intermediate values (e.g., gradients, activations) and adjust hyperparameters as needed.

This table concisely organizes the details for easy reference. Let me know if further clarification or elaboration is needed!

Training Instability Fix

  1. Label smoothing
  2. Z-loss (Log-exp)
  3. (learning rate) Oberserve the trend to adjust the learning rate? How
  4. short-cut, how about massive short-cut?

Overflow/Underflow Fix

  1. Tools to track the overflow and underflow (inf, nan)
  2. For scale invariant operater, or linear operators to insert linear scale (>1 or <1), or nonlinear scale (log and exp)
  3. Use clip for overflow
  4. Use dual number for underflow? How about overflow?

Source