Worked Examples: Gradient Descent Method

Worked Examples: Gradient Descent Method#

These worked solutions correspond to the exercises on the Gradient Descent Method page.

How to use this notebook:

Try each exercise yourself first before looking at the solution
The code cells show both the code and its output
Download this notebook if you want to run and experiment with the code yourself
Your solution might look different - that’s fine as long as it gives the correct answer!

Exercise: Fixed Step Size Gradient Descent#

Problem: Implement gradient descent with a fixed step size to find the minimum of the harmonic potential energy surface. We need to first write a function to calculate the first derivative of the harmonic potential:

\[U'(r) = k(r - r_0)\]

Then implement the gradient descent algorithm using:

\[r_{i+1} = r_i - \Delta r \times \mathrm{sign}(U'(r_i))\]

We’ll start from $r = 1.0$ Å and explore how different step sizes affect convergence.

Setup: Define functions#

First, we’ll define our harmonic potential and its derivative:

import numpy as np
import matplotlib.pyplot as plt

%config InlineBackend.figure_format='retina'

def harmonic_potential(r, k, r_0):
    """Calculate the harmonic potential energy.
    
    Args:
        r (float): Bond length (Å). Can be a float or np.ndarray.
        k (float): Force constant (eV Å^-2).
        r_0 (float): Equilibrium bond length (Å).
    
    Returns:
        Potential energy (eV). Type matches input r.
    """
    return 0.5 * k * (r - r_0) ** 2

def harmonic_gradient(r, k, r_0):
    """Calculate the first derivative of the harmonic potential.
    
    Args:
        rr (float): Bond length (Å). Can be a float or np.ndarray.
        k (float): Force constant (eV Å^-2).
        r_0 (float): Equilibrium bond length (Å).
    
    Returns:
        First derivative dU/dr (eV Å^-1). Type matches input r.
    """
    return k * (r - r_0)

# Parameters for H₂
k = 36.0  # eV Å^-2
r_0 = 0.74  # Å

# Test the gradient function
r_test = 1.0  # Å
gradient = harmonic_gradient(r_test, k, r_0)
print(f"Gradient at r = {r_test} Å: U'(r) = {gradient:.3f} eV Å^-1")

Gradient at r = 1.0 Å: U'(r) = 9.360 eV Å^-1

Part 1: Fixed step size Δr = 0.01 Å#

Let’s implement the gradient descent algorithm with a small step size and see how it performs:

# Gradient descent parameters
r_start = 1.0  # Å
delta_r = 0.01  # Å (step size)
max_iterations = 50
convergence_threshold = 0.001  # eV Å⁻¹

# Initialize
r_current = r_start
r_history = [r_current]
gradient_history = []

print(f"Starting gradient descent with Δr = {delta_r} Å\n")
print("Iteration | Position (Å) | Gradient (eV Å⁻¹)")
print("-" * 50)

converged = False
for iteration in range(max_iterations):
    # Calculate gradient at current position
    gradient = harmonic_gradient(r_current, k, r_0)
    gradient_history.append(gradient)
    
    print(f"{iteration:9d} | {r_current:12.6f} | {gradient:18.6f}")
    
    # Check for convergence
    if abs(gradient) < convergence_threshold:
        print(f"\nConverged after {iteration} iterations!")
        print(f"Final position: r = {r_current:.6f} Å")
        print(f"Analytical solution: r₀ = {r_0} Å")
        print(f"Error: {abs(r_current - r_0):.6f} Å")
        converged = True
        break
    
    # Update position (gradient descent step)
    r_current = r_current - delta_r * np.sign(gradient)
    r_history.append(r_current)

if not converged:
    print(f"\nDid not converge within {max_iterations} iterations.")
    print(f"Final position: r = {r_current:.6f} Å")
    print(f"Final gradient: {gradient:.6f} eV Å⁻¹")

Starting gradient descent with Δr = 0.01 Å

Iteration | Position (Å) | Gradient (eV Å⁻¹)
--------------------------------------------------
        0 |     1.000000 |           9.360000
        1 |     0.990000 |           9.000000
        2 |     0.980000 |           8.640000
        3 |     0.970000 |           8.280000
        4 |     0.960000 |           7.920000
        5 |     0.950000 |           7.560000
        6 |     0.940000 |           7.200000
        7 |     0.930000 |           6.840000
        8 |     0.920000 |           6.480000
        9 |     0.910000 |           6.120000
       10 |     0.900000 |           5.760000
       11 |     0.890000 |           5.400000
       12 |     0.880000 |           5.040000
       13 |     0.870000 |           4.680000
       14 |     0.860000 |           4.320000
       15 |     0.850000 |           3.960000
       16 |     0.840000 |           3.600000
       17 |     0.830000 |           3.240000
       18 |     0.820000 |           2.880000
       19 |     0.810000 |           2.520000
       20 |     0.800000 |           2.160000
       21 |     0.790000 |           1.800000
       22 |     0.780000 |           1.440000
       23 |     0.770000 |           1.080000
       24 |     0.760000 |           0.720000
       25 |     0.750000 |           0.360000
       26 |     0.740000 |          -0.000000

Converged after 26 iterations!
Final position: r = 0.740000 Å
Analytical solution: r₀ = 0.74 Å
Error: 0.000000 Å

Explanation:

With $\Delta r = 0.01$ Å, the algorithm converges successfully in 26 iterations. Key observations:

Steady convergence: The position moves consistently towards the minimum at $r_0 = 0.74$ Å.
Fixed step size: Each iteration moves exactly 0.01 Å in the direction indicated by the gradient sign:
- When $r > r_0$: gradient is positive, so we step left (decrease $r$)
- When $r < r_0$: gradient is negative, so we step right (increase $r$)
Final accuracy: The algorithm stops when $|U'(r)| < 0.001$ eV Å⁻¹, which occurs at approximately $r = 0.74$ Å—exactly the analytical minimum!
Why 26 iterations?: Starting at $r = 1.0$ Å, we need to travel $0.26$ Å to reach $r_0 = 0.74$ Å. With steps of 0.01 Å, this requires about $0.26 / 0.01 = 26$ steps.

The method is straightforward but slow—each small step requires a function evaluation.

Part 2: Fixed step size Δr = 0.1 Å#

Now let’s try a larger step size to see if we can converge faster:

# Try larger step size
r_start = 1.0  # Å
delta_r = 0.1  # Å (larger step size)
max_iterations = 50
convergence_threshold = 0.001  # eV Å⁻¹

# Initialize
r_current = r_start
r_history_large = [r_current]
gradient_history_large = []

print(f"Starting gradient descent with Δr = {delta_r} Å\n")
print("Iteration | Position (Å) | Gradient (eV Å⁻¹)")
print("-" * 50)

converged = False
for iteration in range(max_iterations):
    # Calculate gradient at current position
    gradient = harmonic_gradient(r_current, k, r_0)
    gradient_history_large.append(gradient)
    
    print(f"{iteration:9d} | {r_current:12.6f} | {gradient:18.6f}")
    
    # Check for convergence
    if abs(gradient) < convergence_threshold:
        print(f"\nConverged after {iteration} iterations!")
        print(f"Final position: r = {r_current:.6f} Å")
        converged = True
        break
    
    # Update position
    r_current = r_current - delta_r * np.sign(gradient)
    r_history_large.append(r_current)

if not converged:
    print(f"\nDid not converge within {max_iterations} iterations.")

Starting gradient descent with Δr = 0.1 Å

Iteration | Position (Å) | Gradient (eV Å⁻¹)
--------------------------------------------------
        0 |     1.000000 |           9.360000
        1 |     0.900000 |           5.760000
        2 |     0.800000 |           2.160000
        3 |     0.700000 |          -1.440000
        4 |     0.800000 |           2.160000
        5 |     0.700000 |          -1.440000
        6 |     0.800000 |           2.160000
        7 |     0.700000 |          -1.440000
        8 |     0.800000 |           2.160000
        9 |     0.700000 |          -1.440000
       10 |     0.800000 |           2.160000
       11 |     0.700000 |          -1.440000
       12 |     0.800000 |           2.160000
       13 |     0.700000 |          -1.440000
       14 |     0.800000 |           2.160000
       15 |     0.700000 |          -1.440000
       16 |     0.800000 |           2.160000
       17 |     0.700000 |          -1.440000
       18 |     0.800000 |           2.160000
       19 |     0.700000 |          -1.440000
       20 |     0.800000 |           2.160000
       21 |     0.700000 |          -1.440000
       22 |     0.800000 |           2.160000
       23 |     0.700000 |          -1.440000
       24 |     0.800000 |           2.160000
       25 |     0.700000 |          -1.440000
       26 |     0.800000 |           2.160000
       27 |     0.700000 |          -1.440000
       28 |     0.800000 |           2.160000
       29 |     0.700000 |          -1.440000
       30 |     0.800000 |           2.160000
       31 |     0.700000 |          -1.440000
       32 |     0.800000 |           2.160000
       33 |     0.700000 |          -1.440000
       34 |     0.800000 |           2.160000
       35 |     0.700000 |          -1.440000
       36 |     0.800000 |           2.160000
       37 |     0.700000 |          -1.440000
       38 |     0.800000 |           2.160000
       39 |     0.700000 |          -1.440000
       40 |     0.800000 |           2.160000
       41 |     0.700000 |          -1.440000
       42 |     0.800000 |           2.160000
       43 |     0.700000 |          -1.440000
       44 |     0.800000 |           2.160000
       45 |     0.700000 |          -1.440000
       46 |     0.800000 |           2.160000
       47 |     0.700000 |          -1.440000
       48 |     0.800000 |           2.160000
       49 |     0.700000 |          -1.440000

Did not converge within 50 iterations.

Explanation:

With $\Delta r = 0.1$ Å, the algorithm fails to converge and instead oscillates around the minimum. Here’s why:

Overshooting: Starting at $r = 1.0$ Å, the minimum is at $r_0 = 0.74$ Å, a distance of only 0.26 Å. A step size of 0.1 Å is too large relative to this distance.
The oscillation pattern:
- Iteration 0: $r = 1.0$ Å → gradient positive → step left
- Iteration 1: $r = 0.9$ Å (overshot to the left side of minimum)
- Iteration 2: $r = 0.8$ Å (gradient now negative, step right)
- Iteration 3: $r = 0.9$ Å (back to the right side)
- The algorithm bounces back and forth between $r = 0.8$ Å and $r = 0.9$ Å
Why no convergence?: The fixed 0.1 Å steps are too large to “land” close enough to the minimum where $|U'(r)| < 0.001$ eV Å⁻¹. The algorithm perpetually overshoots.

This demonstrates a fundamental limitation of fixed step size gradient descent: the step size must be carefully chosen, and what works for one starting position may not work for another.

Visualizing the oscillation#

Let’s plot both convergence paths to see the difference clearly:

# First, recreate the successful run with small step size for plotting
r_current = 1.0
r_history_small = [r_current]
delta_r_small = 0.01

for iteration in range(26):
    gradient = harmonic_gradient(r_current, k, r_0)
    if abs(gradient) < 0.001:
        break
    r_current = r_current - delta_r_small * np.sign(gradient)
    r_history_small.append(r_current)

# Plot the small step size (converges)
plt.plot(r_history_small, 'o-', markersize=4, linewidth=1.5, label='Position vs iteration')
plt.axhline(r_0, color='r', linestyle='--', label=f'Minimum at r₀ = {r_0} Å')
plt.xlabel('Iteration')
plt.ylabel('Position, $r$ (Å)')
plt.title(f'Gradient Descent with Δr = 0.01 Å (Converges)')
plt.legend()
plt.show()

# Plot the large step size (oscillates)
plt.plot(r_history_large, 'go-', markersize=4, 
         label='Position vs iteration')
plt.axhline(r_0, color='r', linestyle='--', label=f'Minimum at r₀ = {r_0} Å')
plt.xlabel('Iteration')
plt.ylabel('Position, $r$ (Å)')
plt.title(f'Gradient Descent with Δr = 0.1 Å (Oscillates)')
plt.legend()
plt.show()

../../_images/463b9d2b2adeac1eb8c1f94685652488cf6e566cd6414f919249f3bdf2e03d4d.png

../../_images/84a53c5dd2fe0342f22c9a610af08030575d8ec0c01bf8e063b52d11c58902fd.png

Exercise: Adaptive Step Size Gradient Descent#

The fixed step size method showed us that choosing an appropriate step size is crucial but difficult. An adaptive approach rescales the step size based on the local gradient magnitude:

\[r_{i+1} = r_i - \alpha U'(r_i)\]

where $\alpha$ is the learning rate parameter. This approach naturally takes smaller steps near the minimum (where the gradient is small) and larger steps far from the minimum (where the gradient is large).

Part 1: Learning rate α = 0.01#

Let’s implement adaptive step size gradient descent:

def gradient_descent_adaptive(r_start, alpha, k, r_0, max_iterations=50, threshold=0.001):
    """Perform adaptive step size gradient descent.
    
    Args:
        r_start (float): Starting position (Å).
        alpha (float): Learning rate.
        k (float): Force constant (eV Å⁻²).
        r_0 (float): Equilibrium bond length (Å).
        max_iterations (int): Maximum number of iterations.
        threshold (float): Convergence threshold for |gradient|.
    
    Returns:
        tuple: (r_history, gradient_history, converged, num_iterations).
    """
    r_current = r_start
    r_history = [r_current]
    gradient_history = []
    
    print(f"\nAdaptive gradient descent with α = {alpha}\n")
    print("Iteration | Position (Å) | Gradient (eV Å⁻¹) |     Δr (Å)")
    print("-" * 65)
    
    converged = False
    for iteration in range(max_iterations):
        # Calculate gradient
        gradient = harmonic_gradient(r_current, k, r_0)
        gradient_history.append(gradient)
        
        # Calculate step size (adaptive)
        step = -alpha * gradient
        
        print(f"{iteration:9d} | {r_current:12.6f} | {gradient:17.6f} | {step:9.6f}")
        
        # Check convergence
        if abs(gradient) < threshold:
            print(f"\nConverged after {iteration} iterations!")
            print(f"Final position: r = {r_current:.6f} Å")
            print(f"Analytical solution: r_0 = {r_0} Å")
            print(f"Error: {abs(r_current - r_0):.6f} Å")
            converged = True
            break
        
        # Update position
        r_current = r_current + step
        r_history.append(r_current)
    
    if not converged:
        print(f"\nDid not converge within {max_iterations} iterations.")
    
    return r_history, gradient_history, converged, iteration

# Run with α = 0.01
r_history_01, grad_history_01, converged_01, iters_01 = gradient_descent_adaptive(
    1.0, 0.01, k, r_0
)

Adaptive gradient descent with α = 0.01

Iteration | Position (Å) | Gradient (eV Å⁻¹) |     Δr (Å)
-----------------------------------------------------------------
        0 |     1.000000 |          9.360000 | -0.093600
        1 |     0.906400 |          5.990400 | -0.059904
        2 |     0.846496 |          3.833856 | -0.038339
        3 |     0.808157 |          2.453668 | -0.024537
        4 |     0.783621 |          1.570347 | -0.015703
        5 |     0.767917 |          1.005022 | -0.010050
        6 |     0.757867 |          0.643214 | -0.006432
        7 |     0.751435 |          0.411657 | -0.004117
        8 |     0.747318 |          0.263461 | -0.002635
        9 |     0.744684 |          0.168615 | -0.001686
       10 |     0.742998 |          0.107913 | -0.001079
       11 |     0.741918 |          0.069065 | -0.000691
       12 |     0.741228 |          0.044201 | -0.000442
       13 |     0.740786 |          0.028289 | -0.000283
       14 |     0.740503 |          0.018105 | -0.000181
       15 |     0.740322 |          0.011587 | -0.000116
       16 |     0.740206 |          0.007416 | -0.000074
       17 |     0.740132 |          0.004746 | -0.000047
       18 |     0.740084 |          0.003037 | -0.000030
       19 |     0.740054 |          0.001944 | -0.000019
       20 |     0.740035 |          0.001244 | -0.000012
       21 |     0.740022 |          0.000796 | -0.000008

Converged after 21 iterations!
Final position: r = 0.740022 Å
Analytical solution: r_0 = 0.74 Å
Error: 0.000022 Å

Explanation:

With $\alpha = 0.01$, the algorithm converges successfully. Let’s examine how the adaptive step sizes work:

Key observations:

Adaptive step sizes: Notice how the “Δr (Å)” column changes:
- Early iterations: When far from the minimum, the gradient is large (e.g., 9.36 eV Å⁻¹ initially), so steps are large (−0.0936 Å)
- Later iterations: As we approach the minimum, the gradient decreases, and steps become progressively smaller
- Near convergence: Steps become very small as the gradient approaches zero
Natural damping: The step size automatically decreases as we approach the minimum because the gradient magnitude decreases. This prevents oscillation.
Efficiency: The adaptive method allows large initial steps when far from the minimum, then automatically refines with smaller steps near convergence, without needing to manually tune the step size.
Why it works: For the harmonic potential, $U'(r) = k(r - r_0)$, so the gradient is directly proportional to the distance from the minimum. The adaptive step size therefore automatically scales with this distance.

Part 2: Learning rate α = 0.001#

What happens with a smaller learning rate?

# Run with α = 0.001 (smaller)
r_history_001, grad_history_001, converged_001, iters_001 = gradient_descent_adaptive(
    1.0, 0.001, k, r_0
)

Adaptive gradient descent with α = 0.001

Iteration | Position (Å) | Gradient (eV Å⁻¹) |     Δr (Å)
-----------------------------------------------------------------
        0 |     1.000000 |          9.360000 | -0.009360
        1 |     0.990640 |          9.023040 | -0.009023
        2 |     0.981617 |          8.698211 | -0.008698
        3 |     0.972919 |          8.385075 | -0.008385
        4 |     0.964534 |          8.083212 | -0.008083
        5 |     0.956450 |          7.792217 | -0.007792
        6 |     0.948658 |          7.511697 | -0.007512
        7 |     0.941147 |          7.241276 | -0.007241
        8 |     0.933905 |          6.980590 | -0.006981
        9 |     0.926925 |          6.729289 | -0.006729
       10 |     0.920195 |          6.487034 | -0.006487
       11 |     0.913708 |          6.253501 | -0.006254
       12 |     0.907455 |          6.028375 | -0.006028
       13 |     0.901426 |          5.811353 | -0.005811
       14 |     0.895615 |          5.602145 | -0.005602
       15 |     0.890013 |          5.400468 | -0.005400
       16 |     0.884613 |          5.206051 | -0.005206
       17 |     0.879406 |          5.018633 | -0.005019
       18 |     0.874388 |          4.837962 | -0.004838
       19 |     0.869550 |          4.663795 | -0.004664
       20 |     0.864886 |          4.495899 | -0.004496
       21 |     0.860390 |          4.334046 | -0.004334
       22 |     0.856056 |          4.178021 | -0.004178
       23 |     0.851878 |          4.027612 | -0.004028
       24 |     0.847850 |          3.882618 | -0.003883
       25 |     0.843968 |          3.742844 | -0.003743
       26 |     0.840225 |          3.608101 | -0.003608
       27 |     0.836617 |          3.478210 | -0.003478
       28 |     0.833139 |          3.352994 | -0.003353
       29 |     0.829786 |          3.232286 | -0.003232
       30 |     0.826553 |          3.115924 | -0.003116
       31 |     0.823438 |          3.003751 | -0.003004
       32 |     0.820434 |          2.895616 | -0.002896
       33 |     0.817538 |          2.791374 | -0.002791
       34 |     0.814747 |          2.690884 | -0.002691
       35 |     0.812056 |          2.594012 | -0.002594
       36 |     0.809462 |          2.500628 | -0.002501
       37 |     0.806961 |          2.410605 | -0.002411
       38 |     0.804551 |          2.323823 | -0.002324
       39 |     0.802227 |          2.240166 | -0.002240
       40 |     0.799987 |          2.159520 | -0.002160
       41 |     0.797827 |          2.081777 | -0.002082
       42 |     0.795745 |          2.006833 | -0.002007
       43 |     0.793739 |          1.934587 | -0.001935
       44 |     0.791804 |          1.864942 | -0.001865
       45 |     0.789939 |          1.797804 | -0.001798
       46 |     0.788141 |          1.733083 | -0.001733
       47 |     0.786408 |          1.670692 | -0.001671
       48 |     0.784737 |          1.610547 | -0.001611
       49 |     0.783127 |          1.552568 | -0.001553

Did not converge within 50 iterations.

Explanation:

With $\alpha = 0.001$, the algorithm does not converge within 50 iterations. After 49 iterations, the gradient is still 1.55 eV Å⁻¹, well above the convergence threshold of 0.001 eV Å⁻¹.

The problem:

Too conservative: With such a small learning rate, each step is very cautious:
- Iteration 0: Step is only −0.00936 Å (compared to −0.0936 Å with $\alpha = 0.01$)
- After 49 iterations, we’ve only reached $r = 0.783$ Å, still 0.043 Å away from the minimum at $r_0 = 0.74$ Å
Progress is too slow: The gradient decreases very gradually from 9.36 eV Å⁻¹ to 1.55 eV Å⁻¹, but we’re making progress at such a slow rate that we can’t reach convergence.
Many more iterations needed: To converge with this learning rate would require several hundred iterations.

The learning rate is too small: Whilst smaller learning rates are more stable (less likely to diverge), if they’re too small, convergence becomes impractically slow. This learning rate of 0.001 is simply too conservative for efficient optimisation.

Part 3: Learning rate α = 0.1#

What happens with a larger learning rate?

# Run with α = 0.1 (larger)
r_history_1, grad_history_1, converged_1, iters_1 = gradient_descent_adaptive(
    1.0, 0.1, k, r_0, max_iterations=50
)

Adaptive gradient descent with α = 0.1

Iteration | Position (Å) | Gradient (eV Å⁻¹) |     Δr (Å)
-----------------------------------------------------------------
        0 |     1.000000 |          9.360000 | -0.936000
        1 |     0.064000 |        -24.336000 |  2.433600
        2 |     2.497600 |         63.273600 | -6.327360
        3 |    -3.829760 |       -164.511360 | 16.451136
        4 |    12.621376 |        427.729536 | -42.772954
        5 |   -30.151578 |      -1112.096794 | 111.209679
        6 |    81.058102 |       2891.451663 | -289.145166
        7 |  -208.087065 |      -7517.774325 | 751.777432
        8 |   543.690368 |      19546.213244 | -1954.621324
        9 | -1410.930957 |     -50820.154435 | 5082.015444
       10 |  3671.084487 |     132132.401532 | -13213.240153
       11 | -9542.155666 |    -343544.243982 | 34354.424398
       12 | 24812.268732 |     893215.034353 | -89321.503435
       13 | -64509.234703 |   -2322359.089319 | 232235.908932
       14 | 167726.674229 |    6038133.632229 | -603813.363223
       15 | -436086.688994 |  -15699147.443794 | 1569914.744379
       16 | 1133828.055385 |   40817783.353866 | -4081778.335387
       17 | -2947950.280001 | -106126236.720050 | 10612623.672005
       18 | 7664673.392004 |  275928215.472131 | -27592821.547213
       19 | -19928148.155209 | -717413360.227540 | 71741336.022754
       20 | 51813187.867545 | 1865274736.591605 | -186527473.659161
       21 | -134714285.791616 | -4849714315.138174 | 484971431.513817
       22 | 350257145.722201 | 12609257219.359253 | -1260925721.935925
       23 | -910668576.213724 | -32784068770.334057 | 3278406877.033406
       24 | 2367738300.819682 | 85238578802.868561 | -8523857880.286857
       25 | -6156119579.467175 | -221620304887.458282 | 22162030488.745831
       26 | 16005910909.278656 | 576212792707.391602 | -57621279270.739166
       27 | -41615368361.460510 | -1498153261039.218262 | 149815326103.921844
       28 | 108199957742.461334 | 3895198478701.967773 | -389519847870.196777
       29 | -281319890127.735474 | -10127516044625.117188 | 1012751604462.511719
       30 | 731431714334.776245 | 26331541716025.304688 | -2633154171602.530762
       31 | -1901722457267.754395 | -68462008461665.796875 | 6846200846166.580078
       32 | 4944478388898.826172 | 178001222000331.093750 | -17800122200033.109375
       33 | -12855643811134.283203 | -462803177200860.875000 | 46280317720086.093750
       34 | 33424673908951.812500 | 1203288260722238.750000 | -120328826072223.875000
       35 | -86904152163272.062500 | -3128549477877820.500000 | 312854947787782.062500
       36 | 225950795624510.000000 | 8134228642482333.000000 | -813422864248233.375000
       37 | -587472068623723.375000 | -21148994470454068.000000 | 2114899447045407.000000
       38 | 1527427378421683.500000 | 54987385623180576.000000 | -5498738562318058.000000
       39 | -3971311183896374.500000 | -142967202620269504.000000 | 14296720262026952.000000
       40 | 10325409078130578.000000 | 371714726812700800.000000 | -37171472681270080.000000
       41 | -26846063603139504.000000 | -966458289713022208.000000 | 96645828971302224.000000
       42 | 69799765368162720.000000 | 2512791553253857792.000000 | -251279155325385792.000000
       43 | -181479389957223072.000000 | -6533258038460030976.000000 | 653325803846003072.000000
       44 | 471846413888780032.000000 | 16986470899996082176.000000 | -1698647089999608320.000000
       45 | -1226800676110828288.000000 | -44164824339989815296.000000 | 4416482433998981632.000000
       46 | 3189681757888153600.000000 | 114828543283973521408.000000 | -11482854328397352960.000000
       47 | -8293172570509199360.000000 | -298554212538331168768.000000 | 29855421253833117696.000000
       48 | 21562248683323916288.000000 | 776240952599660986368.000000 | -77624095259966095360.000000
       49 | -56061846576642179072.000000 | -2018226476759118512128.000000 | 201822647675911864320.000000

Did not converge within 50 iterations.

Explanation:

With $\alpha = 0.1$, the algorithm diverges! The position oscillates with increasing amplitude and never converges.

Why this happens:

Overcorrection: The first step is: $$\Delta r_0 = -\alpha U'(r_0) = -0.1 \times 9.36 = -0.936 \text{ Å}$$ This moves from $r = 1.0$ Å to $r = 0.064$ Å, overshooting the minimum at $r_0 = 0.74$ Å by a large margin.
Growing oscillations: Each subsequent step overshoots even more dramatically:
- The gradient on the far side of the minimum has opposite sign and large magnitude
- The large $\alpha$ amplifies this, causing increasingly wild swings
- The position alternates sides but moves further from the minimum each time
Instability criterion: For the harmonic potential with $U'(r) = k(r - r_0)$, the algorithm is stable when: $$\alpha < \frac{2}{k}$$ For our case with $k = 36.0$ eV Å⁻²: $$\alpha < \frac{2}{36} \approx 0.056$$ Our $\alpha = 0.1$ exceeds this threshold, causing divergence.

This demonstrates that whilst adaptive step sizes help, the learning rate parameter must still be chosen carefully. Too small and convergence is slow; too large and the algorithm diverges.

Summary: Choosing the learning rate#

Adaptive vs. fixed: Adaptive step sizes ($r_{i+1} = r_i - \alpha U'(r_i)$) can outperform fixed step sizes because they automatically adjust based on proximity to the minimum.
The learning rate determines convergence speed: There’s a “sweet spot” for $\alpha$:
- Too small: stable but slow
- Just right: fast convergence
- Too large: divergence
Problem-specific tuning: The optimal $\alpha$ depends on the problem (specifically, the curvature $k$ for harmonic potentials). For different force constants, different learning rates would be optimal.

Worked Examples: Gradient Descent Method

Contents

Worked Examples: Gradient Descent Method#

Exercise: Fixed Step Size Gradient Descent#

Setup: Define functions#

Part 1: Fixed step size Δr = 0.01 Å#

Part 2: Fixed step size Δr = 0.1 Å#

Visualizing the oscillation#

Exercise: Adaptive Step Size Gradient Descent#

Part 1: Learning rate α = 0.01#

Part 2: Learning rate α = 0.001#

Part 3: Learning rate α = 0.1#

Summary: Choosing the learning rate#