Optimising Code with Loop Unrolling

Developers often need to optimise their code to improve performance, use resources efficiently, reduce costs and facilitate scalability.

Loop unrolling is a classic optimisation technique employed by programmers and compilers. It involves increasing the number of iterations in a loop body and decreasing the loop control overhead. But why does this lead to performance improvements? Let's delve into the intricacies of loop unrolling and see its impact on code performance.

  1. Reduced Loop Control Overhead: every loop has a control overhead - initialisation, checking the termination condition, and updating the loop counter. By unrolling the loop, we execute the loop body fewer times, reducing this overhead.
  2. Increased Instruction Level Parallelism: modern processors can execute multiple instructions simultaneously. Unrolling can increase the number of independent instructions in each loop iteration, allowing more to be executed in parallel.
  3. Better Cache Utilisation: with fewer loop control instructions, there's a higher chance that the actual computation instructions remain in the CPU cache, leading to faster execution.
  4. Predictable Branch Prediction: modern CPUs use branch prediction to guess the path a program will take. Unrolling can make loops more predictable, improving branch prediction accuracy.

To demonstrate the benefits of loop unrolling, let's consider a simple task: summing the elements of two large arrays.

#include <stdio.h>
#include <stdlib.h>
#include <time.h>

int main() {
    int n = 1000000000;   
    int *array1 = malloc(n * sizeof(int));
    int *array2 = malloc(n * sizeof(int));
    int *result = malloc(n * sizeof(int));


    for (int i = 0; i < n; i++) {
        array1[i] = i;
        array2[i] = n - i;
    }

    clock_t start, end;
    double cpu_time_used;

    // Baseline
    start = clock();
    for (int i = 0; i < n; i++) {
        result[i] = array1[i] + array2[i];
    }
    end = clock();
    cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;
    printf("Baseline: %f seconds\n", cpu_time_used);

    // Unroll x2
    start = clock();
    for (int i = 0; i < n; i+=2) {
        result[i] = array1[i] + array2[i];
        result[i+1] = array1[i+1] + array2[i+1];
    }
    end = clock();
    cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;
    printf("Unroll x2: %f seconds\n", cpu_time_used);

    // Unroll x4
    start = clock();
    for (int i = 0; i < n; i+=4) {
        result[i] = array1[i] + array2[i];
        result[i+1] = array1[i+1] + array2[i+1];
        result[i+2] = array1[i+2] + array2[i+2];
        result[i+3] = array1[i+3] + array2[i+3];
    }
    end = clock();
    cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;
    printf("Unroll x4: %f seconds\n", cpu_time_used);

    // Unroll x8
    start = clock();
    for (int i = 0; i < n; i+=8) {
        result[i] = array1[i] + array2[i];
        result[i+1] = array1[i+1] + array2[i+1];
        result[i+2] = array1[i+2] + array2[i+2];
        result[i+3] = array1[i+3] + array2[i+3];
        result[i+4] = array1[i+4] + array2[i+4];
        result[i+5] = array1[i+5] + array2[i+5];
        result[i+6] = array1[i+6] + array2[i+6];
        result[i+7] = array1[i+7] + array2[i+7];
    }
    end = clock();
    cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;
    printf("Unroll x8: %f seconds\n", cpu_time_used);

    free(array1);
    free(array2);
    free(result);

    return 0;
}
C

On an Intel(R) Xeon(R) Gold 6130, icc version 18.0.1, we get the following time usage:

Baseline: 4.99 seconds
Unroll x2: 1.82 seconds
Unroll x4: 1.17 seconds
Unroll x8: 1.10 seconds
Bash

Categories: