Implementing exponential backoff

This page explains how to use truncated exponential backoff to ensure your devices do not generate excessive load.

When devices retry calls without waiting, they can produce a heavy load on the ClearBlade IoT Core servers. ClearBlade IoT Core automatically limits projects that generate excessive load. Even a small fraction of overactive devices can trigger limits that affect all devices in the same Google Cloud project.

You are strongly encouraged to implement truncated exponential backoff with introduced jitter to avoid triggering these limits. If you have questions or would like to discuss the specifics of your algorithm, send an email to iotcore@clearblade.com providing this information:

IoT Core registry name
Number of devices connected
Industry

Truncated exponential backoff is a standard error-handling strategy for network applications. In this approach, a client periodically retries a failed request with increasing delays between requests. Clients should use truncated exponential backoff for all requests to ClearBlade IoT Core that return HTTP 5xx and 429 response codes and disconnections from the MQTT server.

Example algorithm

An exponential backoff algorithm retries requests exponentially, increasing the waiting time between retries up to a maximum backoff time. For example:

Make a request to ClearBlade IoT Core.
If the request fails, wait 1 + random_number_milliseconds seconds and retry the request.
If the request fails, wait 2 + random_number_milliseconds seconds and retry the request.
If the request fails, wait 4 + random_number_milliseconds seconds and retry the request.
And so on, up to a maximum_backoff time.
Continue waiting and retrying up to some maximum number of retries, but do not increase the waiting period between retries.

where:

The wait time is min(((2^n)+random_number_milliseconds), maximum_backoff), with n incremented by 1 for each iteration (request).
random_number_milliseconds is a random number of milliseconds less than or equal to 1000. This helps to avoid cases in which many clients are synchronized by some situation and all retry at once, sending requests in synchronized waves. The random_number_milliseconds value is recalculated after each retry request.
maximum_backoff is typically 32 or 64 seconds. The appropriate value depends on the use case.

The client can continue retrying after it has reached the maximum_backoff time. Retries after this point do not need to continue increasing backoff time. For example, suppose a client uses a maximum_backoff time of 64 seconds. After reaching this value, the client can retry every 64 seconds. At some point, clients should be prevented from retrying indefinitely.

The wait time between retries and the number of retries depends on your use case and network conditions.