Error Handling & Retry Logic for MySQL Binary Log Archiving and PITR Automation

Binary log archiving is the immutable backbone of Point-in-Time Recovery (PITR), and the failure-handling layer is what decides whether that backbone survives contact with production. A transient network partition, an object-storage rate limit, a truncated local segment, or an expired IAM credential will interrupt the upload stream sooner or later — the only question is whether the pipeline recovers deterministically or silently drops a segment and fractures the recovery timeline. This page defines the retry and error-handling layer for Automated Binlog Archiving to Object Storage: how to classify every failure into retryable, fatal, or integrity classes, apply exponential backoff with full jitter, enforce cryptographic idempotency so a retry can never overwrite a good archive with a corrupt one, and route unrecoverable segments to a dead-letter queue instead of blocking the loop. Naive approaches fail here because they treat all errors identically: an unbounded while True retry hammers a throttled bucket into a longer outage, a blind re-PUT overwrites a verified object with a partial one, and a fatal 403 gets retried five times before anyone is paged. The result is the worst outcome in database reliability — a gap in the GTID chain that nobody discovers until a recovery drill fails.

Visual Overview

Core Concept & Prerequisites

Resilient archiving rests on three non-negotiable primitives: exponential backoff with full jitter, strict idempotency via checksum verification, and explicit error categorization. Miss any one and the pipeline degrades under exactly the conditions it exists to survive.

MySQL rotates binary logs continuously based on max_binlog_size or explicit FLUSH BINARY LOGS, and an archiver must never assume a file is fully written or safely closed before initiating transfer. Only closed segments — every row above the last in SHOW BINARY LOGS — are stable enough to hash and upload. The active tail is still being appended and its final size, checksum, and terminating Rotate event are not yet fixed. Detecting closed segments correctly is the domain of Rotation Scheduling & Cron Automation; this page assumes segments arrive already sealed and focuses on making their transfer survivable.

Version and environment constraints:

MySQL 8.0.22+ / 8.4 — performance_schema.log_status and the SHOW BINARY LOG STATUS surface (renamed from SHOW MASTER STATUS in 8.4) are used to reconcile archived segments against the server’s live GTID coordinate. On 8.0 use SHOW MASTER STATUS.
Row-based logging — PITR replay is only deterministic when binlog_format = ROW; the reasoning is detailed in ROW vs STATEMENT vs MIXED Formats. The retry layer treats a non-ROW server as a fatal precondition, not a retryable error.
Python 3.10+ — the classifier uses structural pattern matching (match/case) to route exceptions without nested if/elif chains.
Libraries — tenacity for declarative retry policies, mysql-connector-python for a pooled control connection to the server, and boto3 for the S3-compatible transport. All are already referenced across this platform.

Idempotency is the primitive most implementations skip. If a retry controller blindly re-uploads a file that already exists in the target bucket, it wastes egress, triggers needless lifecycle transitions, and risks overwriting a verified archive with a corrupted duplicate. Every retry must first query remote metadata, compare checksums, and short-circuit when parity is confirmed — the retry becomes a verified no-op instead of a destructive rewrite. This same guarantee protects the downstream AWS S3 & GCS Sync Pipelines, where duplicate deliveries would otherwise corrupt ordered objects.

Production-Grade Python Implementation

The module below is a complete, runnable retry controller. It uses tenacity for the backoff policy, a mysql-connector-python connection pool for the reconciliation query, typed dataclasses for configuration and results, and structured JSON logging so a single segment can be followed from discovery through commit. Compression and encryption run before this layer engages — re-encrypting on each retry would reuse a GCM nonce and break idempotency — so the input is an already-sealed payload from Compression & Encryption Workflows.

# Python 3.10+  |  deps: boto3, tenacity, mysql-connector-python
import sys
import time
import json
import hashlib
import logging
import random
from dataclasses import dataclass
from datetime import datetime, timezone
from pathlib import Path

import boto3
from botocore.exceptions import ClientError, BotoCoreError
from tenacity import (
    Retrying, retry_if_exception_type, stop_after_attempt,
    wait_random_exponential, before_sleep_log,
)


class ArchiverError(Exception):
    """Base exception for binlog archiving failures."""

class TransientError(ArchiverError):
    """Network or API errors that warrant a bounded retry."""

class PermanentError(ArchiverError):
    """Auth failures, policy violations, or corrupt data — never retried."""


class JsonLogFormatter(logging.Formatter):
    def format(self, record: logging.LogRecord) -> str:
        payload = {
            "ts": datetime.now(timezone.utc).isoformat(),
            "level": record.levelname,
            "event": record.getMessage(),
        }
        payload |= getattr(record, "fields", {})
        return json.dumps(payload)

_handler = logging.StreamHandler(sys.stdout)
_handler.setFormatter(JsonLogFormatter())
logger = logging.getLogger("mysql_binlog_archiver")
logger.addHandler(_handler)
logger.setLevel(logging.INFO)


@dataclass(slots=True)
class RetryConfig:
    max_attempts: int = 5
    base_delay: float = 1.0     # seconds — first backoff ceiling
    max_delay: float = 60.0     # seconds — cap on any single wait
    dry_run: bool = False


@dataclass(slots=True)
class UploadResult:
    success: bool
    remote_key: str
    checksum: str
    attempts: int
    duration_ms: float


def compute_sha256(path: Path) -> str:
    """Stream the digest so a multi-GB segment never lands fully in memory."""
    digest = hashlib.sha256()
    with open(path, "rb") as fh:
        while chunk := fh.read(1 << 16):   # walrus: 64 KiB windows
            digest.update(chunk)
    return digest.hexdigest()


def classify_error(exc: Exception) -> ArchiverError:
    """Route a raw exception into the transient/permanent taxonomy."""
    match exc:
        case ClientError() as e:
            code = e.response["Error"]["Code"]
            if code in ("SlowDown", "RequestTimeout", "Throttling",
                        "ServiceUnavailable", "503", "502", "500"):
                return TransientError(f"transient_api:{code}")
            if code in ("AccessDenied", "InvalidAccessKeyId",
                        "SignatureDoesNotMatch", "403", "401"):
                return PermanentError(f"auth_policy:{code}")
            return TransientError(f"unclassified_client:{code}")
        case BotoCoreError():
            return TransientError(f"transport:{type(exc).__name__}")
        case FileNotFoundError() | PermissionError():
            return PermanentError(f"local_fs:{type(exc).__name__}")
        case ValueError() as e if "checksum" in str(e).lower():
            return PermanentError(f"integrity:{e}")
        case _:
            return TransientError(f"unexpected:{type(exc).__name__}")


def _attempt_upload(s3, bucket: str, path: Path, key: str, sha256: str, size: int) -> None:
    """One upload attempt: idempotent pre-check, put, post-verify. Raises on failure."""
    # Pre-flight idempotency: skip if an object with a matching digest already exists.
    try:
        head = s3.head_object(Bucket=bucket, Key=key)
        if head.get("Metadata", {}).get("sha256") == sha256:
            logger.info("idempotent_skip", extra={"fields": {"key": key}})
            return
    except ClientError as e:
        if e.response["Error"]["Code"] not in ("404", "NoSuchKey"):
            raise  # a non-404 head failure is a real signal — let the classifier see it

    with open(path, "rb") as body:
        s3.put_object(
            Bucket=bucket, Key=key, Body=body, ContentLength=size,
            Metadata={"sha256": sha256,
                      "archived_at": datetime.now(timezone.utc).isoformat()},
        )

    # Post-upload verification: the provider's stored digest must match ours.
    verify = s3.head_object(Bucket=bucket, Key=key)
    if verify.get("Metadata", {}).get("sha256") != sha256:
        raise ValueError("remote checksum mismatch after upload")


def upload_binlog(s3, bucket: str, path: Path, key: str, cfg: RetryConfig) -> UploadResult:
    """Idempotent, bounded-retry upload with full-jitter backoff and dead-letter routing."""
    started = time.monotonic()
    sha256 = compute_sha256(path)
    size = path.stat().st_size

    if cfg.dry_run:
        logger.info("dry_run_skip", extra={"fields": {"key": key, "sha256": sha256, "bytes": size}})
        return UploadResult(True, key, sha256, 0, 0.0)

    attempts = 0
    try:
        for attempt in Retrying(
            retry=retry_if_exception_type(TransientError),
            stop=stop_after_attempt(cfg.max_attempts),
            wait=wait_random_exponential(multiplier=cfg.base_delay, max=cfg.max_delay),
            before_sleep=before_sleep_log(logger, logging.WARNING),
            reraise=True,
        ):
            with attempt:
                attempts = attempt.retry_state.attempt_number
                try:
                    _attempt_upload(s3, bucket, path, key, sha256, size)
                except Exception as exc:                      # noqa: BLE001
                    classified = classify_error(exc)
                    if isinstance(classified, PermanentError):
                        # Fatal: do not retry — surface immediately for dead-lettering.
                        raise classified from exc
                    raise classified from exc                 # TransientError → tenacity retries
    except PermanentError:
        logger.error("permanent_failure", extra={"fields": {"key": key, "attempts": attempts}})
        raise
    except TransientError:
        logger.critical("retries_exhausted", extra={"fields": {"key": key, "attempts": attempts}})
        raise

    return UploadResult(True, key, sha256, attempts, (time.monotonic() - started) * 1000)

wait_random_exponential implements full jitter: each wait is drawn uniformly from [0, min(base·2^n, max)] rather than a fixed exponential, which decorrelates a fleet of archivers so they do not synchronize into retry storms against the same bucket prefix. The classifier raises PermanentError back through tenacity’s retry_if_exception_type(TransientError) predicate, so a fatal error escapes on the first attempt while a transient one loops — the single most important behavior in the module. The caller wraps upload_binlog and routes the escaped PermanentError (or exhausted TransientError) to the dead-letter queue described below.

Configuration Reference

The server-side variables that determine how much slack the retry layer has — how long a segment survives locally before the server is entitled to purge it, and how cleanly failures surface — matter as much as the client code. These are the settings the archiver reads at startup and refuses to run against if they are unsafe.

Variable	Type	Default	Recommended	PITR impact
`binlog_expire_logs_seconds`	integer (s)	`2592000` (30d)	≥ 3× worst-case archive lag	Local retention deadline the uploader races. Too short + a slow retry = the server purges a segment before it is archived, leaving a hole. Tune per Binlog Retention Boundaries.
`max_binlog_size`	integer (bytes)	`1073741824` (1G)	256M–1G	Sets segment granularity. Smaller files retry faster and fail smaller, but multiply request count and throttling pressure.
`sync_binlog`	integer	`1`	`1`	`1` fsyncs every commit so a crash cannot leave a partially durable segment the archiver would hash inconsistently.
`binlog_error_action`	enum	`ABORT_SERVER`	`ABORT_SERVER`	On a binlog write error the server aborts rather than silently continuing without logging — a silent gap would be invisible to the archiver until recovery.
`binlog_checksum`	enum	`CRC32`	`CRC32`	Enables per-event checksums so `mysqlbinlog` decode can detect corruption the transport-layer SHA-256 would not catch.
`log_bin`	string	(8.0: on)	on, explicit path	Binary logging must be enabled; the archiver treats `OFF` (surfaced as ERROR 1381) as a fatal precondition.

# my.cnf — MySQL 8.0.22+ / 8.4 : safe defaults for a race-free archive window
[mysqld]
log_bin                     = /var/lib/mysql/mysql-bin
binlog_format               = ROW
binlog_expire_logs_seconds  = 259200      # 3 days of local slack for the uploader
max_binlog_size             = 268435456   # 256 MiB segments: faster, smaller retries
sync_binlog                 = 1
binlog_error_action         = ABORT_SERVER
binlog_checksum             = CRC32

Validation & Verification Gates

Retry logic is only trustworthy if every attempt ends in a proven-consistent state. The pipeline enforces four gates before it advances the recovery manifest:

Pre-flight digest check — head_object compares the remote sha256 metadata against the freshly computed local digest; a match short-circuits the upload as an idempotent no-op.
Post-upload verification — after every put_object, the archiver re-reads the stored metadata and raises a PermanentError on mismatch rather than trusting the 200 OK. A network path can corrupt bytes that HTTP still acknowledges.
GTID-set reconciliation — periodically diff the GTID ranges recorded in archived object metadata against the server’s live Executed_Gtid_Set to prove contiguity. Aligning these coordinates depends on a gap-free GTID Tracking & Enforcement pipeline.
Dry-run replay — before deploying policy changes, run the controller with dry_run=True to validate path resolution, IAM permissions, and checksum computation without mutating remote state.

The reconciliation gate uses a pooled control connection so the scan never opens a fresh handshake per run:

# Python 3.10+  |  reconcile archived segments against the live server state
from mysql.connector import pooling

_pool = pooling.MySQLConnectionPool(
    pool_name="binlog_ctrl", pool_size=3,
    host="127.0.0.1", user="binlog_archiver", database="mysql",
)

def local_segments() -> list[tuple[str, int]]:
    """Return (log_name, file_size) for every segment the server still retains."""
    conn = _pool.get_connection()
    try:
        cur = conn.cursor()
        cur.execute("SHOW BINARY LOGS;")          # -- MySQL 8.0+
        rows = cur.fetchall()
        cur.close()
        # Drop the active tail (last row) — only closed segments are archivable.
        return [(name, size) for name, size, *_ in rows[:-1]]
    finally:
        conn.close()   # returns the connection to the pool, does not disconnect

def find_archive_gaps(archived_keys: set[str]) -> list[str]:
    """Any closed local segment absent from object storage is a recovery gap."""
    return [name for name, _ in local_segments()
            if f"{name}.zst.enc" not in archived_keys]

Any name returned by find_archive_gaps is an actionable alert: a closed segment the server still holds but the archive does not, i.e. a segment that will become an unrecoverable hole the moment binlog_expire_logs_seconds elapses.

Error Handling & Failure Modes

The taxonomy the classifier encodes maps directly onto operator response. Storage-side and MySQL-side failures both appear here because a binlog archiver straddles both.

Error class	Representative codes	Root cause	Recovery procedure
Transient	HTTP 500/502/503, `SlowDown`, `Throttling`, TCP RST, DNS timeout	Provider partition saturated or a network blip	Full-jitter backoff, bounded retries. Sustained throttling is handled in depth under Handling S3 Throttling During High-Throughput Binlog Archiving.
Permanent (auth/policy)	HTTP 401/403, `AccessDenied`, `InvalidAccessKeyId`, `SignatureDoesNotMatch`	IAM drift, expired credential, immutable-bucket policy	Fail on first attempt, dead-letter, page. Retrying only extends the outage.
Integrity	Local `ValueError: checksum...`, `mysqlbinlog` decode failure, truncated segment	Corrupt or partial local file	Abort, quarantine the local file, re-derive from the server if still present.
MySQL ERROR 1236	`ER_MASTER_FATAL_ERROR_READING_BINLOG`	Segment purged before capture — the retention race was lost	Non-retryable at the source; widen `binlog_expire_logs_seconds` and alert on the gap.
MySQL ERROR 1381	`ER_NO_BINARY_LOGGING`	`log_bin` is OFF	Fatal precondition — the archiver must refuse to start.
OS errno 28 / 2013	`ENOSPC`, `Lost connection to MySQL server`	Local disk exhausted or control connection dropped mid-scan	Treat disk-full as fatal (paging), a dropped connection as transient (pooled reconnect).

The controlling insight is that ERROR 1236 is not a client bug — it is the retention race made visible. If the archiver ever observes it while reading a segment, the local retention window was too tight relative to archive lag, and the fix lives in configuration, not in more retries. Every escaped PermanentError carries the original file path, checksum, and exception payload into a dead-letter queue (DLQ) that preserves the segment for manual intervention without blocking the primary loop. The DLQ is drained by an operator or an automated remediation job; it must never be silently discarded, because each entry is a potential hole in the recovery chain.

Observability & Alerting

Every retry cycle emits structured metrics so archiving lag is measurable, not anecdotal. Track at minimum:

binlog_upload_attempts_total (counter, labeled by outcome) — retry pressure per segment.
binlog_upload_duration_seconds (histogram) — tail latency that predicts backpressure.
binlog_permanent_failures_total (counter) — anything reaching the DLQ; alert on any nonzero rate.
binlog_archive_lag_seconds (gauge) — newest closed segment age minus its archive-commit time; the number that actually protects your RPO.

On the server side, watch the binlog cache so the archiver’s read pattern does not induce I/O pressure on the primary:

-- MySQL 8.0+ : binlog cache pressure and current archive coordinate
SELECT VARIABLE_NAME, VARIABLE_VALUE
FROM performance_schema.global_status
WHERE VARIABLE_NAME IN ('Binlog_cache_disk_use', 'Binlog_cache_use');

-- MySQL 8.0.22+ : one-shot snapshot of file, position, and gtid_executed
SELECT * FROM performance_schema.log_status\G

A rising Binlog_cache_disk_use means transactions are spilling the in-memory cache to disk — a signal to throttle archiver concurrency before it competes with the write path. Alert thresholds: page when binlog_archive_lag_seconds exceeds one third of binlog_expire_logs_seconds (the segment is now closer to purge than to capture), when binlog_permanent_failures_total increases at all, and when the gap scan from the verification gate returns any segment. These three alarms cover the full spectrum from slow-but-working to actively-losing-data.

Frequently Asked Questions

Why full jitter instead of plain exponential backoff or a fixed delay?

Plain exponential backoff keeps every retrying client aligned to the same schedule, so a fleet of archivers all wake at 1s, 2s, 4s… and slam the throttled prefix in synchronized waves — the retries themselves prolong the outage. Full jitter draws each wait uniformly from [0, min(base·2^n, max)], spreading requests across the whole interval and decorrelating the fleet. tenacity’s wait_random_exponential implements exactly this, which is why the module uses it rather than a hand-rolled sleep.

How does the retry layer avoid overwriting a good archive with a corrupt one?

Two gates. Before uploading, head_object checks whether an object with a matching SHA-256 already exists and short-circuits if so, so a retry of an already-succeeded upload is a verified no-op. After uploading, the archiver re-reads the provider’s stored digest and raises a PermanentError on any mismatch instead of trusting the 200 OK. A byte-level corruption that HTTP still acknowledges is caught and dead-lettered rather than committed to the manifest.

Should compression and encryption run inside the retry loop?

No — they must run once, before the retry controller engages. Re-encrypting a segment on each attempt reuses the GCM nonce for the same plaintext, which is a serious cryptographic weakness, and it also breaks idempotency because each retry would produce different ciphertext and therefore a different checksum. Seal the segment once in the Compression & Encryption Workflows stage, then hand the immutable payload to this layer so every retry uploads identical bytes.

What do I do when the archiver reports MySQL ERROR 1236?

Stop treating it as a retryable client error — it means the server already purged a segment the archiver needed, so no amount of retrying can recover it from the source. The real fix is configuration: widen binlog_expire_logs_seconds so local retention comfortably exceeds worst-case archive lag, and add a gap-scan alert so the shortfall is caught long before purge. If the segment still exists in a downstream replica’s relay logs, re-derive it there; otherwise the interval is a permanent hole and the recovery documentation must record it.

Handling S3 Throttling During High-Throughput Binlog Archiving — adaptive retry, prefix sharding, and backpressure for sustained SlowDown storms.
AWS S3 & GCS Sync Pipelines — the conditional-write transport this retry layer wraps.
Async Processing & Queue Management — the ordered worker pool where dead-letter routing and per-queue retry policies live.
Compression & Encryption Workflows — the seal-once transform that must precede any retry to preserve idempotency.

Back to Automated Binlog Archiving to Object Storage.

Error Handling & Retry Logic for MySQL Binary Log Archiving and PITR Automation #

Visual Overview #

Core Concept & Prerequisites #

Production-Grade Python Implementation #

Configuration Reference #

Validation & Verification Gates #

Error Handling & Failure Modes #

Observability & Alerting #

Frequently Asked Questions #

Related #

Explore this section

Handling S3 Throttling During High-Throughput Binlog Archiving