What does ERROR 1236 about purged GTIDs mean for fallback?

It means the source has rotated away binary logs containing GTIDs the archive has not yet captured, so its gtid_purged overtook the needed set. Retrying is futile. Qualify other candidates with GTID_SUBSET(needed, GTID_SUBTRACT(gtid_executed, gtid_purged)) and reroute to one that holds the full set; only if none qualify, degrade to base-backup replay.

Fallback Routing Strategies for MySQL Binary Log Archiving and PITR Automation

Q: Why can't fallback routing use SOURCE_LOG_FILE and SOURCE_LOG_POS?

A file name and byte offset are meaningful only on the server that wrote them, so the same coordinates point at unrelated data on a different source. GTIDs are globally unique and server-independent, so SOURCE_AUTO_POSITION=1 lets the new source negotiate exactly which transactions the archive still needs. File and position routing across sources introduces silent gaps.

Q: How does the controller avoid a double reroute if it restarts mid-handoff?

Each break is reduced to a stable break_signature recorded before the switch and checked at the top of the reroute routine. A controller that crashes after CHANGE REPLICATION SOURCE but before recording completion sees the same signature on restart and short-circuits instead of switching again. SOURCE_AUTO_POSITION is itself idempotent about where the stream resumes.

Q: Is fallback routing safe with STATEMENT or MIXED binlog format?

It is risky. ROW events carry exact before and after images and replay identically on any target, making a mid-stream source switch deterministic. STATEMENT and MIXED events depend on session context that may not reproduce on the new source. Standardize on ROW, or add a format-continuity gate that rejects a handoff across a format boundary.

A binary log archiving pipeline that only knows how to read from one source is a single network partition away from a recovery gap. When the primary stream degrades — an I/O stall, a purged log the archiver had not yet copied, a replica whose relay log write fails — every second the pipeline waits on a dead source is a second of committed transactions that may never reach durable storage, widening the Recovery Point Objective (RPO) silently until a failover proves it. Fallback routing is the control-plane discipline of detecting that break, selecting an alternate source that provably still holds the transactions you are missing, and repointing the archiver without duplicating, skipping, or reordering a single event. Naive approaches fail in two predictable ways: they reroute on file-and-position coordinates that are meaningless on a different server, or they switch sources without first proving the new source has not already purged the exact Global Transaction Identifiers (GTIDs) the archive needs — turning a transient outage into a permanent hole. This page builds a routing controller that treats GTID sets, not byte offsets, as the only valid handoff currency, and that degrades to Base Backup Integration for PITR replay only when no live source qualifies.

Visual Overview

Core Concept & Prerequisites

Fallback routing rests on one invariant: a source is only a valid failover target if the set of transactions the archive still needs is a subset of what that source can still serve. In GTID terms, the needed set is everything the source primary executed that the archive has not yet captured, and a candidate can serve it only if that set is contained in GTID_SUBTRACT(@@GLOBAL.gtid_executed, @@GLOBAL.gtid_purged) on the candidate — the transactions it has executed and has not yet purged from its binary logs. Routing on SOURCE_LOG_FILE / SOURCE_LOG_POS cannot express this, because a byte offset on mysql-bin.000041 of the primary points at unrelated data on a replica. This is why the entire strategy depends on a gap-free GTID Tracking & Enforcement foundation: without globally unique identifiers, there is no server-independent way to ask “do you still have what I’m missing?”

The format of the stream matters at handoff time. ROW-based events are self-contained and replay deterministically on any target, which is what makes a mid-stream source switch safe; STATEMENT and MIXED streams carry session-context dependencies that can replay differently after a reroute, so a fallback pipeline must validate format continuity across the switch. The trade-offs are covered in ROW vs STATEMENT vs MIXED Formats; this controller assumes binlog_format = ROW.

Prerequisites:

MySQL 8.0.23+ on every candidate source, so the controller can issue CHANGE REPLICATION SOURCE TO ... SOURCE_AUTO_POSITION = 1 (the CHANGE MASTER TO spelling was deprecated in 8.0.23 and behaves identically only through an alias). Auto-position lets the server negotiate the correct starting GTID instead of a brittle file/position pair.
gtid_mode = ON and enforce_gtid_consistency = ON on all sources — the routing math is undefined without globally unique transaction IDs.
Python 3.10+ for the controller (match statements for error-code dispatch, slots=True dataclasses, the walrus operator in polling loops).
mysql-connector-python 8.0+ for pooled connections and server-side GTID set functions, and tenacity for declarative retry policy on transient connection faults.
A discovery source for candidates — a static topology file or a service catalog — plus scoped credentials following Security & Access Frameworks (REPLICATION CLIENT, REPLICATION SLAVE, and SELECT on mysql.gtid_executed; never SUPER).

Production-Grade Python Implementation

The controller is a stateful, idempotent module. It computes the needed GTID set once per break, qualifies every candidate against that set using MySQL’s own GTID_SUBSET and GTID_SUBTRACT (never a naive string comparison — GTID set arithmetic is not substring matching), reroutes to the first qualifying source via auto-position, and records the handoff so a re-run of the same break is a no-op rather than a second switch. A candidate that cannot serve the full needed set is rejected outright rather than partially followed, because a partial reroute is exactly how gaps are born.

#!/usr/bin/env python3
"""GTID-anchored fallback routing controller for MySQL binlog archiving.
Targets: MySQL 8.0.23+, Python 3.10+. Requires: mysql-connector-python, tenacity."""
from __future__ import annotations

import logging
from dataclasses import dataclass, field
from enum import Enum

import mysql.connector
from mysql.connector import Error, pooling
from tenacity import (
    retry,
    retry_if_exception_type,
    stop_after_attempt,
    wait_exponential,
)

logging.basicConfig(
    format="%(asctime)s | %(levelname)s | %(name)s | %(message)s", level=logging.INFO
)
logger = logging.getLogger("binlog.fallback_router")


class RouteOutcome(Enum):
    REROUTED = "rerouted"
    ALREADY_HANDLED = "already_handled"
    DEGRADED_TO_BACKUP = "degraded_to_backup"
    NO_ACTION = "no_action"


@dataclass(slots=True, frozen=True)
class Candidate:
    name: str
    host: str
    port: int = 3306


@dataclass(slots=True)
class HandoffRecord:
    break_signature: str          # (source, last error, needed-set hash) — idempotency key
    chosen_source: str
    needed_gtids: str
    outcome: RouteOutcome
    applied: set[str] = field(default_factory=set)


class FallbackRouter:
    def __init__(self, pool: pooling.MySQLConnectionPool, candidates: list[Candidate]):
        self.pool = pool
        self.candidates = candidates
        self._handoffs: dict[str, HandoffRecord] = {}

    @retry(
        retry=retry_if_exception_type(Error),
        stop=stop_after_attempt(3),
        wait=wait_exponential(multiplier=1, min=2, max=10),
        reraise=True,
    )
    def _connect(self, host: str, port: int) -> mysql.connector.MySQLConnection:
        return mysql.connector.connect(host=host, port=port, connection_timeout=5)

    def _global_gtids(self, conn) -> tuple[str, str]:
        """Return (gtid_executed, gtid_purged) from a source."""
        with conn.cursor(dictionary=True) as cur:
            cur.execute(
                "SELECT @@GLOBAL.gtid_executed AS executed, "
                "@@GLOBAL.gtid_purged AS purged"
            )
            row = cur.fetchone() or {}
            return row.get("executed", ""), row.get("purged", "")

    def _servable_set(self, conn, executed: str, purged: str) -> str:
        """Transactions this source can still stream = executed MINUS purged.
        Uses server-side set arithmetic; string ops are NOT valid on GTID sets."""
        with conn.cursor() as cur:
            cur.execute("SELECT GTID_SUBTRACT(%s, %s)", (executed, purged))
            (servable,) = cur.fetchone()
            return servable or ""

    def _is_subset(self, conn, needed: str, servable: str) -> bool:
        """True when GTID_SUBSET(needed, servable) — the candidate holds everything."""
        if not needed:
            return True
        with conn.cursor() as cur:
            cur.execute("SELECT GTID_SUBSET(%s, %s) AS ok", (needed, servable))
            (ok,) = cur.fetchone()
            return bool(ok)

    def qualify(self, needed: str) -> Candidate | None:
        """First candidate whose still-servable set contains the needed set."""
        for cand in self.candidates:
            try:
                conn = self._connect(cand.host, cand.port)
            except Error as exc:
                logger.warning("candidate %s unreachable: %s", cand.name, exc)
                continue
            try:
                executed, purged = self._global_gtids(conn)
                servable = self._servable_set(conn, executed, purged)
                if self._is_subset(conn, needed, servable):
                    logger.info("candidate %s qualifies for needed set", cand.name)
                    return cand
                logger.info("candidate %s has purged part of the needed set", cand.name)
            finally:
                conn.close()
        return None

    def reroute(
        self, replica_conn, needed: str, break_signature: str, *, dry_run: bool = False
    ) -> RouteOutcome:
        """Idempotent: repoint the archiving replica at a qualifying source."""
        if (rec := self._handoffs.get(break_signature)) is not None:
            logger.info("break %s already handled -> %s", break_signature, rec.outcome.value)
            return RouteOutcome.ALREADY_HANDLED

        if (chosen := self.qualify(needed)) is None:
            logger.error("no live source holds needed set; degrading to base backup")
            self._handoffs[break_signature] = HandoffRecord(
                break_signature, "<none>", needed, RouteOutcome.DEGRADED_TO_BACKUP
            )
            return RouteOutcome.DEGRADED_TO_BACKUP

        if dry_run:
            logger.info("[dry-run] would reroute to %s (auto-position)", chosen.name)
            return RouteOutcome.NO_ACTION

        with replica_conn.cursor() as cur:
            cur.execute("STOP REPLICA")
            # 8.0.23+ spelling; SOURCE_AUTO_POSITION negotiates the start GTID.
            cur.execute(
                "CHANGE REPLICATION SOURCE TO "
                "SOURCE_HOST = %s, SOURCE_PORT = %s, SOURCE_AUTO_POSITION = 1",
                (chosen.host, chosen.port),
            )
            cur.execute("START REPLICA")
        replica_conn.commit()

        self._handoffs[break_signature] = HandoffRecord(
            break_signature, chosen.name, needed, RouteOutcome.REROUTED
        )
        logger.info("rerouted archiving stream to %s", chosen.name)
        return RouteOutcome.REROUTED

The break_signature is the idempotency key — a stable hash of the failed source, the terminating error code, and the needed-set digest. Re-running reroute for the same signature short-circuits, so a controller restart mid-handoff never fires a second CHANGE REPLICATION SOURCE. The needed set itself is computed by the caller from the last archived GTID and the failed source’s gtid_executed (via GTID_SUBTRACT), then verified against each candidate’s still-servable set — the crucial subtraction of gtid_purged that stops the controller from repointing at a source that executed the transactions but has already rotated them away. Retention windows on those candidates therefore bound how long fallback stays possible; align them with Binlog Retention Boundaries.

Configuration Reference

The variables below shape whether a source can be a fallback target and how much slack the controller has before a break becomes unrecoverable. Values assume MySQL 8.0.23+.

Variable	Type	Default	Recommended	PITR / fallback impact
`gtid_mode`	enum	`OFF`	`ON`	Mandatory — server-independent routing is impossible without GTIDs.
`enforce_gtid_consistency`	enum	`OFF`	`ON`	Rejects statements that would produce non-deterministic GTIDs, keeping the set arithmetic valid.
`binlog_format`	enum	`ROW`	`ROW`	Self-contained events replay identically after a source switch; `STATEMENT` may diverge.
`binlog_expire_logs_seconds`	integer	`2592000`	Size to exceed max archiver outage	Sets how long a candidate keeps logs — too short and `gtid_purged` overtakes the needed set before failover completes.
`source_connect_retry`	integer	`60`	`10`	Shorter retry backoff on the replica shortens time-to-detect a dead source.
`source_retry_count`	integer	`86400`	`3`	Fail fast to the controller instead of the replica silently retrying a dead source forever.
`source_heartbeat_period`	float	half of retry	`2.0`	Faster heartbeat loss detection so the break signal reaches the controller sooner.
`replica_net_timeout`	integer	`60`	`15`	Caps how long a stalled read blocks before the I/O thread errors and routing can trigger.

The single most important tuning relationship is between binlog_expire_logs_seconds on the candidates and the worst-case archiver outage: the retention horizon must comfortably exceed the longest plausible break, or a fallback target will have purged the needed GTIDs by the time the controller reaches it.

Validation & Verification Gates

A reroute is not “done” when START REPLICA returns — it is done when the new stream provably continues the old one without a gap. Each gate below runs before the controller declares the handoff healthy and resumes normal monitoring.

Pre-switch subset gate. GTID_SUBSET(needed, servable) on the chosen candidate must return 1. This is the load-bearing check: it is the difference between a valid failover and a permanent hole, and it runs before any STOP REPLICA.
Purge-race gate. Re-read the candidate’s gtid_purged immediately before repointing and re-run the subtraction. A candidate that purged part of the needed set between qualification and handoff must be dropped, not followed.
Post-switch continuity diff. After START REPLICA, poll SHOW REPLICA STATUS until Retrieved_Gtid_Set on the replica joins contiguously to the last archived GTID. A discontinuity means the auto-position negotiation started later than expected — halt and degrade.
Dry-run replay. For the archive segments written just after the switch, run a bounded mysqlbinlog --include-gtids=<boundary> dry-run to confirm the seam replays cleanly before the segments are trusted; the coordinate mechanics are shared with Timestamp Targeting Strategies.
Manifest reconciliation. The handoff record’s applied set must reconcile with the archive manifest so an operator can prove, after the incident, exactly which source served which GTID range.

-- MySQL 8.0.23+: prove a candidate holds the needed set before rerouting.
-- Returns 1 only if every needed GTID is still servable (executed AND not purged).
SELECT GTID_SUBSET(
    @needed_set,
    GTID_SUBTRACT(@@GLOBAL.gtid_executed, @@GLOBAL.gtid_purged)
) AS candidate_qualifies;

Error Handling & Failure Modes

Fallback is driven by error codes, so mapping each terminating signal to the correct route is the core of the controller. The dispatch below classes each failure as reroute (a live source can rescue it), degrade (no live source can — go to base backup), or transient (retry the same source before routing at all).

Symptom / signal	Root cause	Class	Route
`ERROR 1236 (HY000): ... master has purged binary logs containing GTIDs the slave requires`	Source’s `gtid_purged` overtook the needed set	Degrade unless another source still serves the set	`qualify()`; if none, base-backup replay
`ERROR 1236 (HY000): Could not find first log file name in binary log index file`	Requested log purged/missing on a non-GTID or misconfigured source	Reroute	Switch to a source with auto-position and the needed set
`ER_SLAVE_RELAY_LOG_WRITE_FAILURE (1594)`	Local relay log disk full or corrupt on the archiving replica	Transient then reroute	Free space / recover relay log; if unrecoverable, rebuild from a qualifying source
`ER_SLAVE_RELAY_LOG_READ_FAILURE (1595)`	Corrupt relay log segment	Reroute	Reset relay log and re-fetch via auto-position from a qualifying source
`ER_NET_READ_ERROR` / `replica_net_timeout` elapsed	Network partition or stalled source I/O	Transient then reroute	Retry within timeout budget, then `qualify()` an alternate
`ER_CANT_SET_GTID_NEXT_WHEN_OWNING_GTID (1776)`	Aborted transaction left `gtid_next` owned during manual replay	Deterministic	Reset session `gtid_next`; never auto-reroute around a stuck replay
`ER_GTID_NEXT_TYPE_UNDEFINED_GROUP (1837)`	Applying events with `gtid_next` in an inconsistent state	Deterministic	Fix replay harness; not a routing event

def classify(errno: int) -> RouteOutcome:
    """Map a terminating MySQL error to a routing decision (8.0.23+ codes)."""
    match errno:
        case 1236:
            # Only 1236 knows whether it is purge-loss or a missing file; the
            # caller must consult gtid_purged before choosing degrade vs reroute.
            return RouteOutcome.DEGRADED_TO_BACKUP
        case 1594 | 1595 | 2013:      # relay-log or net read failures
            return RouteOutcome.REROUTED
        case 1776 | 1837:             # gtid_next state faults — not routable
            return RouteOutcome.NO_ACTION
        case _:
            return RouteOutcome.NO_ACTION

ERROR 1236 is the signal that defines this entire discipline: when a source reports it has purged binary logs containing GTIDs the archive still needs, no amount of retrying that source will help — the controller must find a different source that still holds the set, and only when none exists does it degrade to base-backup replay. Deeper break-recovery playbooks for asynchronous topologies live in Designing Fallback Routing for Async Replication Breaks, and the transport-side backoff and dead-letter policy is shared with Error Handling & Retry Logic.

Observability & Alerting

The controller must make the decision surface observable, not just the outcome — an operator needs to see how close a break came to being unrecoverable. Emit one structured log line per routing decision with stable fields: break_signature, failed_source, chosen_source, needed_gtid_count, qualify_ms, outcome, and purge_margin_seconds (how much retention headroom the winning candidate had left).

-- MySQL 8.0.23+: replica-side view of stream health and the last break.
SELECT SERVICE_STATE, LAST_ERROR_NUMBER, LAST_ERROR_MESSAGE, LAST_ERROR_TIMESTAMP
FROM performance_schema.replication_connection_status;

-- MySQL 8.0.23+: how far the applied GTID set trails the retrieved set (apply lag).
SELECT
    (@@GLOBAL.gtid_executed) AS applied_set,
    RECEIVED_TRANSACTION_SET AS retrieved_set
FROM performance_schema.replication_connection_status;

Alert thresholds worth wiring:

Purge margin collapse — if the winning candidate’s purge_margin_seconds drops toward zero across successive breaks, retention is too tight for your outage profile; page before a future break finds no qualifying source.
Degrade-to-backup events — any DEGRADED_TO_BACKUP outcome is a paging event, not a dashboard trend: it means the pipeline had to fall back to a slower physical restore path.
Reroute frequency — a rising rate of reroutes against one source signals a flapping or dying instance that should be pulled from the candidate pool.
Handoff latency — qualify_ms climbing means candidates are slow or unreachable; the parallelism and queue-depth signals that pair with this live in Async Processing & Queue Management.

Frequently Asked Questions

Why can’t fallback routing use SOURCE_LOG_FILE and SOURCE_LOG_POS?

Because a file name and byte offset are meaningful only on the server that wrote them. mysql-bin.000041 position 10495 on the primary points at completely unrelated data on a replica, so repointing an archiving stream by file/position after a break will read the wrong transactions — or fail outright. GTIDs are globally unique and server-independent, so SOURCE_AUTO_POSITION = 1 lets the new source negotiate exactly which transactions the archive still needs. File/position routing across sources is how silent gaps are introduced.

What does ERROR 1236 about purged GTIDs actually mean for fallback?

It means the source you tried to stream from has already rotated away binary logs containing GTIDs your archive has not yet captured — its gtid_purged overtook your needed set. Retrying that source is futile. The correct response is to qualify every other candidate with GTID_SUBSET(needed, GTID_SUBTRACT(gtid_executed, gtid_purged)) and reroute to one that still holds the full set. Only if no live source qualifies do you degrade to a base-backup restore plus binlog replay, which is slower and widens RTO.

How does the controller avoid a double reroute if it restarts mid-handoff?

Every break is reduced to a stable break_signature (failed source + terminating error + needed-set digest) that is recorded before the switch and checked at the top of reroute. A controller that crashes after CHANGE REPLICATION SOURCE but before recording completion will, on restart, see the same signature and short-circuit rather than issue a second switch. Combined with SOURCE_AUTO_POSITION, which is itself idempotent about where the stream resumes, this makes the handoff safe to retry.

Is fallback routing safe with STATEMENT or MIXED binlog format?

It is risky. ROW events carry the exact before/after image and replay identically on any target, so a mid-stream source switch is deterministic. STATEMENT and MIXED events depend on session context (variables, non-deterministic functions) that may not reproduce on the new source, so a reroute can apply subtly different data. If you must run mixed formats, add a format-continuity gate that rejects a handoff across a format boundary; standardizing on ROW removes the hazard entirely.

Designing Fallback Routing for Async Replication Breaks — the detailed topology playbook for recovering asynchronous replication when a source dies mid-stream.
GTID Tracking & Enforcement — the gap-free identifier foundation every routing decision on this page depends on.
Binlog Retention Boundaries — how long a candidate stays a valid fallback target before gtid_purged overtakes the needed set.
Base Backup Integration for PITR — the graceful-degradation path when no live source can serve the needed transactions.

Back to MySQL Binary Log Architecture & GTID Fundamentals.

Fallback Routing Strategies for MySQL Binary Log Archiving and PITR Automation #

Visual Overview #

Core Concept & Prerequisites #

Production-Grade Python Implementation #

Configuration Reference #

Validation & Verification Gates #

Error Handling & Failure Modes #

Observability & Alerting #

Frequently Asked Questions #

Why can’t fallback routing use SOURCE_LOG_FILE and SOURCE_LOG_POS? #

What does ERROR 1236 about purged GTIDs actually mean for fallback? #

How does the controller avoid a double reroute if it restarts mid-handoff? #

Is fallback routing safe with STATEMENT or MIXED binlog format? #

Related #

Explore this section

Designing Fallback Routing for Async Replication Breaks