Tracking Validation Failures with MongoDB Atlas Alerts

When a $jsonSchema validator runs in validationAction: "warn" mode, MongoDB accepts every non-compliant write and records the violation only in the mongod diagnostic log — so the failure never reaches your driver, your application logs, or your on-call rotation. This page is the alerting layer of the async validation monitoring dashboards workflow, and within the broader Automated Schema Enforcement & Monitoring framework it answers one precise operational question: how do you turn a silent, log-only validation warning into a routed, deduplicated MongoDB Atlas alert before drift corrupts a downstream pipeline? The outcome you leave with is a log-based Atlas alert keyed on message id 51803, a copy-paste diagnostic for confirming the signal, and a rollback path that never restarts a node.

Operational Mechanics and Write-Path Impact

The alert you build depends entirely on which validationAction the collection runs, because that single setting determines whether a failure produces a driver-visible WriteError or a log-only warning. There is no server metric named “validation failures”; Atlas cannot alert on a number that MongoDB does not emit. The only authoritative source for warn-mode violations is the mongod diagnostic log, and the system.profile collection does not capture them. The table below fixes exactly where the signal lives in each mode.

`validationAction`	Write outcome	Driver signal	Diagnostic-log entry	Where the alert must source from
`error` (default)	Rejected	`WriteError` code `121`, `DocumentValidationFailure`	Not written for the rejection	Application error telemetry, or the same log id if you also ingest logs
`warn`	Accepted	None — driver sees success	`"s": "W"`, `"id": 51803`, `"msg": "Document failed validation"`	`mongod` log ingestion only
Not set / `strict` level	Rejected on every write	`WriteError` code `121`	Not written	Application error telemetry

The consequence is that a team running warn during a phased rollout — the safe promotion path described in implementing collection-level validators — is completely blind unless it alerts on log id 51803. Because warn acknowledges the write, latency and error-rate dashboards stay green while non-compliant documents accumulate. The alert has to be built on log ingestion, not on any write-success metric, and it should be scoped to the exact namespace under migration so a noisy neighbour collection cannot mask the signal you care about.

Exact Diagnostic Fingerprints and Fast Resolution

The canonical warn-mode signature is a mongod log line at severity W carrying "id": 51803 and "msg": "Document failed validation". Its attr.validationErrors array names the failing operatorName and the offending BSON path, which is what lets you correlate a spike to a specific schema change:

{
  "t": {"$date": "2026-05-28T08:42:11.302Z"},
  "s": "W",
  "c": "STORAGE",
  "id": 51803,
  "ctx": "conn48291",
  "msg": "Document failed validation",
  "attr": {
    "ns": "analytics.events_v2",
    "validationErrors": [
      {"operatorName": "required", "specifiedAs": {"required": ["metadata.tenant_id"]}, "missingProperties": ["metadata.tenant_id"]},
      {"operatorName": "bsonType", "specifiedAs": {"bsonType": "date"}, "reason": "type did not match", "consideredType": "string"}
    ]
  }
}

To confirm the signal on a log export before you wire any alert, filter for the id and the namespace with jq. This is the fastest way to prove the failures are real and to see which operator is firing:

jq -c 'select(.id == 51803 and .attr.ns == "analytics.events_v2")
       | {ts: .t["$date"], errs: [.attr.validationErrors[].operatorName]}' \
  mongod.log

Expected output is one compact line per warn-mode violation, for example {"ts":"2026-05-28T08:42:11.302Z","errs":["required","bsonType"]}. If jq returns nothing while you know documents are non-compliant, the collection is almost certainly in error mode — in that case the failures surface as pymongo.errors.WriteError (code 121) at the driver instead, and you should reach for the taxonomy in categorizing schema validation errors rather than a log alert. The three fingerprints most teams misread: NumberLong vs double coercion drift, an ObjectId arriving as a string, and a $set partial update that omits a required nested field, since MongoDB evaluates the validator against the full post-update document.

Step-by-Step Playbook

The following sequence stands up a log-based Atlas alert and a parser that categorizes the failures behind it. Every step is verifiable on its own.

Confirm the collection is in warn mode. In mongosh, run db.getCollectionInfos({name: "events_v2"})[0].options.validationAction. Expected output: warn. If it returns error or undefined, warn-mode alerting does not apply — those failures are already reaching the driver.
Create the log-based alert in Atlas. In the Atlas UI go to Alerts → Create Alert → Log-Based Alert and paste the exact filter below. Scoping to the namespace keeps a single migrating collection from being drowned out:
```
logId:51803 AND attr.ns:"analytics.events_v2"
```
Set the window and threshold. Use an evaluation window of 5 minutes and a threshold of >= 1 occurrence, routed to a dedicated incident channel at severity P2. One occurrence is deliberate: during a rollout a single warn-mode failure means the new contract is already rejecting live traffic. For steady-state collections, raise the threshold to suppress fatigue. Webhook payload and deduplication rules are documented in the MongoDB Atlas Alerts reference.

Parse the exported log stream. Point an asynchronous consumer at the Atlas log export (or your Datadog/Splunk stream) so each alert arrives with the failing operator already attached. Avoid synchronous parsing in any request path:

import json
import logging
from typing import Any

logger = logging.getLogger(__name__)


def parse_warn_mode_violations(log_lines: list[str], target_ns: str) -> list[dict[str, Any]]:
    """
    Parse mongod diagnostic log lines exported from Atlas to extract warn-mode
    validation failures (log message id 51803) for a given namespace.
    """
    failures: list[dict[str, Any]] = []
    for line in log_lines:
        try:
            entry = json.loads(line)
        except json.JSONDecodeError:
            continue

        if entry.get("id") != 51803:
            continue
        attr = entry.get("attr", {})
        if attr.get("ns") != target_ns:
            continue

        for err in attr.get("validationErrors", []):
            failures.append({
                "operator": err.get("operatorName"),
                "missing": err.get("missingProperties"),
                "reason": err.get("reason"),
                "timestamp": entry.get("t", {}).get("$date"),
            })
    return failures

Verify end to end. Insert one deliberately non-compliant document into the namespace, then confirm the jq filter above returns it and the Atlas alert fires within one evaluation window. When failures exceed your SLO, hand the parsed records to a fallback validation chain that quarantines them to a staging collection instead of paging a human for every drift event.

Failure Modes & Rollback

Each step has a distinct way of failing silently. The alert filter drops every hit if the namespace string is misspelled — Atlas returns no error, just zero matches, so always validate the filter against a known bad insert. A threshold set too high on a fast-drifting collection produces a green dashboard over a growing corruption backlog; a threshold of >= 1 on a busy steady-state collection produces alert storms. And a consumer that parses logs synchronously in the write path reintroduces exactly the blocking I/O the async model exists to avoid.

If a validation spike is actively corrupting downstream data, contain it without a restart:

Contain. Promote the namespace from warn to error so further drift is rejected at the driver rather than merely logged: db.runCommand({collMod: "events_v2", validator: existingValidator, validationAction: "error"}). This takes no exclusive collection lock long enough to drop connections and requires no node restart.
Quarantine. Route incoming writes to a shadow collection at the application layer while you patch, preserving the original namespace for reads.
Reconcile. Run a background aggregation that patches non-compliant documents with explicit type casting, then, once the failure rate falls below 0.1%, revert with validationAction: "warn" and resume normal routing.

To roll the alert itself back, delete the log-based alert in Atlas → Alerts or disable its notifier; because the alert never mutates the database, removing it is a pure control-plane change with no data impact. Time to recover: the collMod containment applies in under 60 seconds, and the alert teardown is effectively instant. Automate the containment collMod as an infrastructure-as-code template so the promotion is one reviewed command rather than an ad-hoc console edit during an incident.

Frequently Asked Questions

Why can't I alert on a MongoDB metric instead of parsing logs?

Because no such metric exists. In warn mode the server acknowledges the write as a success, so there is nothing for a write-error or latency metric to count. The violation is emitted only as a mongod diagnostic-log line with "id": 51803. A log-based Atlas alert on that id is the only way to see warn-mode failures without changing validationAction.

Does log id 51803 also fire when the collection is in error mode?

No. In validationAction: "error" the write is rejected before it commits and the rejection surfaces as a driver-side WriteError with code 121; the server does not write a 51803 warning for it. If your log alert goes quiet after a promotion to error, that is expected — switch your monitoring to the driver exception path at that point.

How do I stop a single occurrence per 5 minutes from paging me constantly?

Keep the tight threshold only for collections actively under migration, where one failure is genuinely actionable. For steady-state namespaces, raise the occurrence count or widen the window, and route sub-threshold volume into a fallback validation chain that quarantines the documents automatically instead of paging a human.

Async validation monitoring dashboards — the parent workflow that produces the telemetry these alerts are built on.
Automated Schema Enforcement & Monitoring — the domain architecture spanning validators, error routing, fallback, and observability.
Categorizing schema validation errors — the taxonomy for the driver-side code 121 failures you get once a collection is promoted to error.
Building fallback validation chains — the quarantine target for documents an alert flags above your SLO.
Graceful degradation for legacy document formats — routing historical payloads that trip these same validators during a migration.