Python Integration for Schema Checks

Within the broader Automated Schema Enforcement & Monitoring framework, Python is the control plane that turns a declarative $jsonSchema contract into safe, reversible database operations. This guide is a complete implementation workflow for data engineers and platform teams: it shows how to read a collection’s live validator from PyMongo, gate every change on a real compliance count, apply collMod idempotently with bounded retries, and route the resulting WriteError failures deterministically instead of losing them in an application log. Schema drift is a silent failure vector — it compounds quietly across ingestion pipelines, microservices, and analytical jobs until an incident surfaces it. The deliverable by the end of this page is a reusable PyMongo automation surface you can drop into a CI/CD job, a Kubernetes operator reconcile loop, or a one-shot migration runner, plus the exact diagnostics and rollback commands to operate it in production.

This page is the orchestration layer that sits above the individual enforcement primitives. Where collection-level validators define the synchronous gate at the storage engine and fallback validation chains catch what that gate rejects, the Python layer decides when and how the gate itself changes shape.

Architectural Context & Enforcement Boundaries

The Python automation layer never sits on the hot write path — it is a control plane, not a data plane. Its job is to translate a version-controlled schema definition into three database actions: read the current enforcement state, compare it to the target, and transition safely. Every application write still flows through the collection’s native $jsonSchema validator; Python only governs the validator’s configuration and the routing of documents the validator rejects.

That separation matters for latency and blast radius. A synchronous validator adds cost to every insert and update, so the control plane must never widen that gate carelessly — flipping validationAction from warn to error on a collection full of legacy data turns a routine deploy into a write outage. The Python workflow therefore front-loads a dry-run compliance count before any enforcement change, and it hashes the schema to guarantee a repeated run is a genuine no-op rather than a redundant collMod that still takes an exclusive metadata lock. For the exact keyword semantics the schema itself must satisfy, see understanding MongoDB $jsonSchema syntax; for the tradeoff between the two enforcement levels this layer sets, see strict vs moderate validation levels.

The control loop is a deterministic five-stage sequence, and every stage is observable:

Fetch current state — read the active validator and validationLevel from the collection’s options via listCollections.
Compute structural diff — hash the normalized target and active schemas; equal hashes short-circuit the whole loop.
Dry-run compliance count — count documents that would fail the target schema using $nor + $jsonSchema, without mutating cluster state.
Apply safe transition — invoke collMod with explicit validationLevel and validationAction, with retry and backoff.
Emit structured telemetry — log the schema hash, diff metrics, and outcome for audit and alerting.

The control plane runs the same five observable stages on every deploy — stages 1–3 are read-only; only stage 4 mutates cluster state, and only when the diff demands it.

Prerequisites & Operational Requirements

The workflow below targets a supported production topology. Confirm each of the following before pointing the automation at a live collection.

MongoDB version: 5.0 or later. The rich details.schemaRulesNotSatisfied object in validation errors was introduced in 5.0; on 4.x you receive only code: 121 with a generic message and cannot pinpoint the failing JSON path from Python.
Driver: PyMongo 4.x — pin it (pip install "pymongo>=4.6,<5") so collMod argument handling and the WriteError / BulkWriteError class hierarchy stay stable across CI images.
Client-side validation library: the jsonschema package (pip install "jsonschema>=4.0") for pre-flight checks that reject payloads before they reach the driver. Note that jsonschema speaks JSON Schema draft types, not BSON types, so it validates shape client-side while the server validator remains the authority on bsonType.
Permissions: the automation principal needs the collMod action on the target collection (granted by the built-in dbAdmin role) plus find (read) for the dry-run count. Nothing more — do not run schema automation with cluster-admin credentials.
Environment: a replica set, not a standalone. collMod propagates through the oplog, so the runner should be aware of secondary replication lag and run against the primary.
Schema source of truth: the target schema comes from a version-controlled registry, never a hand-edit on the server. Pinning schema versions in the registry is what makes each deployment auditable and aligns the automation with your schema versioning strategies.

The two dials the automation sets are validationLevel (which documents are checked) and validationAction (what happens on failure). Their combined behavior is the entire safety surface of a rollout:

`validationAction`	`validationLevel`	Checks applied to	On failure
`warn`	`moderate`	Inserts + updates to already-valid docs	Logs a warning; write succeeds
`warn`	`strict`	All inserts and updates	Logs a warning; write succeeds
`error`	`moderate`	Inserts + updates to already-valid docs	Rejects with `WriteError` code 121
`error`	`strict`	All inserts and updates	Rejects with `WriteError` code 121

moderate is the migration-safe default: it enforces the contract on new and already-conforming documents while leaving pre-existing non-compliant documents writable — exactly what you want when hardening a collection that predates the schema.

Idempotent Implementation Workflow

A production deployment must be deterministic and repeatable: running it twice must not mutate collection metadata twice, and it must never take an exclusive lock it does not need. Follow this sequence, verifying each step before the next.

Read the live validator. The validator lives in the collection’s options, not in collStats (which returns storage metrics only). From mongosh:
```
db.runCommand({ listCollections: 1, filter: { name: "orders" } })
  .cursor.firstBatch[0].options
```
Diff by structural hash. Serialize both the target and the active validator with normalized key ordering, then SHA-256 each. A hash match means the deployment is a no-op — skip collMod entirely and avoid the metadata lock.
Dry-run the compliance count. $jsonSchema is a valid query operator, so $nor finds documents that would fail the proposed schema with no validator active:
```
db.orders.countDocuments({ $nor: [ { $jsonSchema: <target-schema> } ] })
```
The validate command does not help here — it checks BSON and index integrity, not schema compliance, so it cannot substitute for this count.
Apply conditionally with collMod. Only when the hashes (or the dials) diverge, issue the change with explicit validationLevel and validationAction and bounded backoff.
Phase the rollout and emit telemetry. Deploy first with validationAction: "warn", watch the rejection rate land in your async validation monitoring dashboards, and promote to error only once that rate falls below your threshold. Documents that fail the strict gate should be diverted through fallback validation chains rather than dropped.

Idempotency is the safety mechanism: a matching hash exits right with no lock acquired. Only when the schema truly changes does control cross the dashed boundary where collMod takes the exclusive MODE_X metadata lock.

Production-Ready Automation Implementation

The following PyMongo class performs the full control loop: it verifies the connection, reads the live validator, hashes both sides for idempotency, runs a client-side dry-run against a document sample, and applies the change with bounded exponential backoff. It is safe to run repeatedly from a pipeline or a reconcile loop. Teams commonly extend it with the payload-level pre-flight and dead-letter routing covered in the PyMongo validation wrapper scripts.

import hashlib
import json
import logging
import time
from typing import Any, Dict, List, Optional

import pymongo
from jsonschema import Draft7Validator
from pymongo.errors import (
    ConfigurationError,
    OperationFailure,
    PyMongoError,
    ServerSelectionTimeoutError,
)

logger = logging.getLogger(__name__)


class SchemaValidator:
    """Control-plane wrapper for idempotent $jsonSchema deployment via PyMongo."""

    def __init__(self, client: pymongo.MongoClient, db_name: str, collection_name: str):
        self.db = client[db_name]
        self.collection = self.db[collection_name]
        self._ensure_connection()

    def _ensure_connection(self) -> None:
        try:
            self.db.command("ping")
        except ServerSelectionTimeoutError as exc:
            logger.critical("MongoDB connection failed: %s", exc)
            raise ConfigurationError("Unable to reach MongoDB for schema validation.")

    def _get_current_validator(self) -> Optional[Dict[str, Any]]:
        """Read the active $jsonSchema from the collection's options.

        Uses listCollections because the validator lives in options, not in
        collStats (which returns storage metrics only).
        """
        try:
            info = self.db.command("listCollections", filter={"name": self.collection.name})
            batch = info["cursor"]["firstBatch"]
            if not batch:
                return None
            validator = batch[0].get("options", {}).get("validator", {})
            return validator.get("$jsonSchema")
        except OperationFailure as exc:
            logger.error("Failed to fetch validator metadata: %s", exc)
            return None

    @staticmethod
    def _compute_schema_hash(schema: Optional[Dict[str, Any]]) -> Optional[str]:
        """Deterministic SHA-256 over a normalized schema for idempotency checks."""
        if not schema:
            return None
        normalized = json.dumps(schema, sort_keys=True, separators=(",", ":"), default=str)
        return hashlib.sha256(normalized.encode("utf-8")).hexdigest()

    def validate_sample(
        self, target_schema: Dict[str, Any], sample_size: int = 100
    ) -> List[Dict[str, Any]]:
        """Client-side dry-run: validate a random sample against the target schema.

        Returns a list of {_id, errors} for documents that would be rejected.
        """
        violations: List[Dict[str, Any]] = []
        try:
            documents = list(self.collection.aggregate([{"$sample": {"size": sample_size}}]))
            checker = Draft7Validator(target_schema)
            for doc in documents:
                errors = [e.message for e in checker.iter_errors(doc)]
                if errors:
                    violations.append({"_id": str(doc.get("_id")), "errors": errors})
        except OperationFailure as exc:
            logger.warning("Sample extraction failed: %s", exc)
        return violations

    def apply_validator(
        self,
        target_schema: Dict[str, Any],
        validation_level: str = "moderate",
        validation_action: str = "warn",
        max_retries: int = 3,
    ) -> Dict[str, Any]:
        """Idempotently apply a $jsonSchema validator with bounded backoff."""
        current_schema = self._get_current_validator()
        target_hash = self._compute_schema_hash(target_schema)
        current_hash = self._compute_schema_hash(current_schema)

        if target_hash == current_hash:
            logger.info("Validator on %s already up-to-date; skipping collMod.", self.collection.name)
            return {"applied": False, "status": "no-op", "hash": target_hash}

        for attempt in range(1, max_retries + 1):
            try:
                self.db.command(
                    "collMod",
                    self.collection.name,
                    validator={"$jsonSchema": target_schema},
                    validationLevel=validation_level,
                    validationAction=validation_action,
                )
                logger.info(
                    "Applied validator to %s (level=%s, action=%s) on attempt %d.",
                    self.collection.name, validation_level, validation_action, attempt,
                )
                return {"applied": True, "status": "success", "hash": target_hash}
            except OperationFailure as exc:
                if attempt == max_retries:
                    logger.critical("collMod failed after %d attempts: %s", max_retries, exc)
                    raise
                backoff = 2 ** attempt
                logger.warning("collMod attempt %d failed; retrying in %ds.", attempt, backoff)
                time.sleep(backoff)
            except (PyMongoError, ServerSelectionTimeoutError) as exc:
                logger.error("Non-retryable error during collMod: %s", exc)
                raise

        return {"applied": False, "status": "failed", "hash": target_hash}

Four safeguards make this safe to automate: structural hashing prevents a redundant collMod from taking an exclusive lock when the payload is byte-for-byte identical; bounded retries absorb transient OperationFailure states while letting connection and auth errors fail fast; the client-side sample surfaces the exact violating documents before enforcement flips; and decoupled level/action let you ship the schema in warn mode and change enforcement later without touching the schema definition.

Diagnostic Fingerprints & Fast Resolution

When the server validator rejects a write, PyMongo raises pymongo.errors.WriteError (or BulkWriteError for batch operations) carrying code: 121 and errmsg: "Document failed validation". On MongoDB 5.0+, exc.details["errInfo"]["details"]["schemaRulesNotSatisfied"] isolates the exact failing rule and path:

{
  "code": 121,
  "codeName": "DocumentValidationFailure",
  "errmsg": "Document failed validation",
  "errInfo": {
    "failingDocumentId": { "$oid": "..." },
    "details": {
      "operatorName": "$jsonSchema",
      "schemaRulesNotSatisfied": [
        {
          "operatorName": "bsonType",
          "specifiedAs": { "bsonType": "double" },
          "reason": "type did not match",
          "consideredType": "string",
          "consideredValue": "19.99"
        }
      ]
    }
  }
}

Parse it in Python rather than logging the raw string, so failures can be routed by rule. Cleanly categorizing schema validation errors lets you separate a transient application bug from a legacy-data migration gap:

from pymongo.errors import WriteError

try:
    collection.insert_one(doc)
except WriteError as exc:
    if exc.code == 121:
        rules = exc.details.get("errInfo", {}).get("details", {}).get("schemaRulesNotSatisfied", [])
        for rule in rules:
            logger.error("Rule %s failed: %s", rule.get("operatorName"), rule)
    else:
        raise

Copy-paste diagnostics for the common fingerprints:

# Surface the exact rules the current documents violate, one sample per failure.
mongosh "$MONGO_URI" --quiet --eval '
  const s = db.getCollectionInfos({name:"orders"})[0].options.validator.$jsonSchema;
  db.orders.aggregate([{ $match: { $nor: [ { $jsonSchema: s } ] } }, { $limit: 5 }]).toArray()'

code: 121 on a document you believe is valid is almost always a bsonType mismatch: JSON number maps to bsonType: "double" or "int", so a schema demanding "double" rejects an integer field. Use bsonType: ["double", "int", "long", "decimal"] when numeric width is not part of the contract.
NotWritablePrimary / ServerSelectionTimeoutError during collMod means the runner targeted a secondary — DDL must run against the primary.
pymongo.errors.OperationFailure: not authorized on <db> to execute command { collMod: ... } means the principal lacks the collMod action; grant dbAdmin on that database, not cluster-wide privileges.

Edge Cases, Gotchas & Known Limitations

$merge and $out bypass the validator. Aggregation stages that write to a collection do not trigger $jsonSchema on the destination. After a migration pipeline, run an independent compliance count (count_documents({"$nor": [{"$jsonSchema": schema}]})) on the destination — the server validator will not have caught anything the pipeline wrote.
bulk_write(ordered=True) halts on the first failure. The default ordered mode discards remaining operations after one 121, and PyMongo raises BulkWriteError. Pass ordered=False for validation pipelines so valid documents still land and failures isolate to specific array indices.
bypass_document_validation=True silently disables the gate. It is easy to leave enabled in a migration helper and thereby write non-compliant documents that later break strict readers. Declare it False explicitly in automation.
Client-side jsonschema and server-side $jsonSchema disagree on types. The jsonschema library validates JSON draft types and will happily pass a value the server rejects on bsonType. Treat the client-side pre-flight as a fast filter, never as the authority.
collMod takes an exclusive (MODE_X) lock. The metadata change is milliseconds, but on a hot collection even a brief exclusive lock stalls concurrent writes — which is exactly why the idempotency hash exists: never take the lock when nothing actually changed.
moderate does not retroactively validate. Switching an existing collection to strict starts rejecting updates to legacy documents that were previously writable — the most common cause of a sudden 121 spike after a routine deploy. Run the dry-run count first.

Verification & Rollback Procedures

Confirm the change landed exactly as intended before walking away:

// 1. Verify the active validator, action, and level.
db.getCollectionInfos({ name: "orders" })[0].options
// 2. Confirm enforcement is live with a known-bad insert (expect WriteError 121 under strict/error).
db.orders.insertOne({ intentionallyMissing: true })

If the deployment misbehaves — an unexpected rejection spike, or a downstream service that cannot yet meet the contract — rollback is a single reversible collMod. Dropping to warn restores write availability instantly while you keep collecting compliance telemetry:

// Soft rollback: keep the schema, stop rejecting (time-to-recover: seconds).
db.runCommand({ collMod: "orders", validationAction: "warn" })

// Hard rollback: remove the validator entirely.
db.runCommand({ collMod: "orders", validator: {}, validationLevel: "off" })

Both are metadata-only and take effect immediately on the primary, propagating to secondaries through the oplog. For the connection modes and pipeline-specific validation behavior referenced above, see the official MongoDB Schema Validation documentation.

Frequently Asked Questions

How do I read a collection's active validator from PyMongo?

Run db.command("listCollections", filter={"name": "orders"}) and read cursor.firstBatch[0]["options"]["validator"]["$jsonSchema"]. The validator is stored in the collection's options — it is not returned by collStats, which only reports storage metrics. There is no dedicated "get validator" command.

Should client-side jsonschema validation replace the server validator?

No. The jsonschema library is a fast pre-flight filter that keeps malformed payloads off the write path, but it speaks JSON Schema draft types, not BSON types, so it cannot enforce bsonType distinctions like objectId or decimal. Keep the collection's $jsonSchema validator as the source of truth and treat the client check as a performance optimization.

How do I make a collMod deployment idempotent?

Hash the normalized target schema and the active validator with SHA-256 and compare. Only call collMod when the hashes (or the validationLevel / validationAction dials) differ. This prevents a repeated pipeline run from taking an exclusive MODE_X metadata lock when nothing has actually changed.

Why did a document written by $merge skip validation?

Aggregation write stages ($merge and $out) do not trigger the destination collection's $jsonSchema validator. Documents written this way bypass enforcement entirely, so after a migration pipeline you must run an independent compliance count — count_documents({"$nor": [{"$jsonSchema": schema}]}) — against the destination to detect anything that slipped through.

Which PyMongo exception do I catch for a validation rejection?

pymongo.errors.WriteError for single-document writes and pymongo.errors.BulkWriteError for batches. Check exc.code == 121, then read exc.details["errInfo"]["details"]["schemaRulesNotSatisfied"] on MongoDB 5.0+ to reach the exact failing rule and path.

Automated Schema Enforcement & Monitoring — the parent architecture this Python control plane orchestrates, spanning validators, middleware, and pre-flight checks.
PyMongo validation wrapper scripts — payload-level pre-flight, BulkWriteError parsing, and dead-letter routing that extend this automation surface.
Implementing collection-level validators — the synchronous storage-engine gate this layer deploys and tunes.
Building fallback validation chains — where documents rejected by the strict gate go so enforcement never means data loss.
Async validation monitoring dashboards — surfacing rejection rates and drift so you know when it is safe to promote from warn to error.
Categorizing schema validation errors — turning a parsed code-121 payload into a routing decision between bug, migration gap, and intentional change.

Python Integration for Schema Checks

Explore deeper