Cleaning Your Schema: Tools to Detect and Remove Orphaned XSDs

Cleaning Your Schema: Tools to Detect and Remove Orphaned XSDs

What “orphaned XSDs” are

Orphaned XSDs are XML Schema Definition files that are no longer referenced by any XML, other XSDs (via include/import), or build/tooling configurations in your project. They increase maintenance burden, cause confusion, and can hide missing or outdated schemas.

When to run a cleanup

  • After major refactors or package/module moves
  • Before releases or repository archiving
  • When onboarding new contributors to reduce noise
  • Periodically in large monorepos (quarterly or per sprint)

Tools & approaches

Approach Tools / Commands Notes
Static reference scanning grep, ripgrep (rg), ag Fast, simple. Search for filename, namespace URIs, or schemaLocation strings. Misses generated or indirect references.
XML-aware analysis xmllint, xmlstarlet, Xerces-based validators Can resolve includes/imports and validate XSDs against each other. Useful to trace explicit schemaLocation links.
Dependency graphing custom scripts (Python lxml, Java DOM/SAX), Graphviz Build directed graph of XSD→XSD and XML→XSD references to identify nodes with zero in-degree.
Build-tool integration Maven plugin (maven-dependency-plugin/custom), Gradle tasks Integrate checks into CI; can fail builds on detected orphans.
Repository-wide search git ls-files + scripting, ripgrep across repo Combine file lists with reference scans to detect unreferenced files.
Heuristics & metadata Check timestamps, package/module manifests, and documentation Helps avoid deleting intentionally-unused templates or archived schemas.

Quick detection recipe (practical, cross-platform)

  1. Generate list of XSD files: git ls-files ‘*.xsd’ > xsd_list.txt
  2. Search for references: rg –files-with-matches -f xsd_list.txt || true (adjust to search for schemaLocation patterns)
  3. Build graph with Python (lxml) to parse imports/includes and record edges.
  4. Identify XSDs with zero incoming edges and not referenced by XML files.
  5. Cross-check with recent commit history and documentation before removal.

Sample Python approach (concept)

  • Parse each XSD for xs:include and xs:import schemaLocation attributes.
  • Parse project XML files for schemaLocation or namespace hints.
  • Create graph, compute in-degree, list nodes with in-degree == 0.
    (Use lxml or xml.etree.ElementTree; ensure namespace handling.)

Safeguards before deletion

  • Move candidates to a temporary “quarantine” folder or branch.
  • Run full test/validation suites and CI.
  • Check commit history for recent changes referencing the file.
  • Notify team and keep backups for at least one release cycle.

Automating in CI

  • Implement detection script as a CI job that warns on orphans, then promote to failure after a review window.
  • Keep a whitelist file for intentionally unused schemas.

Quick decision guide

  • If referenced by XML or other XSDs → keep.
  • If only referenced in docs or examples → consider moving to docs area.
  • If untouched for long and unreferenced → quarantine, test, then delete.

Comments

Leave a Reply

Your email address will not be published. Required fields are marked *