Cleaning Your Schema: Tools to Detect and Remove Orphaned XSDs
What “orphaned XSDs” are
Orphaned XSDs are XML Schema Definition files that are no longer referenced by any XML, other XSDs (via include/import), or build/tooling configurations in your project. They increase maintenance burden, cause confusion, and can hide missing or outdated schemas.
When to run a cleanup
- After major refactors or package/module moves
- Before releases or repository archiving
- When onboarding new contributors to reduce noise
- Periodically in large monorepos (quarterly or per sprint)
Tools & approaches
| Approach | Tools / Commands | Notes |
|---|---|---|
| Static reference scanning | grep, ripgrep (rg), ag | Fast, simple. Search for filename, namespace URIs, or schemaLocation strings. Misses generated or indirect references. |
| XML-aware analysis | xmllint, xmlstarlet, Xerces-based validators | Can resolve includes/imports and validate XSDs against each other. Useful to trace explicit schemaLocation links. |
| Dependency graphing | custom scripts (Python lxml, Java DOM/SAX), Graphviz | Build directed graph of XSD→XSD and XML→XSD references to identify nodes with zero in-degree. |
| Build-tool integration | Maven plugin (maven-dependency-plugin/custom), Gradle tasks | Integrate checks into CI; can fail builds on detected orphans. |
| Repository-wide search | git ls-files + scripting, ripgrep across repo | Combine file lists with reference scans to detect unreferenced files. |
| Heuristics & metadata | Check timestamps, package/module manifests, and documentation | Helps avoid deleting intentionally-unused templates or archived schemas. |
Quick detection recipe (practical, cross-platform)
- Generate list of XSD files:
git ls-files ‘*.xsd’ > xsd_list.txt - Search for references:
rg –files-with-matches -f xsd_list.txt || true(adjust to search for schemaLocation patterns) - Build graph with Python (lxml) to parse imports/includes and record edges.
- Identify XSDs with zero incoming edges and not referenced by XML files.
- Cross-check with recent commit history and documentation before removal.
Sample Python approach (concept)
- Parse each XSD for xs:include and xs:import schemaLocation attributes.
- Parse project XML files for schemaLocation or namespace hints.
- Create graph, compute in-degree, list nodes with in-degree == 0.
(Use lxml or xml.etree.ElementTree; ensure namespace handling.)
Safeguards before deletion
- Move candidates to a temporary “quarantine” folder or branch.
- Run full test/validation suites and CI.
- Check commit history for recent changes referencing the file.
- Notify team and keep backups for at least one release cycle.
Automating in CI
- Implement detection script as a CI job that warns on orphans, then promote to failure after a review window.
- Keep a whitelist file for intentionally unused schemas.
Quick decision guide
- If referenced by XML or other XSDs → keep.
- If only referenced in docs or examples → consider moving to docs area.
- If untouched for long and unreferenced → quarantine, test, then delete.