🕰️ Web Archives¶
Web archiving services like the Wayback Machine periodically snapshot and store copies of websites. These archives can reveal content that has since been removed — old pages, deprecated API endpoints, leaked credentials, source code comments, and forgotten subdomains. This is a purely passive reconnaissance technique.
1️⃣ The Wayback Machine¶
The Wayback Machine (https://web.archive.org/) is operated by the Internet Archive and contains billions of web pages archived since 1996.
Browsing Manually¶
- Navigate to https://web.archive.org/
- Enter the target URL (e.g.,
example.com). - Browse the calendar to view snapshots from different dates.
What to Look For¶
| Content | Why It Matters |
|---|---|
| Old login pages | May reveal legacy authentication systems or forgotten admin panels. |
| Removed pages | Content removed for security reasons (e.g., exposed config files, debug pages). |
| Old JavaScript files | May contain hardcoded API keys, internal endpoints, or debug flags. |
| Previous site structure | Reveals how the application evolved — old paths may still work. |
| Staff pages / contact info | Employee names and emails useful for OSINT and social engineering. |
| Technology changes | Seeing that a site migrated from PHP to Node.js helps narrow your testing approach. |
2️⃣ Wayback Machine CLI Tools¶
waybackurls (by tomnomnom)¶
Extracts all URLs that the Wayback Machine has indexed for a given domain:
# Install
go install github.com/tomnomnom/waybackurls@latest
# Usage
echo "example.com" | waybackurls > wayback_urls.txt
This can return thousands of URLs, including:
- Old API endpoints (/api/v1/users).
- Backup files (/backup.zip, /db_dump.sql).
- Admin panels (/admin/, /dashboard/).
- Files with interesting extensions (.env, .bak, .conf).
Filtering Results¶
# Find only JavaScript files (may contain secrets)
cat wayback_urls.txt | grep "\.js$" | sort -u
# Find potential config/backup files
cat wayback_urls.txt | grep -iE "\.(env|bak|sql|conf|old|zip|tar|gz)$" | sort -u
# Find API endpoints
cat wayback_urls.txt | grep -i "/api/" | sort -u
3️⃣ Other Web Archive Services¶
| Service | URL | Notes |
|---|---|---|
| Wayback Machine | https://web.archive.org/ | The largest and most well-known archive. |
| Archive.today | https://archive.ph/ | Focuses on preserving individual pages on demand. |
| Google Cache | cache:example.com in Google search |
Shows Google's most recent cached version of a page. |
| CachedView | https://cachedview.nl/ | Aggregates cached versions from multiple sources. |
| CommonCrawl | https://commoncrawl.org/ | A massive open dataset of web crawl data (petabytes). |
4️⃣ Checking If Removed Content Still Exists¶
A URL found in the Wayback Machine might still be live on the target server, even if it's been removed from the site's navigation.
# Check each wayback URL against the live site
cat wayback_urls.txt | httpx -status-code -content-length -title -o live_urls.txt
Tip
Many organizations "remove" pages by delinking them from navigation but forget to actually delete the files from the server. waybackurls + httpx is a powerful combo for finding these orphaned pages.
5️⃣ Wayback Machine Diffing¶
Compare two snapshots of the same page to see what changed: 1. Open two different snapshots of the same URL in the Wayback Machine. 2. Use the built-in "Changes" feature (calendar view → click on a date → compare).
This can reveal: - Security patches that were applied (which imply a vulnerability existed before). - Content that was removed for legal or security reasons. - Changes to JavaScript files that may indicate new features or fixed bugs.
6️⃣ Defensive Recommendations¶
- Request Removal: If sensitive content appears in the Wayback Machine, submit an exclusion request at https://web.archive.org/web/removals.html
- Use
robots.txt: AddUser-agent: ia_archiver Disallow: /to prevent future archiving by the Internet Archive. - Rotate Secrets: If API keys, credentials, or tokens were ever exposed publicly (even briefly), assume they are compromised and rotate them immediately.
- Audit Before Removing: When decommissioning pages, actually delete the files from the server — don't just remove the navigation links.
Warning
Viewing archived content is completely passive and legal. However, using discovered credentials, API keys, or endpoints to access live systems without authorization is illegal.