Output Structure
Smippo creates organized, structured mirrors that preserve the original URL hierarchy while adding metadata for offline browsing and future updates.
Default Output
When you run:
smippo https://example.com --depth 2
Smippo creates:
site/
├── example.com/
│ ├── index.html # Rendered homepage
│ ├── about/
│ │ └── index.html # About page
│ ├── blog/
│ │ ├── index.html # Blog listing
│ │ └── post-1/
│ │ └── index.html # Blog post
│ └── assets/
│ ├── style.css # Stylesheet
│ ├── app.js # JavaScript
│ └── logo.png # Image
├── cdn.example.com/ # External assets (if --external-assets)
│ └── fonts/
│ └── inter.woff2
├── .smippo/ # Metadata directory
│ ├── manifest.json # Capture metadata
│ ├── cache.json # Cache data
│ ├── network.har # HTTP archive
│ └── log.txt # Capture log
└── index.html # Entry point redirect
Directory Structure
Domain Directories
Each domain gets its own directory:
site/
├── example.com/ # Main site
├── cdn.example.com/ # CDN assets
└── fonts.googleapis.com/ # Google Fonts
This prevents filename collisions and makes it clear where each file originated.
Path Preservation
URL paths are preserved as directory structures:
| URL | Local Path |
|---|---|
https://example.com/ | example.com/index.html |
https://example.com/about | example.com/about/index.html |
https://example.com/blog/post | example.com/blog/post/index.html |
File Naming
- HTML pages:
index.htmlin their directory - Assets: Original filename preserved
- Query strings: Encoded in filename or omitted
The .smippo Directory
The .smippo directory contains metadata crucial for:
- Resuming interrupted captures
- Updating existing mirrors
- Debugging issues
manifest.json
Contains capture metadata:
{
"version": "0.0.1",
"created": "2024-01-15T10:30:00Z",
"updated": "2024-01-15T11:45:00Z",
"rootUrl": "https://example.com",
"options": {
"depth": 3,
"scope": "domain",
"workers": 8,
"externalAssets": true
},
"stats": {
"pagesCapt": 42,
"assetsCapt": 156,
"totalSize": 15728640,
"duration": 180000,
"errors": 2
},
"pages": [
{
"url": "https://example.com/",
"localPath": "example.com/index.html",
"status": 200,
"captured": "2024-01-15T10:30:05Z",
"size": 45678,
"title": "Example Domain"
}
],
"assets": [
{
"url": "https://example.com/style.css",
"localPath": "example.com/style.css",
"mimeType": "text/css",
"size": 12345
}
],
"errors": [
{
"url": "https://example.com/broken",
"error": "404 Not Found",
"time": "2024-01-15T10:32:00Z"
}
]
}
cache.json
Stores HTTP cache data for efficient updates:
{
"etags": {
"https://example.com/style.css": "\"abc123\"",
"https://example.com/logo.png": "\"def456\""
},
"lastModified": {
"https://example.com/": "Sat, 01 Jan 2024 00:00:00 GMT"
},
"contentTypes": {
"https://example.com/api/data": "application/json"
}
}
This allows smippo update to skip unchanged files.
network.har
HTTP Archive file containing all network requests:
- Request/response headers
- Timing information
- Response bodies
- Useful for debugging and replay
log.txt
Plain text log of the capture session:
[10:30:00] Starting capture of https://example.com
[10:30:02] Captured: https://example.com/ (45KB)
[10:30:05] Captured: https://example.com/about (32KB)
[10:30:08] Error: https://example.com/broken - 404 Not Found
[10:30:15] Capture complete: 42 pages, 156 assets
Structure Options
Original Structure (Default)
smippo https://example.com --structure original
Preserves URL paths exactly:
site/
└── example.com/
├── index.html
├── about/
│ └── index.html
└── blog/
└── post/
└── index.html
Flat Structure
smippo https://example.com --structure flat
All files in one directory with unique names:
site/
├── index.html
├── about-index.html
├── blog-post-index.html
├── style-abc123.css
└── logo-def456.png
Useful for:
- Single-page archives
- Avoiding deep directory nesting
- Simple file listings
Domain Structure
smippo https://example.com --structure domain
Organized by domain without path nesting:
site/
├── example.com/
│ ├── index.html
│ ├── about.html
│ └── blog-post.html
└── cdn.example.com/
└── style.css
Entry Point
Every capture includes an index.html entry point at the root:
<!DOCTYPE html>
<html>
<head>
<meta http-equiv="refresh" content="0; url=example.com/index.html">
<title>Redirecting...</title>
</head>
<body>
<a href="example.com/index.html">Click here</a>
</body>
</html>
This allows you to open ./site/index.html directly in a browser.
Serving the Output
Built-in Server
smippo serve ./site --open
The built-in server:
- Sets correct MIME types
- Handles directory browsing
- Supports CORS
- Auto-finds available ports
Other Servers
The output works with any static file server:
# Python
python -m http.server -d ./site
# Node
npx serve ./site
# PHP
php -S localhost:8000 -t ./site
File Types Captured
HTML Pages
- Fully rendered DOM (after JavaScript execution)
- Links rewritten for offline viewing
- Optional: JavaScript stripped (
--static)
Stylesheets
- CSS files with rewritten
url()references - External fonts (with
--external-assets)
JavaScript
- Script files preserved (unless
--static) - Required for interactive functionality
Images & Media
- All images (PNG, JPEG, GIF, WebP, SVG)
- Videos and audio (unless filtered)
- Optimized through browser's natural handling
Fonts
- Web fonts (WOFF, WOFF2, TTF)
- From CDNs (with
--external-assets)
Other Assets
- PDFs and documents
- JSON data files
- Any resource fetched by the page
Best Practices
For Offline Viewing
smippo https://example.com \
--static \
--external-assets \
--depth 5
For Archiving
smippo https://example.com \
--depth 10 \
--structure original \
--har
For Development
smippo https://example.com \
--structure flat \
--no-har
Next Steps
- Link Rewriting — How links are transformed
- Serve Command — View captured sites
- Continue Command — Resume captures