Output Structure

Smippo creates organized, structured mirrors that preserve the original URL hierarchy while adding metadata for offline browsing and future updates.

Default Output

When you run:

smippo https://example.com --depth 2

Smippo creates:

site/
├── example.com/
│   ├── index.html           # Rendered homepage
│   ├── about/
│   │   └── index.html       # About page
│   ├── blog/
│   │   ├── index.html       # Blog listing
│   │   └── post-1/
│   │       └── index.html   # Blog post
│   └── assets/
│       ├── style.css        # Stylesheet
│       ├── app.js           # JavaScript
│       └── logo.png         # Image
├── cdn.example.com/         # External assets (if --external-assets)
│   └── fonts/
│       └── inter.woff2
├── .smippo/                  # Metadata directory
│   ├── manifest.json        # Capture metadata
│   ├── cache.json           # Cache data
│   ├── network.har          # HTTP archive
│   └── log.txt              # Capture log
└── index.html               # Entry point redirect

Directory Structure

Domain Directories

Each domain gets its own directory:

site/
├── example.com/       # Main site
├── cdn.example.com/   # CDN assets
└── fonts.googleapis.com/  # Google Fonts

This prevents filename collisions and makes it clear where each file originated.

Path Preservation

URL paths are preserved as directory structures:

URL	Local Path
`https://example.com/`	`example.com/index.html`
`https://example.com/about`	`example.com/about/index.html`
`https://example.com/blog/post`	`example.com/blog/post/index.html`

File Naming

HTML pages: index.html in their directory
Assets: Original filename preserved
Query strings: Encoded in filename or omitted

The .smippo Directory

The .smippo directory contains metadata crucial for:

Resuming interrupted captures
Updating existing mirrors
Debugging issues

manifest.json

Contains capture metadata:

{
  "version": "0.0.1",
  "created": "2024-01-15T10:30:00Z",
  "updated": "2024-01-15T11:45:00Z",
  "rootUrl": "https://example.com",
  "options": {
    "depth": 3,
    "scope": "domain",
    "workers": 8,
    "externalAssets": true
  },
  "stats": {
    "pagesCapt": 42,
    "assetsCapt": 156,
    "totalSize": 15728640,
    "duration": 180000,
    "errors": 2
  },
  "pages": [
    {
      "url": "https://example.com/",
      "localPath": "example.com/index.html",
      "status": 200,
      "captured": "2024-01-15T10:30:05Z",
      "size": 45678,
      "title": "Example Domain"
    }
  ],
  "assets": [
    {
      "url": "https://example.com/style.css",
      "localPath": "example.com/style.css",
      "mimeType": "text/css",
      "size": 12345
    }
  ],
  "errors": [
    {
      "url": "https://example.com/broken",
      "error": "404 Not Found",
      "time": "2024-01-15T10:32:00Z"
    }
  ]
}

cache.json

Stores HTTP cache data for efficient updates:

{
  "etags": {
    "https://example.com/style.css": "\"abc123\"",
    "https://example.com/logo.png": "\"def456\""
  },
  "lastModified": {
    "https://example.com/": "Sat, 01 Jan 2024 00:00:00 GMT"
  },
  "contentTypes": {
    "https://example.com/api/data": "application/json"
  }
}

This allows smippo update to skip unchanged files.

network.har

HTTP Archive file containing all network requests:

Request/response headers
Timing information
Response bodies
Useful for debugging and replay

log.txt

Plain text log of the capture session:

[10:30:00] Starting capture of https://example.com
[10:30:02] Captured: https://example.com/ (45KB)
[10:30:05] Captured: https://example.com/about (32KB)
[10:30:08] Error: https://example.com/broken - 404 Not Found
[10:30:15] Capture complete: 42 pages, 156 assets

Structure Options

Original Structure (Default)

smippo https://example.com --structure original

Preserves URL paths exactly:

site/
└── example.com/
    ├── index.html
    ├── about/
    │   └── index.html
    └── blog/
        └── post/
            └── index.html

Flat Structure

smippo https://example.com --structure flat

All files in one directory with unique names:

site/
├── index.html
├── about-index.html
├── blog-post-index.html
├── style-abc123.css
└── logo-def456.png

Useful for:

Single-page archives
Avoiding deep directory nesting
Simple file listings

Domain Structure

smippo https://example.com --structure domain

Organized by domain without path nesting:

site/
├── example.com/
│   ├── index.html
│   ├── about.html
│   └── blog-post.html
└── cdn.example.com/
    └── style.css

Entry Point

Every capture includes an index.html entry point at the root:

<!DOCTYPE html>
<html>
<head>
  <meta http-equiv="refresh" content="0; url=example.com/index.html">
  <title>Redirecting...</title>
</head>
<body>
  <a href="example.com/index.html">Click here</a>
</body>
</html>

This allows you to open ./site/index.html directly in a browser.

Serving the Output

Built-in Server

smippo serve ./site --open

The built-in server:

Sets correct MIME types
Handles directory browsing
Supports CORS
Auto-finds available ports

Other Servers

The output works with any static file server:

# Python
python -m http.server -d ./site

# Node
npx serve ./site

# PHP
php -S localhost:8000 -t ./site

File Types Captured

HTML Pages

Fully rendered DOM (after JavaScript execution)
Links rewritten for offline viewing
Optional: JavaScript stripped (--static)

Stylesheets

CSS files with rewritten url() references
External fonts (with --external-assets)

JavaScript

Script files preserved (unless --static)
Required for interactive functionality

Images & Media

All images (PNG, JPEG, GIF, WebP, SVG)
Videos and audio (unless filtered)
Optimized through browser's natural handling

Fonts

Web fonts (WOFF, WOFF2, TTF)
From CDNs (with --external-assets)

Other Assets

PDFs and documents
JSON data files
Any resource fetched by the page

Best Practices

For Offline Viewing

smippo https://example.com \
  --static \
  --external-assets \
  --depth 5

For Archiving

smippo https://example.com \
  --depth 10 \
  --structure original \
  --har

For Development

smippo https://example.com \
  --structure flat \
  --no-har

Next Steps

Link Rewriting — How links are transformed
Serve Command — View captured sites
Continue Command — Resume captures