How It Works

Smippo is fundamentally different from traditional web crawlers. Instead of fetching raw HTML, it renders pages in a real browser—capturing exactly what you see.

The Problem with Traditional Crawlers

Traditional website copiers were built for a simpler web:

Traditional Crawler:
  1. Fetch HTML
  2. Parse for links
  3. Download linked files
  4. Done ❌ (misses dynamic content)

This approach fails on modern websites because:

JavaScript execution — Most content is rendered client-side
SPA navigation — Links don't correspond to actual files
CSS-in-JS — Styles are generated at runtime
Lazy loading — Images and content load on scroll/interaction
Dynamic APIs — Data fetched from backend APIs

Smippo's Approach

Smippo controls a real browser under the hood:

Loading diagram...

This means you get the page exactly as you'd see it in your browser, including:

Fully rendered React/Vue/Angular apps
CSS-in-JS styles
Dynamically loaded content
Web fonts from CDNs
API responses

The Vacuum Architecture

Smippo's parallel worker architecture is designed for speed. Multiple browser instances work simultaneously:

Loading diagram...

By default, Smippo runs 8 parallel workers, each controlling a browser tab. This means capturing 8 pages simultaneously while respecting rate limits.

Capture Flow

Here's what happens when you run smippo https://example.com --depth 2:

Loading diagram...

Link Rewriting

After capture, Smippo rewrites all links to work offline:

Original HTML:

<link href="https://cdn.example.com/style.css" rel="stylesheet">
<img src="/images/logo.png">
<a href="https://example.com/about">About</a>

Rewritten HTML:

<link href="./cdn.example.com/style.css" rel="stylesheet">
<img src="./images/logo.png">
<a href="./about/index.html">About</a>

This includes:

<a href> links
<link href> stylesheets
<script src> scripts
<img src> and <img srcset> images
CSS url() and @import
Inline styles

Output Structure

Every capture creates a structured output:

site/
├── example.com/
│   ├── index.html          # Rendered page
│   ├── about/
│   │   └── index.html
│   └── assets/
│       ├── style.css
│       └── logo.png
├── cdn.example.com/        # External assets
│   └── fonts/
│       └── inter.woff2
├── .smippo/
│   ├── manifest.json       # Capture metadata
│   ├── cache.json          # ETags, last-modified
│   ├── network.har         # Full HAR file
│   └── log.txt             # Capture log
└── index.html              # Entry point

The .smippo directory contains metadata for:

Resuming interrupted captures
Updating existing mirrors
Debugging with HAR files

Smippo vs Traditional Tools

Feature	Traditional Tools	Smippo
JavaScript execution	❌ No	✅ Full browser
SPA support	❌ Limited	✅ Native
CSS-in-JS	❌ No	✅ Yes
Dynamic content	❌ No	✅ Yes
HAR generation	❌ No	✅ Yes
Parallel crawling	✅ Connections	✅ Browser tabs
Device emulation	❌ No	✅ Yes
Screenshot/PDF	❌ No	✅ Yes

Learn More

Capture Command — Full command reference
Output Structure — Detailed output explanation
Link Rewriting — How links are transformed