How It Works

Smippo is fundamentally different from traditional web crawlers. Instead of fetching raw HTML, it renders pages in a real browser—capturing exactly what you see.

The Problem with Traditional Crawlers

Traditional website copiers were built for a simpler web:

Traditional Crawler:
  1. Fetch HTML
  2. Parse for links
  3. Download linked files
  4. Done ❌ (misses dynamic content)

This approach fails on modern websites because:

  • JavaScript execution — Most content is rendered client-side
  • SPA navigation — Links don't correspond to actual files
  • CSS-in-JS — Styles are generated at runtime
  • Lazy loading — Images and content load on scroll/interaction
  • Dynamic APIs — Data fetched from backend APIs

Smippo's Approach

Smippo controls a real browser under the hood:

Loading diagram...

This means you get the page exactly as you'd see it in your browser, including:

  • Fully rendered React/Vue/Angular apps
  • CSS-in-JS styles
  • Dynamically loaded content
  • Web fonts from CDNs
  • API responses

The Vacuum Architecture

Smippo's parallel worker architecture is designed for speed. Multiple browser instances work simultaneously:

Loading diagram...

By default, Smippo runs 8 parallel workers, each controlling a browser tab. This means capturing 8 pages simultaneously while respecting rate limits.

Capture Flow

Here's what happens when you run smippo https://example.com --depth 2:

Loading diagram...

After capture, Smippo rewrites all links to work offline:

Original HTML:

<link href="https://cdn.example.com/style.css" rel="stylesheet">
<img src="/images/logo.png">
<a href="https://example.com/about">About</a>

Rewritten HTML:

<link href="./cdn.example.com/style.css" rel="stylesheet">
<img src="./images/logo.png">
<a href="./about/index.html">About</a>

This includes:

  • <a href> links
  • <link href> stylesheets
  • <script src> scripts
  • <img src> and <img srcset> images
  • CSS url() and @import
  • Inline styles

Output Structure

Every capture creates a structured output:

site/
├── example.com/
│   ├── index.html          # Rendered page
│   ├── about/
│   │   └── index.html
│   └── assets/
│       ├── style.css
│       └── logo.png
├── cdn.example.com/        # External assets
│   └── fonts/
│       └── inter.woff2
├── .smippo/
│   ├── manifest.json       # Capture metadata
│   ├── cache.json          # ETags, last-modified
│   ├── network.har         # Full HAR file
│   └── log.txt             # Capture log
└── index.html              # Entry point

The .smippo directory contains metadata for:

  • Resuming interrupted captures
  • Updating existing mirrors
  • Debugging with HAR files

Smippo vs Traditional Tools

FeatureTraditional ToolsSmippo
JavaScript execution❌ No✅ Full browser
SPA support❌ Limited✅ Native
CSS-in-JS❌ No✅ Yes
Dynamic content❌ No✅ Yes
HAR generation❌ No✅ Yes
Parallel crawling✅ Connections✅ Browser tabs
Device emulation❌ No✅ Yes
Screenshot/PDF❌ No✅ Yes

Learn More

Was this page helpful?