How It Works
Smippo is fundamentally different from traditional web crawlers. Instead of fetching raw HTML, it renders pages in a real browser—capturing exactly what you see.
The Problem with Traditional Crawlers
Traditional website copiers were built for a simpler web:
Traditional Crawler:
1. Fetch HTML
2. Parse for links
3. Download linked files
4. Done ❌ (misses dynamic content)
This approach fails on modern websites because:
- JavaScript execution — Most content is rendered client-side
- SPA navigation — Links don't correspond to actual files
- CSS-in-JS — Styles are generated at runtime
- Lazy loading — Images and content load on scroll/interaction
- Dynamic APIs — Data fetched from backend APIs
Smippo's Approach
Smippo controls a real browser under the hood:
This means you get the page exactly as you'd see it in your browser, including:
- Fully rendered React/Vue/Angular apps
- CSS-in-JS styles
- Dynamically loaded content
- Web fonts from CDNs
- API responses
The Vacuum Architecture
Smippo's parallel worker architecture is designed for speed. Multiple browser instances work simultaneously:
By default, Smippo runs 8 parallel workers, each controlling a browser tab. This means capturing 8 pages simultaneously while respecting rate limits.
Capture Flow
Here's what happens when you run smippo https://example.com --depth 2:
Link Rewriting
After capture, Smippo rewrites all links to work offline:
Original HTML:
<link href="https://cdn.example.com/style.css" rel="stylesheet">
<img src="/images/logo.png">
<a href="https://example.com/about">About</a>
Rewritten HTML:
<link href="./cdn.example.com/style.css" rel="stylesheet">
<img src="./images/logo.png">
<a href="./about/index.html">About</a>
This includes:
<a href>links<link href>stylesheets<script src>scripts<img src>and<img srcset>images- CSS
url()and@import - Inline styles
Output Structure
Every capture creates a structured output:
site/
├── example.com/
│ ├── index.html # Rendered page
│ ├── about/
│ │ └── index.html
│ └── assets/
│ ├── style.css
│ └── logo.png
├── cdn.example.com/ # External assets
│ └── fonts/
│ └── inter.woff2
├── .smippo/
│ ├── manifest.json # Capture metadata
│ ├── cache.json # ETags, last-modified
│ ├── network.har # Full HAR file
│ └── log.txt # Capture log
└── index.html # Entry point
The .smippo directory contains metadata for:
- Resuming interrupted captures
- Updating existing mirrors
- Debugging with HAR files
Smippo vs Traditional Tools
| Feature | Traditional Tools | Smippo |
|---|---|---|
| JavaScript execution | ❌ No | ✅ Full browser |
| SPA support | ❌ Limited | ✅ Native |
| CSS-in-JS | ❌ No | ✅ Yes |
| Dynamic content | ❌ No | ✅ Yes |
| HAR generation | ❌ No | ✅ Yes |
| Parallel crawling | ✅ Connections | ✅ Browser tabs |
| Device emulation | ❌ No | ✅ Yes |
| Screenshot/PDF | ❌ No | ✅ Yes |
Learn More
- Capture Command — Full command reference
- Output Structure — Detailed output explanation
- Link Rewriting — How links are transformed