Scope Control
Scope options determine which links Smippo follows when crawling. Proper scope configuration prevents runaway crawls while ensuring you capture what you need.
Understanding Scope
When Smippo encounters a link, it must decide: should I follow this link?
The --scope option defines the boundary:
smippo https://www.example.com --scope <type>
Scope Types
Subdomain Scope (Strictest)
smippo https://www.example.com --scope subdomain
Only follows links on the exact same subdomain:
| URL | Followed? |
|---|---|
https://www.example.com/page | ✅ Yes |
https://www.example.com/docs/intro | ✅ Yes |
https://docs.example.com/ | ❌ No |
https://example.com/ | ❌ No |
https://other.com/ | ❌ No |
Use when: You want a specific subdomain only (e.g., just www or just docs).
Domain Scope (Default)
smippo https://www.example.com --scope domain
Follows links on the same domain and any subdomain:
| URL | Followed? |
|---|---|
https://www.example.com/page | ✅ Yes |
https://docs.example.com/ | ✅ Yes |
https://api.example.com/ | ✅ Yes |
https://example.com/ | ✅ Yes |
https://other.com/ | ❌ No |
Use when: You want the entire site including all subdomains.
TLD Scope
smippo https://www.example.com --scope tld
Follows links on the same top-level domain. Not recommended as it can lead to capturing unrelated sites.
All Scope (Most Permissive)
smippo https://example.com --scope all
Follows ALL links, regardless of domain:
| URL | Followed? |
|---|---|
https://example.com/page | ✅ Yes |
https://docs.example.com/ | ✅ Yes |
https://other.com/ | ✅ Yes |
https://anything.com/ | ✅ Yes |
Warning: Using --scope all without limits can crawl the entire internet! Always combine with --depth and --max-pages:
smippo https://example.com --scope all --depth 2 --max-pages 100
Directory Restriction
Stay in Directory
smippo https://example.com/docs/ --stay-in-dir
Only follows links within the same directory path:
| URL | Followed? |
|---|---|
https://example.com/docs/intro | ✅ Yes |
https://example.com/docs/guide/start | ✅ Yes |
https://example.com/blog/ | ❌ No |
https://example.com/ | ❌ No |
Use when: Capturing a specific section of a site (documentation, blog category).
Combining with Scope
smippo https://docs.example.com/v2/ --scope subdomain --stay-in-dir
This captures only:
- Same subdomain (
docs.example.com) - Same directory tree (
/v2/*)
External Assets
By default, Smippo only captures pages within scope, but assets (images, CSS, JS) can come from anywhere.
Enable External Assets
smippo https://example.com --external-assets
This captures assets from CDNs and external domains:
| Resource | Without Flag | With Flag |
|---|---|---|
https://cdn.example.com/style.css | ❌ Skip | ✅ Capture |
https://fonts.googleapis.com/ | ❌ Skip | ✅ Capture |
https://example.com/logo.png | ✅ Capture | ✅ Capture |
Use when: You want a complete offline copy with all fonts, images, and styles.
Practical Example
For a fully offline documentation site:
smippo https://docs.example.com \
--depth 5 \
--scope subdomain \
--external-assets \
--static
This captures:
- All pages on
docs.example.com - External CSS, fonts, and images
- Static HTML (no JavaScript needed)
Scope Decision Tree
New link discovered: https://target.com/page
│
├─ Is it the same subdomain?
│ └─ Yes → Follow (any scope)
│
├─ Is it the same domain (different subdomain)?
│ ├─ scope = subdomain → Don't follow
│ └─ scope = domain/tld/all → Follow
│
└─ Is it a different domain?
├─ scope = subdomain/domain → Don't follow
└─ scope = all → Follow
Common Configurations
Documentation Site
smippo https://docs.framework.com \
--scope subdomain \
--depth 10 \
--external-assets
Company Website
smippo https://www.company.com \
--scope domain \
--depth 5 \
--exclude "*/careers/*" \
--exclude "*/press/*"
Blog Archive
smippo https://blog.example.com/posts/ \
--scope subdomain \
--stay-in-dir \
--depth 3
Multi-Site Crawl (Careful!)
smippo https://hub.example.com \
--scope all \
--depth 2 \
--max-pages 500 \
--max-time 600
Troubleshooting
"Capture takes forever"
Your scope is too broad. Add restrictions:
smippo https://example.com \
--scope subdomain \
--max-pages 200 \
--max-time 300
"Missing pages"
Your scope is too narrow. Try:
smippo https://www.example.com --scope domain
"Missing images/fonts"
You need external assets:
smippo https://example.com --external-assets
Next Steps
- Filtering — Fine-tune what gets captured
- Performance — Speed up large captures
- Options Reference — All options explained