SpyWeb — CDP Documentation

Overview

Spyweb does not bundle a browser. Unlike tools that download a 300MB+ Chromium binary you don't need, Spyweb uses whatever browser you already have installed to render JavaScript-heavy pages. You don't download anything extra. You don't configure anything special. It just works.

If you need JS rendering in a hook (to scrape a React site, bypass a Cloudflare challenge, click buttons, wait for DOM elements), you use the cdp module. That's it.

All hooks are optional. CDP is only needed when you require JavaScript rendering. For simple HTML pages, the built-in HTTP client is faster and uses zero extra resources.

How It Works

Spyweb can get a browser two ways:

Launch one — cdp.launch() finds Chrome, Edge, or Brave on your system, starts it in headless mode, and connects to it automatically. Zero config. You can also pass { executable = "/path/to/browser" } to use a specific binary.
Connect to an existing one — cdp.connect("ws://...") attaches to any browser or CDP-compatible server already running (useful for Lightpanda, Obscura, remote debugging, etc.).

Once connected, you call browser:attach() to get a page/tab, then drive it with simple commands like page:open(url), page:wait_for_selector(), page:click(), and page:content() to get the rendered HTML.

Under the Hood

When you call cdp.launch(), Spyweb spawns the browser as a child process with --remote-debugging-port=0 (which picks a random free port). It reads the DevTools WebSocket URL from the browser's stderr output, connects to it, and gives you a Browser object. Every command and response is sent as JSON-RPC messages over WebSocket — the same protocol Chrome's own DevTools uses.

Each page/tab gets its own dedicated WebSocket connection. This is why browser:attach() doesn't just return a reference — it actually opens a new WebSocket to the page's specific endpoint.

By default, closing the browser kills the process and cleans up the profile. You can override this with keep_alive = true to keep the browser warm across multiple hook calls.

Integration

Basic Usage in `override_fetch`

CDP is available globally via the cdp table. While it can be used in any hook, it is most commonly used in override_fetch to replace the default HTTP client with a real browser.

function override_fetch(request)
    -- 1. Launch a temporary browser
    local browser = cdp.launch({})

    -- 2. Attach to a page (tab)
    local page = browser:attach()

    -- 3. Navigate safely and check for errors
    local ok, err = page:open(request.url)
    if not ok then
        browser:close()
        return { error = "Navigation failed: " .. tostring(err) }
    end

    -- 4. Wait for content before extraction
    local found, wait_err = page:wait_for_selector(".dynamic-content", 10000)
    if not found then
        browser:close()
        return { error = "Selector timeout: " .. tostring(wait_err) }
    end

    -- 5. Get the rendered HTML and clean up
    local html = page:content()
    browser:close()

    -- 6. Pass the html back to the pipeline (triggers extraction or catches in after_fetch)
    return {
        status = 200,
        body = html,
        url = request.url
    }
end

Persistent Browser Pattern

Launching a new browser for every fetch is slow. Because Spyweb persists the Lua state for each job, you can keep a browser instance "warm" across multiple scraper runs. Store the browser in a global variable (without local) to keep it alive between iterations.

-- This runs once when the job is loaded
if not browser then
    print("[CDP] Launching persistent browser...")
    browser = cdp.launch({
        headless = true,
        keep_alive = true
    })
end

function override_fetch(request)
    local page = browser:attach()
    local ok, err = page:open(request.url)
    if not ok then
        page:close()
        return { error = "Failed to load " .. request.url }
    end
    if page:wait_for_selector(".item", 5000) then
        local html = page:content()
        page:close()

        -- Pass the html back to the pipeline (triggers extraction or catches in after_fetch)
        return {
            status = 200,
            body = html,
            url = request.url
        }
    else
        page:close()
        return { error = "Content timeout" }
    end
end

keep_alive prevents the browser from being killed when the Lua variable goes out of scope. Without it, the browser closes when the variable is garbage collected.

Browser API Reference

These methods are provided directly by the Spyweb Core on the Browser object returned by cdp.launch() or cdp.connect().

browser:attach([opts]) async
Returns a Page object. Accepts an optional options table.

Option	Type	Default	Description
`opts.url`	string	`"about:blank"`	Initial URL for the new tab.
`opts.reuse`	boolean	`true`	Reuse an existing blank tab if available. Only applies to the default context.
`opts.browserContextId`	string	none	Attach to a specific isolated context created via `browser:new_context()`.

-- Attach with a specific URL
local page = browser:attach({ url = "https://example.com" })

-- Attach within an isolated context
local ctx = browser:new_context()
local page = ctx:attach("https://example.com")

browser:new_context() async
Creates an isolated browser context (like Chrome's guest profile). Returns a Context object.

local ctx = browser:new_context()
local page = ctx:attach("https://example.com")
-- Cookies/storage in this context are isolated from the default context
ctx:close()

browser:close() sync
Closes the browser and kills the underlying process. Also releases the profile lock file.

browser:close()

browser:get_user_data_dir() sync
Returns the path to the browser's profile directory, or nil if the browser was connected via WebSocket (no local profile).

local dir = browser:get_user_data_dir()
print("Profile at: " .. tostring(dir))

browser:call(method, params) async
Sends a raw CDP command at the browser level. Use this for browser-scoped domains like Target, Browser, etc.

-- List all targets (tabs/pages) in the browser
local targets = browser:call("Target.getTargets", {})
for _, t in ipairs(targets.targetInfos) do
    print(t.targetId, t.url)
end

browser:wait_event(event, [timeout_ms], [predicate]) async
Waits for a browser-level CDP event. Accepts an optional timeout in milliseconds and an optional predicate function.

-- Wait for a new target to be created (up to 5 seconds)
local event = browser:wait_event("Target.targetCreated", 5000)
print("New target: " .. event.params.targetInfo.targetId)

browser:attach_session(target_id) async
Attaches to an existing target by ID and returns a session ID for scoped CDP communication.

local result = browser:attach_session(target_id)
local session_id = result.sessionId

browser:call_session(session_id, method, params) async
Sends a CDP command scoped to a specific session (target) obtained via attach_session.

local result = browser:call_session(sid, "Page.navigate", {
    url = "https://example.com"
})

browser:wait_session_event(session_id, event, [opts]) async
Waits for a CDP event scoped to a specific session. opts can include timeout_ms and predicate.

local ev = browser:wait_session_event(sid, "Page.loadEventFired", {
    timeout_ms = 10000
})

Page API (Native)

The Page object combines native transport methods with high-level Lua helpers. Returned by browser:attach() and context:attach().

Native Methods

Low-level CDP methods available directly on every page object.

page:call(method, params) async
Raw page-level CDP command. Sends a JSON-RPC message to the page's dedicated WebSocket.

local result = page:call("Runtime.evaluate", {
    expression = "document.title",
    returnByValue = true
})
print(result.result.value)

page:call_save(method, params, path) async
Like call but optimized for binary responses. Decodes the base64 data field and saves it to path. Returns the JSON result without the massive data string.

-- Download a PDF and save to disk
page:call_save("Page.printToPDF", {}, "/tmp/report.pdf")
-- Result includes { saved_to = "/tmp/report.pdf" } instead of the raw data

page:wait_event(event, ...) async
Waits for a page-level CDP event. Accepts variadic arguments: timeout (number), predicate (function), or a table with timeout_ms and predicate.

-- Wait for page load with default 30s timeout
local ev = page:wait_event("Page.loadEventFired")

-- With predicate and timeout
local ev = page:wait_event("Network.responseReceived", 5000, function(params)
    return params.response.status == 200
end)

page:close() async
Closes the specific tab/page. Does not affect the browser process or other tabs.

page:close()

Page API (Helpers)

Playwright-like convenience methods injected by cdp.lua. These are plain Lua functions — you can override or extend them on any page object.

page:open(url, [wait_until], [timeout_ms]) async
Navigate to a URL. wait_until is a CDP event name (default "Page.loadEventFired"). Pass false to skip waiting. Returns true or nil, error.

-- Default: wait for full load (30s timeout)
local ok, err = page:open("https://example.com")

-- Wait for DOM content instead of full load
page:open("https://example.com", "Page.domContentEventFired")

-- Custom timeout
page:open("https://example.com", "Page.loadEventFired", 15000)

-- Fire-and-forget: navigate without waiting
page:open("https://example.com", false)

page:wait_for_selector(selector, [opts]) async
Poll the DOM until a CSS selector exists. Supports shadow DOM and iframe traversal. opts can be a number (timeout_ms) or a table.

Option	Type	Default	Description
`timeout_ms`	number	`10000`	Max wait time in milliseconds.
`poll_ms`	number	`100`	Poll interval in milliseconds.
`visible`	boolean	`false`	Only return `true` if the element is visible.
`scroll`	boolean	`true`	Scroll to the element or scroll down if not found.

-- Simple timeout
local found, info = page:wait_for_selector(".content", 5000)

-- Full options
local found, info = page:wait_for_selector(".product", {
    timeout_ms = 15000,
    poll_ms = 200,
    visible = true,
    scroll = true
})
if found then
    print("Tag: " .. info.tagName)
    print("Text: " .. info.text)
end

page:wait_for_url(pattern, timeout_ms) async
Polls location.href until it matches the string pattern. Returns the URL or nil, error.

-- Wait until the URL contains "/success"
local url = page:wait_for_url("/success", 10000)
if url then
    print("Redirected to: " .. url)
end

page:wait_for_response([predicate], [timeout_ms]) async
Waits for Network.responseReceived. Optional predicate function receives the event params, return true when the desired response is found.

-- Wait for any response
local resp = page:wait_for_response(nil, 5000)

-- Wait for a specific API response
local resp = page:wait_for_response(function(params)
    return params.response.url:find("/api/data") and params.response.status == 200
end, 10000)

page:wait_for_idle([timeout_ms], [quiet_ms]) async
Waits until the network has been quiet for quiet_ms (default 500). Uses performance.getEntriesByType("resource") to check for in-flight requests.

-- Wait up to 10s for the page to be idle
page:open("https://example.com", false)
page:wait_for_idle(10000, 500)

page:scroll([opts]) async
Scrolls the page with configurable behavior. Useful for infinite-scroll pages.

Option	Type	Default	Description
`max_scrolls`	number	`20`	Maximum number of scroll steps.
`step`	number	80% of viewport	Pixels to scroll per step.
`delay_ms`	number	`250`	Delay between scroll steps.
`until_selector`	string	none	Stop scrolling when this selector appears.
`until_bottom`	boolean	`true`	Stop when the bottom of the page is reached.

-- Scroll until an element appears or bottom is hit
page:scroll({
    max_scrolls = 50,
    delay_ms = 500,
    until_selector = ".load-more"
})

page:content() async
Returns the full rendered HTML (document.documentElement.outerHTML).

local html = page:content()
print(string.sub(html, 1, 500)) -- First 500 chars

page:evaluate(js) async
Evaluates JavaScript in the page context and returns the result. Uses Runtime.evaluate with returnByValue = true.

local title = page:evaluate("document.title")
local count = page:evaluate("document.querySelectorAll('.item').length")
local data = page:evaluate("JSON.stringify(window.__INITIAL_STATE__)")

page:click(selector, [opts]) async
Clicks a CSS selector. By default uses el.click() in JavaScript. opts.real = true dispatches real mouse events via Input.dispatchMouseEvent.

-- Standard click (JS dispatch)
page:click(".submit-btn")

-- Real hardware-level mouse events (better anti-bot)
page:click(".buy-now", { real = true })

page:type(selector, text, [opts]) async
Types text into a selector. By default sets el.value and dispatches input/change events. opts.real = true uses Input.insertText for true keystroke simulation.

-- Standard input
page:type("#search", "hello world")

-- Real keystroke simulation
page:type("#username", "admin", { real = true })

page:screenshot(path, [opts]) async
Takes a screenshot and saves it to path. Returns the path on success.

Option	Type	Default	Description
`format`	string	`"png"`	Image format: `"png"` or `"jpeg"`.
`quality`	number	none	JPEG quality (1-100). Only applies to JPEG.
`full_page`	boolean	`false`	Capture the full page (not just viewport).
`fullPage`	boolean	`false`	Alias for `full_page`.
`fromSurface`	boolean	`true`	Capture from the surface (set to `false` for DPI-aware capture).

-- Basic screenshot (PNG)
page:screenshot("debug.png")

-- Full-page JPEG
page:screenshot("fullpage.jpg", {
    format = "jpeg",
    quality = 80,
    full_page = true
})

page:set_extra_headers(headers) async
Sets extra HTTP headers for all subsequent requests via Network.setExtraHTTPHeaders.

page:set_extra_headers({
    ["Accept-Language"] = "en-US",
    ["X-Custom"] = "my-value"
})

page:set_user_agent(user_agent, [opts]) async
Overrides the User-Agent via Network.setUserAgentOverride. opts can include accept_language and platform.

page:set_user_agent("Mozilla/5.0 ...", {
    accept_language = "en-US,en;q=0.9",
    platform = "Linux"
})

page:cookies([urls]) async
Returns all cookies. If urls is provided, filters cookies by the given URLs. Uses Network.getAllCookies (without URLs) or Network.getCookies (with URLs).

-- All cookies
local all = page:cookies()

-- Cookies for specific URLs
local filtered = page:cookies({ "https://example.com" })
for _, c in ipairs(filtered) do
    print(c.name, c.value)
end

page:set_cookies(cookies) async
Sets cookies via Network.setCookies. Each cookie should have name, value, and optionally domain, path, etc.

page:set_cookies({
    { name = "session", value = "abc123", domain = "example.com", path = "/" },
    { name = "theme", value = "dark", domain = "example.com" }
})

page:block_resources(types) async
Blocks network requests matching resource types or URL patterns. Accepts type names like "image", "font", "media", "stylesheet", "script", or custom URL patterns.

-- Block images and fonts
page:block_resources({ "image", "font" })

-- Block specific URL patterns
page:block_resources({ "*.analytics.js", "*.tracking.com/*" })

Specialized Browsers & Advanced Patterns

While standard Chromium-based browsers are the most compatible, they are resource-intensive. A single Chrome instance can consume hundreds of megabytes of RAM. For a lighter approach, consider these specialized alternatives designed for scraping and automation. They speak the same CDP protocol.

Lightpanda

A high-performance, lightweight browser written in Zig.

Launch the server:
```
lightpanda serve --port 9222
```

Connect in Lua:

local browser = cdp.connect("ws://127.0.0.1:9222")

Obscura

A headless browser built for AI agents and anti-detection.

Launch the server:
```
obscura serve --port 9222 --stealth
```

Connect in Lua:

local browser = cdp.connect("ws://127.0.0.1:9222/devtools/browser")

Compatibility Note

These browsers implement CDP but may not support every method. Based on our assessment:

Helper	Lightpanda	Obscura
`page:screenshot()`	Returns a fake placeholder	❌ Not supported
`page:block_resources()`	❌ Not supported	❌ Not supported
`page:cookies()` (no args)	✅ Works	❌ Use `page:cookies({url})`
`page:type(text, {real=true})`	✅ Works	❌ Not supported

There may be other gaps we haven't caught. Test your hooks thoroughly and fall back to a standard browser if something doesn't work.

Hybrid Human-in-the-Loop

For sites with aggressive bot detection (Cloudflare, CAPTCHAs, etc.), you can switch from a fast headless browser to a visible Chrome window when a block is detected.

Normal Mode: Scrape headlessly for maximum speed.
Detection: In override_fetch, check for a "block" selector (e.g., #captcha-container).
Transition: Close the headless browser and launch a visible Chrome instance using the same user_data_dir.
Notification: Use notify() to alert a human operator.
Intervention: Wait in a while loop, polling for a "success" selector.
Handback: Capture the HTML, close the visual browser, and return to headless mode.

See examples/hybrid-recovery/hooks.lua for a complete implementation.

SpyWeb CDP Documentation