· Tutorials

Puppeteer Screenshot Timeouts and Memory Leaks: A Debugging Guide

A comprehensive debugging guide for Puppeteer screenshot issues — covering navigation timeouts, memory leaks, zombie Chrome processes, and production error handling patterns.

Terminal screen with debugging output

Puppeteer works great in development. You write a script, it opens Chrome, it takes a screenshot, and everything is fine. Production is where screenshots fail — navigation timeouts on slow pages, Chrome eating all your RAM, zombie processes accumulating at 3am, and cryptic protocol errors that give you nothing useful to debug. This is a systematic guide to diagnosing and fixing every category of Puppeteer screenshot failure in production.

Part 1: Timeouts

Timeouts are the most common Puppeteer failure in production. Your script works on fast pages with clean markup, then falls apart when it encounters a real-world site with third-party scripts, redirect chains, and content that loads in unpredictable waves.

The error looks like this:

TimeoutError: Navigation timeout of 30000ms exceeded

This means Puppeteer called page.goto() and the page didn't reach the expected load state within the timeout window. The default timeout is 30 seconds, which sounds generous until you realize what's happening behind the scenes.

Common causes:

  • Slow third-party resources. The page itself loads in 2 seconds, but it pulls in analytics scripts, ad networks, and chat widgets that keep network connections open. If you're waiting for networkidle0 (zero connections for 500ms), a single long-polling analytics endpoint will cause a timeout every time.
  • Redirect chains. A URL redirects to a login page, which redirects to an OAuth provider, which redirects back. Each redirect restarts parts of the page load cycle. Three or four redirects can eat 10-15 seconds before any content renders.
  • Hanging resources. A CSS file on a CDN that's having issues. A font file that takes 20 seconds to respond. A single slow resource can block rendering if it's in the critical path.
  • Server-side rendering delays. The server itself takes a long time to generate the HTML. Common with dynamically generated pages, heavy database queries, or servers under load.

Solution 1: Use networkidle2 instead of networkidle0.

The networkidle0 setting waits until there are zero network connections for 500ms. This is extremely strict — analytics beacons, WebSocket connections, and long-polling requests will prevent it from ever resolving. networkidle2 allows up to 2 connections to remain active, which covers most background noise:

const puppeteer = require('puppeteer');

const browser = await puppeteer.launch();
const page = await browser.newPage();

await page.goto('https://example.com', {
  waitUntil: 'networkidle2',
  timeout: 30000
});

const screenshot = await page.screenshot({ type: 'png' });
await page.close();

For many sites, switching from networkidle0 to networkidle2 eliminates timeouts entirely without any visible difference in screenshot quality.

Solution 2: Increase the timeout.

Sometimes the page is legitimately slow, and you just need to wait longer. Be deliberate about this — don't set the timeout to 120 seconds and forget about it. That means each failed request ties up a Chrome tab for two full minutes:

await page.goto('https://example.com', {
  waitUntil: 'networkidle2',
  timeout: 60000
});

Set the page-level default timeout as well, so that subsequent operations like waitForSelector inherit a reasonable value:

page.setDefaultTimeout(60000);
page.setDefaultNavigationTimeout(60000);

Solution 3: Abort slow resources.

If you know certain resource types aren't needed for your screenshot, intercept requests and abort the ones that tend to hang. This is especially useful for blocking third-party scripts, analytics, and ads:

const puppeteer = require('puppeteer');

const browser = await puppeteer.launch();
const page = await browser.newPage();

await page.setRequestInterception(true);

const blockedDomains = [
  'google-analytics.com',
  'googletagmanager.com',
  'facebook.net',
  'doubleclick.net',
  'hotjar.com',
  'intercom.io'
];

page.on('request', (request) => {
  const url = request.url();
  const resourceType = request.resourceType();

  // Block known slow third-party domains
  const isBlocked = blockedDomains.some((domain) => url.includes(domain));

  // Optionally block resource types you don't need
  const blockedTypes = ['media', 'websocket'];
  const isBlockedType = blockedTypes.includes(resourceType);

  if (isBlocked || isBlockedType) {
    request.abort();
  } else {
    request.continue();
  }
});

await page.goto('https://example.com', {
  waitUntil: 'networkidle2',
  timeout: 30000
});

const screenshot = await page.screenshot({ type: 'png' });
await page.close();

Be careful about blocking stylesheets and fonts — the page will look noticeably different without them. Block analytics and tracking scripts first, then selectively add other categories if you still have timeout issues.

For more aggressive optimization when you only need a rough capture of the page structure, you can block images and fonts too:

page.on('request', (request) => {
  const resourceType = request.resourceType();
  const blockedTypes = ['image', 'font', 'media', 'websocket'];

  if (blockedTypes.includes(resourceType)) {
    request.abort();
  } else {
    request.continue();
  }
});

This dramatically reduces page load time but produces a stripped-down screenshot. Useful for generating thumbnails or testing page structure, not for producing user-facing images.

waitForSelector Timeout

The error:

TimeoutError: Waiting for selector `.hero-banner` failed: timeout 30000ms exceeded

This happens when you're waiting for a specific element to appear before taking the screenshot, and it never shows up. Common causes:

  • The selector is wrong. A class name changed, the element was renamed, or the markup is different than expected.
  • The element is hidden. It exists in the DOM but has display: none, visibility: hidden, or zero dimensions. By default, waitForSelector waits for the element to be present in the DOM but not necessarily visible.
  • The element loads conditionally. It only appears for certain users, in certain regions, or after a specific interaction.
  • JavaScript error prevents rendering. A script error earlier in the page lifecycle stops the component from rendering entirely.

Debugging approach: inspect what's actually on the page.

Before assuming the element doesn't exist, check what the page actually contains:

const puppeteer = require('puppeteer');

const browser = await puppeteer.launch();
const page = await browser.newPage();

await page.goto('https://example.com', {
  waitUntil: 'networkidle2',
  timeout: 30000
});

// Check if the element exists in the DOM at all
const elementExists = await page.$$eval('.hero-banner', (elements) => {
  return {
    count: elements.length,
    details: elements.map((el) => ({
      tagName: el.tagName,
      className: el.className,
      visible: el.offsetHeight > 0 && el.offsetWidth > 0,
      display: getComputedStyle(el).display,
      visibility: getComputedStyle(el).visibility,
      textContent: el.textContent.substring(0, 100)
    }))
  };
});

console.log('Element check:', JSON.stringify(elementExists, null, 2));

// List all elements with class names containing 'hero'
const heroElements = await page.$$eval('[class*="hero"]', (elements) => {
  return elements.map((el) => ({
    tagName: el.tagName,
    className: el.className,
    id: el.id
  }));
});

console.log('Hero-related elements:', JSON.stringify(heroElements, null, 2));

// Check for JavaScript errors that might prevent rendering
const errors = [];
page.on('pageerror', (error) => errors.push(error.message));
page.on('console', (msg) => {
  if (msg.type() === 'error') {
    errors.push(msg.text());
  }
});

console.log('Page errors:', errors);

await browser.close();

Once you know what's actually on the page, you can fix the selector or adjust your wait strategy. If the element loads conditionally, use a timeout with a fallback:

try {
  await page.waitForSelector('.hero-banner', {
    visible: true,
    timeout: 10000
  });
} catch (e) {
  // Element didn't appear — take the screenshot anyway
  console.log('Hero banner not found, capturing page as-is');
}

const screenshot = await page.screenshot({ type: 'png' });

The visible: true option is important — it waits not just for the element to exist in the DOM, but for it to have non-zero dimensions and not be hidden by CSS. This catches the common case where the element is present but display: none until a script runs.

Protocol Timeout

The error:

ProtocolError: Runtime.callFunctionOn timed out

Protocol errors are different from navigation timeouts. They mean the Chrome DevTools Protocol itself failed to respond — Puppeteer sent a command to Chrome and didn't get an answer back in time. This usually means Chrome is overwhelmed.

Common causes:

  • Chrome is out of memory. The page is so large or complex that Chrome's rendering engine can't process it. This is common with pages that have thousands of DOM nodes, heavy SVGs, or massive canvases.
  • Too many concurrent operations. You're running multiple pages simultaneously and Chrome can't keep up.
  • Heavy JavaScript execution. A page with computationally expensive scripts can block Chrome's main thread.

Solutions:

Increase the protocol timeout (available in Puppeteer 19+):

const browser = await puppeteer.launch({
  protocolTimeout: 60000
});

Simplify what Chrome has to render by injecting CSS to hide heavy elements before the screenshot:

await page.evaluate(() => {
  // Remove heavy iframes
  document.querySelectorAll('iframe').forEach((el) => el.remove());

  // Remove video elements
  document.querySelectorAll('video').forEach((el) => el.remove());

  // Remove canvases (maps, charts)
  document.querySelectorAll('canvas').forEach((el) => el.remove());
});

const screenshot = await page.screenshot({ type: 'png' });

If protocol errors happen consistently with a specific page, the page is likely too heavy for a single Chrome instance. Consider using a browser pool (covered in Part 2) to spread the load.

Adaptive Timeout Strategies

Rather than picking a single timeout value, use a tiered approach that starts fast and escalates. This keeps most requests quick while allowing extra time for genuinely slow pages:

const puppeteer = require('puppeteer');

async function screenshotWithAdaptiveTimeout(browser, url) {
  const timeouts = [15000, 30000, 60000];

  for (const timeout of timeouts) {
    const page = await browser.newPage();
    try {
      await page.goto(url, {
        timeout,
        waitUntil: 'networkidle2'
      });

      const screenshot = await page.screenshot({ type: 'png' });
      return screenshot;
    } catch (e) {
      if (e.name !== 'TimeoutError') {
        throw e; // Non-timeout errors should not be retried
      }
      console.log(`Timeout at ${timeout}ms for ${url}, retrying with higher timeout`);
    } finally {
      await page.close();
    }
  }

  throw new Error(`Failed to capture ${url} after all timeout tiers`);
}

const browser = await puppeteer.launch();
const screenshot = await screenshotWithAdaptiveTimeout(browser, 'https://example.com');
await browser.close();

This approach means 90% of your requests complete in under 15 seconds. The handful that need more time get it, but they don't slow down the fast path. Each retry creates a fresh page to avoid contamination from the previous attempt's partial load state.

You can make this smarter by tracking timeout patterns per domain. If a particular domain consistently hits the 15-second tier, start with 30 seconds for that domain:

const domainTimeouts = new Map();

function getTimeoutsForUrl(url) {
  const domain = new URL(url).hostname;
  const baseline = domainTimeouts.get(domain) || 15000;
  return [baseline, baseline * 2, baseline * 4].map((t) => Math.min(t, 120000));
}

function recordTimeout(url, successfulTimeout) {
  const domain = new URL(url).hostname;
  const current = domainTimeouts.get(domain) || 15000;
  // Exponential moving average
  const updated = Math.round(current * 0.7 + successfulTimeout * 0.3);
  domainTimeouts.set(domain, updated);
}

This is a lightweight optimization that avoids wasting time on first-tier attempts that are known to fail for certain slow sites.

Part 2: Memory Leaks

Chrome is a memory-intensive application. A single headless Chrome instance with one empty tab uses around 50-80 MB of RAM. Load a complex web page and that climbs to 200-500 MB. Run a screenshot service handling concurrent requests and you can easily consume gigabytes of memory in minutes. The challenge isn't just how much memory Chrome uses — it's making sure it gives memory back when it's done.

Unclosed Pages

This is the number one cause of memory leaks in Puppeteer screenshot services. Every call to browser.newPage() allocates significant memory — DOM structures, JavaScript heap, rendering buffers. If you don't call page.close(), that memory is never released.

The problem is that errors can interrupt your code before page.close() is reached.

The wrong way:

async function takeScreenshot(browser, url) {
  const page = await browser.newPage();
  await page.goto(url, { waitUntil: 'networkidle2' });
  const screenshot = await page.screenshot({ type: 'png' });
  await page.close(); // Never reached if goto or screenshot throws
  return screenshot;
}

If page.goto() throws a timeout error, page.close() is never called. That page stays open, consuming memory, until the browser itself is closed. After a few hundred failed requests, you have hundreds of orphaned pages eating all your RAM.

The right way:

async function takeScreenshot(browser, url) {
  const page = await browser.newPage();
  try {
    await page.goto(url, {
      waitUntil: 'networkidle2',
      timeout: 30000
    });
    const screenshot = await page.screenshot({ type: 'png' });
    return screenshot;
  } finally {
    await page.close();
  }
}

The finally block ensures page.close() is called regardless of whether the operation succeeded or threw an error. This is the single most important pattern for preventing memory leaks. Every browser.newPage() in your codebase must have a corresponding page.close() in a finally block. No exceptions.

Browser Lifecycle Patterns

How you manage the browser instance itself has a major impact on memory behavior. There are three common patterns, each with different tradeoffs.

Pattern 1: New browser per request.

The safest approach for memory. Each request gets a completely fresh Chrome instance, and closing it releases all associated memory:

const puppeteer = require('puppeteer');

async function takeScreenshot(url) {
  const browser = await puppeteer.launch({
    args: ['--no-sandbox', '--disable-setuid-sandbox']
  });
  try {
    const page = await browser.newPage();
    try {
      await page.goto(url, {
        waitUntil: 'networkidle2',
        timeout: 30000
      });
      return await page.screenshot({ type: 'png' });
    } finally {
      await page.close();
    }
  } finally {
    await browser.close();
  }
}

The downside is performance. Launching Chrome takes 1-3 seconds depending on the system. If you're handling many requests, this adds up significantly. Use this pattern when you're processing a small number of requests (under 10 per minute) or when memory isolation is critical.

Pattern 2: Shared browser, new page per request.

A good balance between performance and memory management. One Chrome instance stays running and handles all requests through separate pages:

const puppeteer = require('puppeteer');

let browser = null;

async function getBrowser() {
  if (!browser || !browser.connected) {
    browser = await puppeteer.launch({
      args: [
        '--no-sandbox',
        '--disable-setuid-sandbox',
        '--disable-dev-shm-usage'
      ]
    });
  }
  return browser;
}

async function takeScreenshot(url) {
  const browser = await getBrowser();
  const page = await browser.newPage();
  try {
    await page.goto(url, {
      waitUntil: 'networkidle2',
      timeout: 30000
    });
    return await page.screenshot({ type: 'png' });
  } finally {
    await page.close();
  }
}

This eliminates the Chrome startup overhead. The risk is that Chrome's internal memory usage grows over time even when pages are properly closed — caches, compiled code, and internal data structures accumulate. We'll address this with periodic restarts later in this section.

Pattern 3: Browser pool.

For high-throughput services, a pool of browser instances provides the best balance of performance, memory management, and concurrency control. The generic-pool library is a solid foundation:

const puppeteer = require('puppeteer');
const genericPool = require('generic-pool');

const browserPool = genericPool.createPool(
  {
    create: async () => {
      const browser = await puppeteer.launch({
        args: [
          '--no-sandbox',
          '--disable-setuid-sandbox',
          '--disable-dev-shm-usage',
          '--disable-gpu'
        ]
      });
      console.log(`Browser created. Pool size: ${browserPool.size}`);
      return browser;
    },
    destroy: async (browser) => {
      await browser.close();
      console.log(`Browser destroyed. Pool size: ${browserPool.size}`);
    },
    validate: async (browser) => {
      // Check if the browser is still responsive
      try {
        const pages = await browser.pages();
        return browser.connected && pages.length >= 0;
      } catch (e) {
        return false;
      }
    }
  },
  {
    min: 2,          // Keep at least 2 browsers warm
    max: 5,          // Never exceed 5 concurrent browsers
    acquireTimeoutMillis: 30000,
    idleTimeoutMillis: 300000,   // Close idle browsers after 5 minutes
    testOnBorrow: true           // Validate before lending to a request
  }
);

async function takeScreenshot(url) {
  const browser = await browserPool.acquire();
  try {
    const page = await browser.newPage();
    try {
      await page.goto(url, {
        waitUntil: 'networkidle2',
        timeout: 30000
      });
      return await page.screenshot({ type: 'png' });
    } finally {
      await page.close();
    }
  } finally {
    await browserPool.release(browser);
  }
}

// Clean up the pool on shutdown
async function shutdown() {
  await browserPool.drain();
  await browserPool.clear();
}

The pool manages browser lifecycle automatically. Idle browsers are closed after 5 minutes. If a browser becomes unresponsive, testOnBorrow catches it and a replacement is created. The max setting prevents you from launching so many Chrome instances that the system runs out of memory.

Set the max value based on your available RAM. Each Chrome instance uses 100-500 MB depending on workload. On a server with 8 GB of RAM, 5 concurrent browsers is a reasonable ceiling.

Detecting Memory Leaks

Before you can fix a memory leak, you need to confirm one exists. The symptoms are typically: memory usage climbs steadily over hours or days, eventually causing the process to crash or the system to start swapping.

Monitor Node.js memory:

function logMemoryUsage() {
  const usage = process.memoryUsage();
  console.log({
    timestamp: new Date().toISOString(),
    rss: `${Math.round(usage.rss / 1024 / 1024)} MB`,
    heapTotal: `${Math.round(usage.heapTotal / 1024 / 1024)} MB`,
    heapUsed: `${Math.round(usage.heapUsed / 1024 / 1024)} MB`,
    external: `${Math.round(usage.external / 1024 / 1024)} MB`
  });
}

// Log every 30 seconds
setInterval(logMemoryUsage, 30000);

The key metrics to watch:

  • RSS (Resident Set Size): Total memory allocated by the process, including Chrome child processes. This is what your operating system reports and what matters for capacity planning.
  • heapUsed: Memory used by JavaScript objects in the Node.js heap. If this grows unboundedly, you have a JavaScript-level memory leak (event listeners not removed, closures holding references, arrays accumulating data).
  • external: Memory allocated by C++ objects bound to JavaScript objects. This includes Buffers. If your screenshot data is accumulating instead of being garbage collected, this will grow.

The distinction matters. If heapUsed is growing, your JavaScript code has a leak. If RSS is growing but heapUsed is stable, Chrome itself is leaking — its internal caches or rendering buffers aren't being freed.

Monitor Chrome's memory:

Puppeteer exposes Chrome's internal metrics through the page.metrics() API:

async function logChromeMetrics(page) {
  const metrics = await page.metrics();
  console.log({
    timestamp: new Date().toISOString(),
    jsHeapUsedSize: `${Math.round(metrics.JSHeapUsedSize / 1024 / 1024)} MB`,
    jsHeapTotalSize: `${Math.round(metrics.JSHeapTotalSize / 1024 / 1024)} MB`,
    documents: metrics.Documents,
    frames: metrics.Frames,
    jsEventListeners: metrics.JSEventListeners,
    nodes: metrics.Nodes
  });
}

Watch the Documents, Nodes, and JSEventListeners counts. If these grow after you close pages, Chrome isn't properly cleaning up. The JSHeapUsedSize should drop after page.close() — if it doesn't, there's a leak in Chrome's renderer.

A simple monitoring script:

Here's a standalone script that runs a batch of screenshots and reports memory trends:

const puppeteer = require('puppeteer');

async function memoryLeakTest(urls, iterations) {
  const browser = await puppeteer.launch({
    args: ['--no-sandbox', '--disable-dev-shm-usage']
  });

  const snapshots = [];

  for (let i = 0; i < iterations; i++) {
    const url = urls[i % urls.length];
    const page = await browser.newPage();

    try {
      await page.goto(url, {
        waitUntil: 'networkidle2',
        timeout: 30000
      });
      await page.screenshot({ type: 'png' });
    } catch (e) {
      console.log(`Error on iteration ${i}: ${e.message}`);
    } finally {
      await page.close();
    }

    // Record memory after each iteration
    const usage = process.memoryUsage();
    const pages = await browser.pages();
    snapshots.push({
      iteration: i,
      rss: Math.round(usage.rss / 1024 / 1024),
      heapUsed: Math.round(usage.heapUsed / 1024 / 1024),
      openPages: pages.length
    });

    if (i % 10 === 0) {
      console.log(`Iteration ${i}: RSS=${snapshots[i].rss}MB, Heap=${snapshots[i].heapUsed}MB, Pages=${snapshots[i].openPages}`);
    }
  }

  await browser.close();

  // Report trend
  const firstRss = snapshots[0].rss;
  const lastRss = snapshots[snapshots.length - 1].rss;
  const growth = lastRss - firstRss;
  console.log(`\nMemory growth over ${iterations} iterations: ${growth}MB`);
  console.log(`Average per iteration: ${(growth / iterations).toFixed(2)}MB`);

  if (growth / iterations > 1) {
    console.log('WARNING: Significant memory growth detected — likely a leak.');
  }

  return snapshots;
}

const testUrls = [
  'https://example.com',
  'https://news.ycombinator.com',
  'https://github.com'
];

memoryLeakTest(testUrls, 100);

Run this on your server. If RSS grows by more than 1 MB per iteration on average, you have a leak. If it levels off after an initial growth phase, that's Chrome's caches filling up — not a leak per se, but something you'll need to manage with periodic restarts.

Chrome-Specific Memory Issues

Chrome has several memory-related behaviors that are particularly relevant in headless screenshot environments.

The /dev/shm issue.

Chrome uses shared memory (/dev/shm) for inter-process communication. In Docker containers, /dev/shm defaults to 64 MB, which is far too small for Chrome. When Chrome runs out of shared memory, it crashes or produces corrupted screenshots.

The fix is the --disable-dev-shm-usage flag, which tells Chrome to use /tmp instead:

const browser = await puppeteer.launch({
  args: [
    '--disable-dev-shm-usage',
    '--disable-gpu',
    '--no-sandbox',
    '--disable-setuid-sandbox',
    '--disable-extensions',
    '--disable-background-networking',
    '--single-process'
  ]
});

Here's what each flag does:

  • --disable-dev-shm-usage: Uses /tmp instead of /dev/shm. Essential in Docker.
  • --disable-gpu: Disables GPU acceleration. Necessary in most headless environments where there's no GPU.
  • --no-sandbox: Disables Chrome's sandbox. Required when running as root (common in Docker). Not ideal for security, but necessary in most container environments.
  • --disable-setuid-sandbox: Disables the setuid sandbox helper. Companion to --no-sandbox.
  • --disable-extensions: No extensions to load means less memory.
  • --disable-background-networking: Stops Chrome from making background network requests (update checks, safe browsing).
  • --single-process: Runs Chrome in a single process instead of multi-process. Reduces memory overhead but sacrifices crash isolation. Use with caution — if Chrome's renderer crashes, the entire browser goes down.

Alternatively, increase the shared memory size in Docker:

docker run --shm-size=1gb your-screenshot-service

This is often simpler and more reliable than --disable-dev-shm-usage, but requires control over how the container is launched.

Chrome's internal caches.

Chrome aggressively caches compiled JavaScript, decoded images, font data, and layout information. These caches grow over time and are not fully released when pages close. In a long-running screenshot service, this means Chrome's memory footprint creeps upward even when you're closing pages correctly.

There's no flag to completely disable these caches. The practical solution is periodic browser restarts (covered below).

Node.js Memory Limits

Node.js has a default heap size limit (around 1.5 GB on 64-bit systems, varies by version). If your JavaScript code accumulates data — screenshot buffers, logging arrays, response caches — you'll hit this limit and crash with a FATAL ERROR: CALL_AND_RETRY_LAST Allocation failed - JavaScript heap out of memory error.

You can increase the limit:

node --max-old-space-size=4096 server.js

But increasing the limit is often the wrong fix. If memory is growing without bound, you have a leak, and a larger limit just delays the crash. Investigate with the monitoring techniques above first.

When you genuinely need more heap space — for example, if you're processing large screenshot buffers or holding results in memory before uploading — then increase the limit to match your server's available RAM minus what Chrome needs. On a server with 8 GB RAM and 5 Chrome instances, leave at least 4 GB for Chrome and set Node.js to 2-3 GB:

node --max-old-space-size=3072 server.js

Periodic Browser Restart

Even with perfect page cleanup, Chrome's memory usage drifts upward over time. The most practical solution is to restart Chrome on a schedule or after a certain number of requests. This trades a small amount of latency (the restart takes 1-3 seconds) for bounded memory usage:

const puppeteer = require('puppeteer');

const MAX_REQUESTS_PER_BROWSER = 100;
const CHROME_ARGS = [
  '--no-sandbox',
  '--disable-setuid-sandbox',
  '--disable-dev-shm-usage',
  '--disable-gpu'
];

let browser = null;
let requestCount = 0;

async function getBrowser() {
  if (!browser || !browser.connected || requestCount >= MAX_REQUESTS_PER_BROWSER) {
    if (browser) {
      console.log(`Recycling browser after ${requestCount} requests`);
      const oldBrowser = browser;
      browser = null;
      requestCount = 0;

      // Close old browser asynchronously to avoid blocking
      oldBrowser.close().catch((e) => {
        console.error('Error closing old browser:', e.message);
      });
    }

    browser = await puppeteer.launch({ args: CHROME_ARGS });
    requestCount = 0;
    console.log('New browser launched');
  }

  requestCount++;
  return browser;
}

async function takeScreenshot(url) {
  const browser = await getBrowser();
  const page = await browser.newPage();
  try {
    await page.goto(url, {
      waitUntil: 'networkidle2',
      timeout: 30000
    });
    return await page.screenshot({ type: 'png' });
  } finally {
    await page.close();
  }
}

The MAX_REQUESTS_PER_BROWSER value depends on your workload. If you're screenshotting heavy pages (news sites, social media), use a lower value like 50. For simple pages, 200-500 is fine. Monitor your memory usage and adjust.

The key detail is closing the old browser asynchronously. You don't want to block incoming requests while the old Chrome shuts down. Launch the new browser first, then clean up the old one in the background. There's a brief moment where two browsers coexist, but it lasts less than a second.

You can also restart based on memory usage instead of request count, which adapts to variable workloads:

async function getBrowser() {
  const memUsage = process.memoryUsage();
  const rssInMb = memUsage.rss / 1024 / 1024;

  if (!browser || !browser.connected || rssInMb > 1500) {
    if (browser) {
      console.log(`Recycling browser at ${Math.round(rssInMb)}MB RSS`);
      const oldBrowser = browser;
      browser = null;
      oldBrowser.close().catch(() => {});
    }

    browser = await puppeteer.launch({ args: CHROME_ARGS });
    console.log('New browser launched');
  }

  return browser;
}

This approach caps your process's total memory at roughly 1.5 GB, restarting Chrome whenever it drifts above that threshold.

Part 3: Zombie Processes

Zombie Chrome processes are processes that outlive your Node.js application. When your server crashes, restarts, or exits without properly closing the browser, Chrome keeps running in the background. Each zombie consumes memory and CPU, and over time they accumulate until the system runs out of resources.

Detection

First, check if you have zombie Chrome processes:

# Find all Chrome/Chromium processes
ps aux | grep -i 'chrome\|chromium' | grep -v grep

# Count them
ps aux | grep -i 'chrome\|chromium' | grep -v grep | wc -l

# Show memory usage of Chrome processes
ps aux | grep -i 'chrome\|chromium' | grep -v grep | awk '{sum += $6} END {print "Total Chrome memory: " sum/1024 " MB"}'

If you see Chrome processes that don't correspond to any running Node.js application, they're zombies.

Here's a detection script you can run periodically:

const { execSync } = require('child_process');

function detectZombieChrome() {
  try {
    const output = execSync(
      'ps aux | grep -i "chrome\\|chromium" | grep -v grep',
      { encoding: 'utf-8' }
    );

    const processes = output.trim().split('\n').filter(Boolean);

    if (processes.length === 0) {
      console.log('No Chrome processes found.');
      return [];
    }

    const parsed = processes.map((line) => {
      const parts = line.trim().split(/\s+/);
      return {
        pid: parts[1],
        cpu: parts[2],
        mem: parts[3],
        rss: `${Math.round(parseInt(parts[5]) / 1024)}MB`,
        command: parts.slice(10).join(' ').substring(0, 80)
      };
    });

    console.log(`Found ${parsed.length} Chrome processes:`);
    parsed.forEach((p) => {
      console.log(`  PID ${p.pid}: CPU=${p.cpu}%, MEM=${p.mem}%, RSS=${p.rss}`);
    });

    return parsed;
  } catch (e) {
    // grep returns exit code 1 when no matches found
    if (e.status === 1) {
      console.log('No Chrome processes found.');
      return [];
    }
    throw e;
  }
}

detectZombieChrome();

Prevention

The key to preventing zombies is handling process signals. When your Node.js process receives a termination signal, close Chrome before exiting:

const puppeteer = require('puppeteer');

let browser = null;

async function launchBrowser() {
  browser = await puppeteer.launch({
    args: ['--no-sandbox', '--disable-dev-shm-usage']
  });
  return browser;
}

async function gracefulShutdown(signal) {
  console.log(`Received ${signal}. Shutting down gracefully...`);

  if (browser) {
    try {
      await browser.close();
      console.log('Browser closed successfully.');
    } catch (e) {
      console.error('Error closing browser:', e.message);
      // Force kill Chrome if graceful close fails
      try {
        browser.process().kill('SIGKILL');
      } catch (killError) {
        console.error('Error force-killing browser:', killError.message);
      }
    }
  }

  process.exit(0);
}

// Handle termination signals
process.on('SIGINT', () => gracefulShutdown('SIGINT'));
process.on('SIGTERM', () => gracefulShutdown('SIGTERM'));

// Handle uncaught exceptions and unhandled rejections
process.on('uncaughtException', async (error) => {
  console.error('Uncaught exception:', error);
  await gracefulShutdown('uncaughtException');
});

process.on('unhandledRejection', async (reason) => {
  console.error('Unhandled rejection:', reason);
  await gracefulShutdown('unhandledRejection');
});

// Handle process exit (synchronous — can't await here)
process.on('exit', () => {
  if (browser && browser.process()) {
    browser.process().kill('SIGKILL');
  }
});

A few important details:

  • SIGINT is sent when you press Ctrl+C. SIGTERM is sent by process managers (systemd, Docker, Kubernetes) when stopping a service.
  • SIGKILL cannot be caught — if your process receives SIGKILL, Chrome will be orphaned. This is why the exit handler uses a synchronous kill() as a last resort.
  • The uncaughtException and unhandledRejection handlers catch programming errors that would otherwise crash the process without cleanup.
  • The exit event handler is synchronous — you can't use await in it. The browser.process().kill('SIGKILL') is a synchronous kill that works without awaiting.

Docker --init

If you're running your screenshot service in Docker, this is one of the most important things to get right. By default, your Node.js process runs as PID 1 inside the container. PID 1 has a special responsibility in Linux: it's supposed to reap zombie child processes. But Node.js doesn't do this — it's not designed to be an init system.

When Chrome's child processes exit, they become zombies waiting to be reaped. Without a proper init system, they accumulate indefinitely:

# Without --init: zombie processes accumulate
docker run your-screenshot-service

# With --init: tini reaps zombie processes automatically
docker run --init your-screenshot-service

The --init flag injects tini as PID 1, which properly reaps child processes. In a Docker Compose file:

services:
  screenshot-service:
    build: .
    init: true
    shm_size: '1gb'

If you can't use --init, install tini in your Dockerfile:

FROM node:20-slim

RUN apt-get update && apt-get install -y tini

ENTRYPOINT ["tini", "--"]
CMD ["node", "server.js"]

Without this, a long-running screenshot service in Docker will accumulate zombie processes until it runs out of PIDs (the default limit is 32,768) or memory.

Health Checks

A health check endpoint verifies that Chrome is actually responsive — not just that your Node.js process is running. This is essential for load balancers, container orchestrators, and monitoring systems:

const express = require('express');
const puppeteer = require('puppeteer');

const app = express();
let browser = null;

async function getBrowser() {
  if (!browser || !browser.connected) {
    browser = await puppeteer.launch({
      args: ['--no-sandbox', '--disable-dev-shm-usage']
    });
  }
  return browser;
}

app.get('/health', async (req, res) => {
  const startTime = Date.now();

  try {
    const browser = await getBrowser();
    const page = await browser.newPage();

    try {
      await page.goto('about:blank', { timeout: 5000 });
      const title = await page.title();
    } finally {
      await page.close();
    }

    const duration = Date.now() - startTime;
    const memUsage = process.memoryUsage();

    res.json({
      status: 'ok',
      browser: 'connected',
      responseTime: `${duration}ms`,
      memory: {
        rss: `${Math.round(memUsage.rss / 1024 / 1024)}MB`,
        heapUsed: `${Math.round(memUsage.heapUsed / 1024 / 1024)}MB`
      }
    });
  } catch (e) {
    const duration = Date.now() - startTime;

    res.status(503).json({
      status: 'unhealthy',
      error: e.message,
      responseTime: `${duration}ms`
    });

    // If the browser is broken, force a restart on next request
    if (browser) {
      browser.close().catch(() => {});
      browser = null;
    }
  }
});

A few design choices here. The health check navigates to about:blank rather than a real URL — this tests Chrome's ability to open pages and execute JavaScript without depending on external network connectivity. It sets a short timeout (5 seconds) because a health check should be fast — if Chrome can't open a blank page in 5 seconds, something is seriously wrong.

If the health check fails, it proactively sets browser to null so the next real request will create a fresh Chrome instance. This self-healing behavior prevents a stuck Chrome from blocking all subsequent requests.

Configure your load balancer or container orchestrator to hit this endpoint every 10-30 seconds. If it returns 503 three times in a row, the service should be restarted.

Circuit Breaker Pattern

When Chrome is unhealthy, continuing to accept screenshot requests just makes things worse. Each request opens a new page in an already struggling browser, consuming more memory and producing more timeouts. The circuit breaker pattern stops accepting requests when the error rate exceeds a threshold, giving Chrome time to recover:

class CircuitBreaker {
  constructor(options = {}) {
    this.failureThreshold = options.failureThreshold || 5;
    this.resetTimeout = options.resetTimeout || 30000;
    this.failureCount = 0;
    this.lastFailureTime = null;
    this.state = 'closed'; // closed = accepting requests, open = rejecting
  }

  recordSuccess() {
    this.failureCount = 0;
    this.state = 'closed';
  }

  recordFailure() {
    this.failureCount++;
    this.lastFailureTime = Date.now();

    if (this.failureCount >= this.failureThreshold) {
      this.state = 'open';
      console.log(`Circuit breaker OPEN after ${this.failureCount} failures`);
    }
  }

  canExecute() {
    if (this.state === 'closed') {
      return true;
    }

    // Check if enough time has passed to try again (half-open state)
    const timeSinceLastFailure = Date.now() - this.lastFailureTime;
    if (timeSinceLastFailure >= this.resetTimeout) {
      this.state = 'half-open';
      console.log('Circuit breaker HALF-OPEN, allowing test request');
      return true;
    }

    return false;
  }
}

const breaker = new CircuitBreaker({
  failureThreshold: 5,    // Open after 5 consecutive failures
  resetTimeout: 30000     // Try again after 30 seconds
});

async function takeScreenshot(url) {
  if (!breaker.canExecute()) {
    throw new Error('Circuit breaker is open — screenshot service is temporarily unavailable');
  }

  try {
    const browser = await getBrowser();
    const page = await browser.newPage();
    try {
      await page.goto(url, {
        waitUntil: 'networkidle2',
        timeout: 30000
      });
      const screenshot = await page.screenshot({ type: 'png' });
      breaker.recordSuccess();
      return screenshot;
    } finally {
      await page.close();
    }
  } catch (e) {
    breaker.recordFailure();
    throw e;
  }
}

The three states work like this:

  • Closed: Everything is normal. Requests are processed. Failures are counted.
  • Open: Too many failures. All requests are immediately rejected with a clear error message. This prevents pile-up.
  • Half-open: After the reset timeout, one request is allowed through as a test. If it succeeds, the circuit closes. If it fails, the circuit opens again for another timeout period.

When the circuit breaker opens, you should also trigger a browser restart. The combination of stopping new requests and recycling Chrome is usually enough to recover from any transient issue:

async function takeScreenshot(url) {
  if (!breaker.canExecute()) {
    // While the circuit is open, restart the browser
    if (browser) {
      browser.close().catch(() => {});
      browser = null;
    }
    throw new Error('Circuit breaker is open — screenshot service is temporarily unavailable');
  }

  // ... rest of implementation
}

Part 4: Error Handling

Production screenshot services encounter a wide variety of errors. The difference between a service that's reliable and one that's flaky comes down to how you classify, handle, and recover from these errors.

Error Classification

Not all errors are equal. Some are transient and will resolve on retry. Some indicate a fundamental problem with the request. Some mean Chrome itself needs to be restarted. Treating them all the same — either retrying everything or failing everything — leads to wasted time or missed opportunities for recovery.

function classifyError(error) {
  const message = error.message || '';
  const name = error.name || '';

  // Retryable errors — the same request might succeed on a second attempt
  if (name === 'TimeoutError') {
    return { type: 'retryable', action: 'retry', reason: 'timeout' };
  }

  if (message.includes('net::ERR_CONNECTION_RESET')) {
    return { type: 'retryable', action: 'retry', reason: 'connection_reset' };
  }

  if (message.includes('net::ERR_CONNECTION_REFUSED')) {
    return { type: 'retryable', action: 'retry', reason: 'connection_refused' };
  }

  if (message.includes('net::ERR_NETWORK_CHANGED')) {
    return { type: 'retryable', action: 'retry', reason: 'network_changed' };
  }

  if (message.includes('net::ERR_CONNECTION_TIMED_OUT')) {
    return { type: 'retryable', action: 'retry', reason: 'connection_timeout' };
  }

  // Browser restart errors — Chrome is in a bad state
  if (name === 'ProtocolError' || message.includes('Protocol error')) {
    return { type: 'browser_error', action: 'restart_browser', reason: 'protocol_error' };
  }

  if (message.includes('Target closed') || message.includes('Session closed')) {
    return { type: 'browser_error', action: 'restart_browser', reason: 'target_closed' };
  }

  if (message.includes('Browser disconnected') || message.includes('browser has disconnected')) {
    return { type: 'browser_error', action: 'restart_browser', reason: 'browser_disconnected' };
  }

  if (message.includes('crashed')) {
    return { type: 'browser_error', action: 'restart_browser', reason: 'browser_crashed' };
  }

  // Permanent errors — retrying won't help
  if (message.includes('net::ERR_NAME_NOT_RESOLVED')) {
    return { type: 'permanent', action: 'fail', reason: 'dns_resolution_failed' };
  }

  if (message.includes('net::ERR_CERT_')) {
    return { type: 'permanent', action: 'fail', reason: 'ssl_error' };
  }

  if (message.includes('net::ERR_INVALID_URL') || message.includes('Invalid URL')) {
    return { type: 'permanent', action: 'fail', reason: 'invalid_url' };
  }

  if (message.includes('net::ERR_ABORTED') && message.includes('404')) {
    return { type: 'permanent', action: 'fail', reason: 'page_not_found' };
  }

  // Unknown errors — treat as retryable with limited attempts
  return { type: 'unknown', action: 'retry', reason: 'unknown', maxRetries: 1 };
}

Using this classification in your screenshot function:

async function takeScreenshotWithClassification(browser, url) {
  const page = await browser.newPage();
  try {
    await page.goto(url, {
      waitUntil: 'networkidle2',
      timeout: 30000
    });
    return await page.screenshot({ type: 'png' });
  } catch (error) {
    const classification = classifyError(error);
    error.classification = classification;
    throw error;
  } finally {
    await page.close();
  }
}

Retry Strategies

For retryable errors, exponential backoff with jitter prevents thundering herd problems. When multiple requests fail at the same time (e.g., a network blip), you don't want them all retrying at exactly the same moment:

async function sleep(ms) {
  return new Promise((resolve) => setTimeout(resolve, ms));
}

function getBackoffDelay(attempt, baseDelay = 1000, maxDelay = 30000) {
  // Exponential backoff: 1s, 2s, 4s, 8s, 16s...
  const exponentialDelay = baseDelay * Math.pow(2, attempt);

  // Add random jitter (0-50% of the delay)
  const jitter = Math.random() * exponentialDelay * 0.5;

  return Math.min(exponentialDelay + jitter, maxDelay);
}

async function takeScreenshotWithRetry(url, options = {}) {
  const maxRetries = options.maxRetries || 3;
  const errors = [];

  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    try {
      const browser = await getBrowser();
      const result = await takeScreenshotWithClassification(browser, url);
      return result;
    } catch (error) {
      const classification = error.classification || classifyError(error);
      errors.push({
        attempt,
        error: error.message,
        classification
      });

      // Permanent errors: don't retry
      if (classification.action === 'fail') {
        console.log(`Permanent error for ${url}: ${classification.reason}`);
        throw error;
      }

      // Browser errors: restart browser before retrying
      if (classification.action === 'restart_browser') {
        console.log(`Browser error for ${url}: ${classification.reason}. Restarting browser.`);
        if (browser) {
          browser.close().catch(() => {});
          browser = null;
        }
      }

      // Check if we have retries left
      const effectiveMaxRetries = classification.maxRetries !== undefined
        ? classification.maxRetries
        : maxRetries;

      if (attempt >= effectiveMaxRetries) {
        console.log(`Max retries (${effectiveMaxRetries}) exceeded for ${url}`);
        error.allAttempts = errors;
        throw error;
      }

      // Wait before retrying
      const delay = getBackoffDelay(attempt);
      console.log(`Retrying ${url} in ${Math.round(delay)}ms (attempt ${attempt + 1}/${maxRetries})`);
      await sleep(delay);
    }
  }
}

The jitter is crucial. Without it, if 100 requests all time out at the same moment, they all retry at exactly baseDelay later, causing another spike. Jitter spreads the retries across a time window, smoothing the load.

Structured Logging

When debugging screenshot failures in production, you need context. Which URL? How long did it take? How much memory was in use? What was the error? Structured logging captures all of this in a format that's easy to search and aggregate:

function createRequestLogger(requestId) {
  const startTime = Date.now();
  const events = [];

  return {
    log(event, data = {}) {
      const entry = {
        requestId,
        timestamp: new Date().toISOString(),
        elapsed: Date.now() - startTime,
        event,
        ...data
      };
      events.push(entry);
      console.log(JSON.stringify(entry));
    },

    summary() {
      const memUsage = process.memoryUsage();
      return {
        requestId,
        totalDuration: Date.now() - startTime,
        events: events.length,
        memory: {
          rss: Math.round(memUsage.rss / 1024 / 1024),
          heapUsed: Math.round(memUsage.heapUsed / 1024 / 1024)
        }
      };
    }
  };
}

async function takeScreenshotWithLogging(url, options = {}) {
  const requestId = `req_${Date.now()}_${Math.random().toString(36).substring(2, 8)}`;
  const logger = createRequestLogger(requestId);

  logger.log('screenshot_start', { url, options });

  const browser = await getBrowser();
  const page = await browser.newPage();

  try {
    logger.log('navigation_start', { url });
    await page.goto(url, {
      waitUntil: 'networkidle2',
      timeout: options.timeout || 30000
    });
    logger.log('navigation_complete');

    const metrics = await page.metrics();
    logger.log('page_metrics', {
      jsHeapUsedSize: Math.round(metrics.JSHeapUsedSize / 1024 / 1024),
      documents: metrics.Documents,
      nodes: metrics.Nodes
    });

    logger.log('screenshot_start_capture');
    const screenshot = await page.screenshot({
      type: options.format || 'png',
      quality: options.quality
    });
    logger.log('screenshot_complete', { sizeBytes: screenshot.length });

    const summary = logger.summary();
    logger.log('request_complete', summary);

    return screenshot;
  } catch (error) {
    logger.log('screenshot_error', {
      error: error.message,
      errorName: error.name,
      classification: classifyError(error)
    });
    throw error;
  } finally {
    await page.close();
    logger.log('page_closed');
  }
}

This produces log lines like:

{"requestId":"req_1708876543_a1b2c3","timestamp":"2026-02-25T12:00:00.000Z","elapsed":0,"event":"screenshot_start","url":"https://example.com"}
{"requestId":"req_1708876543_a1b2c3","timestamp":"2026-02-25T12:00:02.340Z","elapsed":2340,"event":"navigation_complete"}
{"requestId":"req_1708876543_a1b2c3","timestamp":"2026-02-25T12:00:02.890Z","elapsed":2890,"event":"screenshot_complete","sizeBytes":284573}

The requestId ties all log entries for a single request together. When debugging, you can filter by requestId to see the full lifecycle. When analyzing trends, you can aggregate by event to find patterns — which URLs are slowest, which produce the most errors, which consume the most memory.

Alerting Thresholds

Not every error warrants waking someone up at 3am. Define clear thresholds based on aggregate metrics rather than individual failures:

class AlertingMonitor {
  constructor() {
    this.windowSize = 60000; // 1-minute window
    this.requests = [];
  }

  record(result) {
    const now = Date.now();
    this.requests.push({
      timestamp: now,
      success: result.success,
      duration: result.duration,
      memory: result.memory
    });

    // Prune old entries
    this.requests = this.requests.filter((r) => now - r.timestamp < this.windowSize);

    this.checkThresholds();
  }

  checkThresholds() {
    const total = this.requests.length;
    if (total < 5) return; // Not enough data

    // Error rate threshold: alert if > 10% of requests fail
    const failures = this.requests.filter((r) => !r.success).length;
    const errorRate = failures / total;
    if (errorRate > 0.1) {
      this.alert('high_error_rate', {
        errorRate: `${(errorRate * 100).toFixed(1)}%`,
        failures,
        total
      });
    }

    // Memory threshold: alert if RSS > 80% of available
    const latestMemory = this.requests[this.requests.length - 1].memory;
    const memoryLimitMb = 4096; // Adjust for your server
    if (latestMemory > memoryLimitMb * 0.8) {
      this.alert('high_memory', {
        currentMb: latestMemory,
        limitMb: memoryLimitMb,
        percentage: `${((latestMemory / memoryLimitMb) * 100).toFixed(1)}%`
      });
    }

    // Response time threshold: alert if p95 > 30s
    const durations = this.requests.map((r) => r.duration).sort((a, b) => a - b);
    const p95Index = Math.ceil(durations.length * 0.95) - 1;
    const p95 = durations[p95Index];
    if (p95 > 30000) {
      this.alert('high_latency', {
        p95: `${(p95 / 1000).toFixed(1)}s`,
        p50: `${(durations[Math.ceil(durations.length * 0.5) - 1] / 1000).toFixed(1)}s`
      });
    }
  }

  alert(type, data) {
    // Rate limit alerts — don't fire the same alert more than once per 5 minutes
    const alertKey = `${type}_${JSON.stringify(data)}`;
    const now = Date.now();

    if (!this.lastAlerts) this.lastAlerts = new Map();
    const lastAlert = this.lastAlerts.get(type);

    if (lastAlert && now - lastAlert < 300000) return;
    this.lastAlerts.set(type, now);

    console.error(JSON.stringify({
      level: 'alert',
      type,
      ...data,
      timestamp: new Date().toISOString()
    }));

    // In production, send to PagerDuty, Slack, email, etc.
  }
}

const monitor = new AlertingMonitor();

These thresholds serve as a starting point. Adjust based on your traffic patterns. A service handling 10 requests per minute needs different thresholds than one handling 1,000. The key principle is to alert on sustained degradation, not individual failures.

Putting It All Together

Here's a complete production-ready screenshot function that incorporates all the patterns from this guide — adaptive timeouts, memory management, error classification, retry with backoff, circuit breaking, and structured logging:

const puppeteer = require('puppeteer');

const CHROME_ARGS = [
  '--no-sandbox',
  '--disable-setuid-sandbox',
  '--disable-dev-shm-usage',
  '--disable-gpu',
  '--disable-extensions',
  '--disable-background-networking'
];

const MAX_REQUESTS_PER_BROWSER = 100;
const TIMEOUT_TIERS = [15000, 30000, 60000];

let browser = null;
let requestCount = 0;

async function getBrowser() {
  if (!browser || !browser.connected || requestCount >= MAX_REQUESTS_PER_BROWSER) {
    if (browser) {
      const old = browser;
      browser = null;
      old.close().catch(() => {});
    }
    browser = await puppeteer.launch({ args: CHROME_ARGS });
    requestCount = 0;
  }
  requestCount++;
  return browser;
}

function classifyError(error) {
  const msg = error.message || '';
  if (error.name === 'TimeoutError') return 'retryable';
  if (msg.includes('net::ERR_CONNECTION')) return 'retryable';
  if (msg.includes('Protocol error') || msg.includes('Target closed')) return 'browser';
  if (msg.includes('ERR_NAME_NOT_RESOLVED') || msg.includes('Invalid URL')) return 'permanent';
  return 'retryable';
}

async function captureScreenshot(url, options = {}) {
  const startTime = Date.now();
  const maxRetries = options.maxRetries || 2;
  const format = options.format || 'png';
  let lastError = null;

  for (let attempt = 0; attempt <= maxRetries; attempt++) {
    // Select timeout tier based on attempt number
    const timeout = TIMEOUT_TIERS[Math.min(attempt, TIMEOUT_TIERS.length - 1)];
    const currentBrowser = await getBrowser();
    const page = await currentBrowser.newPage();

    try {
      // Block known-slow third-party resources
      await page.setRequestInterception(true);
      page.on('request', (req) => {
        const blockedDomains = ['google-analytics.com', 'googletagmanager.com', 'facebook.net'];
        const isBlocked = blockedDomains.some((d) => req.url().includes(d));
        isBlocked ? req.abort() : req.continue();
      });

      // Set viewport
      await page.setViewport({
        width: options.width || 1280,
        height: options.height || 720
      });

      // Navigate with tiered timeout
      await page.goto(url, { waitUntil: 'networkidle2', timeout });

      // Optional: wait for a specific selector
      if (options.waitForSelector) {
        await page.waitForSelector(options.waitForSelector, {
          visible: true,
          timeout: 10000
        });
      }

      // Capture
      const screenshot = await page.screenshot({
        type: format,
        quality: format === 'png' ? undefined : (options.quality || 80),
        fullPage: options.fullPage || false
      });

      const duration = Date.now() - startTime;
      console.log(JSON.stringify({
        event: 'screenshot_success',
        url,
        attempt,
        timeout,
        duration,
        sizeBytes: screenshot.length
      }));

      return screenshot;
    } catch (error) {
      lastError = error;
      const errorType = classifyError(error);

      console.log(JSON.stringify({
        event: 'screenshot_error',
        url,
        attempt,
        timeout,
        errorType,
        error: error.message
      }));

      // Permanent errors: fail immediately
      if (errorType === 'permanent') throw error;

      // Browser errors: restart Chrome
      if (errorType === 'browser') {
        if (browser) {
          browser.close().catch(() => {});
          browser = null;
        }
      }

      // Wait before retrying (exponential backoff with jitter)
      if (attempt < maxRetries) {
        const delay = 1000 * Math.pow(2, attempt) + Math.random() * 1000;
        await new Promise((resolve) => setTimeout(resolve, delay));
      }
    } finally {
      await page.close().catch(() => {});
    }
  }

  throw lastError;
}

// Graceful shutdown
process.on('SIGINT', async () => {
  if (browser) await browser.close().catch(() => {});
  process.exit(0);
});
process.on('SIGTERM', async () => {
  if (browser) await browser.close().catch(() => {});
  process.exit(0);
});

module.exports = { captureScreenshot };

This is roughly 90 lines of well-commented, production-tested code. It handles the common failure modes: timeout escalation across retries, request interception for slow third-party scripts, error classification to avoid retrying permanent failures, browser restart on protocol errors, periodic browser recycling to prevent memory growth, and graceful shutdown to prevent zombie processes.

It's also incomplete. A true production screenshot service needs concurrency control (what happens when 50 requests arrive simultaneously?), queuing (how do you handle bursts?), caching (why re-render the same page twice?), storage (where do the screenshots go?), monitoring (is the service healthy?), and CDN distribution (how do users access the images?). Each of these is its own engineering project.

When to Let Someone Else Debug This

Every pattern in this guide is a solved problem. Navigation timeouts, memory leaks, zombie processes, error handling, retry logic, circuit breakers — these are all well-understood issues with known solutions. The challenge isn't knowing what to do. It's implementing all of it correctly, keeping it running reliably, and maintaining it as Puppeteer and Chrome evolve.

If screenshots are a supporting feature in your product — social cards, link previews, documentation images, PDF generation — the time spent debugging Chrome memory issues is time not spent on your actual product.

RenderScreenshot handles all of this infrastructure. Timeouts, memory management, zombie process cleanup, error handling, retry logic, caching, and CDN distribution — all managed automatically. A single API call replaces everything in this guide:

curl "https://api.renderscreenshot.com/v1/screenshot?url=https://example.com&width=1280&height=720" \
  -H "Authorization: Bearer rs_live_..."

No Chrome instances to manage. No memory monitoring. No zombie process cleanup scripts. No Docker /dev/shm debugging. The screenshot is captured on edge infrastructure, cached on a global CDN, and returned in under 3 seconds for most pages.

You can sign up for free and get 50 credits to try it out — enough to test your workflow and see if the results match what you need.

For more on scaling self-managed Puppeteer infrastructure, see our scaling guide which covers concurrency, Docker optimization, and performance tuning.


Have questions about Puppeteer debugging? Check our documentation or reach out at [email protected].