Published on

Large Offset Query Slowness — The Export Job That Takes 6 Hours

Authors

Introduction

When you need to process all rows in a large table — for exports, data migrations, report generation, or ETL pipelines — the naive approach is to paginate with OFFSET. It works for the first few thousand rows, but each subsequent batch has a larger OFFSET, and larger OFFSET means more rows read and discarded. By the end, you're reading millions of rows to process hundreds.

The Problem

// ❌ OFFSET pagination for bulk processing — gets slower every iteration
async function exportAllOrders() {
  const pageSize = 1000
  let page = 0
  let processed = 0

  while (true) {
    const offset = page * pageSize

    // Batch 1 (offset 0): reads 1,000 rows → 50ms
    // Batch 100 (offset 99,000): reads 100,000 rows, discards 99,000 → 2s
    // Batch 1000 (offset 999,000): reads 1,000,000 rows, discards 999,000 → 45s
    // Batch 10000 (offset 9,999,000): reads 10M rows → 8+ minutes per batch!
    const result = await db.query(`
      SELECT * FROM orders ORDER BY id LIMIT $1 OFFSET $2
    `, [pageSize, offset])

    if (result.rows.length === 0) break

    await processRows(result.rows)
    processed += result.rows.length
    page++
  }
}

Fix 1: Keyset Pagination (Cursor-Based) for Bulk Processing

// ✅ Cursor-based: always constant time per batch
async function exportAllOrdersCursor() {
  const pageSize = 1000
  let lastId: string | null = null
  let processed = 0

  while (true) {
    const result = lastId
      ? await db.query(`
          SELECT * FROM orders
          WHERE id > $1  -- Only reads the NEXT 1,000 rows, no scanning
          ORDER BY id
          LIMIT $2
        `, [lastId, pageSize])
      : await db.query(`
          SELECT * FROM orders ORDER BY id LIMIT $1
        `, [pageSize])

    if (result.rows.length === 0) break

    await processRows(result.rows)
    lastId = result.rows[result.rows.length - 1].id
    processed += result.rows.length

    console.log(`Processed ${processed} orders`)
    // Every batch takes ~50ms regardless of position — constant time!
  }
}

The WHERE id > lastId clause, combined with an index on id, tells PostgreSQL exactly where to start — no rows are read and discarded.

Fix 2: PostgreSQL Server-Side Cursor (For Very Large Exports)

For truly massive exports, use a server-side cursor to stream rows without holding them all in memory:

import { Pool, PoolClient } from 'pg'

async function streamLargeTable(
  pool: Pool,
  query: string,
  processRow: (row: any) => Promise<void>
): Promise<void> {
  const client = await pool.connect()

  try {
    // Open a server-side cursor — PostgreSQL holds the result set
    await client.query('BEGIN')
    await client.query(`DECLARE export_cursor CURSOR FOR ${query}`)

    let done = false
    while (!done) {
      // Fetch 500 rows at a time from the server cursor
      const result = await client.query('FETCH FORWARD 500 FROM export_cursor')

      if (result.rows.length === 0) {
        done = true
      } else {
        for (const row of result.rows) {
          await processRow(row)
        }
      }
    }

    await client.query('CLOSE export_cursor')
    await client.query('COMMIT')
  } catch (err) {
    await client.query('ROLLBACK')
    throw err
  } finally {
    client.release()
  }
}

// Usage
await streamLargeTable(
  pool,
  'SELECT * FROM orders ORDER BY created_at',
  async (order) => {
    await writeToExportFile(order)
  }
)

The server-side cursor is ideal because: it holds the snapshot consistent, rows are fetched in small chunks without the OFFSET problem, and the entire result set isn't loaded into Node.js memory.

Fix 3: Parallel Chunking by ID Range

For maximum throughput, split the table by ID range and process chunks in parallel:

async function parallelExport(parallelism = 4) {
  // Get the ID range
  const range = await db.query(`
    SELECT MIN(id)::text AS min_id, MAX(id)::text AS max_id,
           COUNT(*) AS total
    FROM orders
  `)
  const { min_id, max_id, total } = range.rows[0]

  // Divide into chunks (using ctid-based splitting for UUID primary keys)
  // Or split by the actual ID range for sequential IDs:
  const chunkSize = Math.ceil(parseInt(total) / parallelism)

  const chunks = await db.query(`
    SELECT
      MIN(id) AS chunk_start,
      MAX(id) AS chunk_end
    FROM (
      SELECT id, ((ROW_NUMBER() OVER (ORDER BY id) - 1) / $1) AS chunk_num
      FROM orders
    ) numbered
    GROUP BY chunk_num
    ORDER BY chunk_num
  `, [chunkSize])

  // Process chunks in parallel
  await Promise.all(
    chunks.rows.map(({ chunk_start, chunk_end }) =>
      processChunk(chunk_start, chunk_end)
    )
  )
}

async function processChunk(startId: string, endId: string) {
  const rows = await db.query(`
    SELECT * FROM orders WHERE id >= $1 AND id <= $2 ORDER BY id
  `, [startId, endId])

  for (const row of rows.rows) {
    await processRow(row)
  }
}

Fix 4: COPY for Maximum Export Speed

When you need to export the entire table as fast as possible:

import { createWriteStream } from 'fs'
import { pipeline } from 'stream/promises'

async function exportToCsv(outputPath: string) {
  const client = await pool.connect()

  try {
    // COPY is the fastest way to read all rows — bypasses row-by-row overhead
    const copyStream = client.query(
      `COPY (SELECT * FROM orders ORDER BY created_at) TO STDOUT WITH (FORMAT CSV, HEADER true)`
    )

    const outputStream = createWriteStream(outputPath)
    await pipeline(copyStream, outputStream)

    console.log(`Export complete: ${outputPath}`)
  } finally {
    client.release()
  }
}

PostgreSQL's COPY command is 5-10x faster than equivalent SELECT queries for bulk exports — it bypasses much of the query parsing and row serialization overhead.

Large Offset Checklist

  • ✅ Replace OFFSET with keyset pagination (WHERE id > lastId) for bulk processing
  • ✅ Use server-side cursors for memory-efficient streaming of large result sets
  • ✅ Use COPY TO for maximum export throughput
  • ✅ Split large tables into ID-range chunks for parallel processing
  • ✅ Add created_at or id index if doing sequential table scans
  • ✅ Log progress — large jobs should report row count and ETA
  • ✅ Run bulk exports as background jobs, not in API request handlers

Conclusion

OFFSET pagination is a hidden performance trap for bulk processing jobs. As the offset grows, each batch gets linearly slower — an export that processes 10 million rows with OFFSET might take hours while cursor-based pagination would take minutes. The fix is to replace ORDER BY id LIMIT N OFFSET M with WHERE id > lastId ORDER BY id LIMIT N. The WHERE clause with an index makes every batch constant-time regardless of position. For maximum throughput, use PostgreSQL's native COPY TO command or a server-side cursor with FETCH FORWARD.