Improper Sharding Strategy — When Your "Scalable" Database Isn't

Introduction

Sharding splits your database across multiple nodes to scale writes and storage beyond what a single database can handle. But the sharding key determines everything — choose wrong and you get hot spots (some shards overwhelmed, others idle), cross-shard queries (expensive joins across nodes), or uneven growth (the current time shard fills up while historical shards sit empty).

The Hot Shard Problem
Choosing the Right Shard Key
Consistent Hashing — Shard Remapping Without Full Rebalancing
Avoiding Cross-Shard Joins
Directory-Based Sharding (Most Flexible)
Monitoring for Hot Shards
Sharding Checklist
Conclusion

The Hot Shard Problem

Sharding by user_id modulo 8 (8 shards):
- Shard 0: user_ids ending in 0 or 8
- Shard 1: user_ids ending in 1 or 9
- ...

Problem: Your top 100 enterprise customers (each with 10,000 employees)
all happen to hash to shard 3.
- Shard 3: 100% CPU, 95% storage
- Other shards: 30% CPU, 40% storage

This isn't hypothetical — with modulo sharding, any clustering
in the data distribution creates hot shards.

Choosing the Right Shard Key

Good shard keys:
- UUID v4 (random) — perfectly even distribution
- High-cardinality IDs with random component
- Hash of customer_id (if customer IDs are sequential, hash them first)

Bad shard keys:
- Sequential integer IDs (mod N → uneven for non-uniform write patterns)
- Status column ('pending', 'complete') — all new data hits 'pending' shard
- Date/timestamp — all writes hit current time shard
- Tenant size — small tenants on shard A, large tenants on shard B → uneven
- Country code — most traffic from US → US shard is hot

Consistent Hashing — Shard Remapping Without Full Rebalancing

import * as crypto from 'crypto'

class ConsistentHashRing {
  private ring = new Map<number, string>()  // hash position → shard name
  private sortedKeys: number[] = []
  private virtualNodes = 150  // more virtual nodes = more even distribution

  addShard(shardId: string): void {
    for (let i = 0; i < this.virtualNodes; i++) {
      const virtualNodeKey = `${shardId}:${i}`
      const hash = this.hash(virtualNodeKey)
      this.ring.set(hash, shardId)
    }
    this.sortedKeys = [...this.ring.keys()].sort((a, b) => a - b)
  }

  removeShard(shardId: string): void {
    for (let i = 0; i < this.virtualNodes; i++) {
      const hash = this.hash(`${shardId}:${i}`)
      this.ring.delete(hash)
    }
    this.sortedKeys = [...this.ring.keys()].sort((a, b) => a - b)
  }

  getShard(key: string): string {
    if (this.ring.size === 0) throw new Error('No shards available')

    const hash = this.hash(key)
    // Find first shard clockwise from this hash
    const idx = this.sortedKeys.findIndex((k) => k >= hash)
    const position = idx === -1 ? this.sortedKeys[0] : this.sortedKeys[idx]
    return this.ring.get(position)!
  }

  private hash(key: string): number {
    const buf = crypto.createHash('md5').update(key).digest()
    return buf.readUInt32BE(0)
  }
}

const ring = new ConsistentHashRing()
ring.addShard('db-1')
ring.addShard('db-2')
ring.addShard('db-3')

// Consistent: same userId always goes to same shard
// Adding shard 4: only ~25% of keys need to move (not 100%)
ring.addShard('db-4')

function getShardForUser(userId: string): string {
  return ring.getShard(userId)
}

Avoiding Cross-Shard Joins

The biggest operational cost of sharding is queries that need data from multiple shards:

// ❌ Cross-shard join — requires querying all shards, merging results
async function getUserOrdersAndProfile(userId: string) {
  const userShard = getShardForUser(userId)

  // User is on shard 2
  const user = await shards[userShard].query('SELECT * FROM users WHERE id = $1', [userId])

  // But orders might be sharded differently!
  // If sharded by order_id (not user_id), this order could be on any shard
  const orders = await Promise.all(
    Object.values(shards).map(shard =>
      shard.query('SELECT * FROM orders WHERE user_id = $1', [userId])
    )
  )
  // Merges all shard results — expensive, complex

  return { user: user.rows[0], orders: orders.flatMap(r => r.rows) }
}

// ✅ Co-locate related data on the same shard
// Shard orders by user_id (same key as users table)
// Then user AND their orders are on the same shard — no cross-shard join needed
async function getUserOrders(userId: string) {
  const shard = getShardForUser(userId)

  // Both queries go to the same shard
  const [user, orders] = await Promise.all([
    shards[shard].query('SELECT * FROM users WHERE id = $1', [userId]),
    shards[shard].query('SELECT * FROM orders WHERE user_id = $1', [userId]),
  ])

  return { user: user.rows[0], orders: orders.rows }
}

Directory-Based Sharding (Most Flexible)

// Lookup table: entity → shard assignment
// Allows explicit rebalancing and handles uneven data without algorithm changes

class ShardDirectory {
  async getShard(entityType: string, entityId: string): Promise<string> {
    // Check lookup table first
    const entry = await metaDb.query(
      'SELECT shard_id FROM shard_directory WHERE entity_type = $1 AND entity_id = $2',
      [entityType, entityId]
    )

    if (entry.rows.length > 0) {
      return entry.rows[0].shard_id
    }

    // New entity — assign to least-loaded shard
    const shard = await this.assignToLeastLoadedShard(entityType)
    await metaDb.query(
      'INSERT INTO shard_directory (entity_type, entity_id, shard_id) VALUES ($1, $2, $3)',
      [entityType, entityId, shard]
    )
    return shard
  }

  private async assignToLeastLoadedShard(entityType: string): Promise<string> {
    const loads = await metaDb.query(
      'SELECT shard_id, COUNT(*) as entity_count FROM shard_directory WHERE entity_type = $1 GROUP BY shard_id',
      [entityType]
    )
    // Return shard with fewest entities
    return loads.rows.sort((a, b) => a.entity_count - b.entity_count)[0]?.shard_id ?? 'shard-1'
  }
}

Monitoring for Hot Shards

// Check per-shard query rates and storage
async function checkShardBalance() {
  const stats = await Promise.all(
    Object.entries(shards).map(async ([name, pool]) => {
      const result = await pool.query(`
        SELECT
          pg_database_size(current_database()) AS size_bytes,
          (SELECT sum(n_live_tup) FROM pg_stat_user_tables) AS row_count
      `)
      return { shard: name, ...result.rows[0] }
    })
  )

  const avgSize = stats.reduce((s, r) => s + Number(r.size_bytes), 0) / stats.length

  for (const stat of stats) {
    const ratio = Number(stat.size_bytes) / avgSize
    if (ratio > 1.5) {
      console.warn(`Hot shard detected: ${stat.shard} is ${ratio.toFixed(1)}x average size`)
    }
  }
}

Sharding Checklist

✅ Choose shard key that distributes writes evenly (prefer UUIDs over sequential IDs)
✅ Co-locate related data on the same shard — avoid cross-shard joins
✅ Use consistent hashing for online shard addition without full rebalancing
✅ Monitor per-shard query rates and storage size — alert on imbalance
✅ Consider directory-based sharding for fine-grained control
✅ Test shard distribution with production-like data before going live
✅ Plan for rebalancing from day one — you will need it eventually

Conclusion

Sharding is the right answer when you've exhausted vertical scaling and read replicas. But a bad sharding key turns a scaling solution into a hot-spot generator. The safest shard key is a uniformly distributed UUID. The most important constraint is co-locating data your queries join — sharding users by user_id means orders should also shard by user_id so they end up on the same node. Use consistent hashing to allow adding shards online, monitor for hot shards from day one, and have a rebalancing plan before you need one.