Published on

Improper Sharding Strategy — When Your "Scalable" Database Isn't

Authors

Introduction

Sharding splits your database across multiple nodes to scale writes and storage beyond what a single database can handle. But the sharding key determines everything — choose wrong and you get hot spots (some shards overwhelmed, others idle), cross-shard queries (expensive joins across nodes), or uneven growth (the current time shard fills up while historical shards sit empty).

The Hot Shard Problem

Sharding by user_id modulo 8 (8 shards):
- Shard 0: user_ids ending in 0 or 8
- Shard 1: user_ids ending in 1 or 9
- ...

Problem: Your top 100 enterprise customers (each with 10,000 employees)
all happen to hash to shard 3.
- Shard 3: 100% CPU, 95% storage
- Other shards: 30% CPU, 40% storage

This isn't hypothetical — with modulo sharding, any clustering
in the data distribution creates hot shards.

Choosing the Right Shard Key

Good shard keys:
- UUID v4 (random) — perfectly even distribution
- High-cardinality IDs with random component
- Hash of customer_id (if customer IDs are sequential, hash them first)

Bad shard keys:
- Sequential integer IDs (mod N → uneven for non-uniform write patterns)
- Status column ('pending', 'complete') — all new data hits 'pending' shard
- Date/timestamp — all writes hit current time shard
- Tenant size — small tenants on shard A, large tenants on shard B → uneven
- Country code — most traffic from USUS shard is hot

Consistent Hashing — Shard Remapping Without Full Rebalancing

import * as crypto from 'crypto'

class ConsistentHashRing {
  private ring = new Map<number, string>()  // hash position → shard name
  private sortedKeys: number[] = []
  private virtualNodes = 150  // more virtual nodes = more even distribution

  addShard(shardId: string): void {
    for (let i = 0; i < this.virtualNodes; i++) {
      const virtualNodeKey = `${shardId}:${i}`
      const hash = this.hash(virtualNodeKey)
      this.ring.set(hash, shardId)
    }
    this.sortedKeys = [...this.ring.keys()].sort((a, b) => a - b)
  }

  removeShard(shardId: string): void {
    for (let i = 0; i < this.virtualNodes; i++) {
      const hash = this.hash(`${shardId}:${i}`)
      this.ring.delete(hash)
    }
    this.sortedKeys = [...this.ring.keys()].sort((a, b) => a - b)
  }

  getShard(key: string): string {
    if (this.ring.size === 0) throw new Error('No shards available')

    const hash = this.hash(key)
    // Find first shard clockwise from this hash
    const idx = this.sortedKeys.findIndex((k) => k >= hash)
    const position = idx === -1 ? this.sortedKeys[0] : this.sortedKeys[idx]
    return this.ring.get(position)!
  }

  private hash(key: string): number {
    const buf = crypto.createHash('md5').update(key).digest()
    return buf.readUInt32BE(0)
  }
}

const ring = new ConsistentHashRing()
ring.addShard('db-1')
ring.addShard('db-2')
ring.addShard('db-3')

// Consistent: same userId always goes to same shard
// Adding shard 4: only ~25% of keys need to move (not 100%)
ring.addShard('db-4')

function getShardForUser(userId: string): string {
  return ring.getShard(userId)
}

Avoiding Cross-Shard Joins

The biggest operational cost of sharding is queries that need data from multiple shards:

// ❌ Cross-shard join — requires querying all shards, merging results
async function getUserOrdersAndProfile(userId: string) {
  const userShard = getShardForUser(userId)

  // User is on shard 2
  const user = await shards[userShard].query('SELECT * FROM users WHERE id = $1', [userId])

  // But orders might be sharded differently!
  // If sharded by order_id (not user_id), this order could be on any shard
  const orders = await Promise.all(
    Object.values(shards).map(shard =>
      shard.query('SELECT * FROM orders WHERE user_id = $1', [userId])
    )
  )
  // Merges all shard results — expensive, complex

  return { user: user.rows[0], orders: orders.flatMap(r => r.rows) }
}

// ✅ Co-locate related data on the same shard
// Shard orders by user_id (same key as users table)
// Then user AND their orders are on the same shard — no cross-shard join needed
async function getUserOrders(userId: string) {
  const shard = getShardForUser(userId)

  // Both queries go to the same shard
  const [user, orders] = await Promise.all([
    shards[shard].query('SELECT * FROM users WHERE id = $1', [userId]),
    shards[shard].query('SELECT * FROM orders WHERE user_id = $1', [userId]),
  ])

  return { user: user.rows[0], orders: orders.rows }
}

Directory-Based Sharding (Most Flexible)

// Lookup table: entity → shard assignment
// Allows explicit rebalancing and handles uneven data without algorithm changes

class ShardDirectory {
  async getShard(entityType: string, entityId: string): Promise<string> {
    // Check lookup table first
    const entry = await metaDb.query(
      'SELECT shard_id FROM shard_directory WHERE entity_type = $1 AND entity_id = $2',
      [entityType, entityId]
    )

    if (entry.rows.length > 0) {
      return entry.rows[0].shard_id
    }

    // New entity — assign to least-loaded shard
    const shard = await this.assignToLeastLoadedShard(entityType)
    await metaDb.query(
      'INSERT INTO shard_directory (entity_type, entity_id, shard_id) VALUES ($1, $2, $3)',
      [entityType, entityId, shard]
    )
    return shard
  }

  private async assignToLeastLoadedShard(entityType: string): Promise<string> {
    const loads = await metaDb.query(
      'SELECT shard_id, COUNT(*) as entity_count FROM shard_directory WHERE entity_type = $1 GROUP BY shard_id',
      [entityType]
    )
    // Return shard with fewest entities
    return loads.rows.sort((a, b) => a.entity_count - b.entity_count)[0]?.shard_id ?? 'shard-1'
  }
}

Monitoring for Hot Shards

// Check per-shard query rates and storage
async function checkShardBalance() {
  const stats = await Promise.all(
    Object.entries(shards).map(async ([name, pool]) => {
      const result = await pool.query(`
        SELECT
          pg_database_size(current_database()) AS size_bytes,
          (SELECT sum(n_live_tup) FROM pg_stat_user_tables) AS row_count
      `)
      return { shard: name, ...result.rows[0] }
    })
  )

  const avgSize = stats.reduce((s, r) => s + Number(r.size_bytes), 0) / stats.length

  for (const stat of stats) {
    const ratio = Number(stat.size_bytes) / avgSize
    if (ratio > 1.5) {
      console.warn(`Hot shard detected: ${stat.shard} is ${ratio.toFixed(1)}x average size`)
    }
  }
}

Sharding Checklist

  • ✅ Choose shard key that distributes writes evenly (prefer UUIDs over sequential IDs)
  • ✅ Co-locate related data on the same shard — avoid cross-shard joins
  • ✅ Use consistent hashing for online shard addition without full rebalancing
  • ✅ Monitor per-shard query rates and storage size — alert on imbalance
  • ✅ Consider directory-based sharding for fine-grained control
  • ✅ Test shard distribution with production-like data before going live
  • ✅ Plan for rebalancing from day one — you will need it eventually

Conclusion

Sharding is the right answer when you've exhausted vertical scaling and read replicas. But a bad sharding key turns a scaling solution into a hot-spot generator. The safest shard key is a uniformly distributed UUID. The most important constraint is co-locating data your queries join — sharding users by user_id means orders should also shard by user_id so they end up on the same node. Use consistent hashing to allow adding shards online, monitor for hot shards from day one, and have a rebalancing plan before you need one.