Rewrite vs Refactor — The Decision That Defines the Next Two Years of Your Team

Introduction

"Let's rewrite it from scratch" is the most seductive phrase in software engineering. The existing system has accumulated years of complexity, workarounds, and decisions that made sense at the time. A rewrite offers the fantasy of a clean slate. But rewrites are among the most reliably underestimated projects in software — they routinely take 3x longer than estimated, often fail to ship, and frequently miss behavior that the existing system handles correctly (because nobody documented it). The refactor path is slower but safer. Choosing correctly requires an honest assessment of both the codebase and the team.

The Rewrite Trap
Fix 1: The Decision Framework
Fix 2: The Strangler Fig — Rewrite Without Feature Freeze
Fix 3: Incremental Refactor Roadmap
Fix 4: The Honest Rewrite Estimate
Fix 5: Signs the Refactor Is Working
Rewrite vs Refactor Checklist
Conclusion

The Rewrite Trap

Why rewrites fail (Joel Spolsky's "Things You Should Never Do" pattern):

1. The existing system contains years of bug fixes that aren't documented
   → Rewrite doesn't know about them
   → They're rediscovered in production, one at a time

2. Rewrite estimate doesn't account for feature parity
   → "We'll match existing functionality in 3 months"
   → Existing system has 5 years of edge case handling
   → Actually takes 18 months to match, by which time the system has moved

3. During rewrite, team stops shipping features
   → "We'll continue maintaining old system while building new"
   → Reality: all engineers are on the rewrite, old system gets no love
   → Business falls behind competitors for the duration

4. New system inherits old problems
   → The complexity wasn't in the language — it was in the domain
   → Domain complexity travels to the new system
   → Plus new complexity from the migration itself

Classic rewrite failures:
→ Netscape 6 (abandoned, took years, gave Internet Explorer the market)
→ FogBugz rewrite (Spolsky's example — the original was not the problem)
→ Countless internal systems rewritten "in 3 months" that shipped in 18

Fix 1: The Decision Framework

// When to rewrite vs when to refactor
interface RewriteVsRefactorSignals {
  rewireIndicators: string[]
  refactorIndicators: string[]
  contextFactors: string[]
}

const decisionFramework: RewriteVsRefactorSignals = {
  rewireIndicators: [
    'Technology is fundamentally incompatible with requirements (language, runtime)',
    'The existing system cannot safely accept new features without catastrophic failures',
    'Cost to maintain > cost to rewrite + migration (with honest estimates)',
    'Team has no ability to understand or modify the existing code',
    'The system has no test coverage AND domain experts who wrote it are gone',
    'The system is < 12 months old and was written before requirements were understood',
  ],

  refactorIndicators: [
    'The system works (even if it\'s painful to change)',
    'Existing test coverage, even partial',
    'Team members who understand the domain, even if not the code',
    'Features are being requested for the system (active, not legacy)',
    'Incremental improvement is possible (not "change everything to change anything")',
    'Business cannot afford feature freeze for the rewrite duration',
  ],

  contextFactors: [
    'Can you do a partial rewrite? (rewrite one module, not the whole system)',
    'Is the pain in the domain complexity or the implementation quality?',
    'What is the minimum viable rewrite vs full rewrite?',
    'Can the strangler fig pattern work? (new system alongside old, migrating gradually)',
  ],
}

Fix 2: The Strangler Fig — Rewrite Without Feature Freeze

// Strangler fig: gradually replace the old system by building alongside it
// Migrate endpoints/modules one at a time, never the whole thing at once

// Old: monolith handles everything
// New: extracted services take over one domain at a time

// Phase 1: Add routing layer in front of both systems
// Traffic starts 100% to old system

import { Router } from 'express'
const router = Router()

// Feature flag controls routing:
router.use('/api/payments', async (req, res, next) => {
  const useNewPayments = await featureFlags.get('new-payment-service')

  if (useNewPayments) {
    return proxy(req, res, { target: 'http://new-payment-service' })
  }

  next()  // Falls through to old monolith
})

// Phase 2: Extract and migrate one module at a time
// Month 1: Payments module → new service (most isolated, well-tested)
// Month 2: User profiles → new service
// Month 3: Product catalog → new service
// ...

// Phase 3: Retire old system only after all traffic is migrated
// Gradual migration preserves velocity throughout
// Each phase is independently shippable and rollback-able

Fix 3: Incremental Refactor Roadmap

// When refactoring: sequence matters more than speed
// Wrong order: change everything simultaneously
// Right order: establish safety, then change

interface RefactorPhase {
  name: string
  goal: string
  deliverables: string[]
  timeEstimate: string
  unblocks: string[]
}

const incrementalRefactorRoadmap: RefactorPhase[] = [
  {
    name: 'Phase 1: Characterization',
    goal: 'Understand what exists and make it safely changeable',
    deliverables: [
      'Test suite for most critical paths (not 100% coverage — critical paths)',
      'System documentation: what does each module do?',
      'Dependency map: what calls what?',
    ],
    timeEstimate: '2-4 weeks',
    unblocks: ['Any subsequent change is now safe to make'],
  },
  {
    name: 'Phase 2: Structural cleanup',
    goal: 'Make the code easier to understand without changing behavior',
    deliverables: [
      'Rename unclear variables/functions',
      'Extract large functions into smaller, named ones',
      'Move misplaced code to appropriate modules',
    ],
    timeEstimate: '2-4 weeks',
    unblocks: ['Phase 3 — behavioral changes are now reviewable'],
  },
  {
    name: 'Phase 3: Architectural improvement',
    goal: 'Address the structural issues that cause ongoing pain',
    deliverables: [
      'Introduce domain boundaries (if needed)',
      'Replace problematic patterns (N+1, shared mutable state)',
      'Modernize specific components that are bottlenecks',
    ],
    timeEstimate: '4-8 weeks (per major area)',
    unblocks: ['New features are easier to add correctly'],
  },
]

Fix 4: The Honest Rewrite Estimate

If you decide to rewrite, honest estimation:

Common factors that make rewrites longer:
- Feature parity: every feature in the old system, including ones nobody documented
- Migration: data migration, user communication, cutover strategy
- Parallel operation: old and new systems running simultaneously during migration
- Unexpected domain complexity: the pain wasn't the code — it was the problem space
- External integrations: every third-party integration needs to be retested
- Regression: bugs in the old system that users have worked around
- Rollback: if the new system has issues, can you go back?

Estimation correction factor: 2.5x-3x the engineering team's initial estimate

Example:
Team says: "6 months to rewrite the payment system"
Honest estimate: 15-18 months
During that time: no new payment features, old system maintained

Alternative: strangler fig refactor
Month 1-2: payment service extracted alongside old code (features continue shipping)
Month 3: all new payment logic in new service (old code maintained for fallback)
Month 4: old payment code removed
Outcome: improved architecture AND continued feature velocity

Fix 5: Signs the Refactor Is Working

// Track these metrics during a multi-month refactor
// to confirm it's delivering value and stay accountable

interface RefactorHealthMetrics {
  // Feature velocity: are we shipping faster over time?
  velocityTrend: 'improving' | 'stable' | 'degrading'

  // Incident rate: are we having fewer incidents?
  incidentTrend: 'improving' | 'stable' | 'degrading'

  // Test coverage: are we safer to change?
  testCoveragePercent: number

  // Code complexity: is it getting cleaner?
  averageCyclomaticComplexity: number

  // Engineer sentiment: does the team feel better about the codebase?
  teamSatisfactionScore: number  // 1-5 scale, quarterly survey
}

// If velocity is not improving after 3 months of refactoring:
// → Refactor scope may be too large (not delivering value)
// → Refactor approach may be wrong (reorganizing without addressing root cause)
// → Business requirements changed (original pain point is no longer the bottleneck)
// Recalibrate — don't continue a refactor that isn't working

Rewrite vs Refactor Checklist

✅ Honest assessment: is the pain in the code or in the domain complexity?
✅ If rewriting: estimate includes feature parity, migration, and parallel operation
✅ If refactoring: characterization tests before any changes
✅ Strangler fig considered for large-scale changes (no feature freeze required)
✅ Refactor sequenced: safety first, then structure, then architecture
✅ Progress metrics tracked: is velocity actually improving?
✅ Regular check-in: is the refactor delivering value, or should approach change?

Conclusion

The rewrite vs refactor decision comes down to one honest question: is the complexity in the implementation, or is it in the domain? If it's in the implementation, refactoring — carefully, sequentially, starting with test coverage — will fix it with less disruption. If it's genuinely in the implementation AND the implementation is so broken that incremental improvement isn't possible, the strangler fig offers a rewrite path that preserves velocity. A pure big-bang rewrite is almost never the right answer — the history of software is full of teams that discovered this the hard way, eighteen months after the rewrite project started.