Kafka

learnings

Programming

Events

Kafka

Kafka is a distributed event streaming platform built for high-throughput, fault-tolerant messaging. My experience working with Kafka in event-driven systems is that it can be quite different from traditional message queues in ways that make it powerful for certain use cases.

What is Kafka?

Think of Kafka as a distributed commit log. It stores streams of records in categories called topics, and unlike traditional message queues that delete messages after consumption, Kafka retains everything for a configurable period. This durability is what makes Kafka valuable, it’s not just a messaging system, it’s also a source of truth for event streams.

The retention means you can replay events, which can be very useful for debugging production issues or when I needed to reprocess data with new logic. This characteristic alone changes how you architect systems.

Core Components

Topics and Partitions

Topics are logical channels where you publish records. Each topic splits into partitions—ordered, immutable sequences of records. Understanding partitions is crucial because they’re how Kafka achieves parallelism and scalability.

When you publish a message, Kafka assigns it to a specific partition based on the message key (if you provide one) or uses round-robin distribution. Messages within a partition are strictly ordered by their offset—a sequential ID that marks the message’s position in the partition.

  graph TD
    A[Topic: user-events] --> B[Partition 0]
    A --> C[Partition 1]
    A --> D[Partition 2]
    B --> E[Offset 0]
    B --> F[Offset 1]
    B --> G[Offset 2]
    C --> H[Offset 0]
    C --> I[Offset 1]
    D --> J[Offset 0]

Partitions enable several things: parallel consumption (multiple consumers can read from different partitions simultaneously), ordering guarantees within each partition, horizontal scaling by distributing partitions across brokers, and fault tolerance through replication.

The partition count decision is important. Too few and you can’t parallelize enough. Too many and you add overhead on the brokers. I’ve learned this through experience—you can’t easily change partition counts later without careful planning.

Brokers and Clusters

Brokers are servers that store data and serve clients. They’re organized into clusters, with one broker acting as the controller managing partition assignments and leader elections.

Each partition has one leader broker and zero or more follower brokers. The leader handles all reads and writes for that partition while followers replicate the data. When the leader fails, one of the in-sync replicas gets promoted to leader. This replication is what gives Kafka its high availability.

Producers

Producers publish records to topics. They choose which partition to send records to, typically based on the record’s key. The configuration options here affect both performance and durability guarantees.

The acknowledgment (ack) setting is critical:

acks=0: Producer doesn’t wait for acknowledgment. Fast but risky.
acks=1: Leader writes to its local log but doesn’t wait for replicas. Balanced.
acks=all (or -1): Leader waits for all in-sync replicas. Slow but safe.

I usually use acks=all in production because data loss is worse than slightly higher latency.

Consumers and Consumer Groups

Consumers read records from topics. The consumer group concept is elegant—when consumers work together in a group, Kafka automatically distributes partitions among them. This provides horizontal scalability: need more processing capacity? Add more consumers to the group.

Each partition is assigned to exactly one consumer within a group, but a consumer might handle multiple partitions. When a consumer fails, Kafka rebalances the partitions among remaining consumers.

Consumer offsets track which records have been processed. Kafka stores these in an internal topic (__consumer_offsets), letting consumers resume from where they left off after restarts or failures.

  sequenceDiagram
    participant P as Producer
    participant B as Broker (Leader)
    participant R as Replica
    participant C as Consumer

    P->>B: Produce message
    B->>B: Write to local log
    B->>R: Replicate message
    R->>B: Acknowledge replication
    B->>P: Acknowledge (acks=all)
    C->>B: Fetch request
    B->>C: Return messages
    C->>C: Process messages
    C->>B: Commit offset

How Kafka Works Under the Hood

Message Durability and Acknowledgments

Kafka’s durability comes from replication. When you produce a message with acks=all, Kafka ensures the message is written to the leader and all in-sync replicas before acknowledging. This prevents data loss even when brokers fail.

The in-sync replica (ISR) set is crucial. A replica is in-sync if it’s caught up with the leader within a configurable lag threshold. Only ISR members can become the new leader during failover, ensuring no committed data is lost.

Here’s how I typically set up a producer with franz-go:

package main

import (
    "context"
    "fmt"
    "log"

    "github.com/twmb/franz-go/pkg/kgo"
)

func main() {
    client, err := kgo.NewClient(
        kgo.SeedBrokers("localhost:9092"),
        kgo.RequiredAcks(kgo.AllISRAcks()), // acks=all
    )
    if err != nil {
        log.Fatal(err)
    }
    defer client.Close()

    ctx := context.Background()
    
    record := &kgo.Record{
        Topic: "user-events",
        Key:   []byte("user-123"),
        Value: []byte(`{"action": "login", "timestamp": "2024-01-15T10:30:00Z"}`),
    }

    // Produce synchronously
    if err := client.ProduceSync(ctx, record).FirstErr(); err != nil {
        log.Fatal(err)
    }

    fmt.Println("Message produced successfully")
}

The producer batches messages for efficiency, but the acks setting ensures durability. With AllISRAcks(), the producer waits for confirmation that all in-sync replicas have written the message.

Consumer Offset Management

Consumers must track which messages they’ve processed to avoid reprocessing or skipping messages. Kafka offers automatic and manual offset commit strategies.

Automatic commits are convenient—the consumer periodically commits offsets in the background. But this can cause problems. If the consumer crashes before committing, you get duplicates. If offsets are committed before processing completes, you lose messages.

I prefer manual commits because they give precise control. You commit offsets only after successfully processing messages, ensuring at-least-once delivery:

package main

import (
    "context"
    "fmt"
    "log"

    "github.com/twmb/franz-go/pkg/kgo"
)

func main() {
    client, err := kgo.NewClient(
        kgo.SeedBrokers("localhost:9092"),
        kgo.ConsumeTopics("user-events"),
        kgo.ConsumerGroup("analytics-service"),
        kgo.DisableAutoCommit(), // Manual offset management
    )
    if err != nil {
        log.Fatal(err)
    }
    defer client.Close()

    ctx := context.Background()

    for {
        fetches := client.PollFetches(ctx)
        if errs := fetches.Errors(); len(errs) > 0 {
            log.Printf("Fetch errors: %v", errs)
            continue
        }

        fetches.EachPartition(func(p kgo.FetchTopicPartition) {
            for _, record := range p.Records {
                fmt.Printf("Received: key=%s, value=%s, offset=%d\n",
                    string(record.Key), string(record.Value), record.Offset)
                
                // Process the message
                if err := processMessage(record); err != nil {
                    log.Printf("Error processing message: %v", err)
                    continue
                }
            }
        })

        // Commit offsets after successful processing
        if err := client.CommitUncommittedOffsets(ctx); err != nil {
            log.Printf("Failed to commit offsets: %v", err)
        }
    }
}

func processMessage(record *kgo.Record) error {
    // ... some business logic here
    return nil
}

This pattern ensures at-least-once delivery. If processing fails, the offset isn’t committed, and the consumer will reprocess the message after restarting.

Rebalancing

When consumers join or leave a consumer group, Kafka triggers a rebalance to redistribute partitions. During rebalancing, consumption pauses briefly while Kafka reassigns partitions.

The newer incremental cooperative rebalancing reduces disruption by avoiding the stop-the-world pause, moving partitions incrementally between consumers instead. This has made consumer group management much smoother in recent Kafka versions.

When to Use Kafka

I reach for Kafka in specific scenarios:

Event-Driven Architecture: When services need to communicate asynchronously without tight coupling. Services publish events to topics, other services subscribe and react. This decoupling makes systems more maintainable and resilient.

Real-Time Data Pipelines: Moving data between systems with low latency. Kafka becomes the central nervous system, ingesting from various sources and feeding multiple destinations.

Stream Processing: Processing data as it flows through the system. Kafka Streams or ksqlDB let you transform, aggregate, and analyze event streams in real-time.

Log Aggregation: Collecting logs from multiple services centrally. Kafka’s durability and ordering guarantees make it reliable for log transport.

Change Data Capture (CDC): Capturing database changes as events. Tools like Debezium publish database change logs to Kafka, letting other systems react to data changes.

The ability to replay events is often what tips the scale toward Kafka. This has been invaluable for debugging production issues and reprocessing data with new logic.

Kafka in Event-Driven Architecture

In event-driven systems, Kafka acts as the event backbone. Services publish domain events when significant state changes occur—a user registers, an order is placed, a payment processes. Other services subscribe to these events and update their own state.

This architecture provides real benefits. Services stay loosely coupled, knowing nothing about downstream consumers. You can add new consumers without modifying producers. The system becomes more resilient—if a consumer is down, events remain in Kafka for processing when it recovers. You get a complete audit trail of all events in your system.

  graph LR
    A[Order Service] -->|order-placed| K[Kafka]
    B[Payment Service] -->|payment-completed| K
    K -->|order-placed| C[Inventory Service]
    K -->|order-placed| D[Notification Service]
    K -->|payment-completed| C
    K -->|payment-completed| E[Analytics Service]

The challenge is managing message schemas and ensuring backward compatibility as events evolve. Schema registries like Confluent Schema Registry help enforce schema contracts between producers and consumers.

Characteristics and Limitations

Strengths

High Throughput: Kafka handles millions of messages per second by leveraging sequential disk I/O, batching, and compression. The architecture is optimized for write-heavy workloads.

Durability: With replication and configurable acknowledgments, messages aren’t lost. The append-only log structure and filesystem caching make writes fast and reliable.

Scalability: You scale horizontally by adding more brokers and partitions. Consumer groups provide parallel consumption without complex coordination logic.

Ordering Guarantees: Within a partition, messages are strictly ordered. Using message keys ensures related events are processed in order.

Fault Tolerance: Replication and automatic leader election mean Kafka continues operating even when brokers fail. Consumers resume from their last committed offset after crashes.

Limitations

Complexity: Running and maintaining a Kafka cluster requires expertise. You need to manage brokers, monitor replication lag, handle rebalancing, and tune various configuration parameters. It’s not a “set it and forget it” system.

Operational Overhead: Kafka requires monitoring, capacity planning, and occasional manual intervention during failures. The operational burden is real.

Not a Database: While Kafka stores data, it’s not designed for random access or complex queries. You can only read messages sequentially from offsets.

Partition Constraints: Each partition can only be consumed by one consumer in a group. If you need more parallelism than you have partitions, you must create more partitions—but this can’t be done transparently and requires careful planning.

Latency Considerations: Kafka’s throughput is excellent, but end-to-end latency can be higher than simple RPC or message queues, especially with strong durability guarantees. The batching that enables high throughput adds delay.

Exactly-Once is Tricky: Kafka supports exactly-once semantics within Kafka Streams, but achieving end-to-end exactly-once delivery across arbitrary systems requires careful design, often involving idempotent consumers.

Performance Considerations

Kafka’s performance depends heavily on configuration. The number of partitions affects parallelism—more partitions enable more concurrent consumers but increase overhead on the broker. The replication factor impacts durability and availability but requires more disk space and network bandwidth.

Producer batching significantly improves throughput. The linger.ms setting controls how long the producer waits to batch messages. Higher values increase throughput but add latency. Compression (snappy, lz4, or zstd) reduces network and disk usage at the cost of CPU.

Consumer fetch sizes determine how much data consumers retrieve per request. Larger fetches improve throughput but increase memory usage. The max.poll.records setting prevents consumers from being overwhelmed with too many messages at once.

Example implementation

Here’s a complete example showing a producer-consumer pattern with proper error handling and offset management:

package main

import (
    "context"
    "fmt"
    "log"
    "os"
    "os/signal"
    "sync"
    "syscall"
    "time"

    "github.com/twmb/franz-go/pkg/kgo"
)

func main() {
    ctx, cancel := context.WithCancel(context.Background())
    defer cancel()

    // Handle graceful shutdown
    sigChan := make(chan os.Signal, 1)
    signal.Notify(sigChan, syscall.SIGINT, syscall.SIGTERM)

    var wg sync.WaitGroup

    // Start producer
    wg.Add(1)
    go func() {
        defer wg.Done()
        runProducer(ctx)
    }()

    // Start consumer
    wg.Add(1)
    go func() {
        defer wg.Done()
        runConsumer(ctx)
    }()

    <-sigChan
    log.Println("Shutting down...")
    cancel()
    wg.Wait()
}

func runProducer(ctx context.Context) {
    client, err := kgo.NewClient(
        kgo.SeedBrokers("localhost:9092"),
        kgo.RequiredAcks(kgo.AllISRAcks()),
    )
    if err != nil {
        log.Fatal(err)
    }
    defer client.Close()

    ticker := time.NewTicker(2 * time.Second)
    defer ticker.Stop()

    for {
        select {
        case <-ctx.Done():
            return
        case <-ticker.C:
            record := &kgo.Record{
                Topic: "events",
                Key:   []byte(fmt.Sprintf("key-%d", time.Now().Unix())),
                Value: []byte(fmt.Sprintf("Event at %s", time.Now().Format(time.RFC3339))),
            }

            if err := client.ProduceSync(ctx, record).FirstErr(); err != nil {
                log.Printf("Failed to produce: %v", err)
            } else {
                log.Printf("Produced: %s", record.Value)
            }
        }
    }
}

func runConsumer(ctx context.Context) {
    client, err := kgo.NewClient(
        kgo.SeedBrokers("localhost:9092"),
        kgo.ConsumeTopics("events"),
        kgo.ConsumerGroup("example-group"),
        kgo.DisableAutoCommit(),
    )
    if err != nil {
        log.Fatal(err)
    }
    defer client.Close()

    for {
        select {
        case <-ctx.Done():
            return
        default:
            fetches := client.PollFetches(ctx)
            if errs := fetches.Errors(); len(errs) > 0 {
                for _, err := range errs {
                    log.Printf("Fetch error: %v", err)
                }
                continue
            }

            fetches.EachRecord(func(record *kgo.Record) {
                log.Printf("Consumed: key=%s, value=%s, partition=%d, offset=%d",
                    string(record.Key), string(record.Value), record.Partition, record.Offset)
            })

            if err := client.CommitUncommittedOffsets(ctx); err != nil {
                log.Printf("Failed to commit: %v", err)
            }
        }
    }
}

Once you understand the core concepts—topics, partitions, offsets, and replication—Kafka becomes a powerful tool for building resilient and scalable event-driven systems. The complexity is real, but the capabilities it provides are worth it for the right use cases.