Debezium: CDC for the event-driven architecture

Debezium: CDC for the event-driven architecture

Intro

In the rabbit hole of the web I came across the “Debezium postgres connector”. And why is this good? Better question even: what is it? The short answer is: it turns your database row-level changes into Kafka messages. And this is delicate. Any event-based software needs can be satisfied as it basically streams your database real-time as it changes. Not to mention that the stream can be consumed at any point in time from the persisted snapshot created from all database schemas.

dee-BEE-zee-uhm and Change Data Capture (CDC)

Change Data Capture (CDC) is a term for a monitoring system that captures changes so other systems can react on those. Debezium is an open source project which implements such needs for a variety of databases.

Postgres has implemented logical decoding which allows to capture the changes from the database. Logical decoding enables to read the changes persisted in the transaction logs. Debezium hooks into this to know what changes are happening in the database. In Postgres the contents of the WAL (Write-Ahead-Log) is read and decoded, this is the reason why data integrity is guaranteed also in the Kafka messages. To enable this feature the postges instance should run with the wal_level set to logical.

Use Case

I’m a big fan of demos and examples so this blogpost will demonstrate a real-time data processing pipeline where a picture is uploaded and stored in a database which is processed further on by an AWS Lambda function. The link between the upload and Lambda is, as you might assume already: Debezium and the generated Kafka messages.

System skeleton

In order to make this work you need to setup a few things. Typically these consist the Debezium connector for the database you’re using. In my case it’s a Postgres database, but Debezium supports a variety of databases, like MySQL, Oracle, SQL Server, MongoDB, DB2.. etc.

Then you need to setup a Kafka cluster, which is a message broker that allows you to send and receive messages between different systems.

Once these are done, Kafka-Connect will start to stream the changes from the database to the Kafka topic to which you can subscribe your services. My service consumes the messages, passes the content to a AWS Lambda function which processes them further on.

Debezium System Architecture

Overall setup

In the demo there is a simple user interface where you can upload a picture for a customer. This picture is passed to a server and then it’s stored in a Postgres database as base64.

Once the picture is uploaded, the database is updated and the Debezium connector captures the change and streams it to a Kafka topic.

The change is known becasue of the registered connector in the Kafka-Connect service. The registration of a connector is done by sending a JSON config file to the Kafka-Connect API. You can see the table.include.list property in the JSON file where it is defined from which data tables the connector should capture the changes.

/* register-postgres.json */
{
    "name": "photo-connector",
    "config": {
        "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
        "database.hostname": "postgres",
        "database.port": "5432",
        /* other db related config */
        "table.include.list": "public.customer_photos",
        /* ... */
    }
}

The JSON is then sent to the Kafka-Connect POST /connectors endpoint to register the connector.

curl -i -X POST -H "Accept:application/json" -H "Content-Type:application/json" \
    http://localhost:8083/connectors/ -d @register-postgres.json

After the connector is registered you can see the generated messages in Kafka. I did spin up a Kafka UI container too to see the messages right after an image is uploaded. Kafka UI

The structure of the massage contains the following information:

/* message example */
{
    "before": null,
    "after": {
        "id": 5,
        "photo": "base64 encoded image"
    },
    "op": "c" /* c for create */
}

From this point I only required a service to consume the messages from the Kafka topic and pass them to the AWS Lambda function. This is the NodeJs service named NodeJS Consumer on the figure below. The consumer utilized the npm packages kafkajs and @aws-sdk/client-lambda to consume the messages from Kafka and pass them to the Lambda function. Debezium System Architecture

/* photo-consumer.ts */
const kafka = new Kafka({
    brokers: ['localhost:9092'],
    clientId: 'my-app',
});
/* ... */

await consumer.connect();
await consumer.subscribe({ 
    topic: 'postgres.public.customer_photos', /* this is the concatenated name defined in the kafka connector */
    fromBeginning: true 
});

The consumer is then ready to consume the messages from Kafka and pass them to the Lambda function.

Conclusion

This was a simplified example of how to use Debezium to capture changes from a database and stream them to a Kafka topic. As a result a real-time data processing pipeline could be achieved. The biggest power I see in taking over the responsibility of genereting Kafka messages from various micro services and handing them down to the database level. This is mighty. All your services can know instantly about data changes and react on them accordingly.

Resources