Setting Up pg_partman and pg_cron in Docker
With the popularity of PostgreSQL, it's often assumed that most problems are solved, documented, and shared on platforms like Stack Overflow or GitHub. Yet, when diving into the specifics of setting up pg_partman and pg_cron within a Containerized PostgreSQL environment, it quickly becomes evident that up-to-date, comprehensive guides are surprisingly scarce. This article aims to fill that gap, detailing the challenges faced and the solutions found during this endeavour. I spent a day getting all this working, so hopefully, you won't. All code is available on GitHub.
Understanding Table Partitioning
Table partitioning is a database technique that enhances performance by dividing large tables into smaller, more manageable segments yet treating them as a single entity. This division can be based on criteria like date ranges or identifiers. By doing so, queries can run faster as they access fewer data blocks, backups become more efficient, and older data can be archived or purged seamlessly. Partitioning allows databases to maintain optimal performance as data volumes grow. Partitioning is particularly powerful in modern "NewSQL" tools like AWS's Aurora DB or Google's AlloyDB.
The Magic of pg_partman and pg_cron
PostgreSQL supports native partitioning. However, it is not easy to set up, maintain or automate. This is where pg_partman and pg_cron come in. pg_partman is an extension to PostgreSQL that provides automated management of partitioned tables, including time-based and serial-based table partition sets. pg_partman works on a declarative model. You define how to partition the data, and pg_partman converts that definition into the appropriate PostgreSQL scripts. The one part of pg_partman that needs to be added is the continuous enforcement of this desired state.
On the other hand, pg_cron is a job scheduler for PostgreSQL, allowing database administrators to schedule tasks like periodic data rollups, data retention policies, or even routine maintenance tasks directly from the database.
When combined, pg_partman and pg_cron form a powerful duo. pg_partman can manage the partitions, and pg_cron can be used to for continuous enforcement.
Why Docker? The Case for Local Development
It's often recommended to leverage managed services like RDS for production environments. They come with built-in scalability, automated backups, and maintenance, freeing teams from the operational overhead of managing databases. Heck, they even manage extensions like pg_partman and pg_cron for you!. However, you cannot run that RDS locally or for free. Using Docker for local development, developers can experiment with these tools without fearing data loss. Need to create or delete partitions? Go ahead. Want to test out a new partitioning strategy? Docker makes it easy to reset to a default state. This sandboxed environment allows developers to get hands-on experience with tools like pg_partman and pg_cron, ensuring they're well-prepared before deploying these tools in a production environment.
The Hurdles
Alpine
The standard PostgreSQL image is built with Alpine. Alpine does not have any packages for pg_partman or pp_cron. These have to be built from the source. To build the C extensions, many tools will need to be installed. From what I could see, this is the list:
1RUN apk add --no-cache \
2 autoconf \
3 automake \
4 g++ \
5 clang15 \
6 llvm15 \
7 libtool \
8 libxml2-dev \
9 make
Build from source
Since the releases are only gzipped points in time of the source code, we need to build it from source. The steps are:
- Download
- Extract
- make and make install
- cleanup
1wget -O pg_partman.tar.gz "https://github.com/pgpartman/pg_partman/archive/$PG_PARTMAN_VERSION.tar.gz" \
2 && mkdir -p /usr/src/pg_partman \
3 && tar \
4 --extract \
5 --file pg_partman.tar.gz \
6 --directory /usr/src/pg_partman \
7 --strip-components 1 \
8 && rm pg_partman.tar.gz \
9 && cd /usr/src/pg_partman \
10 && make \
11 && make install \
12 && rm -rf /usr/src/pg_partman
Repeat this for pg_cron. Note: At this point you can start using pg_partman.
pg_cron config
If you run the container now and try this:
1CREATE EXTENSION pg_cron
You wil see this error:
1ERROR: unrecognized configuration parameter "cron.database_name"
pg_cron has a unique requirement: it can only be created in a specific database, as defined by the cron.database_name configuration. This necessitated additional environment variables and configuration tweaks to ensure compatibility. At first, it is tempting to do this:
1RUN echo "shared_preload_libraries = 'pg_cron'" >> /var/lib/postgresql/data/postgresql.conf
2RUN echo "cron.database_name = '${PG_CRON_DB:-pg_cron}'" >> /var/lib/postgresql/data/postgresql.conf
But this has 2 problems. For 1 the PG_CRON_DB
variable is now build time. But more important, the standard PostgreSQL
init script will see that /var/lib/postgresql/data
exists and thinks there is already an DB and startup wil fail
with this error:
1initdb: error: directory "/var/lib/postgresql/data" exists but is not empty
2initdb: hint: If you want to create a new database system, either remove or empty the directory "/var/lib/postgresql/data" or run initdb with an argument other than "/var/lib/postgresql/data".
The only solution I've found are these steps:
- Let PostgreSQL start normally
- Shut it down
- Add the pg_cron config
- Start PostgreSQL again
Here is my init script:
1#!/bin/sh
2
3set -e
4
5echo "Starting custom entrypoint..."
6
7# Initialize the database but don't start postgres
8docker-entrypoint.sh postgres -h '' &
9PID=$!
10
11# Wait for the initialization to complete
12echo "Waiting for PostgreSQL to initialize..."
13until pg_isready; do
14 sleep 1
15done
16
17# Stop the temporary PostgreSQL process
18echo "Stopping temporary PostgreSQL process..."
19kill -s TERM $PID
20wait $PID
21
22# Modify the PostgreSQL configuration
23echo "Modifying PostgreSQL configuration..."
24echo "shared_preload_libraries = 'pg_cron'" >> /var/lib/postgresql/data/postgresql.conf
25echo "cron.database_name = '${POSTGRES_DB:-postgres}'" >> /var/lib/postgresql/data/postgresql.conf
26
27echo "Starting PostgreSQL..."
28# Has to run as the postgres user
29exec su - postgres -c "postgres -D /var/lib/postgresql/data"
Create extension
There we have it. Everything should be working now. To test:
1docker build -t partman .
2docker run -e POSTGRES_USER=test -e POSTGRES_PASSWORD=test -e POSTGRES_DB=postgres -d -p 5432:5432 --name partman partman
Connect to the DB using your favourite tool, or the CLI and run this:
1CREATE SCHEMA partman;
2CREATE EXTENSION pg_partman WITH SCHEMA partman;
3CREATE EXTENSION pg_cron;
No more errors!
Playing around
Now you can play around with pg_cron and pg_partman. If you mess anything up its as easy as this to start clean:
1docker stop partman; docker rm partman; docker run -e POSTGRES_USER=test -e POSTGRES_PASSWORD=test -e POSTGRES_DB=postgres -d -p 5432:5432 --name partman partman
Lets create a table called my_data
1CREATE TABLE public.my_data (
2 id SERIAL,
3 my_date timestamp
4) PARTITION BY RANGE (my_date);
And then partition it by minutes to make sure it works:
1SELECT partman.create_parent(
2 p_parent_table => 'public.my_data',
3 p_control => 'my_date',
4 p_type => 'native',
5 p_interval => '1 minute',
6 p_premake => 1,
7 p_start_partition => '2023-08-11 14:00:00'
8);
Now check how many partitions there are:
1select count(*) from partman.show_partitions('public.data');
This should be more than 1. If not, play around with the p_start_partition
and p_premake
.
At this point, only the parent was created. We need to config it to create infinite partitions:
1UPDATE partman.part_config
2SET infinite_time_partitions = true,
3 retention = NULL, --- NULL = Infinite
4 retention_keep_table=true
5WHERE parent_table = 'public.data';
If you want to tell pg_partman to enforce its desired state:
1CALL partman.run_maintenance_proc();
Or here is the magic, schedule it with pg_cron!
1SELECT cron.schedule('* * * * *', $$CALL partman.run_maintenance_proc()$$);
Conclusion
Setting up pg_partman and pg_cron in a Dockerized PostgreSQL environment was more complex than initially anticipated. However, with persistence and the challenges were overcome. Now you can play around and get familiar with the safety of Docker on localhost before going to prod.