Building a PostgreSQL Extension, Line by Line

2024-06-14

I’ve been working on a particular problem - how to offer logical replication for a very large number of Postgres databases, where I don’t have the ability (or capacity) to liaise with the users of the databases one-to-one. In order to mitigate some of the limitations, I wanted to implement a flexible extension to help guard against two of the larger pitfalls - DDL and large objects.

There are some other interesting problems here, including enforcing primary keys on all tables and ensuring REPLICA IDENTITY FULL is set for all tables. For now, I’m going to focus on building an extension for DDL and large object guardrails. Extensions are probably my favourite Postgres feature.

Contents

Extension Overview

The extension, ddl_guard, has the following features:

The “sentinel mode” was mostly inspired by AWS’ RDS Blue/Green deployments, where a notification is sent to the customer if they have performed an incompatible action.

Here is a quick tree of the directory structure:

.
├── LICENSE
├── Makefile
├── README.md
├── bin
│   └── test
├── ddl_guard--1.0.0.sql
├── ddl_guard.c
├── ddl_guard.control
└── test
    ├── data
    │   └── test.data
    ├── ddl_guard.conf
    ├── regular
    │   ├── expected
    │   │   ├── ddl.out
    │   │   └── lobject.out
    │   └── sql
    │       ├── ddl.sql
    │       └── lobject.sql
    └── superuser
        ├── expected
        │   ├── ddl.out
        │   └── lobject.out
        └── sql
            ├── ddl.sql
            └── lobject.sql

Extension Implementation

Control File

When building an extension, particularly in C, generally the first thing to do is create a control file. This file is used by Postgres to understand the extension and how to load it. More on the definitions here: Extension Files.

# ddl_guard.control
# The version of the extension. Semver is not enforced, but good to follow.
default_version = '1.0.0'
# Initially applied on first install.
comment = 'Prevents DDL execution by non-superusers when a specific GUC is set'
# Prevent or allow moving the extension and associated objects
# to another schema after install.
relocatable = false
# Used when invoking 'MODULE_PATHNAME' when installing C extensions.
module_pathname = '$libdir/ddl_guard'
# Forces the extension to be installable only by superusers.
superuser = true

Makefile

Next, we create a Makefile to compile the extension. ddl_guard has a very simple Makefile:

MODULES = ddl_guard # Name of the built module.
EXTENSION = ddl_guard # Name of the extension.
DATA = ddl_guard--1.0.0.sql # The install SQL script.

# The directory containing the test files.
# Can be overidden by setting the INPUTDIR environment variable.
INPUTDIR ?= test/regular
# The test files. Uses Makefile wildcard expansion.
TESTS = $(wildcard $(INPUTDIR)/sql/*.sql)
# The test names. Uses Makefile pattern substitution
# to extract the test suite names.
REGRESS = $(patsubst $(INPUTDIR)/sql/%.sql,%,$(TESTS))
# The test options.
# --inputdir: The directory containing the test files.
# --load-extension: The extension to load as part of the test run.
# --temp-config: The temporary configuration file to use for the test.
# --use-existing: Use an existing database for the test.
# See: https://www.postgresql.org/docs/current/extend-pgxs.html#EXTEND-PGXS
REGRESS_OPTS = --inputdir=$(INPUTDIR) --load-extension=$(EXTENSION) --temp-config=test/ddl_guard.conf --use-existing

# Set a compiler flag when built under different architectures.
OPTFLAGS = -march=native
ifeq ($(shell uname -s), Darwin)
	ifeq ($(shell uname -p), arm)
		# no difference with -march=armv8.5-a
		OPTFLAGS =
	endif
endif

# Set some additional compilter flags.
# The most important one for us here is -Werror - treat warnings as errors.
PG_CFLAGS += $(OPTFLAGS) -Werror -ftree-vectorize -fassociative-math -fno-signed-zeros -fno-trapping-math

# Get the pg_config binary.
# Can be overridden by setting the PG_CONFIG environment variable.
PG_CONFIG ?= pg_config
# Include the PostgreSQL extension makefile.
PGXS := $(shell $(PG_CONFIG) --pgxs)
include $(PGXS)

Install SQL

The install SQL is run on invoking CREATE EXTENSION. In our case, the install script is very simple, just installing the event trigger and relevant function:

-- ddl_guard--1.0.sql
CREATE OR REPLACE FUNCTION ddl_guard_check()
    -- The function returns an event trigger.
    RETURNS event_trigger
    -- The function is run with the privileges of the user invoking it.
    SECURITY INVOKER
     -- Restrict search path to pg_catalog and pg_temp as a security measure.
    SET search_path = 'pg_catalog, pg_temp'
    -- Specify that this is a C extension
    LANGUAGE C
    -- The module path and the function name exported from the C shared object.
    AS 'MODULE_PATHNAME', 'ddl_guard_check';

-- Create the event trigger.
CREATE EVENT TRIGGER ddl_guard_trigger
    -- Trigger on the start of a DDL command.
    ON ddl_command_start
    -- Execute the earlier function.
    EXECUTE FUNCTION ddl_guard_check();

C Extension

The C extension is where the bulk of the work happens. The full code is in the linked repo, but I’ll break it down in chunks here.

// Imports
#include "postgres.h"                   // Main Postgres headers
#include "fmgr.h"                       // Function manager headers
#include "utils/fmgrtab.h"              // Function manager table headers
#include "storage/fd.h"                 // File descriptor headers
#include "utils/guc.h"                  // GUC headers
#include "commands/event_trigger.h"     // Event trigger headers
#include "miscadmin.h"                  // Miscellaneous admin headers
#include "pgstat.h"                     // Postgres statistics headers
#include "catalog/objectaccess.h"       // Object access headers
#include "catalog/pg_largeobject.h"     // Large object headers

#include <stdio.h>  // Standard I/O headers
#include <unistd.h> // POSIX headers

Imports are relatively straight forward - we’re importing the Postgres headers, as well as some standard C headers for file descriptor handling and POSIX headers for file handling. Ordering is important here - some headers expect others to be included first, hence the non-alphabetical ordering.

#ifdef PG_MODULE_MAGIC
PG_MODULE_MAGIC;
#endif

#define DDL_SENTINEL_FILE PG_STAT_TMP_DIR "/ddl_guard_ddl_sentinel"
#define LO_SENTINEL_FILE PG_STAT_TMP_DIR "/ddl_guard_lo_sentinel"

void		_PG_init(void);
void		_PG_fini(void);
void		write_sentinel_file(const char *filename);
Datum		ddl_guard_check(PG_FUNCTION_ARGS);
static void lob_object_access_hook(ObjectAccessType access, Oid classId, Oid objectId, int subId, void *arg);

static bool ddl_guard_enabled = false;
static bool ddl_guard_ddl_sentinel = false;
static bool ddl_guard_lo_sentinel = false;
static object_access_hook_type next_object_access_hook = NULL;

Lets get our definitions out of the way, including some constants. PG_MODULE_MAGIC is a macro that Postgres uses to ensure the extension was built correctly, and is not, lets say, been built for a different version of Postgres. Generally, for dynamic libraries this should always be called and called once. See Dynamic Loading for more.

We also define some static variables to hold the state of the extension, as well as some function prototypes. PG_STAT_TMP_DIR is provided by Postgres, and is the temporary directory for statistics files.

/* large_object_funcs contains all the create/update/destroy lobject funcs */
static const char *large_object_funcs[] = {
	"lo_create",
	"lo_creat",
	"lo_truncate",
	"lo_truncate64",
	"lo_unlink",
	"lowrite",
	"lo_from_bytea",
	"be_lowrite",
	"be_lo_create",
	"be_lo_creat",
	"be_lo_unlink",
	"be_lo_truncate",
	"be_lo_from_bytea",
	"be_lo_truncate64",
	"be_lo_import",
	"be_lo_import_with_oid",
};
static const int lobject_funcs_count = sizeof(large_object_funcs) / sizeof(large_object_funcs[0]);
static Oid *lobject_func_oids;
static int	max_reserved_oid = 0;
static int	min_reserved_oid = 9000;

We define an array of all the functions that are “write” operations for large objects, large_object_funcs. We also store the size of the array.

lobject_func_oids is an array of OIDs for the given functions, and we set and upper and lower bound for the OIDs we’re interested in - this will become more relevant later.

PG_FUNCTION_INFO_V1(ddl_guard_check);

PG_FUNCTION_INFO_V1 is a macro that defines a Postgres-callable function with version-1 calling conventions. This is the function that is exported and callable from Postgres.

A brief note on C formatting: Postgres uses a particular style of formatting, BSD-like, that is enforced using the pgindent tool. It is broadly similar to K&R style. I don’t personally like it, but convention is king.

static const FmgrBuiltin *
fmgr_lookupByName(const char *name)
{
	int			i;

	for (i = 0; i < fmgr_nbuiltins; i++)
	{
		/* switched to strncmp */
		/* current max len is 22 in large_object_funcs */
		if (strncmp(name, fmgr_builtins[i].funcName, 22) == 0)
			return fmgr_builtins + i;
	}

	return NULL;
}

fmgr_lookupByName is a helper function that looks up a function by name in the function manager. This is used to find the OIDs of the functions we’re interested in. It is ripped directly from Postgres’ fmgr.c, as it is unexported. The original function uses strcmp, but we switch to strncmp as a defensive measure.

fmgr_nbuiltins is the number of built-in functions in Postgres, which are the ones we are interested in for the purposes of this extension. Exported from fmgrtab.h.

void
write_sentinel_file(const char *filename)
{
	FILE	   *fp = NULL;

	fp = AllocateFile(filename, "w");
	if (fp == NULL)
		ereport(ERROR,
				(errcode(ERRCODE_INTERNAL_ERROR),
				 errmsg("could not create sentinel file \"%s\": %m", filename)));

	if (FreeFile(fp))
		ereport(ERROR,
				(errcode(ERRCODE_INTERNAL_ERROR),
				 errmsg("could not write sentinel file \"%s\": %m", filename)));
}

write_sentinel_file is a helper function that writes a file to disk, with no content. It uses AllocateFile and FreeFile to hande the file creation and closing. We use AllocateFile over fopen, mostly for safety reasons.

All is pretty straightforward so far.

static bool
set_lobject_func_oids()
{
	const FmgrBuiltin *entry;
	int			i;

	lobject_func_oids = (Oid *) palloc(lobject_funcs_count * sizeof(Oid));
	if (lobject_func_oids == NULL)
	{
		return false;
	}

	for (i = 0; i < lobject_funcs_count; i++)
	{
		if ((entry = fmgr_lookupByName(large_object_funcs[i])) != NULL)
		{
			lobject_func_oids[i] = entry->foid;
			if (entry->foid < min_reserved_oid)
			{
				min_reserved_oid = entry->foid;
			}
			else if (entry->foid > max_reserved_oid)
			{
				max_reserved_oid = entry->foid;
			}
		}
	}

	return true;
}

On initialisation, we need to find and store the OIDs of the large object functions. set_lobject_func_oids does this, by iterating over the large_object_funcs array and looking up the function by name. If the function is found, we store the OID and update the min and max reserved OIDs.

static void
lob_object_access_hook(ObjectAccessType access, Oid classId, Oid objectId, int subId, void *arg)
{
	int			i;
	const FmgrBuiltin *entry;

    /* check if the extension is enabled and we're not a superuser */
	if (ddl_guard_enabled && !superuser())
	{
        /* check if we're in sentinel mode */
		if (ddl_guard_lo_sentinel)
		{
			switch (access)
			{
                /* if we're executing a function */
				case OAT_FUNCTION_EXECUTE:
                    /* if the function is in the range of large object functions */
					if (objectId >= min_reserved_oid && objectId <= max_reserved_oid)
					{
						for (i = 0; i < lobject_funcs_count; i++)
						{
                            /* if the OID is in the large object OID array */
							if (lobject_func_oids[i] == objectId)
							{
                                /* if we successfully look up the function */
								if ((entry = fmgr_lookupByName(large_object_funcs[i])) != NULL)
								{
                                    /* write a sentinel file */
									write_sentinel_file(LO_SENTINEL_FILE);
                                    /* log a warning */
									ereport(WARNING, errmsg("lo_guard: lobject \"%s\" function call, sentinel file written", entry->funcName));
								}
							}
						}
					}
				default:
					break;
			}
		}
	}
    /* call the next object access hook */
	if (next_object_access_hook)
		(*next_object_access_hook) (access, classId, objectId, subId, arg);
}

lob_object_access_hook is the meat of our detection for large object writes. It is an object access hook, which is a way to intercept and modify object access in Postgres. For more information on this type of hook, see unofficial hook documentation We check if the extension is enabled and we’re not a superuser, and if we’re in sentinel mode. If we are, we check if the function being executed is in the range of large object functions, and if it is, we write a sentinel file and log a warning.

Datum
ddl_guard_check(PG_FUNCTION_ARGS)
{
	if (!CALLED_AS_EVENT_TRIGGER(fcinfo))
		ereport(ERROR,
				(errcode(ERRCODE_INTERNAL_ERROR),
				 errmsg("ddl_guard_check: not fired by event trigger manager")));

	if (ddl_guard_enabled && !superuser())
	{
		if (ddl_guard_ddl_sentinel)
		{
			write_sentinel_file(DDL_SENTINEL_FILE);
			ereport(WARNING, (errmsg("ddl_guard: ddl detected, sentinel file written")));
			PG_RETURN_VOID();
		}
		ereport(ERROR,
				(errcode(ERRCODE_INSUFFICIENT_PRIVILEGE),
				 errmsg("Non-superusers are not allowed to execute DDL statements"),
				 errhint("ddl_guard.enabled is set.")));
	};

	PG_RETURN_VOID();
}

ddl_guard_check is the function that is called when the event trigger is fired. It checks if the extension is enabled and we’re not a superuser, and if we’re in sentinel mode. If we are, we write a sentinel file and log a warning. If we’re not in sentinel mode, we return an error and prevent the operation from being completed. We check we’re being called from within an event trigger to prevent users from calling it directly.

void
_PG_init(void)
{
	DefineCustomBoolVariable("ddl_guard.enabled", "Enable or disable DDL Guard",
							 NULL, &ddl_guard_enabled, false, PGC_SUSET, 0, NULL,
							 NULL, NULL);

	DefineCustomBoolVariable("ddl_guard.ddl_sentinel", "Write sentinel file for DDL statements",
							 NULL, &ddl_guard_ddl_sentinel, false, PGC_SUSET, 0, NULL,
							 NULL, NULL);

	DefineCustomBoolVariable("ddl_guard.lo_sentinel", "Write sentinel file for pg_largeobject modifications",
							 NULL, &ddl_guard_lo_sentinel, false, PGC_SUSET, 0, NULL,
							 NULL, NULL);

	unlink(DDL_SENTINEL_FILE);
	unlink(LO_SENTINEL_FILE);

	if (set_lobject_func_oids())
	{
		next_object_access_hook = object_access_hook;
		object_access_hook = lob_object_access_hook;
	}
	else
	{
		ereport(ERROR, errmsg("We beefed it, chief."));
	}

	EmitWarningsOnPlaceholders("ddl_guard");
}

_PG_init is an initialisation function called immediately when a dynamic library is loaded. In it, we do a couple of things. First, we define some custom GUCs for the extension, which are used to control the behaviour of the extension. We also unlink any existing sentinel files, in case the service is restarted. Finally, we set our object_access_hook to our lob_object_access_hook function, which will intercept large object writes.

EmitWarningsOnPlaceholders is mostly a prophylactic measure to ensure folks don’t dynamically define GUCs within our namespace. It doesn’t quite work like that, though - recent work on Postgres has created a new function for this purpose.

void
_PG_fini(void)
{
	if (lobject_func_oids != NULL)
	{
		pfree(lobject_func_oids);
	}

	object_access_hook = next_object_access_hook;
}

_PG_fini is a bit of an oddball - this is called if a module is unloaded. This isn’t currently implemented in Postgres, but it is good practice to clean up after yourself. We free the memory we allocated for the large object function OIDs, and reset the object_access_hook to its original value.

Tests

Testing is important for any extension author, and we are no exception. We use the standard Postgres regression testing approach, which executes a series of SQL files and compares the output to expected output. We have two sets of tests - one for superusers, and one for regular users. The superuser tests are the same as the regular tests, but with the superuser flag set.

-- DDL tests
SELECT current_setting('ddl_guard.enabled', true) = 'on' AS ddl_guard_enabled;
SELECT current_setting('is_superuser', true) = 'on' AS is_superuser;
CREATE TABLE IF NOT EXISTS foobar (id serial PRIMARY KEY);
CREATE OR REPLACE FUNCTION foobar_trigger() RETURNS TRIGGER AS $$
BEGIN
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;
DROP TABLE IF EXISTS foobar;
DROP FUNCTION IF EXISTS foobar_trigger;
SELECT current_setting('ddl_guard.enabled', true) = 'on' AS ddl_guard_enabled;
 ddl_guard_enabled
-------------------
 t
(1 row)

SELECT current_setting('is_superuser', true) = 'on' AS is_superuser;
 is_superuser
--------------
 f
(1 row)

CREATE TABLE IF NOT EXISTS foobar (id serial PRIMARY KEY);
WARNING:  ddl_guard: ddl detected, sentinel file written
CREATE OR REPLACE FUNCTION foobar_trigger() RETURNS TRIGGER AS $$
BEGIN
    RETURN NEW;
END;
$$ LANGUAGE plpgsql;
WARNING:  ddl_guard: ddl detected, sentinel file written
DROP TABLE IF EXISTS foobar;
WARNING:  ddl_guard: ddl detected, sentinel file written
DROP FUNCTION IF EXISTS foobar_trigger;
WARNING:  ddl_guard: ddl detected, sentinel file written

There is also a set for large object tests, which are more comprehensive:

SET bytea_output TO escape;
SELECT lo_create(42);
\lo_unlink 42

CREATE TABLE lotest_stash_values (loid oid, fd integer);
INSERT INTO lotest_stash_values (loid) SELECT lo_creat(42);
-- lobs require xacts
BEGIN;
UPDATE lotest_stash_values SET fd = lo_open(loid, CAST(x'20000' | x'40000' AS integer));
SELECT lowrite(fd, '
My hair is grey, but not with years,
Nor grew it white
    In a single night,
[snip]
') FROM lotest_stash_values;

SELECT lo_close(fd) FROM lotest_stash_values;
END;

SELECT lo_from_bytea(0, lo_get(loid)) AS newloid FROM lotest_stash_values
\gset

BEGIN;
UPDATE lotest_stash_values SET fd=lo_open(loid, CAST(x'20000' | x'40000' AS integer));
SELECT lo_truncate(fd, 11) FROM lotest_stash_values;
SELECT lo_close(fd) FROM lotest_stash_values;
END;
-- and more

A helper script is defined in bin/test to make it easier to invoke make installcheck with the appropriate arguments.

Considerations

Fortuantely, this extension is quite simple - a hook and an event trigger. However, there are some considerations to be made when building an extension:

Putting It All Together

The extension is built by running make in the extension directory. This will compile the extension and create a shared object. The extension can then be installed by running make install and make installcheck. It can be uninstalled with make uninstall.

For packaging, some folks opt for releasing it on PGXN. This involves packaging the repository according to a specific set of conventions - more on this can be found here.

Alternatively, you might wish to package your extension as a Debian package or similar. Typically here the shared object is packaged and the install script installs in the correct directory. As a consequence, the package needs to be built for each Postgres version you support, and declare its dependencies in case you depend on other libraries (such as json-c).