You can't check if primary keys are unique in Redshift
Redshift can lie to you if you’re not careful with your primary keys. This is in the documentation if you know where to look, but what to do about it is not.
The contract around primary keys
When a table has a primary key specified, Redshift will cheerfully allow you to insert multiple rows with the same primary key.
If all your data came straight from an RDBMS that enforces their uniqueness, then you won’t encounter that behavior at all, and you can read this with schadenfreude toward other Redshift users.
If you’re one of the unlucky ones imposing primary keys on arbitrary data, then the responsibility of ensuring their uniqueness falls on you.
Failing to do so could lead to problems worse than not being able to identify a record unambiguously: the results of your queries might not reflect the actual state of the table. And as a result, the obvious way to ensure the uniqueness of primary keys is wrong.
The wrong way to ensure uniqueness
For these examples, let’s suppose you have a table, my_table
, where id
is the primary key:
CREATE TABLE my_table
(
id VARCHAR,
value VARCHAR,
PRIMARY KEY (id)
);
The obvious solution is to query the table, grouping by the primary key column and returning groupings that are found in more than one row. If any are returned, then those are your duplicates, and you can figure out what to do about them… right?
-- Incorrect query to identify duplicated PKs (`id`).
-- Don't do this!
SELECT id
FROM my_table
GROUP BY id
HAVING COUNT(*) > 1;
Not necessarily. According to the documentation on defining constraints, Redshift’s planner
assumes that all keys in Amazon Redshift tables are valid as loaded. If your application allows invalid foreign keys or primary keys, some queries could return incorrect results.
So the problem with this query is that it might work or it might not, which is worse than it never working at all!
It’s like undefined behavior in C. The language specification describes certain things that are allowed by the syntax but that programmers are expected never to do (for example, dividing by zero). Compilers, in turn, are free to handle non-conforming code in any way in their efforts to produce efficient machine code.
I’d imagine Redshift’s non-enforcement of this constraint yields significant performance benefits. Since Redshift is often used to query large amounts of data originating in other RDBMSes, it’s a reasonable design tradeoff.
Okay, but does it really matter?
In my experience, the wrong way works – or, rather, has worked, at least sometimes, because I can’t tell when it doesn’t.
Even if you insert the same record multiple times, Redshift might find it appropriate, for performance reasons or otherwise, to return zero rows for the query above. And if it doesn’t today, it might tomorrow. And if it does tomorrow, it might always do it or only in certain cases. And if it does in certain cases, those cases might be rare. All of these are valid behaviors according to the documentation. How would you like to debug that in production?
You can’t check if primary keys are unique in Redshift because duplicating them is undefined behavior: as soon as you’ve done it, you’ve lost the ability to make sense of your data, including finding out whether primary keys are unique. If you’re checking for uniqueness at that point, it’s already too late.
The right way
A reliable way to ensure your tables don’t have duplicate primary keys is to check the uniqueness of candidate primary keys instead.
Before inserting records into your table, insert them into a “staging” table with the same schema as the real one but without a primary key constraint:
CREATE TABLE my_table_staging
(
id VARCHAR, -- Not a primary key in this table!
value VARCHAR
);
The SELECT
statement from before, when run against my_table_staging
, will always return correct results. If Redshift tells you no values of id
are present more than once in that table, you can trust it.
-- Identify duplicated values of `id`.
-- This works because `id` is not a PK.
SELECT id
FROM my_table_staging
GROUP BY id
HAVING COUNT(*) > 1;
Then, if my_table
is empty, you can go ahead and insert all the rows from the staging table into it and be confident you’re not running afoul of Redshift’s baked-in assumptions.
That’ll work the first time, but in the general case, where my_table
already contains data, it’s not enough.
Suppose a particular value of id
is unique in my_table_staging
, and the same value is a (truly) unique primary key in my_table
. If you insert all rows from my_table_staging
into my_table
, then that primary key will end up duplicated, and you’ll be in the same situation as before.
So you need to look at all actual and candidate primary keys together. Here’s a verbose example:
-- This CTE consists of candidate PKs you haven't checked
-- yet and could contain duplicates.
WITH candidate_pks AS
(
SELECT id
FROM my_table_staging
),
-- This CTE never contains duplicates, as long as you always
-- use a query like this one to find them and you never
-- insert them into `my_table`.
actual_pks AS
(
SELECT id
FROM my_table
),
-- This CTE could contain duplicates. They could be
-- duplicated in `candidate_pks` only, or they could be
-- duplicated between `candidate_pks` and `actual_pks`.
candidate_and_actual_pks AS
(
SELECT id
FROM candidate_pks
UNION ALL
SELECT id
FROM actual_pks
)
SELECT id
FROM candidate_and_actual_pks
GROUP BY id
HAVING COUNT(*) > 1;
If, as part of your ingestion process, you always run a query like this one and weed out duplicates before inserting records into your “real” table, then you can be sure your primary keys will always be unique. And then you can confidently do interesting things with them, like including them in this query!
Another option: keep it simple
I haven’t touched on uniqueness or foreign key constraints at all, but the same caveat applies to them, and you need to come up with similar solutions if you’re not sure they’re valid for your data.
If you can’t or don’t want to do all that, you have another option: just don’t use constraints.
Depending on your use case, it might make sense to tolerate a few duplicate or ambiguous records. Or, if your initial understanding of the data wasn’t perfect, you might find it doesn’t have a logical key and you shouldn’t have specified one at all.
But one thing you definitely shouldn’t do is trust Redshift when you ask it if things it assumes to be unique are unique. It might only tell you what you want to hear.