A simple change to (some) monthly meetings

Read in: Esperanto

This tip isn’t exactly life-changing but just seemed to me like a good application of common sense.

I work with a lot of junior engineers, and recently I decided it would be a good idea to have recurring one-on-ones with each of them. I thought Thursday would be a good day and monthly would be a good cadence. I would spread them out over the month so I could meet with a few people each week.

As I created the first meeting series, I wondered: should I meet with this person on the first Thursday of the month? The fourth Thursday? The last?

I realized that by fixing the schedule to calendar months, there would be either four or five weeks between each meeting. Four was probably okay, but five sounded like too many. I thought of a group I’m a part of that meets on the first Saturday of each month. If five weeks have passed since the last meeting, it seems like too long.

And it would cause problems with my schedule. If I always met with some people on the fourth Thursday and some on the last Thursday (whether the fourth or fifth), then on months with four Thursdays I’d have a day with twice as many one-on-ones. But if I chose never to meet with anyone on the fifth Thursday, then from time to time I’d have a week with no one-on-ones, which seemed like a waste.

In fact, there was no reason to couple these meetings to some decisions the Romans made 2,000 years ago about how many days each month should have. So I decided to meet with each coworker every four weeks instead.

There are some reasons you might not want to do that:

  • If the purpose of your meeting is to talk about things that happened over a period of about a month, it’s easier to remember what that period is if it’s a calendar month, which implies that you should meet exactly once each calendar month.
  • If your meeting is modeled more in people’s heads than in calendar software – like my group’s weekend get-together – then “the first Saturday of the month” beats “four weeks since the last meeting.”
  • If you meet every two or three months, then an extra week or two between occurrences doesn’t make as much of a difference.

But if none of those cases apply, reduce the coupling between your monthly meetings and the vagaries of the calendar by meeting once every four (or five) weeks – whatever makes sense for you.

You can't check if primary keys are unique in Redshift

Redshift can lie to you if you’re not careful with your primary keys. This is in the documentation if you know where to look, but what to do about it is not.

The contract around primary keys

When a table has a primary key specified, Redshift will cheerfully allow you to insert multiple rows with the same primary key.

If all your data came straight from an RDBMS that enforces their uniqueness, then you won’t encounter that behavior at all, and you can read this with schadenfreude toward other Redshift users.

If you’re one of the unlucky ones imposing primary keys on arbitrary data, then the responsibility of ensuring their uniqueness falls on you.

Failing to do so could lead to problems worse than not being able to identify a record unambiguously: the results of your queries might not reflect the actual state of the table. And as a result, the obvious way to ensure the uniqueness of primary keys is wrong.

The wrong way to ensure uniqueness

For these examples, let’s suppose you have a table, my_table, where id is the primary key:

CREATE TABLE my_table
(
    id    VARCHAR,
    value VARCHAR,
    PRIMARY KEY (id)
);

The obvious solution is to query the table, grouping by the primary key column and returning groupings that are found in more than one row. If any are returned, then those are your duplicates, and you can figure out what to do about them… right?

-- Incorrect query to identify duplicated PKs (`id`).
-- Don't do this!
SELECT id
FROM my_table
GROUP BY id
HAVING COUNT(*) > 1;

Not necessarily. According to the documentation on defining constraints, Redshift’s planner

assumes that all keys in Amazon Redshift tables are valid as loaded. If your application allows invalid foreign keys or primary keys, some queries could return incorrect results.

So the problem with this query is that it might work or it might not, which is worse than it never working at all!

It’s like undefined behavior in C. The language specification describes certain things that are allowed by the syntax but that programmers are expected never to do (for example, dividing by zero). Compilers, in turn, are free to handle non-conforming code in any way in their efforts to produce efficient machine code.

I’d imagine Redshift’s non-enforcement of this constraint yields significant performance benefits. Since Redshift is often used to query large amounts of data originating in other RDBMSes, it’s a reasonable design tradeoff.

Okay, but does it really matter?

In my experience, the wrong way works – or, rather, has worked, at least sometimes, because I can’t tell when it doesn’t.

Even if you insert the same record multiple times, Redshift might find it appropriate, for performance reasons or otherwise, to return zero rows for the query above. And if it doesn’t today, it might tomorrow. And if it does tomorrow, it might always do it or only in certain cases. And if it does in certain cases, those cases might be rare. All of these are valid behaviors according to the documentation. How would you like to debug that in production?

You can’t check if primary keys are unique in Redshift because duplicating them is undefined behavior: as soon as you’ve done it, you’ve lost the ability to make sense of your data, including finding out whether primary keys are unique. If you’re checking for uniqueness at that point, it’s already too late.

The right way

A reliable way to ensure your tables don’t have duplicate primary keys is to check the uniqueness of candidate primary keys instead.

Before inserting records into your table, insert them into a “staging” table with the same schema as the real one but without a primary key constraint:

CREATE TABLE my_table_staging
(
    id    VARCHAR, -- Not a primary key in this table!
    value VARCHAR
);

The SELECT statement from before, when run against my_table_staging, will always return correct results. If Redshift tells you no values of id are present more than once in that table, you can trust it.

-- Identify duplicated values of `id`.
-- This works because `id` is not a PK.
SELECT id
FROM my_table_staging
GROUP BY id
HAVING COUNT(*) > 1;

Then, if my_table is empty, you can go ahead and insert all the rows from the staging table into it and be confident you’re not running afoul of Redshift’s baked-in assumptions.

That’ll work the first time, but in the general case, where my_table already contains data, it’s not enough.

Suppose a particular value of id is unique in my_table_staging, and the same value is a (truly) unique primary key in my_table. If you insert all rows from my_table_staging into my_table, then that primary key will end up duplicated, and you’ll be in the same situation as before.

So you need to look at all actual and candidate primary keys together. Here’s a verbose example:

-- This CTE consists of candidate PKs you haven't checked
-- yet and could contain duplicates.
WITH candidate_pks AS
(
    SELECT id
    FROM my_table_staging
),
-- This CTE never contains duplicates, as long as you always
-- use a query like this one to find them and you never
-- insert them into `my_table`.
actual_pks AS
(
    SELECT id
    FROM my_table
),
-- This CTE could contain duplicates. They could be
-- duplicated in `candidate_pks` only, or they could be
-- duplicated between `candidate_pks` and `actual_pks`.
candidate_and_actual_pks AS
(
    SELECT id
    FROM candidate_pks
    UNION ALL
    SELECT id
    FROM actual_pks
)
SELECT id
FROM candidate_and_actual_pks
GROUP BY id
HAVING COUNT(*) > 1;

If, as part of your ingestion process, you always run a query like this one and weed out duplicates before inserting records into your “real” table, then you can be sure your primary keys will always be unique. And then you can confidently do interesting things with them, like including them in this query!

Another option: keep it simple

I haven’t touched on uniqueness or foreign key constraints at all, but the same caveat applies to them, and you need to come up with similar solutions if you’re not sure they’re valid for your data.

If you can’t or don’t want to do all that, you have another option: just don’t use constraints.

Depending on your use case, it might make sense to tolerate a few duplicate or ambiguous records. Or, if your initial understanding of the data wasn’t perfect, you might find it doesn’t have a logical key and you shouldn’t have specified one at all.

But one thing you definitely shouldn’t do is trust Redshift when you ask it if things it assumes to be unique are unique. It might only tell you what you want to hear.