Nathan Robinson's blog

Introduction to PostgreSQL Aggregate Functions

Nathan Robinson — Sat, 11 May 2024 18:04:13 GMT

Aggregation is a really helpful tool in the SQL arsenal, and it enables us to ask slightly more sophisticated questions of our data than we would be able to otherwise. This is a slightly more advanced topic, so be sure to brush up on basic SQL syntax before jumping into this article.

We'll be using the following data model for the examples.

Count

Here is a relatively simple example of when COUNT can be useful. We're calculating the number of expensive locations, which are those with a per-encounter cost of over $100. Here the WHERE clause is filtering out the cheap rows before our aggregation function operates against the resulting record set and produces a scalar value.

select count(*)
    from locations
    where cost > 100

Group by

Group by allows us to slightly modify the behavior of our aggregation function in an interesting way. Consider the below example where we've chosen to group our encounters by patient ID and then apply the COUNT aggregation function. Because the aggregation is happening for each group, the output here is the number of encounters a given patient has had.

select patientid, count(*)
    from encounters
    group by patientid
order by patientid

Sum

In this example, we are calculating the total revenue for each of our business' locations. First we are joining our location and encounters tables to produce an intermediate record set that has all encounters listed, along with the associated cost for each encounter. We group these per location and SUM adds up the cost values for each location.

select name, sum(cost) as revenue
    from locations loc
    inner join encounters enc
        on loc.id = enc.locid
    group by name
order by revenue

Extract

If we want to calculate the total number of encounters per location in April 2024, we could do it this way.

select locid, count(*)
    from encounters
    where starttime >= '2024-04-01'
            and starttime < '2024-05-01'
            and status = 'cancelled'
    group by locid
order by count(*) desc

However, we also could make use of the extract function instead. The exact function lets us pull out pieces of PostgreSQL timestamps.

select locid, count(*)
    from encounters
    where extract(month from starttime) = 4
            and extract(year from starttime) = 2024
            and status = 'cancelled'
    group by locid
order by count(*) desc

Here is a slightly more sophisticated usage of extract, where we seek to calculate total encounters per location per month in 2024.

select locid, extract(month from starttime) as q_month, count(*)
    from encounters
    where extract(year from starttime) = 2024
    group by locid, q_month
order by locid, q_month

Having

The HAVING clause differs from WHERE because it is applied after the grouping has already occurred. In other words, HAVING filters the output of the aggregation function, while WHERE filters what goes into the aggregation function. This example shows how we would output only locations that have more than 100 total encounters.

select locid, count(*) as "Total Encounters"
    from encounters
    group by locid
    having count(*) > 100
order by locid

Here is another example where we list locations that have less than $1000 of revenue.

select name, sum(cost) as revenue
    from locations loc
    inner join encounters enc
        on loc.id = enc.locid
    group by loc.name
    having sum(cost) < 1000
order by revenue

Just for science, here is an alternative query that uses a subquery instead of HAVING. Note that in the subquery version, we only have to describe our aggregation once sum(cost) while in the HAVING query we actually have to repeat it in the SELECT. Unfortunately, this is because unlike some other flavors of SQL, PostgreSQL does not support putting column alias names in the HAVING clause. This means that modifying our previous example to look like HAVING revenue < 1000 results in an error.

select name, revenue
    from (select name, sum(cost) as revenue
            from locations loc
            inner join encounters enc
                on loc.id = enc.locid
            group by loc.name
        ) as sub
    where sub.revenue < 1000
order by revenue

Conclusion

In this post we covered aggregation functions like COUNT and SUM alongside grouping tools like GROUP BY and HAVING. We also covered EXTRACT, which makes aggregating data based on dates more manageable. Combined, these tools allow us to ask more detailed questions about our data and retrieve grouped and aggregated metrics from our tables.

Beginners guide to modifying data in PostgreSQL

Nathan Robinson — Sat, 27 Apr 2024 16:14:17 GMT

There are three primary modification operations we can perform in SQL, create, update, and delete. In this blog post, we'll cover examples and explanations of each.

We'll be using the following data model for the examples.

Inserting Data

Here's an example of inserting two new rows into our patients table.

insert into patients (id, firstname, lastname, phone)
values (1, 'Terry', 'Avila', '540-923-0088'),
       (2, 'Dorothy', 'Buckler', '334-422-6232')

Technically, the column specification is optional here since I'm providing values for all columns (example below), but I recommend including them for future proofing your inserts.

insert into patients
values (1, 'Terry', 'Avila', '540-923-0088'),
       (2, 'Dorothy', 'Buckler', '334-422-6232')

One final note is that PostgreSQL does have a column level setting that we are not taking advantage of here called SERIAL. It automatically generates a unique, incrementing integer for each new row. I recommend using this when possible on your primary key field to simplify your inserts.

Updating Data

Below is an example of updating a single row's values for last name and phone number. Be very cautious when performing modifications of existing data in SQL, as simply forgetting the where clause will result in the entire table being modified. Performing routine backups of your database is essential, and you should always try your updates in a safe isolated environment before performing changes on a live system.

update patients
set lastname = 'Bucklar',
    phone = '334-123-9372'
where id = 2;

Here is a contrived example where we wish to set the last name of one patient equal to the last name of a different patient. Notice how we're performing a scalar subquery here to retrieve the new value we wish to apply.

update patients pats
set lastname = (
    select lastname from patients where id = 0
)
where pats.id = 1;

PostgreSQL also has a bonus feature not present in standard SQL that allows you to perform the same query in a more readable way.

update patients pats1
set lastname = pats2.lastname
from (select * from patients where id = 0) pats2
where pats1.id = 1;

Deleting Data

Let's consider a scenario where we want to clear our some of our unnecessary data by deleting all patients who have no encounter associated with them.

delete from patients
where id not in (
    select patientid
        from encounters
)

This performs a subquery on encounters, pulling out the patientid's on record for each encounter. We then check to see if a patient id is in that result set. If it isn't then we delete it.

Here is an alternative way to perform this query, which uses a correlated subquery.

delete from patients
where not exists (
    select 1
        from encounters
    where encounters.patientid = patients.id
)

These two queries have different performance characteristics, but it's worth noting that the query optimizer may choose to modify the behavior to benefit performance. In general, correlated subqueries are hard to optimize.