Wednesday, June 14, 2023

The time adding a limit to a Postgres query made it slower

This is something I ran into at work.  I wouldn't usually post stuff like this here, but I thought it was pretty interesting.  It involves optimizations to SQL queries, so best to stop reading here if that's not something you're interested in.  I've cleaned up the example below and made it generic.


Out database is Postgres 13 and we have two tables.  One is named offers which has a primary key of offer_id and a foreign key column named plan_id.  The other table is named plans and has a plan_id primary key and then a column named plan_name.  Many offers can have the same one plan_id.  There are indexes on the primary and foreign key columns.  We want to find all the offers that have a particular plan_name.


This was the starting query:
select * from offers where plan_id in (
  select plan_id from plans where plan_name = 'Example Plan'
)
order by offer_id desc
limit 5


It took about 200 seconds to run.

If I removed the order by and ran it again, it only took 14 ms for me.  So far so good, sorting can be slow, although offer_id is the primary key, and has an index, so sorting by it shouldn't be that slow.

Here's where things get interesting though.  If I removed the limit, and keep the order by:
select * from offers where plan_id in (
  select plan_id from plans where plan_name = 'Example Plan'
)
order by offer_id desc

That only takes 323 ms.

To be clear: Adding a limit to this query causes it to go from less than a second to 3 minutes!

It turns out this is a thing.  If you google "Postgres query slower with limit" you'll find lots of confused Stack Overflow posts.  The problem is when Postgres's statistics are wrong and the query planner assumes the limit will make one approach faster, but actually there are far less rows meeting your criteria than it thought, and it takes much longer.

In this case, Postgres thinks "oh I have an index on offer_id, let me just sort by that column and scan through all those rows until I find 5 rows that match that plan_id filter, that'll be fast cause there's plenty of rows that match that filter and I need to sort them anyway, so I'm accomplishing 2 things with one sort" (Postgres has a problem with run on sentences).  However, Postgres only thinks that's a good plan because it misestimated how many rows match the filter, thinking there are thousands when there are only a dozen or so.  So it wastes a lot of time scanning through rows in reverse chronological order until it finds 5.  A better plan is to use the index on the plan_id to find all the rows that match that, and then sort those and take the newest 5.  But that is only a better plan because we know how few rows actually match the filter, and Postgres has a bad estimate for that.

The "solution" is to force Postgres to not use that plan.  Postgres does not have query plan hints, so we have to use hacks to force Postgres to not take the slow approach.  There are various suggestions to use subqueries or CTEs and put the order by in the subquery, and the limit on the outer query.  So something like this
with all_results as (
  select * from offers
  join plans using (plan_id)
  where plan_name = 'Example Plan'
  order by offer_id desc
)
select * from all_results
limit 5

But it seems Postgres 13 is smarter than that, and still optimizes the combined query.  It does seem a bit faster than the first query, but still took 2 minutes to run.

However, a much hackier solution, which also seems to be fastest and simplest to understand is to "modify" the column you are ordering on so that it can't use the index.  So bizarrely, this query
select * from offers where plan_id in (
  select plan_id from plans where plan_name = 'Example Plan'
)
order by offer_id + 0 desc
limit 5

Ran in 272 ms.  Removing the + 0 causes it to go back to 3+ minutes.

This article goes into more detail, and is where I got the + 0 idea came from.

I should also mention that the "real" solution to this is to increase the amount of statistics on the columns Postgres is having a hard time estimating, but I tested this and it didn't seem to fix this case.  Removing the index on offer_id would also work, although that's not an option here with it being the primary key, it also feels like a bad idea to remove the index entirely rather than just forcing Postgres not to use it on a particular query where we know it's a bad idea.