Replacing DISTINCT ON with Postgres Lateral Subquery
Until 2018, Naya used CouchDB as its storage option, Neo4j for executing graph queries and Elasticsearch for search and caching queries. It worked great and was extremely fast! However, it came with a lot of moving parts and maintaining it demanded a lot of time and resources.
So in 2018, Naya decided to move away from NoSQL+Graph database infrastructure. We chose Postgres to replace all of the moving parts to simplify the stack. While Postgres is not as "flashy" and "trendy" as the new databases that pop up every other week, it is an extremely powerful database. Nevertheless, you still need to do it right and write efficient queries. Small tweaks can really improve the speed and performance of your queries. For example, last week Naya released Naya News's web version. Naya News is a news aggregator with a very simple UI that makes reading news from multiple sources extremely easy. But one of its queries in particular was extremely slow. The culprit was the following query (simplified).
SELECT * FROM ( SELECT DISTINCT ON (clusterid) clusterid AS cluster_id, * FROM ( SELECT (SELECT id FROM cluster WHERE news.id = ANY(cluster.news_ids) LIMIT 1) AS clusterid, news.*, news.popularity AS ranking, ... FROM news WHERE news.timestamp < 1614921375170 AND ... ORDER BY ranking DESC, news.timestamp DESC ) news WHERE clusterid IS NOT NULL ORDER BY clusterid DESC, ranking DESC, news.timestamp DESC ) news ORDER BY ranking DESC, news.timestamp DESC LIMIT 12 OFFSET 0;
This query pulls a unique list of news based on the ranking and timestamp. It also only selects the first row of a group (cluster). For the following table structure and around 100k rows, this query took around 6 seconds! That is just too slow!
cluster: id news_ids 1 {1, 4, 5} 2 {2} 3 {3, 6} news: id ... 1 2 3 ...
Lateral subqueries to the rescue
Waiting 6 seconds just for the database to perform every single list fetch would be excruciatingly slow. Running the query with EXPLAIN ANALYZE shows that DISTINCT ON with ORDER BY was consuming a lot of resources. We decided to replace this with Postgres's LATERAL subquery. LATERAL was introduced in PostgreSQL 9.3 and it allows subqueries to reference columns provided by preceding FROM items. Here's the updated query.
SELECT cluster.id AS cluster_id, news.* FROM cluster, LATERAL( SELECT news.*, news.popularity AS ranking, ... FROM news WHERE news.timestamp < 1614921375170 AND news.id = ANY(cluster.news_ids) AND ... ORDER BY ranking DESC, news.timestamp DESC LIMIT 1 ) news ORDER BY ranking DESC, news.timestamp DESC LIMIT 12 OFFSET 0;
This dramatically reduced the query time to under 100 milliseconds. That's 60x improvement! Furthermore, we did not have to change the query too much. It also provides some flexibility over separately ranking news and clusters.