default settings nightmare

We build new software by leveraging existing libraries, frameworks, and tools. Most of the time, we use them with their default settings. This works for many projects without issue, as these defaults are generally optimized for common use cases. However, while it may not be necessary to tweak settings from the start, we should at least be aware of the configuration options available. Understanding these options helps us grasp system behavior better and identify the root cause of incidents faster. Here, I’ll show how using the default settings of a PostgreSQL database can lead to a catastrophic issue where your entire system goes down.

The Idle Transaction Problem

In PostgreSQL, a transaction can stay open indefinitely without any activity. To mitigate this, PostgreSQL provides a setting called idle_in_transaction_session_timeout, which defines how long a transaction can remain open without executing another query, committing, or rolling back. for example, the default value for this setting in AWS RDS, is 24 hours.

While the backend application opening the transaction could close it earlier, a transaction could theoretically stay open for a full 24 hours. On its own, this may not seem like a critical issue, but when combined with other conditions, it can bring down an entire system. Let’s explore how this happens.

Transaction T1 runs a simple SELECT query on Table1. This holds a AccessShareLocklock and then goes into idle mode(because of an unknown reason)
Transaction T2 come which is a migration to adds a new nullable column to Table1. This needs an AccessExclusiveLock lock on this table and has conflict with any other types of locks on tha Table1. So it can’t be executed and should wait for the T1 to be finished an release the lock.
Now in a highly loaded application with a lot of different transactions(T3…TN) are coming into the system to be ran on Table1. None of these transactions can be executed because these new transactions see the T2 in waiting before themselves so they will wait for T2.

Since T1 is idle and not releasing its lock, T2 remains blocked. As a result, all subsequent transactions (T3…TN) queue up behind it, consuming all available database connections. Eventually, the system reaches its limit, and the entire application goes down due to a lack of available connections.

At this point, you’re dealing with a P0 incident with no immediate clues about what went wrong. Situations like this force us to learn the hard way—when a simple understanding of PostgreSQL’s settings could have helped us diagnose and prevent the issue much faster.

In this specific case, we could find the PID related to the T1 by querying the pg_locks and pg_stat_activity tables and kill that transaction to allow the T2 proceed. Because adding a nullable column is fairly fast operation we would never block the other transactions.

If we knew all these stuff that now we know, we could prevent the issue by setting a lock timeout for T2 using this command SET lock_timeout = '5s'. So if T2 can’t acquire the lock in 5s then it will fail and allows the incoming transactions T3…TN to pass and system could continue working.

All these aside, do we really need to keep a session that is idle in transaction fro 24 hours? No we don’t!

Conclusion

Before using a new library, framework, or tool in production, it’s critical to understand its default settings and configuration options. A quick review of PostgreSQL’s locking behavior and transaction settings could have prevented this outage entirely.

By proactively tuning these settings, we can avoid system-wide failures and diagnose issues faster—rather than learning through painful outages.