High LWLock Contention During Concurrent HNSW Index Scans
Problem
I’ve been running into LWLock contention issues with HNSW indexes during concurrent workloads, and I wanted to see if anyone has insights or suggestions for improving this. The problem becomes noticeable when we have 32+ DB connections. At this level, the database load is dominated by LWLock:LockManager wait events. At lower concurrency levels, the QPS scales pretty well, and there’s minimal lock contention. But as we increase the number of connections, contention spikes, and throughput saturates. Observations Lock Behavior - Based on the code in hnswscan.c, it looks like the search process uses `LockPage(..., HNSW_SCAN_LOCK, ShareLock)` to protect access to the adjacency graph during traversal. These locks are brief but can pile up when multiple queries hit the same graph structures. Scaling Issue- With ~32 workers (1 connection per worker), QPS is great and lock contention is low. When we push to 32+ to 100 or more workers, contention grows exponentially, leading to LWLock:LockManager dominating database load. What I’ve Tried: Concurrency Tuning: Sticking to ~32 workers seems to work best, but we’d like to scale further if possible. Instance Scaling: Larger instances don’t seem to help much because the bottleneck is lock contention, not compute or I/O. Yet to try: Having prepared statements. Ideas: 1. Finer grained Locking: Is there a way to reduce the granularity of the HNSW_SCAN_LOCK to avoid so much contention when multiple queries traverse the graph? 2. Asynchrono
Unverified for your environment
Select your OS to check compatibility.
1 Fix
Implement Finer Grained Locking for HNSW Index Scans
The high LWLock contention during concurrent HNSW index scans is primarily due to the use of a single lock (HNSW_SCAN_LOCK) for protecting access to the adjacency graph. When multiple queries attempt to traverse the same graph structure simultaneously, they contend for this lock, leading to increased wait times and reduced throughput as the number of concurrent connections rises.
Awaiting Verification
Be the first to verify this fix
- 1
Analyze Locking Strategy
Review the current locking strategy in hnswscan.c to identify opportunities for finer-grained locking. Consider breaking down the HNSW_SCAN_LOCK into multiple locks that can protect smaller sections of the adjacency graph, allowing for concurrent access.
N/AN/A - 2
Implement Fine-Grained Locks
Modify the HNSW index implementation to use multiple locks instead of a single lock. This could involve creating locks for individual nodes or clusters within the graph, allowing multiple queries to traverse different parts of the graph simultaneously without contention.
N/AN/A - 3
Test Locking Changes
Run performance tests with varying levels of concurrency (32, 64, 100+ connections) to measure the impact of the new locking strategy on LWLock contention and overall throughput. Monitor the LockManager wait events to ensure they are reduced.
N/AN/A - 4
Optimize Query Patterns
Review and optimize the query patterns to ensure that they are not excessively locking the same graph structures. Consider using prepared statements to reduce the overhead of lock acquisition and improve performance.
sqlPREPARE stmt AS SELECT * FROM hnsw_index WHERE vector <-> $1 < $2; - 5
Monitor and Adjust
After deploying the changes, continuously monitor the system for any signs of contention or performance degradation. Be prepared to further adjust the locking strategy or query patterns based on real-world usage and performance metrics.
N/AN/A
Validation
To confirm the fix worked, compare the LWLock:LockManager wait events and overall QPS before and after implementing the changes. A significant reduction in wait events and an increase in throughput under high concurrency should indicate success.
Sign in to verify this fix
Environment
Submitted by
Alex Chen
2450 rep